Squeezing blood from a stone: SQM at 600+ Mbps on IPQ4018 without NSS?

Wed Mar 4 15:59:17 PST 2026

Hi everyone,

I have a silly question. Or maybe not so silly.

We build mesh routers for crisis response (refugee camps, disaster 
zones,
festivals). Our hardware is the 8dev Jalapeno / MeshPoint.One - IPQ4018,
quad-core Cortex-A7 @ 717 MHz, 256 MB RAM. Old hardware by today's
standards, but it's what's deployed in the field and we can't swap it 
out.

We've been doing a deep dive into SQM performance on this platform and
found something interesting that I'd love the community's input on.

THE SITUATION
-------------

IPQ4018 has NO NSS cores.
So all packet processing is software-only on the ARM cores.

Current performance:
   - Raw forwarding (no SQM):      ~780 Mbps
   - With software flow offload:   ~950 Mbps
   - With CAKE (single-queue):     ~200-250 Mbps  <-- the problem
   - With fq_codel (no shaping):   ~600-800 Mbps

CAKE on a single core is the bottleneck. Meanwhile 3 cores sit mostly
idle. Classic.

WHAT WE FOUND
-------------

1) The IPQ4018 EDMA driver exposes 4 TX queues per netdev
    (EDMA_NETDEV_TX_QUEUE = 4 in edma.h, confirmed in both legacy
    essedma and new IPQESS drivers).

2) CAKE_MQ (merged into net-next for Linux 7.0) creates one CAKE
    instance per hardware TX queue, distributing the work across cores.

3) In theory: 4 TX queues x 4 CPU cores = CAKE distributed across
    all cores = potentially 600-800 Mbps with full QoS.

Has anyone tested CAKE_MQ on IPQ40xx hardware? Does the EDMA driver's
multi-queue implementation actually distribute softirq processing across
cores, or does it all end up on core 0 anyway?

THE BANDWIDTH PROBLEM
---------------------

Our use case makes bandwidth estimation... interesting:

   - WiFi mesh backhaul: anywhere from 10 to 800 Mbps depending on
     distance, interference, weather, number of mesh hops
   - WiFi AP with clients: 5 to 150 simultaneous users, signal quality
     varies wildly
   - WAN uplink: often Starlink or cellular, bandwidth oscillates
     throughout the day

We can't hardcode a bandwidth value for CAKE because nothing is fixed.

We know about cake-autorate and it looks promising for the WAN side.
But for the mesh backhaul and AP interfaces, we're relying on kernel
fq_codel + mac80211 per-station fq_codel + AQL, with no explicit
shaping.

Question: Is there a simple approach for periodic bandwidth measurement
+ SQM reconfiguration? Something like:

   1. Detect low-traffic period
   2. Run quick bandwidth probe (iperf3 to next mesh hop?)
   3. Reconfigure CAKE bandwidth parameter
   4. Repeat every few hours

Or is this overengineering and fq_codel without shaping is genuinely
"good enough" for mesh links where bandwidth is unknown?

WHAT WE'RE DOING NOW
---------------------

Our current stack (all available today on OpenWrt 24.10):

   - Gateway WAN: CAKE besteffort + cake-autorate (only interface with
     explicit shaping)
   - Mesh backhaul: kernel fq_codel (default, zero config)
   - WiFi AP: mac80211 per-station fq_codel + AQL (driver-level)
   - LAN: kernel fq_codel (default)
   - IRQ affinity + RPS/XPS distributed across all 4 cores
   - NAPI budget tuned to 1000
   - CPU governor: performance

This gets us to approximately 300-350 Mbps with CAKE on WAN. We're
hoping CAKE_MQ on OpenWrt 25.12 will push that to 600+.

SPECIFIC QUESTIONS
------------------

1. CAKE_MQ on IPQ40xx: Has anyone tested it? Does the EDMA multi-queue
    actually distribute across cores?

2. SFE + egress qdisc: We confirmed from source that SFE calls
    dev_queue_xmit() which preserves the egress qdisc. Anyone running
    SFE + CAKE/fq_codel in production on IPQ40xx? Any gotchas beyond
    the known ingress/IFB issue?

3. Mesh QoS without bandwidth knowledge: For WiFi mesh links where
    bandwidth varies 10-800 Mbps, is fq_codel genuinely the right
    answer? Or are we leaving performance on the table?

4. cake-autorate on Starlink: Anyone running this combination? How
    well does it adapt to Starlink's bandwidth variations?

5. Am I missing something obvious? Any IPQ4018-specific optimizations
    we haven't considered?

We've documented our full analysis including per-interface 
recommendations,
CPU impact measurements, and community network research. Happy to share
the write-up privately if anyone is interested.

A NOTE ON DAVE TAHT
--------------------

I want to say something personal here. Dave was a mentor to me. During
the years we were building MeshPoint - mesh routers for refugee camps
along the Croatian border in 2015-2016 - Dave was incredibly generous
with his time and knowledge. He answered emails within hours, jumped on
calls whenever I asked, and never once made me feel like my questions
were too basic. He genuinely cared about getting networks right for the
people who needed them most.

The fact that fq_codel + AQL "just works" on our mesh nodes without any
configuration - that's Dave's legacy in every packet we forward. We're
trying to build on that foundation for crisis response networks where
connectivity saves lives.

The 25.12 dedication is well deserved. Rest in peace, Dave.

Thanks for any insights. Happy to share our benchmark data and test
methodology if useful.

Best regards,
Valent Turkovic
MeshPoint
https://meshpointone.com