Optimizing kernel compilation / alignments for network performance

Wed Apr 27 05:56:58 PDT 2022

From: Rafał Miłecki <zajec5 at gmail.com>
Date: Wed, 27 Apr 2022 14:04:54 +0200

> Hi,

Hej,

> 
> I noticed years ago that kernel changes touching code - that I don't use
> at all - can affect network performance for me.
> 
> I work with home routers based on Broadcom Northstar platform. Those
> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
> those devices is NAT masquerade and that is what I test with iperf
> running on two x86 machines.
> 
> ***
> 
> Example of such unused code change:
> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
> 
> I first reported that issue it in the e-mail thread:
> ARM router NAT performance affected by random/unrelated commits
> https://lkml.org/lkml/2019/5/21/349
> https://www.spinics.net/lists/linux-block/msg40624.html
> 
> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
> unicast headers")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
> 
> ***
> 
> It appears Northstar CPUs have little cache size and so any change in
> location of kernel symbols can affect NAT performance. That explains why
> changing unrelated code affects anything & it has been partially proven
> aligning some of cache-v7.S code.
> 
> My question is: is there a way to find out & force an optimal symbols
> locations?

Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
fighting with the same issue on some Realtek MIPS boards: random
code changes in random kernel core parts were affecting NAT /
network performance. This option resolved this I'd say, for the cost
of slightly increased vmlinux size (almost no change in vmlinuz
size).
The only thing is that it was recently restricted to a set of
architectures and MIPS and ARM32 are not included now lol. So it's
either a matter of expanding the list (since it was restricted only
because `-falign-functions=` is not supported on some architectures)
or you can just do:

make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size

The actual alignment is something to play with, I stopped on the
cacheline size, 32 in my case.
Also, this does not provide any guarantees that you won't suffer
from random data cacheline changes. There were some initiatives to
introduce debug alignment of data as well, but since function are
often bigger than 32, while variables are usually much smaller, it
was increasing the vmlinux size by a ton (imagine each u32 variable
occupying 32-64 bytes instead of 4). But the chance of catching this
is much lower than to suffer from I-cache function misplacement.

> 
> Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
> out what other functions are worth optimizing (aligning) and force that
> (I guess  __attribute__((aligned(32))) could be used).
> 
> I can't really draw any conclusions from comparing System.map before and
> after above commits as they relocate thousands of symbols in one go.
> 
> Optimizing is pretty important for me for two reasons:
> 1. I want to reach maximum possible NAT masquerade performance
> 2. I need stable performance across random commits to detect regressions

[0] https://elixir.bootlin.com/linux/v5.18-rc4/K/ident/CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Thanks,
Al