Optimizing kernel compilation / alignments for network performance
zajec5 at gmail.com
Wed Apr 27 10:31:33 PDT 2022
On 27.04.2022 14:56, Alexander Lobakin wrote:
> From: Rafał Miłecki <zajec5 at gmail.com>
> Date: Wed, 27 Apr 2022 14:04:54 +0200
>> I noticed years ago that kernel changes touching code - that I don't use
>> at all - can affect network performance for me.
>> I work with home routers based on Broadcom Northstar platform. Those
>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>> those devices is NAT masquerade and that is what I test with iperf
>> running on two x86 machines.
>> Example of such unused code change:
>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>> I first reported that issue it in the e-mail thread:
>> ARM router NAT performance affected by random/unrelated commits
>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>> unicast headers")
>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>> It appears Northstar CPUs have little cache size and so any change in
>> location of kernel symbols can affect NAT performance. That explains why
>> changing unrelated code affects anything & it has been partially proven
>> aligning some of cache-v7.S code.
>> My question is: is there a way to find out & force an optimal symbols
> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B. I've been
> fighting with the same issue on some Realtek MIPS boards: random
> code changes in random kernel core parts were affecting NAT /
> network performance. This option resolved this I'd say, for the cost
> of slightly increased vmlinux size (almost no change in vmlinuz
> The only thing is that it was recently restricted to a set of
> architectures and MIPS and ARM32 are not included now lol. So it's
> either a matter of expanding the list (since it was restricted only
> because `-falign-functions=` is not supported on some architectures)
> or you can just do:
> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
> The actual alignment is something to play with, I stopped on the
> cacheline size, 32 in my case.
> Also, this does not provide any guarantees that you won't suffer
> from random data cacheline changes. There were some initiatives to
> introduce debug alignment of data as well, but since function are
> often bigger than 32, while variables are usually much smaller, it
> was increasing the vmlinux size by a ton (imagine each u32 variable
> occupying 32-64 bytes instead of 4). But the chance of catching this
> is much lower than to suffer from I-cache function misplacement.
Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
1. Without ce5013ff3bec and with -falign-functions=32
2. Without ce5013ff3bec and with -falign-functions=64
3. With ce5013ff3bec and with -falign-functions=32
4. With ce5013ff3bec and with -falign-functions=64
So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed
I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.
More information about the openwrt-devel