Optimizing kernel compilation / alignments for network performance

Thu May 5 08:42:56 PDT 2022

On 29.04.2022 16:49, Arnd Bergmann wrote:
> On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki <zajec5 at gmail.com> wrote:
>> On 27.04.2022 14:56, Alexander Lobakin wrote:
> 
>> Thank you Alexander, this appears to be helpful! I decided to ignore
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
>> manually.
>>
>>
>> 1. Without ce5013ff3bec and with -falign-functions=32
>> 387 Mb/s
>>
>> 2. Without ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>> 3. With ce5013ff3bec and with -falign-functions=32
>> 384 Mb/s
>>
>> 4. With ce5013ff3bec and with -falign-functions=64
>> 377 Mb/s
>>
>>
>> So it seems that:
>> 1. -falign-functions=32 = pretty stable high speed
>> 2. -falign-functions=64 = very stable slightly lower speed
>>
>>
>> I'm going to perform tests on more commits but if it stays so reliable
>> as above that will be a huge success for me.
> 
> Note that the problem may not just be the alignment of a particular
> function, but also how different function map into your cache.
> The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or
> 64KB, with a line size of 32 bytes. If you are unlucky and you get
> five different functions that are frequently called and are a multiple
> functions are exactly the wrong spacing that they need more than
> four ways, calling them in sequence would always evict the other
> ones. The same could of course happen if the problem is the D-cache
> or the L2.
> 
> Can you try to get a profile using 'perf record' to see where most
> time is spent, in both the slowest and the fastest versions?
> If the instruction cache is the issue, you should see how the hottest
> addresses line up.

Your explanation sounds sane of course.

If you take a look at my old e-mail
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

Is there a way to optimize kernel for optimal cache usage of selected
(above) functions?

Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks
reported as worth trying. It's another randomness. It stabilizes NAT
performance across some commits and breaks stability across others.