ARM board lockups/hangs triggered by locks and mutexes

Rafał Miłecki zajec5 at gmail.com
Wed Aug 2 00:02:20 PDT 2023


On 2.08.2023 00:25, Florian Fainelli wrote:
> Hi Rafal,
> 
> On 8/1/23 15:10, Rafał Miłecki wrote:
>> Hi,
>>
>> Years ago I added support for Broadcom's BCM53573 SoCs. We released
>> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
>> fine. There was one little issue we couldn't debug or fix: random hangs
>> and reboots. They were too rare to deal with (most devices worked fine
>> for weeks or months).
>>
>> Recently I updated my stable kernel 5.4 and I started experiencing
>> stability issues on my own! After some uptime (usually from 0 to 20
>> minutes of close to zero activity) serial console hangs. I can't type
>> anything and I stop getting any messages. I've to wait about a minute
>> for watchdog to kick in and reboot device.
>>
>> #####
>>
>> I took that great chance and decided to track the regression.
>>
>> Linux 5.4 stable branch worked stable up to the release v5.4.197.
>> Starting with v5.4.198 I started experiencing those stability issues. I
>> bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
>> with loops over idev->addr_list"):
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e
>>
>> With above commit reverted I was able to use stable 5.4 branch up to the
>> release v5.4.207. Starting with v5.4.208 it got unstable again. I
>> bisected it down to:
>> commit d0d583484d2e ("locking/refcount: Consolidate implementations of
>> refcount_t")
>> commit dab787c73f6e ("locking/refcount: Consolidate
>> REFCOUNT_{MAX,SATURATED} definitions")
>> commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
>> commit 809554147d60 ("locking/refcount: Improve performance of generic
>> REFCOUNT_FULL code")
>> commit 9c9269977f03 ("locking/refcount: Move the bulk of the
>> REFCOUNT_FULL implementation into the <linux/refcount.h> header")
>> commit 04bff7d7b808 ("locking/refcount: Remove unused
>> refcount_*_checked() variants")
>> commit 513b19a43bec ("locking/refcount: Ensure integer operands are
>> treated as signed")
>> commit 68b4ee68e8c8 ("locking/refcount: Define constants for
>> saturation and max refcount values")
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
>> (I didn't actually check above commits individually).
>>
>> Reverting above locking/refcount commits worked fine for few releases:
>> up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
>> that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
>> sleepable contexts").
>>
>> Reverting that extra commit from v5.4.238 allows me to run Linux for
>> hours again (currently 3 devices x 6 hours and counting). So I need in
>> total 10+1 reverts from 5.4 branch to get a stable kernel.
>>
>> #####
>>
>> I'm clueless at this point. Is that possible kernel has some locking bug
>> I can hit only using this specific SoC? BCM53573s have a single ARM
>> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
>> can think of is a slow arch timer running at 36,8 kHz.
> 
>  From the look of it, it seems like the CPU might have bugs with atomics?
> 
> Your log indicates that your Cortex-A7 is r0p5 which is described to be susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, can you enable it and see if makes any difference?

I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.



More information about the openwrt-devel mailing list