Bridge-vlan bug? (mt7621/DSA)

Thibaut hacks at slashdirt.org
Sat Aug 6 02:58:49 PDT 2022


> Le 6 août 2022 à 00:50, Mark Mentovai <mark at mentovai.com> a écrit :
> 
> Thibaut wrote:
>> I’m experiencing a strange bug on Yuncore AX820 (mt7621/mt7905/mt7975, DSA-enabled) when using a bridge-vlan setup. This bug affects at least OpenWRT 22.03.0-rc6.
>> 
>> I’m not sure whether this bug is related to this particular SoC or only to DSA as I was unable to test with another DSA-enabled device (I don’t have any). However this bug does not affect e.g. QCA non-DSA devices.
>> 
>> I’m running out of ideas on how to further debug this problem, so feel free to guide me if more information is needed. Please CC-me in replies.
> 
> This sounds very similar to the problem I experienced with the work-in-progress DSA patches for ipq40xx:
> 
> https://github.com/openwrt/openwrt/pull/4721#issuecomment-971162067
> 
> This kernel patch explains the situation fairly well:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5f19486cee79d04c054427577ac96ed123706db
> 
> But the fix isn’t operative unless the switch driver opts in via assisted_learning_on_cpu_port. There were also comments from around that time that there may still be trouble with untagged traffic.
> 
> There’s a bit of discussion about this issue in the comments around there on the pull request. Hopefully you’ll find it helpful. It should at least get you oriented in the right direction, even if it’s not a fix for your untagged use case.

Thanks a lot for these details. Based on your input and looking at our current 5.10 source and the current upstream, it seems this might have already been fixed upstream:

https://github.com/torvalds/linux/commit/0b69c54c74bcb60e834013ccaf596caf05156a8e

I’ll check if this can be backported without too much fuss.

>> == Hardware setup ==
>> 
>> - 1 router (any router works for the purpose of the test), serving DHCP on the LAN (the default configuration from a fresh OpenWRT install works to reproduce this bug - the router setup has not play in the bug).
>> - 1 AX820 setup as « dumb » AP (testcase config provided below, using a bridge-vlan), with one uplink interface (here ‘wan’) directly connected to the router
>> - 1 other AP, make/model irrelevant, provided it has the same dumb config as the AX820 and is also directly connected to the router
>> 
>> The APs use a single bridge-vlan to which their interfaces are hooked: in the full scenario multiple VLANs are assigned to the bridge, and assigned to separate SSIDs. All but one VLANs are tagged on the uplink interface. The reduced test case config provided below uses a single untagged VLAN (id 8, for network ‘lan’) and a single SSID: that is enough to expose the bug.
>> 
>> 
>> == Bug description ==
>> 
>> The following bug happens on the untagged VLAN on the uplink interface (see testcase config below):
>> 
>> When a client device roams to the AX820 AP (which can be forced by issuing « wifi off » on the other AP when the client is connected to it), a « blackout » period that typically lasts 2-5mn begins, where the client loses connectivity.
> 
> The stale entries persist in the FDB with a 5-minute timeout, so this aligns. You can use “bridge fdb show” to see this happening, and “bridge fdb del” to delete entries before they time out. This comment and the gists linked in the one after have more information on a test environment:
> 
> https://github.com/openwrt/openwrt/pull/4721#issuecomment-974911742

Thanks

Thibaut




More information about the openwrt-devel mailing list