Bridge-vlan bug? (mt7621/DSA)

Mark Mentovai mark at mentovai.com
Fri Aug 5 15:50:05 PDT 2022


Thibaut wrote:
> I’m experiencing a strange bug on Yuncore AX820 (mt7621/mt7905/mt7975, 
> DSA-enabled) when using a bridge-vlan setup. This bug affects at least 
> OpenWRT 22.03.0-rc6.
>
> I’m not sure whether this bug is related to this particular SoC or only 
> to DSA as I was unable to test with another DSA-enabled device (I don’t 
> have any). However this bug does not affect e.g. QCA non-DSA devices.
>
> I’m running out of ideas on how to further debug this problem, so feel 
> free to guide me if more information is needed. Please CC-me in replies.

This sounds very similar to the problem I experienced with the 
work-in-progress DSA patches for ipq40xx:

https://github.com/openwrt/openwrt/pull/4721#issuecomment-971162067

This kernel patch explains the situation fairly well:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5f19486cee79d04c054427577ac96ed123706db

But the fix isn’t operative unless the switch driver opts in via 
assisted_learning_on_cpu_port. There were also comments from around that 
time that there may still be trouble with untagged traffic.

There’s a bit of discussion about this issue in the comments around there 
on the pull request. Hopefully you’ll find it helpful. It should at least 
get you oriented in the right direction, even if it’s not a fix for your 
untagged use case.

> == Hardware setup ==
>
> - 1 router (any router works for the purpose of the test), serving DHCP 
> on the LAN (the default configuration from a fresh OpenWRT install works 
> to reproduce this bug - the router setup has not play in the bug).
> - 1 AX820 setup as « dumb » AP (testcase config provided below, using a 
> bridge-vlan), with one uplink interface (here ‘wan’) directly connected 
> to the router
> - 1 other AP, make/model irrelevant, provided it has the same dumb 
> config as the AX820 and is also directly connected to the router
>
> The APs use a single bridge-vlan to which their interfaces are hooked: 
> in the full scenario multiple VLANs are assigned to the bridge, and 
> assigned to separate SSIDs. All but one VLANs are tagged on the uplink 
> interface. The reduced test case config provided below uses a single 
> untagged VLAN (id 8, for network ‘lan’) and a single SSID: that is 
> enough to expose the bug.
>
>
> == Bug description ==
>
> The following bug happens on the untagged VLAN on the uplink interface 
> (see testcase config below):
>
> When a client device roams to the AX820 AP (which can be forced by 
> issuing « wifi off » on the other AP when the client is connected to 
> it), a « blackout » period that typically lasts 2-5mn begins, where the 
> client loses connectivity.

The stale entries persist in the FDB with a 5-minute timeout, so this 
aligns. You can use “bridge fdb show” to see this happening, and “bridge 
fdb del” to delete entries before they time out. This comment and the 
gists linked in the one after have more information on a test environment:

https://github.com/openwrt/openwrt/pull/4721#issuecomment-974911742

Mark

> == Analysis ==
>
> Running tcpdump, one obvious symptom is that the client emits DHCP 
> requests which are received by the router, the router sends back DHCP 
> replies (confirmed via tcpdump on router) but these replies never reach 
> the client during the blackout period.
>
> In fact, running tcpdump on all of the connected AP’s interface 
> (wireless (wlan0), DSA slave (wan), DSA master (eth0), 8021q (vlan-lan), 
> bridge-vlan (vbridge0)) shows no DHCP reply ever being captured during 
> this blackout period, until one finally makes it through when the 
> blackout ends.
>
>
> == Known unaffected scenarios ==
>
> If the VLAN is configured tagged on the uplink interface (using « list 
> ports ‘wan:t’ ») - and the router is setup to use tagged frames as well 
> of course - the bug does not occur.
>
> If the slaves are configured with regular ‘br-lan’ bridges (no vlans), 
> the bug does not occur: it seems tied to using a bridge-vlan.
>
> « roaming » a wired device from one AP to the other (through the free 
> ethernet port configured untagged, see testcase config below) does not 
> trigger the bug: wireless seems a key part of this problem.
>
> Finally, the exact same AP configuration used on non-DSA QCA9533-based 
> devices works flawlessly.
>
>
> == Other remarks ==
>
> To decode DSA master interface (eth0) captures, I used editcap (from 
> Wireshark) as follows:
> $ editcap -L -T ether -C 12:4 dsamaster.cap master.cap
> this removes the mtk DSA tags that libpcap cannot parse.
>
>
> == Reduced testcase AP configs ==
>
> /etc/config/network (loopback config not quoted, adjust ipaddr for each 
> AP):
>
> config interface 'lan'
> 	option proto 'static'
> 	option netmask '255.255.255.0'
> 	option ip6assign '60'
> 	option device 'vlan-lan'
> 	option ipaddr '192.168.1.2'
> 	option gateway '192.168.1.1'
> 	option dns '192.168.1.1'
>
> config device
> 	option type 'bridge'
> 	option name 'vbridge0'
> 	option ipv6 '0'
> 	option vlan_filtering '1'
> 	list ports 'lan'
> 	list ports 'wan'
>
> config device
> 	option type '8021q'
> 	option ifname 'vbridge0'
> 	option vid '8'
> 	option name 'vlan-lan'
> 	option ipv6 '0'
>
> config bridge-vlan
> 	option vlan '8'
> 	option device 'vbridge0'
> 	list ports 'wan:u*'
> 	list ports 'lan:u*’
>
>
> /etc/config/wireless (wifi-device not quoted):
>
> config wifi-iface 'radio0_test'
> 	option device 'radio0'
> 	option mode 'ap'
> 	option network 'lan'
> 	option ssid ’test'
> 	option encryption ’none'



More information about the openwrt-devel mailing list