[OpenWrt-Devel] [RFC] [PATCH v4] lantiq: IRQ balancing, ethernet driver, wave300

Mon Mar 25 21:23:23 EDT 2019

On 3/26/19 1:24 AM, Petr Cvek wrote:
> 
> 
> Dne 26. 03. 19 v 0:45 Hauke Mehrtens napsal(a):
>> On 3/26/19 12:24 AM, Hauke Mehrtens wrote:
>>> Hi Petr
>>>
>>> On 3/14/19 6:46 AM, Petr Cvek wrote:
>>>> Hello again,
>>>>
>>>> I've managed to enhance few drivers for lantiq platform. They are still
>>>> in ugly commented form (ethernet part especially). But I need some hints
>>>> before the final version. The patches are based on a kernel 4.14.99.
>>>> Copy them into target/linux/lantiq/patches-4.14 (cleaned from any of my
>>>> previous patch).
>>>
>>> Thanks for working on this.
>>>
>>>> The eth+irq speedup is up to 360/260 Mbps (the vanilla was 170/80 on my
>>>> setup). The iperf3 benchmark (2 passes for both vanilla and changed
>>>> versions) altogether with script are in the attachment.
>>>>
>>>> 1) IRQ with SMP and balancing support:
>>>>
>>>> 	0901-add-icu-smp-support.patch
>>>> 	0902-enable-external-irqs-for-second-vpe.patch
>>>> 	0903-add-icu1-node-for-smp.patch
>>>>
>>>> As requested I've changed the patch heavily. The original locking from
>>>> k3b source code (probably from UGW) didn't work and in heavy load the
>>>> system could have froze (smp affinity change during irq handling). This
>>>> version has this fixed by using generic raw spinlocks with irq.
>>>>
>>>> The SMP IRQ now works in a way that before every irq_enable (serves as
>>>> unmask too) the VPE will be switched. This can be limited by writing
>>>> into /proc/irq/X/smp_affinity (it can be possibly balanced from
>>>> userspace too).
>>>>
>>>> I've rewritten the device tree reg fields so there are only 2 arrays
>>>> now. One per an icu controller. The original one per module was
>>>> redundant as the ranges were continuous. The modules of a single ICU are
>>>> now explicitly computed in a macro:
>>>>
>>>> 	ltq_w32((x), ltq_icu_membase[vpe] + m*0x28 + (y))
>>>> 	ltq_r32(ltq_icu_membase[vpe] + m*0x28 + (x))
>>>>
>>>> before there was a pointer for every 0x28 block (there shouldn't be
>>>> speed downgrade, only a multiplication and an addition for every
>>>> register access).
>>>>
>>>> Also I've simplified register names from LTQ_ICU_IM0_ISR to LTQ_ICU_ISR
>>>> as "IM0" (module) was confusing (the real module number 0-4 was a part
>>>> of the macro).
>>>>
>>>> The code is written in a way it should work fine on a uniprocessor
>>>> configuration (as the for_each_present_cpu etc macros will cycle on a
>>>> single VPE on uniprocessor). I didn't test the no CONFIG_SMP yet, but I
>>>> did check it with "nosmp" kernel parameter. It works.
>>>>
>>>> Anyway please test if you have the board where the second VPE is used
>>>> for FXS.
>>>>
>>>> The new device tree structure is now incompatible with an old version of
>>>> the driver (and old device tree with the new driver too). It seems icu
>>>> driver is used in Danube, AR9, AmazonSE and Falcon chipset too. I don't
>>>> know the hardware for these boards so before a final patch I would like
>>>> to know if they have a second ICU too (at 0x80300 offset).
>>>
>>> Normally the device tree should stay stable, but I already though about
>>> the same change and I am not aware that any device ships a U-Boot with
>>> an embedded device tree, so this should be fine.
>>>
>>> The Amazon and Amazon SE only have one ICU block because they only have
>>> one CPU with one VPE.
>>> The Danube SoC has two ICU blocks one for each CPU, each CPU only has
>>> one VPE. The CPUs are not cache coherent, SMP is not possible.
>>>
>>> Falcon, AR9, VR9, AR10, ARX300, GRX300, GRX330 have two ICU blocks one
>>> for each VPE of the single CPU.
>>> GRX350 uses a MIPS InterAptiv CPU with a MIPS GIC.
>>>
>>>> More development could be done with locking probably. As only the
>>>> accesses in a single module (= 1 set of registers) would cause a race
>>>> condition. But as the most contented interrupts are in the same module
>>>> there won't be much speed increase IMO. I can add it if requested (just
>>>> spinlock array and some lookup code).
>>>
>>> I do not think that this improves the performance significantly, I
>>> assume that the CPUs only have to wait there in rare conditions anyway.
>>>
>>>> 2) Reworked lantiq xrx200 ethernet driver:
>>>>
>>>> 	0904-backport-vanilla-eth-driver.patch
>>>> 	0905-increase-dma-descriptors.patch
>>>> 	0906-increase-dma-burst-size.patch
>>>>
>>>> The code is still ugly, but stable now. There is a fragmented skb
>>>> support and napi polling. DMA ring buffer was increased so it handle
>>>> faster speeds and I've fixed some code weirdness. A can split the
>>>> changes in the future into separate patches.
>>>
>>> It would be nice if you could also do the same changes to the upstream
>>> driver in mainline Linux kernel and send this for inclusion to mainline
>>> Linux.
>>>
>>>> I didn't test the ICU and eth patches separate, but I've tested the
>>>> ethernet driver on a single VPE only (by setting smp affinity and
>>>> nosmp). This version of the ethernet driver was used for root over NFS
>>>> on the debug setup for like two weeks (without problems).
>>>>
>>>> Tell me if we should pursue the way for the second DMA channel to PPE so
>>>> both VPEs can send frames at the same time.
>>>
>>> I think it should be ok to use both DMA channels for the CPU traffic.
>>>
>>>> 3) WAVE300
>>>>
>>>> In the two past weeks I've tried to fix a mash together various versions
>>>> of wave300 wifi driver (there are partial version in GPL sources from
>>>> router vendors). And I've managed to put the driver into "not
>>>> immediately crashing" mode. If you are interested in the development,
>>>> there is a thread in openwrt forum. The source repo here:
>>>>
>>>> https://repo.or.cz/wave300.git
>>>> https://repo.or.cz/wave300_rflib.git
>>>>
>>>> (the second one must be copied into the first one)
>>>>
>>>> The driver will often crash when meeting an unknown packet, request for
>>>> encryption (no encryption support), unusual combination of configuration
>>>> or just by module unloading. The code is _really_ ugly and it will
>>>> server only as hardware specification for better GPL driver development.
>>>> If you want to help or you have some tips you can join the forum (there
>>>> are links for firmwares and intensive research of available source codes
>>>> from vendors).
>>>>
>>>> Links:
>>>> https://forum.openwrt.org/t/support-for-wave-300-wi-fi-chip/24690/129
>>>> https://forum.openwrt.org/t/how-can-we-make-the-lantiq-xrx200-devices-faster/9724/70
>>>> https://forum.openwrt.org/t/xrx200-irq-balancing-between-vpes/29732/25
>>>>
>>>> Petr
>>> Hauke
>>
> 
> Hi
> 
>> It would be nice if you could send your patches as single mails and
>> inline so I can easily comment on them.
> 
> OK
> 
>>
>> The DMA handling in the OpenWrt Ethernet driver is only more flexible to
>> handle arbitrary number of DMA channels, but I think this is not needed.
>>
>> The DMA memory is already 16 byte aligned, see the byte_offset variable
>> in xmit, so it should not be a problem to use the 4W DMA mode, I assume
>> that the hardware also takes care of this.
>>
> 
> Yes it is 16 byte aligned in the original driver, but my patched driver
> is using 32 byte alignment (8W DMA mode). Using 32B bursts with 16B
> alignment caused crashing.
> 
>> Why are the changes in arch/mips/kernel/smp-mt.c needed? this looks
>> strange to me.
>>
> 
> That is interrupt masking. IP0 and IP1 are (I think) software interrupts
> for IPI communications, IP6/7 are timer (and something) and in IP2-IP5
> range, which is not enabled there are external IRQ signals for ICU.
> Without this set the second VPE only receives IPI and not ICU events.
>
> Basically I've set this MIPS C0 Status register to the same value as the
> C0 Status register for the first VPE.

hmm strange, looks like there are not so many SoCs with multiple VPEs
which have an own IRQ controller.

>> Changing LTQ_DMA_CPOLL could affect the latency of the system, but I
>> think your increase should not harm significantly.
> 
> Yeah I've tested it, there is some minor impact on the maximal
> bandwidth. However I cannot set the value correctly without the model of
> xrx200 SoC (I assume this register controls the check frequency of the
> OWN bit of the first descriptor).

Yes this is the polling frequency in fDMA/16, this value is global and
not per channel. The DMA controller will check the OWN bit on all
descriptors for all DMA channels where polling is activated with this
frequency. fDMA is the same as the FPI frequency, probably 250MHz.

> I don't even know the clock and width
> of the bus between DMA and RAM (or between DMA and ethernet FIFO). But
> if the original value DMA_CLK_DIV4 means "every fourth clock" it seems
> too often for me (if a packet has like 1500 bytes, it would check many
> times before the packet is transferred). The highest values empirically
> lags the DMA descriptor ring.

The DMA controller uses a 32 bit wide data path to the RAM and 28 bit
word addresses, a word for the DMA controller is 32 bit.

The DMA controller can handle some priorities between the ports and
channels. When you activate PKTARB (BIT(31)) in DMA_CTRL the DMA
controller will transfer the complete packet before the arbitration is
changed. With MBRSTCNT (bit 25:16) in DMA_CTRL you can control after how
many burst the arbitration should be changed, when MBRSTARB (BIT(30)) in
DMA_CTRL is activated. Both is for TX and RX.

Hauke

_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel