sysupgrade broken on imx nand targets (and maybe others too)

Lanchon lanchon at gmail.com
Tue Jan 24 12:46:25 PST 2023



On 1/24/23 17:35, Lanchon wrote:
>
>
> On 1/24/23 13:25, Koen Vandeputte wrote:
>> On Tue, Jan 24, 2023 at 4:26 PM Koen Vandeputte
>> <koen.vandeputte at citymesh.com> wrote:
>>> On Tue, Jan 24, 2023 at 7:59 AM Lanchon <lanchon at gmail.com> wrote:
>>>> hi Koen, thanks again.
>>>>
>>>> i copied your log here for ease of reference:
>>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db
>>>>
>>>>
>>>> first let me say:
>>>>
>>>> - ubinized sysupgrade is not used by any of my devices.
>>>>
>>>> - ubinized sysupgrade happens when when an ubi partition image is 
>>>> fed as
>>>> an upgrade file. the image contains the complete set of ubi volumes 
>>>> that
>>>> are normally stored within the ubi partition on your device: typically
>>>> kernel (raw image), R/O rootfs (sqfs), and R/W overlay (ubifs). during
>>>> said sysupgrade, the current configuration is first copied to RAM, 
>>>> then
>>>> the ubi partition image is written, and finally -if current config is
>>>> kept- the RAM contents are written back to the new overlay volume.
>>>>
>>>> - ubinized sysupgrades were known to be broken by other people at the
>>>> time of my commits, and they wanted to remove the feature altogether
>>>> because it was unused (look it up in the relevant pull request for my
>>>> commits on github). as i remember it, it was broken because some ubi
>>>> volumes within the ubi partition were sometimes mounted or R/O block
>>>> devices were created on top of them (/dev/ubiblock*), locking the ubi
>>>> partition and preventing the upgrade.
>>>>
>>>> - although my devices would normally not use such upgrades, i could
>>>> still take a whole ubi partition backup and then test ubinized
>>>> sysupgrade with it on my devices. in fact, if you restore the ubi
>>>> partition image without conserving the current configuration, then 
>>>> this
>>>> procedure is the best way to do a backup/restore of the complete state
>>>> of the router: kernel, rootfs, and overlay are completely saved and
>>>> restored. btw, i think this should be documented. (and this is mostly
>>>> the reason why i added gzip support on sysupgrade: ubinized images of
>>>> backed-up ubi partitions compress like crazy.)
>>>>
>>>> - my tests of ubinized sysupgrade worked after these changes but not
>>>> before. specifically the fix is in: af34733593 base-files: fix 
>>>> ubinized
>>>> nand sysupgrade
>>>>
>>>>
>>>> regarding your log:
>>>>
>>>> - nand_do_platform_check succeeds:
>>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L438-L469 
>>>>
>>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L192 
>>>>
>>>>
>>>> - next comes the actual nand_do_flash_file:
>>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L379-L405 
>>>>
>>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2061 
>>>>
>>>>
>>>> - it is determined that passed file is a ubi partition image, so
>>>> nand_upgrade_ubinized is invoked:
>>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L260-L269 
>>>>
>>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2088 
>>>>
>>>>
>>>> nand_upgrade_ubinized is basically a one-liner:
>>>> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - && 
>>>> ubiattach
>>>> -m "$mtdnum"
>>>>
>>>> cat/zcat the image, feeding that to ubiformat -f - which writes it.
>>>>
>>>>
>>>> and the write does take place, but take a look:
>>>>
>>>> -------------------
>>>>
>>>> + cat /tmp/nandnew.ubi
>>>> + ubiformat /dev/mtd2 -y -f -
>>>> ubiformat: mtd2 (nand), size 250609664 bytes (239.0 MiB), 1912
>>>> eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
>>>>
>>>> libscan: scanning eraseblock 0 --  0 % complete
>>>> libscan: scanning eraseblock 1 --  0 % complete
>>>> libscan: scanning eraseblock 2 --  0 % complete
>>>>      ...
>>>> libscan: scanning eraseblock 1868 -- 97 % complete
>>>>
>>>> libscan: scanning eraseblock
>>>> [  207.876200] ci_hdrc ci_hdrc.1: remove, state 1
>>>> 1869 -- 97 % complete
>>>>
>>>> libscan:
>>>> [  207.883339] usb usb2: USB disconnect, device number 1
>>>> scanning eraseblock 1870 -- 97 %
>>>> [  207.891238] usb 2-1: USB disconnect, device number 2
>>>> complete
>>>>
>>>> libscan: scanning eras
>>>> [  207.901522] ci_hdrc ci_hdrc.1: USB bus 2 deregistered
>>>> eblock 1871 -- 97 % complete
>>>>
>>>> libscan: scanning eraseblock 1872
>>>> [  207.910396] ci_hdrc ci_hdrc.0: remove, state 4
>>>>    -- 97 % complete
>>>>
>>>> libscan: scan
>>>> [  207.917055] usb usb1: USB disconnect, device number 1
>>>> ning eraseblock 1873 -- 98 % comp
>>>> [  207.925651] ci_hdrc ci_hdrc.0: USB bus 1 deregistered
>>>> lete
>>>>
>>>> libscan: scanning eraseblo
>>>> [  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect 
>>>> reboot!
>>>> ck 1874 -- 98 % complete
>>>>
>>>> libsca
>>>> [  207.942382] reboot: Restarting system
>>>>
>>>> -----------------------
>>>>
>>>>
>>>> while sysupgrade is flashing UBI the partition, the system is 
>>>> rebooted;
>>>> i don't know why.
>>>>
>>>> here are the cleaned-up kernel messages:
>>>>
>>>> [  207.876200] ci_hdrc ci_hdrc.1: remove, state 1
>>>> [  207.883339] usb usb2: USB disconnect, device number 1
>>>> [  207.891238] usb 2-1: USB disconnect, device number 2
>>>> [  207.901522] ci_hdrc ci_hdrc.1: USB bus 2 deregistered
>>>> [  207.910396] ci_hdrc ci_hdrc.0: remove, state 4
>>>> [  207.917055] usb usb1: USB disconnect, device number 1
>>>> [  207.925651] ci_hdrc ci_hdrc.0: USB bus 1 deregistered
>>>> [  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect 
>>>> reboot!
>>>> [  207.942382] reboot: Restarting system
>>>>
>>>>
>>>> next the reboot takes place. u-boot mounts what it knows as mtd3 as an
>>>> ubi partition:
>>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2183 
>>>>
>>>>
>>>> but openwrt used mtd2 to write to ubi, not mtd3. i don't know if in 
>>>> your
>>>> device what openwrt calls mtd2 is called mtd3 by your current 
>>>> (default)
>>>> u-boot config.
>>>>
>>>>
>>>> so some takeaways...
>>>>
>>>> - something is rebooting the system while i write the sysupgrade 
>>>> image.
>>>> maybe another thread? maybe an unattended watchdog?
>>>>
>>>> - the reboot happens while 98% of the image is written. maybe this 
>>>> issue
>>>> is time dependent and only shows up with my code because it is a 
>>>> little
>>>> slower than the previous code. maybe the previous code reached 100%
>>>> before being rebooted and thus the upgrade went through.
>>>>
>>>> - the image *IS* written, albeit partially, which means that the
>>>> previous image that was there is definitely erased. what is booted 
>>>> then?
>>>> IDK, it depends on the details of your device. maybe it is booting a
>>>> recovery image? or maybe the ubi partition format is not finished, but
>>>> the ubi volumes within are fully written before the reboot, so the
>>>> system doesn't brick by chance. (but then the newer image would be
>>>> booted, but you say that the sysupgrade has no effect and the prior
>>>> image is booted instead, so that can't be it.) maybe two ubi 
>>>> partitions
>>>> are used on your device to implement an A/B dual system boot. so 
>>>> maybe a
>>>> flag needs to be toggled to switch between A and B after the image is
>>>> written, but since that code is never executed, the previous system is
>>>> booted instead.
>>>>
>>>> - several issues cropped up with a set of sysupgrade changes i did
>>>> (among them, these you mention here). there are many device types and
>>>> several sysupgrade mechanisms with their own files, and then some 
>>>> common
>>>> files. i assumed other types of upgrades would invoke common routines
>>>> but not -for instance- nand flash routines. i was wrong: the 
>>>> codebase is
>>>> spaghetti calling any file form any sysupgrade method, and this 
>>>> caused a
>>>> couple of issues with my nand sysupgrade changes. i don't think 
>>>> this is
>>>> one of those instances though. i think that the sysupgrade code is 
>>>> doing
>>>> the right thing and the fault is elsewhere, but i may be wrong. 
>>>> without
>>>> knowledge of your device and without a device to test, it is hard 
>>>> to tell.
>>>>
>>>>
>>>> maybe we should could call the attention of the device maintainer to
>>>> this thread now?
>>>>
>>>>
>>>> thanks!
>>>>
>>>>
>>> Hi,
>>>
>>> Many thanks for the very detailed reply.
>>> It aided hugely in debugging this further.
>>>
>>> I'm happy to share that I found the culprit and it works nicely 
>>> again now.
>>>
>>> within nand.sh:
>>>
>>>
>>> nand_upgrade_ubinized() {
>>>          local ubi_file="$1"
>>>          local gz="$2"
>>>
>>>          nand_detach_ubi "$CI_UBIPART" || return 1
>>>
>>>          local mtdnum="$( find_mtd_index "$CI_UBIPART" )"
>>>          ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - &&
>>> ubiattach -m "$mtdnum"
>>> }
>>>
>>> I changed the last line to:
>>>
>>>          ubiformat "/dev/mtd$mtdnum" -y -f "$ubi_file" && ubiattach 
>>> -m "$mtdnum"
>>>
>>>
>>> Although ubiformat indeed states "-f -" to use stdinput, it does not
>>> seem to work.
>>> Writing the image seemed to be simply skipped using that method.
>>>
>>> actually presenting the absolute filepath to "-f" fixed it.
>>>
>>> Any thoughts?
>>>
>>> Thanks again for your prompt reply,
>>>
>>> Koen
>>>
>>>
>
> hi,
>
> my answers inline...
>
>> Summarized after more testing:
>>
>> --> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - &&
>> ubiattach -m "$mtdnum"
>> Does not work.
> it works fine on other openwrt platforms. also, the commands cat and 
> ubiformat are mature and stable and they work fine on your platform 
> too. i am repeating myself, but according to your posted log, 
> something is rebooting the system before the command ends.
>
> from your posted log:
>
> -----
> libscan: scan[  207.917055] usb usb1: USB disconnect, device number 1
> ning eraseblock 1873 -- 98 % comp[  207.925651] ci_hdrc ci_hdrc.0: USB 
> bus 1 deregistered
> lete
>
> libscan: scanning eraseblo[  207.934010] imx2-wdt 20bc000.watchdog: 
> Device shutdown: Expect reboot!
> ck 1874 -- 98 % complete
>
> // check this out
> //    |
> //    v
> libsca[  207.942382] reboot: Restarting system
> \FE
> U-Boot SPL 2020.04-dirty (May 20 2020 - 12:35:32 +0200)
> Booting from NAND
> -----
>
> as you can see, the process is interrupted in the middle of printing a 
> status line beginning with "libsca". there is no way that ubiformat 
> will fail without a diagnostic message, much less fail in the middle 
> of printing a diagnostics message.
>
> this is a platform issue and you will not fix it by changing this code.
>
>> Looks like ubiformat gets empty/invalid input from stdin, writing the
>> image seems to be skipped completely.
>> The board simply boots again in it's original state without any changes
> possibly an A/B dual boot system, possibly with dual ubi partitions. 
> IDK, but the platform maintainer knows.
>
>> I confirmed that this command expands to "cat" and not "zcat"
> yes, this is visible in your log.
>
>>
>>
>> --> ubiformat "/dev/mtd$mtdnum" -y -f "$ubi_file" && ubiattach -m 
>> "$mtdnum"
>> Works fine
>> image is written as expected and boots properly using the newly 
>> written image
> possibly because it is a few percent faster than piping.
>
> IMHO, this is an extremely brittle workaround, not a fix. as soon as 
> your flash becomes slower to erase or write or the openwrt image grows 
> a bit, the problem will reappear.
>
>>
>>
>> --> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y && ubiattach
>> -m "$mtdnum"
>> Omitting "-f -" here
>> 'cat' throws a "Broken Pipe" error during the flash process.
> this is expected. without "-f -" ubiformat will not open stdin and 
> then cat has nowhere to pipe to.
>
>> The data is written, but the image is corrupted and does not boot 
>> afterwards.
> no. without -f the ubi partition is formatted but nothing is written 
> onto it.
>> Is this error/corruption not present on other targets ??
>>
>>
>> So unless anyone has other ideas .. it looks for 'ubinized' the file
>> needs to be provided as-is, and not using stdin.
>> This drops support for gzip'd files .. but it does not seem to be
>> gzipped anyway,  at least on my target.
> the other idea is that the platform maintainer needs to be involved to 
> track down what is rebooting the device in the middle of an update.
>
> i suppose this problem is platform-specific. in that case it might be 
> a watchdog on your platform. my suspicion: the normal userland is 
> brought down before sysupgrade (as rootfs is not available) and thus 
> userland tickle of a watchdog ceases to happen. the watchdog expires 
> and reboots the system mid upgrade. the correct fix is disabling the 
> watchdog before sysupgrade. alternatively, the userland tickle could 
> be kept working during sysupgrade.
>
> in particular, note this log line just 8 milliseconds before the reboot:
> [  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect reboot!
> [  207.942382] reboot: Restarting system
>
> i don't know nor have the platform, so please contact the maintainer.
>
>>
>> Regards,
>>
>> Koen
>
>
> thanks!


could you collect more logs during sysupgrade? let's see if the reboot 
happens always approximately at the same time.

also, you could you collect a full bootlog?

thanks!


More information about the openwrt-devel mailing list