sysupgrade broken on imx nand targets (and maybe others too)

Lanchon lanchon at gmail.com
Tue Jan 24 12:35:54 PST 2023



On 1/24/23 13:25, Koen Vandeputte wrote:
> On Tue, Jan 24, 2023 at 4:26 PM Koen Vandeputte
> <koen.vandeputte at citymesh.com> wrote:
>> On Tue, Jan 24, 2023 at 7:59 AM Lanchon <lanchon at gmail.com> wrote:
>>> hi Koen, thanks again.
>>>
>>> i copied your log here for ease of reference:
>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db
>>>
>>>
>>> first let me say:
>>>
>>> - ubinized sysupgrade is not used by any of my devices.
>>>
>>> - ubinized sysupgrade happens when when an ubi partition image is fed as
>>> an upgrade file. the image contains the complete set of ubi volumes that
>>> are normally stored within the ubi partition on your device: typically
>>> kernel (raw image), R/O rootfs (sqfs), and R/W overlay (ubifs). during
>>> said sysupgrade, the current configuration is first copied to RAM, then
>>> the ubi partition image is written, and finally -if current config is
>>> kept- the RAM contents are written back to the new overlay volume.
>>>
>>> - ubinized sysupgrades were known to be broken by other people at the
>>> time of my commits, and they wanted to remove the feature altogether
>>> because it was unused (look it up in the relevant pull request for my
>>> commits on github). as i remember it, it was broken because some ubi
>>> volumes within the ubi partition were sometimes mounted or R/O block
>>> devices were created on top of them (/dev/ubiblock*), locking the ubi
>>> partition and preventing the upgrade.
>>>
>>> - although my devices would normally not use such upgrades, i could
>>> still take a whole ubi partition backup and then test ubinized
>>> sysupgrade with it on my devices. in fact, if you restore the ubi
>>> partition image without conserving the current configuration, then this
>>> procedure is the best way to do a backup/restore of the complete state
>>> of the router: kernel, rootfs, and overlay are completely saved and
>>> restored. btw, i think this should be documented. (and this is mostly
>>> the reason why i added gzip support on sysupgrade: ubinized images of
>>> backed-up ubi partitions compress like crazy.)
>>>
>>> - my tests of ubinized sysupgrade worked after these changes but not
>>> before. specifically the fix is in: af34733593 base-files: fix ubinized
>>> nand sysupgrade
>>>
>>>
>>> regarding your log:
>>>
>>> - nand_do_platform_check succeeds:
>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L438-L469
>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L192
>>>
>>> - next comes the actual nand_do_flash_file:
>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L379-L405
>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2061
>>>
>>> - it is determined that passed file is a ubi partition image, so
>>> nand_upgrade_ubinized is invoked:
>>> https://github.com/openwrt/openwrt/blob/ac21dff5b67698c09f54a4b98d6f9f12af17edd4/package/base-files/files/lib/upgrade/nand.sh#L260-L269
>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2088
>>>
>>> nand_upgrade_ubinized is basically a one-liner:
>>> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - && ubiattach
>>> -m "$mtdnum"
>>>
>>> cat/zcat the image, feeding that to ubiformat -f - which writes it.
>>>
>>>
>>> and the write does take place, but take a look:
>>>
>>> -------------------
>>>
>>> + cat /tmp/nandnew.ubi
>>> + ubiformat /dev/mtd2 -y -f -
>>> ubiformat: mtd2 (nand), size 250609664 bytes (239.0 MiB), 1912
>>> eraseblocks of 131072 bytes (128.0 KiB), min. I/O size 2048 bytes
>>>
>>> libscan: scanning eraseblock 0 --  0 % complete
>>> libscan: scanning eraseblock 1 --  0 % complete
>>> libscan: scanning eraseblock 2 --  0 % complete
>>>      ...
>>> libscan: scanning eraseblock 1868 -- 97 % complete
>>>
>>> libscan: scanning eraseblock
>>> [  207.876200] ci_hdrc ci_hdrc.1: remove, state 1
>>> 1869 -- 97 % complete
>>>
>>> libscan:
>>> [  207.883339] usb usb2: USB disconnect, device number 1
>>> scanning eraseblock 1870 -- 97 %
>>> [  207.891238] usb 2-1: USB disconnect, device number 2
>>> complete
>>>
>>> libscan: scanning eras
>>> [  207.901522] ci_hdrc ci_hdrc.1: USB bus 2 deregistered
>>> eblock 1871 -- 97 % complete
>>>
>>> libscan: scanning eraseblock 1872
>>> [  207.910396] ci_hdrc ci_hdrc.0: remove, state 4
>>>    -- 97 % complete
>>>
>>> libscan: scan
>>> [  207.917055] usb usb1: USB disconnect, device number 1
>>> ning eraseblock 1873 -- 98 % comp
>>> [  207.925651] ci_hdrc ci_hdrc.0: USB bus 1 deregistered
>>> lete
>>>
>>> libscan: scanning eraseblo
>>> [  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect reboot!
>>> ck 1874 -- 98 % complete
>>>
>>> libsca
>>> [  207.942382] reboot: Restarting system
>>>
>>> -----------------------
>>>
>>>
>>> while sysupgrade is flashing UBI the partition, the system is rebooted;
>>> i don't know why.
>>>
>>> here are the cleaned-up kernel messages:
>>>
>>> [  207.876200] ci_hdrc ci_hdrc.1: remove, state 1
>>> [  207.883339] usb usb2: USB disconnect, device number 1
>>> [  207.891238] usb 2-1: USB disconnect, device number 2
>>> [  207.901522] ci_hdrc ci_hdrc.1: USB bus 2 deregistered
>>> [  207.910396] ci_hdrc ci_hdrc.0: remove, state 4
>>> [  207.917055] usb usb1: USB disconnect, device number 1
>>> [  207.925651] ci_hdrc ci_hdrc.0: USB bus 1 deregistered
>>> [  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect reboot!
>>> [  207.942382] reboot: Restarting system
>>>
>>>
>>> next the reboot takes place. u-boot mounts what it knows as mtd3 as an
>>> ubi partition:
>>> https://gist.github.com/Lanchon/f24ea9c16eda5ffaa5085a7b240238db#file-imx6dl-gw52xx-ubinized-sysupgrade-log-txt-L2183
>>>
>>> but openwrt used mtd2 to write to ubi, not mtd3. i don't know if in your
>>> device what openwrt calls mtd2 is called mtd3 by your current (default)
>>> u-boot config.
>>>
>>>
>>> so some takeaways...
>>>
>>> - something is rebooting the system while i write the sysupgrade image.
>>> maybe another thread? maybe an unattended watchdog?
>>>
>>> - the reboot happens while 98% of the image is written. maybe this issue
>>> is time dependent and only shows up with my code because it is a little
>>> slower than the previous code. maybe the previous code reached 100%
>>> before being rebooted and thus the upgrade went through.
>>>
>>> - the image *IS* written, albeit partially, which means that the
>>> previous image that was there is definitely erased. what is booted then?
>>> IDK, it depends on the details of your device. maybe it is booting a
>>> recovery image? or maybe the ubi partition format is not finished, but
>>> the ubi volumes within are fully written before the reboot, so the
>>> system doesn't brick by chance. (but then the newer image would be
>>> booted, but you say that the sysupgrade has no effect and the prior
>>> image is booted instead, so that can't be it.) maybe two ubi partitions
>>> are used on your device to implement an A/B dual system boot. so maybe a
>>> flag needs to be toggled to switch between A and B after the image is
>>> written, but since that code is never executed, the previous system is
>>> booted instead.
>>>
>>> - several issues cropped up with a set of sysupgrade changes i did
>>> (among them, these you mention here). there are many device types and
>>> several sysupgrade mechanisms with their own files, and then some common
>>> files. i assumed other types of upgrades would invoke common routines
>>> but not -for instance- nand flash routines. i was wrong: the codebase is
>>> spaghetti calling any file form any sysupgrade method, and this caused a
>>> couple of issues with my nand sysupgrade changes. i don't think this is
>>> one of those instances though. i think that the sysupgrade code is doing
>>> the right thing and the fault is elsewhere, but i may be wrong. without
>>> knowledge of your device and without a device to test, it is hard to tell.
>>>
>>>
>>> maybe we should could call the attention of the device maintainer to
>>> this thread now?
>>>
>>>
>>> thanks!
>>>
>>>
>> Hi,
>>
>> Many thanks for the very detailed reply.
>> It aided hugely in debugging this further.
>>
>> I'm happy to share that I found the culprit and it works nicely again now.
>>
>> within nand.sh:
>>
>>
>> nand_upgrade_ubinized() {
>>          local ubi_file="$1"
>>          local gz="$2"
>>
>>          nand_detach_ubi "$CI_UBIPART" || return 1
>>
>>          local mtdnum="$( find_mtd_index "$CI_UBIPART" )"
>>          ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - &&
>> ubiattach -m "$mtdnum"
>> }
>>
>> I changed the last line to:
>>
>>          ubiformat "/dev/mtd$mtdnum" -y -f "$ubi_file" && ubiattach -m "$mtdnum"
>>
>>
>> Although ubiformat indeed states "-f -" to use stdinput, it does not
>> seem to work.
>> Writing the image seemed to be simply skipped using that method.
>>
>> actually presenting the absolute filepath to "-f" fixed it.
>>
>> Any thoughts?
>>
>> Thanks again for your prompt reply,
>>
>> Koen
>>
>>

hi,

my answers inline...

> Summarized after more testing:
>
> --> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y -f - &&
> ubiattach -m "$mtdnum"
> Does not work.
it works fine on other openwrt platforms. also, the commands cat and 
ubiformat are mature and stable and they work fine on your platform too. 
i am repeating myself, but according to your posted log, something is 
rebooting the system before the command ends.

from your posted log:

-----
libscan: scan[  207.917055] usb usb1: USB disconnect, device number 1
ning eraseblock 1873 -- 98 % comp[  207.925651] ci_hdrc ci_hdrc.0: USB 
bus 1 deregistered
lete

libscan: scanning eraseblo[  207.934010] imx2-wdt 20bc000.watchdog: 
Device shutdown: Expect reboot!
ck 1874 -- 98 % complete

// check this out
//    |
//    v
libsca[  207.942382] reboot: Restarting system
\FE
U-Boot SPL 2020.04-dirty (May 20 2020 - 12:35:32 +0200)
Booting from NAND
-----

as you can see, the process is interrupted in the middle of printing a 
status line beginning with "libsca". there is no way that ubiformat will 
fail without a diagnostic message, much less fail in the middle of 
printing a diagnostics message.

this is a platform issue and you will not fix it by changing this code.

> Looks like ubiformat gets empty/invalid input from stdin, writing the
> image seems to be skipped completely.
> The board simply boots again in it's original state without any changes
possibly an A/B dual boot system, possibly with dual ubi partitions. 
IDK, but the platform maintainer knows.

> I confirmed that this command expands to "cat" and not "zcat"
yes, this is visible in your log.

>
>
> --> ubiformat "/dev/mtd$mtdnum" -y -f "$ubi_file" && ubiattach -m "$mtdnum"
> Works fine
> image is written as expected and boots properly using the newly written image
possibly because it is a few percent faster than piping.

IMHO, this is an extremely brittle workaround, not a fix. as soon as 
your flash becomes slower to erase or write or the openwrt image grows a 
bit, the problem will reappear.

>
>
> --> ${gz}cat "$ubi_file" | ubiformat "/dev/mtd$mtdnum" -y && ubiattach
> -m "$mtdnum"
> Omitting "-f -" here
> 'cat' throws a "Broken Pipe" error during the flash process.
this is expected. without "-f -" ubiformat will not open stdin and then 
cat has nowhere to pipe to.

> The data is written, but the image is corrupted and does not boot afterwards.
no. without -f the ubi partition is formatted but nothing is written 
onto it.
> Is this error/corruption not present on other targets ??
>
>
> So unless anyone has other ideas .. it looks for 'ubinized' the file
> needs to be provided as-is, and not using stdin.
> This drops support for gzip'd files .. but it does not seem to be
> gzipped anyway,  at least on my target.
the other idea is that the platform maintainer needs to be involved to 
track down what is rebooting the device in the middle of an update.

i suppose this problem is platform-specific. in that case it might be a 
watchdog on your platform. my suspicion: the normal userland is brought 
down before sysupgrade (as rootfs is not available) and thus userland 
tickle of a watchdog ceases to happen. the watchdog expires and reboots 
the system mid upgrade. the correct fix is disabling the watchdog before 
sysupgrade. alternatively, the userland tickle could be kept working 
during sysupgrade.

in particular, note this log line just 8 milliseconds before the reboot:
[  207.934010] imx2-wdt 20bc000.watchdog: Device shutdown: Expect reboot!
[  207.942382] reboot: Restarting system

i don't know nor have the platform, so please contact the maintainer.

>
> Regards,
>
> Koen


thanks!


More information about the openwrt-devel mailing list