[OpenWrt-Devel] Sysupgrade and Failed to kill all processes

Wed May 13 02:02:13 EDT 2020

Applause, applause.

The first (partial) docs of the magic of sysupgrade. And its pitfalls.

Having had various issues with sysupgrade myself in the past (also doing sysupgrade OTA), I add following notes:
- Having open files on storage devices (i.e. for swap, but also explicitly opened) broke sysupgrade for me.
- No real error-feedback, in case sysupgrade was _not_ done. Even leaving the filesystem in inconsistent state,
as "sysupgrade ... -f myfilestobesaved.tar.gz" was applied to (still) running image, without upgrading to new firmware.bin

Regarding killing the processes in a 10-times loop, in addition of a short sleep in every iteration,
may be also to check for "process still alive".

Having read your mail, I am happy, that for some time already I explicitly do the killing of processes myself,
before sysupgrade. Especially, in case I have non-standard programs running, like nginx or squid.
As the default config of squid defines a 10s duration for shutdown.

Am 13.05.2020 um 08:17 schrieb Michael Jones:
> I've been investigating a problem with sysupgrade failing with the error message "Failed to kill all processes", and 
> then hanging indefinitely.
> 
> This happens maybe once every 10-20 sysupgrades, and it's kind of a pain.
> 
> So far I've determined this workflow that the sysupgrade command follows. Note, I'm not aiming for 100% accuracy, but 
> just broad strokes.
> 
> 
> 1) /sbin/sysupgrade locates the file to upgrade from on the filesystem, or if the second option to sysupgrade starts 
> with http://, it downloads the firmware file using wget.
> 2) /sbn/sysupgrade does some minor validation of various things, and grabs whatever config files it thinks the end user 
> wants to be restored and packs them up into some kind of tarball.
> 3) sysupgrade sends a message, via ubus, to procd, to initiate the upgrade.
> 4) Procd does some stuff which I haven't finished completely understanding just yet, but it looks like firmware 
> verification to make sure we don't upgrade to a bad firmware file.
> 5) It *does not* appear that procd will proactively terminate services until everything (or almost everything) is shut 
> down. Seems like something that should be added to increase reliability.
> 6) procd replaces itself (execvp systemcall) with the program /sbin/upgraded. This means that procd is *no longer 
> running*, PID 1 is now /sbin/upgraded. So service management is not possible at this point.
> 7) /sbin/upgraded now acts as PID1. It executes the shell script /lib/upgrade/stage2 with parameters.
> 8) The shell script loops on all processes, and sends them the TERM signal, and then the KILL signal. See email subjec 
> for problems with this.
> 9) the shell script creates a new ram filesystem, mounts it, then copies over a very small set of binaries into it.
> 10) The shell script changes root into the new ram filesystem
> 11) Inside the ramfilesystem, the shell script writes the upgraded firmware and saved configuration to disk
> 12) Reboot.
> 
> 
> Now that the very rough summary is out of the way, I have 4 questions.
> 
> 1) I notice that the shell script /lib/upgrade/stage2 is doing a tight loop with kill -9 to terminate processes. 
> However, it's only looping a maximum of 10 times, and its going as fast as the shell can loop.
> 
> What's to stop this loop from quickly going through every process almost immediately 10 times, before a process that 
> would be about to terminate terminates? The process in question may be handling some kind of IO, so the kernel wouldn't 
> immediately terminate it.
> 
> Shouldn't there be some very brief sleep at the end of each loop iteration to ensure that the processes that are going 
> to practically terminate have done so?
> 
> 2) Why is the behavior on failure to terminate processes to just give up? That leaves devices hanging without any 
> network connectivity.
> A reboot with some logging on disk would allow for remote sysupgrades to have some kind of recoverability.
> 
> 3) Is looping over sigkill a reliable way to terminate all processes?
> I was under the impression that the only reliable way to ensure all processes terminate is to use cgroups, and put the 
> processes to terminate in the freezer group and then kill them off after they've been frozen. Otherwise you have 
> basically a race condition between the termination of processes and the creation of children. E.g. a fork-bomb could 
> prevent all processes from being terminated.
> 
> 4) Why doesn't procd, prior to execvp the /sbin/upgraded program, shutdown all the services that are running?
> 
> Maybe I'm just not seeing where it does this, so if that's the case, then I'm happy to be corrected.
> 
> But I'm under the impression that when not using cgroups, stopping all services would allow for anything that isn't 
> double forked to be gracefully shutdown and cleaned up after itself.
> 
> 
> 
> 
> _______________________________________________
> openwrt-devel mailing list
> openwrt-devel at lists.openwrt.org
> https://lists.openwrt.org/mailman/listinfo/openwrt-devel
> 

_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel