[OpenWrt-Devel] Sysupgrade and Failed to kill all processes
mike at meshplusplus.com
Wed May 13 01:17:43 EDT 2020
I've been investigating a problem with sysupgrade failing with the error
message "Failed to kill all processes", and then hanging indefinitely.
This happens maybe once every 10-20 sysupgrades, and it's kind of a pain.
So far I've determined this workflow that the sysupgrade command follows.
Note, I'm not aiming for 100% accuracy, but just broad strokes.
1) /sbin/sysupgrade locates the file to upgrade from on the filesystem, or
if the second option to sysupgrade starts with http://, it downloads the
firmware file using wget.
2) /sbn/sysupgrade does some minor validation of various things, and grabs
whatever config files it thinks the end user wants to be restored and packs
them up into some kind of tarball.
3) sysupgrade sends a message, via ubus, to procd, to initiate the upgrade.
4) Procd does some stuff which I haven't finished completely understanding
just yet, but it looks like firmware verification to make sure we don't
upgrade to a bad firmware file.
5) It *does not* appear that procd will proactively terminate services
until everything (or almost everything) is shut down. Seems like something
that should be added to increase reliability.
6) procd replaces itself (execvp systemcall) with the program
/sbin/upgraded. This means that procd is *no longer running*, PID 1 is now
/sbin/upgraded. So service management is not possible at this point.
7) /sbin/upgraded now acts as PID1. It executes the shell script
/lib/upgrade/stage2 with parameters.
8) The shell script loops on all processes, and sends them the TERM signal,
and then the KILL signal. See email subjec for problems with this.
9) the shell script creates a new ram filesystem, mounts it, then copies
over a very small set of binaries into it.
10) The shell script changes root into the new ram filesystem
11) Inside the ramfilesystem, the shell script writes the upgraded firmware
and saved configuration to disk
Now that the very rough summary is out of the way, I have 4 questions.
1) I notice that the shell script /lib/upgrade/stage2 is doing a tight loop
with kill -9 to terminate processes. However, it's only looping a maximum
of 10 times, and its going as fast as the shell can loop.
What's to stop this loop from quickly going through every process almost
immediately 10 times, before a process that would be about to terminate
terminates? The process in question may be handling some kind of IO, so the
kernel wouldn't immediately terminate it.
Shouldn't there be some very brief sleep at the end of each loop iteration
to ensure that the processes that are going to practically terminate have
2) Why is the behavior on failure to terminate processes to just give up?
That leaves devices hanging without any network connectivity.
A reboot with some logging on disk would allow for remote sysupgrades to
have some kind of recoverability.
3) Is looping over sigkill a reliable way to terminate all processes?
I was under the impression that the only reliable way to ensure all
processes terminate is to use cgroups, and put the processes to terminate
in the freezer group and then kill them off after they've been frozen.
Otherwise you have basically a race condition between the termination of
processes and the creation of children. E.g. a fork-bomb could prevent all
processes from being terminated.
4) Why doesn't procd, prior to execvp the /sbin/upgraded program, shutdown
all the services that are running?
Maybe I'm just not seeing where it does this, so if that's the case, then
I'm happy to be corrected.
But I'm under the impression that when not using cgroups, stopping all
services would allow for anything that isn't double forked to be gracefully
shutdown and cleaned up after itself.
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
More information about the openwrt-devel