[OpenWrt-Devel] Sysupgrade and Failed to kill all processes

Philip Prindeville philipp_subx at redfish-solutions.com
Wed May 13 19:29:32 EDT 2020

> [snip]
> (2) If the box is in an indeterminate state then it’s not always clear that there’s a safe path forward, and sometimes this is something that a human needs to ascertain.
> There's no human that can ascertain anything. The board that is being upgraded is being upgraded from inside an enclosed shell where no human can access it aside from the ethernet port, or wifi card. Given that the upgrade process shuts off all the possible methods of interacting with the system, having the board hang forever complaining "Can't kill all processes" is not useful. Now a human has to go physically interact with it, to cut power and restore power. 

Okay. All of my routers are Xeon-D pizza boxes with serial port consoles connected to a terminal server…

> (3) You might also want to collect data about the failure so you can fix it and stop it from happening again.  Proceeding would efface all of that.
> There's no data to be gathered. There is no shell. Even if you had UART / Serial / VGA + Keyboard, the upgrade process has already terminated any CLI interface. So there's no data that can be gathered.

No, no shell… but there is console spew.

> What if the failure left the box in a partially compromised state?  Would you want your firewall to “fail open”?  I wouldn’t.
> I want my router to not render itself useless every 10th time I update it.

Sure, but security is about mitigating failure appropriately.

> As for firewalls failing open, that depends heavily on context. Do I want a device that's in the middle of no where, where sending a human to access it costs a lot of money, to fail in a state where there's no possibility of communication? Or do I want it to fail in a way that I can still log into it?

My lights-out machine rooms all have a dialup line, a terminal server, and a power strip where I can cycle outlets… because I don’t like driving 7 hours each way for 1 hour of productive work.

> Further, we're still talking about failing to sigkill processes prior to an upgrade. Nothing has changed on disk at this point in the upgrade, so a failure to kill all processes should not render the box into powered up brick.

Okay, that seems like a good place for the WDT to recover the box…

> The man page for signal(2) says:
>        The signals SIGKILL and SIGSTOP cannot be caught or ignored.
> but yeah, if you’re in the kernel when the signal arrives, and you get stuck in there, then your process won’t go away and it becomes a moot point.
> Right, that's what I suspect is happening.

Trying to remember how to get a snapshot of all the kernel threads without panicking the kernel…

But if you have no shell, you’d need to modify your upgrade code to catch an alarm and do this.  Or chain it to the WDT.

> > I was under the impression that the only reliable way to ensure all processes terminate is to use cgroups, and put the processes to terminate in the freezer group and then kill them off after they've been frozen. Otherwise you have basically a race condition between the termination of processes and the creation of children. E.g. a fork-bomb could prevent all processes from being terminated.
> That assumes you have a kernel with CGROUPS compiled in.
> RIght, so if you don't have CGROUPS compiled in, there's no reliable way to terminate all processes. It can't be done. 

Well, maybe not in the absolute in all possible conceivable circumstances.  But if we could figure out why you’re getting that kernel hang, then maybe you could fix that and not have to resort to this.

Though “yeah”, a bullet-proof mechanism that works in all cases is always nice to have.

Maybe it’s worth making CGROUPS mandatory.

But someone is going to whine about the pittance of memory it takes… while being apparently okay with the possibility of hung upgrades.

Another application of CONFIG_SKINNY!

> For my purposes, I would have Cgroups compiled in. I have no concern for increase in firmware size. Others may care, about firmware size, but they can decide on the tradeoff between having cgroups or not having them.

Yup.  I’m in the same boat where I value functionality (and reliability) over memory footprint.

> […]
> > I’m speculating but it could be for any number of reasons…  Keeping procd simple…  There might be ordering or dependency that requires doing the shutdown in a particular order… There might be services (like squid if socks or proxy web access is required) that might be needed by the upgrade process in some scenarios…  
> The upgrade process does not interact with any network services. All necessary files are completely verified as on disk and correct prior to the stage of the upgrade process that I'm talking about.

Okay.  It was speculation.  Maybe for some future capability of doing a PUT to remote storage, and then GETting it back post-image rewrite.

> If there is an ordering dependency, then the services in question must describe that dependency in their /etc/init.d/ files. 

Right, so something like “telinit 1” functionality would be useful here.  Then procd could run all the /etc/rc.d/S??* scripts… and kill any stragglers.

> Procd is the manager of services. There's no meaningful complexity or binary size cost to having it do a "for(auto const& service : services) { service.shutdown(); }" loop (forgive the C++-ism, It's simply an example), and the cost is well worth the additional reliability that it would provide.

Yup.  It’s evolved in a sometimes helter-skelter fashion it seems driven by short-term requirements more than a long-term evolution.

> > When *not* using cgroups?  I thought you just argued for using cgroups to avoid the fork-race condition above…
> Yes, cgroups are the ideal.
> But even if cgroups are not available, procd should still attempt to use the service management system to shut things down *prior* to looping with sigterm and sigkill. 


openwrt-devel mailing list
openwrt-devel at lists.openwrt.org

More information about the openwrt-devel mailing list