[OpenWrt-Devel] Sysupgrade and Failed to kill all processes

Wed May 13 18:36:46 EDT 2020

>
> How the entire upgrade process works would be a good subject for
> documenting on the Wiki if it’s not already.
>

Feel free :-)

> How long are you thinking this I/O will take to complete?
>

 Longer than the blazing speed of /bin/sh looping over /proc/*

(1) It shouldn’t be happening very often.  Hopefully.
>

It happens all the time for me. I have customers complaining my ear off
about their router not upgrading properly.

> (2) If the box is in an indeterminate state then it’s not always clear
> that there’s a safe path forward, and sometimes this is something that a
> human needs to ascertain.
>

There's no human that can ascertain anything. The board that is being
upgraded is being upgraded from inside an enclosed shell where no human can
access it aside from the ethernet port, or wifi card. Given that the
upgrade process shuts off all the possible methods of interacting with the
system, having the board hang forever complaining "Can't kill all
processes" is not useful. Now a human has to go physically interact with
it, to cut power and restore power.

> (3) You might also want to collect data about the failure so you can fix
> it and stop it from happening again.  Proceeding would efface all of that.
>

There's no data to be gathered. There is no shell. Even if you had UART /
Serial / VGA + Keyboard, the upgrade process has already terminated any CLI
interface. So there's no data that can be gathered.

What if the failure left the box in a partially compromised state?  Would
> you want your firewall to “fail open”?  I wouldn’t.
>

I want my router to not render itself useless every 10th time I update it.

As for firewalls failing open, that depends heavily on context. Do I want a
device that's in the middle of no where, where sending a human to access it
costs a lot of money, to fail in a state where there's no possibility of
communication? Or do I want it to fail in a way that I can still log into
it?

Further, we're still talking about failing to sigkill processes prior to an
upgrade. Nothing has changed on disk at this point in the upgrade, so a
failure to kill all processes should not render the box into powered up
brick.

The man page for signal(2) says:
>
>        The signals SIGKILL and SIGSTOP cannot be caught or ignored.
>
> but yeah, if you’re in the kernel when the signal arrives, and you get
> stuck in there, then your process won’t go away and it becomes a moot point.
>

Right, that's what I suspect is happening.

> > I was under the impression that the only reliable way to ensure all
> processes terminate is to use cgroups, and put the processes to terminate
> in the freezer group and then kill them off after they've been frozen.
> Otherwise you have basically a race condition between the termination of
> processes and the creation of children. E.g. a fork-bomb could prevent all
> processes from being terminated.
>
>
> That assumes you have a kernel with CGROUPS compiled in.
>
RIght, so if you don't have CGROUPS compiled in, there's no reliable way to
terminate all processes. It can't be done.

For my purposes, I would have Cgroups compiled in. I have no concern for
increase in firmware size. Others may care, about firmware size, but they
can decide on the tradeoff between having cgroups or not having them.

> Also, if you have fork-bombs, why haven’t they brought down the system
> earlier?  And why would you have untrusted services/programs on your system
> in the first place?  This isn’t a general computing base with naive users
> picking up malware inadvertently, etc.  It’s a closed software ecosystem
> (in theory… how it gets mangled downstream is a different question).
>

It was merely an example.

I’m speculating but it could be for any number of reasons…  Keeping procd
> simple…  There might be ordering or dependency that requires doing the
> shutdown in a particular order… There might be services (like squid if
> socks or proxy web access is required) that might be needed by the upgrade
> process in some scenarios…
>

The upgrade process does not interact with any network services. All
necessary files are completely verified as on disk and correct prior to the
stage of the upgrade process that I'm talking about.

If there is an ordering dependency, then the services in question must
describe that dependency in their /etc/init.d/ files.

Procd is the manager of services. There's no meaningful complexity or
binary size cost to having it do a "for(auto const& service : services) {
service.shutdown(); }" loop (forgive the C++-ism, It's simply an example),
and the cost is well worth the additional reliability that it would provide.

> When *not* using cgroups?  I thought you just argued for using cgroups to
> avoid the fork-race condition above…
>
Yes, cgroups are the ideal.

But even if cgroups are not available, procd should still attempt to use
the service management system to shut things down *prior* to looping with
sigterm and sigkill.
If cgroups are available, then the fork-bomb worst-case scenario can be
entirely mitigated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/openwrt-devel/attachments/20200513/f1692c45/attachment.htm>
-------------- next part --------------
_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel