This question has always appeared to me as academic, with little or no real-worl...

vezzy-fnord · on Nov 1, 2015

There's been a lot of research in fault recovery through message logging and checkpoint-based methods that could be applied here, e.g. [1]. Of course, you use "academic" as a snarl world, so I don't think anything will convince you.

The idea that the service manager would not be able to know the system and service states is completely false. Solaris SMF is a design that does, via its use of the configuration repository. Simpler designs can then deduce enough metadata from the persistent configuration in the supervisor tree. There's many possible approaches.

The idea that such fault recovery is implausible is a naive one that only one unfamiliar with the research literature could espouse.

If we take your logic to its conclusion, we should just run everything in ring 0 with a single unisolated address space, because hey, anything can fail. Component modularization and communication boundary enforcement is the first step to fault isolation, which is the first step to fault tolerance.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52....

the_why_of_y · on Nov 2, 2015

That's very interesting, I didn't know SMF could do that.

Let's see... init(1) is apparently restarted by the Solaris kernel automatically, which is different from Linux, no automatic kernel panic.

https://github.com/illumos/illumos-gate/blob/master/usr/src/...

  * State File and Restartability
  *   Premature exit by init(1M) is handled as a special case by the kernel:
  *   init(1M) will be immediately re-executed, retaining its original PID.  (PID
  *   1 in the global zone.)  To track the processes it has previously spawned,
  *   as well as other mutable state, init(1M) regularly updates a state file
  *   such that its subsequent invocations have knowledge of its various
  *   dependent processes and duties.

Then init(1) and SMF's svc.startd(1) seem to have a bit of a relationship:

  * Process Contracts
  *   We start svc.startd(1M) in a contract and transfer inherited contracts when
  *   restarting it.  Everything else is started using the legacy contract
  *   template, and the created contracts are abandoned when they become empty.

So init(1) creates the initial contract for svc.startd(1), then the latter creates nested contracts below that. (Aside: doing the equivalent cgroup manipulation on Linux would run afoul of the notorious one-writer rule.)

If svc.startd(1) crashes, init(1) will restart it inside the existing contract of the crashed instance, so it can find its spawned services (in nested contracts), as well as its companion svc.configd(1).

Now during startup, svc.startd(1) calls ct_event_reset(3), and this is really the interesting bit here:

https://github.com/illumos/illumos-gate/blob/master/usr/src/...

     The ct_event_reset() function resets  the  ___location  of  the
     listener to the beginning of the queue. This function can be
     used to re-read events, or read events that were sent before
     the  event endpoint was opened. Informative and acknowledged
     critical events, however, might have been removed  from  the
     queue.

I'm willing to entertain the idea that with this feature, SMF can properly track the state of the services that its previous incarnation launched, even if it crashed in the middle of handling an event.

With any luck it will also handle the situation if a supervised process exits after the service manager crashes, and before it is restarted, as the contact should buffer the event in the kernel until it is read.

Notably this is a Solaris specific kernel feature of the contract(4) filesystem; does Linux have anything equivalent in cgroups or somewhere?

The other SMF process, svc.configd, uses an SQLite database (actually 2, a persistent one and a tmpfs one for runtime state), so it's plausible that it's properly transactional.

> If we take your logic to its conclusion, we should just run everything in ring 0 with a single unisolated address space, because hey, anything can fail.

That is an entirely erroneous extrapolation, as I never claimed any other single point of failure [in user-space] than the service manager.

JdeBP · on Nov 2, 2015

> I never claimed any other single point of failure [in user-space] than the service manager.

If all of one's system and service management relies upon a system-wide software "bus", then another similar problem is what to do when one has restarted the "bus" broker service and it has lost track of all active clients and servers.

* https://bugs.freedesktop.org/show_bug.cgi?id=89847

* https://github.com/NixOS/nixpkgs/issues/7633

Related problems are what to do when one cannot shut down one's log daemon because the only way to reach its control interface is via a "bus" broker service, and the broker in turn relies upon logging being available until it is shut down. Again, this is an example of engineering tradeoffs. Choose one big centralized logger daemon for logging everything, and this complexity and interdependence is a consequence. A different design is to have multiple log daemons, independent of one another. With the cyclog@dbus service logging to /var/log and that log daemon's own and the service manager's log output being logged by a different daemon to /run/system-manager/log/, one can shut down the separate logging services at separate points in the shutdown procedure.

* https://github.com/systemd/systemd/issues/867

* https://bugzilla.redhat.com/show_bug.cgi?id=1214466

JdeBP · on Nov 1, 2015

> If your service manager process were to crash, what are you going to do about it?

With the assistance of the SRC_kex.ext extension, you re-establish knowledge of all running services in a new service manager.

* http://www-01.ibm.com/support/knowledgecenter/ssw_aix_53/com...

Or you make the other engineering tradeoffs.

* https://news.ycombinator.com/item?id=10216906

* https://news.ycombinator.com/item?id=8384251

the_why_of_y · on Nov 2, 2015

It's literally named SRC_kex.ext? So... would it be fair to say that part of SRC is implemented in kernel-space? The manual page gives me this impression.

That could very well be a solution to the problem, but perhaps not one that vezzy-fnord was hoping for.

I actually wanted to link the second of your linked comments but couldn't find it unfortunately.