The actual reason you really need a proper PID 1 is not explained in this post, but a couple of clicks away at [0]:
>[...] the init process must also wait for child processes to terminate,
>before terminating itself.
>If the init process terminates prematurely then all children are terminated uncleanly by the kernel.
It also needs to reap orphan processes or they will become zombies. The dumb-init code does not appear to be doing that so I reported an issue[1].
In general docker is half-trying to be the init system, but most people using it are putting a whole child os with its own init system in their container. I think the approach that rkt uses where it uses systemd to run the process is safer. Now if people would just start using lightweight containers...
Considering phusion/baseimage has been around for more than 2 years and plenty of people have been using an init system inside their containers that contain multiple process, why didn't Yelp just pick something up off the shelf? Why not use runit or one of the plenty of more mature lightweight init systems?
I can't speak to the others that have been mentioned in this thread (tini in particular seems to be identical), but the solution used by phusion/baseimage is written in python[0] - a C-based solution allows for lighter-weight containers
If I understand correctly, the main goal here can be summarized in the quote below:
"The motivation: modeling Docker containers as regular processes
[...] we want processes to behave just as if they weren’t running inside a container. That means handling user input, responding the same way to signals, and dying when we expect them to. In particular, when we signal the docker run command, we want that same signal to be received by the process inside."
and that seems to me as the core reason why they can't just use a simple init system (like e.g. runit I suppose?)
> Having a shell as PID 1 actually makes signaling your process almost impossible. Signals sent to the shell won’t be forwarded to the subprocess, and the shell won’t exit until your process does. The only way to kill your container is by sending it SIGKILL (or if your process happens to die).
Noob question. Why is it impossible? You have the PID, no?
Good question! The problem is trying to signal it from outside the Docker container.
If your container has a process tree like
PID 1: /bin/sh
+--- PID 2: <your Python server>
then if you use `docker signal` from the host, it will only send a signal to PID 1, which is the shell. However the shell won't forward it on to your Python server, so nothing happens (in most cases).
dumb-init basically replaces the shell in that diagram, but forwards signals when it receives them. So when you use `docker signal`, the Python process receives the signal.
Alternatively, just eliminating the shell (so your Python app is PID 1) works for some cases, but you get special kernel behavior applied to PID 1 which you usually don't want. This is the main purpose of dumb-init.
Ah that makes sense. I did not realize how docker-signal forwarded signals. From its perspective using PID 1 makes sense because that's where the "application" should run as specified in your dockerfile.
Yup, tini is really really similar and looks pretty cool! They're solving much of the same problem. It's unfortunate that we didn't find tini before we went and wrote dumb-init.
There are some minor differences (dumb-init looks like it's probably a bit better for interactive commands since it e.g. handles SIGTSTP). You can also get process group behavior at run-time with dumb-init rather than compile time, and it's on by default unlike tini (as far as I can tell from a brief reading). But for most cases it won't make a difference.
Quick disclaimer: I'm the author of Tini (thanks for the hat tip, by the way!).
Note that for interactive usage, Tini actually hands over the tty (if there is one) to the child, so in that case signals that come "from the TTY" (though in a Docker environment this is an over-simplication) actually bypass Tini and are sent to the child directly. This should include SIGSTP, though I'm not sure I tested this specifically.
That being said, both tools are probably indeed very similar — after all there is little flexibility in that kind of tool! Process group behavior is probably indeed where they differ the most. : )
If it's such a straightforward fix, why isn't it part of the docker core? I'd love to hear from the docker team why it's not a concern for them. Presumably if it was they'd have addressed it by now.
From my own experience with docker in production, I'm yet to see any of the described scenarios crop up. Has anyone else, or is this solving an extreme edge case?
> From my own experience with docker in production, I'm yet to see any of the described scenarios crop up. Has anyone else, or is this solving an extreme edge case?
The biggest issue we see at Yelp is leaking containers in test (e.g. Jenkins aborting a job but leaving the containers it spawned still running).
Depending on how you orchestrate containers, you might not encounter the issue in prod. If you're using something like Kubernates or Marathon or Paasta, they're probably going to do the "right thing" and ensure the containers are actually stopped.
We also use containers a lot in development. For example, we might put a single tool into a container, and then when developers call that tool, they're actually spawning a container without realizing it. For this use case, it's really important that signals are handled properly so that Ctrl-C (and similar) continues working.
Why did you not use something like supervisord? I run a few containers (obviously not at yelp scale) and supervisors has been spectacular at restarting, managing,reloading,etc. It handles nginx,gunicorn,puma,tomcat, etc pretty well.
Yes its python - but was that the motivation?
Also,you guys should comment on https://github.com/docker/docker/pull/5773 which is work on unprivileged systemd in docker. I think you guys can influence the bug with your experience in this.
Zombies are not uncommon to see for an app running in a container that forks off other processes. This isn't every program but forking processes is pretty common, enough so to be worrisome.
For instance, some programs watch the Docker event stream and can reload, say, HAproxy configuration to automatically load balance any new containers which come up. In my experience, running such a program in a container can make reloading the HAproxy process frequently tend to create a huge variety of zombie processes - and once they're present, zombies are difficult to eliminate without a reboot.
Generally accepted practice is no - they aren't worth having. Containers should only contain a single process. That process shouldn't be writing logs to disk (hence no logrotate) and timed tasks would generally be done outside the container rather than in it (though there are a lot of ways to skin that cat).
Single process containers generally don't need all the baggage of a full init system or other dependencies - hence this project.
At my current job, we're basically using Docker as a sort of package manager and deployment script runner. Our containers are very fat, things are installed with apt. One of them has GCC in it but I'm not sure why. One installs Node and runs a few js scripts during the build process, then never runs it again but keeps it around. It's obviously wrong, but I think it's just a new set of bad ideas that this software has allowed people to have.
'npm install' needs 'make' half the time. Sometimes it's easier on debian-based setups to install build-essential than make, as it will pull in a few other things that help as well. GCC is one of those things - might it be that that container has build-essential installed?
I used to run docker with fat containers, and have now just finished getting rid of docker in favour of .debs. We were basically using it as a package manager, and it is terrible at the job. Docker has its use-cases, but package management isn't one of them. The docker tagging system is particularly bad at the job.
Under a time crunch I've not found a way to use language-package managers and not wind up with gcc in the container.
The problem is apt does a poor job of letting you setup something like build-essential and then remove it and leave just the runtime shared libraries you need for the other things you build to actually work.
You can use the "dockerception" method: build in a container with devtools, then import the resulting binaries into a container with no devtools.
https://github.com/jamiemccrindle/dockerception
If doing this sort of thing, make sure to accomplish it in a single step (image layer) in the Dockerfile. Otherwise you won't be doing any good as the "removed" files actually persist behind a layer which specifies them as removed.
If you installed build-essential, then removed it, then apt-get --purge autoremove should remove the packages that build-essential pulled in that were not already installed.
The problem is by default this will remove runtimes like libtool as well. Sure, I could figure out what these are and keep them around, but the problem is the time-crunch aspect - it takes time, and if the program changes then you still need to take the time.
Part of the potential gain with containers is that you don't have to treat them like full os environments, and don't have to worry about administering them as such.
I'd rather have a few tens of simple, single-process containers logging to a shared collector (or to their stdout, which is then collected) than deal with managing logrotate for them all and solve processing all those files for every host somehow.