We did this for Caddy 1 too [1]. It was really cool. I am not sure how many people used this feature, so I haven't implemented it for Caddy 2 yet, and in the ~two years that Caddy 2 has been released, I've only had the request once. It's a bit tricky/tedious to do properly, but I'm willing to bring it over to Caddy 2 with a sufficient sponsorship.
If Caddy were to support systemd socket activation, this self-restart dance is not necessary as the parent process (systemd) is holding the socket for you. And for other systems, they can use https://github.com/zimbatm/socketmaster instead. I believe this to be more elegant and robust than the nginx approach as there is no PID re-parenting issues.
But I suspect that most Caddy deployments are done via docker, and that requires a whole container restart anyways.
It's kind of fun to watch things go out of fashion and back in. We used to use inetd, mostly because memory was expensive, so it could spawn a service only when a request came in, then the spawned process would exit and give the memory back to the os. Then someone decided tcpd should sit between inetd and servers, for security and logging. Then, every service just ran as it's own daemon. Now I'm occasionally seeing posts like this reviving inetd.
Inetd is listening to a port and then for each new connection, spawning a new process, binding stdin/stout to the socket pair. The main issue was that it could lead to system resource exhaustion pretty easily if too many connections were being opened and there were no good ways to control that.
With systemd, the listening socket is only bound by systemd and passed to the service. The service itself is responsible for accepting and handling the new connections. So it has the control on the rate of new connections, and can also more easily share memory. The main advantage of that approach is that as soon as systemd binds the socket, new connections won't be rejected by the system and will be on hold until the service accept() them. So no connection gets dropped, even during a restart. The service itself is still responsible for gracefully shutting down existing connections on SIGTERM.
Good point, and I'm not sure which deployment method is more popular.
In general I am personally not a fan of Docker due to added complexities (often unnecessary for static binaries like Caddy) and technical limitations such as this. All my Caddy deployments use systemd (which I don't love either, sigh).
Technically, it should be possible to pass a file descriptor to a docker process, but I haven't seen that before.
I don't like all parts of systemd but I think the socket activation is pretty elegant. The decoupling also allows to bind on port 80 and 443 (because systemd runs as root), and then still have Caddy run as a user process. I think it would unlock nice things for Caddy.
I think one caveat is that systemd stops the old instance before starting the new one. This will give you a latency blip as requests will be queued during that time. The nginx approach doesn't introduce any latency impact.
In the one hand, an application should never be able to replace itself with "random code" to be executed. I want my systems to be immutable. I want my services to be run with the smallest set of privileges required.
On the other hand, it encourages "consumer level" users to keep their software up-to-date, even when it wasn't installed from a distribution's repository etc.
So I think in general it's a good feature to have, as advanced users/distributions will restrict what a service/process is able to to anyways and won't have any downsides of not using this feature.
> an application should never be able to replace itself with "random code" to be executed.
To clarify: it doesn't, nor has it ever worked that way. You have to be the one to do that (or someone with privileges to write to that file on disk). Most production setups don't give Caddy that permission. And you have to trigger the upgrade too.
Can you explain any of the technical details around this perchance? I'm super curious. I know that SO_REUSEPORT[1] exists but is that the only little trick to make this work? From what I've read with SO_REUSEPORT it can open up that port to hijacking by rogue processes, so is that fine to rely on?
You don't even need that. If the old server process exec()s the new one, it can pass on its file descriptors -- including the listening socket -- when that happens.
If an attacker is already running rogue processes on your box, the minor details surrounding SO_REUSEPORT is the least of your worries. An attacker could just restart nginx, and won't care about lost requests.
>it can open up that port to hijacking by rogue processes
That seems relevant if the process is using a non-privileged port that's >= 1024. If we're talking about privileged ports (<= 1023), though, only another root process could hijack that, and those can already hijack you many other ways.
poked around a bit of that from a previous job, here's what I remember:
1. there's a control process and worker processes
2. on upgrade, control process launches new worker processes from the new binary
3. requests are drained from old worker processes
4. most of the time nginx request handlers allocate from a per-request allocation pool, so requests mostly don't share memory
5. for the cases where there are global states, there's a separate shared memory pool that you need to allocate from (which is kind of hard to work if you are not using built-in nginx primitives)
I've implemented this a few times in a few languages based on exactly what nginx does. It works well, and it is pretty straight forward if you are comfortable with posix style signals, sockets, and daemons.
I'm not sure it is super critical in the age of containerized workloads with rolling deploys but at the very least the connection draining is a good pattern to implement to prevent deploy/scaling related error spikes.
Even with containerized workloads, you still have an ingress, or SPOF (or multiple, when using multicast), and the seamless restart is meant for exactly those processes. Nginx is often used (https://kubernetes.github.io/ingress-nginx/), or when you use AWS, GCS etc they provide such a service for you.
Not sure how the cloud providers do it though, maybe combination of low DNS TTL and rolling restart since they often have huge fleets of servers which handle ingress?
A container though should be immutable and ideally shouldn't have changes made to it. If the container were to die, it'd revert back to the old version? It looks to me like these seamless upgrades would be an anti pattern to containers.
With ingress you'd have a load balancer in front or have it routed in the network layer using BGP.
I think the parent was talking more about the fact that at some point, you have a component that should be available as much as possible. In the case you mention, that would be the load balancer. Being able to upgrade it in place might be easier than other ways.
You would need more than one to do a rolling restart. Alternatively to do it with one instance of a software load balancer is a bit more work, spin another instance up and update DNS. Wait for traffic to the old one to die as TTLs expire, then decommission.
But I agree it isn't as easy as a in place upgrade.
If you really want to have no SPOF you'd probably build something like this:
Multihomed IP <-> Loadbalancer <-> Application
By having the same setup running on multiple locations you can replace the load banacers by taking one ___location offline (stop announcing the corresponding route). Application instances can be replaced by taking the application instance out of the load balancer.
At least some I am familiar with operate at the packet level and can hand off live "connections" to a peer or hot standby, along with full session state.
Remember that with TCP or anything else, the abstract session is an illusion.
UDP itself is stateless, but QUIC itself is stateful. Without knowing the background I would assume the issue to be that the incoming UDP packets will be routed to the new process after the reload and that new process is not aware of the existing QUIC connections, because the state resides in the old process. Thus it is not able to decrypt the packets for example.
I've been working on something similar in a load balancer I've been writing in Rust. It's still a work in progress.
Basically the parent executes the new binary after it receives a USR1 signal. Once the child is healthy it kills the parent via SIGTERM. The listener socket file descriptor is passed over an environment variable.
I've considered building something like this to allow for us to update customer software while it's serving users.
In my proposals, there would be a simple application-aware http proxy process that we'd maintain and install on all environments. It would handle relaying public traffic to the appropriate final process on an alternate port. There would be a special pause command we could invoke on the proxy that would buy us time to swap the processes out from under the TCP requests. A second resume command would be issued once the process is running and stable. Ideally, the whole deal completes in ~5 seconds. Rapid test rollbacks would be double that. You can do most of the work ahead of time by toggling between an A and B install path for the binaries, with a third common data path maintained in the middle (databases, config, etc)
With the above proposal, the user experience would be a brief delay at time of interaction, but we already have some UX contexts where delays of up to 30 seconds are anticipated. Absolutely no user request would be expected to drop with this approach, even in a rollback scenario. Our product is broad enough that entire sections of it can be a flaming wasteland while other pockets of users are perfectly happy, so keeping the happy users unbroken is key.
DNS not required. You can use a load balancer to do the same thing. If you don't want a full second setup, do a rolling restart of application servers instead.
Edit: I forgot... you can do this with containers too.
Once the USR2 signal is received the master process forks, the child process inherits the parents file descriptors including listen().
One process stops accepting connections creating a queue in the kernel. The new process takes over and starts accepting connections.
You can follow the trail by searching for ngx_exec_new_binary in the nginx repo.
Correct but to clarify, only the master process binds to the ports. The master process creates socketpairs to the workers for interprocess communication. The workers accept connections over the shared socket.
There’s an ioctl for this on FreeBSD and Linux — SO_REUSEPORT. You could also just leave the listening socket open when exec’ing the new httpd, or send it with a unix ___domain socket.
Is there any restrictions on this option? Eg only children of the same parent process are allowed to bind to same port. Otherwise how does the packet distribution work? And how does the response from that port work?
>So long as the first server sets this option before binding its socket, then any number of other servers can also bind to the same port if they also set the option beforehand. [...] To prevent unwanted processes from hijacking a port that has already been bound by a server using SO_REUSEPORT, all of the servers that later bind to that port must have an effective user ID that matches the effective user ID used to perform the first bind on the socket.
I am curious does anyone know why Nginx uses SIGWINCH for this? I know Apache uses WINCH as well which makes me wonder if there was some historical reason a server process wound up using a signal meant for a TTY?
That's what I suspected - that it would pretty much be guaranteed to not to be needed however I was not able to find any historical UNIX lore mentioning it specifically. Cheers.
Seems like a useful feature for a service manager like systemd to have for its managed services. It is already able to perform inetd style socket activation, I imagine this would be a welcome feature
inetd style socket activation (iirc) forks a process for every connection.
So, simply replacing the binary on disk will cause all new connections going forward to use the new binary, while existing held connections (with in-memory references to the old binary's inode) will finish the operations. Once they are done and all references to that inode are gone, the blocks referencing the binary will be removed.
> inetd style socket activation (iirc) forks a process for every connection.
inetd supports both process-per-connection and single process/multiple connections using the "nowait" and "wait" declarations, respectively. The former passes an accept'd socket, the latter passes the listening socket.
This is already possible. You can configure whether you want inetd style socket activation (where systemd calls accept() and passes you the client socket)), or just systemd listening to the socket (where systemd passes you the listen socket and your binary calls accept()).
I've always found the multi-process approach taken by both nginx and apache to be nothing but a hindrance when you have to write a custom module. It means that you may have to use shared memory, which is a PITA.
I don't know why they haven't moved on from it; it only really made sense when uni-core processors were the norm.
[1]: https://github.com/caddyserver/caddy/blob/v1/upgrade.go