Hacker News new | past | comments | ask | show | jobs | submit login
Upgrading Executable on the Fly (nginx.org)
208 points by pantuza on Jan 4, 2022 | hide | past | favorite | 71 comments



We did this for Caddy 1 too [1]. It was really cool. I am not sure how many people used this feature, so I haven't implemented it for Caddy 2 yet, and in the ~two years that Caddy 2 has been released, I've only had the request once. It's a bit tricky/tedious to do properly, but I'm willing to bring it over to Caddy 2 with a sufficient sponsorship.

[1]: https://github.com/caddyserver/caddy/blob/v1/upgrade.go


If Caddy were to support systemd socket activation, this self-restart dance is not necessary as the parent process (systemd) is holding the socket for you. And for other systems, they can use https://github.com/zimbatm/socketmaster instead. I believe this to be more elegant and robust than the nginx approach as there is no PID re-parenting issues.

But I suspect that most Caddy deployments are done via docker, and that requires a whole container restart anyways.


It's kind of fun to watch things go out of fashion and back in. We used to use inetd, mostly because memory was expensive, so it could spawn a service only when a request came in, then the spawned process would exit and give the memory back to the os. Then someone decided tcpd should sit between inetd and servers, for security and logging. Then, every service just ran as it's own daemon. Now I'm occasionally seeing posts like this reviving inetd.


It's the same but a bit different.

Inetd is listening to a port and then for each new connection, spawning a new process, binding stdin/stout to the socket pair. The main issue was that it could lead to system resource exhaustion pretty easily if too many connections were being opened and there were no good ways to control that.

With systemd, the listening socket is only bound by systemd and passed to the service. The service itself is responsible for accepting and handling the new connections. So it has the control on the rate of new connections, and can also more easily share memory. The main advantage of that approach is that as soon as systemd binds the socket, new connections won't be rejected by the system and will be on hold until the service accept() them. So no connection gets dropped, even during a restart. The service itself is still responsible for gracefully shutting down existing connections on SIGTERM.


Sure, though systemd does support the inetd style setup also.


Yes despite the haters systemD is going the right direction. Just hope it doesn't crumble under its complexity.


Good point, and I'm not sure which deployment method is more popular.

In general I am personally not a fan of Docker due to added complexities (often unnecessary for static binaries like Caddy) and technical limitations such as this. All my Caddy deployments use systemd (which I don't love either, sigh).


Technically, it should be possible to pass a file descriptor to a docker process, but I haven't seen that before.

I don't like all parts of systemd but I think the socket activation is pretty elegant. The decoupling also allows to bind on port 80 and 443 (because systemd runs as root), and then still have Caddy run as a user process. I think it would unlock nice things for Caddy.


If I'm running on containers, does this mean that I need to run systemd inside docker?


For containers you need a load-balancer in front to handle failover and rolling deploys.


I think one caveat is that systemd stops the old instance before starting the new one. This will give you a latency blip as requests will be queued during that time. The nginx approach doesn't introduce any latency impact.


I'm torn on this feature.

In the one hand, an application should never be able to replace itself with "random code" to be executed. I want my systems to be immutable. I want my services to be run with the smallest set of privileges required.

On the other hand, it encourages "consumer level" users to keep their software up-to-date, even when it wasn't installed from a distribution's repository etc.

So I think in general it's a good feature to have, as advanced users/distributions will restrict what a service/process is able to to anyways and won't have any downsides of not using this feature.

It should be optional, that's all!


> an application should never be able to replace itself with "random code" to be executed.

To clarify: it doesn't, nor has it ever worked that way. You have to be the one to do that (or someone with privileges to write to that file on disk). Most production setups don't give Caddy that permission. And you have to trigger the upgrade too.


Ok, this explains a lot and also clearly shows I never used this feature :)

I always assumed it would be the Caddy process itself taking care of downloading the update & replacing the binary, before restarting it.


Caddy 2 can do that at least, but it's something you have to command it to do. :)


if you can log into the machine and replace the nginx executable you are probably capable of running it too


Can you explain any of the technical details around this perchance? I'm super curious. I know that SO_REUSEPORT[1] exists but is that the only little trick to make this work? From what I've read with SO_REUSEPORT it can open up that port to hijacking by rogue processes, so is that fine to rely on?

[1] https://lwn.net/Articles/542629/


You don't even need that. If the old server process exec()s the new one, it can pass on its file descriptors -- including the listening socket -- when that happens.


Yep, we don't use SO_REUSEPORT. We just pass it from the old process to the new one.


You could also be fancy and pass open sockets over a unix ___domain socket with sendmsg().


This is the best way as it avoids any sort of session/parenting issues which are not always easy to solve portably as a parent.


If an attacker is already running rogue processes on your box, the minor details surrounding SO_REUSEPORT is the least of your worries. An attacker could just restart nginx, and won't care about lost requests.


>it can open up that port to hijacking by rogue processes

That seems relevant if the process is using a non-privileged port that's >= 1024. If we're talking about privileged ports (<= 1023), though, only another root process could hijack that, and those can already hijack you many other ways.


What about processes that aren't root but hold CAP_NET_BIND_SERVICE?


Sure, should have mentioned that, and perhaps namespaces too.


poked around a bit of that from a previous job, here's what I remember:

1. there's a control process and worker processes

2. on upgrade, control process launches new worker processes from the new binary

3. requests are drained from old worker processes

4. most of the time nginx request handlers allocate from a per-request allocation pool, so requests mostly don't share memory

5. for the cases where there are global states, there's a separate shared memory pool that you need to allocate from (which is kind of hard to work if you are not using built-in nginx primitives)


Someone did this for golang. It isn't perfect, but works for some basic use cases... https://github.com/jpillora/overseer


I've implemented this a few times in a few languages based on exactly what nginx does. It works well, and it is pretty straight forward if you are comfortable with posix style signals, sockets, and daemons.

I'm not sure it is super critical in the age of containerized workloads with rolling deploys but at the very least the connection draining is a good pattern to implement to prevent deploy/scaling related error spikes.


Even with containerized workloads, you still have an ingress, or SPOF (or multiple, when using multicast), and the seamless restart is meant for exactly those processes. Nginx is often used (https://kubernetes.github.io/ingress-nginx/), or when you use AWS, GCS etc they provide such a service for you.

Not sure how the cloud providers do it though, maybe combination of low DNS TTL and rolling restart since they often have huge fleets of servers which handle ingress?


A container though should be immutable and ideally shouldn't have changes made to it. If the container were to die, it'd revert back to the old version? It looks to me like these seamless upgrades would be an anti pattern to containers.

With ingress you'd have a load balancer in front or have it routed in the network layer using BGP.


I think the parent was talking more about the fact that at some point, you have a component that should be available as much as possible. In the case you mention, that would be the load balancer. Being able to upgrade it in place might be easier than other ways.


How do you restart the load balancer though, without dropping traffic?


You would need more than one to do a rolling restart. Alternatively to do it with one instance of a software load balancer is a bit more work, spin another instance up and update DNS. Wait for traffic to the old one to die as TTLs expire, then decommission.

But I agree it isn't as easy as a in place upgrade.


2 LBs and a VIP/DNS switch.


two nginx load balancers, reroute to the secondary via dns, restart primary


If you really want to have no SPOF you'd probably build something like this:

Multihomed IP <-> Loadbalancer <-> Application

By having the same setup running on multiple locations you can replace the load banacers by taking one ___location offline (stop announcing the corresponding route). Application instances can be replaced by taking the application instance out of the load balancer.


> Not sure how the cloud providers do it though

At least some I am familiar with operate at the packet level and can hand off live "connections" to a peer or hot standby, along with full session state.

Remember that with TCP or anything else, the abstract session is an illusion.



Also HAProxy, they both use UNIX sockets via ancillary messages+SCM_RIGHTS I believe.

https://www.haproxy.com/blog/truly-seamless-reloads-with-hap...


Just a shout out: it's super hard to do it for UDP / QUIC / H3. Beware.

(but I don't think nginx supports h3 out of the box yet)


It's fundamentally identical. How do you think one hands off a TLS session? Or any other state?

It's only a problem if your state is tangled and impossible to serialize or bundle up to hand off.

UDP is perhaps the easiest because there's nothing to do in the basic case, for example with DNS.


Why so? I thought UDP was stateless, making that process even easier. But I never implemented it.


UDP itself is stateless, but QUIC itself is stateful. Without knowing the background I would assume the issue to be that the incoming UDP packets will be routed to the new process after the reload and that new process is not aware of the existing QUIC connections, because the state resides in the old process. Thus it is not able to decrypt the packets for example.


How are quick/http3 servers usually upgraded? As you say, it seems tricky.


I've been working on something similar in a load balancer I've been writing in Rust. It's still a work in progress.

Basically the parent executes the new binary after it receives a USR1 signal. Once the child is healthy it kills the parent via SIGTERM. The listener socket file descriptor is passed over an environment variable.

https://github.com/monroeclinton/- (this is the proper url, it's called dash)


I've considered building something like this to allow for us to update customer software while it's serving users.

In my proposals, there would be a simple application-aware http proxy process that we'd maintain and install on all environments. It would handle relaying public traffic to the appropriate final process on an alternate port. There would be a special pause command we could invoke on the proxy that would buy us time to swap the processes out from under the TCP requests. A second resume command would be issued once the process is running and stable. Ideally, the whole deal completes in ~5 seconds. Rapid test rollbacks would be double that. You can do most of the work ahead of time by toggling between an A and B install path for the binaries, with a third common data path maintained in the middle (databases, config, etc)

With the above proposal, the user experience would be a brief delay at time of interaction, but we already have some UX contexts where delays of up to 30 seconds are anticipated. Absolutely no user request would be expected to drop with this approach, even in a rollback scenario. Our product is broad enough that entire sections of it can be a flaming wasteland while other pockets of users are perfectly happy, so keeping the happy users unbroken is key.


https://en.wikipedia.org/wiki/Blue-green_deployment

DNS not required. You can use a load balancer to do the same thing. If you don't want a full second setup, do a rolling restart of application servers instead.

Edit: I forgot... you can do this with containers too.


How do the two processes listen to the same port?


Once the USR2 signal is received the master process forks, the child process inherits the parents file descriptors including listen(). One process stops accepting connections creating a queue in the kernel. The new process takes over and starts accepting connections.

You can follow the trail by searching for ngx_exec_new_binary in the nginx repo.


Just to add - Nginx normally spawns several worker processes that all process connections to the same port.


Correct but to clarify, only the master process binds to the ports. The master process creates socketpairs to the workers for interprocess communication. The workers accept connections over the shared socket.

https://www.nginx.com/blog/socket-sharding-nginx-release-1-9...

Page also has an example of how SO_REUSEPORT effects flow.


Oh, thanks! I didn't know that. I supposed it worked by inheriting the listening socket but I didn't check.


There’s an ioctl for this on FreeBSD and Linux — SO_REUSEPORT. You could also just leave the listening socket open when exec’ing the new httpd, or send it with a unix ___domain socket.


This article on how haproxy uses SO_REUSEPORT goes into some more detail: https://www.haproxy.com/blog/truly-seamless-reloads-with-hap...


Using socket option SO_REUSEPORT allows multiple processes to bind to same port.


Is there any restrictions on this option? Eg only children of the same parent process are allowed to bind to same port. Otherwise how does the packet distribution work? And how does the response from that port work?


>So long as the first server sets this option before binding its socket, then any number of other servers can also bind to the same port if they also set the option beforehand. [...] To prevent unwanted processes from hijacking a port that has already been bound by a server using SO_REUSEPORT, all of the servers that later bind to that port must have an effective user ID that matches the effective user ID used to perform the first bind on the socket.

https://lwn.net/Articles/542629/


is this what it's actually doing though? It doesn't say the reuseport option to the listen directive is required for this.


I am curious does anyone know why Nginx uses SIGWINCH for this? I know Apache uses WINCH as well which makes me wonder if there was some historical reason a server process wound up using a signal meant for a TTY?


It's otherwise unused in the context of a daemon. SIGHUP is similarly popular and historically designed for use with a TTY.


That's what I suspected - that it would pretty much be guaranteed to not to be needed however I was not able to find any historical UNIX lore mentioning it specifically. Cheers.


Seems like a useful feature for a service manager like systemd to have for its managed services. It is already able to perform inetd style socket activation, I imagine this would be a welcome feature


inetd style socket activation (iirc) forks a process for every connection.

So, simply replacing the binary on disk will cause all new connections going forward to use the new binary, while existing held connections (with in-memory references to the old binary's inode) will finish the operations. Once they are done and all references to that inode are gone, the blocks referencing the binary will be removed.


> inetd style socket activation (iirc) forks a process for every connection.

inetd supports both process-per-connection and single process/multiple connections using the "nowait" and "wait" declarations, respectively. The former passes an accept'd socket, the latter passes the listening socket.


CGI (apache's mod_cgi) has supported this since almost the beginning of the web as well. Deploying a new CGI is as simple as replacing the CGI binary.


This is already possible. You can configure whether you want inetd style socket activation (where systemd calls accept() and passes you the client socket)), or just systemd listening to the socket (where systemd passes you the listen socket and your binary calls accept()).

https://www.freedesktop.org/software/systemd/man/systemd.soc...


One thread with one comment from a long time ago:

Upgrading an Nginx executable on the fly - https://news.ycombinator.com/item?id=8677077 - Nov 2014 (1 comment)


So if I understood correctly, would it be like this

cp new/nginx /path/to/nginx kill -SIGUSR2 <processid>

That does sound pretty neat if you're not running nginx in a container. I wonder if they've built a Windows equivalent for that.



The systemv init script for nginx had an upgrade operation (in addition to start/stop/reload etc) which would send the signal. Worked like a charm.


I've always found the multi-process approach taken by both nginx and apache to be nothing but a hindrance when you have to write a custom module. It means that you may have to use shared memory, which is a PITA.

I don't know why they haven't moved on from it; it only really made sense when uni-core processors were the norm.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: