Using select(2) the right way

ts4z · on June 5, 2017

I have worked on servers that had this problem with select(2). poll was better but kqueue was much much better (from near-100% utilization with a few thousand open fds to < 1% cpu utilization; please forgive the vague numbers, it was 7 years ago at a company that I no longer work for).

If you're using nonblocking I/O, or threads, select can become the bottleneck with a few thousand interesting connections at sizes that are much less than the FD_SETSIZE max.

The user-space CPU utilization should be near-0 since this is really a kernel I/O problem; the fact that it is much higher is an indication that a lot of wasted user space work is being done.

Select, by design, has a flaw: the kernel must scan the bitmask for the highest-numbered fd passed in, even if there is little change in the fd_set bitmask (as is usually the case). This means incurring a cost if any fd in the set has a high fd number. The fd on which you accept probably has a low number, but your client I/O fds will have increasing numbers as the server gets greater and greater utilization.

kqueue (and epoll, but I haven't used it) resolve this problem nicely. Only things that change state must be communicated. poll is better than select in this respect if the array is sparse, but if a large number of fds are interesting, it will also get expensive.

aivarsk · on June 5, 2017

What you wrote about the design flaws of select is correct but a feed set with 10,000 sockets is just 1250 bytes or 157 longs and hardware is very good at sequential memory access. I agree it will be slower than kqueue or epoll for long-lived connections because there is more work to do for the kernel and application but for many applications select is good enough when used correctly.

I have my doubts about select causing 100% cpu utilization, I suspect you were doing other suboptimal things as well. The sample code I wrote is running well with both 100 and 10,000 connections. I have my own anecdotal evidence where application was barely handling just 100 mostly inactive connections and OPS guys suggested limits of just 50 connections per application. After I fixed how feed sets were created and how the result of select was processed the same application was running just fine with 8,000 connections. We had to support Linux, Solaris, AIX and HP-UX at that time and select/poll were available on all of them. That's why I invested time in optimizing the code instead of switching to epoll. OPS guys still suggested limit of 1,000 connections per application but this time it was for availability and other non-performance reasons.

joosters · on June 5, 2017

In my experience, there's definitely a point where poll() beats select() when handling lots of connections, long-lived or short-lived. But at that point you are much better off moving to epoll() or kqueue() depending upon the OS. And if you support those syscalls, there's little point falling back to select().

For most people, if your program is handling 10s or 100s of concurrent connections, select() should be just fine.

If you are going to be handling more, it's worthwhile looking into the other syscalls to improve performance. In any case, I'd recommend abstracting away the event handling so that you can switch between different syscalls and can use benchmarking to try to replicate the traffic you want to handle. There's lots of libraries that will do this work for you, unless you need to get into the low-level stuff.

Trying to judge the syscall performance based on reasoning about amount of data transferred between your program and the OS, number of syscalls etc, is very difficult. You are much better off measuring the actual behaviour. For example, epoll() seems like a terribly designed API to me as it involves making many syscalls (whereas kqueue() is just one per loop). However, I found epoll() was very high performance. I guess the cost of syscalls on Linux can be very low in some cases.

valarauca1 · on June 5, 2017

systems calls on linux that deal with per-process state try very hard to not invalidate userland memory mappings.

The kernels real range and userland's virtual range won't overlap so for a lot of functions kernel memory just has to mapped/unmapped on call, not invalidate _all_ userland bindings.

Well okay they will overlap. So yeah your mappings may get invalidated but for synchronized higher performance systems calls they _shouldnt_.

This lets them be in the 10's to 100's of nano-seconds.

Normally the most _expensive_ part of a linux syscall is the TLB misses after one.

---

Your model of memory transfer size assumes data is being copied.

The Linux kernel has a lot of features to let userland, devices, and itself all share the same memory copy free.

koverstreet · on June 6, 2017

This is wrong.

Kernel address space and userspace address space don't overlap at all. There's zero mucking with memory mappings, TLB etc. on syscall.

hawkinsw · on June 4, 2017

I've written lots of servers in C and found this article quite interesting.

The techniques that the author presents are new to me and I know that I will take them into consideration when I write a server in C the next time.

More importantly, they are techniques that I think library writers should use. My uneducated guess is that the underlying implementations of servers in Node.js and Iron, etc, could be improved with these ideas, even though they might not be directly applicable since those servers are written in different languages.

I can't vouch for the results, but the author's experiments seem conclusive.

DSMan195276 · on June 5, 2017

His allocating an fd_set to an arbitrary size is somewhat iffy standards wise, but if it works it works - especially for a solution for specific OSs like Linux, where this won't be broken in the future. On that note, it is worth wondering how expensive the copying is, but I would wager he is right in that the copying isn't actually that expensive - it's only an 8K buffer.

Also, when creating an array that is indexed by fd, if you want to only use the memory you need instead of creating an extremely large array for the max fd of the system you can just allocate the memory using `mmap`. Systems like Linux won't actually allocate all the memory right when you do the `mmap`, instead it will map in a zero-page in each spot. When you write to it, the kernel will automatically allocate a page to back that memory. When you combine that with the kernel always selecting the lowest fd, it means that the memory backing your array will only grow to be as large as needed for the maximum number of fd's you use. (Note: It's likely that such a large array would be allocated via `mmap` in it's own spot by malloc/calloc anyway, but using `mmap` ensures you get the functionality you intended.)

kelnos · on June 5, 2017

The thing that really resonated for me in this article was the point that for most of us, the performance of select() or accept() really doesn't matter, because any time spent there is easily dwarfed by application logic, database lookups, queries to other services, etc.

You only really need to care about this if you're working on something mainly concerned with the connections themselves, like HAProxy.

morecoffee · on June 5, 2017

The right way to think about these kind of things is in terms of responsiveness. For example, when I think of any request intiated by a user, there is a ~200ms budget of time to service that request. If you can shave off a couple hundred micro seconds, those can be spent on returning a better answer.

Everyone has a piece to contribute to servicing user requests, and every should be mindful how much of the budget they consume. Using select (or epoll) more efficiently makes everyone elses job easier. It is important for everyone in the pipeline to think this way, you included. The user experience depends on it.

astrobe_ · on June 5, 2017

In a nutshell: performance still matters because your program isn't the only one running on the computer.

Matthias247 · on June 5, 2017

Regarding this:

> epoll uses system calls (“slow”) to add or remove descriptor to the set. select does the same thing in userspace. If you have to add and remove many descriptors per iteration which is typical for short-lived connections then select might be more efficient.

I would have estimated that on a select system call the kernel needs to internally attach some kind of event listener for each passed FD and uninstall it before the system call returns. Which means select is interally a sequence of "iterate over all FDs, check if an event is available and if not install listener", "wait at least one event to happen" and finally "uinstall all listeners". While epoll requires only "wait for at least one event to be available and then iterate over the FDs and record all available events". So I would guess that while epoll requires more syscalls for setup it should be a lot more efficient if a FD is polled more than once. With edge triggering most of the registration/unregistration calls can also be avoided.

Or do kernels have special optimizations that make select internally more efficient than I described? Maybe some memoization of former used pollsets?

joosters · on June 5, 2017

You're right. Effectively the kernel is having to setup the event listeners for each FD on every select() call, whereas with epoll it can be done once and the state is preserved, leading to less overall work. Also, the epoll syscalls are surprisingly fast and lightweight.

aivarsk · on June 5, 2017

http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod and search for "EVBACKEND_EPOLL (value 4, Linux)" to read how much fun epoll is :-D

masklinn · on June 5, 2017

Or read this: https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-...

loeg · on June 5, 2017

A lot of the advice seems misguided, if not wrong. Rather than optimize around the limitations of select(2), I'd suggest using different APIs instead.

For example:

> Literature ussually says that fd_set is limited to 1024 descriptors but some mention that you can increase that by defining FD_SETSIZE to a larger number.

If you just need a one off tool for a few descriptors, why not just use poll(2) instead? No problems with high fd numbers, and the API is quite simple.

> To find out which sockets have events some loop through data structures and check each socket they contain. But a much better way is to iterate through fd_set as array of long and check 32-64 bits at a time.

If you've reached the point at which this trade-off matters, you're better off switching to a socket API that scales well (epoll or kevent) and does this filtering for you. Or like another commenter suggested, using a library that abstracts this functionality.

> select modifies descriptor sets passed in. A common mistake is to create descriptor set from zero before calling select by looping through data structures and adding all sockets they contain. I’ve seen that to take huge amount of time due to locking and synchronization between threads. The correct way is to maintain a descriptor set all the time and create a copy of it and pass the copy to select.

Again -- if you've reached the point where this tradeoff matters, just go directly to epoll/kevent.

> Most programs will need to store some state associated with the socket. Map seems to be a good choice at first but you should use an array indexed by socket (it’s an int). The POSIX API says that kernel always has to use the lowest unused file descriptor when creating a new one. So the array will be as big as many connections your program handles and kernel will take care of finding a free slot in your array.

Maybe your program deals with non-socket fds, and the set of socket fds is fairly sparse. Using a map is actually pretty reasonable even if fds are dense.

aivarsk · on June 5, 2017

I probably should have started by telling it was years ago when applications had to work on OSes other than Linux and BSD (AIX, HP-UX, Solaris). Today it's libev or Boost.Asio.

> why not just use poll(2) instead?

Because as I mentioned later poll basically is the same as select but requires more memory to be copied to/from kernel and was slower in some test cases although I can come up with cases where poll will be faster than select. Networking libraries like libev and others allocate fd_set the same way.

>> To find out which sockets have events some loop through data structures and check each socket they contain. But a much better way is to iterate through fd_set as array of long and check 32-64 bits at a time.

> you're better off switching to a socket API that scales well (epoll or kevent) and does this filtering for you. Or like another commenter suggested, using a library that abstracts this functionality.

That's how kernel, libev and others work with fd_set.

>> The correct way is to maintain a descriptor set all the time and create a copy of it and pass the copy to select.

> Again -- if you've reached the point where this tradeoff matters, just go directly to epoll/kevent.

Again -- libev and others do this

> Maybe your program deals with non-socket fds, and the set of socket fds is fairly sparse. Using a map is actually pretty reasonable even if fds are dense.

Yes, there might be cases where you have non-sockets and you can't use array indexed by socket. But it's great in most of the cases and kernel will keep it as dense as possible. It might be that you have 10,001 connections and then 10,000 are closed and highest socket is still in use and array memory is wasted. But it will not require more memory than during peaks.

libev use of select: http://cvs.schmorp.de/libev/ev_select.c?revision=1.56

adynatos · on June 5, 2017

  unsigned long *wfds = (unsigned long *)wfds_out;

Unless fd_set has the same or stricter alignment requirements as unsigned long, and fd_set and unsigned long are compatible effective types, this line of code provokes undefined behaviour twice (breaks strict alignment and breaks effective type access).

beeforpork · on June 5, 2017

In this case, it is most probably OK, because wfds_out comes from calloc(), which uses the largest possible alignment to work for any type (it returns void* and cannot know which types are inside the allocated memory, so it needs to assume the max on the architecture).

t5e · on June 5, 2017

Another problem with select() is that trying to use a file descriptor above FD_SETSIZE in FD_SET() and the likes will actually result in overwriting the stack and will cause random segfaults in other areas of code. Note that sockets aren't the only thing using up file descriptors and it is quite easy to run into this issue. The solution is to use poll() as essentially a drop in replacement. Strangely the winsock version of select() acts more like poll and instead of a bitmask based on socket fds, uses an array of sockets so you can actually use a number of sockets up to FD_SETSIZE regardless of their file descriptor.

dorianm · on June 5, 2017

Btw "ussually" instead of usually

known · on June 5, 2017

Does Apache web server use this trick?

dfox · on June 5, 2017

In general it cannot, because it does not do asynchronous IO on "sane" platforms. (what it does/did on Netware might pass for async IO)

lmm · on June 5, 2017

No, mpm_event uses epoll/kqueue rather than select.

styfle · on June 5, 2017

It took me a few seconds to realize this article is not about select2[0], the javascript library.

[0]: https://select2.github.io/

0xbadcafebee · on June 5, 2017

The correct way to use select() is not to use it. You should instead use a library designed to do at a high level whatever it was you were going to use select() for. This library will then (hopefully) figure out what your application needs as efficiently as it can, rather than you having to spend a week researching how modern operating systems work to finally learn that you should have been using a library to do what you were about to try to reinvent.

linkregister · on June 5, 2017

Are there really no applications for using low-level async I/O APIs/syscalls? Embedded systems? Game servers? Proxies? Software routers? Hidden command & control servers?

MrBuddyCasino · on June 5, 2017

For embedded systems, LWIP has a socket API that supports select(). Try finding an event library for that one! I had to roll my own, and I learned quite a bit in the process.

Matthias247 · on June 5, 2017

Actually if you use the low-level LWIP (sometimes called Raw API) instead of the socket API it will be actually event-driven (singlethreaded, callback-driven, etc.) by design. The socket abstraction layer on top of it then provides blocking APIs and multithreading support - with the cost of a lower performance.

MrBuddyCasino · on June 5, 2017

I'm planning to eventually migrate to the netconn API, but was deterred by a lack of documentation. Do you know of any sources? This doesn't help much: http://www.nongnu.org/lwip/2_0_x/index.html

Matthias247 · on June 5, 2017

I also had quite some hard time with the lack of documentation for LWIP. The only other source that I found was the Wiki which is also linked there: http://lwip.wikia.com/wiki/LwIP_Wiki

But I still had to figure out lots of things from the actual code.

MrBuddyCasino · on June 6, 2017

Rats. I had a feeling you'd say that, thanks!

Matthias247 · on June 7, 2017

I found something more in my work bookmarks: http://www.atmel.com/Images/Atmel-42233-Using-the-lwIP-Netwo...

Maybe it also has some extra information which can help.

MrBuddyCasino · on June 7, 2017

Thanks!

0xbadcafebee · on June 5, 2017

So in other words, when is it appropriate to avoid using a pre-coded, pre-tested, independently-supported, documented, stable set of code designed to work in a variety of situations as well as it can? When it's not designed for what you need it to do. But i/o libraries are designed for all cases of select() I can think of.

The only time you shouldn't use a library, then, is when it literally doesn't work for you. In that case you should modify the library to add support for what you need, get it upstream, and use the library. Even if you decide using a library isn't right for you, you should at least read the source code of a library to see how they would have done it.

linkregister · on June 5, 2017

You're right with this comment. There have been times when I would have benefitted by looking at a library's code even when it didn't fit what I was trying to do.

What libraries do you recommend to replace async io system calls in C?

loeg · on June 5, 2017

The libevent family of libraries -- libevent, libev, libuv.

pbat · on June 5, 2017

check out libev: http://software.schmorp.de/pkg/libev.html

scales well, has a decent API.