Comparing the C FFI overhead on various languages

kllrnohj · on May 14, 2022

Another major caveat to this benchmark is it doesn't include any significant marshalling costs. For example, passing strings or arrays from Java to C is much, much slower than passing a single integer. Same is going to be true for a lot (all?) of the GC'd languages, and especially true for strings when the language isn't utf8 natively (as in, even though Java can store in utf8 internally, it doesn't expose that publicly so JNI doesn't benefit)

adgjlsfhk1 · on May 14, 2022

Julia (one of the 2 fastest languages here) Is GCed. GC only make C interop hard if you move objects.

kllrnohj · on May 14, 2022

But as I said, this benchmark doesn't include any meaningful marshalling. Julia is the second fastest here when a single int is passed. Julia is very unlikely to still be second fastest when an array (or string) is passed.

adgjlsfhk1 · on May 15, 2022

You only need marshling if one of your languages is using a bad data layout. Julia stores strings as a a length plus a pointer to Uint8 data. Julia structs have the same layout as C (you may need to specify padding in C, but that's easy enough). Arrays of immutable structs are also usually stored inline. There are definitely some types of objects that you might need to marshal (eg Dicts or other more complicated data structures), but for all of the basic stuff (lists of floats etc), Julia can still just pass a pointer.

cbkeller · on May 15, 2022

For an array, you'd have to worry about row- vs column-major orientation if multidimensional, but for simple numeric vectors and base strings (which are just a collection of UInt8s in memory), it appears to be sufficient to merely pass the pointer to C:

  julia> a = "hello there!"
  "hello there!"

  julia> p = pointer(a)
  Ptr{UInt8} @0x000000010a431458

  julia> ccall(:puts, Int, (Ptr{UInt8},), p)
  hello there!
  10

For vectors of structs, then of course you'd need to know the layout of each struct when operating on them from the other language, but still doable enough in principle. Vectors of union types would be trickier though.

jimmaswell · on May 14, 2022

How is that possible? It's not just passing pointers?

lelanthran · on May 14, 2022

> How is that possible? It's not just passing pointers?

No. A Java string is a "pointer" to an array of 16-bit integers (each element is a 2-byte character). A C string is a pointer to an array of 8-bit integers.

You have to first convert the Java string to UTF8, then allocate an array of 1-byte unsigned integers, then copy the UTF8 into it, and only then can you pass it to a C function that expects a string.

vvanders · on May 14, 2022

Let's not forget that it's modified UTF-8[1] you get back from JNI lest you think that you'll be able to use the buffer as-is.

[1] https://docs.oracle.com/javase/10/docs/specs/jni/types.html#...

ReactiveJelly · on May 14, 2022

Qt (C++ framework) is also UTF-16, so maybe if you're lucky you could pass strings between Java and Qt without transcoding?

lelanthran · on May 15, 2022

> Qt (C++ framework) is also UTF-16, so maybe if you're lucky you could pass strings between Java and Qt without transcoding?

Probably not; the other fields in the string will be different (the length field might be unsigned in Qt while it's almost certainly signed in Java. Java strings may have other fields that are not present in the Qt string (and vice versa).

xigoi · on May 15, 2022

Why would signedness be a problem? If you reinterpret a non-negative two's complement integer as unsigned, you get the same value.

lelanthran · on May 15, 2022

> Why would signedness be a problem? If you reinterpret a non-negative two's complement integer as unsigned, you get the same value.

It won't be a problem if the string being passed from Java to C is const. It will be if the C code increases the size of the string enough to set the highest bit. Then Java will be looking at negative length strings.

spullara · on May 14, 2022

Will be interesting to see how Project Panama does on this kind of benchmark.

https://openjdk.java.net/projects/panama/

jimmaswell · on May 14, 2022

Guess I was missing the context, I thought this was just within Java.

kllrnohj · on May 14, 2022

Others already covered the string issue, but broadly you can't have a compacting GC if you need to also have stable C pointers. Can't move the data around to compact it at that point.

In theory this is why JNI has GetPrimitiveArrayElements and GetPrimitiveArrayCritical. The Critical variant could block the GC from running at all for the duration or disable compaction (hence why you also can't make other JNI calls in the interim). In practice the way I've found that's most consistently fast is to actually use the GetArrayRegion methods. You're paying for a copy, but you're often paying for one anyway. So at least you can avoid the release JNI call, and copy to memory you've allocated (and could then also reuse).

glouwbug · on May 14, 2022

Likely a malloc’d copy to appease the 8bit char ABI

haberman · on May 14, 2022

Some of the results look outdated. The Dart results look bad (25x slower than C), but looking at the code (https://github.com/dyu/ffi-overhead/tree/master/dart) it appears to be five years old. Dart has a new FFI as of Dart 2.5 (2019): https://medium.com/dartlang/announcing-dart-2-5-super-charge... I'm curious how the new FFI would fare in these benchmarks.

kcb · on May 14, 2022

Because their environment is using an Ubuntu version from 8 years ago. So a better title would be "Comparing the C FFI overhead on various languages in 2014"

meibo · on May 14, 2022

Same with C#, it only benchmarks an old version of mono and not .NET Core, which has received several big performance boosts in recent releases.

elcritch · on May 14, 2022

Actually looks like most of the languages are seriously outdated. Nim and Julia are both way outdated, Elixir is pretty outdated.

tomas789 · on May 14, 2022

There is no Python benchmark but you can find a PR claiming it has 123,198ms. That would be a worst one by a wide margin.

https://github.com/dyu/ffi-overhead/pull/18

spullara · on May 14, 2022

All python code generally does is call C/C++ code and you're telling me it is slow to do that as well? Yikes.

dralley · on May 15, 2022

It's probably the Python loop that is slow rather than calling the code.

remram · on May 14, 2022

cffi is probably the canonical way to do this on Python, I wonder what the performance is there.

edit: 30% improvement, still 100x slower than e.g. Rust.

kevin_thibedeau · on May 14, 2022

If you need a fast loop in Python then switch to Cython.

cycomanic · on May 14, 2022

Using a cython binding compared to the Ctypes one gives a speedup of a factor of 3. That's still not very fast, now putting the whole thing into a cython program. Like so:

    def extern from "newplus/plus.h":
        cpdef int plusone(int x)

    cdef extern from "newplus/plus.h":
        cpdef long long current_timestamp()


    def run(int count):
        cdef int start 
        cdef int out 
        cdef int x = 0
        start = current_timestamp()
        while x < count:
            x = plusone(x)
        out = current_timestamp() - start
        return out

Actually yields 597 compared to the pure c program yielding 838.

remram · on May 14, 2022

That's fine for a tight loop. Performance might still matter in a bigger application. This benchmark is measuring the overhead, which is relevant in all contexts; the fact that it does it with a loop is a statistical detail.

worik · on May 14, 2022

> If you need a fast loop in Python then switch to Cython.

If you need a fast loop do not use Python.

I am a Python hater, but this is unfair. Python is not designed to do fast loops. Crossing the FFI boundary happens very few times compared to iterations of tight loops.

(I have very little experience using FFI, but I am about to - hence keen interest)

kevin_thibedeau · on May 15, 2022

The point is Python's exception mechanism is a particularly heavyweight way to do loop control. This benchmark is heavily dominated by that overhead in a way other interpreted languages aren't.

Too · on May 15, 2022

Can be optimized by assigning plusone = libplus.plusone, before using it as plusone(i).

Otherwise it will do an attribute lookup in each loop iteration, Python has no way to assume zero side-effects of function calls, in case lib.plusone was overwritten to something new inside the plusone function.

sk0g · on May 14, 2022

C FFI takes 123 seconds?! That's pretty insane, but if you mean 123.2 ms, it's still very bad.

Doesn't feel like that would be the case from using NumPy, PyTorch and the likes, but they also typically run 'fat' functions, where it's one function with a lot of data that returns something. Usually don't chain or loop much there.

Edit: the number was for 500 million calls. Yeah, don't think I've ever made that many calls. 123 seconds feels fairly short then, except for demanding workflows like game dev maybe.

seniorsassycat · on May 14, 2022

500 million calls in 123 seconds

NoahKAndrews · on May 14, 2022

I think that's the time to run the whole benchmark suite. Compare to the results for go, for example.

throw827474737 · on May 14, 2022

So why isn't C the baseline (and zig and rust being pretty close to it quite expected), but both luajit and julia are significantly faster??

gallexme · on May 14, 2022

https://nullprogram.com/blog/2018/05/27/

eatonphil · on May 14, 2022

> For the C “FFI” he used standard dynamic linking, not dlopen(). This distinction is important, since it really makes a difference in the benchmark. There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure

jcelerier · on May 14, 2022

With clang, just compiling with -fno-plt gives me:

    jit: 1.003483 ns/call
    plt: 1.254158 ns/call
    ind: 1.254616 ns/call

GCC does not seem to support it though, even if it accepts the flag and gives me:

    jit: 1.003483 ns/call
    plt: 1.502089 ns/call
    ind: 1.254616 ns/call

(tried everything I could think of that would have a chance to make the PLT disappear:

    cc -fno-plt -Bsymbolic -fno-semantic-interposition -flto -std=c99 -Wall -Wextra -O3 -g3 -Wl,-z,relro,-z,now -o benchmark benchmark.c ./empty.so -ldl

without any change on GCC)

arinlen · on May 14, 2022

> There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure (...)

If there's interest in measuring dynamic linking then wouldn't there be an interest in measuring it on all languages that support dynamic linking?

qalmakka · on May 14, 2022

I'm always pretty surprised when I find out most people writing C or C++ have no idea that PLTs exist. They have a small but not negligible cost.

miohtama · on May 14, 2022

Is there anything akin FFI but with static linking for any foreign (non C) language?

tlb · on May 14, 2022

Calling WebAssembly from Javascript, sort of?

In the early Python 2 era there was an option to build an interpreter binary with statically linked C stubs, and it was noticeably faster and let you access Python data structures from C. I used it for robotics code for speed. It was inconvenient because you had to link in all the modules you needed.

fweimer · on May 14, 2022

For OpenJDK, there is JEP 178: https://openjdk.java.net/jeps/178 I haven't seen it used in practice.

Ocaml's C-implemented functions are linked statically. But like JNI, the C functions have special names and type signatures, so it is slightly different from, say, ctypes in Python.

CGO for Go is statically linked, too. Its overhead stems from significant differences between the Go and C world. The example uses dynamic linking, but it would not have to do that.

junon · on May 14, 2022

The question as it stands makes a few assumptions I don't think one can make, and as such is a bit tricky to answer cleanly, but I'll try.

Yes it's just called linking. The language needs to be aware of calling conventions and perhaps side effects and be prepared for no additional intrinsic support for higher level features.

It probably also needs to be able to read C headers, because C symbols do not contain type signatures like many C++ compilers add.

There's no "library" or some out of the box solution for this, if that's what you're asking. This boils down to how programs are constructed and, moreso, how CPUs work.

In most (all?) cases, anything higher level than straight-up linking is headed toward FFI territory.

samatman · on May 14, 2022

LuaJIT can use the FFI against statically linked object code just fine, I'm not sure if that answers your question since in this context it must be embedded in a C program.

It's a hard requirement of static linking that you have just one binary so it might, answer your question that is.

bachmeier · on May 14, 2022

C, C++, Zig, Rust, D, and Haskell are all similar because they're basically doing the same thing. Someone else linked to the blog post, but Lua and Julia aren't doing the same thing, so they get different results.

> both luajit and julia are significantly faster

I would be interested if anyone has an example where the difference matters in practice. As soon as you move to the more realistic scenario where you're writing a program that does something other than what is measured by these benchmarks, that's not going to be your biggest concern.

kllrnohj · on May 14, 2022

> I would be interested if anyone has an example where the difference matters in practice.

Vulkan. Any sort of binding to Vulkan over a non-trivial FFI (so like, not from C++, Rust, etc...) is going to be murdered by this FFI overhead cost. Especially since for bindings from something like Java you're either paying FFI on every field set on a struct, or you're paying non-trivial marshalling costs to convert from a Java class to a C struct to then finally call the corresponding Vulkan function.

joeld42 · on May 14, 2022

Not really, you're usually setting up commands and buffers and stuff in Vulkan. If you're making millions of calls a frame, you're going to have other bottlenecks.

My favorite example is something like Substance designer's node graph or Disney's SeExpr. You'd often want custom nodes that do often something trivial like a lookup from a custom data format or a small math evaluation, but you're calling the node potentially a handful of times per pixel, on millions of pixels. The calling overhead often comes out to take as much time or more than the operation, but there's no easy way to rearrange the operations without making things a lot more complicated for everyone.

I kind of like python's approach, make it so slow that it's easy to notice when you're hitting the bottleneck. Encourages you to write stuff that works in larger operations, and you get stuff like numpy and tensorflow which are some of the fastest things out there despite the slowest binding.

https://www.disneyanimation.com/technology/seexpr-expression...

kllrnohj · on May 14, 2022

> Not really, you're usually setting up commands and buffers and stuff in Vulkan

Those commands and buffers are represented as C structs. If you're in a language that can't speak C structs (like Java, Go, Dart, JavaScript, etc...), all of those command & buffer setup become function calls rather than simple field writes.

bachmeier · on May 14, 2022

> Especially since for bindings from something like Java

I guess I wasn't clear, but I meant the difference between C and Luajit.

kllrnohj · on May 14, 2022

Ah. The answer to that is a lot more murky, since in an actual C/C++ program you're going to have a mix of local, static, and dynamic linking. You're generally not putting super chatty stuff across a dynamic linkage, since that tends to be where the stable API boundaries go. Anything internal is then going to be static linkage, so comparable to luajit, or inlined (either by the compiler initially or with something like LTO) and then even faster than luajit

vvanders · on May 14, 2022

Oh it totally matters, any sort of chatty interface over FFI you will pay for it.

There's a reason a lot of gamedev uses luajit, I've personally had to refactor many interfaces to avoid JNI calls as much as possible as there was significant overhead(both in the call and from the VM not being able to optimize around it).

kllrnohj · on May 14, 2022

The reason a lot of gamedev uses luajit is the ease at which it can be embedded.

And that's not really even true anymore as the majority of gamedev is using Unreal or Unity, neither of which use luajit.

vvanders · on May 14, 2022

It's not just how easy it is to embed, it's also really small in both code + runtime size. I've shipped it on systems with sub 8mb of total system memory(we used a preallocated 400kb block), until quickjs came along there really wasn't anything comparable. It was also much faster than anything else at the time and regularly beat v8 in the benchmarks I ran.

Unity+Unreal are the public engines out there but there's plenty of in-house engines and tool chains you don't really hear about. I wouldn't be surprised if it's still deployed in quite a few contexts.

mananaysiempre · on May 14, 2022

ETA: I see now I was answering the wrong question: you were asking about the comparison between C and LuaJIT, not heavier FFIs and C/LuaJIT.

Honestly I think of the difference (as discussed in Wellons’s post among others) not as a performance optimization but as an anti-stupidity optimization: regardless of the performance impact, it’s stupid that the standard ELF ABI forces us to jump through these hoops for every foreign call, and even stupider that plain inter- and even intra-compilation-unit calls can also be affected unless you take additional measures. Things are also being fixed on the C side with things such as -fvisibility=, -fno-semantic-interposition, -fno-plt, and new relocation types.

Can this be relevant to performance? Probably—aside from just doing more stuff, there are trickier-to-predict parts of the impact such as buffer pressure on the indirect branch predictor. Does it? Not sure. The theoretical possibility of interposition preventing inlining of publicly-accessible functions is probably much more important, at the very least I have seen it make a difference. But this falls outside the scope of FFI, strictly speaking, even if the cause is related.

---

I don’t have a readily available example, but in the LuaJIT case there are two considerations that I can mention:

- FFI is not just cheap but gets into the realm of a native call (perhaps an indirect one), so a well-adapted inner loop is not ruined even if it makes several FFI calls per iteration (it will still be slower, but this is fractions not multiples unless the loop did not allocate at all before the change). What this influences is perhaps not even the final performance but the shape of the API boundary: similarly to the impact of promise pipelining for RPC[1], you’re no longer forced into the “construct job, submit job” mindset and coarse-grained calls (think NumPy). Even calling libm functions through the FFI, while probably not very smart, isn’t an instant death sentence, so not as many things are forced to be reimplemented in the language as you’re used to.

- The JIT is wonderfully speedy and simple, but draws much of that speed and simplicity from the fact that it really only understands two shapes of control flow: straight-line code; and straight-line code leading into a loop with straight-line code in the body. Other control transfers aren’t banned as such, but are built on top of these, can only be optimized across to a limited extent, and can confuse the machinery that decides what to trace. This has the unpleasant corollary that builtins, which are normally implemented as baked-in bytecode, can’t usefully have loops in them. The solution uses something LuaJIT 2.1 calls trace stitching: the problematic builtins are implemented in normal C and are free to have arbitrarily complex control flow inside, but instead of outright aborting the trace due to an unJITtable builtin the compiler puts what is effectively an FFI call into it.

[1] https://capnproto.org/rpc.html

WalterBright · on May 14, 2022

The D programming language has literally a zero overhead to interface with C. The same calling conventions are used, the types are the same.

D can also access C code by simply importing a .c file:

    import foo;  // call functions from foo.c

analogously to how you can `#include "foo.h"` in C++.

dgan · on May 14, 2022

I had to run it to believe, I confirm it's 183 seconds(!) for python3 on my laptop

Also, OCaml because I was interested (milliseconds):

    ocaml(int,noalloc,native) = 2022
    ocaml(int,alloc,native) = 2344
    ocaml(int,untagged,native) = 1912
    ocaml(int32,noalloc,native) = 1049
    ocaml(int32,alloc,native) = 1556
    ocaml(int32,boxed,native) = 7544

TazeTSchnitzel · on May 14, 2022

It seems Rust has basically no overhead versus C, but it could have negative overhead if you use cross-language LTO. Of course, you can do LTO between C files too, so that would be unfair. But I think this sets it apart from languages that, even with a highly optimised FFI, don't have compiler support for LTO with C code.

moefh · on May 15, 2022

I'm not sure LTO is the correct term for this.

What makes C slower than these other languages is that the external function is in a dynamic library. Calling external functions like this is notoriously slow because it involves a double indirection -- every call jumps to the PLT's function entry, which jumps to the actual function.

It's not that hard to do better in C if you load the library manually with `dlopen` and read the function pointer with `dlysym`. In this case, you directly call the function given its address value.

But that's still slower than having the function address "hardcoded" in the machine code, which is what you get when statically linking in C. That's also what the languages with negative overhead do when JIT-compiling (they basically statically link at runtime).

TazeTSchnitzel · on May 16, 2022

You must be talking about something different to me. I am referring to using the compiler's LTO option to do LTO. I didn't say that's the only way you could get low overhead.

moefh · on May 16, 2022

Right, but traditional LTO is for static linking. How would one even do LTO with a shared library?

TazeTSchnitzel · on May 17, 2022

In principle nothing prevents doing it for shared libraries too, if you control them.

Normal_gaussian · on May 14, 2022

For those of us not up on our compilation acronyms - this is Link Time Optimisation.

TazeTSchnitzel · on May 14, 2022

Ah yes indeed. LTO means the compiler can do certain optimisations across compilation units (in C/C++ those are your .c/.cpp files, in Rust the unit is the entire crate), notably inlining.

DonHopkins · on May 14, 2022

>negative overhead

underhead

otikik · on May 14, 2022

Overtail

cube2222 · on May 14, 2022

Just a caveat, not sure if it matters in practice, but this benchmark is using very old versions of many languages it's comparing (5 year old ones).

kllrnohj · on May 14, 2022

It probably matters for a few of the slower ones, like Java, Go, or Dart. It's also going to matter on what platform. eg, Java may have better FFI on x86 than on ARM. Or similarly Dart's FFI may better on ARM than on x86, particularly given Flutter is the primary user these days.

And then to make it even more complicated it's also going to potentially depend on the GC being used. For example for Java's JNI it's actually the bookkeeping for the GC that takes the most time in that FFI transition (can't pause the thread to mark the stack for a concurrent GC when it's executing random C code, after all). Which is going to potentially depend on what the specific GC being used requires.

exebook · on May 14, 2022

I developed a terminal emulator, file manager and text editor Deodar 8 years ago in JavaScript/V8 with native C++ calls, it worked but I was extremely disappointed by speed, it felt so slow like you need to do a passport control each time you call a C++ function.

Koromix · on May 14, 2022

The official solutions, node-ffi and node-ffi-napi, are extremely slow, with an overhead hundred of times higher than it should be. I don't know what they do to be so slow.

I'm making my own FFI module for Node.js, Koffi, as a much faster alternative. You can see some benchmark here, to compare with node-ffi-napi: https://www.npmjs.com/package/koffi#benchmarks

SemanticStrengh · on May 14, 2022

An interesting alternative would be to not have any FFI and to use transparent polyglot interop between javascript and c++ via GraalJs /sulong

exikyut · on May 14, 2022

Oh nice, with a Norton Commander-alike terminal UI.

Screenshots at https://sourceforge.net/projects/deodar/

Last-modified 2018 over at https://github.com/exebook/deodar

I'm not sure where the use of Yacc ends and https://github.com/exebook/elfu begins in the .yy files (which are sprinkled with very algebraic-looking Unicode throughout). The Pascal-like class definitions may be defined in https://github.com/exebook/intervision.

Very interesting project.

ZiiS · on May 14, 2022

Probably butter smooth after eight years of V8 development and Moore's Law (well whatever passes for it now).

kllrnohj · on May 14, 2022

Probably not since single threaded improvement has barely advanced over the last 8 years, and JS/V8 are still in a single threaded world that stopped existing a decade ago.

ryukoposting · on May 14, 2022

This ia a cool concept, but the implementation is contrived (as many others describe). e.g. JNI array marshalling/unmarshalling has a lot of overhead. The Nim version is super outdated too (not sure about the other languages).

sk0g · on May 14, 2022

For a game scripting language, Wren posts a pretty bad result here. Think it has isn't explicitly game focused though. The version tested is quite old however, having released in 2016.

mhh__ · on May 14, 2022

Needs LTO, with that it will have 0 overhead in the compiled languages.

D can actually compile the C code in this test now.

khoobid_shoma · on May 14, 2022

I guess it is better to measure CPU time instead of wall time (e.g. using clock() ).

KSPAtlas · on May 14, 2022

What about Common Lisp?

medo-bear · on May 14, 2022

there is a pretty powerful CFFI package (library) to achieve this, however performance will be very implementation dependent. in case someone wants to try this, the defacto-standard, free open source, speedy implementation is SBCL

planetis · on May 14, 2022

That Nim version has just left kindergarten and is prepping for elementary.

dunefox · on May 14, 2022

> - julia 0.6.3

That's an ancient version, the current version is v1.7.2.

Sukera · on May 14, 2022

I don't think that matters here - the FFI interface hasn't changed and I wouldn't expect it to differ significantly.

SemanticStrengh · on May 14, 2022

Java has a new API for FFI called the foreign memory interface

alkonaut · on May 14, 2022

Any idea why mono is used rather than .NET here?

DoingIsLearning · on May 14, 2022

From the readme:

> My environment:

> [...]

> - Ubuntu 14.04 x64

> [...]

Mono can run on nix targets but .NET itself (not .NET core) is still very much windows only.

vopi · on May 14, 2022

.NET core doesn't exist anymore. It's now just .NET. It is the successor to the old windows only .NET. It is very much not windows only.

juki · on May 14, 2022

.NET core is .NET nowadays. The old Windows only .NET framework isn't being developed anymore (other than fixes).

alkonaut · on May 14, 2022

The “normal” .NET runs on Linux as much as Java or python.

The framework/core divide no longer exists. That’s why I’m asking.

richardwhiuk · on May 14, 2022

The benchmark is ancient.

bfrog · on May 16, 2022

Go looks horribly slow, I thought segmented stacks have gone away to improve this?

ta988 · on May 14, 2022

Java has project Panama coming that may improve things a little.

thot_experiment · on May 14, 2022

n.b. this is using an absolutely ancient version of node, though i'm not sure that would change anything, worth nothing

sdze · on May 14, 2022

Can you try PHP?

ksec · on May 15, 2022

Missing 2014, or 2018 in the title.

SolitudeSF · on May 15, 2022

this benchmark is afwul (which is expected)