It's interesting to watch the swings and roundabouts in the VM engineering space. I remember many years ago being in a Google engineering all hands where Android was first announced (to the firm) and the technical architecture was explained. I and quite a few others were very surprised to hear that they planned to take slow, limited mobile devices and run a Java bytecode interpreter on them. The stated rationale was also quite surprising: it was done to save memory. I remember being very dubious about the idea of running a GCd interpreted language on a mobile phone (I had J2ME experience!).
In the years since I've seen Android go from interpreter, to interpreter+JIT, to AOT, to JIT+AOT, to interpreted + JIT then AOT at night. V8 has gone from only JIT (compile on first use) to multiple JITs to now, interpreter+single JIT. MS CLR still doesn't have any interpreter and is fully AOT or JIT depending on mode. HotSpot started interpreted, then gained a parallel fast JIT, then gained a parallel optimising JIT, then went to a tiered mechanism where code can be interpreted, compiled and recompiled multiple times before the system stabilises at peak performance.
Looking back, it's apparent that Android's and indeed Java's initial design was really quite insightful. A tight bytecode interpreter isn't quite as awful as it sounds, especially given how far CPU core execution speed has raced ahead of memory and cache availability. If you can fit an interpreter almost entirely in icache, and pack a ton of logic into dense bytecode, and if you can use spare cores that would otherwise be idle (in desktop/mobile scenarios) to do profile guided optimisation, you can end up utilising machine resources more effectively than it might otherwise appear.
That was just catching up with what mainframes have been doing since the 60's. :)
That is why when you research old papers about compilers, Assembly opcodes and Bytecode are often intermixed.
It was common for the OS, language runtime to work with bytecodes, with the execution being done at microcode level. What nowadays would be better done as FPGA.
On IBM i (aka OS/400), there is a kernel JIT and all languages target bytecode, including C. Also JVM bytecodes get translated into IBM I ones. It was implemented in PL/M, nowadays with many modules ported to C++. And if you want to generate actual machine code, you need the Metal C compilers.
There are plenty of other examples.
Like with containers, it is just recycling mainframe success stories.
Meanwhile, iOS has used strictly AOT compilation from the beginning for native applications. And doesn't iOS have a reputation for buttery smooth UIs? Perhaps the original AOT compilation approach of ART wouldn't be so bad if it didn't have to be done on the user's device, taking up the user's time while installing apps or upgrading the OS.
Smooth but for many years, smooth only until you tried to switch app. iPhones did not have enough RAM to really do more than one thing at once, so multi-tasking came years late and the legacy of that is still present in the very weak, hacked together way in which apps collaborate on that platform. Perhaps this lack of emphasis on multi-tasking and linking apps together is partly due to the decision to AOT compile everything, creating additional RAM pressure. Also, Android was able to scale down to cheap devices, and scale up to very high end devices. Apple basically ceded the entire budget smartphone space to Android, they didn't even try, and budget phones (especially in developing countries) are often characterised by being RAM constrained.
Interesting. I was under the impression that Android had more problems with RAM pressure than iOS, because Android uses tracing GC rather than reference counting. I observed that as late as the iPhone 6, the flagship iOS device had only 1 GB of RAM, whereas contemporaneous flagship Android devices had 2 or 3 GB. I guess I succumbed to the tendency to assume that Apple can do no wrong, leading to the assumption that Apple shipped with less RAM because they didn't need it, not because they were just being cheap.
GC technically speaking can reduce the amount of RAM you need, as it can compact the heap whereas a native malloc can't. In practice this effect is hard to measure as the sorts of languages that use compacting GCs tend to be rather pointer and allocation happy, so any gain from compaction gets outweighed by other factors.
The specs of the various devices have changed over time. Android scales better: you can give it very little RAM, or a lot, and it'll make best use of it by e.g. keeping more background apps loaded at once. iOS was at least historically much less flexible about this: boosting RAM was not worth as much as in the Android space because app devs would still target older devices for quite a while and so the additional RAM would go unused, and the OS wasn't capable of using the spare for much due to the lack of multi-tasking.
Eventually Apple implemented Android-style task switching, so I don't know if that's still true. I haven't done any mobile dev for years. But I also think at some point Apple realised nobody who buys an iPhone actually cares about whether they're getting value for money or what the specs are, so they just stopped competing on that area. I mean they have never reduced the price of the iPhone once, right? Despite the huge fall in underlying component prices over the years. They could ship a device with 128mb of RAM and if it did the same thing as the iPhone 4 people would still buy it, simply because they see themselves as iPhone users and not "smartphone users.
It goes to show that several factors influence the performance, memory usage, and real hardware requirements of a platform. I'll gladly admit that my original, simplistic assumptions were wrong.
ObjC, as typically written (i.e. more object-oriented than C), is also pointer and allocation heavy. so that might make compaction more of a benefit for Android. I wonder what proportion of memory in a typical Android application process is on the C heap rather than the garbage-collected, compacted heap. For example, where does image data usually end up?
To throw in another complication, are there any significant problems that come with layering a garbage-collected runtime on top of a high-level framework based on C heap allocation and reference counting? That's what Xamarin, React Native, and AOT Java runtimes (e.g. Multi OS Engine) do on iOS, and what .NET (even .NET Native) does on UWP. Or how about two garbage-collected runtimes in one process, e.g. Xamarin and React Native on Android?
Google did lots of work to fit Android on 512MB devices, also modern Android does use a parallel concurrent GC with just one pause. They also made the JNI stricter, in order to allow compactation on future versions.
Windows Phone .NET runtime also uses tracing GC, although the underlying UWP APIs are COM based.
iOS's advantage in this is that the hardware range is small. It's feasible to do full AOT away from the end user's device.
With Android there are hundreds or thousands of combinations of various bits of hardware and drivers, with various levels of compatibility. Device manufacturers have a terrific advantage in being able to produce hardware that is more specialised for the purpose, but there are trade-offs. Compiling all possible variations AOT away from the end device is going to be difficult.
Not really. Practically all Android devices support ARMv7 binaries. They have to, in order to work with applications that use the NDK. If what you say about Android were true, then AOT compilation for Windows on x86 wouldn't be feasible either.
This comment reminds me, there are actually a bunch of Android tablets that shipped with Intel Atom x86 processors. Intel worked some magic with libhoudini that allowed them to run ARM binaries surprisingly fast. If you're an app maker who actually cares about those devices (like my employer) then you end up building all your native code for x86 too. I recall the ARM binaries running at more than 50% of the native x86 ones, running optimized threaded and vectorized image processing code. I wish Intel would talk more about that library. I'd love to know what techniques it used.
It's done on-device to handle the problem of linking. All calls from the app into framework have to be linked when you do AOT, and Google can't do that without having device images of every android device and compiling on-demand for whatever device is requesting the install.
And then you'd still have to fall back to on-device compilation if Play Store doesn't know what the device is anyway (aka, all custom ROMs), or when you take an OTA (otherwise you'd need to re-download all installed apps for the new OTA image).
Plus, the dex bytecode is considerably smaller than the AOT result, so wire transmission costs are cheaper doing on-device compilation - call it a form of compression.
In fairness, they managed to do it only for the highest of high end users. Apple institutionally doesn't care about anyone in places like China, India or Africa where data is expensive. That's why Android completely dominates the global smartphone market share.
I guess nothing would stop the Play Store from AOT compiling binaries just for users with fast connections though.
WP 8 and 8.1 use MDIL, Machine Dependent Intermediate Language, basically machine code with jump targets kept as symbolic links, replaced at installation time.
WP 10 uses .NET Native.
I can gladly point out the respective Microsoft documentation, BUILD and Channel 9 presentations.
Yet, now bitcode is a thing on iDevices, to the point LLVM guys started to discuss a possible portable version without the low level hardware bitcodes.
The difference from Apple and Microsoft's bytecode solutions versus Google, is that the AOT compilation of bytecode takes place at the store, instead of the device.
iOS's smoothness doesn't come from being AOT. If it did you'd have seen a huge difference in responsiveness in Android between interpreted, JIT & AOT, but you don't. Raw code perf does improve from interpreted -> JIT -> AOT, but the rendering is part of the platform not the app and that's been C++ since the beginning.
dalvik's GC was a big problem for smooth UIs, though.
I think there's a lesson for application developers buried in here: If you have the freedom to choose the language and language features of your application, choose your language and language features so as to minimize the risk of performance cliffs such as those that the article discusses at the beginning. This probably means static typing, and static dispatch (i.e. not virtual functions) by default. And of course, a language like that can be compiled to JavaScript, e.g. Kotlin, C# via Bridge.net, or F# via Fable.
By the way, I know this is a tenuous association, but to me, the name TurboFan makes me think of CPU-hungry code that cranks up the fan(s) on a user's laptop.
This is a huge deal for me. I am creating server[1] for Node.js: "npm install server". The slowest part I noticed in some initial benchmarks was the promise-based code which is used heavily instead of express' callback based middleware.
While performance is not one of the main concerns for the library (they are simplicity, "batteries-on" and user experience in general) it is nice to see that my hunch was right; V8 would provide a 500% performance boost to Promises.
I've never seen this before and it looks interesting - from a brief scan of the source & about section it looks to me like a wrapper over express to provide sensible defaults to round off some of the sharp corners when setting up a project from scratch.
This could be extremely useful. I've been using express in production for years, and though setting up a new server is not fundamentally difficult, it does involve a lot of common boilerplate (cookie-parser et al) that I think a lot of people already solve with a boilerplate template. This could be a cleaner, more fully-featured approach.
Is this more or less it, or is there more to it / future aims to do more? I also wonder - did you talk to the express developers about making direct contributions to provide solutions to these problems from within the express library itself, rather than externally via a wrapper library?
I'll keep an eye on the development of this project :-)
Thanks! I got started few months back. I haven't really talked to them besides asking a question or two. The thing is, express used to be like this but it was then split up; from what I read it was mainly due to unstability in the subpackages, something which I think/hope will be solved by now. Another point is apparently it's easier to maintain since there have been some problems internally.
The main differences (for the initial or a later release) are:
* A lot of functionality out-of-the-box. Just install it and get to work.
* Websockets as first-class citizens: since they are one of the major advantages of Node.js in itself it makes sense they are trivial to use.
* Error handling: intercept some messages from Node.js and provide a more human-readable version.
* Promise-based.
There's also a single parameter for middleware instead of a [err], req, res, next parameters, since Promises work really well with a single parameter. You might think that this comes from http://koajs.com/ , but only the name comes from there, I used to call it inst for instance until I found a better name, which I found in Koa's ctx.
This is an unsubstantiated claim. If LuaJIT was indeed multiple times faster on arbitrary code, everybody would be just running JS on something like LuaJIT in browsers :)
LuaJIT does spectacular job on loops, especially if you limit yourself to number crunching and FFI.
But if you code is polymorphic, makes heavy use of normal OOP and does not decompose into a graph of biased loops and linear traces V8 will pull ahead due to its method-compiler nature which is not susceptible to tracing pitfalls.
Here is an interesting example: there is an issue[1] "Metatable/__index specialization" in LuaJIT repository filed by Mike Pall himself, it's about adding infrastructure which would allow traces to make optimistic assumptions about metatable constness, because it would greatly improve the quality of traces produced from OOP heavy code. (Also this is precisely the limitation that is already imposed on FFI metatypes to allow JIT generate better code). The issue is still unimplemented in LuaJIT... However this is something that V8 could do for several years now, because it is extremely important for the kind of code people are writing in the real world.
To be fair it probably also helped that Lua is a simpler language and therefore easier to optimize. A tracing JIT also has an inherent downside: the infamous example of an if in an inner loop with roughly the same probability for the then and else branch.
That said LuaJIT is an impressive piece of software. Would love to know what Mike Pall is up to now.
It could easily keep up with Lua if Mike Pall agreed with the changes made to Lua, but he didn't.
The Lua ecosystem is very different than almost any other; It's primary use case, probably 95% of projects using it or more, is being embedded into another project for scripting/configuration/control. As such, it is common for projects to pick a version of Lua and stick with it, rather than upgrade to the latest-and-greatest-with-slight-to-major-incompatibilities. Pall liked 5.2, and thought 5.3 didn't offer enough to break compatibility.
"If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements."
Javascript has some features that makes it hard to be JITed.
For example it has no fixed object layout.
If you are interested into how language design impacts performance this is worth reading:
http://wren.io/performance.html
> From my very personal point of view, the TurboFan optimizing compiler at that time was probably the most beautiful version we ever had, and the only version (of a JavaScript compiler) where I could imagine that a “sea of nodes” approach might make sense (although it was already showing its weakness at that time).
What were the weaknesses of the Sea of Nodes? Backwards data flow analysis and control flow sensitive analysis being hard?
I worked on a Sea of Nodes compiler [0] myself and "backwards data flow analysis and control flow sensitive analysis" has never been a problem.
Changes to the CFG are hard because of the Phi nodes, which have to be adapted in lockstep with CFG changes. However, this is probably inherent to SSA form, but not sea of nodes.
You want a good graph visualization instead of text-based output, because you should actually to see the "non-order of the sea". Text-output with an implicit order can hide issues.
Personally, I believe sea of nodes is better then others, like SSA form is better than non-SSA. There is nothing which makes it inherently more powerful, but it feels more elegant. Unfortunately, there is no objective comparison and anecdotes are apples vs oranges. Similarly, how would you compare object-oriented with functional programming?
LibFIRM's sea of nodes representation is a bit different in that nodes are tied to a basic block, if I understand it correctly. That would indeed make control flow sensitive analysis easy. Their sea of nodes representation does not have this property, so it is not immediately apparent under what control conditions a node may be executed. Their scheduler will place nodes in basic blocks at the end of the pipeline.
I loved your paper about PBQP register allocation, by the way!
While I've often felt many of the pain points in the article and they were never explained anywhere (that I saw anyway).
I'd often profile/benchmark something, make sure it's fast enough to use in our performance critical section of the code, only to find that once in the application I'd only get a fraction of the speed I was expecting.
I would then hop over to using --trace-opt only to find that functions were getting deoptimized or never optimized in the first place and I'd start playing the game of trying things here and there to get it to cooperate. And in some cases --trace-opt wouldn't tell me anything that I could usefully understand yet my code would still be slow.
Here's to hoping that turbofan clears up a lot of these weird cases!
And slightly off topic, but what are the plans for the dart VM? Is it going to end up using TurboFan or will it stay with Crankshaft?
>Looking at it naively, it seems to follow the rules for the arguments object in Crankshaft
I've never managed to find a good source on the V8 internals and how to target these optimizations (the "rules for the arguments object" the author alludes to).
They are complementary resources, "Optimization killers" helps you avoid pitfalls in practice, "v8-bailout-reasons" tries to document and explain the various Crankshaft bailouts.
Latest node (7.7.1) uses v8 5.5 I believe: does anyone know what the roadmap wrt to updating v8 to the latest version containing all this new stuff look like?
Since 16 days ago master uses 5.6, whereas v8 is at 5.8, the pull request took about a month to go through, but it touched ~2000 files, and there doesn't appear to be an open pull request for "update v8". So it seems it'll take some time (as it looks like a considerable effort).
In the years since I've seen Android go from interpreter, to interpreter+JIT, to AOT, to JIT+AOT, to interpreted + JIT then AOT at night. V8 has gone from only JIT (compile on first use) to multiple JITs to now, interpreter+single JIT. MS CLR still doesn't have any interpreter and is fully AOT or JIT depending on mode. HotSpot started interpreted, then gained a parallel fast JIT, then gained a parallel optimising JIT, then went to a tiered mechanism where code can be interpreted, compiled and recompiled multiple times before the system stabilises at peak performance.
Looking back, it's apparent that Android's and indeed Java's initial design was really quite insightful. A tight bytecode interpreter isn't quite as awful as it sounds, especially given how far CPU core execution speed has raced ahead of memory and cache availability. If you can fit an interpreter almost entirely in icache, and pack a ton of logic into dense bytecode, and if you can use spare cores that would otherwise be idle (in desktop/mobile scenarios) to do profile guided optimisation, you can end up utilising machine resources more effectively than it might otherwise appear.