More

Asm2D · 2025-03-12T16:03:04 1741795384

Introduction of a new high performance PNG decoder provided by Blend2D library, which challenges existing decoders written in C++ and other programming languages.

Asm2D · on Feb 20, 2024

Cairo is in a maintenance-only mode. Nobody develops this library anymore and it only has a maintainer or two. Since nobody really worked on Cairo in the past 15 years it's not optimized for modern hardware.

You can see some existing benchmarks here:

  - https://blend2d.com/performance.html

Both the benchmarking tool and Blend2D are open-source projects so anyone can verify the numbers presented are indeed correct, and anyone can review/improve the backend-specific code that is used by the benchmarking tool.

gigatexal · on Feb 21, 2024

That’s crazy. I once lurked in the IRC of the project. I knew the creator. He was a family friend. I was a silly teen kid toying with Linux he was a dev who worked at redhat and lived in the same town as me.

I wonder what he’s up to these days?

Update/edit:

Ahh he moved on to Ampere: https://www.linkedin.com/in/carl-worth

Also I was a bad fan: the library had a co-founder too I thought it was a bespoke creation of Carl’s own making.

I remember building a bunch of stuff from source back in the day and a lot of Linux applications had Cairo as a dependency.

hgs3 · on Feb 20, 2024

> Cairo is in a maintenance-only mode.

That's too bad. Is their a successor planned or is Skia the recommended alternative?

Asm2D · on Feb 20, 2024

I think that when it comes to 2D rendering libraries there is in general not too many options if you want to target CPU or both CPU+GPU. Targeting GPU-only is bad for users that run on a hardware where GPU doesn't perform well or is not available at all due to driver issues or just not present (like servers).

If you consider libraries that offer CPU rendering there are basically:

  - AGG (CPU only)

  - Blend2D (CPU only, GPU planned, but not now)

  - Cairo (CPU only)

  - Qt's QPainter (CPU only, GPU without anti-aliasing / deprecated)

  - Skia (CPU + GPU)

  - Tiny Skia (CPU only, not focused on performance)

  - GPU only libs (there is many in C++ and Rust)

Nobody develops AGG and Cairo anymore and Qt's QPainter hasn't really improved in the past decade (Qt Company's focus is QtQuick, which doesn't use QPainter, so they don't really care about improving the performance of QPainter). So, only 2 libraries from this list have active development - Blend2D and Skia.

As an author of Blend2D I hope that it will be a go-to replacement for both AGG and Cairo users. Architecturally, Blend2D should be fine after a 1.0 release as the plan is to offer a stable ABI with 1.0 - And since Blend2D only exports C-API it should be a great choice for users who want to use every cycle and who want their code to work instead of making changes every time the dependency is updated (hello Skia).

At the moment Blend2D focuses on AGG users though, because AGG is much more widespread in commercial applications due to its licensing model and extensibility. However, AGG is really slow especially when rendering to large images (like 4K) so switching from AGG to Blend2D can offer a great performance benefits while avoiding other architectural changes of the application itself.

BTW Blend2D is still under active development. It started as an experiment and historically it only offered great performance on X86 platforms, but that is changing with a new JIT backend, which provides both X86 and AArch64 support and is almost ready for merge. This is good news as it will enable great performance on Apple hardware and also other AArch64 devices, basically covering 99% of the market.

a_e_k · on Feb 20, 2024

I'm the author of another CPU-only 2D vector graphics library that might be of interest:

- Canvas Ity (https://github.com/a-e-k/canvas_ity)

It's a tiny single-header C++ library in the style of the STB libraries. My aim was to make it dirt simple to be able to drop into almost any project and get high-quality rendering while providing an API comfortable to those used to <canvas>.

I've been checking out Blend2D every now and then. It seems like a very nice option for the bigger, but faster and more fully-featured end of the spectrum.

(Though for what it's worth, while raw performance isn't my priority, my little library still can hit about 70fps rendering the Postscript Tiger to 733x757 res with a single thread on my 7950x. :-)

Asm2D · on Feb 21, 2024

Nice project, thanks for sharing!

BTW for comparison - Blend2D can render SVG tiger in 1.68ms on the same machine (I also have 7950X) so it can provide almost an order of magnitude better performance in this case, which is great I think. But I understand the purpose of your library, sometimes it's nice to have something small :)

c-smile · on Feb 20, 2024

If I am not mistaken, NanoVG actually can render as by CPU (need external path rasterizer) as by GPU (OpenGL and other options).

NanoVG provided Canvas.Context kind of API in plain C.

cztomsik · on Feb 21, 2024

There is also thorvg by Samsung (or the authors work in there and they are also using it, not 100% sure but it's production ready)

https://github.com/thorvg/thorvg#dependencies

AFAIK they have experimental GPU backend but I'm not sure how far they are with it.

rubymamis · on Feb 20, 2024

Qt Quick should support non-GPU rendering[1]. I don't know how good it is, tho.

[1] https://www.toradex.com/blog/running-qt-without-gpu

mfabbri77 · on Feb 21, 2024

Do not forget: https://www.amanithvg.com (I'm one of the authors, 20+ years of active development). Full OpenVG 1.1 API, CPU only, cross-platform and analytical coverage antialiasing (rendering quality) as main feature. The rasterizer is really fast. I swear ;) At Mazatech we are working to a new GPU backend just these days.

AmanithVG is the library on which our SVG renderer: https://www.amanithsvg.com is based. All closed source as now, but things may change in future.

I will do some benchmarks of the current (and next, when the new GPU backend will be ready) version of our libraries against other libraries. Do you know if there are any standard tests (besides the classic post script Tiger)? Maybe we can all agree on a common test set for all vector graphics libs bechmarks?

Asm2D · on Feb 21, 2024

That's right! I didn't consider closed source libraries when writing the list. There would be more options in that case like Direct2D and CoreGraphics. However, my opinion is that nobody should be using closed source libraries to render 2D graphics in 2024 :)

Regarding benchmarks - I think Tiger is not enough. Tiger is a great benchmark to exercise the rasterizer and stroker, but it doesn't provide enough metrics about anything else. Tt's very important how fast a 2D renderer renders small geometries, be it rectangles or paths. Because when you look at screen most stuff is actually small. That's the main reason why Blend2D benchmarking tool scales the size of geometries from 8x8 to 256x256 pixels to make sure small geometries are rendered fast and covered by benchmarks. When you explore the results you will notice how inefficient other libraries actually are when it comes to this.

Asm2D · on Feb 20, 2024

GPU support was removed from Cairo, because it was slower than CPU rendering and nobody wanted to maintain it.

tetromino_ · on Feb 20, 2024

Cairo's OpenGL support was removed, but I thought Cairo's X11 backend still has GPU acceleration for a few operations through XRender (depending on your video driver).

Asm2D · on Feb 20, 2024

That's true, Cairo still provides XRender backend. Not sure it's that usable though as I think nobody really focuses on improving XRender, so it's probably in the same state as Cairo itself.

Asm2D · on Feb 20, 2024

Thanks! I'm doing what I can to make Blend2D even faster. It's been really exciting project to work on and I have big plans with this library.

Asm2D · on Feb 20, 2024

Text rendering is something that will get improved in the future.

At the moment when you render text Blend2D queries each character from the font and then rasterizes all the edges and runs a pipeline to composite them. All these steps are super optimized (there is even a SIMD accelerated TrueType decoder, which I have successfully ported to AArch64 recently), so when you compare this approach against other libraries you still get like 4-5x performance difference in favor of Blend2D, but if you compare this method against cached glyphs Blend2D loses as it has to do much more work per glyph.

So the plan is to use the existing pipeline for glyphs that are larger (let's say 30px+ vertically) and to use caching for glyphs that are smaller, but how it's gonna be cached is currently in research as I don't consider simple glyph caching in a mask a great solution (it cannot be sub-pixel positioned and it cannot be rotated - and if you want that subpixel positioned the cache would have to store each glyph several times).

There is a demo application in blend2d-apps repository that can be used to compare Blend2D text rendering vs Qt, and the caching Qt does is clearly visible in this demo - when the text is smaller Qt renders it differently and characters can "jump" from one pixel to another when the font size is slightly scaled up and down, so Qt glyph caching has its limits and it's not nice when you render animated text, for example. This is a property that I consider very important so that's why I want to design something better than glyph masks that would be simple to calculate on CPU. One additional interesting property of Qt glyph caching is that once you want to render text having a size that was not cached previously, something in Qt takes 5ms to setup, which is insane...

BTW one nice property of Blend2D text rendering is that when you use the multithreaded rendering context the whole text pipeline would run multithreaded as well (all the outline decoding, GSUB/GPOS processing, rasterization, etc...).

Asm2D · on Nov 8, 2023

Can you elaborate what's "bad" on asmjit?

I wrote it for the purpose of making it easy to write JIT compilers in C++ and it has been adopted even by projects that are written in C (like Erlang). It's also one of the smallest libraries out there and has a great performance.

I understand why people don't want to use projects such as libgccjit or llvm, but asmjit is just super easy to use compared to others.

duped · on Nov 8, 2023

It's in C++. I don't really mind interfacing with C, but C++ is kind of a non-starter.

And back when I last used it, the docs were pretty rough. I understand that's changed a lot and they look much better now.

Asm2D · on Nov 8, 2023

The reason why it's written in C++ is to make it practical. I understand why some people would want it in C, but honestly I have never seen a nice JIT assembler library written in C - I saw few incomplete assemblers in C before as part of other projects, but it was usually ugly code full of macros, so no proper C API to interface with anyway.

I think that asmjit's biggest strength is in its completeness and performance, so it can be used to generate code in a sub-millisecond time, which higher level code generators such as LLVM don't offer.

duped · on Nov 9, 2023

I understand, but I don't think you understand me.

asmjit is great if you're writing a fast JIT and want to write C++, or can tolerate a C++ requirement. If you're writing a compiler in any other language, requiring C++ is a burden that may be insurmountable. What I'm lamenting is the lack of a widely used, decent C library for generating x86 and ARM code, which doesn't seem to exist, nor does asmjit fill that gap.

> I think that asmjit's biggest strength is in its completeness and performance, so it can be used to generate code in a sub-millisecond time, which higher level code generators such as LLVM don't offer.

Not everyone cares about sub-millsecond performance for code generation. It's easier to generate GCC syntax ASM and shell out to `asm` or `cc` than to write bindings to asmjit and interface from a different language, and the performance is fine because lowering your AST into an IR that can be used to generate asm either as text or the input to asmjit is a much bigger bottleneck.

Asm2D · on Nov 10, 2023

Yeah, I usually write compilers in C++ and it seems we have totally different use-cases :)

High performance assembling is literally what asmjit was designed for and that allows it to be used in interesting projects - for example there are multiple databases that use asmjit to JIT compile queries, and low-latency compilation is the key feature here.

BTW if you need a pure C library to encode x86 there is zydis (yeah it used to be a disassembler only, but it has now also an encoder), but it's only for x86.

a-priori · on Nov 8, 2023

The fact that it's written in C++ doesn't make it 'bad'. It just means it's not suited to your use, which is your problem not the library's problem.

Asm2D · on June 10, 2023

They do!

The most important post-SSE2 extensions are SSSE3 (pshufb) and SSE4.1 (rounding, min/max, blending, etc...). Pure SSE2 is a nightmare to use as it's a totally unbalanced SIMD ISA (a lot of missing stuff here are there requires a lot of workarounds and sometimes it's just better to go scalar). In addition, just [V]PSHUFB alone can do wonders and has a lot of application - I would say that almost all interesting problems can take the advantage of PSHUFB.

Asm2D · on June 5, 2023

It's above 4400 instructions actually if you count different encoding of SIMD instructions (like SSE2, VEX, and EVEX variations) and consider instructions using 128-bit, 256-bit, and 512-bit SIMD as separate instructions.

Instructions can be comfortably explored here:

https://asmjit.com/asmgrid/

Asm2D · on June 1, 2023

A C version of the library with a portable implementation and AVX-512 optimizations is planned.

Asm2D · on June 1, 2023

The initial AVX-512 implementation brought a lot of issues with it. The biggest problem was that Intel used 512-bit ALUs from the beginning and I think it was just too much that time (initial 14nm node) - even AMD's Zen4 architecture, which came years after Skylake-X, uses 256-bit ALUs for most of the operations except complex shuffles, which use a dedicated 512-bit unit to make them competitive. And from my experience, AMD's Zen4 AVX-512 implementation is a very competitive one. I just wish it had faster gathers.

Our typical workload at Sneller uses most of the computational power of the machine: we typically execute heavy AVX-512 workloads on all available cores and we compare our processing performance at GB/s per core. This is generally why we needed a faster decompression, because before Iguana almost 50% of the computational power was spent in a zstd decompressor, which is scalar. The rest of the code is written in Go, but it's insignificant compared to how much time we spend executing AVX-512 now.

(I work for Sneller)