Hacker News new | past | comments | ask | show | jobs | submit | jerrinot's comments login

A QuestDB engineer here: These are cool benchmarks! The idea to try Robin Hood probing came to me after receiving some feedback on Reddit. I ran initial experiments, and the results were promising, leading to its integration into our codebase. Thank you so much for sharing your repository. Perhaps one day we'll explore bidirectional probing as well!

A snapshot of my happiness after running first experiments with Robin Hood: https://twitter.com/jerrinot/status/1730147245285150743 :)


Hi there!

I made the initial suggestion to look into Robin Hood hashing when it was first posted on Reddit.

Glad to see it make its way into the repo!


indeed! thank you for that :)


What are your requirements? We see some people switching from InfluxDB to QuestDB. Common reasons include SQL support (InfluxDB strategy is a bit confusing here), performance and issues with high-cardinality in InfluxDB.

Full Disclosure: I work for QuestDB and I am obviously biased.


QuestDB engineer here:

It's true that our non-idiomatic Java usage denies us some of the benefits typically associated with Java programming. Automatic memory management and the old "Write Once, Run Anywhere" paradigm are difficult to maintain due to our reliance on native libraries and manual memory management.

I see two classes of reasons for choosing Java:

1. Historical: The QuestDB codebase predates Rust. According to Wikipedia, the initial Rust release was in 2015. The oldest commit in the QuestDB repo is from 2014: https://github.com/questdb/questdb/commit/95b8095427c4e2c781... What were the options back in 2014? C++? Too complicated. C? Too low-level. Pretty much anything else? Either too slow or too exotic.

2. Technical: Java, even without GC or WORA, still offers some advantage. 2a: The tooling is robust, especially when compared to C++. This starts with build systems (don't get me started on CMake!), and extends to aspects like observability. Stacktraces in Java are taken for granted. What's the state of stacktrace walking/printing in C++? I think it boils down to either Boost, C++23, or some other form of black magic. (I might be wrong here tho) 2b: It's a simpler language, especially when compared to C++ or even Rust. This makes it easier to hire people and also attracts external contributors: https://github.com/questdb/questdb/graphs/contributors 2c: The HotSpot JIT still provides us with solid peak performance without having to mess with PGO, etc. 2d: Concurrency is easier with Java's managed memory, eliminating the need for hazard pointers and the like.


What do you use as a build system for Java? We use ant (old, ugly, reliable) and gradle (new, nice looking and absolutely horrible to maintain)


Maven, see the rust-maven-plugin we wrote for this. It's opensource.


QuestDB engineer here: We use jlink to create images for selected platforms. This means not even JRE is a dependency: You unpack a tarball and you are good to go. See: https://questdb.io/docs/get-started/binaries/


Well, technically it is a dependency in that you can't target platforms there isn't a JRE for. But I take your point about simplifying installation.


I could use DirectByteBuffer instances as CHM values. But Java deallocates the backing memory of DirectByteBuffers during object finalization. If there is no on-heap memory pressure then there is no GC and thus no finalization. So it would leak offheap memory. I could also use Unsafe to hack into DirectByteBuffer and call the Cleaner explicitly. Many libraries do that anyway. But then I would still need some kind of reference counting to make sure I won't deallocate a buffer with active readers.


Or you could simply invoke a GC periodically (or every N times a key is removed from the map, or similar schemes).

Another simple way, if we don't like the idea of triggering GCs manually,is to allocate the same buffer both off-heap and on-heap: use the off-heap one for actual key storage, and the on-heap one just to generate heap memory pressure.


is there any regret from choosing Java for Db development and need to work around this and likely many other issues?


I only learned about the left-right schema after finishing my design. I thought I came up with something novel - it turns out it was just my ignorance:)

Still, I think the left-right crate could benefit from one optimization I came up with. This is what the lib says in the docs:

> When WriteHandle::publish is called, the writer, atomically swaps the reader pointer to point to the other T. It then waits for the epochs of all current readers to change, and then replays the operational log to bring the stale copy up to date.

I think the lib could postpone the log replaying until the next publish. Chances are all readers will have the new epoch by than = the writer thread won't have to wait at all.


>> I thought I came up with something novel - it turns out it was just my ignorance:)

The first 3 years of my PhD summarized in one sentence.


> I think the lib could postpone the log replaying until the next publish. Chances are all readers will have the new epoch by than = the writer thread won't have to wait at all.

I think you can't even do a new publish if all readers haven't finished switching to a new epoch after a previous publish. Otherwise, you risk corrupting readers that are still on the original epoch.


That's actually exactly what I did! I removed this part from the article as I felt it was not relevant for the concurrent protocol design and it was adding mental load for readers not that fluent in Java.


That's indeed a very good question! There are 2 sets of reasons:

1. Technical

The contract is given - I'm receiving usernames as CharSequence(String). I could change that, but that would likely require changes in the SQL parser - not simple at all. Alternatively, I could pass the whole CharSequence to Rust - but passing objects over the JNI boundary comes with perf. penalty (when compared to passing primitive) and we avoid that whenever possible. Or I could encode CharSequence content to a temporary offheap buffer and pass just the pointer to Rust. But this brings back some of the questions from the article - like who owns the buffer?

2. Other reasons

I realized this was a possibility only when I was nearly done with the design (this whole endeavor took less than one day) and I felt the urge to finish it. Also: This article wouldn't have been created!


I have the opposite question: why not write everything in java?


Hello, the author here. It feels great to see my blog on HN!

It was quite a journey, at first I thought I invented a novel concurrency schema. However, it turns out that it was simply a mix of my ignorance and hubris! :-)

Still, I had a lot of fun while designing this data structure and I believe it made a nice story. Ask me anything!


If the keys are strings and you can use Rust, have you considered fst [0]? If the keys have prefixes in common, it is very compact. There's a blog post about it [1]

[0] https://github.com/BurntSushi/fst

[1] https://blog.burntsushi.net/transducers/

The main limitation of it is that it doesn't support removal, but if removals are infrequent you can workaround that with another fst with removed items (and periodically rebuild the whole thing)



Yes the state machine data structure are also used in regex engines. The author of the fst crate also created Rust's regex crate and the ripgrep [0] CLI tool.

The regex crate has multiple implementations of regex matching algorithms, which is exposed as a library [1]. The implementations are selected at runtime based on which is faster and works right for a given regex. See also [2]

[0] https://blog.burntsushi.net/ripgrep/

[1] https://blog.burntsushi.net/regex-internals/

[2] https://docs.rs/regex-automata/latest/regex_automata/#availa...


Author of fst and regex crates here.

No, it isn't. The only similarity between them is finite state machines. Cox's article is about regexes. The fst crate is a data structure for storing keys and optional values. Two completely different things.


How do you learn stuff like this? Especially the JNI and off-heap memory parts.


DirectByteBuffer exhibits an intriguing behavior: it deallocates its backing memory during the finalization process, which occurs after garbage collection cycles. This poses an issue if your system is conservative with on-heap allocations, leading to infrequent GC cycles. In such cases, there could be a significant delay between the time the memory becomes unreferenced and when it is actually deallocated. This behavior could, in some respects, mimic a memory leak.

This is why some libraries hacks into DirectByteBuffer to deallocate memory explicitly, bypassing the finalizer altogether. For instance, the Netty library has implemented such a workaround: https://github.com/netty/netty/blob/795db4a866401aa172757b95...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: