More

petethepig · on July 28, 2022

"Hello everybody out there using minix -

I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things)."

https://www.cs.cmu.edu/~awb/linux.history.html

googlryas · on July 28, 2022

That message's 30th birthday is in 2 days

czx4f4bd · on July 28, 2022

> Please note that this project is meant to be used for fun and learning purposes and not for practical use.

https://github.com/ronami/HypeScript#introduction

petethepig · on July 21, 2022

Tries (or prefix trees).

We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).

We have an article with some animations that illustrate the concept in case anyone's interested [0].

[0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...

NSMutableSet · on July 21, 2022

I wouldn't consider tries to be obscure tbh. They are the recommended solution for many leetcode-style interview problems involving string searching. I think anyone who has done serious interview prep has encountered them.

https://leetcode.com/discuss/general-discussion/931977/begin...

EddySchauHai · on July 22, 2022

I got stung by this for a test engineering role. Apparently implementing them from scratch is a pre-req to building out test infrastructure!

OJFord · on July 22, 2022

The starting example was Bloom filters.

NSMutableSet · on July 22, 2022

I'm not sure what the point of your post is. While bloom filters are heavily used in actual production code throughout the industry, it is very rare for anyone to need to code their own, or make changes to a prior legacy implementation.

Not all educational programs will cover bloom filters, and for those that do, there's no guarantee that the students will retain the information, and be able to recall it.

I don't know of any common interview question patterns where a bloom filter would be the only optimal solution. If it does share optimality with any other data structures, you would be better off choosing something else unless you are extremely confident in your ability to implement one and explain it clearly and in-depth, since the odds of your interviewer not being familiar with them are high in comparison to other solutions.

I only know what a bloom filter is because of HN, and previous submissions like this one that have made it to the front page throughout the years. It's nowhere near as common knowledge as tries, especially if you are sampling younger cohorts.

OJFord · on July 22, 2022

I just didn't think it was necessary to be judgemental/'gate-keep' about what's obscure or interesting 'enough' - better to let anyone share what's interested them and why.

To your points, personally I do fall in a 'younger cohort' I imagine; have come across bloom filters on HN way more frequently, and I dimly recall tries from university but I knew them as prefix trees. Without submissions like this one I wouldn't learn what a trie is/to make that alias in my head.

jefftk · on July 22, 2022

I was with you until the last seven words ;)

Trees were a huge part of CS practice and education historically, but have been replaced by hash-based methods in many cases. For example, in C++ std::map is generally a tree, while in a more recent language the standard Map data structure will be a hashmap. My impression is that the instruction time devoted to Bloom filters (and other hash-based methods) vs trees has shifted towards the former over the last ~20y.

sicp-enjoyer · on July 22, 2022

C++ STL uses trees because the generic requirement for them to work is a < operator. Designing generics around hashing is much more difficult, and most languages struggle with that.

Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations, even though they are theoretically worse due to log(n) factor. So in a way, C++ algorithms are actually more modern.

Sesse__ · on July 22, 2022

> Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations

This doesn't match my experience at all. C++ trees are not cache-friendly; they're pointer-chasing (and there's no arena implementation in the STL). Second, any sort of ordering structure (be it through a tree or through sorting + binary search) is notoriously prone to branch mispredictions. They look great in a microbenchmark if you search for the same element all the time and the trees are small, but in a practical application, it can be incredibly painful. Pretty much by design, every choice is 50–50 and unpredictable.

std::unordered_map from most STLs isn't a fantastic hash table, but in my experience it absolutely destroys std::map (and is also comfortably ahead of something like std::lower_bound on a sorted array). All of this is assuming large-ish N, of course; for small N (e.g. below 30, usually), the best by far is a simple unsorted array and just going through it linearly.

sicp-enjoyer · on July 22, 2022

> C++ trees are not cache-friendly

Agreed, they should use a B-tree to get cache locality and easy generics, but there is legacy code there.

I was referring to the performance of algorithms. For example `std::unique`, `std::lower_bounds`, etc. Many of these use sorted lists, whereas most other languages' standard libraries utilize hashing for these.

> is also comfortably ahead of something like std::lower_bound on a sorted array

I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

Sesse__ · on July 22, 2022

B-trees don't work well for CPU caches; 64b lines are typically too small to gain much. (OK, the M1 has 128b lines, but still.) And you still get the branch mispredicts, and a significant increase in code complexity, so hashing is still significantly better. (I've seen C++ projects where the main structure was a B+-tree because the lower levels could be residing on disk, but where they maintained a separate hash table in memory to skip the top first few B-tree levels!)

> I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

If by “it” you're talking about std::unordered_map (as key); yes, you can, but you'd have to supply your own hash function, because std::pair<A, B> does not have a std::hash specialization. (It's a bit sad, but the problem is highly specific to std::pair/std::tuple.) Likewise, you can use any arbitrary type you create yourself, as long as it has operator== and you've written a hash function for it.

Shorel · on July 23, 2022

With a modern Intel processor, it was much faster to go through a 10k integers array than sort it first and then do binary search.

The small N is not that small anymore, and the difference speed between stack and heap is bigger now than ever.

umanwizard · on July 22, 2022

To your point, C++ got hash structures in the standard library (unordered_map and unordered_set) in C++11 (so only about a decade ago).

adgjlsfhk1 · on July 22, 2022

a lot of this is that other than heaps, most tree based algorithms involve O(logn) random access pointer lookups which make them relatively slow for in memory data structures

ant6n · on July 22, 2022

Trie != tree

btschaegg · on July 22, 2022

Indeed, but it took a while for me to appreciate Tries.

Originally, i thought "nice idea", but I rarely ever encountered a problem where I really needed a prefix tree.

But while reading into HAMT and Clojure's datastructure implementations, it dawned on me that prefix trees aren't only useful for strings (or arbitrary byte sequences): Hashes in a dictionary, for example, are sequences of bits, too. And the design of a HAMT is exactly such that it doesn't concern itself with bytes but any blocks of N bits (determined by the branching factor; i.e. a factor of 32 implies 6-bit-blocks). And if you know the length of each sequence (hash), you know the maximum Trie depth, which also works for you, not against you.

That was rather eye opening for me at the time.

chrismarlow9 · on July 22, 2022

Even though a trie is pretty standard and expected (to be known) these days, I will vote for this. It was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.

https://github.com/Azeem112/tries-T9-Prediction

rockwotj · on July 22, 2022

Do you have any documentation about how tries are used in mongo indexes? Or could you point to the source?

petethepig · on July 22, 2022

There's not much about it online. They do mention it in the docs, they call this "prefix compression" [0], so you can search for that. This article describes it [1], although it is somewhat high level.

They only use this for indexes. It works really well for internal mongo ids (they all start with timestamps if I remember correctly). It also works well for compound indexes. Pro tip: for compound indexes always go from low-cardinality attribute to high-cardinality attribute to get the best results from prefix compression (it's also good for speed of queries).

[0] https://www.mongodb.com/docs/manual/reference/glossary/#std-...

[1] https://medium.com/swlh/mongodb-indexes-deep-dive-understand...

petethepig · on May 28, 2022

* With larger lambdas you get more predictable performance, 2GB RAM lambdas should get you ~ 90MB/s [0]

* Assuming you can parse faster than you read from S3 (true for most workloads?) that read throughput is your bottleneck.

* Set target query time, e.g 1s. That means for queries to finish in 1s each record on S3 has to be 90MB or smaller.

* Partition your data in such a way that each record on S3 is smaller than 90 MBs.

* Forgot to mention, you can also do parallel reads from S3, depending on your data format / parsing speed might be something to look into as well.

This is somewhat of a simplified guide (e.g for some workloads merging data takes time and we're not including that here) but should be good enough to start with.

[0] - https://bryson3gps.wordpress.com/2021/04/01/a-quick-look-at-...

petethepig · on May 2, 2022

Pyroscope (YC W21) | REMOTE | Full-time / Contract

Pyroscope is open source performance monitoring software, with a focus on Continuous Profiling. Pyroscope was founded at the beginning of 2021 and we've seen a lot of rapid growth / community adoption as well as feedback for what features developers are excited for us to build.

We're looking for Backend Software Engineers (Golang). Our custom storage engine is written in Go (with agents for many other languages as well).

---

Roles:

Senior Software Engineer - Golang: https://apply.workable.com/pyroscope/j/28E1725C98/

Questions?: [email protected]

petethepig · on April 2, 2022

I set up a similar system in the past with redis replicas — it was about 200 MBs of data that I needed to have on quick access to on ~ 300 servers around the world — it was easy to set up and worked really well.

petethepig · on April 1, 2022

Pyroscope (YC W21) | REMOTE | Full-time / Contract

Pyroscope is open source performance monitoring software, with a focus on Continuous Profiling. Pyroscope was founded at the beginning of 2021 and we've seen a lot of rapid growth / community adoption as well as feedback for what features developers are excited for us to build.

We're looking for Backend Software Engineers (Golang or Java). Our custom storage engine is written in Golang (with agents for many other languages as well). While we support 8 language integrations, one of our most popular is our Java Integration and we're looking for an engineer (either full time or contract) to help us improve this integration.

---

Roles:

Senior Software Engineer - Java: https://apply.workable.com/pyroscope/j/34D132806F/

Senior Software Engineer - Golang: https://apply.workable.com/pyroscope/j/28E1725C98/

Questions?: [email protected]

petethepig · on March 1, 2022

Pyroscope (YC W21) | REMOTE | Full-time / Contract

Pyroscope is open source performance monitoring software, with a focus on Continuous Profiling. Pyroscope was founded at the beginning of 2021 and we've seen a lot of rapid growth / community adoption as well as feedback for what features developers are excited for us to build.

We're looking for Backend Software Engineers (Golang or Java). Our custom storage engine is written in Golang (with agents for many other languages as well). While we support 8 language integrations, one of our most popular is our Java Integration and we're looking for an engineer (either full time or contract) to help us improve this integration.

---

Roles:

Senior Software Engineer - Java: https://apply.workable.com/pyroscope/j/34D132806F/

Senior Software Engineer - Golang: https://apply.workable.com/pyroscope/j/28E1725C98/

Questions?: [email protected]

petethepig · on Jan 3, 2022

Pyroscope (YC W21) | REMOTE | Full-time / Contract

Pyroscope is open source performance monitoring software, with a focus on Continuous Profiling. Pyroscope was founded at the beginning of 2021 and we've seen a lot of rapid growth / community adoption as well as feedback for what features developers are excited for us to build.

We're looking for Software Engineers for both frontend and backend. Our frontend is React and Redux and our custom storage engine is written in Golang (with agents for many other languages as well). We also have a few components written in Rust and we're currently experimenting with WASM for offloading some logic from the backend to the frontend.

---

Roles:

Junior Software Engineer (Frontend): https://apply.workable.com/pyroscope/j/CCF934937E/

Senior Software Engineer (Backend): https://apply.workable.com/pyroscope/j/28E1725C98/

Questions?: [email protected]

petethepig · on Dec 4, 2021

Semantic pull requests [0] + conventional-changelog [1] + squashing PRs and you get almost the same results (and more) with great automation and self-documentation for the whole process:

* you'll never forget to add something to changelog

* you get links to PRs in your changelog

* it's much faster to make edits to PR messages compared to editing files

* outside contributors get familiarized with the practice much faster (imagine getting new contributors to update changelogs)

We recently adopted this practice at Pyroscope and it's been working out pretty well for us [2], I can certainly recommend it.

[0] - https://github.com/apps/semantic-pull-requests

[1] - https://github.com/conventional-changelog/conventional-chang...

[2] - https://github.com/pyroscope-io/pyroscope/blob/main/CHANGELO...

jrochkind1 · on Dec 4, 2021

I'm having trouble figuring out what a "semantic pull request" is from that link, it's not a concept I've heard of before, and I'm interested. I mean, i assume it's some conventions for what goes in PR message, but what are they? Is there a better link?

In general, as a developer-consumer of them, I have found automatically-generated-from-commits changelogs to be not so useful, but I don't know if I've experienced any using the conventions you are recommending (which I dont' understand!).

In the small projects I write, I include links to the PR in the manually created CHANGELOG (the delta of which is part of the PR) simply by making the PR, then making another commit/ammend in the PR to add it's own url to the CHANGELOG. These are some extra steps, it's true.

danieldisu · on Dec 4, 2021

You use the title of the PR as a changelog item instead of every commit message.

It forces you to think a moment for a good PR title, but it is an small price to pay when it gives you a 0 extra work changelog that is good enough for semi-internal use.

a5aAqU · on Dec 4, 2021

Instead of using the title, could the changes be put in the description using a custom syntax? Prefix the line with `CHANGELOG:` or something like that.

    git --no-pager log | grep 'CHANGELOG: '

https://git-scm.com/book/en/v2/Customizing-Git-An-Example-Gi...

Vinnl · on Dec 4, 2021

I really dislike this because, while it's easier for the author, either the changelog or the commit messages become less useful. In fact, I might just copy the last time I ranted about this [1]:

> Ugh. One of my pet peeves is the generation of release notes from commit messages. Commit messages and PR descriptions have a different audience (i.e. contributors) from release notes (i.e. users).

> For example, take a look at ESLint's autogenerated changelog [1]:

    67c0074 Update: Suggest missing rule in flat config (fixes #14027) (#15074) (Nicholas C. Zakas)
    cf34e5c Update: space-before-blocks ignore after switch colons (fixes #15082) (#15093) (Milos Djermanovic)
    c9efb5f Fix: preserve formatting when rules are removed from disable directives (#15081) (Milos Djermanovic)
    14a4739 Update: no-new-func rule catching eval case of MemberExpression (#14860) (Mojtaba Samimi)
    7f2346b Docs: Update release blog post template (#15094) (Nicholas C. Zakas)
    fabdf8a Chore: Remove target.all from Makefile.js (#15088) (Hirotaka Tagawa / wafuwafu13)
    e3cd141 Sponsors: Sync README with website (ESLint Jenkins)
    05d7140 Chore: document target global in Makefile.js (#15084) (Hirotaka Tagawa / wafuwafu13)
    0a1a850 Update: include ruleId in error logs (fixes #15037) (#15053) (Ari Perkkiö)
    47be800 Chore: test Property > .key with { a = 1 } pattern (fixes #14799) (#15072) (Milos Djermanovic)
    a744dfa Docs: Update CLA info (#15058) (Brian Warner)
    9fb0f70 Chore: fix bug report template (#15061) (Milos Djermanovic)
    f87e199 Chore: Cleanup issue templates (#15039) (Nicholas C. Zakas)

> I'm reading release notes to get a feel for how the new release might impact me. This takes so much time to scan, because there's so much useless cruft (to me, as a user) I have to ignore.

> What's worked very well for me is to simply have an "I updated the changelog, if applicable" entry in my PR template checklist. Then when I cut a new release, I simply add the release date above the release notes currently listed under "Unreleased", and they'll list all relevant changes, reviewed during the pull request to verify that it is relevant to users.

> [1] https://github.com/eslint/eslint/blob/master/CHANGELOG.md

[2] https://news.ycombinator.com/item?id=28757317

ReleaseCandidat · on Dec 4, 2021

It always depends on your users. Such a changelog is suitable for stuff other developers use, but not for end users.

But it is a _way_ better starting point to write a changelog that 'usual' commit messages or PR descriptions.

majkinetor · on Dec 4, 2021

I disagree. I publish to the end users and they are grateful all the time.

I use custom scheme tho, using Gitlab issue title and issue labels. I find this more meaningful since one issue can contain multiple PRs and I still want single changelog line.

bbarnett · on Dec 4, 2021

How can I, as a user, know if bug x, or issue y is resolved, without a changelog?!

detaro · on Dec 4, 2021

Nobody suggested not having a changelog.

rezonant · on Dec 4, 2021

Seconded, I've seen semantic PR plus squash add a lot of value both for the engineering side and the product side.

We do this and it works great, but I've still found value in summarizing things in a digest (ie changelog or release notes, depending on the context)

fullsend · on Dec 4, 2021

Thirded? Tertiaried? This is something I always beat myself up for not doing consistently. Automating it at a minimum checks the box, and also can provide real value.

busbus · on Dec 4, 2021

May I ask you why you want to squash every branch ?

What is the size of the resulting commit (in terms of average modified files, lines of code) ?

rezonant · on Dec 4, 2021

I've shared this anecdote elsewhere but it's my go-to example. We recently had a new engineer join our team, we gave them a small piece of work related to a bigger feature we had completed months before they arrived. They tracked the feature back to the squashed PR commit, which of course referenced the Github PR URL- in there they had the engineer's summary and implementation caveats, the review comments from the rest of the team and the discussion context. They could reference all the commits made throughout the process from there. The PR itself used "Closes #X" syntax, so they could jump back to the engineering task and up into the product backlog item that spawned it. They nailed the change the first time :-)

> What is the size of the resulting commit (in terms of average modified files, lines of code) ?

Lately I'm starting to advocate squashing PR branches even if they have only a single commit. The real benefit here is that Github puts the PR URL into the squashed commit message. It automates the "context links" regardless of whether a PR branch ends up being 1 commit or 50

busbus · on Dec 4, 2021

I understand the needs. But to me, I can’t see any real value to squashing other than cleaning local history before sharing the code. Also, and that’s important for me (maybe other people doesn’t care of it) I lose the ability to bisect more precisely. When I know that my guilty commit is a squashed one, I have to watch every modification to understand what went wrong, which takes time for me.

Also, everything you describe can be achieved with a merge commit. As a side effect, when you look at the history, you may be annoyed with small commits that may not interest you. But Git can help by displaying only merge commit (with --first-parent option).

rezonant · on Dec 4, 2021

Ah bisect is a great use case here.

True that you can achieve it in a merge commit. And I think Github includes the PR URL in a merge commit as well iirc. I suppose my main argument is really just that the always-rebase strategy that was trendy for awhile has some downsides over squash/merge commit /shrug

jrochkind1 · on Dec 4, 2021

I don't think I understand what "squashing PR branches" means (a github feature I haven't noticed?), but:

> The real benefit here is that Github puts the PR URL into the squashed commit message.

While not in the commit message, github will show you the PR a commit is part of in it's commit view regardless, also for every commit in a PR whether it's one commit or 50. Which I do find invaluable for tracing the documentation of a change, agreed! Not everyone knows about this feature (I don't think it's been there forever) in github UI.

Eg notice the `#279` included in this github commit UI display (the PR was, I'm pretty sure, not "squashed").

https://github.com/traject/traject/commit/10339e774e92e57f44...

fphilipe · on Dec 4, 2021

The merge commit also includes the link to the PR when merged via the GitHub UI (or locally via `hub merge $PR_URL`).

majkinetor · on Dec 4, 2021

One typical reason would be easier history following and eventually revert.

limelight · on Dec 4, 2021

Yep, I love this workflow.

We use release-please to automatically generate WIP "release PRs" so we can see the exact changelog (for a candidate release) drafted as merges come in.

[0] https://github.com/googleapis/release-please

kubanczyk · on Dec 4, 2021

Semantic pull requests work well for a slow pace of work, and I use them all the time.

I also recognize that for full-speed delivery, they are suboptimal in that they sit directly on the critical path. After reviewing the code of the PR, there is now an extra context switch to also review the title/semantics.

So the complexity of turning a list of PRs into a changelog does not disapper, it gets hidden within each individual PR.

petethepig · on Dec 4, 2021

I am an unexperienced cyclist, got into it when covid started, and I'm having an opposite experience — I've loved getting new sensors and geeking out on metrics and seeing how they all interact with each other. It motivated me to go out more for sure. Getting a heart rate sensor actually helped me to climb more efficiently — when I first got it I noticed that I was going too hard and my heart rate would go too high and I would have to stop and rest. Now I know that if I just keep my heart rate below a certain number I can go non stop for a much longer time.

RE Garmin — my first computer was a xoss g+, which was pretty good. And I recently switched to garmin edge 530 because I'm planning to get a power meter soon and I have to say I'm not a big fan — screen quality and navigation are kinda crapy for the amount of money they ask for it.