On the topic\* of having 24 cores and wanting to put them to work: when I were a...

gdwatson · 2025-04-26T04:03:12 1745640192

Superscalar processors (which include all mainstream ones these days) do this within a single core, provided there are no data dependencies between the assignment statements. They have multiple arithmetic logic units, and they can start a second operation while the first is executing.

But yeah, I agree that we were promised a lot more automatic multithreading than we got. History has proven that we should be wary of any promises that depend on a Sufficiently Smart Compiler.

lazide · 2025-04-26T04:39:18 1745642358

Eh, in this case not splitting them up to compute them in parallel is the smartest thing to do. Locking overhead alone is going to dwarf every other cost involved in that computation.

gdwatson · 2025-04-26T04:56:26 1745643386

Yeah, I think the dream was more like, “The compiler looks at a map or filter operation and figures out whether it’s worth the overhead to parallelize it automatically.” And that turns out to be pretty hard, with potentially painful (and nondeterministic!) consequences for failure.

Maybe it would have been easier if CPU performance didn’t end up outstripping memory performance so much, or if cache coherency between cores weren’t so difficult.

lazide · 2025-04-26T05:20:11 1745644811

I think it has shaken out the way it has, is because compile time optimizations to this extent require knowing runtime constraints/data at compile time. Which for non-trivial situations is impossible, as the code will be run with too many different types of input data, with too many different cache sizes, etc.

The CPU has better visibility into the actual runtime situation, so can do runtime optimization better.

In some ways, it’s like a bytecode/JVM type situation.

PinkSheep · 2025-04-26T10:14:08 1745662448

If we can write code to dispatch different code paths (like has been used for decades for SSE, later AVX support within one binary), then we can write code to parallelize large array execution based on heuristics. Not much different from busy spins falling back to sleep/other mechanisms when the fast path fails after ca. 100-1000 attempts to secure a lock.

For the trivial example of 2+2 like above, of course, this is a moot discussion. The commenter should've lead with a better example.

lazide · 2025-04-26T11:02:24 1745665344

Sure, but it’s a rare situation (by code path) where it will beat the CPU’s auto optimization, eh?

And when that happens, almost always the developer knows it is that type of situation and will want to tune things themselves anyway.

PinkSheep · 2025-04-28T12:20:49 1745842849

What kind of CPU auto-optimization? Here specifically I envisioned a macro-level optimization, when an array is detected to have length on the order of thousands/tens of thousands. I guess some advanced sorting algorithms do extend their operation to multi-thread in such cases.

For CPU machine code it's the compilers doing the hard work of reordering code to allow ILP (instruction-level parallelism), eliminate false dependencies, inlining and vectorization; whatever else it takes to keep the pipeline filled and busy.

I'd love for the sentiment "the dev knows" to be true, but I think this is no longer the case. Maybe if you are in a low-level language AND have time to reason about it? Add to this the reserved smile when I see someone "benchmarking" their piece of code in a "for i to 100000" loop, without other considerations. Next, suppose a high-level language project: the most straightforward optimization to carry out for new code is to apply proper algorithms and fitting data structures. And I think this is too much to ask nowadays, because it takes time, effort, and knowledge of existence to remember to implement something.

eptcyka · 2025-04-26T06:09:42 1745647782

Spawning threads or using a thread pool implicitly would be pretty bad - it would be difficult to reason about performance if the compiler was to make these choices for you.

maccard · 2025-04-26T14:26:23 1745677583

I think you’re fixating on the very specific example. Imagine if instead of 2 + 2 it was multiplying arrays of large matrices. The compiler or runtime would be smart enough to figure out if it’s worth dispatching the parallelism or not for you. Basically auto vectorisation but for parallelism

lazide · 2025-04-26T15:00:54 1745679654

Notably - in most cases, there is no way the compiler can know which of these scenarios are going to happen at compile time.

At runtime, the CPU can figure it out though, eh?

maccard · 2025-04-26T16:41:26 1745685686

I mean, theoretically it's possible. A super basic example would be if the data is known at compile time, it could be auto-parallelized, e.g.

    int buf_size = 10000000;
    auto vec = make_large_array(buf_size);
    for (const auto& val : vec)
    {
        do_expensive_thing(val);
    }

this could clearly be parallelised. In a C++ world that doesn't exist, we can see that it's valid.

If I replace it with int buf_size = 10000000; cin >> buf_size; auto vec = make_large_array(buf_size); for (const auto& val : vec) { do_expensive_thing(val); }

the compiler could generate some code that looks like: if buf_size >= SOME_LARGE_THRESHOLD { DO_IN_PARALLEL } else { DO_SERIAL }

With some background logic for managing threads, etc. In a C++-style world where "control" is important it likely wouldn't fly, but if this was python...

    arr_size = 10000000
    buf = [None] * arr_size
    for x in buf:
        do_expensive_thing(x)

could be parallelised at compile time.

lazide · 2025-04-26T17:02:30 1745686950

Which no one really does (data is generally provided at runtime). Which is why ‘super smart’ compilers kinda went no where eh?

maccard · 2025-04-26T22:41:04 1745707264

I dunno. I was promised the same things when I started programming and it never materialised.

It doesn’t matter what people do or don’t do because this is a hypothetical feature of a hypothetical language that doesn’t exist.

lazide · 2025-04-27T05:28:25 1745731705

maccard · 2025-04-27T09:17:39 1745745459

You’re fixated on the very specific examples in our existing tools and saying that this wouldn’t work. Numpy could have a switch inside an operation decides whether to auto parallelise or not, for example. It’s possible but nobody is doing it. Maybe for good reasons, maybe for bad.

lazide · 2025-04-27T10:17:59 1745749079

I’m doing no such thing. I’m providing an example of why verifiable industry trends and current technical state of the art are the way they are.

You providing examples of why it totally-doesn’t-need-to-be-that-way are rather tangential, aren’t they? Especially when they aren’t addressing the underlying point.

snackbroken · 2025-04-26T06:09:40 1745647780

Bend[1] and Vine[1] are two experimental programming languages that take similar approaches to automatically parallelizing programs; interaction nets[3]. IIUC, they basically turn the whole program into one big dependency graph, then the runtime figures out what can run in parallel and distributes the work to however many threads you can throw at it. It's also my understanding that they are currently both quite slow, which makes sense as the focus has been on making `write embarrassingly parallelizable program -> get highly parallelized execution` work at all until recently. Time will tell if they can manage enough optimizations that the approach enables you to get reasonably performing parallel functional programs 'for free'.

[1] https://github.com/HigherOrderCO/Bend [2] https://github.com/VineLang/vine [3] https://en.wikipedia.org/wiki/Interaction_nets

chubot · 2025-04-26T03:48:10 1745639290

That looks more like a SIMD problem than a multi-core problem

You want bigger units of work for multiple cores, otherwise the coordination overhead will outweigh the work the application is doing

I think the Erlang runtime is probably the best use of functional programming and multiple cores. Since Erlang processes are shared nothing, I think they will scale to 64 or 128 cores just fine

Whereas the GC will be a bottleneck in most languages with shared memory ... you will stop scaling before using all your cores

But I don't think Erlang is as fine-grained as your example ...

Some related threads:

https://news.ycombinator.com/item?id=40130079

https://news.ycombinator.com/item?id=31176264

AFAIU Erlang is not that fast an interpreter; I thought the Pony Language was doing something similar (shared nothing?) with compiled code, but I haven't heard about it in awhile

juped · 2025-04-26T10:05:11 1745661911

There's some sharing used to avoid heavy copies, though GC runs at the process level. The implementation is tilted towards copying between isolated heaps over sharing, but it's also had performance work done over the years. (In fact, if I really want to cause a global GC pause bottleneck in Erlang, I can abuse persistent_term to do this.)

fmajid · 2025-04-26T11:15:07 1745666107

Yes, Erlang's zero-sharing model is what I think Rust should have gone for in its concurrency model. Sadly too few people have even heard of it.

chubot · 2025-04-26T18:27:15 1745692035

That would be an odd choice for a low-level language ... languages like C, C++, and Rust let you use whatever the OS has, and the OS has threads

A higher level language can be more opinionated, but a low level one shouldn't straight jacket you.

i.e. Rust can be used to IMPLEMENT an Erlang runtime

If you couldn't use threads, then you could not implement an Erlang runtime.

steveklabnik · 2025-04-26T20:20:36 1745698836

Very early on, Rust was like this! But as the language changed over time, it because less appropriate.

speed_spread · 2025-04-26T03:40:47 1745638847

I believe it's not the language preventing it but the nature of parallel computing. The overhead of splitting up things and then reuniting them again is high enough to make trivial cases not worth it. OTOH we now have pretty good compiler autovectorization which does a lot of parallel magic if you set things right. But it's not handled at the language level either.

inejge · 2025-04-26T05:20:41 1745644841

> …where x and y evaluate in parallel without me having to do anything.

I understand that yours is a very simple example, but a) such things are already parallelized even on a single thread thanks to all the internal CPU parallelism, b) one should always be mindful of Amdahl's law, c) truly parallel solutions to various problems tend to be structurally different from serial ones in unpredictable ways, so there's no single transformation, not even a single family of transformations.

fweimer · 2025-04-26T06:24:12 1745648652

There have been experimental parallel graph reduction machines. Excel has a parallel evaluator these days.

Oddly enough, functional programming seems to be a poor fit for this because the fanout tends to be fairly low: individual operations have few inputs, and single-linked lists and trees are more common than arrays.

colechristensen · 2025-04-26T04:42:25 1745642545

there have been fortran compilers which have done auto parallelization for decades, i think nvidia released a compiler that will take your code and do its best to run it on a gpu

this works best for scientific computing things that run through very big loops where there is very little interaction between iterations

que-encrypt · 2025-04-26T04:38:57 1745642337

Jax: https://docs.jax.dev/en/latest/_autosummary/jax.jit.html

deepsun · 2025-04-26T03:40:52 1745638852

Sure, Tensorflow and Pytorch, here ya go :)