Hacker News new | past | comments | ask | show | jobs | submit login

This seems pretty reasonable and matches my suspicions. It is not hard for me to believe that CUDA has a lot of momentum behind it, not just in users, but in optimization and development. And thanks, I'll look more at Octo. As for Modular, aren't they only CPU right now? I'm not impressed by their results, as their edge isn't strong over PyTorch, especially scaling. A big reason this is surprising to me is simply how much faster numpy functions are than torch. Like just speed test np.sqrt(np.random.random(256, 1024)) vs torch.sqrt(torch.random(256, 1024)). Hell, np.sqrt(x) is also a lot slower than math.sqrt(x). It just seems like there's a lot of availability for optimization, but I'm sure there are costs.

When we're presented with problems where the two potential answers are "it's a lot harder than it looks" and "the people working on it are idiots" I tend to lean towards the former. But hey, when it is the latter there's usually a good market opportunity. Just I've found that ___domain expertise is seeing the nuance that you miss when looking at 10k ft.




First you have to figure out what problem to attack. Research, training production models, and production inference all have very different needs on the software side. Then you have to work out what the decision tree is for your customers (so depends who you are in this equation) and how you can solve some important problem for them. In all of this for say training a big transformer numpy isn't going to help you much so it doesn't matter if it's faster for some small cases. If you want to support a lot of model flexibility (for research and maybe training) then you need to do some combination of hand-writing chip-specific kernels and building a compiler that can do some or most of that automatically. Behind that door is a whole world of hardware-specific scheduling models, polyhedral optimization, horizontal and vertical fusion, sparsity, etc, etc, etc. It's a big and sustained engineering effort, not within the reach of hobby developers, so you go back to the question of who is paying for all this work and why. Nvidia has clarity there and some answers that are working. Historically AMD has operated on the theory that deep learning is too early/small to matter, and for big HPC deployments they can hand-craft whatever tools they need for those specific contracts (this is why ROCm seems so broken for normal people). Google built TensorFlow, XLA, Jax, etc for their own workloads and the priorities reflect that (e.g. TPU support). For a long time the great majority of inference workloads were on Intel CPUs so their software then reflected that. Not sure what tiny corp's bet here is going to be.

The change in the landscape I see now is that the models are big enough and useful enough that the commercial appetite for inference is expanding rapidly, hardware supply will continue to be constrained, and so tools that can reduce production inference cost by a percentage are starting to become a straight forward sale (and thus justify the infrastructure investment). This is not based on any inside info but when I look at companies like Modular and Octo that's a big part of why I think they probably will have some success.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: