I find this type of problem is what current AI is best at: where the actual logic isn't very hard, but it requires pulling together and assimilating a huge amount of fuzzy, known information from various sources
Which also fits with how it performs at software engineering (in my experience). Great at boilerplate code, tests, simple tutorials, common puzzles but bad at novel and complex things.
This is also why I buy the apocalyptic headlines about AI replacing white collar labor - most white collar employment is mostly creating the same things (a CRUD app, a landing page, a business plan) with a few custom changes
Not a lot of labor is actually engaged in creating novel things.
The marketing plan for your small business is going to be the same as the marketing plan for every other small business with some changes based on your current situation. There’s no “novel” element in 95% of cases.
I don’t know if most software engineers build toy CRUD apps all day? I have found the state of the art models to be almost completely useless in a real large codebase. Tried Claude and Gemini latest since the company provides them but they couldn’t even write tests that pass after over a day of trying
Our current architectures are complex, mostly because of DRY and a natural human tendency to abstract things. But that's a decision, not a fundamental property of code. At core, most web stuff is "take it out of the database, put it on the screen. Accept it from the user, put it in the database."
If everything was written PHP3 style (add_item.php, delete_item.php, etc), with minimal includes, a chatbot might be rather good at managing that single page.
I'm saying code architected to take advantage of human skills, and code architected to take advantage of chatbot skills might be very different.
LOL how does the AI keep track of all the places it needs to update once you make a logic change ? I have no idea what you're doing but almost nothing I do is basic CRUD - there's always logic/constraints around data flow and processes built on top.
People didn't move away from PHP3 style of code because it's a natural human tendency - they moved away because it was impossible to maintain that kind of code at scale. AI does nothing to address that and is in fact incredibly bad at it because it's super inconsistent, it's not even copy paste at that point - it's the "whatever flavor of solution LLM chose in this instance".
I don't understand what's your background to think that this kind of thing scales ?
long time ago, in one small company, i wrote an accounting system from 1st principles and then it was deployed to some large-ish client. It took several months of rearranging their whole workflows and quarelling with their operators to enable the machine to do what it is good at and to disable all the human-related quirky +optimizations -cover-asses. Like, humans are good at rough guessing but bad at remembering/repeating same thing. Hence usual manual accounting workflows are heavily optimized for error-avoidability.
Seems same thing here.. another kind of bitter lesson, maybe less bitter :/
This is IMHO where the interesting direction will be. How do we architecture code so that it is optimized around chatbot development? In the past areas of separation were determined by api stability, deployment concerns, or even just internal team politics. In the future a rep might be separated from a monolith repo to be an area of responsibility that a chatbot can reason about, and not get lost in the complexity.
IMHO we should always architect code to take advantage of human skills.
1°) When there is an issue to debug and fix in a not-so-big codebase, LLMs can give ideas to diagnose, but are pretty bad at fixing. Where your god will be when you have a critical bug in production ?
2°) Code is meant for humans in the first place, not machines. Bytecodes and binary formats are meant for machines, these are not human-readable.
As a SWE, I pass more time reading than writing code, and I want to navigate in a the codebase in the most easy possible way. I don't want my life to be miserable or more complicated because the code is architected to take advantage of chatbot skills.
And still IMHO, if you need to architect your code for not-humans, there is a defect in the design. Why force yourself to write code that is not meant to be maintained by a human when you will in any case maintain that said code ?
Agreed in general, the models are getting pretty good at dumping out new code, but for maintaining or augmenting existing code produces pretty bad results, except for short local autocomplete.
BUT it's noteworthy that how much context the models get makes a huge difference. Feeding in a lot of the existing code in the input improves the results significantly.
This might be an argument in favor of a microservices architecture with the code split across many repos rather than a monolithic application with all the code in a single repo. It's not that microservices are necessarily technically better but they could allow you to get more leverage out of LLMs due to context window limitations.
Most senior SWEs, no. But most technical people in software do a lot of what the parent commenter describes in my experience. At my last company there was a team of about 5 people whose job was just to make small design changes (HTML/CSS) to the website. Many technical people I've worked with over the years were focused on managing and configuring things in CMSs and CRMs which often require a bit of technical and coding experince. At the place I currently work we have a team of people writing simple python and node scripts for client integrations.
There's a lot of variety in technical work, with many modern technical jobs involving a bit of code, but not at the same complexity and novelty as the types of problems a senior SWE might be working on. HN is full of very senior SWEs. It's really no surprise people here still find LLMs to be lacking. Outside of HN I find people are far more impressed/worried by the amount of their job an LLM can do.
I agree but the reason it won’t be an apocalypse is the same reason economists get most things wrong, it’s not an efficient market.
Relatively speaking we live in a bubble, there are still broad swaths of the economy that operate with pen and paper. Another broad swath that migrated off 1980s era AS/400 in the last few years. Even if we had ASI available literally today (And we don’t) I’d give it 20-30 years until the guy that operates your corner market or the local auto repair shop has any use in the world for it.
I had predicted the same about websites, social media presence, Google maps presence etc. back 10-15 years ago, but lo and behold, even the small burger place hole-on-a-wall in rural eastern Europe is now on Google maps with reviews, and even answers by the owner, a facebook page with info on changes of opening hours etc. I'd have said there's no way that fat 60 year old guy will get up to date with online stuff.
But gradually they were forced to.
If there are enough auto repair shops that can just diagnose and process n times more cars in a day, it will absolutely force people to adopt it as well, whether they like the aesthetics or not, whether they feel like learning new things or not. Suddenly they will be super interested in how to use it, regardless of how they were boasting about being old school and hands-on beforehand.
If a technology gives enough boost to productivity, there's simply no way for inertia to hold it back, outside of the most strictly regulated fields, such as medicine, which I do expect to lag behind by some years, but will have to catch up once the benefits are clear in lower-stakes industries and there's immense demand on it that politicians will be forced to crush the doctor's cartel's grip on things.
This doesn't apply to literal ASI, mostly because copy-pasteable intelligence is an absolute gamechanger, particularly if the physical interaction problems that prevent exponential growth (think autonomous robot factory) are solved (which I'd assume a full ASI could do).
People keep comparing to other tools, but a real ASI would be an agent, so the right metaphor is not the effect of the industrial revolution on workers, but the effect of the internal combustion engine on the horse.
Definitely matches my experience as well. I've been working away on a very quirky, non-idiomatic 3D codebase, and LLMs are a mixed bag there. Y is down, there's no perspective distortion or Z buffer, there are no meshes, it's a weird place.
It's still useful to save me from writing 12 variations of x1 = sin(r2) - cos(r1) while implementing some geometric formula, but absolutely awful at understanding how those fit into a deeply atypical environment. Also have to put blinders on it. Giving it too much context just throws it back in that typical 3D rut and has it trying to slip in perspective distortion again.
Yeah I have the same experience. I’ve done some work on novel realtime text collaboration algorithms. For optimisation, I use some somewhat bespoke data structures. (Eg I’m using an order-statistic tree storing substring lengths with internal run-length encoding in the leaf nodes).
ChatGPT is pretty useless with this kind of code. I got it to help translate a run length encoded b-tree from rust to typescript. Even with a reference, it still introduced a bunch of new bugs. Some were very subtle.
It’s just not there yet but I think it will get there for translation kind of tasks quite capably in the next 12 months, especially if asked to translate a single file or a selection in a file line by line. Right now it’s quite bad which I find surprising. I have less confidence we’ll see whole-codebase or even module level understanding for novel topics in the next 24 months.
There’s also a question of quality of source data. At least in TypeScript/JavaScript land, the vast majority of code appears to be low quality and buggy or ignores important edge cases and so even when working on “boilerplate” it can produce code that appears to work but will fall over in production for 20% of users (for example string handling code that will tear Unicode graphemes like emoji).
Working on extending the [Zdog](https://zzz.dog) library, adding some new types and tooling, patching bugs I run into on the way.
All the quirks inherit from it being based on (and rendering to) SVG. SVG is Y-down, Zdog only adds Z-forward. SVG only has layering, so Zdog only z-sorts shapes as wholes. Perspective distortion needs more than dead-simple affine transforms to properly render beziers, so Zdog doesn't bother.
The thing that really throws LLMs is the rendering. Parallel projection allows for optical 2D treachery, and Zdog makes heavy use of it. Spheres are rendered as simple 2D circles, a torus can be replicated with a stroked ellipse, a cylinder is just two ellipses and a line with a stroke width of $radius. LLMs struggle to even make small tweaks to existing objects/renderers.
Yep. But wonderful at aggregating details from twelve different man pages to write a shell script I didn't even know was possible to write using the system utils
Is it 'only' "aggregating details from twelve different man pages" or has it 'studied' (scraped) all (accessible) code in GitHub/GitLab/Stachexchange/etc. and any other publicly available coding repositories on the web (and for the case of MS the Git it owns)? Together with descriptions of what is right and what is wrong..
I use it for code, and I only do fine tuning. When I want something that is clearly never done before, I 'talk' to it and train it on which method to use, and for a human brain some suggestions/instructions are clearly obvious (use an Integer and not a Double, or use Color not Weight). So I do 'teach' it as well when I use it.
Now, I imagine that when 1 million people use LLMs to write code and fine tune it (the code), then we are inherently training the LLMs on how to write even better code.
So it's not just "..different man pages.." but "the finest coding brains (excluding mine) to tweak and train it".
CRUD backend app for a business in a common sector? It's mostly just connecting stuff together (though I would argue that an experienced dev with a good stack takes less time to write it as is than painstakingly explaining it to an LLM in an inexact human language).
Some R&D stuff, or even debugging any kind of code? It's almost useless, as it would require deep reasoning, where these models absolutely break down.
Have you tried debugging using the new "reasoning" models yet?
I have been extremely impressed with o1, o3, o4-mini and Gemini 2.5 as debugging aids. The combination of long context input and their chain-of-thought means they can frequently help me figure out bugs that span several different layers of code.
In my experience they're not great with mathy code for example. I had a function that did subdivision of certain splines and had some of the coefficients wrong. I pasted my function into these reasoning models and asked "does this look right?" and they all had a whole bunch of math formulas in their reasoning and said "this is correct" (which it wasn't).
Wait I’ve found it very good at debugging. It iteratively states a hypothesis, tries things, and reacts from what it sees.
It thinks of things that I don’t think of right away. It tries weird approaches that are frequently wrong but almost always yield some information and are sometimes spot on.
And sometimes there’s some annoying thing that having Claude bang its head against for $1.25 in API calls is slower than I would be but I can spend my time and emotional bandwidth elsewhere.
I agree with this. I do mostly DevOps stuff for work and it’s great at telling me about errors with different applications/build processes. Just today I used it to help me scrape data from some webpages and it worked very well.
But when I try to do more complicated math it falls short. I do have to say that Gemini Pro 2.5 is starting to get better in this area though.
If you'd like a creative waste of time, make it implement any novel algorithm that mixes the idea of X with Y. It will fail miserably, double down on the failure and hard troll you, run out of context and leave you questioning why you even pay for this thing. And it is not something that can be fixed with more specific training.
I have asked chatgpt reasoning model to solve chess endgames where white had king and a queen vs king and a rook on a 7x8 chessboards. So to compute value for all positions and find the position which is the longest win for white.
Not creative, not novel and not difficult algorithmic task. But it requires some reasoning, planning and precision.
I think you need to be more specific about which "chatgpt reasoning model" you used. Even the free version of chatgpt has reasoning/thinking now but there are also o1-mini, o1, o1-pro, o3-mini, o3, and o4-mini and they all have very different capabilities.
My favorite example is implementing NEAT with keras dense layers instead of graphs. Last time I tried with claude 3.7, it wrote code to mutate the output layer (??). I tried to prevent that a few times and gave up.
I've been surprised that so much focus was put on generative uses for LLMs and similar ML tools. It seems to me like they have a way better chance of being useful when tasked with interpreting given information rather than generating something meant to appear new.
> Is what you're doing taking a large amount of text and asking the LLM to convert it into a smaller amount of text? Then it's probably going to be great at it. If you're asking it to convert into a roughly equal amount of text it will be so-so. If you're asking it to create more text than you gave it, forget about it.
This quote sounds clever, but is very different than my experience.
I have been very pleased with responses to things like: "explain x", "summarize y", "make up a parody dog about A to the tune of B", "create a single page app that does abc".
I've had coworkers tell me it works Copilot works well for refactoring code, which also makes sense in the same vein.
Its like they wouldn't be so controversial if they didn't decide to market it as "generative" or "AI"...I assume fund raising
valuations would move inline with the level of controversy though.
FWIW, I do a lot of talks about AI in the physical security ___domain and this is how I often describe AI, at least in terms of what is available today. Compared to humans, AI is not very smart, but it is tireless and able to recall data with essentially perfect accuracy.
It is easy to mistake the speed, accuracy, and scope of training data for "intelligence", but it's really just more like a tireless 5th grader.
Something I have found quite amusing about LLMs is that they are computers that don't have perfect recall - unlike every other computer for the past 60+ years.
That is finally starting to change now that they have reliable(ish) search tools and are getting better at using them.
My guess is that those questions are very typical and follow very normal patterns and use well established processes. Give it something weird and it'll continuously trip over itself.
My current project is nothing too bizarre, it's a 3D renderer. Well-trodden ground. But my project breaks a lot of core assumptions and common conventions, and so any LLM I try to introduce—Gemini 2.5 Pro, Claude 3.7 Thinking, o3—they all tangle themselves up between what's actually in the codebase and the strong pull of what's in the training data.
I tried layering on reminders and guidance in the prompting, but ultimately I just end up narrowing its view, limiting its insight, and removing even the context that this is a 3D renderer and not just pure geometry.
> Give it something weird and it'll continuously trip over itself.
And so will almost all humans. It's weird how people refuse to ascribe any human-level intelligence to it until it starts to compete with the world top elite.
Yeah, but humans can be made to understand when and how they're wrong and narrow their focus to fixing the mistake.
LLMs apologize and then proudly present the exact same output as before, repeatedly, forever spinning their wheels at the first major obstacle to their reasoning.
> LLMs apologize and then proudly present the exact same output as before, repeatedly, forever spinning their wheels at the first major obstacle to their reasoning.
So basically like a human, at least up to young adult years in teaching context[0], where the student is subject to authority of the teacher (parent, tutor, schoolteacher) and can't easily weasel out of the entire exercise. Yes, even young adults will get stuck in a loop, presenting "the exact same output as before, repeatedly, forever spinning their wheels at the first major obstacle to their reasoning", or at least until something clicks, or they give up in shame (or the teacher does).
As someone currently engaged in teaching the Adobe suite to high school students, that doesn't track with what I see. When my students are getting stuck and frustrated, I look at the problem, remind them of the constraints and assumptions the software operates under. Almost always they realize the problem without me spelling it out, and they reinforce the mental model of the software they're building. Often noticing me lurking and about to offer help is enough for them to pause, re-evaluate, and catch the error in their thinking before I can get out a full sentence.
Reminding LLMs of the constraints they're bumping into doesn't help. They haven't forgotten, after all. The best performance I got out of the LLMs in my project I mentioned upthread was a loop of trying out different functions, pausing, re-evaluating, realizing in its chain of thought that it didn't fit the constraints, and trying out a slightly different way of phrasing the exact same approach. Humans will stop slamming their head into a wall eventually. I sat there watching Gemini 2.5 Pro internally spew out maybe 10 variations of the same function before I pulled the tokens it was chewing on out of its mouth.
Yes, sometimes students get frustrated and bail, but they have the capacity to learn and try something new. If you fall into an area that's adjacent to but decidedly not in their training data, the LLMs will feel that pull from the training data too strongly and fall right into that rut, forgetting where they're at.
A human can play tictactoe or any other simple game in a few minutes after being described the game. AI will do all kinds on interesting things that either are against the rules or will be extremely poor choices.
Yeah, I tried playing tictactoe with chatGPT and it did not do well.
LLMs struggle with context windows, so as long as the problem can be solved in their small windows, they do great.
Humans neural networks are constantly being retrained, so their effective context window is huge. The LLM may be better at a complex, well specified 200 line python program, but the human brain is better at the 1M line real-world application. It takes some study though.
LLMs are like a knowledge aggregator. The reasoning models have potential to get creative usefully but I have yet to see evidence of it, like invent a novel scientific thing
It takes a lot of energy to compress the data. And a lot to actually extract something sensible. While you could just just optimize the single problem you have quite easily.
They are, after all, information-digesters