I share your sentiment, I've written three apps where I've used language models extensively (a different one for each: ChatGPT, Mixtral and Llama-70B) and while I agree that they where immensely helpful in terms of velocity, there are a bunch of caveats:
- it only works well when you write code from scratch, context length is too short to be really helpful for working on existing codebase.
- the output code is pretty much always broken in some way, and you need to be accustomed to doing code reviews to use them effectively. If you trust the output and had to debug it later it would be a painfully slow process.
Also, I didn't really noticed a significant difference in code quality, even the best model (GPT-4) write code that doesn't work, and I find it much more efficient to use open models on Groq due to the really fast inference. Looking at ChatGPT slowly typing is really annoying (I didn't test o1 and I have no interest in doing so because of its very low throughput).
> context length is too short to be really helpful for working on existing codebase.
This is kind of true, my approach is I spend a fairly large amount of time copy-pasting code from relevant modules back and forth into ChatGPT so it has enough context to make the correct changes. Most changes I need to make don't need more than 2-3 modules though.
> the output code is pretty much always broken in some way, and you need to be accustomed to doing code reviews to use them effectively.
I think this really depends on what you're building. Making a CRM is a very well trodden path so I think that helps? But even when it came to asking ChatGPT to design and implement a flexible data model it did a very good job. Most of the code it's written has worked well. I'd say maybe 60-70% of the code it writes I don't have to touch at all.
The slow typing is definitely a hindrance! Sometimes when it's a big change I lose focus and alt-tab away, like I used to do when building large C++ codebases or waiting for big test suites to run. So that aspect saps productivity. Conversely though I don't want to use a faster model that might give me inferior results.
> approach is I spend a fairly large amount of time copy-pasting code from relevant modules back and forth into ChatGPT
It can work, but what a terrible developer experience.
> I'd say maybe 60-70% of the code it writes I don't have to touch at all
I used to to write web apps so the ratio was even higher I'd say (maybe 80/90% of the code didn't need any modification) but the app itself wouldn't work at all if I didn't make those 10% changes. And you really need to read 100% of the code because you won't know upfront where those 10% will be.
> The slow typing is definitely a hindrance! Sometimes when it's a big change I lose focus and alt-tab away, like I used to do when building large C++ codebases or waiting for big test suites to run.
Yeah exactly, it's xkcd 303 but with “IA processing the response” instead of “compiling”. Having instant response was a game changer for me in terms of focus hence productivity.
> I don't want to use a faster model that might give me inferior results
As I said earlier, I didn't really feel the difference in quality so the switch was without drawbacks.
> Also, I didn't really noticed a significant difference in code quality, even the best model (GPT-4) write code that doesn't work,
Interesting, personally I have noticed a difference. Mostly in how well the models pick up small details and context. Although I do have to agree that the open Llama models are generally fairly serviceable.
Recently I have tended to lean towards Claude Sonnet 3.5 as it seems slightly better. Although that does differ per language as well.
As far as them being slow, I haven't really noticed a difference. I use them mostly through the API with open webui and the answers come quick enough.
I use o1 for research rather than coding. If I have a complex question that requires combining multiple ideas or references and checking the result, it's usually pretty good at that.
Sometimes that results in code, but it's the research and cross referencing that's actually useful with it
- it only works well when you write code from scratch, context length is too short to be really helpful for working on existing codebase.
- the output code is pretty much always broken in some way, and you need to be accustomed to doing code reviews to use them effectively. If you trust the output and had to debug it later it would be a painfully slow process.
Also, I didn't really noticed a significant difference in code quality, even the best model (GPT-4) write code that doesn't work, and I find it much more efficient to use open models on Groq due to the really fast inference. Looking at ChatGPT slowly typing is really annoying (I didn't test o1 and I have no interest in doing so because of its very low throughput).