Hacker News new | past | comments | ask | show | jobs | submit login

This is pretty exciting. I'm a copilot user at work, but also have access to Claude. I'm more inclined to use Claude for difficult coding problems or to review my work as I've just grown more confident in its abilities over the last several months.



I use both Claude and ChatGPT/GPT-4o a lot. Claude, the model, definitely is 'better' than GPT-4o. But OpenAI provides a much more capable app in ChatGPT and an easier development platform.

I would absolutely choose to use Claude as my model with ChatGPT if that happened (yes, I know it won't). ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close. But Claude absolutely produces better code, only being beaten by ChatGPT because it can fetch data from the web to RAG enhance its knowledge of things like APIs.

Claude's implementation of artifacts is very good though, and I'm sure that is what lead OpenAI to push out their buggy canvas feature.


It’s all a dice game with these things, you have to watch them closely or they start running you (with bad outcomes). Disclaimers aside:

Sonnet is better in the small, by a lot. It’s sharply up from idk, three months ago or something when it was still an attractive nuisance. It still tops out at “Best SO Answer”, but it hits that like 90%+. If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.

But for sheer “doesn’t stutter every interaction at the worst moment”? You’ve got to hand it to the ops people: 4o can give you second best in industrial quantity on demand. I’m finding that if AI is good enough, then OpenAI is good enough.


>If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.

Are you sure you're using Claude 3.5 Sonnet? In my experience it's absolutely capable of writing entire small applications based off a detailed spec I give it, which don't exist on GitHub or Stack Overflow. It makes some mistakes, especially for underspecified things, but generally it can fix them with further prompting.


I’m quite sure what model revision their API quotes, though serious users rapidly discover that like any distributed system, it has a rhythm to it.

And I’m not sure we disagree.

Vercel demo but Pets is copy paste.


We have entered the era of generic fashionable CRUD framework demo Too Cheap To Hawk.


Are there any good 3rd-party native frontend apps for Claude (on MacOS)? I mean something like ChatGPTs app, not an editor. I guess one option would be to just run Claude iPad app on MacOS.


Jan [0] is MacOS native, open source, similar feel to the ChatGPT frontend, very polished, and offers Anthropic integration (all Claude models).

It also features one-click installation, OpenAI integration, a hub for downloading and running local models, a spec-compatible API server, global "quick answer" shortcut, and more. Really can't recommend it enough!

[0] https://github.com/janhq/jan


You can use https://recurse.chat/ if you have an Apple silicon Mac.


Msty [0] is a really good app - you can use both local or online models and has web search, attachments, RAG, split chats, etc., built-in.

[0] https://msty.app


If you're willing to settle for a client-side only web frontend (i.e. talks directly with APIs of the models you use), TypingMind would work. It's paid, but it's good (see [0]), and I guess you could always go for the self-hosted version and wrap it in an Electron app - it's what most "native" apps are these days anyway (and LLM frontends in particular).

--

[0] - https://news.ycombinator.com/item?id=41988306


I like msty.app. Parallel prompting across multiple commercial and local models plus branching dialogs. Doesn’t do artifacts, etc, though.


It's not native, but I've been pretty happy with big-AGI. It's just an `npm run` away. I don't use it for coding tasks, though.

Its most unique feature is its "beam" facility, which allows you to send a query to multiple APIs simultaneously (if you want to cross-check) and even combine the answer.


Open-WebUI doesn't support Claude natively (only through a series of hacks) but it is absolutely "THE" go-to for a ChatGPT Pro like experience (it is slightly better).

https://github.com/open-webui/open-webui


> But OpenAI provides a much more capable app in ChatGPT and an easier development platform

Which app are you talking about here?


FWIW, I was able to get a decent way into making my own client for ChatGPT by asking the free 3.5 version to do JS for me* before it was made redundant by the real app, so this shouldn't be too hard if you want a specific experience/workflow?

* I'm iOS by experience; my main professional JS experience was something like a year before jQuery came out, so I kinda need an LLM to catch me up for anything HTML

Also, I wanted HTML rather than native for this.


> ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close.

Funny thing, TypingMind was ahead of them for over a year, implementing those features on top of the API, without trying to mix business model with engineering[0]. It's only recently that ChatGPT webapp got more polished and streamlined, but TypingMind's been giving you all those features for every LLM that can handle it. So, if you're looking for ChatGPT-level frontend to Anthropic models, this is it.

ChatGPT shines on mobile[1] and I still keep my subscription for that reason. On desktop, I stick to TypingMind and being able to run the same plugins on GPT-4o and Claude 3.5 Sonnet, and if I need a new tool, I can make myself one in five minutes with passing knowledge of JavaScript[2]; no need to subscribe to some Gee Pee Tee.

Now, I know I sound like a shill, I'm not. I'm just a satisfied user with no affiliation to the app or the guy that made it. It's just that TypingMind did the bloodingly stupid obvious thing to do with the API and tool support (even before the latter was released), and continues to do the obvious things with it, and I'm completely confused as to why others don't, or why people find "GPTs" novel. They're not. They're a simple idea, wrapped in tons of marketing bullshit that makes it less useful and delayed its release by half a year.

--

[0] - "GPTs", seriously. That's not a feature, that's just system prompt and model config, put in an opaque box and distributed on a marketplace for no good reason.

[1] - Voice story has been better for a while, but that's a matter of integration - OpenAI putting together their own LLM and (unreleased) voice model in a mobile app, in a manner hardly possible with the API their offered, vs. TypingMind being a webapp that uses third party TTS and STT models via "bring your own API key" approach.

[2] - I made https://docs.typingmind.com/plugins/plugins-examples#db32cc6... long before you could do that stuff with ChatGPT app. It's literally as easy as it can possibly be: https://git.sr.ht/~temporal/typingmind-plugins/tree. In particular, this one is more representative - https://git.sr.ht/~temporal/typingmind-plugins/tree/master/i... - PlantUML one is also less than 10 lines of code, but on top of 1.5k lines of DEFLATE implementation in JS I plain copy-pasted from the interwebz because I cannot into JS modules.


Have you tried using Cursor with Claude embedded? I can't go back to anything else, it's very nice having the AI embedded in the IDE and it just knows all the files i am working with. Cursor can use GPT-4o too if you want


I too use Claude more frequently than OpenAi GPT4o. I think this is a two fold move for MS and I like it. Claude being more accurate / efficient for me says it's likely they see the same thing, win number 1. The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.

Either way, Claude is great so this is a net win for everyone.


I'm the same, but had a lot of issues getting structured output from Anthropic. Ended up always writing response processors. Frustrated by how fragile that was, decided to try OpenAI structured outputs and it just worked and since they also have prompt caching now, it worked out very well for my use case.

Anthropic's seems to have addressed the issue using pydantic but I haven't had a chance to test it yet.

I pretty much use Anthropic for everything else.


>The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.

I agree, this was a tactical move designed to give them leverage over OpenAI.


Yeah, Claude consistently impresses me.

A commenter on another thread mentioned it but it’s very similar to how search felt in the early 2000s. I ask it a question and get my answer.

Sometimes it’s a little (or a lot) wrong or outdated, but at least I get something to tinker with.


I recently tried to ask these tools for help with using a popular library, and both GPT-4o and Claude 3.5 Sonnet gave highly misleading and unusable suggestions. They consistently hallucinated APIs that didn't exist, and would repeat the same wrong answers, ignoring my previous instructions. I spent upwards of 30 minutes repeating "now I get this error" to try to coax them in the right direction, but always ending up in a loop that got me nowhere. Some of the errors were really basic too, like referencing a variable that was never declared, etc. Finally, Claude made a tangential suggestion that made me look into using a different approach, but it was still faster to look into the official documentation than to keep asking it questions. GPT-4o was noticeably worse, and I quickly abandoned it.

If this is the state of the art of coding LLMs, I really don't see why I should waste my time evaluating their confident sounding, but wrong, answers. It doesn't seem like much has improved in the past year or so, and at this point this seems like an inherent limitation of the architecture.


FWIW I almost never ask it to write code for me. I did once to write a matplotlib script and it gave me a similar headache.

I ask it questions mostly about libraries I’m using (usually that have poor documentation) and how to integrate it with other libraries.

I found out about Yjs by asking about different operational transform patterns.

Got some context on the prosemirror plugin by pasting the entire provider class into Claude and asking questions.

It wasn’t always exactly correct, but it was correct enough that it made the process of learning prosemirror, yjs, and how they interact pretty nice.

The “complete” examples it kept spitting out were totally wrong, but the information it gave me was not.


To be clear, I didn't ask it to write something complex. The prompt was "how do I do X with library Y?", with a bit more detail. The library is fairly popular and in a mainstream language.

I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.

So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.


> if you want a specific answer about something

Specific is the specific thing that statistical models are not good at :(

> how do I do X with library Y?

Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts. Attention just has more data to work with when there are more tokens. Try extending that question like “I am using this programming language and am trying to do this task with this library. How do I do this thing with this other library”

I realize prompt engineering like this is fuzzy and “magic,” but short prompts have a consistent lower performance.

> In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code.

Not as simple as you’d think. You’re letting something run arbitrary code.

Tho you should give aider.chat a try if you want to test out that workflow. I found it very very slow.


> Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts.

I'm aware of that. The actual prompt was more elaborate. I was just mentioning the gist of it here.

Besides, you would think that after 30 minutes of prompting and corrections it would arrive at the correct answer. I'm aware that subsequent output is based on the session history, but I would also expect this to be less of an issue if the human response was negative. It just seems like sloppy engineering otherwise.

> Specific is the specific thing that statistical models are not good at

Some models are good at needle-in-a-haystack problems. If the information exists, they're able to find it. What I don't need is for it to hallucinate wrong answers if the information doesn't exist.

This is a core problem of this tech, but I also expected it to improve over time.

> Tho you should give aider.chat a try

Thanks, I'll do that eventually. If it's slow, it can get faster. I'd rather the tool be slow but give correct answers, than it slowing me down by wasting my time error correcting it.

Thankfully, these approaches can work for programming tasks. There is not much that can be done to verify the output of any other subject.


Well it is volume business. <1% of advanced skill developers will find AI helper useless but for 99% of IT CRUD peddlers these tools are quite sufficient. All in all if employers cut down 15-20% of net development costs by reducing head counts, it will be very worthwhile for companies.


I suspect it will go a different direction.

Codebases are exploding in size. Feature development has slowed down.

What might have been a carefully designed 100kloc codebase in 2018 is now a 500kloc ball of mud in 2024.

Companies need many more developers to complete a decent sized feature than they needed in 2018.


It's worse than that. Now the balls of mud are distributed. We get incredibly complex interactions between services which need a lot of infrastructure to enable them, that requires more observability, which requires more infrastructure...


Yeah. You can fit a lot of business logic into a 100kloc monolith written by skilled developers.

Once you start shifting it to micro services the business logic gets spread out and duplicated.

At the same time each micro-service now has its own code to handle rest, graphql, grpc endpoints.

And each downstream call needs error handling and retry logic.

And of course now you need distributed tracing.

And of course now your auth becomes much more complex.

And of course now each service might be called multiple times for the one request - better make them idempotent.

And each service will drift in terms of underlying libraries.

And so on.

Now we have been adding in LLM solutions so there is no consistency in any of the above services.

Each dev rather than look at the existing approaches instead asks Claude and it provides a slightly different way each time - often pulling in additional libraries we have to support.

These days I see so much bad code like a single microservice with 3 different approaches to making a http request.


Agree. But we are already in that loop. A 50KLOC properly written "Monolith, hence outdated" app is now 30 micro services of 20KLOC surface + 100KLOC of submerged in terms of convenience libraries with kubernetes, grafana, datadog, servicemesh and so on. From what I am seeing companies are increasingly using off the shelf components so KLOC will keep rising but developer count would not.


Sure, but my specific question was fairly trivial, using a mainstream language and a popular library. Most of my work qualifies as CRUD peddling. And yet these tools are still wasting my time.

Maybe I'll have better luck next time, or maybe I need to improve my prompting skills, or use a different model, etc. I was just expecting more from state of the art LLMs in 2024.


Yeah there is a big disconnect between the devs caught up in the hype and the devs who aren't.

A lot of the devs in my office using Claude/gpt are convinced they are so much more productive but they aren't actually producing features or bug fixes any faster.

I think they are just excited about a novel new way to write code.


Conversely I feel that the experience of searching has been degraded by a lot since 2016/17. My these is that, at this time, online spam increased by an order of magnitude


Old style Google search is dead, folks just haven’t closed the casket yet. My index queries are down ~90%. In the future, we’ll look back at LLMs as a major turning point in how people retrieve and consume information.


I still prefer it over using llm. And I would be doubtful that llm search has major benefits over Google search imo


Depends what you want it for.

Right now, I find each tool better at different things.

If I can only describe what I want but don't know key words, LLM are the only solution.

If I need citations, LLMs suck.


Abstractive vs. extractive search.


I think it was the switch from desktop search traffic being dominant to mobile traffic being dominant, that switch happened around the end of 2016.

Google used to prioritise big comprehensive articles on subjects for desktop users but mobile users just wanted quick answers, so that's what google prioritised as they became the biggest users.

But also, per your point, I think those smaller simpler less comprehensive posts are easier to fake/spam than the larger more compreshensible posts that came before.


Ironically, I almost never see quick answers in the top results, mostly it's dragged out pages of paragraph after paragraph with ads inbetween.


Guess who sells the ads…


Winning the war against spam is an arms race. Spam hasn’t spent years targeting AI search yet.


It's getting ridiculous. Half of the time now when I ask AI to search some information for me, it finds and summarizes some very long article obviously written by AI, and lacking any useful information.


Queries were rewritten with BERT starting even before then so it's still the same generative model problem.


I don't think this is necessarily converse to what they said.


The speed with which AI models are improving blows my mind. Humans quickly normalize technological progress, but it's staggering to reflect on our progress over just these two years.


Yes! I'm much more inclined to write one-off scripts for short manual tasks as I can usually get AI to get something useful very fast. For example, last week I worked with Claude to write a script to get a sense of how many PRs my company had that included comprehensive testing. This was borderline best done as a manual task previously, now I just ask Claude to write a short bash script that uses the GitHub CLI to do it and I've got a repeatable reliable process for pulling this information.


I rarely use LLMs for tasks but i love it for exploring spaces i would otherwise just ignore. Like writing some random bash script isn't difficult at all, but it's also so fiddly that i just don't care to do it. It's nice to just throw a bot at it and come back later. Loosely speaking.

Still i find very little use from LLMs in this front, but they do come in handy randomly.


Lots of progress, but I feel like we've been seeing diminishing returns. I can't help but feel like recent improvements are just refinements and not real advances. The interest in AI may drive investment and research in better models that are game-changers, but we aren't there yet.


You're proving GP's point about normalization of progress. It's been two years. We're still during the first iteration of applications of this new tech, advancements didn't have time yet to start compounding. This is barely getting started.


Neither of your points are proven. Is the slowdown a real effect of hitting technological/technique/data limits? Or is it just a lull in the storm?


I don't know about you, but o1-preview/o1-mini has been able to solve many moderately challenging programming tasks that would've taken me 30 mins to an hour. No other models earlier could've done that.


It's an improvement but...I've asked it to do some really simple tasks and it'll occasionally do them in the most roundabout way you could imagine. Like, let's source a bash file that creates and reads a state file to do something for which the functionality was already built-in. Say I'm a little skeptical of this solution and plug it into a new o1-preview prompt to double check the solution, and it starts by critiquing the bash script and error handling instead of seeing that the functionality is baked in and it's plainly documented. Other errors have been more subtle.

When it works, it's pretty good, and sometimes great. But when failure modes look like the above I'm very wary of accepting its output.


> I've asked it to do some really simple tasks and it'll occasionally do them in the most roundabout way you could imagine.

But it still does the tasks you asked for, so that's the part that really matters.


I wonder how long people will still protest in these threads that "It doesn't know anything! It's just an autocomplete parrot!"

Because.. yea, it is. However.. it keeps expanding, it keeps getting more useful. Yea people and especially companies are using it for things which it has no business being involved in.. and despite that it keeps growing, it keeps progressing.

I do find the "stochastic parrot" comments slowly dwindle in number and volume with each significant release, though.

Still, i find it weirdly interesting to see a bunch of people be both right and "wrong" at the same time. They're completely right, and yet it's like they're also being proven wrong in the ways that matter.

Very weird space we're living in.


You're conflating three different things.

There's the question, "is an LLM just autocomplete"? The answer to that question is obviously no, but the question is also a strawman - people who actually use LLM's regularly do recognize that there is more to their capabilities than randomized pattern matching.

Separately, there's the question of "will LLM's become AGI and/or become super intelligent." Most people recognize that LLM's are not currently super intelligent, and that there currently isn't a clear path toward making them so. Still, many people seem to feel that we're on the verge of progress here, and feel very strongly that anyone who disagrees is an AI "doomer".

Then there's the question of "are we in an AI bubble"? This is more a matter of debate. Some would argue that if LLM reasoning capabilities plateau, people will stop investing in the technology. I actually don't agree with that view - I think there is a lot of economic value still yet to be realized in AI advancements - I don't think we're on the verge of some sort of AI winter, even if LLM's never become super intelligent.


> Most people recognize that LLM's are not currently super intelligent,

I think calling it intelligent is being extremely generous. Take a look at the following example which is a spelling and grammar checker that I wrote:

https://app.gitsense.com/?doc=f7419bfb27c89&temperature=0.50...

When the temperature is 0.5, both Claude 3.5 and GPT-4o can't properly recognize that GitHub is capitalized. You can see the responses by clicking in the sentence. Each model was asked to validate the sentence 5 times.

If the temperature is set to 0.0, most models will get it right (most of the time), but Claude 3.5 still can't see the sentence in front of it.

https://app.gitsense.com/?doc=f7419bfb27c89&temperature=0.00...

Right now, LLM is an insanely useful and powerful next word predictor, but I wouldn't call it intelligent.


> I think calling it intelligent is being extremely generous ... can't properly recognize that GitHub is capitalized.

Wouldn't this make chimpanzees and ravens and dolphins unintelligent too? You're asking it to do a task that's (mostly) easy for humans. It's not a human though. It's an alien intelligence which "thinks" in our language, but not in the same way we do.

If they could, specialized AI might think we're unintelligent based on how often we fail, even with advanced tools, pattern matching tasks that are trivial for them. Would you say they're right to feel that way?


Animals are capable of learning. LLMs can not. LLM uses weights that are defined during the training process to decide what to do next. LLM cannot self evaluate based on what it has said. You have to create a new message for it to create a new probability path.

Animals have the ability to learn and grow by themselves. LLMs are not intelligent and I don't see how they can be since they just follow the most likely path with randomness (temperature) sprinkled in.


Ok so just to be clear, that's an entirely different and unrelated argument from the one I responded to.

Second, it's wrong. LLMs can learn within their context window. The main issue now is the limited size of their context window; animals have a lifetime of compressed context and LLMs only have approximately one conversation.


> Ok so just to be clear, that's an entirely different and unrelated argument from the one I responded to.

It honestly made no sense what you were saying so I didn't respond to that directly as I assumed it would be clear from my explanation as to why animals can be intelligent and LLM are not.

> LLMs can learn within their context window.

They don't learn from the context window as much as they use what is in the context window to define a probabilistic path. If you put something in the context window that it was never trained on, it would spit out BS or say it doesn't know.


The "statistical parrot" parrots have been demonstrably wrong for years (see e.g. LeCun et al[1]). It's just harder to ignore reality with hundreds of millions of people now using incredible new AI tools. We're approaching "don't believe your lying eyes" territory. Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.

[1] https://arxiv.org/abs/2110.09485


> Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.

You should re-read that very slowly and carefully and really think about it. Calling anyone that's skeptical a 'denier' is a red flag.

We have been through these AI cycles before. In every case, the tools were impressive for their time. Their limitations were always brushed aside and we would get a hype cycle. There was nothing wrong with the technology, but humans always like to try to extrapolate their capabilities and we usually get that wrong. When hype caught up to reality, investments dried up and nobody wanted to touch "AI" for a while.

Rinse, repeat.

LLMs are again impressive, for our time. When the dust settles, we'll get some useful tools but I'm pretty sure we will experience another – severe – AI winter.

If we had some optimistic but also realistic discussions on their limitations, I'd be less skeptical. As it is, we are talking about 'revolution', and developers being out of jobs, and superintelligence and whatnot. That's not the level the technology is at today and it is not clear we are going to do anything else other than get stuck in a local maxima.


A trillion dimensional stochastic parrot is still a stochastic parrot.

If these systems showed understanding we would notice.

No one is denying that this form of intelligence is useful.


I don't know how you can say they lack understanding of the world when in pretty much any standardised test designed to measure human intelligence they perform better than the average human. They only thing that don't understand is touch because they're not trained on that, but they can already understand audio and video.


You said it, those tests are designed to measure human intelligence, because we know that there is a correspondence between test results and other, more general tasks - in humans. We do not know that such a correspondence exists with language models. I would actually argue that they demonstrably do not, since even an LLM that passes every IQ test you put in front of it can still trip up on trivial exceptions that wouldn't fool a child.


So they fail in their own way? They're not humans; that's to be expected.


An answer key would outperform the average human but it isn’t intelligent. Tests designed for humans are not appropriate to judge non humans.


No you don’t understand, if i put a billion billion trillion monkeys on typewriters, they’re actually now one super intelligent monkey because they’re useful now!

We just need more monkeys and it will be the same as a human brain.


What does the mass of users change about what it is? How many of these check the results for hallucinations and how many don’t because I part of AI?

More than once these tools fail at tasks a fifth grader could understand


Are you confusing frequency of use with usefulness?

If these tools boost tue productivity where is the output spike of all the companies, the spike in revenue and profits?

How often do we lose the benefit auto text generation to the loop of That’s wrong Oh yes of course, here is the correct version Nope, still wrong Prompt editing?


One service is not really enough -- you need a few to triangulate more often than not, especially when it comes to code using latest versions of public APIs

Phind is useful as you can switch between them -- but only get a handful of o1 and Opus a day which I burn through quick at moment on deeper things -- Phind-405b and 3.5 Sonnet are decent for general use


Switch to Cursor with Claude backend and 5x immediately




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: