Semantic unit testing: test code without executing it

vouwfietsman · 2025-05-05T07:40:15 1746430815

Maybe someone can help me out here:

I always get the feeling that fundamentally our software should be built on a foundation of sound logic and reasoning. That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently lack logic and reasoning, or at least such validation must be on par with human authored code + review. Because of this, the validation cannot be done by an LLM, as it would just compound the problem.

Unless we get a drastic change in the level of error detection and self-validation that can be done by an LLM, this remains a problem for the foreseeable future.

How is it then that people build tooling where the LLM validates the code they write? Or claim 2x speedups for code written by LLMs? Is there some kind of false positive/negative tradeoff I'm missing that allows people to extract robust software from an inherently not-robust generation process?

I'm not talking about search and documentation, where I'm already seeing a lot of benefit from LLMs today, because between the LLM output and the code is me, sanity checking and filtering everything. What I'm asking about is the: "LLM take the wheel!" type engineering.

InkCanon · 2025-05-05T10:31:43 1746441103

It's a common idea, all the way back to Hoare logic. There was a time when people believed in the future, people would write specifications instead of code.

The problem with it takes several times more effort to verify code than to write it. This makes intuitive sense if you consider that the search space for the properties of code is much larger than the code for space. Rice theorem's states that all non trivial semantic properties of a program are undeniable.

Smaug123 · 2025-05-05T11:31:25 1746444685

No, Rice's theorem states that there is no general procedure to take an arbitrary program and decide nontrivial properties of its behaviour. As software engineers, though, we write specific programs which have properties which can be decided, perhaps by reasoning specific to the program. (That's, like, the whole point of software engineering: you can't claim to have solved a problem if you wrote a program such that it's undecidable whether it solved the problem.)

The "several times more effort to verify code" thing: I'm hoping the next few generations of LLMs will be able to do this properly! Imagine if you were writing in a dependently typed language, and you wrote your test as simply a theorem, and used a very competent LLM (perhaps with other program search techniques; who knows) to fill in the proof, which nobody will never read. Seems like a natural end state of the OP: more compute may relax the constraints on writing software whose behaviour is formally verifiable.

motorest · 2025-05-05T08:56:02 1746435362

> That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently (...)

The problem with your assertion is that it fails to understand that today's software, where every single line of code was typed in by real flesh-and-bone humans, already fails to have adequate test coverages, let alone be validated.

The main problem with output from LLMs is that they were trained with the code written by humans, and thus they accurately reflect the quality of the code that's found in the wild. Consequently, your line of reasoning actually criticizes LLMs for outputing the same unreliable code that people write.

Counterintuitively, LLMs end up generating a better output because at least they are designed to simplify the task of automatically generating tests.

vouwfietsman · 2025-05-05T10:47:18 1746442038

Right but by your reasoning it would make sense to use LLMs only to augment an incomplete but rigorous testing process, or to otherwise elevate below average code.

My issue is not necessarily with the quality of the code, but rather with the intention of the code, which is much more important: a good design without tests is more durable than a bad design with tests.

motorest · 2025-05-05T11:33:56 1746444836

> Right but by your reasoning it would make sense to use LLMs only to augment an incomplete but rigorous testing process, or to otherwise elevate below average code.

No. It makes sense to use LLMs to generate tests. Even if their output matches the worst output the average human can write by hand, having any coverage whatsoever already raises the bar from where the average human output is.

> My issue is not necessarily with the quality of the code, but rather with the intention of the code (...)

That's not the LLM's responsibility. Humans specify what they want and LLMs fill in the blanks. If today's LLMs output bad results, that's a reflection of the prompts. Garbage in, garbage out.

vouwfietsman · 2025-05-05T11:57:33 1746446253

> No. It makes sense to use LLMs to generate tests. Even if their output matches the worst output the average human can write by hand, having any coverage whatsoever already raises the bar from where the average human output is.

Although this is true, it disregards the fact that prompting for tests takes time which may also be spent writing tests, and its not clear if poor quality tests are free, in the sense that further development may cause these tests to fail for the wrong reasons, causing time spent debugging. This is why I used the word "augment": these tests are clearly not the same quality as manual tests, and should be considered separately from manual tests. In other words, they may serve to elevate below average code or augment manual tests, but not more than that. Again, I'm not saying it makes no sense to do this.

> That's not the LLM's responsibility. Humans specify what they want and LLMs fill in the blanks. If today's LLMs output bad results, that's a reflection of the prompts. Garbage in, garbage out.

This is unlikely to be true, for a couple reasons: 1. Ambiguity makes it impossible to define "garbage", see prompt engineering. In fact, all human natural language output is garbage in the context of programming. 2. As the LLM fills in blanks, it must do so respecting the intention of the code, otherwise the intention of the code erodes, and its design is lost. 3. This would imply that LLMs have reached their peak and only improve by requiring less prompting by a user, this is simply not true as it is trivial to currently find problems an LLM cannot solve, regardless of the amount of prompting.

motorest · 2025-05-05T12:25:58 1746447958

> Although this is true, it disregards the fact that prompting for tests takes time which may also be spent writing tests (...)

No, not today at least. Some services like Copilot provide plugins that implement actions to automatically generate unit tests. This means that the unit test coverage you're describing is a right-click away.

https://code.visualstudio.com/docs/copilot/copilot-smart-act...

> (...).and its not clear if poor quality tests are free, in the sense that further development may cause these tests to fail for the wrong reasons, causing time spent debugging.

That's not how automated tests work. If you have a green test that turns red when you touch some part of the code, this is the test working as expected, because your code change just introduced unexpected changes that violated an invariant.

Also, today's LLMs are able to recreate all your unit tests from scratch.

> This is unlikely to be true, for a couple reasons: 1. Ambiguity makes it impossible to define "garbage", see prompt engineering.

"Ambiguity" is garbage in this context.

> . 2. As the LLM fills in blanks, it must do so respecting the intention of the code, otherwise the intention of the code erodes, and its design is lost.

That's the responsibility of the developer, not the LLM. Garbage in, garbage out.

> . 3. This would imply that LLMs have reached their peak and only improve by requiring less prompting by a user, this is simply not true as it is trivial to currently find problems an LLM cannot solve, regardless of the amount of prompting.

I don't think that point is relevant. The goal of a developer is still to meet the definition of done, not to tie their hands around their back and expect working code to just fall on their lap. Currently the main approach to vibe coding is to set the architecture, and lean on the LLM to progressively go from high level to low level details. Speaking from personal experience in vibecoding, LLMs are quite capable of delivering fully working apps with a single, detailed prompt. However, you get far more satisfactory results (i.e., the app reflects the same errors in judgement you'd make) if you just draft a skeleton and progressively fill in the blanks.

UncleEntity · 2025-05-05T11:08:24 1746443304

From my testing the robots seem to 'understand' the code more than just learn how do thing X in code from reading code about doing X. I've thrown research papers at them and they just 'get' what needs to be done to take the idea and implement it as a library or whatever. Or, what has become my favorite activity of late, give them some code and ask them how they would make it better -- then take that and split it up into simpler tasks because they get confused it you ask them to do too much at one time.

As for debugging, they're not so good at that. Some debugging they can figure out but if they need to do something simple, like counting how far away item A is from item B, then I've found you pretty much have to do that for them. Don't get me wrong, they've found some pretty deep bugs I would have spend a bunch of time tracking down in gdb, so they aren't completely worthless but I have definitely given up on the idea that I can just tell them the problem and they get to work fixing it though.

And, yeah, they're good at writing tests. I usually work on python C modules and my typical testing is playing with it in the repl but my current project is getting fully tested at the C level before I have gotten around to the python wrapper code.

Overall its been pretty productive using the robots, code is being written I wouldn't have spent the time working on, unit testing is being used to make sure they don't break anything as the project progresses and the codebase is being kept pretty sound because I know enough to see when they're going off the rails as they often do.

lgiordano_notte · 2025-05-05T11:40:02 1746445202

LLM-based coding only really works when wrapped in structured prompts, constrained outputs, external checks etc. The systems that work well aren’t just 'LLM take the wheel' architecture, they’re carefully engineered pipelines. Most success stories are more about that scaffolding than the model itself.

CivBase · 2025-05-05T11:58:05 1746446285

Does anyone provide a good breakdown of how much time/cost goes into the scaffolding vs how much is saved from not writing the code itself?

darawk · 2025-05-05T07:59:54 1746431994

This particular person seems to be using LLMs for code review, not generation. I agree that the problem is compounded if you use an LLM (esp. the same model) on both sides. However, it seems reasonable and useful to use it as an adjunct to other forms of testing, though not necessarily a replacement for them. Though again, the degree to which it can be a replacement is a function of the level of the technology, and it is currently at the level where it can probably replace some traditional testing methods, though it's hard to know which, ex-ante.

edit: of course, maybe that means we need a meta-suite, that uses a different LLM to tell you which tests you should write yourself and which tests you can safely leave to LLM review.

vouwfietsman · 2025-05-05T10:48:49 1746442129

Indeed the idea of a meta LLM, or some sort of clear distinction between manual and automated-but-questionable tests makes sense. So what bothers me is that does not seem to be the approach most people take: code produced by the LLM is treated the same as code produces by human authors.

PeterStuer · 2025-05-05T09:55:46 1746438946

If you are working with natural language, it is by definition 'fuzzy' unless you reduce it to simple templates. So to evaluate whether an output is a semantically e.g. a reasonable answer to an input where non-templated natural verbalization is needed, you need something that 'tests' the output, and that is not going to be purely 'logical'.

Will that test be perfect? No. But what is the alternative?

vouwfietsman · 2025-05-05T10:44:28 1746441868

Are you referring to the process of requirement engineering? Because although I agree its a fuzzy natural language interface, behind the interface should be (heavy should) a rigorously defined & designed system, where fuzzyness is eliminated. The LLMs need to work primarily with the rigorous definition, not the fuzzyness.

PeterStuer · 2025-05-05T11:55:25 1746446125

It depends on the use case. e.g. Music generation like Suno. How do you rigorously and logically check the output? Or an automated copy-writing service?

The tests should match the rigidity of the case. A mismatch in modality will lead to bad outcomes.

vouwfietsman · 2025-05-05T12:00:55 1746446455

Aha! Like that. Yes that's interesting, the only other alternative would be manual classification of novel data, so extremely labour intensive. If an LLM is able to do the same classification automatically it opens up use cases that are otherwise indeed impossible.

RainyDayTmrw · 2025-05-05T06:17:34 1746425854

I'm skeptical. Most of us maintaining medium sized codebases or larger are constantly fighting nondeterminism in the form of flaky tests. I can't imagine choosing a design that starts with nondeterminism baked in.

And if you're really dead-set on paying nondeterminism to get more coverage, property-based testing has existed for a long time and has a comparatively solid track record.

IshKebab · 2025-05-05T08:53:17 1746435197

I agree. I want this as a code review tool to check if people forgot to update comments - "it looks like this now adds instead of multiplies, but the comment says otherwise; did you forget to update it?".

Seems of dubious value as unit tests. LLMs don't seem to be quite smart enough for that in my experience, unless your bugs are really as trivial as adding instead of multiplying, in which case god help you.

mrkeen · 2025-05-05T06:25:34 1746426334

Couldn't put it better myself.

I have the toughest time trying to communicate why f(x) should equal f(x) in the general case.

Garlef · 2025-05-05T07:16:17 1746429377

Hm... I think you have a good point.

Maybe the non-determinism can be reduced by caching: Just reevaluate the spec if the code actually changes?

I think there are also other problems (inlining a verbal description makes the codebase verbose, writing a precise, non-ambiguous verbal description might be more work than writing unit tests)

carlmr · 2025-05-05T08:57:19 1746435439

>Maybe the non-determinism can be reduced by caching: Just reevaluate the spec if the code actually changes?

That would be good anyway to keep the costs reasonable.

Davidbrcz · 2025-05-05T11:31:03 1746444663

Many good and prolific approaches are non deterministic such as fuzzing or property-based testing,

dragonwriter · 2025-05-05T06:03:06 1746424986

This is more of "LLM code review" than any kind of testing, and calling it "testing" is just badly misleading.

spiddy · 2025-05-05T09:41:46 1746438106

this. Let’s not confuse meanings. There are multiple ways to improve quality of code. Testing is one, code review is another. this belongs to the latter

IshKebab · 2025-05-05T08:50:04 1746435004

Yeah this sounds like a good way to detect out of date comments. I would have focused on that.

anself · 2025-05-05T07:34:03 1746430443

Agree, it's not testing. The problem is here: "In a typical testing workflow, you write some basic tests to check the core functionality. When a bug inevitably shows up—usually after deployment—you go back and add more tests to cover it. This process is reactive, time-consuming, and frankly, a bit tedious."

This is exactly the problem that TDD solves. One of the most compelling reasons for test-first is because "Running the code in your head" does not actually work well in practice, leading to the above-cited issues. This is just another variant of "Running the code in your head" except an LLM is doing it. Strong TDD practices (don't write any code without a test to support it) will close those gaps. It may feel tedious at first but the safety it creates will leave you never wanting to go back.

Where this could be safe and useful: Find gaps in the test-set. Places where the code was never written because there wasn't a test to drive it out. This is one of the hardest parts of TDD, and where LLMs could really help.

jonathanlydall · 2025-05-05T05:40:13 1746423613

If you’re stuck with dynamically typed languages, then tests like this can make a lot of sense.

On statically typed languages this happens for free at compile time.

I’ve often heard proponents of dynamically typed languages say how all the typing and boiler plate required by statically typed languages feels like such a waste of time, and on a small enough system maybe they are right.

But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.

They also allow trivial refactoring that people using dynamically typed languages wouldn’t even consider due to the risk being so high.

So keep this all in mind when you next choose your language for a new project.

motorest · 2025-05-05T05:52:46 1746424366

> But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.

I firmly believe that the group of people who laud dynamically typed languages as efficient time-savers, that help shed drudge work involving typing, is tightly correlated with the group of people who fail to establish any form of quality assurance or testing, often using the same arguments to justify their motivation.

0xDEAFBEAD · 2025-05-05T07:24:00 1746429840

The question I find interesting is whether type systems are an efficient way to buy reliability relative to other ways to purchase reliability, such as writing tests, doing code review, or enforcing immutability.

Of course, some programmers just don't care about purchasing reliability. Those are the ones who eschew type systems, and tests, and produce unreliable software, about like you'd expect. But for my purposes, this is besides the point.

bluGill · 2025-05-05T13:28:30 1746451710

I find they are valuable. When you have a small program - 10k lines of code you don't really need them. However when you are at more than 10 million lines of code types find a lot of little errors that writing the correct test for would be hard.

Most dynamically typed languages (all that I have worked with) cannot catch that you misspelled a function name until that function is called. If that misspelled function is in an error path it would be very easy to never test it until a customer hit the crash. Just having your function names as a strong type that is checked by static analysis (need not be a compiler though that is what everything uses) is a big win. Checking the other arguments as well is similarly helpful.

globular-toast · 2025-05-05T07:24:14 1746429854

Rubbish, in my experience. People who understand dynamic languages know they need to write tests because it's the only thing asserting correctness. I could just as easily say static people don't write tests because they think the type system is enough. A type system is laughably bad at asserting correct behaviour.

Personally I do use type hinting and mypy for much of my Python code. But I'll most certainly omit it for throwaway scripts and trivial stuff. I'm still not convinced it's really worth the effort, though. I've had a few occasions where the type checker has caught something important, but most of the time it's an autist trap where you spend ages making it correct "just because".

motorest · 2025-05-05T08:47:37 1746434857

> Rubbish, in my experience. People who understand dynamic languages know they need to write tests because it's the only thing asserting correctness.

Tests don't assert correctness. At best they verify specific invariants.

Statically typed languages lean on the compiler to automatically verify some classes of invariants (i.e., can I call this method in this object?)

With dynamically typed languages, you cannot lean on the compiler to verify these invariants. Developers must fill in this void by writing their own tests.

It's true that they "need" to do it to avoid some classes of runtime errors that are only possible in dynamically typed languages. But that's not the point. The point is that those who complan that statically typed languages are too cumbersome because they require boilerplate code for things type compile-time type checking are also correlated with the set of developers who fail to invest any time adding or maintaining automated test suites, because of the same reasons.

> I could just as easily say static people don't write tests because they think the type system is enough. A type system is laughably bad at asserting correct behaviour.

No, you can't. Developers who use statically typed languages don't even think of type checking as a concern, let alone a quality assurance issue.

bluGill · 2025-05-05T13:32:02 1746451922

> Tests don't assert correctness. At best they verify specific invariants.

Pedantically correct, but in practice those are close enough to the same thing.

Even a formal proof cannot assert correctness - requirements are often wrong. However in practice requirements are close enough to correct that we can call a formal proof also close enough.

0xDEAFBEAD · 2025-05-05T07:21:38 1746429698

Dan Luu looked at the literature and concluded that the evidence for the benefit of types is underwhelming:

https://danluu.com/empirical-pl/

>But on any significant sized code bases, they pay dividends over and over by saving you from having to make tests like this.

OK, but if the alternative to tests is spending more time on a reliability method (type annotations) which buys you less reliability compared to writing tests... it's hardly a win.

It fundamentally seems to me that there are plenty of bugs that types can simply never catch. For example, if I have a "divide" function and I accidentally swap the numerator and divisor arguments, I can't think of any realistic type system which will help me. Other methods for achieving reliability, like writing tests or doing code review, don't seem to have the same limitations.

Smaug123 · 2025-05-05T11:35:25 1746444925

> swap the numerator and divisor

Even Rust can express this; you don't need to get fancy. Morally speaking, division takes a Num and a std::num::NonZero<Num>.

ngruhn · 2025-05-05T05:46:16 1746423976

I think at least some people who say this think of Java-esque type systems. And there I agree: it is a boilerplate nightmare.

UncleEntity · 2025-05-05T11:16:19 1746443779

> On statically typed languages this happens for free at compile time.

If only that were true I wouldn't be a tiny bit as good at tracking down segfaults as I've become over the years...

lgiordano_notte · 2025-05-05T11:36:30 1746444990

Treating docstrings as the spec and asking an LLM to flag mismatches feels promising in theory but personally I'd b wary of overfitting to underspecified docs. Might be useful as a lint-like signal, but hard to see it replacing real tests just yet.

bluGill · 2025-05-05T12:19:26 1746447566

if that is the only testing you do I agree. However to test that the code works as the docs say is valuable as well. The code often will do more, but it needs to do at least what the docs say.

yuliyp · 2025-05-05T05:40:45 1746423645

Did the author do any analysis of the effectiveness of their tool on something beyond multiplication? Did they look to see if it caught any bugs in any codebases? What's the false positive rate? False negative?

As is it's neat that they wrote some code to generate some prompts for an LLM but there's no idea if it actually works.

motorest · 2025-05-05T05:53:48 1746424428

> Did the author do any analysis of the effectiveness of their tool on something beyond multiplication? Did they look to see if it caught any bugs in any codebases? What's the false positive rate? False negative?

I would also add the concern on whether the tests are actually deterministic.

The premise is also dubious, as docstring comments typically hold only very high-level descriptions of the implementation and often aren't even maintained. Writing a specification of what a function is expected to do is what writing tests is all about, and with LLMs these are a terse prompt away.

bluGill · 2025-05-05T13:35:08 1746452108

Documentation should not be telling your how it is implemented. It should tell you how and why to use the function. Users who care about how it is implemented should be reading the code not the comments. Users who need to find/use a helper and get on with their feature shouldn't.

evanb · 2025-05-05T10:44:54 1746441894

> Beware of bugs in the above code; I have only proved it correct, not tried it.

-- Donald Knuth, Notes on the van Emde Boas construction of priority deques: An instructive use of recursion (1977)

https://www-cs-faculty.stanford.edu/~knuth/faq.html

masklinn · 2025-05-05T06:52:24 1746427944

> But here’s the catch: you’re missing some edge cases. What about negative inputs?

The docstring literally says it only works with positive integers, and the LLM is supposed to follow the docstring (per previous assertions).

> The problem is that traditional tests can only cover a narrow slice of your function’s behavior.

Property tests? Fuzzers? Symbolic execution?

> Just because a high percentage of tests pass doesn’t mean your code is bug-free.

Neither does this thing. If you want your code to be bug-free what you're looking for is a proof assistant not vibe-reviewing.

Also

> One of the reasons to use suite is its seamless integration with pytest.

Exposing a predicate is not "seamless integration with pytest", it's just exposing a predicate.

jmull · 2025-05-05T11:52:41 1746445961

This is probably better thought of as AI-assisted code review rather than unit testing.

Although you can automate running this test...

1. You may not want to blow up your token budget.

2. You probably want to manually review/use the results.

simianwords · 2025-05-05T05:38:19 1746423499

I was a bit skeptical at first but I think this is a good idea. Although I'm not convinced with the usage of max_depth parameter. In real life you rarely know what type your dependencies are if they are loaded at run time. This is kind of why we explicitly mock our dependencies.

On a side note: I have wondered whether LLM's are particularly good with functional languages. Imagine if your code entirely consisted of just pure functions and no side effects. You pass all parameters required and do not use static methods/variables and no OOP concepts like inheritance. I imagine every program can be converted in such a way, the tradeoff being human readability.

brap · 2025-05-05T11:56:44 1746446204

Skepticism aside, I think this would have worked better as a linter rule. 100% coverage out of the box. Or opt-in with linter comments.

rollulus · 2025-05-05T05:50:00 1746424200

I wonder if the random component of the LLM makes every test flaky by definition.

cerpins · 2025-05-05T07:35:41 1746430541

It sounds like it might be a good use case for testing documentation - verifying whether what documentation describes is actually in accordance with the code, and then you can act on it. With that in mind, it's also probably pointless to re-run if relevant code or documentation hasn't changed.

gnabgib · 2025-05-05T05:37:39 1746423459

This seems to be your site @op.. your CSS needs attention. On a narrower screen (ie. portrait) the text is enormous, and worse, zooming out shrinks the quantity of words (increases the font-size).. which is the surely the opposite of expected? It's basically unusable.

Your CSS seems to assume all portrait screens (whether 80" or 3") deserve the same treatment.

JanSchu · 2025-05-05T11:37:29 1746445049

Interesting experiment. I like that you framed it as “tests that read the docs” rather than “AI will magically find bugs”, because the former is exactly where LLMs shine: cross‑checking natural‑language intent with code.

A couple of thoughts after playing with a similar idea in private repos:

Token pressure is the real ceiling. Even moderately sized modules explode past 32k tokens once you inline dependencies and long docstrings. Chunking by call‑graph depth helps, but at some point you need aggressive summarization or cropping, otherwise you burn GPU time on boilerplate.

False confidence is worse than no test. LLMs love to pass your suite when the code and docstring are both wrong in the same way. I mitigated this by flipping the prompt: ask the model to propose three subtle, realistic bugs first, then check the implementation for each. The adversarial stance lowered the “looks good to me” rate.

Structured outputs let you fuse with traditional tests. If the model says passed: false, emit a property‑based test via Hypothesis that tries to hit the reasoning path it complained about. That way a human can reproduce the failure locally without a model in the loop.

Security review angle. LLM can spot obvious injection risks or unsafe eval calls even before SAST kicks in. Semantic tests that flag any use of exec, subprocess, or bare SQL are surprisingly helpful.

CI ergonomics. Running suite on pull requests only for files that changed keeps latency and costs sane. We cache model responses keyed by file hash so re‑runs are basically free.

Overall I would not drop my pytest corpus, but I would keep an async “semantic diff” bot around to yell when a quick refactor drifts away from the docstring. That feels like the sweet spot today.

P.S. If you want a local setup, Mistral‑7B‑Instruct via Ollama is plenty smart for doc/code mismatch checks and fits on a MacBook

stephantul · 2025-05-05T05:37:48 1746423468

This is cool! I think that, in general, generating test cases “offline” using an LLM and then running them using regular unit testing also solves this particular issue.

It also might be more transparent and cheaper.

stoical1 · 2025-05-05T09:11:18 1746436278

Test driving a car by looking at it

sigtstp · 2025-05-05T08:17:15 1746433035

I feel this makes some fundamental conceptual mistakes and is just riding the LLM wave.

"Semantics" is literally behavior under execution. This is syntactical analysis by a stochastic language model. I know the NLP literature uses "semantics" to talk about representations but that is an assertion which is contested [1].

Coming back to testing, this implicitly relies on the strong assumption of the LLM correctly associating the code (syntax) with assertions of properties under execution (semantic properties). This is a very risky assumption considering, once again, these things are stochastic in nature and cannot even guarantee syntactical correctness, let alone semantic. Being generous with the former, there is a track record of the latter often failing and producing subtle bugs [2][3][4][5]. Not to mention the observed effect of LLMs often being biased to "agree" with the premise presented to them.

It also kind of misses the point of testing, which is the engineering (not automation) task of reasoning about code and doing QC (even if said tests are later run automatically, I'm talking about their conception). I feel it's a dangerous, albeit tempting, decision to relegate that to an LLM. Fuzzing, sure. But not assertions about program behavior.

[1] A Primer in BERTology: What we know about how BERT works https://arxiv.org/abs/2002.12327 (Layers encode a mix of syntactic and semantic aspects of natural language, and it's problem-specific.)

[2] Large Language Models of Code Fail at Completing Code with Potential Bugs https://arxiv.org/abs/2306.03438

[3] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? https://arxiv.org/abs/2502.12115 (best models unable to solve the majority of coding problems)

[4] Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT https://arxiv.org/abs/2304.10778

[5] Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions https://arxiv.org/abs/2308.02312v4

EDIT: Added references

cjfd · 2025-05-05T05:35:52 1746423352

Much better solution: don't write useless docstrings.

motorest · 2025-05-05T05:59:40 1746424780

> Much better solution: don't write useless docstrings.

Actually writing the tests is far more effective, and doesn't require fancy frameworks tightly coupled with external services.

masklinn · 2025-05-05T06:46:49 1746427609

Importantly there's all sorts of tests beyond trivial single-value unit tests. Property testing (via hypothesis, in python) for instance.

jonstewart · 2025-05-05T11:21:26 1746444086

Does this buy carbon offsets, too?

noodletheworld · 2025-05-05T06:10:09 1746425409

I don't think this is particularly terrible.

Broadly speaking, linters are good, and if you have a way of linting implementation errors it's probably helpful.

I would say it's probably more helpful while you're coding than at test/CI time because it will be, indubitably, flakey.

However, for a local developer workflow I can see a reasonable value in being able to go:

Take every function in my code and scan it to figure out if you think it's implemented correctly, and let me know if you spot anything that looks weird / wrong / broken. Ideally only functions that I've touched in my branch.

So... you know. Cool idea. I think it's overselling how useful it is, but hey, smash your AI into every possible thing and eventually you'll find a few modestly interesting uses for it.

This is probably a modestly interesting use case.

> suite allows you to run the tests asynchronously, and since the main bottleneck is IO (all the computations happen in a GPU in the cloud) it means that you can run your tests very fast. This is a huge advantage in comparison to standard tests, which need to be run sequentially.

uh... that said, saying that it's fast to run your functions through an LLM compared to, you know, just running tests, is a little bit strange.

I'm certain your laptop will melt if you run 500 functions in parallel through ollama gemma-3.

Running it over a network is, obviously, similarly insane.

This would also be enormously and time consuming and expensive to use with a hosted LLM api.

The 'happy path' is probably having a plugin in your IDE that scans the files you touch and then runs this in the background when you make a commit somehow using a local LLM of sufficient complexity it can be useful (gemma3 would probably work).

Kind of like having your tests in 'watch mode'; you don't expect instant feedback, but some-time-after you've done something you get a popup saying 'oh hey, are you sure you meant to return a string here..?'

Maybe it would just be annoying. You'd have to build it out properly and see. /shrug

I think it's not implausible though, that you could see something vaguely like this that was generally useful.

Probably what you see in this specific implementation is only the precursory contemplations of something actually useful though. Not really useful on its own, in its current form, imo.