Hacker News new | past | comments | ask | show | jobs | submit login

This is such a bad idea. Skip the first section and read the "false positives" section.



Aren't false positives acceptable in this situation? I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying. If there is a 10% false positive rate, then the only cost is the wasted time of whoever needs to identify it's a false positive.

I guess this is a bad idea if these tools replace peer reviewers altogether, and papers get published if they can get past the error checker. But I haven't seen that proposed.


> I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying.

This made me laugh so hard that I was almost crying.

For a specific journal, editor, or reviewer, maybe. For most journals, editors, or reviewers… I would bet money against it.


You'd win that bet. Most journal reviewers don't do more than check that data exists as part of the peer review process—the equivalent of typing `ls` and looking at the directory metadata. They pretty much never run their own analyses to double check the paper. When I say "pretty much never", I mean that when I interviewed reviewers and asked them if they had ever done it, none of them said yes, and when I interviewed journal editors—from significant journals—only one of them said their policy was to even ask reviewers to do it, and that it was still optional. He said he couldn't remember if anyone had ever claimed to do it during his tenure. So yeah, if you get good odds on it, take that bet!


That screams "moral hazard"[1] to me. See also the incident with curl and AI confabulated bug reports[2].

[1]: Maybe not in the strict original sense of the phrase. More like, an incentive to misbehave and cause downstream harm to others. [2]: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...


Let me tell you about this thing called Turnitin and how it was a purely advisory screening tool…


Note that the section with that heading also discusses several other negative features.

The only false positive rate mentioned in the article is more like 30%, and the true positives in that sample were mostly trivial mistakes (as in, having no effect on the validity of the message) and that is in preprints that have not been peer reviewed, so one would expect that that false positive rate would be much worse after peer review (the true positives would decrease, false positives remain).

And every indication both from the rhetoric of the people developing this and from recent history is that it would almost never be applied in good faith, and instead would empower ideologically motivated bad actors to claim that facts they disapprove of are inadequately supported, or that people they disapprove of should be punished. That kind of user does not care if the "errors" are false positives or trivial.

Other comments have made good points about some of the other downsides.


People keep offering this hypothetical 10% acceptable false positive rate, but the article says it’s more like 35%. Imagine if your workplace implemented AI and it created 35% more unfruitful work for you. It might not seem like an “unqualified good” as it’s been referred to elsewhere.


It depends if you do stuff that matters or not. If your job is meaningless, then detecting errors with a 35% false positive rate would just be extra work. On the other hand, if the quality of your output matters - 35% seems like an incredibly small price to pay if it also detects real issues.


Lots to unpack here but I'll just say that I think it would probably matter to a lot of people if they were forced to use something that increased their pointless work by 35%, regardless of whether their work mattered to you or not.


> is reviewing the errors these tools are identifying.

Unfortunately, no one has the incentives or the resources to do doubly triply thorough fine tooth combing: no reviewer or editor’s getting paid; tenure-track researchers who need the service to the discipline check mark in their tenure portfolios also need to churn out research…


I can see its usefulness as a screening tool, though I can also see downsides similar to what maintainers face with AI vulnerability reporting. It's an imperfect tool attempting to tackle a difficult and important problem. I suppose its value will be determined by how well it's used and how well it evolves.


Being able to have a machine double check your work for problems that you fix or dismiss as false seems great? If the bad part is "AI knows best" - I agree with that! Properly deployed, this would be another tool in line with peer review that helps the scientific community judge the value of new work.


I don't see this a worse idea than AI code reviewer. If it spits out irrelevant advice and only gets 1 out of 10 points right, I consider it a win, since the cost is so low and many humans can't catch subtle issues in code.


since the cost is so low

As someone who has had to deal with the output of absolutely stupid "AI code reviewers", I can safely say that the cost of being flooded with useless advice is real, and I will simply ignore them unless I want a reminder of how my job will not be automated away by anyone who wants real quality. I don't care if it's right 1 in 10 times; the other 9 times are more than enough to be of negative value.

Ditto for those flooding GitHub with LLM-generated "fix" PRs.

and many humans can't catch subtle issues in code.

That itself is a problem, but pushing the responsibility onto an unaccountable AI is not a solution. The humans are going to get even worse that way.


You’re missing the bit where humans can be held responsible and improve over time with specific feedback.

AI models only improve through training and good luck convincing any given LLM provider to improve their models for your specific use case unless you have deep pockets…


And people's willingness to outsource their judgement to a computer. If a computer says it, for some people, it's the end of the matter.


There's also a ton of false positives with spellcheck on scientific papers, but it's obviously a useful tool. Humans review the results.


Just consider it being a additional mean reviewer who most likely is wrong. There is still value in debunking their false claims.


Deploying this on already published work is probably a bad idea. But what is wrong with working with such tools on submission and review?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: