Are there good benchmarks for this type of tool? It seems not? Also, I'd compare...

caseyy · 2025-02-16T04:33:36 1739680416

The best practical benchmark I found is asking LLMs to research or speak on my field of expertise.

ibeff · 2025-02-16T09:21:34 1739697694

That's what I did. It came up with smart-sounding but infeasible recommendations because it took all sources it found online at face value without considering who authored them for what reason. And it lacked a massive amount of background knowledge to evaluate the claims made in the sources. It took outlandish, utopian demands by some activists in my field and sold them to me as things that might plausibly be implemented in the near future.

Real research needs several more levels of depth of contextual knowledge than the model is currently doing for any prompt. There is so much background information that people working in my field know. The model would have to first spend a ton of time taking in everything there is to know about the field and several related fields and then correlate the sources it found for the specific prompt with all of that.

At the current stage, this is not deep research but research that is remarkably shallow.

rchaud · 2025-02-16T17:00:21 1739725221

> It took outlandish, utopian demands by some activists in my field and sold them to me as things that might plausibly be implemented in the near future.

Reminds me of when Altman went to TSMC and bloviated about chip fabs to subject matter experts: https://www.tomshardware.com/tech-industry/tsmc-execs-allege...

SubiculumCode · 2025-02-16T05:23:01 1739683381

Yeah...and it didn't cite me :)

caseyy · 2025-02-16T05:40:46 1739684446

Yeah, that's a data point as well. I found a model that was good with citations by asking it to recall what I published articles on.

d4rkp4ttern · 2025-02-16T12:42:19 1739709739

I’ve seen at least one deep-research replicator claiming they were the “best open deep research” tool on the GAIA benchmark: https://huggingface.co/papers/2311.12983 This is not a perfect benchmark but the closest I’ve seen.