The primary challenge is not just about harnessing AI for search; it's about pre... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

constantinum 11 months ago | parent | context | favorite | on: Ask HN: I have many PDFs – what is the best local ...

The primary challenge is not just about harnessing AI for search; it's about preparing complex documents of various formats, structures, designs, scans, multi-layout tables, and even poorly captured images for LLM consumption. This is a crucial issue.

There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].

Now back to your problem:

This solution might be an overkill for your requirement, but you can try the following:

To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]

[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...

jszymborski 11 months ago | [–]

Apache Tika could help extract the relevant bits of PDFs, couldnt it?

https://tika.apache.org/

fooker 11 months ago | [–]

Modern LLMs are good enough at treating pdfs as images and groking the context.

Well, Claude and GPT-4 seem to be.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact