No public code. This has been a long running project for me. Last I touched it- pre-LLM world- it had turned into a real Rube Goldberg machine. Hard to imagine anyone else putting up with it.
PDF to text (using either python or Java lib), which then is turned into a "header" structure with dates and balances via configuration driven regexes, and a "body" structure containing the transactions. The transactions themselves go through an EBNF parser to extract the date(s), narration, amount, and balance if reported. The narration text gets run against a custom merchant database for payee and categorization. It is a painful problem! The code is Clojure so there is not much of it, and there are high abstraction libraries like Instaparse that make it easy to use grammars as primitives. And the rube goldberg has yielded for me balance-validated data now for the last several years from half a dozen financial providers.
I have been incorporating local LLMs, running on an RTX 3090, into some other workflows I have, hope over the summer to see if those can help simplify some of the workflow.
Or at least the tool(s) you use?
I have the same need but it's surprisingly difficult to get it right, at least with the `camelot` or `fitz` python packages.