Hacker News new | past | comments | ask | show | jobs | submit login

Question for ya if you don't mind: I had to do some PDF scraping a while back as part of a side project collecting alternative social/economic datasources.

Even within a single site, there were often errors at the fringes, especially if things like layout/styling changed, and my concern about giving bad data to users (or needing to constantly be checking data quality and adjusting custom parameters for each target site) held me back from ever feeling confident enough to convert it into a paid product.

I don't mean for you to give up your secret sauce here, but wondering if you ran into this same issue, and what your approach was from a business/customer expectations perspective?




Oh yes I ran into this issue many many times. The way I dealt with it is a bit insane. I classify bank statements using images or text on the first page. Then I run custom code for that document type.

I also have a "pretty good" fallback algorithm if the statement cannot be classified.


Usually banks have a template so the edge cases aren't so edgy. Had to do this with Canadian banks and each one had their own template, but once you parsed it, generally, it worked until they updated their template again.


True, Canadian banks are quite nice to work with. US, Indian and South African banks are hell!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: