New ask Hacker News story: Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline
Seeking Advice on Improving OCR for Watermarked PDFs in My RAG Pipeline
2 by hundredtrillion | 0 comments on Hacker News.
I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream. The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints. I’d really appreciate any ideas on: more robust OCR libraries or models that handle watermarks well preprocessing strategies to suppress watermark text better extraction pipelines for RAG use cases or any general advice on improving this part of the system The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository: GitHub: https://ift.tt/QnF4638 If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it. Thanks in advance for any insights.
2 by hundredtrillion | 0 comments on Hacker News.
I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream. The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints. I’d really appreciate any ideas on: more robust OCR libraries or models that handle watermarks well preprocessing strategies to suppress watermark text better extraction pipelines for RAG use cases or any general advice on improving this part of the system The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository: GitHub: https://ift.tt/QnF4638 If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it. Thanks in advance for any insights.
Comments
Post a Comment