Post

PaperExtractor: AI-Powered PDF Data Extraction

An AI-powered tool that extracts specific information from PDF documents. Upload a PDF, specify what information you need, and the application automatically finds it with source citations.

PaperExtractor: AI-Powered PDF Data Extraction

Overview

PaperExtractor is an AI-powered tool that extracts specific information from PDF documents. Upload a PDF, specify what information you need, and the application automatically finds it with source citations.

View on GitHub

Demo

PaperExtractor Demo

I also shared this project and my thoughts on building it in a LinkedIn post

Learning Goals

  • Visual Derendering: Leveraging Gemini 3 Pro’s advanced visual derendering capabilities to reverse-engineer the visual layout of a PDF back into structured code.
  • Two-Stage Processing: Converting PDFs to Markdown first (preserving structural hierarchy), then extracting requested fields with proof snippets.
  • Overcoming PDF Limitations: Traditional PDF extraction is notoriously unreliable since PDFs are designed for humans, not computers, resulting in misordered text, broken tables, and poor formatting. The derendering approach addresses these challenges.
  • Visual Interface: Building a three-panel display showing original PDF pages, OCR text, and extracted results.

Tech Stack

  • Frontend: React + Tailwind CSS
  • Backend: Node.js + Express
  • AI API: Google Gemini 3 Pro for intelligent extraction
  • PDF Processing: pdf-to-img for document handling

Takeaways

  • Two-stage processing (PDF to Markdown, then extraction) improves accuracy and allows for better source citations.
  • JSON schema mode in Gemini makes structured data extraction much more reliable.
  • Providing field context instructions significantly improves extraction quality for domain-specific documents.
  • Building a visual interface that shows the extraction proof alongside results helps users trust and verify the output.
  • Markdown is widely recommended by model providers for use with Large Language Models, making it an ideal intermediate format.
This post is licensed under CC BY 4.0 by the author.