PaperExtractor: AI-Powered PDF Data Extraction

An AI-powered tool that extracts specific information from PDF documents. Upload a PDF, specify what information you need, and the application automatically finds it with source citations.

Posted Jan 18, 2025 Updated Jan 23, 2026

By Sinan Koparan

1 min read

Overview

PaperExtractor is an AI-powered tool that extracts specific information from PDF documents. Upload a PDF, specify what information you need, and the application automatically finds it with source citations.

View on GitHub

Demo

I also shared this project and my thoughts on building it in a LinkedIn post

Learning Goals

Visual Derendering: Leveraging Gemini 3 Pro’s advanced visual derendering capabilities to reverse-engineer the visual layout of a PDF back into structured code.
Two-Stage Processing: Converting PDFs to Markdown first (preserving structural hierarchy), then extracting requested fields with proof snippets.
Overcoming PDF Limitations: Traditional PDF extraction is notoriously unreliable since PDFs are designed for humans, not computers, resulting in misordered text, broken tables, and poor formatting. The derendering approach addresses these challenges.
Visual Interface: Building a three-panel display showing original PDF pages, OCR text, and extracted results.

Tech Stack

Frontend: React + Tailwind CSS
Backend: Node.js + Express
AI API: Google Gemini 3 Pro for intelligent extraction
PDF Processing: pdf-to-img for document handling

Takeaways

Two-stage processing (PDF to Markdown, then extraction) improves accuracy and allows for better source citations.
JSON schema mode in Gemini makes structured data extraction much more reliable.
Providing field context instructions significantly improves extraction quality for domain-specific documents.
Building a visual interface that shows the extraction proof alongside results helps users trust and verify the output.
Markdown is widely recommended by model providers for use with Large Language Models, making it an ideal intermediate format.

Projects, AI Tools

projects ai pdf gemini react extraction

This post is licensed under CC BY 4.0 by the author.