Introducing AgenticPDF: a PDF library built for AI agents
PDFs are everywhere an AI agent needs to look — contracts, papers, invoices, manuals — and almost none of them were built to be read by a machine. A single document can run to hundreds of pages, the text arrives as a soup of positioned glyphs rather than sentences, and most libraries insist on loading the whole thing into memory before they will tell you anything. For an agent paying by the token and the millisecond, that is a hostile format.
AgenticPDF is our answer: an agentic-native PDF processing and rendering library, written as a single zero-dependency TypeScript file, designed from the first line for the way AI agents actually consume documents.
Streaming first, not memory first
The core idea is simple: an agent should be able to start working on page one without waiting for page five hundred. AgenticPDF processes documents as streams. streamText() yields text page by page, streamSemanticChunks() yields RAG-ready chunks as it finds them, and every long-running call accepts an AbortSignal so an agent can stop the moment it has what it needs. Memory limits are configurable, loading is lazy, and cleanup is automatic.
import AgenticPDF from 'agenticpdf';
const pdf = await AgenticPDF.fromFile(file);
for await (const content of pdf.streamText({ normalizeWhitespace: true })) {
console.log(`Page ${content.pageNumber}: ${content.text}`);
}
pdf.close();
One call to ingest a document
Most agent pipelines want the same four things from a PDF: what it is, how it is structured, the chunks to embed, and some stats. So AgenticPDF gives you all of it in a single call. ingest() returns metadata, structural analysis, semantic chunks, and processing stats together; streamIngest() does the same as NDJSON — a header record, a stream of chunk records, then a footer — for large documents and data pipelines.
const result = await pdf.ingest({ maxChunkSize: 1000 });
result.documentType; // "AcademicPaper"
result.summary; // extractive summary, no external service
result.stats.totalChunks; // semantic chunk count
for (const chunk of result.chunks) {
await vectorStore.add(chunk.content, {
pages: chunk.pages,
importance: chunk.importance,
keywords: chunk.keywords,
});
}
The chunking is semantic, not a blind character split: it respects paragraphs, detects sections, tables, and figures, and carries page provenance and keywords on every chunk so your retrieval layer can cite and rank.
It describes itself to your agent
An agent should not need a human to read the docs and hand-write tool definitions. AgenticPDF is introspectable. It can hand an agent its own ontology, its capability map, and ready-made function-calling schemas for whichever platform you are on.
// Full introspection payload: ontology + tools + schemas + guidance
const info = AgenticPDF.describeForAgent('openai');
// Platform-specific tool schemas
const openaiTools = AgenticPDF.getToolSchemas('openai');
const anthropicTools = AgenticPDF.getToolSchemas('anthropic');
// An MCP server manifest for MCP-compatible agents
const manifest = AgenticPDF.getMCPManifest();
Out of the box that is 34 tool definitions, 43 JSON schemas, and 16 pre-built workflow templates for common jobs — plus a JSON-LD ontology from describe() that a code-generating agent can read instead of prose.
A complete library, not just a reader
Agentic features sit on top of a full PDF toolkit, so you are never forced out to a second dependency:
- Extraction — text with formatting and tables, images, forms, annotations, and metadata.
- Rendering — full PDF-to-canvas with text, images, vector graphics, and form XObjects, with hardware-accelerated rendering for significantly improved performance, plus dark/light theming for viewer UIs.
- Writing — incremental append-only saves that preserve signatures, page insert/delete/reorder, annotation persistence, and PDF/A validation.
- Export — text, HTML, Markdown, JSON, and the aPDF binary format.
- Layout —
PretextLayout, a zero-dependency multiline text engine with grapheme-aware, CJK-correct line breaking viaIntl.Segmenterand an LRU measurement cache.
It runs the same in modern browsers and in Node 18 and up, offloads heavy work to Web Workers, and ships built-in OpenTelemetry instrumentation that degrades gracefully to no-ops when OTEL is not installed.
The aPDF format
AgenticPDF introduces aPDF (%aPDF-1.1), a compact binary container that bundles a PDF together with a rich, machine-readable metadata envelope using LZ77 compression. One command produces an agent-ready artifact:
apdf generate -i paper.pdf -o paper.apdf
The envelope goes well beyond a filename or a title. It carries structured identifiers (DOI, arXiv ID, ISBN, internal document IDs), provenance metadata (source URL, download timestamp, processing pipeline version), a structural summary (page count, detected document type, section outline), and a cryptographic hash of the original PDF bytes. That combination unlocks several things that plain PDFs cannot do cleanly:
- Deterministic deduplication — the embedded hash lets a vector store or document cache identify the same paper regardless of filename or URL, so re-ingesting a document you already have is a no-op rather than a silent duplicate.
- Provenance-aware retrieval — every chunk returned from a RAG pipeline carries the source URL and download timestamp baked into the artifact, not inferred after the fact from a metadata sidecar that may have drifted.
- Offline archival — the original PDF bytes are preserved inside the container with full round-trip fidelity, so an archived
.apdffile is self-contained: re-render, re-extract, or re-chunk at any time without the original URL. - Cross-tool portability — the envelope is a JSON-LD document, so any tool that understands JSON-LD can read the metadata without knowing anything about the aPDF binary format.
There is a TypeScript CLI (apdf / agenticpdf) and a native Rust engine — agenticpdf-rs — that compiles the same document-understanding surface to a single zero-runtime binary and to WASM. See below.
The Rust engine: native binary and WASM
agenticpdf-rs is a self-contained Rust implementation of the AgenticPDF document-understanding engine. It ships as a single ~801 KB static binary with no runtime dependencies — no JVM, no Python interpreter, no model server — and as a WASM module that runs in modern browsers, serverless functions, and edge workers.
The binary exposes the full document-understanding surface as discrete CLI commands:
# Reading-order Markdown for an LLM context window
apdf markdown paper.pdf
# Structured layout: typed blocks (heading/paragraph/list) + bounding boxes
apdf layout paper.pdf --output layout.json
# Table reconstruction (bordered, booktabs, borderless, side-by-side panels)
apdf table report.pdf --format json
# Tagged-PDF logical structure tree (author-provided, no heuristics)
apdf structure tagged.pdf
# Figure detection + caption linking
apdf figures paper.pdf
# Best-effort LaTeX for formulas (symbols, sub/superscripts, fractions)
apdf formula paper.pdf
# Prompt-injection / hidden-text scan + sanitized extraction
apdf scan untrusted.pdf
apdf markdown untrusted.pdf --sanitize
# Semantic chunks for RAG
apdf chunk document.pdf --size 500 --overlap 50 --format json
# Everything in one pass — one JSON, full document understanding
apdf all document.pdf --output full.json
apdf all is the agentic shortcut: metadata, text, outline, semantic chunks, reading-order Markdown, tables, figures, formulas, and a prompt-injection scan in a single call. apdf mcp runs a Model Context Protocol stdio server that exposes every command as a named tool, so MCP clients like Claude Desktop can invoke the engine directly:
{
"mcpServers": {
"agenticpdf": { "command": "apdf", "args": ["mcp"] }
}
}
The WASM build exports the same surface for browser and edge environments — toMarkdown(), toLayout(), extractTables(), generateChunks(), scanInjection(), and more — so the zero-dependency Rust engine runs in a Next.js route handler, a browser extension, or a Cloudflare Worker with no native module. The rendering pipeline uses WebGL2 on the GPU (the same hardware-accelerated approach as the TypeScript renderer).
OCR is entirely local. Building with --features ocr unlocks three backends, all of which run on your own hardware:
- Tesseract —
apdf ocr document.pdfshells out to a locally installedtesseractbinary. No FFI, no model downloads beyond what Tesseract already has; if Tesseract is on yourPATH, it works. - PaddleOCR / EasyOCR —
apdf ocr document.pdf --server http://localhost:8868/ocrPOSTs each scanned page to a self-hosted PaddleOCR or EasyOCR server running on your own machine or network. The HTTP contract is the same one liteparse uses, so any compatible server is a drop-in. - Vision-language model —
apdf ocr document.pdf --vlm http://localhost:8000/v1/chat/completionssends page images to a locally served VLM (PaddleOCR-VL-1.6, LLaVA, or any model behind an OpenAI-compatible endpoint). Nothing leaves your network.
The recommendation is to start with Tesseract for quick results and move to a self-hosted PaddleOCR or VLM backend when you need higher accuracy on complex layouts or non-Latin scripts. The backend is a compile-time feature flag, so the production binary carries only what you need.
Hardened on purpose
A library that fetches URLs, parses untrusted binaries, and writes files is an attack surface, so we treated it like one. AgenticPDF went through three security audit passes — 25-plus fixes in total, the third graded against CVE, MITRE ATT&CK, NIST FIPS 140-3, and CMMC 2.0 Level 2:
- SSRF protection on URL loading, with private-IP blocking and redirect validation.
- Path traversal prevention on every write, in both the library and the CLI.
- Cryptographic randomness —
crypto.getRandomValues()everywhere, neverMath.random. - ReDoS-safe regexes, with a 64-character cap and escaping on user-supplied search terms.
- XSS sanitization in HTML export and prototype-pollution protection in object merging.
- Bounded everything — streaming buffers, recursion depth, and aPDF size caps.
Tested and zero-dependency
The whole library is one TypeScript file, agenticpdf.ts, with no runtime dependencies — copy it into a project, install it from npm, or load it from a CDN. It is covered by 950 tests across 25 suites, all passing, with zero TypeScript compilation errors and CI on Node 18, 20, and 22.
# npm
npm install agenticpdf
# or just take the file
curl -O https://raw.githubusercontent.com/nervosys/AgenticPDF/master/agenticpdf.ts
Try it
AgenticPDF 1.0 is open source under AGPL-3.0. If your agents read documents — and these days, most do — we would love your feedback and your hardest PDFs.
Check out AgenticPDF on GitHub.
Per aspera ad astra.
Subscribe for new dispatches
Research updates, technical deep-dives, and announcements from the frontier of embodied AI — delivered to your inbox.
Check your inbox to confirm your subscription.
