Library Usage
Use LiteParse programmatically from TypeScript or Python.
LiteParse can be used as a library in your own code, not just from the CLI. There are packages for both TypeScript and Python.
TypeScript
Section titled “TypeScript”Install as a project dependency:
npm install @llamaindex/liteparse# orpnpm add @llamaindex/liteparseParsing a document
Section titled “Parsing a document”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });const result = await parser.parse("document.pdf");
// Full document text with layout preservedconsole.log(result.text);
// Per-page datafor (const page of result.pages) { console.log(`Page ${page.pageNum}: ${page.textItems.length} text items`);}JSON output with bounding boxes
Section titled “JSON output with bounding boxes”const parser = new LiteParse({ outputFormat: "json" });const result = await parser.parse("document.pdf");
for (const page of result.json?.pages || []) { for (const item of page.textItems) { console.log(`[${item.x}, ${item.y}] → [${item.x + item.width}, ${item.y + item.height}] ${item.text}`); }}Configuration
Section titled “Configuration”Pass any config options to the constructor. You only need to specify what you want to override:
const parser = new LiteParse({ ocrEnabled: true, ocrServerUrl: "http://localhost:8828/ocr", ocrLanguage: "fra", dpi: 300, outputFormat: "json", targetPages: "1-10", password: "secret", // for encrypted/protected documents});Buffer / Uint8Array input
Section titled “Buffer / Uint8Array input”You can pass raw bytes directly instead of a file path. PDF buffers are parsed with zero disk I/O — no temp files are written:
import { readFile } from "fs/promises";
const parser = new LiteParse();
// From a file readconst pdfBytes = await readFile("document.pdf");const result = await parser.parse(pdfBytes);
// From an HTTP responseconst response = await fetch("https://example.com/document.pdf");const buffer = Buffer.from(await response.arrayBuffer());const result2 = await parser.parse(buffer);Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. You can control the temp directory with the LITEPARSE_TMPDIR environment variable.
Screenshots
Section titled “Screenshots”Generate page images as buffers — useful for sending to LLMs or saving to disk. Accepts file paths, Buffer, or Uint8Array:
const parser = new LiteParse();const screenshots = await parser.screenshot("document.pdf");
for (const shot of screenshots) { console.log(`Page ${shot.pageNum}: ${shot.width}x${shot.height}`); // shot.imageBuffer contains the raw PNG/JPG data}
// Also works with buffer inputconst pdfBytes = await readFile("document.pdf");const shots = await parser.screenshot(pdfBytes, [1, 2, 3]);Environment variables
Section titled “Environment variables”| Variable | Description |
|---|---|
TESSDATA_PREFIX | Path to a directory containing Tesseract .traineddata files. For offline/air-gapped environments. Also available as the tessdataPath config option. |
LITEPARSE_TMPDIR | Override the temp directory for format conversion. Defaults to os.tmpdir(). |
See the API reference for full type details.
Python
Section titled “Python”The Python package is a wrapper around the LiteParse Node.js CLI. Node.js (>= 18) must be installed on your system.
Installation
Section titled “Installation”First, install the LiteParse CLI:
npm install -g @llamaindex/liteparseThen install the Python package:
pip install liteparseParsing a document
Section titled “Parsing a document”from liteparse import LiteParse
parser = LiteParse()result = parser.parse("document.pdf")
# Full document textprint(result.text)
# Per-page datafor page in result.pages: print(f"Page {page.pageNum}: {len(page.textItems)} text items")Configuration
Section titled “Configuration”The LiteParse constructor accepts cli_path (to override CLI auto-detection) and install_if_not_available (to control auto-install behavior). All parsing options are passed per-call:
parser = LiteParse()
result = parser.parse( "document.pdf", ocr_enabled=True, ocr_server_url="http://localhost:8828/ocr", ocr_language="fra", dpi=300, target_pages="1-5", password="secret", # for encrypted/protected documents)Parsing from bytes
Section titled “Parsing from bytes”If you already have file contents in memory (e.g. from a web upload), pass them directly to parse():
with open("document.pdf", "rb") as f: pdf_bytes = f.read()
result = parser.parse(pdf_bytes)print(result.text)Batch parsing
Section titled “Batch parsing”For multiple files, batch mode reuses the PDF engine and is significantly faster:
result = parser.batch_parse( input_dir="./documents", output_dir="./output", recursive=True, extension_filter=".pdf",)
print(f"Output written to: {result.output_dir}")