Skip to content
LiteParse

API Reference

API reference for the @llamaindex/liteparse TypeScript library.

LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.

import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);

Defined in: core/parser.ts:58

Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.

import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse();
const result = await parser.parse("document.pdf");
console.log(result.text);
const parser = new LiteParse({ outputFormat: "json", dpi: 300 });
const result = await parser.parse("document.pdf");
for (const page of result.json.pages) {
console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);
}
const parser = new LiteParse({
ocrServerUrl: "http://localhost:8828/ocr",
ocrLanguage: "en",
});
const result = await parser.parse("scanned-document.pdf");

new LiteParse(userConfig?): LiteParse

Defined in: core/parser.ts:68

Create a new LiteParse instance.

Partial<LiteParseConfig> = {}

Partial configuration to override defaults. See LiteParseConfig for all options.

LiteParse

getConfig(): LiteParseConfig

Defined in: core/parser.ts:444

Get a copy of the current configuration, including defaults merged with user overrides.

LiteParseConfig

A shallow copy of the active LiteParseConfig.

parse(input, quiet?): Promise<ParseResult>

Defined in: core/parser.ts:100

Parse a document and return the extracted text, page data, and optionally structured JSON.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.

LiteParseInput

A file path, Buffer, or Uint8Array containing document bytes. When given raw bytes, PDF data is parsed directly with zero disk I/O. Non-PDF bytes are written to a temp file for format conversion.

boolean = false

If true, suppresses progress logging to stderr.

Promise<ParseResult>

Parsed document data including text, per-page info, and optional JSON.

Error if the file cannot be found, converted, or parsed.

screenshot(input, pageNumbers?, quiet?): Promise<ScreenshotResult[]>

Defined in: core/parser.ts:227

Generate screenshots of PDF pages as image buffers.

Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.

LiteParseInput

A file path, Buffer, or Uint8Array containing PDF bytes.

number[]

1-indexed page numbers to screenshot. If omitted, all pages are rendered.

boolean = false

If true, suppresses progress logging to stderr.

Promise<ScreenshotResult[]>

Array of screenshot results, one per rendered page.

Defined in: core/types.ts:259

An axis-aligned bounding box defined by its top-left and bottom-right corners.

All coordinates are in PDF points.

Use TextItem coordinates (x, y, width, height) instead. Will be removed in v2.0.

x1: number

Defined in: core/types.ts:261

X coordinate of the top-left corner.

x2: number

Defined in: core/types.ts:265

X coordinate of the bottom-right corner.

y1: number

Defined in: core/types.ts:263

Y coordinate of the top-left corner.

y2: number

Defined in: core/types.ts:267

Y coordinate of the bottom-right corner.


Defined in: core/types.ts:294

A text element from the JSON output with position, size, and font metadata.

optional confidence: number

Defined in: core/types.ts:310

The OCR confidence (null if OCR wasn’t used)

optional fontName: string

Defined in: core/types.ts:306

Font name.

optional fontSize: number

Defined in: core/types.ts:308

Font size in PDF points.

height: number

Defined in: core/types.ts:304

Height of the text item in PDF points.

text: string

Defined in: core/types.ts:296

The text content of this item.

width: number

Defined in: core/types.ts:302

Width of the text item in PDF points.

x: number

Defined in: core/types.ts:298

X coordinate of the top-left corner, in PDF points.

y: number

Defined in: core/types.ts:300

Y coordinate of the top-left corner, in PDF points.


Defined in: core/types.ts:34

Configuration options for the LiteParse parser.

All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the constructor to override only the options you need.

const parser = new LiteParse({
ocrEnabled: true,
ocrLanguage: "fra",
dpi: 300,
outputFormat: "json",
});

dpi: number

Defined in: core/types.ts:98

DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.

150

maxPages: number

Defined in: core/types.ts:83

Maximum number of pages to parse from the document.

1000

numWorkers: number

Defined in: core/types.ts:76

Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.

CPU cores - 1 (minimum 1)

ocrEnabled: boolean

Defined in: core/types.ts:49

Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.

true

ocrLanguage: string | string[]

Defined in: core/types.ts:41

OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra") or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").

"en"

optional ocrServerUrl: string

Defined in: core/types.ts:57

URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.

OCR API Specification

outputFormat: OutputFormat

Defined in: core/types.ts:105

Output format for parsed results.

"json"

optional password: string

Defined in: core/types.ts:138

Password for opening encrypted/protected documents. Used for password-protected PDFs and office documents.

undefined

preciseBoundingBox: boolean

Defined in: core/types.ts:116

Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.

Controls the deprecated boundingBoxes output. Will be removed in v2.0. Text item coordinates (x, y, width, height) are always present regardless.

true

preserveLayoutAlignmentAcrossPages: boolean

Defined in: core/types.ts:130

Maintain consistent text alignment across page boundaries.

false

preserveVerySmallText: boolean

Defined in: core/types.ts:123

Preserve very small text that would normally be filtered out.

false

optional targetPages: string

Defined in: core/types.ts:90

Specific pages to parse, as a comma-separated string of page numbers and ranges.

`"1-5,10,15-20"`

optional tessdataPath: string

Defined in: core/types.ts:68

Path to a directory containing Tesseract .traineddata files. Used as both the language data source and cache directory for Tesseract.js.

If not set, falls back to the TESSDATA_PREFIX environment variable. If neither is set, Tesseract.js downloads data from cdn.jsdelivr.net.

`/opt/tessdata`

Defined in: core/types.ts:185

Markup annotation data associated with a text item.

optional highlight: string

Defined in: core/types.ts:187

Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.

optional squiggly: boolean

Defined in: core/types.ts:191

Whether the text has a squiggly underline.

optional strikeout: boolean

Defined in: core/types.ts:193

Whether the text is struck out.

optional underline: boolean

Defined in: core/types.ts:189

Whether the text is underlined.


Defined in: core/types.ts:273

Parsed data for a single page of a document.

optional boundingBoxes: BoundingBox[]

Defined in: core/types.ts:288

Use TextItem coordinates instead. Will be removed in v2.0. Present when LiteParseConfig.preciseBoundingBox is enabled.

height: number

Defined in: core/types.ts:279

Page height in PDF points.

pageNum: number

Defined in: core/types.ts:275

1-indexed page number.

text: string

Defined in: core/types.ts:281

Full text content of the page with spatial layout preserved.

textItems: TextItem[]

Defined in: core/types.ts:283

Individual text elements extracted from the page.

width: number

Defined in: core/types.ts:277

Page width in PDF points.


Defined in: core/types.ts:354

The result of parsing a document with LiteParse.parse.

optional json: ParseResultJson

Defined in: core/types.ts:360

Structured JSON data. Present when LiteParseConfig.outputFormat is "json".

pages: ParsedPage[]

Defined in: core/types.ts:356

Per-page parsed data.

text: string

Defined in: core/types.ts:358

Full document text, concatenated from all pages.


Defined in: core/types.ts:331

Structured JSON representation of parsed document data. Returned when LiteParseConfig.outputFormat is "json".

pages: object[]

Defined in: core/types.ts:333

Array of page data.

boundingBoxes: BoundingBox[]

Use textItems coordinates instead. Will be removed in v2.0.

height: number

Page height in PDF points.

page: number

1-indexed page number.

text: string

Full text content of the page.

textItems: JsonTextItem[]

Individual text elements with position and font metadata.

width: number

Page width in PDF points.


Defined in: core/types.ts:366

The result of generating a screenshot with LiteParse.screenshot.

height: number

Defined in: core/types.ts:372

Image height in pixels.

imageBuffer: Buffer

Defined in: core/types.ts:374

Raw image data as a Node.js Buffer (PNG or JPG).

optional imagePath: string

Defined in: core/types.ts:376

File path if the screenshot was saved to disk.

pageNum: number

Defined in: core/types.ts:368

1-indexed page number.

width: number

Defined in: core/types.ts:370

Image width in pixels.


Defined in: core/types.ts:316

Options for searchItems.

optional caseSensitive: boolean

Defined in: core/types.ts:324

Whether the search should be case-sensitive.

false

phrase: string

Defined in: core/types.ts:318

Find text items containing this phrase. Matches can span multiple adjacent items.


Defined in: core/types.ts:147

An individual text element extracted from a page, with position, size, and font metadata.

Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.

optional confidence: number

Defined in: core/types.ts:179

Confidence score from 0.0 to 1.0. Native PDF text defaults to 1.0, OCR text reflects engine confidence.

optional fontName: string

Defined in: core/types.ts:163

Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).

optional fontSize: number

Defined in: core/types.ts:165

Font size in PDF points.

h: number

Defined in: core/types.ts:161

Alias for height.

height: number

Defined in: core/types.ts:157

Height of the text item in PDF points.

optional markup: MarkupData

Defined in: core/types.ts:173

Markup annotations (highlights, underlines, etc.) applied to this text.

optional r: number

Defined in: core/types.ts:167

Rotation angle in degrees. One of 0, 90, 180, or 270.

optional rx: number

Defined in: core/types.ts:169

X coordinate after rotation transformation.

optional ry: number

Defined in: core/types.ts:171

Y coordinate after rotation transformation.

str: string

Defined in: core/types.ts:149

The text content of this item.

w: number

Defined in: core/types.ts:159

Alias for width.

width: number

Defined in: core/types.ts:155

Width of the text item in PDF points.

x: number

Defined in: core/types.ts:151

X coordinate of the top-left corner, in PDF points.

y: number

Defined in: core/types.ts:153

Y coordinate of the top-left corner, in PDF points.

LiteParseInput = string | Buffer | Uint8Array

Defined in: core/types.ts:16

Accepted input types for LiteParse.parse and LiteParse.screenshot.

  • string — A file path to a document on disk.
  • Buffer | Uint8Array — Raw file bytes (PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion).

OutputFormat = "json" | "text"

Defined in: core/types.ts:7

Supported output formats for parsed documents.

  • "json" — Structured JSON with per-page text items, bounding boxes, and metadata.
  • "text" — Plain text with spatial layout preserved.

searchItems(items, options): JsonTextItem[]

Defined in: processing/searchItems.ts:26

Search text items for matches, returning synthetic merged items for each match.

For phrase searches, consecutive text items are concatenated and searched. When a phrase spans multiple items, the result is a single merged item with combined bounding box and the matched text. Font metadata is taken from the first matched item.

JsonTextItem[]

SearchItemsOptions

JsonTextItem[]

import { LiteParse, searchItems } from "@llamaindex/liteparse";
const parser = new LiteParse({ outputFormat: "json" });
const result = await parser.parse("report.pdf");
for (const page of result.json.pages) {
const matches = searchItems(page.textItems, { phrase: "0°C to 70°C" });
for (const match of matches) {
console.log(`Found "${match.text}" at (${match.x}, ${match.y})`);
}
}