API Reference

LiteParse

API reference for the @llamaindex/liteparse TypeScript library.

LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.

Example

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse("document.pdf");
console.log(result.text);

Classes

LiteParse

Defined in: core/parser.ts:58

Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.

Examples

import { LiteParse } from "@llamaindex/liteparse";

const parser = new LiteParse();
const result = await parser.parse("document.pdf");
console.log(result.text);

const parser = new LiteParse({ outputFormat: "json", dpi: 300 });
const result = await parser.parse("document.pdf");
for (const page of result.json.pages) {
  console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);
}

const parser = new LiteParse({
  ocrServerUrl: "http://localhost:8828/ocr",
  ocrLanguage: "en",
});
const result = await parser.parse("scanned-document.pdf");

Constructors

Constructor

new LiteParse(userConfig?): LiteParse

Defined in: core/parser.ts:68

Create a new LiteParse instance.

Parameters

userConfig?

Partial<LiteParseConfig> = {}

Partial configuration to override defaults. See LiteParseConfig for all options.

Returns

LiteParse

Methods

getConfig()

getConfig(): LiteParseConfig

Defined in: core/parser.ts:444

Get a copy of the current configuration, including defaults merged with user overrides.

Returns

LiteParseConfig

A shallow copy of the active LiteParseConfig.

parse()

parse(input, quiet?): Promise<ParseResult>

Defined in: core/parser.ts:100

Parse a document and return the extracted text, page data, and optionally structured JSON.

Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.

Parameters

input

LiteParseInput

A file path, Buffer, or Uint8Array containing document bytes. When given raw bytes, PDF data is parsed directly with zero disk I/O. Non-PDF bytes are written to a temp file for format conversion.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ParseResult>

Parsed document data including text, per-page info, and optional JSON.

Throws

Error if the file cannot be found, converted, or parsed.

screenshot()

screenshot(input, pageNumbers?, quiet?): Promise<ScreenshotResult[]>

Defined in: core/parser.ts:227

Generate screenshots of PDF pages as image buffers.

Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.

Parameters

input

LiteParseInput

A file path, Buffer, or Uint8Array containing PDF bytes.

pageNumbers?

number[]

1-indexed page numbers to screenshot. If omitted, all pages are rendered.

quiet?

boolean = false

If true, suppresses progress logging to stderr.

Returns

Promise<ScreenshotResult[]>

Array of screenshot results, one per rendered page.

Interfaces

BoundingBox

Defined in: core/types.ts:259

An axis-aligned bounding box defined by its top-left and bottom-right corners.

All coordinates are in PDF points.

Deprecated

Use TextItem coordinates (x, y, width, height) instead. Will be removed in v2.0.

Properties

x1

x1: number

Defined in: core/types.ts:261

X coordinate of the top-left corner.

x2

x2: number

Defined in: core/types.ts:265

X coordinate of the bottom-right corner.

y1

y1: number

Defined in: core/types.ts:263

Y coordinate of the top-left corner.

y2

y2: number

Defined in: core/types.ts:267

Y coordinate of the bottom-right corner.

JsonTextItem

Defined in: core/types.ts:294

A text element from the JSON output with position, size, and font metadata.

Properties

confidence?

optional confidence: number

Defined in: core/types.ts:310

The OCR confidence (null if OCR wasn’t used)

fontName?

optional fontName: string

Defined in: core/types.ts:306

Font name.

fontSize?

optional fontSize: number

Defined in: core/types.ts:308

Font size in PDF points.

height

height: number

Defined in: core/types.ts:304

Height of the text item in PDF points.

text

text: string

Defined in: core/types.ts:296

The text content of this item.

width

width: number

Defined in: core/types.ts:302

Width of the text item in PDF points.

x

x: number

Defined in: core/types.ts:298

X coordinate of the top-left corner, in PDF points.

y

y: number

Defined in: core/types.ts:300

Y coordinate of the top-left corner, in PDF points.

LiteParseConfig

Defined in: core/types.ts:34

Configuration options for the LiteParse parser.

All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the constructor to override only the options you need.

Example

const parser = new LiteParse({
  ocrEnabled: true,
  ocrLanguage: "fra",
  dpi: 300,
  outputFormat: "json",
});

Properties

dpi

dpi: number

Defined in: core/types.ts:98

DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.

Default Value

150

maxPages

maxPages: number

Defined in: core/types.ts:83

Maximum number of pages to parse from the document.

Default Value

1000

numWorkers

numWorkers: number

Defined in: core/types.ts:76

Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.

Default Value

CPU cores - 1 (minimum 1)

ocrEnabled

ocrEnabled: boolean

Defined in: core/types.ts:49

Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.

Default Value

true

ocrLanguage

ocrLanguage: string | string[]

Defined in: core/types.ts:41

OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra") or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").

Default Value

"en"

ocrServerUrl?

optional ocrServerUrl: string

Defined in: core/types.ts:57

URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.

See

OCR API Specification

outputFormat

outputFormat: OutputFormat

Defined in: core/types.ts:105

Output format for parsed results.

Default Value

"json"

password?

optional password: string

Defined in: core/types.ts:138

Password for opening encrypted/protected documents. Used for password-protected PDFs and office documents.

Default Value

undefined

preciseBoundingBox

preciseBoundingBox: boolean

Defined in: core/types.ts:116

Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.

Deprecated

Controls the deprecated boundingBoxes output. Will be removed in v2.0. Text item coordinates (x, y, width, height) are always present regardless.

Default Value

true

preserveLayoutAlignmentAcrossPages

preserveLayoutAlignmentAcrossPages: boolean

Defined in: core/types.ts:130

Maintain consistent text alignment across page boundaries.

Default Value

false

preserveVerySmallText

preserveVerySmallText: boolean

Defined in: core/types.ts:123

Preserve very small text that would normally be filtered out.

Default Value

false

targetPages?

optional targetPages: string

Defined in: core/types.ts:90

Specific pages to parse, as a comma-separated string of page numbers and ranges.

Example

`"1-5,10,15-20"`

tessdataPath?

optional tessdataPath: string

Defined in: core/types.ts:68

Path to a directory containing Tesseract .traineddata files. Used as both the language data source and cache directory for Tesseract.js.

If not set, falls back to the TESSDATA_PREFIX environment variable. If neither is set, Tesseract.js downloads data from cdn.jsdelivr.net.

Example

`/opt/tessdata`

MarkupData

Defined in: core/types.ts:185

Markup annotation data associated with a text item.

Properties

highlight?

optional highlight: string

Defined in: core/types.ts:187

Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.

squiggly?

optional squiggly: boolean

Defined in: core/types.ts:191

Whether the text has a squiggly underline.

strikeout?

optional strikeout: boolean

Defined in: core/types.ts:193

Whether the text is struck out.

underline?

optional underline: boolean

Defined in: core/types.ts:189

Whether the text is underlined.

ParsedPage

Defined in: core/types.ts:273

Parsed data for a single page of a document.

Properties

boundingBoxes?

optional boundingBoxes: BoundingBox[]

Defined in: core/types.ts:288

Deprecated

Use TextItem coordinates instead. Will be removed in v2.0. Present when LiteParseConfig.preciseBoundingBox is enabled.

height

height: number

Defined in: core/types.ts:279

Page height in PDF points.

pageNum

pageNum: number

Defined in: core/types.ts:275

1-indexed page number.

text

text: string

Defined in: core/types.ts:281

Full text content of the page with spatial layout preserved.

textItems

textItems: TextItem[]

Defined in: core/types.ts:283

Individual text elements extracted from the page.

width

width: number

Defined in: core/types.ts:277

Page width in PDF points.

ParseResult

Defined in: core/types.ts:354

The result of parsing a document with LiteParse.parse.

Properties

json?

optional json: ParseResultJson

Defined in: core/types.ts:360

Structured JSON data. Present when LiteParseConfig.outputFormat is "json".

pages

pages: ParsedPage[]

Defined in: core/types.ts:356

Per-page parsed data.

text

text: string

Defined in: core/types.ts:358

Full document text, concatenated from all pages.

ParseResultJson

Defined in: core/types.ts:331

Structured JSON representation of parsed document data. Returned when LiteParseConfig.outputFormat is "json".

Properties

ScreenshotResult

Defined in: core/types.ts:366

The result of generating a screenshot with LiteParse.screenshot.

Properties

height

height: number

Defined in: core/types.ts:372

Image height in pixels.

imageBuffer

imageBuffer: Buffer

Defined in: core/types.ts:374

Raw image data as a Node.js Buffer (PNG or JPG).

imagePath?

optional imagePath: string

Defined in: core/types.ts:376

File path if the screenshot was saved to disk.

pageNum

pageNum: number

Defined in: core/types.ts:368

1-indexed page number.

width

width: number

Defined in: core/types.ts:370

Image width in pixels.

SearchItemsOptions

Defined in: core/types.ts:316

Options for searchItems.

Properties

caseSensitive?

optional caseSensitive: boolean

Defined in: core/types.ts:324

Whether the search should be case-sensitive.

Default Value

false

phrase

phrase: string

Defined in: core/types.ts:318

Find text items containing this phrase. Matches can span multiple adjacent items.

TextItem

Defined in: core/types.ts:147

An individual text element extracted from a page, with position, size, and font metadata.

Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.

Properties

confidence?

optional confidence: number

Defined in: core/types.ts:179

Confidence score from 0.0 to 1.0. Native PDF text defaults to 1.0, OCR text reflects engine confidence.

fontName?

optional fontName: string

Defined in: core/types.ts:163

Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).

fontSize?

optional fontSize: number

Defined in: core/types.ts:165

Font size in PDF points.

h

h: number

Defined in: core/types.ts:161

Alias for height.

height

height: number

Defined in: core/types.ts:157

Height of the text item in PDF points.

markup?

optional markup: MarkupData

Defined in: core/types.ts:173

Markup annotations (highlights, underlines, etc.) applied to this text.

r?

optional r: number

Defined in: core/types.ts:167

Rotation angle in degrees. One of 0, 90, 180, or 270.

rx?

optional rx: number

Defined in: core/types.ts:169

X coordinate after rotation transformation.

ry?

optional ry: number

Defined in: core/types.ts:171

Y coordinate after rotation transformation.

str

str: string

Defined in: core/types.ts:149

The text content of this item.

w

w: number

Defined in: core/types.ts:159

Alias for width.

width

width: number

Defined in: core/types.ts:155

Width of the text item in PDF points.

x

x: number

Defined in: core/types.ts:151

X coordinate of the top-left corner, in PDF points.

y

y: number

Defined in: core/types.ts:153

Y coordinate of the top-left corner, in PDF points.

Type Aliases

LiteParseInput

LiteParseInput = string | Buffer | Uint8Array

Defined in: core/types.ts:16

Accepted input types for LiteParse.parse and LiteParse.screenshot.

string — A file path to a document on disk.
Buffer | Uint8Array — Raw file bytes (PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion).

OutputFormat

OutputFormat = "json" | "text"

Defined in: core/types.ts:7

Supported output formats for parsed documents.

"json" — Structured JSON with per-page text items, bounding boxes, and metadata.
"text" — Plain text with spatial layout preserved.

Functions

searchItems()

searchItems(items, options): JsonTextItem[]

Defined in: processing/searchItems.ts:26

Search text items for matches, returning synthetic merged items for each match.

For phrase searches, consecutive text items are concatenated and searched. When a phrase spans multiple items, the result is a single merged item with combined bounding box and the matched text. Font metadata is taken from the first matched item.

Parameters

Example

import { LiteParse, searchItems } from "@llamaindex/liteparse";

const parser = new LiteParse({ outputFormat: "json" });
const result = await parser.parse("report.pdf");

for (const page of result.json.pages) {
  const matches = searchItems(page.textItems, { phrase: "0°C to 70°C" });
  for (const match of matches) {
    console.log(`Found "${match.text}" at (${match.x}, ${match.y})`);
  }
}