API Reference
API reference for the @llamaindex/liteparse TypeScript library.
LiteParse — open-source PDF parsing with spatial text extraction, OCR, and bounding boxes.
Example
Section titled “Example”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse({ ocrEnabled: true });const result = await parser.parse("document.pdf");console.log(result.text);Classes
Section titled “Classes”LiteParse
Section titled “LiteParse”Defined in: core/parser.ts:58
Main document parser class. Handles PDF parsing, OCR, format conversion, and screenshot generation.
Examples
Section titled “Examples”import { LiteParse } from "@llamaindex/liteparse";
const parser = new LiteParse();const result = await parser.parse("document.pdf");console.log(result.text);const parser = new LiteParse({ outputFormat: "json", dpi: 300 });const result = await parser.parse("document.pdf");for (const page of result.json.pages) { console.log(`Page ${page.page}: ${page.boundingBoxes.length} bounding boxes`);}const parser = new LiteParse({ ocrServerUrl: "http://localhost:8828/ocr", ocrLanguage: "en",});const result = await parser.parse("scanned-document.pdf");Constructors
Section titled “Constructors”Constructor
Section titled “Constructor”new LiteParse(
userConfig?):LiteParse
Defined in: core/parser.ts:68
Create a new LiteParse instance.
Parameters
Section titled “Parameters”userConfig?
Section titled “userConfig?”Partial<LiteParseConfig> = {}
Partial configuration to override defaults. See LiteParseConfig for all options.
Returns
Section titled “Returns”Methods
Section titled “Methods”getConfig()
Section titled “getConfig()”getConfig():
LiteParseConfig
Defined in: core/parser.ts:444
Get a copy of the current configuration, including defaults merged with user overrides.
Returns
Section titled “Returns”A shallow copy of the active LiteParseConfig.
parse()
Section titled “parse()”parse(
input,quiet?):Promise<ParseResult>
Defined in: core/parser.ts:100
Parse a document and return the extracted text, page data, and optionally structured JSON.
Supports PDFs natively. Non-PDF formats (DOCX, XLSX, images, etc.) are automatically converted to PDF before parsing if the required system tools are installed.
Parameters
Section titled “Parameters”A file path, Buffer, or Uint8Array containing document bytes.
When given raw bytes, PDF data is parsed directly with zero disk I/O.
Non-PDF bytes are written to a temp file for format conversion.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ParseResult>
Parsed document data including text, per-page info, and optional JSON.
Throws
Section titled “Throws”Error if the file cannot be found, converted, or parsed.
screenshot()
Section titled “screenshot()”screenshot(
input,pageNumbers?,quiet?):Promise<ScreenshotResult[]>
Defined in: core/parser.ts:227
Generate screenshots of PDF pages as image buffers.
Uses PDFium for high-quality rendering. Each page is returned as a ScreenshotResult with the raw image buffer and dimensions.
Parameters
Section titled “Parameters”A file path, Buffer, or Uint8Array containing PDF bytes.
pageNumbers?
Section titled “pageNumbers?”number[]
1-indexed page numbers to screenshot. If omitted, all pages are rendered.
quiet?
Section titled “quiet?”boolean = false
If true, suppresses progress logging to stderr.
Returns
Section titled “Returns”Promise<ScreenshotResult[]>
Array of screenshot results, one per rendered page.
Interfaces
Section titled “Interfaces”BoundingBox
Section titled “BoundingBox”Defined in: core/types.ts:259
An axis-aligned bounding box defined by its top-left and bottom-right corners.
All coordinates are in PDF points.
Deprecated
Section titled “Deprecated”Use TextItem coordinates (x, y, width, height) instead. Will be removed in v2.0.
Properties
Section titled “Properties”x1:
number
Defined in: core/types.ts:261
X coordinate of the top-left corner.
x2:
number
Defined in: core/types.ts:265
X coordinate of the bottom-right corner.
y1:
number
Defined in: core/types.ts:263
Y coordinate of the top-left corner.
y2:
number
Defined in: core/types.ts:267
Y coordinate of the bottom-right corner.
JsonTextItem
Section titled “JsonTextItem”Defined in: core/types.ts:294
A text element from the JSON output with position, size, and font metadata.
Properties
Section titled “Properties”confidence?
Section titled “confidence?”
optionalconfidence:number
Defined in: core/types.ts:310
The OCR confidence (null if OCR wasn’t used)
fontName?
Section titled “fontName?”
optionalfontName:string
Defined in: core/types.ts:306
Font name.
fontSize?
Section titled “fontSize?”
optionalfontSize:number
Defined in: core/types.ts:308
Font size in PDF points.
height
Section titled “height”height:
number
Defined in: core/types.ts:304
Height of the text item in PDF points.
text:
string
Defined in: core/types.ts:296
The text content of this item.
width:
number
Defined in: core/types.ts:302
Width of the text item in PDF points.
x:
number
Defined in: core/types.ts:298
X coordinate of the top-left corner, in PDF points.
y:
number
Defined in: core/types.ts:300
Y coordinate of the top-left corner, in PDF points.
LiteParseConfig
Section titled “LiteParseConfig”Defined in: core/types.ts:34
Configuration options for the LiteParse parser.
All fields have sensible defaults. Pass a Partial<LiteParseConfig> to the
constructor to override only the options you need.
Example
Section titled “Example”const parser = new LiteParse({ ocrEnabled: true, ocrLanguage: "fra", dpi: 300, outputFormat: "json",});Properties
Section titled “Properties”dpi:
number
Defined in: core/types.ts:98
DPI (dots per inch) for rendering pages to images. Higher values improve OCR accuracy but increase processing time and memory usage.
Default Value
Section titled “Default Value”150
maxPages
Section titled “maxPages”maxPages:
number
Defined in: core/types.ts:83
Maximum number of pages to parse from the document.
Default Value
Section titled “Default Value”1000
numWorkers
Section titled “numWorkers”numWorkers:
number
Defined in: core/types.ts:76
Number of pages to OCR in parallel. Higher values use more memory but process faster on multi-core machines.
Default Value
Section titled “Default Value”CPU cores - 1 (minimum 1)ocrEnabled
Section titled “ocrEnabled”ocrEnabled:
boolean
Defined in: core/types.ts:49
Whether to run OCR on pages with little or no native text. When enabled, LiteParse selectively OCRs only images and text-sparse regions.
Default Value
Section titled “Default Value”true
ocrLanguage
Section titled “ocrLanguage”ocrLanguage:
string|string[]
Defined in: core/types.ts:41
OCR language code(s). Uses ISO 639-3 codes for Tesseract (e.g., "eng", "fra")
or ISO 639-1 for HTTP OCR servers (e.g., "en", "fr").
Default Value
Section titled “Default Value”"en"
ocrServerUrl?
Section titled “ocrServerUrl?”
optionalocrServerUrl:string
Defined in: core/types.ts:57
URL of an HTTP OCR server implementing the LiteParse OCR API. If not provided, the built-in Tesseract.js engine is used.
outputFormat
Section titled “outputFormat”outputFormat:
OutputFormat
Defined in: core/types.ts:105
Output format for parsed results.
Default Value
Section titled “Default Value”"json"
password?
Section titled “password?”
optionalpassword:string
Defined in: core/types.ts:138
Password for opening encrypted/protected documents. Used for password-protected PDFs and office documents.
Default Value
Section titled “Default Value”undefined
preciseBoundingBox
Section titled “preciseBoundingBox”preciseBoundingBox:
boolean
Defined in: core/types.ts:116
Calculate precise bounding boxes for each text line. Disable for faster parsing when bounding boxes aren’t needed.
Deprecated
Section titled “Deprecated”Controls the deprecated boundingBoxes output. Will be removed in v2.0.
Text item coordinates (x, y, width, height) are always present regardless.
Default Value
Section titled “Default Value”true
preserveLayoutAlignmentAcrossPages
Section titled “preserveLayoutAlignmentAcrossPages”preserveLayoutAlignmentAcrossPages:
boolean
Defined in: core/types.ts:130
Maintain consistent text alignment across page boundaries.
Default Value
Section titled “Default Value”false
preserveVerySmallText
Section titled “preserveVerySmallText”preserveVerySmallText:
boolean
Defined in: core/types.ts:123
Preserve very small text that would normally be filtered out.
Default Value
Section titled “Default Value”false
targetPages?
Section titled “targetPages?”
optionaltargetPages:string
Defined in: core/types.ts:90
Specific pages to parse, as a comma-separated string of page numbers and ranges.
Example
Section titled “Example”`"1-5,10,15-20"`tessdataPath?
Section titled “tessdataPath?”
optionaltessdataPath:string
Defined in: core/types.ts:68
Path to a directory containing Tesseract .traineddata files.
Used as both the language data source and cache directory for Tesseract.js.
If not set, falls back to the TESSDATA_PREFIX environment variable.
If neither is set, Tesseract.js downloads data from cdn.jsdelivr.net.
Example
Section titled “Example”`/opt/tessdata`MarkupData
Section titled “MarkupData”Defined in: core/types.ts:185
Markup annotation data associated with a text item.
Properties
Section titled “Properties”highlight?
Section titled “highlight?”
optionalhighlight:string
Defined in: core/types.ts:187
Highlight color (e.g., "yellow", "#FFFF00"), or undefined if not highlighted.
squiggly?
Section titled “squiggly?”
optionalsquiggly:boolean
Defined in: core/types.ts:191
Whether the text has a squiggly underline.
strikeout?
Section titled “strikeout?”
optionalstrikeout:boolean
Defined in: core/types.ts:193
Whether the text is struck out.
underline?
Section titled “underline?”
optionalunderline:boolean
Defined in: core/types.ts:189
Whether the text is underlined.
ParsedPage
Section titled “ParsedPage”Defined in: core/types.ts:273
Parsed data for a single page of a document.
Properties
Section titled “Properties”boundingBoxes?
Section titled “boundingBoxes?”
optionalboundingBoxes:BoundingBox[]
Defined in: core/types.ts:288
Deprecated
Section titled “Deprecated”Use TextItem coordinates instead. Will be removed in v2.0. Present when LiteParseConfig.preciseBoundingBox is enabled.
height
Section titled “height”height:
number
Defined in: core/types.ts:279
Page height in PDF points.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: core/types.ts:275
1-indexed page number.
text:
string
Defined in: core/types.ts:281
Full text content of the page with spatial layout preserved.
textItems
Section titled “textItems”textItems:
TextItem[]
Defined in: core/types.ts:283
Individual text elements extracted from the page.
width:
number
Defined in: core/types.ts:277
Page width in PDF points.
ParseResult
Section titled “ParseResult”Defined in: core/types.ts:354
The result of parsing a document with LiteParse.parse.
Properties
Section titled “Properties”
optionaljson:ParseResultJson
Defined in: core/types.ts:360
Structured JSON data. Present when LiteParseConfig.outputFormat is "json".
pages:
ParsedPage[]
Defined in: core/types.ts:356
Per-page parsed data.
text:
string
Defined in: core/types.ts:358
Full document text, concatenated from all pages.
ParseResultJson
Section titled “ParseResultJson”Defined in: core/types.ts:331
Structured JSON representation of parsed document data.
Returned when LiteParseConfig.outputFormat is "json".
Properties
Section titled “Properties”pages:
object[]
Defined in: core/types.ts:333
Array of page data.
boundingBoxes
Section titled “boundingBoxes”boundingBoxes:
BoundingBox[]
Deprecated
Section titled “Deprecated”Use textItems coordinates instead. Will be removed in v2.0.
height
Section titled “height”height:
number
Page height in PDF points.
page:
number
1-indexed page number.
text:
string
Full text content of the page.
textItems
Section titled “textItems”textItems:
JsonTextItem[]
Individual text elements with position and font metadata.
width:
number
Page width in PDF points.
ScreenshotResult
Section titled “ScreenshotResult”Defined in: core/types.ts:366
The result of generating a screenshot with LiteParse.screenshot.
Properties
Section titled “Properties”height
Section titled “height”height:
number
Defined in: core/types.ts:372
Image height in pixels.
imageBuffer
Section titled “imageBuffer”imageBuffer:
Buffer
Defined in: core/types.ts:374
Raw image data as a Node.js Buffer (PNG or JPG).
imagePath?
Section titled “imagePath?”
optionalimagePath:string
Defined in: core/types.ts:376
File path if the screenshot was saved to disk.
pageNum
Section titled “pageNum”pageNum:
number
Defined in: core/types.ts:368
1-indexed page number.
width:
number
Defined in: core/types.ts:370
Image width in pixels.
SearchItemsOptions
Section titled “SearchItemsOptions”Defined in: core/types.ts:316
Options for searchItems.
Properties
Section titled “Properties”caseSensitive?
Section titled “caseSensitive?”
optionalcaseSensitive:boolean
Defined in: core/types.ts:324
Whether the search should be case-sensitive.
Default Value
Section titled “Default Value”false
phrase
Section titled “phrase”phrase:
string
Defined in: core/types.ts:318
Find text items containing this phrase. Matches can span multiple adjacent items.
TextItem
Section titled “TextItem”Defined in: core/types.ts:147
An individual text element extracted from a page, with position, size, and font metadata.
Coordinates use the PDF coordinate system where the origin is at the top-left of the page, x increases to the right, and y increases downward.
Properties
Section titled “Properties”confidence?
Section titled “confidence?”
optionalconfidence:number
Defined in: core/types.ts:179
Confidence score from 0.0 to 1.0. Native PDF text defaults to 1.0, OCR text reflects engine confidence.
fontName?
Section titled “fontName?”
optionalfontName:string
Defined in: core/types.ts:163
Font name (e.g., "Helvetica", "Times-Roman", "OCR" for OCR-detected text).
fontSize?
Section titled “fontSize?”
optionalfontSize:number
Defined in: core/types.ts:165
Font size in PDF points.
h:
number
Defined in: core/types.ts:161
Alias for height.
height
Section titled “height”height:
number
Defined in: core/types.ts:157
Height of the text item in PDF points.
markup?
Section titled “markup?”
optionalmarkup:MarkupData
Defined in: core/types.ts:173
Markup annotations (highlights, underlines, etc.) applied to this text.
optionalr:number
Defined in: core/types.ts:167
Rotation angle in degrees. One of 0, 90, 180, or 270.
optionalrx:number
Defined in: core/types.ts:169
X coordinate after rotation transformation.
optionalry:number
Defined in: core/types.ts:171
Y coordinate after rotation transformation.
str:
string
Defined in: core/types.ts:149
The text content of this item.
w:
number
Defined in: core/types.ts:159
Alias for width.
width:
number
Defined in: core/types.ts:155
Width of the text item in PDF points.
x:
number
Defined in: core/types.ts:151
X coordinate of the top-left corner, in PDF points.
y:
number
Defined in: core/types.ts:153
Y coordinate of the top-left corner, in PDF points.
Type Aliases
Section titled “Type Aliases”LiteParseInput
Section titled “LiteParseInput”LiteParseInput =
string|Buffer|Uint8Array
Defined in: core/types.ts:16
Accepted input types for LiteParse.parse and LiteParse.screenshot.
string— A file path to a document on disk.Buffer | Uint8Array— Raw file bytes (PDF bytes go straight to the parser with zero disk I/O; non-PDF bytes are written to a temp file for format conversion).
OutputFormat
Section titled “OutputFormat”OutputFormat =
"json"|"text"
Defined in: core/types.ts:7
Supported output formats for parsed documents.
"json"— Structured JSON with per-page text items, bounding boxes, and metadata."text"— Plain text with spatial layout preserved.
Functions
Section titled “Functions”searchItems()
Section titled “searchItems()”searchItems(
items,options):JsonTextItem[]
Defined in: processing/searchItems.ts:26
Search text items for matches, returning synthetic merged items for each match.
For phrase searches, consecutive text items are concatenated and searched. When a phrase spans multiple items, the result is a single merged item with combined bounding box and the matched text. Font metadata is taken from the first matched item.
Parameters
Section titled “Parameters”options
Section titled “options”Returns
Section titled “Returns”Example
Section titled “Example”import { LiteParse, searchItems } from "@llamaindex/liteparse";
const parser = new LiteParse({ outputFormat: "json" });const result = await parser.parse("report.pdf");
for (const page of result.json.pages) { const matches = searchItems(page.textItems, { phrase: "0°C to 70°C" }); for (const match of matches) { console.log(`Found "${match.text}" at (${match.x}, ${match.y})`); }}