Generate Extraction Schema
Generate a JSON schema and return a product configuration request.
ParametersExpand Collapse
params: ExtractGenerateSchemaParams { organization_id, project_id, data_schema, 3 more }
organization_id?: string | null
Query param
project_id?: string | null
Query param
data_schema?: Record<string, Record<string, unknown> | Array<unknown> | string | 2 more | null> | null
Body param: Optional schema to validate, refine, or extend
file_id?: string | null
Body param: Optional file ID to analyze for schema generation
name?: string | null
Body param: Name for the generated configuration (auto-generated if omitted)
prompt?: string | null
Body param: Natural language description of the data structure to extract
ReturnsExpand Collapse
ExtractGenerateSchemaResponse { name, parameters }
Request body for creating a product configuration.
name: string
Human-readable name for this configuration.
parameters: SplitV1Parameters { categories, product_type, splitting_strategy } | ExtractV2Parameters { data_schema, product_type, cite_sources, 10 more } | ClassifyV2Parameters { product_type, rules, mode, parsing_configuration } | 2 more
Product-specific configuration parameters.
SplitV1Parameters { categories, product_type, splitting_strategy }
Typed parameters for a split v1 product configuration.
Categories to split documents into.
name: string
Name of the category.
description?: string | null
Optional description of what content belongs in this category.
product_type: "split_v1"
Product type.
splitting_strategy?: SplittingStrategy { allow_uncategorized }
Strategy for splitting documents.
allow_uncategorized?: "include" | "forbid" | "omit"
Controls handling of pages that don't match any category. 'include': pages can be grouped as 'uncategorized' and included in results. 'forbid': all pages must be assigned to a defined category. 'omit': pages can be classified as 'uncategorized' but are excluded from results.
ExtractV2Parameters { data_schema, product_type, cite_sources, 10 more }
Typed parameters for an extract v2 product configuration.
data_schema: Record<string, Record<string, unknown> | Array<unknown> | string | 2 more | null>
JSON Schema defining the fields to extract. Validate with the /schema/validate endpoint first.
product_type: "extract_v2"
Product type.
cite_sources?: boolean
Include citations in results
confidence_scores?: boolean
Include confidence scores in results
extract_version?: string
Extract algorithm version. Use 'latest' or a date string.
extraction_target?: "per_doc" | "per_page" | "per_table_row"
Granularity of extraction: per_doc returns one object per document, per_page returns one object per page, per_table_row returns one object per table row
lang?: string
ISO 639-1 language code for the document
max_pages?: number | null
Maximum number of pages to process. Omit for no limit.
parse_config_id?: string | null
Saved parse configuration ID to control how the document is parsed before extraction
parse_tier?: string | null
Parse tier to use before extraction (fast, cost_effective, or agentic)
system_prompt?: string | null
Custom system prompt to guide extraction behavior
target_pages?: string | null
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
tier?: "cost_effective" | "agentic"
Extract tier: cost_effective (5 credits/page) or agentic (15 credits/page)
ClassifyV2Parameters { product_type, rules, mode, parsing_configuration }
Typed parameters for a classify v2 product configuration.
product_type: "classify_v2"
Product type.
rules: Array<Rule>
Classify rules to evaluate against the document (at least one required)
description: string
Natural language criteria for matching this rule
type: string
Document type to assign when rule matches
mode?: "FAST"
Classify execution mode
parsing_configuration?: ParsingConfiguration | null
Parsing configuration for classify jobs.
lang?: string
ISO 639-1 language code for the document
max_pages?: number | null
Maximum number of pages to process. Omit for no limit.
target_pages?: string | null
Comma-separated page numbers or ranges to process (1-based). Omit to process all pages.
ParseV2Parameters { product_type, tier, version, 11 more }
Configuration for LlamaParse v2 document parsing.
Includes tier selection, processing options, output formatting, page targeting, and webhook delivery. Refer to the LlamaParse documentation for details on each field.
product_type: "parse_v2"
Product type.
tier: "fast" | "cost_effective" | "agentic" | "agentic_plus"
Parsing tier: 'fast' (rule-based, cheapest), 'cost_effective' (balanced), 'agentic' (AI-powered with custom prompts), or 'agentic_plus' (premium AI with highest accuracy)
version: "2025-12-11" | "2025-12-18" | "2025-12-31" | 31 more | (string & {})
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
"2025-12-11" | "2025-12-18" | "2025-12-31" | 31 more
agentic_options?: AgenticOptions | null
Options for AI-powered parsing tiers (cost_effective, agentic, agentic_plus).
These options customize how the AI processes and interprets document content. Only applicable when using non-fast tiers.
custom_prompt?: string | null
Custom instructions for the AI parser. Use to guide extraction behavior, specify output formatting, or provide domain-specific context. Example: 'Extract financial tables with currency symbols. Format dates as YYYY-MM-DD.'
client_name?: string | null
Identifier for the client/application making the request. Used for analytics and debugging. Example: 'my-app-v2'
crop_box?: CropBox { bottom, left, right, top }
Crop boundaries to process only a portion of each page. Values are ratios 0-1 from page edges
bottom?: number | null
Bottom boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content below this line is excluded
left?: number | null
Left boundary as ratio (0-1). 0=left edge, 1=right edge. Content left of this line is excluded
right?: number | null
Right boundary as ratio (0-1). 0=left edge, 1=right edge. Content right of this line is excluded
top?: number | null
Top boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content above this line is excluded
disable_cache?: boolean | null
Bypass result caching and force re-parsing. Use when document content may have changed or you need fresh results
fast_options?: unknown
Options for fast tier parsing (rule-based, no AI).
Fast tier uses deterministic algorithms for text extraction without AI enhancement. It's the fastest and most cost-effective option, best suited for simple documents with standard layouts. Currently has no configurable options but reserved for future expansion.
input_options?: InputOptions { html, pdf, presentation, spreadsheet }
Format-specific options (HTML, PDF, spreadsheet, presentation). Applied based on detected input file type
html?: HTML { make_all_elements_visible, remove_fixed_elements, remove_navigation_elements }
HTML/web page parsing options (applies to .html, .htm files)
make_all_elements_visible?: boolean | null
Force all HTML elements to be visible by overriding CSS display/visibility properties. Useful for parsing pages with hidden content or collapsed sections
remove_fixed_elements?: boolean | null
Remove fixed-position elements (headers, footers, floating buttons) that appear on every page render
remove_navigation_elements?: boolean | null
Remove navigation elements (nav bars, sidebars, menus) to focus on main content
pdf?: unknown
PDF-specific parsing options (applies to .pdf files)
presentation?: Presentation { out_of_bounds_content, skip_embedded_data }
Presentation parsing options (applies to .pptx, .ppt, .odp, .key files)
out_of_bounds_content?: boolean | null
Extract content positioned outside the visible slide area. Some presentations have hidden notes or content that extends beyond slide boundaries
skip_embedded_data?: boolean | null
Skip extraction of embedded chart data tables. When true, only the visual representation of charts is captured, not the underlying data
spreadsheet?: Spreadsheet { detect_sub_tables_in_sheets, force_formula_computation_in_sheets, include_hidden_sheets }
Spreadsheet parsing options (applies to .xlsx, .xls, .csv, .ods files)
detect_sub_tables_in_sheets?: boolean | null
Detect and extract multiple tables within a single sheet. Useful when spreadsheets contain several data regions separated by blank rows/columns
force_formula_computation_in_sheets?: boolean | null
Compute formula results instead of extracting formula text. Use when you need calculated values rather than formula definitions
include_hidden_sheets?: boolean | null
Parse hidden sheets in addition to visible ones. By default, hidden sheets are skipped
output_options?: OutputOptions { extract_printed_page_number, images_to_save, markdown, 2 more }
Output formatting options for markdown, text, and extracted images
extract_printed_page_number?: boolean | null
Extract the printed page number as it appears in the document (e.g., 'Page 5 of 10', 'v', 'A-3'). Useful for referencing original page numbers
images_to_save?: Array<"screenshot" | "embedded" | "layout">
Image categories to extract and save. Options: 'screenshot' (full page renders useful for visual QA), 'embedded' (images found within the document), 'layout' (cropped regions from layout detection like figures and diagrams). Empty list saves no images
markdown?: Markdown { annotate_links, inline_images, tables }
Markdown formatting options including table styles and link annotations
annotate_links?: boolean | null
Add link annotations to markdown output in the format text. When false, only the link text is included
inline_images?: boolean | null
Embed images directly in markdown as base64 data URIs instead of extracting them as separate files. Useful for self-contained markdown output
tables?: Tables { compact_markdown_tables, markdown_table_multiline_separator, merge_continued_tables, output_tables_as_markdown }
Table formatting options including markdown vs HTML format and merging behavior
compact_markdown_tables?: boolean | null
Remove extra whitespace padding in markdown table cells for more compact output
markdown_table_multiline_separator?: string | null
Separator string for multiline cell content in markdown tables. Example: '
' to preserve line breaks, ' ' to join with spaces
merge_continued_tables?: boolean | null
Automatically merge tables that span multiple pages into a single table. The merged table appears on the first page with merged_from_pages metadata
output_tables_as_markdown?: boolean | null
Output tables as markdown pipe tables instead of HTML