Parse File
Parse a file by file ID or URL.
Provide either file_id (a previously uploaded file) or
source_url (a publicly accessible URL). Configure parsing
with options like tier, target_pages, and lang.
Tiers
fast— rule-based, cheapest, no AIcost_effective— balanced speed and qualityagentic— full AI-powered parsingagentic_plus— premium AI with specialized features
The job runs asynchronously. Poll GET /parse/{job_id} with
expand=text or expand=markdown to retrieve results.
Query ParametersExpand Collapse
Cookie ParametersExpand Collapse
Body ParametersJSONExpand Collapse
tier: "fast" or "cost_effective" or "agentic" or "agentic_plus"
Parsing tier: 'fast' (rule-based, cheapest), 'cost_effective' (balanced), 'agentic' (AI-powered with custom prompts), or 'agentic_plus' (premium AI with highest accuracy)
version: "2025-12-11" or "2025-12-18" or "2025-12-31" or 31 more or string
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
UnionMember0 = "2025-12-11" or "2025-12-18" or "2025-12-31" or 31 more
Tier version. Use 'latest' for the current stable version, or specify a specific version (e.g., '1.0', '2.0') for reproducible results
agentic_options: optional object { custom_prompt }
Options for AI-powered parsing tiers (cost_effective, agentic, agentic_plus).
These options customize how the AI processes and interprets document content. Only applicable when using non-fast tiers.
custom_prompt: optional string
Custom instructions for the AI parser. Use to guide extraction behavior, specify output formatting, or provide domain-specific context. Example: 'Extract financial tables with currency symbols. Format dates as YYYY-MM-DD.'
client_name: optional string
Identifier for the client/application making the request. Used for analytics and debugging. Example: 'my-app-v2'
crop_box: optional object { bottom, left, right, top }
Crop boundaries to process only a portion of each page. Values are ratios 0-1 from page edges
bottom: optional number
Bottom boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content below this line is excluded
left: optional number
Left boundary as ratio (0-1). 0=left edge, 1=right edge. Content left of this line is excluded
right: optional number
Right boundary as ratio (0-1). 0=left edge, 1=right edge. Content right of this line is excluded
top: optional number
Top boundary as ratio (0-1). 0=top edge, 1=bottom edge. Content above this line is excluded
disable_cache: optional boolean
Bypass result caching and force re-parsing. Use when document content may have changed or you need fresh results
fast_options: optional unknown
Options for fast tier parsing (rule-based, no AI).
Fast tier uses deterministic algorithms for text extraction without AI enhancement. It's the fastest and most cost-effective option, best suited for simple documents with standard layouts. Currently has no configurable options but reserved for future expansion.
file_id: optional string
ID of an existing file in the project to parse. Mutually exclusive with source_url
http_proxy: optional string
HTTP/HTTPS proxy for fetching source_url. Ignored if using file_id
input_options: optional object { html, pdf, presentation, spreadsheet }
Format-specific options (HTML, PDF, spreadsheet, presentation). Applied based on detected input file type
html: optional object { make_all_elements_visible, remove_fixed_elements, remove_navigation_elements }
HTML/web page parsing options (applies to .html, .htm files)
make_all_elements_visible: optional boolean
Force all HTML elements to be visible by overriding CSS display/visibility properties. Useful for parsing pages with hidden content or collapsed sections
remove_fixed_elements: optional boolean
Remove fixed-position elements (headers, footers, floating buttons) that appear on every page render
remove_navigation_elements: optional boolean
Remove navigation elements (nav bars, sidebars, menus) to focus on main content
pdf: optional unknown
PDF-specific parsing options (applies to .pdf files)
presentation: optional object { out_of_bounds_content, skip_embedded_data }
Presentation parsing options (applies to .pptx, .ppt, .odp, .key files)
out_of_bounds_content: optional boolean
Extract content positioned outside the visible slide area. Some presentations have hidden notes or content that extends beyond slide boundaries
skip_embedded_data: optional boolean
Skip extraction of embedded chart data tables. When true, only the visual representation of charts is captured, not the underlying data
spreadsheet: optional object { detect_sub_tables_in_sheets, force_formula_computation_in_sheets, include_hidden_sheets }
Spreadsheet parsing options (applies to .xlsx, .xls, .csv, .ods files)
detect_sub_tables_in_sheets: optional boolean
Detect and extract multiple tables within a single sheet. Useful when spreadsheets contain several data regions separated by blank rows/columns
force_formula_computation_in_sheets: optional boolean
Compute formula results instead of extracting formula text. Use when you need calculated values rather than formula definitions
include_hidden_sheets: optional boolean
Parse hidden sheets in addition to visible ones. By default, hidden sheets are skipped
output_options: optional object { extract_printed_page_number, images_to_save, markdown, 2 more }
Output formatting options for markdown, text, and extracted images
extract_printed_page_number: optional boolean
Extract the printed page number as it appears in the document (e.g., 'Page 5 of 10', 'v', 'A-3'). Useful for referencing original page numbers
images_to_save: optional array of "screenshot" or "embedded" or "layout"
Image categories to extract and save. Options: 'screenshot' (full page renders useful for visual QA), 'embedded' (images found within the document), 'layout' (cropped regions from layout detection like figures and diagrams). Empty list saves no images
markdown: optional object { annotate_links, inline_images, tables }
Markdown formatting options including table styles and link annotations
annotate_links: optional boolean
Add link annotations to markdown output in the format text. When false, only the link text is included
inline_images: optional boolean
Embed images directly in markdown as base64 data URIs instead of extracting them as separate files. Useful for self-contained markdown output
tables: optional object { compact_markdown_tables, markdown_table_multiline_separator, merge_continued_tables, output_tables_as_markdown }
Table formatting options including markdown vs HTML format and merging behavior
compact_markdown_tables: optional boolean
Remove extra whitespace padding in markdown table cells for more compact output
markdown_table_multiline_separator: optional string
Separator string for multiline cell content in markdown tables. Example: '
' to preserve line breaks, ' ' to join with spaces
merge_continued_tables: optional boolean
Automatically merge tables that span multiple pages into a single table. The merged table appears on the first page with merged_from_pages metadata
output_tables_as_markdown: optional boolean
Output tables as markdown pipe tables instead of HTML