Configuring Parse
Every option for a Parse job in one place — request anatomy, input controls, output formats, and processing features.
Parse has a lot of knobs. This page is the map: it shows how a parse request is structured and documents every option you can set.
For the full field-by-field reference, see the Parse API Reference. For controlling what comes back from a parse job, see Retrieving Results.
Request anatomy
Section titled “Request anatomy”Every Parse request is a JSON object. Only three fields are required — everything else is optional:
{ // --- Required --- "file_id": "<file_id>", // or "source_url" — exactly one is required "tier": "agentic", // fast | cost_effective | agentic | agentic_plus "version": "latest", // dated version string or "latest"
// --- Optional --- "input_options": { /* file-type-specific hints */ }, "processing_options": { /* how Parse processes the document */ }, "agentic_options": { /* configures the agentic models */ }, "output_options": { /* shape of what Parse returns */ },
"crop_box": { /* page-level crop */ }, "page_ranges": { /* parse specific pages only */ }, "disable_cache": false, "processing_control": { /* timeouts and fail modes */ }, "webhook_configurations": [ /* push results to a URL */ ]}The simplest valid request is just these three fields — everything below them can be added when you need it.
Note:
expandis not part of the parse request body. It’s a query parameter on the GET result endpoint that controls which fields come back. The SDKs handle this for you — when you passexpand=["markdown"]toclient.parsing.parse(), the SDK submits the job, polls until complete, then retrieves the result with the right expand values. If you’re using the REST API directly, see Retrieving Results.
Where do I configure X?
Section titled “Where do I configure X?”Input options
Section titled “Input options”Control how Parse reads your document — page ranges, crop boxes, file-type-specific controls, and cache behavior.
Page ranges
Section titled “Page ranges”Parse only the pages you need. Every page you skip is a page you don’t pay for.
API key: page_ranges — top-level.
max_pages(integer) — cap total pages parsed, starting from page 1target_pages(string) — comma-separated 1-based pages and ranges, e.g."1,3,5-10"
{ "page_ranges": { "max_pages": 5 } }{ "page_ranges": { "target_pages": "1,3,7-12" } }Crop box
Section titled “Crop box”Strip repeating headers, footers, and margin chrome from every page. Four numbers (0.0–1.0), each the fraction to strip from that edge.
API key: crop_box — top-level.
{ "crop_box": { "top": 0.1, "bottom": 0.15 } }This is a geometric crop, not a content filter. If the chrome moves between pages, use content-based ignore rules instead.
Cache control
Section titled “Cache control”Parse caches identical requests by default. Any change to parse options busts the cache automatically.
API key: disable_cache — top-level boolean.
{ "disable_cache": true }Only disable for benchmarking, debugging, or verifying version pins.
HTML pages
Section titled “HTML pages”API key: input_options.html.
Parse walks the DOM, extracts visible content, and produces clean markdown. These controls strip noise that doesn’t belong in the output:
make_all_elements_visible— force hidden CSS content visible. Useful when parts of the document are behinddisplay: none,visibility: hidden, or JavaScript-driven UI states.remove_navigation_elements— strip menus, breadcrumbs, sidebar nav, and non-content chrome. Most useful when parsing a real web page rather than a hand-built HTML document.remove_fixed_elements— strip sticky headers, floating sidebars, and other fixed-position UI.
{ "input_options": { "html": { "make_all_elements_visible": true, "remove_navigation_elements": true, "remove_fixed_elements": true } }}Spreadsheets (XLSX, CSV)
Section titled “Spreadsheets (XLSX, CSV)”API key: input_options.spreadsheet.
Parse handles spreadsheets where layouts aren’t clean rectangular tables — multiple logical tables stacked in one sheet, formulas with stale cached values, etc.
detect_sub_tables_in_sheets— find and extract sub-tables within a single sheet. If your spreadsheet has three small tables stacked vertically with empty rows between them, Parse detects each as its own table instead of merging them.force_formula_computation_in_sheets— re-compute formula cells instead of using cached values. Enable when the file was edited but never recalculated, or you’re parsing a template with placeholder values. Can slow parsing on formula-heavy sheets.
{ "input_options": { "spreadsheet": { "detect_sub_tables_in_sheets": true, "force_formula_computation_in_sheets": true } }}Presentations (PPTX, Keynote)
Section titled “Presentations (PPTX, Keynote)”API key: input_options.presentation.
Speaker notes are extracted by default — request expand=["metadata"] to retrieve them on per-slide metadata.
out_of_bounds_content— extract content positioned beyond the visible slide boundaries. Presenters sometimes park notes, draft text, or reference images outside the visible area.skip_embedded_data— skip extraction of embedded chart data. Set totrueif you only need slide text and the chart-data extraction is slowing you down.
{ "input_options": { "presentation": { "out_of_bounds_content": true, "skip_embedded_data": false } }}API key: input_options.pdf — see the API reference for available PDF-specific options.
Output options
Section titled “Output options”Shape what Parse returns. Parse can emit several formats from the same job.
| I want… | Use |
|---|---|
| Clean text for an LLM | Markdown (default) |
| Whitespace-preserving layout | Spatial text |
| Tables as downloadable XLSX | Tables as spreadsheet |
| Embedded images, screenshots, layout crops | Image assets |
| Printed page numbers for citations | Printed page numbers |
| PDF copy of any parsed document | Exported PDF |
| Results pushed to my server | Webhooks |
Markdown output options
Section titled “Markdown output options”API key: output_options.markdown.
| Symptom | Knob |
|---|---|
| Downstream needs HTML tables | output_tables_as_markdown: false |
| Table spans multiple pages | merge_continued_tables: true |
| Images transcribed instead of referenced | inline_images: true |
| Want link destinations in markdown | annotate_links: true |
| Whitespace in table cells | compact_markdown_tables: true |
| Multi-line cell content | `markdown_table_multiline_separator: " |
{ "output_options": { "markdown": { "annotate_links": true, "inline_images": true, "tables": { "merge_continued_tables": true, "output_tables_as_markdown": false } } }}Spatial text
Section titled “Spatial text”Preserves visual positioning using whitespace. Use for forms, CAD drawings, multi-column layouts, receipts.
API key: output_options.spatial_text. Retrieve via expand=["text"].
Flags: preserve_layout_alignment_across_pages, preserve_very_small_text, do_not_unroll_columns.
Tables as spreadsheet
Section titled “Tables as spreadsheet”Generates an XLSX file — one sheet per table.
API key: output_options.tables_as_spreadsheet. Retrieve via expand=["xlsx_content_metadata"].
{ "output_options": { "tables_as_spreadsheet": { "enable": true } } }Image assets
Section titled “Image assets”API key: output_options.images_to_save — enum array: "screenshot", "embedded", "layout". Retrieve via expand=["images_content_metadata"].
{ "output_options": { "images_to_save": ["screenshot", "embedded"] } }Printed page numbers
Section titled “Printed page numbers”API key: output_options.extract_printed_page_number (singular). Retrieve via expand=["metadata"].
Exported PDF
Section titled “Exported PDF”Always generated — no input config. Retrieve via expand=["output_pdf_content_metadata"].
Webhook configurations
Section titled “Webhook configurations”Push results instead of polling. For large PDFs, batch pipelines, or when the result goes to a different service.
API key: webhook_configurations — top-level array.
{ "webhook_configurations": [{ "webhook_url": "https://example.com/webhook", "webhook_events": ["parse.success"], "webhook_headers": { "Authorization": "Bearer your-token" } }]}Security: use HTTPS, put auth tokens in webhook_headers (not query params), verify the caller in your handler, never let untrusted users control webhook_url.
Processing options
Section titled “Processing options”Control how Parse processes your document — flagship features, tuning knobs, and production controls.
Cost Optimizer
Section titled “Cost Optimizer”Pay premium prices only on pages that need it. Routes each page to the right tier automatically: simple pages → cost_effective, complex pages → your selected tier. Both groups run in parallel.
API key: processing_options.cost_optimizer. Available on agentic and agentic_plus only.
{ "processing_options": { "cost_optimizer": { "enable": true } } }When to use: mixed-complexity documents where most pages are prose. When to skip: every page is table-heavy, you’re already on cost_effective/fast, or you need exact reproducibility.
Request expand=["metadata"] to see which pages were cost-optimized (cost_optimized: true/false per page).
Specialized Chart Parsing
Section titled “Specialized Chart Parsing”Extract chart/graph data as structured tables — bar heights, line values, pie percentages.
API key: processing_options.specialized_chart_parsing — enum: "efficient", "agentic", "agentic_plus". Default-on for Agentic Plus. Not available on fast.
{ "processing_options": { "specialized_chart_parsing": "agentic_plus" } }Retrieve chart data via expand=["items"]. See the chart parsing tutorial.
Custom Prompt
Section titled “Custom Prompt”Steer the parser with natural-language instructions — focus on specific content, preserve formats, give document context.
API key: agentic_options.custom_prompt (not processing_options). Available on cost_effective, agentic, agentic_plus. Not on fast.
{ "agentic_options": { "custom_prompt": "This is a financial report. Preserve all currency symbols." } }Tips: be specific, name the document type, say what to skip, specify output format, keep it short (2–3 sentences). For guaranteed structured extraction with a schema, use LlamaExtract instead.
Skipping content patterns
Section titled “Skipping content patterns”API key: processing_options.ignore.
| Flag | Skips |
|---|---|
ignore_diagonal_text | Watermarks (CONFIDENTIAL, DRAFT) |
ignore_text_in_image | Low-quality OCR text in embedded images |
ignore_hidden_text | White-on-white or CSS-hidden text |
OCR languages
Section titled “OCR languages”API key: processing_options.ocr_parameters.languages. Only affects text from images — native text in born-digital PDFs is read directly.
{ "processing_options": { "ocr_parameters": { "languages": ["en", "fr", "de"] } } }Table extraction tuning
Section titled “Table extraction tuning”| Flag | What it does |
|---|---|
aggressive_table_extraction | Try harder to find tables (may add false positives) |
disable_heuristics | Turn off outlined-table extraction and adaptive long-table handling |
Timeouts and failure conditions
Section titled “Timeouts and failure conditions”API key: processing_control — top-level (not under processing_options).
Timeouts: base_in_seconds + (extra_time_per_page_in_seconds × page count).
Failure conditions: allowed_page_failure_ratio, fail_on_image_extraction_error, fail_on_image_ocr_error, fail_on_markdown_reconstruction_error, fail_on_buggy_font.
{ "processing_control": { "timeouts": { "base_in_seconds": 300, "extra_time_per_page_in_seconds": 30 }, "job_failure_conditions": { "allowed_page_failure_ratio": 0.05 } }}See also
Section titled “See also”- Tiers — credit costs, tier comparison, version pinning
- Retrieving Results — the
expandparameter and what comes back - Recipes — copy-pasteable snippets
- API reference — full field-by-field listing