Skip to content

LlamaExtract Core Concepts

LlamaExtract is designed to be a flexible and scalable extraction platform. At the core of the platform are the following concepts:

  • Extraction Agents: Reusable extractors configured with a specific schema and extraction settings.
  • Data Schema: Structured definition for the data you want to extract in JSON/Pydantic format. See detailed explanation below.
  • Extraction Target: Defines the scope of extraction and how your schema is applied to documents. See detailed explanation below.
  • Extraction Jobs: Asynchronous extraction tasks that involve running an extraction agent on a set of files.
  • Extraction Runs: The results of an extraction job including the extracted data and other metadata.

The Data Schema defines the structure of the data you want to extract from your documents. It is a JSON Schema that specifies the fields, types, and descriptions for the information you need.

While the schema is fundamentally a JSON Schema (supporting a subset of the full JSON Schema specification), our Python SDK allows you to use Pydantic models for a more Pythonic experience with type validation and IDE support.

Learn more:

The Extraction Target determines how your schema is applied to the document and what granularity of results you receive. This is an important configuration option as it fundamentally changes how data is extracted.

Extraction Target Visualization

When to use: This is the default mode and what you need most of the time when extracting data based on your JSON schema from the full document.

How it works: The schema is applied to the entire document as a single unit.

Returns: A single JSON object matching your schema.

Example use case: Extracting summary information from a contract, annual report, or research paper.

When to use: Each page independently contains information about an entity. For example, each page contains financial information about a different portfolio company and you want to extract the same set of metrics for each company.

How it works: The schema is applied independently to each page of the document.

Returns: An array of JSON objects, one per page, each matching your schema.

Example use case: Multi-page forms where each page represents a different entity, or a document with one record per page.

Important: Your schema should describe a single entity/page, not a list. Don’t use extracted_result: list[template], instead provide the template directly that will be applied at the page level.

When to use: The document contains an ordered list of entities (in tables, bulleted/numbered lists, or separated by headers) and you want to extract the same information for each entity.

How it works: The schema is applied to each identified entity in the document. LlamaExtract automatically detects formatting patterns that distinguish entities (table rows, list items, section headers, etc.).

Returns: An array of JSON objects, one per entity/row, each matching your schema.

Example use cases:

  • Invoice line items (each row is a product/service)
  • Employee lists or directories
  • Purchase orders with multiple items
  • Any document with repeating structured entities

Important:

  • Your schema should describe a single entity, not a list. Don’t use extracted_result: list[template], instead provide the template directly that will be applied at the entity level.
  • The document must have some formatting or structure that distinguishes the different entities (table formatting, bullets, numbering, headers, etc.).
  • Entities should appear in an ordered manner in the document.