Schema Design and Restrictions
At the core of LlamaExtract is the schema, which defines the structure of the data you want to extract from your documents.
Schema Restrictions
Section titled “Schema Restrictions”LlamaExtract only supports a subset of the JSON Schema specification. While limited, it should be sufficient for a wide variety of use-cases.
- If you are specifying the schema as a JSON, there are two ways you can mark optional fields:
- not including them in the containing object’s
requiredarray - explicilty marking them as nullable fields using
anyOfwith anulltype. See"start_date"field in the example schema.
- not including them in the containing object’s
- If you are using Pydantic for specifying the schema in the Python SDK, you can use the
Optionalannotation for marking optional fields. - Root node must be of type
object. - Schema nesting must be limited to within 7 levels.
- The important fields are key names/titles, type and description. Fields for formatting, default values, etc. are not supported. If you need these, you can add the
restrictions to your field description and/or use a post-processing step. e.g. default values can be supported by making a field optional and then setting
"null"values from the extraction result to the default value. - There are other restrictions on number of keys, size of the schema, etc. that you may hit for complex extraction use cases. In such cases, it is worth thinking how to restructure your extraction workflow to fit within these constraints, e.g. by extracting subset of fields and later merging them together.
Tips & Best Practices
Section titled “Tips & Best Practices”- Try to limit schema nesting to 3-4 levels.
- Make fields optional when data might not always be present. Having required fields may force the model to hallucinate when these fields are not present in the documents.
- When you want to extract a variable number of entities, use an
arraytype. However, note that you cannot use anarraytype for the root node. - Use descriptive field names and detailed descriptions. Use descriptions to pass formatting instructions or few-shot examples.
- Above all, start simple and iteratively build your schema to incorporate requirements.
Hitting “The response was too long to be processed” Error
Section titled “Hitting “The response was too long to be processed” Error”This implies that the extraction response is hitting output token limits of the LLM. In such cases, it is worth rethinking the design of your schema to enable a more efficient/scalable extraction. e.g.
- Instead of one field that extracts a complex object, you can use multiple fields to distribute the extraction logic.
- You can also use multiple schemas to extract different subsets of fields from the same document and merge them later.
Another option (orthogonal to the above) is to break the document into smaller sections and extract from each section individually, when possible. LlamaExtract will in most cases be able to handle both document and schema chunking automatically, but there are cases where you may need to do this manually.
Defining Schemas (Python SDK)
Section titled “Defining Schemas (Python SDK)”The Python SDK can be installed using
pip install llama-cloud-servicesSchemas can be defined using either Pydantic models or JSON Schema:
Using Pydantic (Recommended)
Section titled “Using Pydantic (Recommended)”from pydantic import BaseModel, Fieldfrom typing import List, Optionalfrom llama_cloud_services import LlamaExtract
class Experience(BaseModel): company: str = Field(description="Company name") title: str = Field(description="Job title") start_date: Optional[str] = Field(description="Start date of employment") end_date: Optional[str] = Field(description="End date of employment")
class Resume(BaseModel): name: str = Field(description="Candidate name") experience: List[Experience] = Field(description="Work history")Using JSON Schema
Section titled “Using JSON Schema”schema = { "type": "object", "properties": { "name": {"type": "string", "description": "Candidate name"}, "experience": { "type": "array", "description": "Work history", "items": { "type": "object", "properties": { "company": { "type": "string", "description": "Company name", }, "title": {"type": "string", "description": "Job title"}, "start_date": { "anyOf": [{"type": "string"}, {"type": "null"}], "description": "Start date of employment", }, "end_date": { "anyOf": [{"type": "string"}, {"type": "null"}], "description": "End date of employment", }, }, }, }, },}extractor = LlamaExtract(api_key="YOUR_API_KEY")agent = extractor.create_agent(name="resume-parser", data_schema=schema)