LlamaParse Configuration
Configuration and scaling recommendations for LlamaParse OCR services and workers.
Overview
Section titled âOverviewâLlamaParse components:
- OCR Service: Text extraction from document images
- LlamaParse Workers: Document processing (fast, balanced, agentic modes)
OCR Service Configuration
Section titled âOCR Service ConfigurationâOCR service runs on GPU or CPU infrastructure.
Hardware Recommendations
Section titled âHardware RecommendationsâCPU deployments: Use x86 architecture (50% better throughput than ARM).
Resource Requirements
Section titled âResource Requirementsâ| Configuration | GPU | CPU |
|---|---|---|
| Minimum instances | 2 | 12 |
| Pages per minute per pod | 100 | ~2 per worker |
| Recommended workers per pod | 4 | Core count á 2 |
Scaling Ratios
Section titled âScaling RatiosâCPU: 2 CPU OCR workers (2 cores each) per LlamaParse worker GPU: 1 GPU OCR worker per 8 LlamaParse workers
LlamaParse Worker Configuration
Section titled âLlamaParse Worker ConfigurationâWorkers process documents in three modes:
Performance by Mode
Section titled âPerformance by Modeâ| Mode | Pages per Minute | Use Case |
|---|---|---|
| Fast | ~10,000 | High-volume, basic text extraction |
| Balanced | ~250 | Standard parsing with good accuracy |
| Agentic | ~100 | Complex documents requiring AI analysis |
Resource Requirements
Section titled âResource RequirementsâCompute:
- CPU: 2 vCPUs per worker
- Memory: 2-16 GB RAM per worker
Deployment:
- Multiple workers per Kubernetes node
- ~6 workers per node (production)
Scaling Examples
Section titled âScaling Examplesâ| Target Throughput | LlamaParse Workers | CPU OCR Pods | GPU OCR Pods |
|---|---|---|---|
| 1,000 pages/min | 8 | 16 | 2 |
| 10,000 pages/min | 64 | 128 | 12 |
GenAI Providers
Section titled âGenAI ProvidersâLlamaParse uses GenAI providers for parsing:
parse_page_with_llm: LLM parsing (supportsgpt-4o-mini,haiku-3.5)parse_page_with_lvm: Vision model parsing (supportsgemini,openai,claude sonnet)parse_page_with_agent: Agentic parsing (supportsclaude,gemini,openai)
Provider fallback: Multiple providers configured â automatic fallback on unavailability.
Supported providers:
- Claude/Haiku: Anthropic (US), AWS Bedrock, Google VertexAI
- OpenAI: OpenAI (US), OpenAI EU (
parse_page_with_llmonly), AzureAI - Gemini: Google Vertex AI, Google GenAI
Advanced Configuration
Section titled âAdvanced ConfigurationâOCR Worker Tuning
Section titled âOCR Worker TuningâOCR_WORKER=<value> # Recommended: pod_core_count á 2OCR Concurrency Control
Section titled âOCR Concurrency ControlâOCR_CONCURRENCY=8 # Default- Lower: Fewer OCR pods, slower processing
- Higher: More OCR pods, faster processing
Image Processing Limits
Section titled âImage Processing LimitsâMAX_EXTRACTED_IMAGES_PER_PAGES=30 # DefaultJob Queue Concurrency
Section titled âJob Queue ConcurrencyâPDF_JOB_QUEUE_CONCURRENCY=1 # Default (recommended)Do not change PDF_JOB_QUEUE_CONCURRENCY without understanding performance implications.
GenAI Throughput Tuning
Section titled âGenAI Throughput TuningâLimit throughput per mode to match TPM/RPM quotas:
ACCURATE_MODE_LLM_CONCURRENCY=250 # parse_page_with_llm (default)MULTIMODAL_MODEL_CONCURRENCY=50 # parse_page_with_lvm (default)PREMIUM_MODE_MODEL_CONCURRENCY=25 # parse_page_with_agent (default)Token usage per 1k pages:
| Mode | Requests | Input Tokens | Output Tokens |
|---|---|---|---|
parse_page_with_llm | 1,010 | 1.2M | 1.5M |
parse_page_with_agent | 2,000 | 4M | 2M |
parse_page_with_lvm | 1,200 | 3M | 1.5M |
Providers like AWS Bedrock have low default quotas. Verify quotas accommodate desired parsing volume.
Autoscaling
Section titled âAutoscalingâLlamaParse supports KEDA-based autoscaling to automatically adjust worker pods based on queue depth. This ensures optimal resource utilization during varying workloads.
Queue-Based Scaling
Section titled âQueue-Based ScalingâAutoscaling uses the LlamaCloud queue status API to monitor parse job queues:
- Queue monitoring:
/api/queue-statusz?queue_prefix=parse_raw_file_job - Scaling metric: Total messages across healthy queues
- Target queue depth: Total messages in queue KEDA tries to maintain (typically 20-100)
Scaling Recommendations
Section titled âScaling Recommendationsâ| Environment | Min Pods | Max Pods | Target Queue Depth | Characteristics |
|---|---|---|---|---|
| Development | 3 | 12 | 20 | Fast scaling, testing |
| Staging | 12 | 120 | 20 | Moderate scaling, validation |
| Production | 96 | 600 | 100 | Conservative, high availability |
Integration with OCR Services
Section titled âIntegration with OCR ServicesâWhen scaling LlamaParse workers, consider OCR service capacity:
- CPU OCR: Scale 2 OCR workers per LlamaParse worker
- GPU OCR: Scale 1 OCR worker per 8 LlamaParse workers
For detailed autoscaling configuration, see the Autoscaling Configuration guide.
Monitoring and Optimization
Section titled âMonitoring and OptimizationâKey Metrics
Section titled âKey Metricsâ- OCR throughput: Pages/minute
- Worker utilization: CPU/memory usage
- Queue depth: Pending jobs
- Error rates: Failed operations
- Scaling events: Autoscaling frequency and effectiveness
Optimization
Section titled âOptimizationâ- Node placement: Collocate complementary resource usage patterns
- Horizontal scaling: Add workers before increasing per-worker resources
- OCR scaling: Scale OCR services independently
- Memory management: Use restart policies for long-running deployments
- Autoscaling tuning: Monitor queue depth and adjust scaling parameters