Configuration Guide
Overview
Soliplex Ingester is configured via environment variables using Pydantic Settings. All configuration is defined in src/soliplex/ingester/lib/config.py:15.
Environment Variables
Database Configuration
DOC_DB_URL (required)
Database connection URL.
SQLite Example:
PostgreSQL Example:
Notes:
- Must use async drivers (aiosqlite for SQLite or psycopg for PostgreSQL)
- SQLite uses relative or absolute file paths
- PostgreSQL requires credentials and network access
External Services
DOCLING_SERVER_URL
Docling document parsing service endpoint.
Default: http://localhost:5001/v1
Example:
Notes:
- Used for document parsing (PDF, DOCX, etc.)
- Must be accessible from worker nodes
- Health check: GET {url}/health
DOCLING_CHUNK_SERVER_URL
Docling service endpoint for chunking operations.
Default: http://localhost:5001/v1
Example:
Notes:
- Used by haiku.rag for document chunking via docling-serve
- Can point to a different Docling instance than DOCLING_SERVER_URL for load distribution
- If not set, defaults to the same endpoint as parsing
- Useful when running separate Docling instances for parsing vs chunking
DOCLING_HTTP_TIMEOUT
HTTP timeout for Docling requests in seconds.
Default: 600 (10 minutes)
Example:
Notes: - Large documents may require longer timeouts - Adjust based on document size and complexity
OLLAMA_BASE_URL
Ollama server endpoint for embedding generation.
Default: http://ollama:11434
Example:
Notes: - Used for generating document embeddings during the embed step - Must be accessible from worker nodes - The Ollama server should have the required embedding models loaded
OLLAMA_BASE_URL_DOCLING
Ollama server endpoint for Docling chunking operations.
Default: http://ollama:11434
Example:
Notes:
- Used by docling-serve for document chunking operations
- Can point to a different Ollama instance than OLLAMA_BASE_URL for load distribution
- If not set, defaults to the same URL as OLLAMA_BASE_URL
- Useful when running separate Ollama instances to distribute model loading across servers
Logging
LOG_LEVEL
Python logging level.
Default: INFO
Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
Example:
File Storage
FILE_STORE_TARGET
Storage backend type.
Default: fs (filesystem)
Options:
- fs - Local filesystem
- s3 - S3-compatible storage (requires OpenDAL config)
Example:
FILE_STORE_DIR
Base directory for file storage.
Default: file_store
Example:
Notes:
- Used when FILE_STORE_TARGET=fs
- Must be writable by the application user
- Consider disk space requirements
DOCUMENT_STORE_DIR
Subdirectory for raw documents.
Default: raw
Full Path: {FILE_STORE_DIR}/{DOCUMENT_STORE_DIR}
PARSED_MARKDOWN_STORE_DIR
Subdirectory for parsed markdown.
Default: markdown
PARSED_JSON_STORE_DIR
Subdirectory for parsed JSON.
Default: json
CHUNKS_STORE_DIR
Subdirectory for text chunks.
Default: chunks
EMBEDDINGS_STORE_DIR
Subdirectory for embedding vectors.
Default: embeddings
Vector Database
LANCEDB_DIR
Directory for LanceDB vector storage.
Default: lancedb
Filesystem Example:
S3 Example:
Notes: - Local filesystem paths or S3 URIs are supported - When using S3, LanceDB requires standard AWS environment variables (see below) - Stores vector embeddings for RAG retrieval - Requires sufficient disk space (filesystem) or S3 bucket access - Periodically compact for performance
Required AWS Environment Variables for S3:
Optional AWS Environment Variables:
# For S3-compatible providers (MinIO, SeaweedFS, etc.)
AWS_ENDPOINT=http://127.0.0.1:8333
# Required when using HTTP to connect to endpoint
AWS_ALLOW_HTTP=1
Worker Configuration
INGEST_QUEUE_CONCURRENCY
Maximum concurrent queue operations.
Default: 20
Example:
Notes: - Controls internal queue processing - Higher values increase throughput but use more memory
INGEST_WORKER_CONCURRENCY
Maximum concurrent workflow steps per worker.
Default: 10
Example:
Notes: - Primary throughput control - Balance against CPU and external service limits - Monitor resource usage when tuning
DOCLING_CONCURRENCY
Maximum concurrent Docling requests.
Default: 3
Example:
Notes: - Prevents overwhelming Docling service - Coordinate with Docling server capacity - Increase if Docling can handle load
WORKER_TASK_COUNT
Number of workflow steps to fetch per query.
Default: 5
Example:
Notes: - Batch size for step queries - Higher values reduce database round-trips - Lower values improve fairness across workers
WORKER_CHECKIN_INTERVAL
Worker heartbeat interval in seconds.
Default: 120 (2 minutes)
Example:
Notes: - How often workers update health status - Lower values increase database load slightly - Used for monitoring worker liveness
WORKER_CHECKIN_TIMEOUT
Worker timeout threshold in seconds.
Default: 600 (10 minutes)
Example:
Notes:
- When to consider a worker stale
- Should be significantly larger than WORKER_CHECKIN_INTERVAL
- Used for detecting crashed workers
EMBED_BATCH_SIZE
Batch size for embedding operations.
Default: 1000
Example:
Notes: - Number of chunks to embed at once - Higher values improve throughput - Limited by embedding service capacity and memory
Workflow Configuration
WORKFLOW_DIR
Directory containing workflow YAML definitions.
Default: config/workflows
Example:
Notes:
- Scanned for *.yaml files at startup
- Hot-reload if --reload flag is used
DEFAULT_WORKFLOW_ID
Default workflow to use when not specified.
Default: batch_split
Example:
Notes:
- Must match an id in workflow YAML files
- Used when API requests omit workflow_definition_id
PARAM_DIR
Directory containing parameter set YAML files.
Default: config/params
Example:
DEFAULT_PARAM_ID
Default parameter set to use when not specified.
Default: default
Example:
Notes:
- Must match an id in parameter YAML files
Feature Flags
DO_RAG
Enable/disable HaikuRAG integration.
Default: True
Example:
Notes:
- Set to false for testing without RAG backend
- When disabled, store step becomes a no-op
- Useful for CI/CD testing
Authentication
API_KEY
Static API key for programmatic access.
Default: None (disabled)
Example:
Notes:
- Generate with: openssl rand -hex 32
- Must also set API_KEY_ENABLED=true to enforce
- Clients pass via Authorization: Bearer <token> header
- Keep this value secure - do not commit to version control
API_KEY_ENABLED
Enable API key authentication.
Default: False
Example:
Notes:
- When true, all API requests require valid Authorization: Bearer header
- When false, API is open (or protected by OAuth2 Proxy)
- Can be combined with AUTH_TRUST_PROXY_HEADERS for hybrid auth
AUTH_TRUST_PROXY_HEADERS
Trust user identity headers from OAuth2 Proxy.
Default: False
Example:
Notes:
- Enable when running behind OAuth2 Proxy
- Reads user identity from X-Auth-Request-User, X-Auth-Request-Email headers
- Security: Only enable when behind a trusted reverse proxy
- See AUTHENTICATION.md for OAuth2 Proxy setup
Configuration File
While the system uses environment variables, you can organize them in a .env file:
.env Example:
# Database
DOC_DB_URL=sqlite+aiosqlite:///./db/documents.db
# External Services
DOCLING_SERVER_URL=http://localhost:5001/v1
DOCLING_HTTP_TIMEOUT=600
# Logging
LOG_LEVEL=INFO
# Storage
FILE_STORE_TARGET=fs
FILE_STORE_DIR=file_store
LANCEDB_DIR=lancedb
# Worker Settings
INGEST_WORKER_CONCURRENCY=10
DOCLING_CONCURRENCY=3
WORKER_TASK_COUNT=5
WORKER_CHECKIN_INTERVAL=120
WORKER_CHECKIN_TIMEOUT=600
EMBED_BATCH_SIZE=1000
# Workflow Configuration
WORKFLOW_DIR=config/workflows
DEFAULT_WORKFLOW_ID=batch
PARAM_DIR=config/params
DEFAULT_PARAM_ID=default
# Features
DO_RAG=true
Load with:
S3 Configuration Overview
This project supports S3 storage in two different contexts with different configuration methods:
1. LanceDB Vector Storage (LANCEDB_DIR)
Purpose: Stores vector embeddings for RAG retrieval Configuration Method: Standard AWS environment variables Example:
LANCEDB_DIR=s3://my-vector-bucket/lancedb
AWS_ACCESS_KEY_ID=your_key_id
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
2. Artifact Storage (FILE_STORE_TARGET=s3)
Purpose: Stores intermediate processing artifacts (documents, markdown, chunks, embeddings)
Configuration Method: Nested Pydantic settings with __ delimiter
Example:
FILE_STORE_TARGET=s3
ARTIFACT_S3__BUCKET=my-artifact-bucket
ARTIFACT_S3__ACCESS_KEY_ID=your_key_id
ARTIFACT_S3__ACCESS_SECRET=your_secret_key
ARTIFACT_S3__REGION=us-east-1
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Important Notes:
- These are independent systems and can use different S3 buckets or providers
- LanceDB uses standard AWS SDK naming (AWS_SECRET_ACCESS_KEY)
- Artifact/Input S3 uses Pydantic nested naming (ARTIFACT_S3__ACCESS_SECRET)
- The field name difference is intentional to support Pydantic's nested configuration
Nested Configuration (Advanced)
Pydantic Settings supports nested configuration using __ delimiter for structured settings.
Artifact S3 Configuration:
ARTIFACT_S3__BUCKET=soliplex-artifacts
ARTIFACT_S3__ACCESS_KEY_ID=soliplex
ARTIFACT_S3__ACCESS_SECRET=soliplex
ARTIFACT_S3__REGION=xx
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Input S3 Configuration:
INPUT_S3__BUCKET=soliplex-inputs
INPUT_S3__ACCESS_KEY_ID=soliplex
INPUT_S3__ACCESS_SECRET=soliplex
INPUT_S3__REGION=xx
INPUT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Notes:
- ACCESS_SECRET (not SECRET_ACCESS_KEY) is used for nested config fields
- Nested delimiter is __ (double underscore)
- Both INPUT_S3 and ARTIFACT_S3 can point to different buckets/providers
See src/soliplex/ingester/lib/config.py:7-16 for nested model definitions.
Configuration Validation
Validate Settings
Check configuration without starting services:
Output:
doc_db_url='sqlite+aiosqlite:///./db/documents.db'
docling_server_url='http://localhost:5001/v1'
log_level='INFO'
...
Validation Errors:
Environment-Specific Configuration
Development
dev.env:
Staging
staging.env:
DOC_DB_URL=postgresql+psycopg://user:pass@staging-db:5432/soliplex
LOG_LEVEL=INFO
INGEST_WORKER_CONCURRENCY=10
DOCLING_SERVER_URL=http://docling-staging:5001/v1
DO_RAG=true
Production
production.env:
DOC_DB_URL=postgresql+psycopg://user:pass@prod-db:5432/soliplex
LOG_LEVEL=WARNING
INGEST_WORKER_CONCURRENCY=20
DOCLING_CONCURRENCY=5
DOCLING_SERVER_URL=http://docling-prod:5001/v1
FILE_STORE_TARGET=s3
DO_RAG=true
WORKER_CHECKIN_INTERVAL=60
Docker Configuration
Docker Compose Example
docker-compose.yml:
version: '3.8'
services:
ingester:
image: soliplex-ingester:latest
environment:
DOC_DB_URL: postgresql+psycopg://postgres:password@db:5432/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
LOG_LEVEL: INFO
FILE_STORE_DIR: /data/files
LANCEDB_DIR: /data/lancedb
INGEST_WORKER_CONCURRENCY: 15
volumes:
- ./config/workflows:/app/config/workflows
- ./config/params:/app/config/params
- data-volume:/data
depends_on:
- db
- docling
db:
image: postgres:16
environment:
POSTGRES_DB: soliplex
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- db-volume:/var/lib/postgresql/data
docling:
image: docling-server:latest
ports:
- "5001:5001"
volumes:
db-volume:
data-volume:
Kubernetes ConfigMap
configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: soliplex-config
data:
DOC_DB_URL: "postgresql+psycopg://user:pass@postgres-service:5432/soliplex"
DOCLING_SERVER_URL: "http://docling-service:5001/v1"
LOG_LEVEL: "INFO"
INGEST_WORKER_CONCURRENCY: "20"
WORKFLOW_DIR: "/config/workflows"
PARAM_DIR: "/config/params"
Performance Tuning
High Throughput
Notes: - Requires powerful hardware - Monitor CPU, memory, and I/O - Coordinate with external service capacity
Resource Constrained
Notes: - Reduces memory and CPU usage - Lower throughput but more stable - Good for shared environments
Batch Processing
Notes: - Optimized for processing large batches - Reduces monitoring overhead - Assumes long-running workers
Secrets Management
Using Environment Files
Keep secrets out of version control:
# .env.secrets (add to .gitignore)
DOC_DB_URL=postgresql+psycopg://user:$(cat /run/secrets/db_password)@db/soliplex
Using Secret Management Tools
AWS Secrets Manager:
export DOC_DB_URL=$(aws secretsmanager get-secret-value --secret-id db-url --query SecretString --output text)
HashiCorp Vault:
Kubernetes Secrets:
Troubleshooting
Configuration Not Loading
Symptom: Application uses default values
Solutions:
1. Verify environment variables are set: env | grep DOC_
2. Check for typos in variable names
3. Ensure .env file is in correct directory
4. Verify .env file is being loaded
Validation Errors
Symptom: Application fails to start with validation error
Solutions:
1. Run si-cli validate-settings to see errors
2. Check required fields are set
3. Verify value types (e.g., integers for ports)
4. Check URL formats
Connection Errors
Symptom: Cannot connect to database or Docling
Solutions:
1. Verify URLs are correct
2. Test connectivity: curl http://docling-url/health
3. Check network policies/firewall
4. Verify credentials
File Permissions
Symptom: Cannot write to storage directories
Solutions:
1. Check directory exists and is writable
2. Verify application user permissions
3. Create directories if needed: mkdir -p file_store lancedb
4. Set ownership: chown -R app-user file_store lancedb
Configuration Reference
| Variable | Type | Required | Default | Description |
|---|---|---|---|---|
DOC_DB_URL |
str | Yes | - | Database connection URL |
DOCLING_SERVER_URL |
str | No | http://localhost:5001/v1 |
Docling parsing service URL |
DOCLING_CHUNK_SERVER_URL |
str | No | http://localhost:5001/v1 |
Docling chunking service URL |
DOCLING_HTTP_TIMEOUT |
int | No | 600 |
Docling timeout (seconds) |
LOG_LEVEL |
str | No | INFO |
Logging level |
FILE_STORE_TARGET |
str | No | fs |
Storage backend type |
FILE_STORE_DIR |
str | No | file_store |
Base storage directory |
LANCEDB_DIR |
str | No | lancedb |
LanceDB directory (supports S3 URIs) |
DOCUMENT_STORE_DIR |
str | No | raw |
Raw documents subdir |
PARSED_MARKDOWN_STORE_DIR |
str | No | markdown |
Markdown subdir |
PARSED_JSON_STORE_DIR |
str | No | json |
JSON subdir |
CHUNKS_STORE_DIR |
str | No | chunks |
Chunks subdir |
EMBEDDINGS_STORE_DIR |
str | No | embeddings |
Embeddings subdir |
INGEST_QUEUE_CONCURRENCY |
int | No | 20 |
Queue concurrency |
INGEST_WORKER_CONCURRENCY |
int | No | 10 |
Worker concurrency |
DOCLING_CONCURRENCY |
int | No | 3 |
Docling concurrency |
WORKER_TASK_COUNT |
int | No | 5 |
Steps per query |
WORKER_CHECKIN_INTERVAL |
int | No | 120 |
Heartbeat interval (sec) |
WORKER_CHECKIN_TIMEOUT |
int | No | 600 |
Worker timeout (sec) |
EMBED_BATCH_SIZE |
int | No | 1000 |
Embedding batch size |
OLLAMA_BASE_URL |
str | No | http://ollama:11434 |
Ollama server URL for embeddings |
OLLAMA_BASE_URL_DOCLING |
str | No | http://ollama:11434 |
Ollama server URL for Docling chunking (can differ for load distribution) |
WORKFLOW_DIR |
str | No | config/workflows |
Workflow definitions dir |
DEFAULT_WORKFLOW_ID |
str | No | batch_split |
Default workflow |
PARAM_DIR |
str | No | config/params |
Parameter sets dir |
DEFAULT_PARAM_ID |
str | No | default |
Default parameter set |
AWS_ACCESS_KEY_ID |
str | Conditional | - | AWS access key (required for S3 LanceDB) |
AWS_SECRET_ACCESS_KEY |
str | Conditional | - | AWS secret key (required for S3 LanceDB) |
AWS_REGION |
str | Conditional | - | AWS region (required for S3 LanceDB) |
AWS_ENDPOINT |
str | No | - | S3 endpoint (for non-AWS providers) |
AWS_ALLOW_HTTP |
int | No | - | Allow HTTP for S3 (set to 1 for HTTP) |
ARTIFACT_S3__* |
nested | Conditional | - | Artifact S3 config (BUCKET, ACCESS_SECRET, etc.) |
INPUT_S3__* |
nested | Conditional | - | Input S3 config (BUCKET, ACCESS_SECRET, etc.) |
DO_RAG |
bool | No | True |
Enable RAG integration |
API_KEY |
str | Conditional | - | Static API key (required if API_KEY_ENABLED) |
API_KEY_ENABLED |
bool | No | False |
Enable API key authentication |
AUTH_TRUST_PROXY_HEADERS |
bool | No | False |
Trust OAuth2 Proxy headers |