Configuration Guide
Overview
Soliplex Ingester is configured via environment variables using Pydantic Settings. All configuration is defined in src/soliplex/ingester/lib/config.py:15.
Environment Variables
Database Configuration
DOC_DB_URL (required)
Database connection URL.
AUTO_CREATE_DATABASE
Automatically create database tables on initialization.
Default: True
Example:
Notes:
- When
true, all SQLModel tables are created viaCREATE TABLE IF NOT EXISTSduringDatabase.initialize() - Set to
falsewhen using a migration tool (e.g., Alembic) to manage schema changes - Useful in production where schema should be controlled by migrations rather than auto-created
SQLite Example:
PostgreSQL Example:
Notes:
- Must use async drivers (
aiosqlitefor SQLite orpsycopgfor PostgreSQL) - SQLite uses relative or absolute file paths
- PostgreSQL requires credentials and network access
External Services
DOCLING_SERVER_URL
Docling document parsing service endpoint.
Default: http://localhost:5001/v1
Example:
Notes:
- Used for document parsing (PDF, DOCX, etc.)
- Must be accessible from worker nodes
- Health check:
GET {url}/health
DOCLING_CHUNK_SERVER_URL
Docling service endpoint for chunking operations.
Default: http://localhost:5001/v1
Example:
Notes:
- Used by haiku.rag for document chunking via docling-serve
- Can point to a different Docling instance than
DOCLING_SERVER_URLfor load distribution - If not set, defaults to the same endpoint as parsing
- Useful when running separate Docling instances for parsing vs chunking
DOCLING_HTTP_TIMEOUT
HTTP timeout for Docling requests in seconds.
Default: 600 (10 minutes)
Example:
Notes:
- Large documents may require longer timeouts
- Adjust based on document size and complexity
OLLAMA_BASE_URL
Ollama server endpoint for embedding generation.
Default: http://ollama:11434
Example:
Notes:
- Used for generating document embeddings during the embed step
- Must be accessible from worker nodes
- The Ollama server should have the required embedding models loaded
OLLAMA_BASE_URL_DOCLING
Ollama server endpoint for Docling chunking operations.
Default: http://ollama:11434
Example:
Notes:
- Used by docling-serve for document chunking operations
- Can point to a different Ollama instance than
OLLAMA_BASE_URLfor load distribution - If not set, defaults to the same URL as
OLLAMA_BASE_URL - Useful when running separate Ollama instances to distribute model loading across servers
Logging
LOG_LEVEL
Python logging level.
Default: INFO
Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
Example:
LOG_FORMAT
Log output format.
Default: text
Options:
text- Human-readable log lines (default Python formatting)json- Structured JSON, one object per line
Example:
Notes:
- JSON format emits one JSON object per log line with fields:
timestamp,level,name,message - Any
extrafields passed to logger calls (e.g.,log.info("msg", extra={"doc_id": 123})) are included as top-level keys in the JSON output - If an exception is attached, it appears in the
exceptionfield - Recommended for production and container environments where logs are consumed by aggregation tools (ELK, Datadog, CloudWatch, etc.)
JSON output example:
{"timestamp": "2026-03-23T14:05:12", "level": "INFO", "name": "soliplex.ingester", "message": "Processing document", "doc_id": 123, "batch_id": "abc"}
File Storage
FILE_STORE_TARGET
Storage backend type.
Default: fs (filesystem)
Options:
fs- Local filesystems3- S3-compatible storage (requires OpenDAL config)
Example:
FILE_STORE_DIR
Base directory for file storage.
Default: file_store
Example:
Notes:
- Used when
FILE_STORE_TARGET=fs - Must be writable by the application user
- Consider disk space requirements
DOCUMENT_STORE_DIR
Subdirectory for raw documents.
Default: raw
Full Path: {FILE_STORE_DIR}/{DOCUMENT_STORE_DIR}
PARSED_MARKDOWN_STORE_DIR
Subdirectory for parsed markdown.
Default: markdown
PARSED_JSON_STORE_DIR
Subdirectory for parsed JSON.
Default: json
CHUNKS_STORE_DIR
Subdirectory for text chunks.
Default: chunks
EMBEDDINGS_STORE_DIR
Subdirectory for embedding vectors.
Default: embeddings
FILE_PROTECTION_LEVEL
Protection level for file-stored artifacts. Controls integrity checking and encryption.
Default: none
Options:
none- No protection (current default behavior)hash- SHA-512 hash verification: writes a.hashsidecar file alongside each artifacthmac- HMAC-SHA-512 verification: writes a.hmacsidecar file using a keyed hash (requiresFILE_SECRET)encrypt- Fernet encryption: encrypts artifact bytes at rest using authenticated encryption (requiresFILE_SECRET)
Example:
Notes:
- Only applies when
FILE_STORE_TARGET=fs(filesystem storage) hmacandencryptmodes requireFILE_SECRETto be set- A single
FILE_SECRETis used for both HMAC and encryption; purpose-specific keys are derived internally via HKDF-SHA256 - Changing protection level does not retroactively modify existing files
hashmode detects accidental corruption;hmacmode detects tampering;encryptmode provides confidentiality
FILE_SECRET
Master secret used for HMAC and encryption operations.
Default: None
Example:
Notes:
- Required when
FILE_PROTECTION_LEVELishmacorencrypt - Treated as a secret (stored via
SecretStr, supports/run/secretsdirectory) - Recommended minimum length: 64 bytes (the HMAC-SHA-512 key size). Generate with:
openssl rand -hex 64 - Purpose-specific keys are derived internally via HKDF, so any sufficiently long secret works for both HMAC and encryption
- Keep this value secure - do not commit to version control
- Changing the secret will make previously protected files unreadable
Vector Database
LANCEDB_DIR
Directory for LanceDB vector storage.
Default: lancedb
Filesystem Example:
S3 Example:
Notes:
- Local filesystem paths or S3 URIs are supported
- When using S3, LanceDB requires standard AWS environment variables (see below)
- Stores vector embeddings for RAG retrieval
- Requires sufficient disk space (filesystem) or S3 bucket access
- Periodically compact for performance
Required AWS Environment Variables for S3:
Optional AWS Environment Variables:
# For S3-compatible providers (MinIO, SeaweedFS, etc.)
AWS_ENDPOINT=http://127.0.0.1:8333
# Required when using HTTP to connect to endpoint
AWS_ALLOW_HTTP=1
Worker Configuration
INGEST_QUEUE_CONCURRENCY
Maximum concurrent queue operations.
Default: 20
Example:
Notes:
- Controls internal queue processing
- Higher values increase throughput but use more memory
INGEST_WORKER_CONCURRENCY
Maximum concurrent workflow steps per worker.
Default: 10
Example:
Notes:
- Primary throughput control
- Balance against CPU and external service limits
- Monitor resource usage when tuning
DOCLING_CONCURRENCY
Maximum concurrent Docling requests.
Default: 3
Example:
Notes:
- Prevents overwhelming Docling service
- Coordinate with Docling server capacity
- Increase if Docling can handle load
WORKER_TASK_COUNT
Number of workflow steps to fetch per query.
Default: 5
Example:
Notes:
- Batch size for step queries
- Higher values reduce database round-trips
- Lower values improve fairness across workers
WORKER_CHECKIN_INTERVAL
Worker heartbeat interval in seconds.
Default: 120 (2 minutes)
Example:
Notes:
- How often workers update health status
- Lower values increase database load slightly
- Used for monitoring worker liveness
WORKER_CHECKIN_TIMEOUT
Worker timeout threshold in seconds.
Default: 600 (10 minutes)
Example:
Notes:
- When to consider a worker stale
- Should be significantly larger than
WORKER_CHECKIN_INTERVAL - Used for detecting crashed workers
EMBED_BATCH_SIZE
Batch size for embedding operations.
Default: 1000
Example:
Notes:
- Number of chunks to embed at once
- Higher values improve throughput
- Limited by embedding service capacity and memory
Workflow Configuration
WORKFLOW_DIR
Directory containing workflow YAML definitions.
Default: config/workflows
Example:
Notes:
- Scanned for
*.yamlfiles at startup - Hot-reload if
--reloadflag is used
DEFAULT_WORKFLOW_ID
Default workflow to use when not specified.
Default: batch_split
Example:
Notes:
- Must match an
idin workflow YAML files - Used when API requests omit
workflow_definition_id
PARAM_DIR
Directory containing system parameter set YAML files.
Default: config/params
Example:
USER_PARAM_DIR
Directory containing user-uploaded parameter set YAML files. Must be different from PARAM_DIR. Both directories are merged transparently at load time; parameter set IDs must be unique across both.
Default: config/user_params
Example:
DEFAULT_PARAM_ID
Default parameter set to use when not specified.
Default: default
Example:
Notes:
- Must match an
idin parameter YAML files
Feature Flags
DO_RAG
Enable/disable HaikuRAG integration.
Default: True
Example:
Notes:
- Set to
falsefor testing without RAG backend - When disabled,
storestep becomes a no-op - Useful for CI/CD testing
Authentication
API_KEY
Static API key for programmatic access.
Default: None (disabled)
Example:
Notes:
- Generate with:
openssl rand -hex 32 - Must also set
API_KEY_ENABLED=trueto enforce - Clients pass via
Authorization: Bearer <token>header - Keep this value secure - do not commit to version control
API_KEY_ENABLED
Enable API key authentication.
Default: False
Example:
Notes:
- When
true, all API requests require validAuthorization: Bearerheader - When
false, API is open (or protected by OAuth2 Proxy) - Can be combined with
AUTH_TRUST_PROXY_HEADERSfor hybrid auth
AUTH_TRUST_PROXY_HEADERS
Trust user identity headers from OAuth2 Proxy.
Default: False
Example:
Notes:
- Enable when running behind OAuth2 Proxy
- Reads user identity from
X-Auth-Request-User,X-Auth-Request-Emailheaders - Security: Only enable when behind a trusted reverse proxy
- See AUTHENTICATION.md for OAuth2 Proxy setup
Configuration File
While the system uses environment variables, you can organize them in a .env file:
.env Example:
# Database
DOC_DB_URL=sqlite+aiosqlite:///./db/documents.db
AUTO_CREATE_DATABASE=true
# External Services
DOCLING_SERVER_URL=http://localhost:5001/v1
DOCLING_HTTP_TIMEOUT=600
# Logging
LOG_LEVEL=INFO
LOG_FORMAT={name}|{asctime}|{levelname}|{message}
# Storage
FILE_STORE_TARGET=fs
FILE_STORE_DIR=file_store
LANCEDB_DIR=lancedb
# Worker Settings
INGEST_WORKER_CONCURRENCY=10
DOCLING_CONCURRENCY=3
WORKER_TASK_COUNT=5
WORKER_CHECKIN_INTERVAL=120
WORKER_CHECKIN_TIMEOUT=600
EMBED_BATCH_SIZE=1000
# Workflow Configuration
WORKFLOW_DIR=config/workflows
DEFAULT_WORKFLOW_ID=batch_split
PARAM_DIR=config/params
USER_PARAM_DIR=config/user_params
DEFAULT_PARAM_ID=default
# Features
DO_RAG=true
Load with:
S3 Configuration Overview
This project supports S3 storage in two different contexts with different configuration methods:
1. LanceDB Vector Storage (LANCEDB_DIR)
Purpose: Stores vector embeddings for RAG retrieval Configuration Method: Standard AWS environment variables Example:
LANCEDB_DIR=s3://my-vector-bucket/lancedb
AWS_ACCESS_KEY_ID=your_key_id
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
2. Artifact Storage (FILE_STORE_TARGET=s3)
Purpose: Stores intermediate processing artifacts (documents, markdown, chunks, embeddings)
Configuration Method: Nested Pydantic settings with __ delimiter
Example:
FILE_STORE_TARGET=s3
ARTIFACT_S3__BUCKET=my-artifact-bucket
ARTIFACT_S3__ACCESS_KEY_ID=your_key_id
ARTIFACT_S3__ACCESS_SECRET=your_secret_key
ARTIFACT_S3__REGION=us-east-1
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Important Notes:
- These are independent systems and can use different S3 buckets or providers
- LanceDB uses standard AWS SDK naming (
AWS_SECRET_ACCESS_KEY) - Artifact/Input S3 uses Pydantic nested naming (
ARTIFACT_S3__ACCESS_SECRET) - The field name difference is intentional to support Pydantic's nested configuration
Nested Configuration (Advanced)
Pydantic Settings supports nested configuration using __ delimiter for structured settings.
Artifact S3 Configuration:
ARTIFACT_S3__BUCKET=soliplex-artifacts
ARTIFACT_S3__ACCESS_KEY_ID=soliplex
ARTIFACT_S3__ACCESS_SECRET=soliplex
ARTIFACT_S3__REGION=xx
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Input S3 Configuration:
INPUT_S3__BUCKET=soliplex-inputs
INPUT_S3__ACCESS_KEY_ID=soliplex
INPUT_S3__ACCESS_SECRET=soliplex
INPUT_S3__REGION=xx
INPUT_S3__ENDPOINT_URL=http://127.0.0.1:8333
Notes:
ACCESS_SECRET(notSECRET_ACCESS_KEY) is used for nested config fields- Nested delimiter is
__(double underscore) - Both
INPUT_S3andARTIFACT_S3can point to different buckets/providers
See src/soliplex/ingester/lib/config.py:7-16 for nested model definitions.
Configuration Validation
Validate Settings
Check configuration without starting services:
Output:
doc_db_url='sqlite+aiosqlite:///./db/documents.db'
docling_server_url='http://localhost:5001/v1'
log_level='INFO'
...
Validation Errors:
Environment-Specific Configuration
Development
dev.env:
Staging
staging.env:
DOC_DB_URL=postgresql+psycopg://user:pass@staging-db:5432/soliplex
LOG_LEVEL=INFO
INGEST_WORKER_CONCURRENCY=10
DOCLING_SERVER_URL=http://docling-staging:5001/v1
DO_RAG=true
Production
production.env:
DOC_DB_URL=postgresql+psycopg://user:pass@prod-db:5432/soliplex
LOG_LEVEL=WARNING
LOG_FORMAT=json
INGEST_WORKER_CONCURRENCY=20
DOCLING_CONCURRENCY=5
DOCLING_SERVER_URL=http://docling-prod:5001/v1
FILE_STORE_TARGET=s3
DO_RAG=true
WORKER_CHECKIN_INTERVAL=60
Docker Configuration
Docker Compose Example
docker-compose.yml:
version: '3.8'
services:
ingester:
image: soliplex-ingester:latest
environment:
DOC_DB_URL: postgresql+psycopg://postgres:password@db:5432/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
LOG_LEVEL: INFO
FILE_STORE_DIR: /data/files
LANCEDB_DIR: /data/lancedb
INGEST_WORKER_CONCURRENCY: 15
volumes:
- ./config/workflows:/app/config/workflows
- ./config/params:/app/config/params
- data-volume:/data
depends_on:
- db
- docling
db:
image: postgres:16
environment:
POSTGRES_DB: soliplex
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- db-volume:/var/lib/postgresql/data
docling:
image: docling-server:latest
ports:
- "5001:5001"
volumes:
db-volume:
data-volume:
Kubernetes ConfigMap
configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: soliplex-config
data:
DOC_DB_URL: "postgresql+psycopg://user:pass@postgres-service:5432/soliplex"
DOCLING_SERVER_URL: "http://docling-service:5001/v1"
LOG_LEVEL: "INFO"
INGEST_WORKER_CONCURRENCY: "20"
WORKFLOW_DIR: "/config/workflows"
PARAM_DIR: "/config/params"
USER_PARAM_DIR: "/config/user_params"
Performance Tuning
High Throughput
Notes:
- Requires powerful hardware
- Monitor CPU, memory, and I/O
- Coordinate with external service capacity
Resource Constrained
Notes:
- Reduces memory and CPU usage
- Lower throughput but more stable
- Good for shared environments
Batch Processing
Notes:
- Optimized for processing large batches
- Reduces monitoring overhead
- Assumes long-running workers
Secrets Management
Using Environment Files
Keep secrets out of version control:
# .env.secrets (add to .gitignore)
DOC_DB_URL=postgresql+psycopg://user:$(cat /run/secrets/db_password)@db/soliplex
Using Secret Management Tools
AWS Secrets Manager:
export DOC_DB_URL=$(aws secretsmanager get-secret-value --secret-id db-url --query SecretString --output text)
HashiCorp Vault:
Kubernetes Secrets:
Troubleshooting
Configuration Not Loading
Symptom: Application uses default values
Solutions:
- Verify environment variables are set:
env | grep DOC_ - Check for typos in variable names
- Ensure
.envfile is in correct directory - Verify
.envfile is being loaded
Validation Errors
Symptom: Application fails to start with validation error
Solutions:
- Run
si-cli validate-settingsto see errors - Check required fields are set
- Verify value types (e.g., integers for ports)
- Check URL formats
Connection Errors
Symptom: Cannot connect to database or Docling
Solutions:
- Verify URLs are correct
- Test connectivity:
curl http://docling-url/health - Check network policies/firewall
- Verify credentials
File Permissions
Symptom: Cannot write to storage directories
Solutions:
- Check directory exists and is writable
- Verify application user permissions
- Create directories if needed:
mkdir -p file_store lancedb - Set ownership:
chown -R app-user file_store lancedb
Configuration Reference
| Variable | Type | Required | Default | Description |
|---|---|---|---|---|
DOC_DB_URL |
str | Yes | - | Database connection URL |
AUTO_CREATE_DATABASE |
bool | No | True |
Auto-create database tables on init |
DOCLING_SERVER_URL |
str | No | http://localhost:5001/v1 |
Docling parsing service URL |
DOCLING_CHUNK_SERVER_URL |
str | No | http://localhost:5001/v1 |
Docling chunking service URL |
DOCLING_HTTP_TIMEOUT |
int | No | 600 |
Docling timeout (seconds) |
LOG_LEVEL |
str | No | INFO |
Logging level |
LOG_FORMAT |
str | No | {name}\|{asctime}\|{levelname}\|{message} |
Log format string (or json) |
FILE_STORE_TARGET |
str | No | fs |
Storage backend type |
FILE_STORE_DIR |
str | No | file_store |
Base storage directory |
LANCEDB_DIR |
str | No | lancedb |
LanceDB directory (supports S3 URIs) |
DOCUMENT_STORE_DIR |
str | No | raw |
Raw documents subdir |
PARSED_MARKDOWN_STORE_DIR |
str | No | markdown |
Markdown subdir |
PARSED_JSON_STORE_DIR |
str | No | json |
JSON subdir |
CHUNKS_STORE_DIR |
str | No | chunks |
Chunks subdir |
EMBEDDINGS_STORE_DIR |
str | No | embeddings |
Embeddings subdir |
FILE_PROTECTION_LEVEL |
str | No | none |
File protection level (none/hash/hmac/encrypt) |
FILE_SECRET |
str | Conditional | - | Master secret for HMAC/encryption (required if protection is hmac or encrypt) |
INGEST_QUEUE_CONCURRENCY |
int | No | 20 |
Queue concurrency |
INGEST_WORKER_CONCURRENCY |
int | No | 10 |
Worker concurrency |
DOCLING_CONCURRENCY |
int | No | 3 |
Docling concurrency |
WORKER_TASK_COUNT |
int | No | 5 |
Steps per query |
WORKER_CHECKIN_INTERVAL |
int | No | 120 |
Heartbeat interval (sec) |
WORKER_CHECKIN_TIMEOUT |
int | No | 600 |
Worker timeout (sec) |
EMBED_BATCH_SIZE |
int | No | 1000 |
Embedding batch size |
OLLAMA_BASE_URL |
str | No | http://ollama:11434 |
Ollama server URL for embeddings |
OLLAMA_BASE_URL_DOCLING |
str | No | http://ollama:11434 |
Ollama server URL for Docling chunking (can differ for load distribution) |
WORKFLOW_DIR |
str | No | config/workflows |
Workflow definitions dir |
DEFAULT_WORKFLOW_ID |
str | No | batch_split |
Default workflow |
PARAM_DIR |
str | No | config/params |
System parameter sets dir |
USER_PARAM_DIR |
str | No | config/user_params |
User parameter sets dir |
DEFAULT_PARAM_ID |
str | No | default |
Default parameter set |
AWS_ACCESS_KEY_ID |
str | Conditional | - | AWS access key (required for S3 LanceDB) |
AWS_SECRET_ACCESS_KEY |
str | Conditional | - | AWS secret key (required for S3 LanceDB) |
AWS_REGION |
str | Conditional | - | AWS region (required for S3 LanceDB) |
AWS_ENDPOINT |
str | No | - | S3 endpoint (for non-AWS providers) |
AWS_ALLOW_HTTP |
int | No | - | Allow HTTP for S3 (set to 1 for HTTP) |
ARTIFACT_S3__* |
nested | Conditional | - | Artifact S3 config (BUCKET, ACCESS_SECRET, etc.) |
INPUT_S3__* |
nested | Conditional | - | Input S3 config (BUCKET, ACCESS_SECRET, etc.) |
DO_RAG |
bool | No | True |
Enable RAG integration |
API_KEY |
str | Conditional | - | Static API key (required if API_KEY_ENABLED) |
API_KEY_ENABLED |
bool | No | False |
Enable API key authentication |
AUTH_TRUST_PROXY_HEADERS |
bool | No | False |
Trust OAuth2 Proxy headers |