Configuration Guide

Overview

Soliplex Ingester is configured via environment variables using Pydantic Settings. All configuration is defined in src/soliplex/ingester/lib/config.py:15.

Environment Variables

Database Configuration

DOC_DB_URL (required)

Database connection URL.

SQLite Example:

DOC_DB_URL="sqlite+aiosqlite:///./db/documents.db"

PostgreSQL Example:

DOC_DB_URL="postgresql+psycopg://username:password@localhost:5432/soliplex"

Notes: - Must use async drivers (aiosqlite for SQLite or psycopg for PostgreSQL) - SQLite uses relative or absolute file paths - PostgreSQL requires credentials and network access

External Services

DOCLING_SERVER_URL

Docling document parsing service endpoint.

Default: http://localhost:5001/v1

Example:

DOCLING_SERVER_URL="http://docling.internal.company.com/v1"

Notes: - Used for document parsing (PDF, DOCX, etc.) - Must be accessible from worker nodes - Health check: GET {url}/health

DOCLING_CHUNK_SERVER_URL

Docling service endpoint for chunking operations.

Default: http://localhost:5001/v1

Example:

DOCLING_CHUNK_SERVER_URL="http://docling-chunker.internal.company.com/v1"

Notes: - Used by haiku.rag for document chunking via docling-serve - Can point to a different Docling instance than DOCLING_SERVER_URL for load distribution - If not set, defaults to the same endpoint as parsing - Useful when running separate Docling instances for parsing vs chunking

DOCLING_HTTP_TIMEOUT

HTTP timeout for Docling requests in seconds.

Default: 600 (10 minutes)

Example:

DOCLING_HTTP_TIMEOUT=300

Notes: - Large documents may require longer timeouts - Adjust based on document size and complexity

OLLAMA_BASE_URL

Ollama server endpoint for embedding generation.

Default: http://ollama:11434

Example:

OLLAMA_BASE_URL="http://ollama.internal.company.com:11434"

Notes: - Used for generating document embeddings during the embed step - Must be accessible from worker nodes - The Ollama server should have the required embedding models loaded

OLLAMA_BASE_URL_DOCLING

Ollama server endpoint for Docling chunking operations.

Default: http://ollama:11434

Example:

OLLAMA_BASE_URL_DOCLING="http://ollama-chunker.internal.company.com:11434"

Notes: - Used by docling-serve for document chunking operations - Can point to a different Ollama instance than OLLAMA_BASE_URL for load distribution - If not set, defaults to the same URL as OLLAMA_BASE_URL - Useful when running separate Ollama instances to distribute model loading across servers

Logging

LOG_LEVEL

Python logging level.

Default: INFO

Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

Example:

LOG_LEVEL=DEBUG

File Storage

FILE_STORE_TARGET

Storage backend type.

Default: fs (filesystem)

Options: - fs - Local filesystem - s3 - S3-compatible storage (requires OpenDAL config)

Example:

FILE_STORE_TARGET=s3

FILE_STORE_DIR

Base directory for file storage.

Default: file_store

Example:

FILE_STORE_DIR=/var/lib/soliplex/files

Notes: - Used when FILE_STORE_TARGET=fs - Must be writable by the application user - Consider disk space requirements

DOCUMENT_STORE_DIR

Subdirectory for raw documents.

Default: raw

Full Path: {FILE_STORE_DIR}/{DOCUMENT_STORE_DIR}

PARSED_MARKDOWN_STORE_DIR

Subdirectory for parsed markdown.

Default: markdown

PARSED_JSON_STORE_DIR

Subdirectory for parsed JSON.

Default: json

CHUNKS_STORE_DIR

Subdirectory for text chunks.

Default: chunks

EMBEDDINGS_STORE_DIR

Subdirectory for embedding vectors.

Default: embeddings

Vector Database

LANCEDB_DIR

Directory for LanceDB vector storage.

Default: lancedb

Filesystem Example:

LANCEDB_DIR=/var/lib/soliplex/lancedb

S3 Example:

LANCEDB_DIR=s3://my-bucket/lancedb

Notes: - Local filesystem paths or S3 URIs are supported - When using S3, LanceDB requires standard AWS environment variables (see below) - Stores vector embeddings for RAG retrieval - Requires sufficient disk space (filesystem) or S3 bucket access - Periodically compact for performance

Required AWS Environment Variables for S3:

AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1

Optional AWS Environment Variables:

# For S3-compatible providers (MinIO, SeaweedFS, etc.)
AWS_ENDPOINT=http://127.0.0.1:8333

# Required when using HTTP to connect to endpoint
AWS_ALLOW_HTTP=1

Worker Configuration

INGEST_QUEUE_CONCURRENCY

Maximum concurrent queue operations.

Default: 20

Example:

INGEST_QUEUE_CONCURRENCY=50

Notes: - Controls internal queue processing - Higher values increase throughput but use more memory

INGEST_WORKER_CONCURRENCY

Maximum concurrent workflow steps per worker.

Default: 10

Example:

INGEST_WORKER_CONCURRENCY=20

Notes: - Primary throughput control - Balance against CPU and external service limits - Monitor resource usage when tuning

DOCLING_CONCURRENCY

Maximum concurrent Docling requests.

Default: 3

Example:

DOCLING_CONCURRENCY=5

Notes: - Prevents overwhelming Docling service - Coordinate with Docling server capacity - Increase if Docling can handle load

WORKER_TASK_COUNT

Number of workflow steps to fetch per query.

Default: 5

Example:

WORKER_TASK_COUNT=10

Notes: - Batch size for step queries - Higher values reduce database round-trips - Lower values improve fairness across workers

WORKER_CHECKIN_INTERVAL

Worker heartbeat interval in seconds.

Default: 120 (2 minutes)

Example:

WORKER_CHECKIN_INTERVAL=60

Notes: - How often workers update health status - Lower values increase database load slightly - Used for monitoring worker liveness

WORKER_CHECKIN_TIMEOUT

Worker timeout threshold in seconds.

Default: 600 (10 minutes)

Example:

WORKER_CHECKIN_TIMEOUT=300

Notes: - When to consider a worker stale - Should be significantly larger than WORKER_CHECKIN_INTERVAL - Used for detecting crashed workers

EMBED_BATCH_SIZE

Batch size for embedding operations.

Default: 1000

Example:

EMBED_BATCH_SIZE=500

Notes: - Number of chunks to embed at once - Higher values improve throughput - Limited by embedding service capacity and memory

Workflow Configuration

WORKFLOW_DIR

Directory containing workflow YAML definitions.

Default: config/workflows

Example:

WORKFLOW_DIR=/etc/soliplex/workflows

Notes: - Scanned for *.yaml files at startup - Hot-reload if --reload flag is used

DEFAULT_WORKFLOW_ID

Default workflow to use when not specified.

Default: batch_split

Example:

DEFAULT_WORKFLOW_ID=batch

Notes: - Must match an id in workflow YAML files - Used when API requests omit workflow_definition_id

PARAM_DIR

Directory containing parameter set YAML files.

Default: config/params

Example:

PARAM_DIR=/etc/soliplex/params

DEFAULT_PARAM_ID

Default parameter set to use when not specified.

Default: default

Example:

DEFAULT_PARAM_ID=high_quality

Notes: - Must match an id in parameter YAML files

Feature Flags

DO_RAG

Enable/disable HaikuRAG integration.

Default: True

Example:

DO_RAG=false

Notes: - Set to false for testing without RAG backend - When disabled, store step becomes a no-op - Useful for CI/CD testing

Authentication

API_KEY

Static API key for programmatic access.

Default: None (disabled)

Example:

API_KEY=your-secret-api-key-here

Notes: - Generate with: openssl rand -hex 32 - Must also set API_KEY_ENABLED=true to enforce - Clients pass via Authorization: Bearer <token> header - Keep this value secure - do not commit to version control

API_KEY_ENABLED

Enable API key authentication.

Default: False

Example:

API_KEY_ENABLED=true

Notes: - When true, all API requests require valid Authorization: Bearer header - When false, API is open (or protected by OAuth2 Proxy) - Can be combined with AUTH_TRUST_PROXY_HEADERS for hybrid auth

AUTH_TRUST_PROXY_HEADERS

Trust user identity headers from OAuth2 Proxy.

Default: False

Example:

AUTH_TRUST_PROXY_HEADERS=true

Notes: - Enable when running behind OAuth2 Proxy - Reads user identity from X-Auth-Request-User, X-Auth-Request-Email headers - Security: Only enable when behind a trusted reverse proxy - See AUTHENTICATION.md for OAuth2 Proxy setup

Configuration File

While the system uses environment variables, you can organize them in a .env file:

.env Example:

# Database
DOC_DB_URL=sqlite+aiosqlite:///./db/documents.db

# External Services
DOCLING_SERVER_URL=http://localhost:5001/v1
DOCLING_HTTP_TIMEOUT=600

# Logging
LOG_LEVEL=INFO

# Storage
FILE_STORE_TARGET=fs
FILE_STORE_DIR=file_store
LANCEDB_DIR=lancedb

# Worker Settings
INGEST_WORKER_CONCURRENCY=10
DOCLING_CONCURRENCY=3
WORKER_TASK_COUNT=5
WORKER_CHECKIN_INTERVAL=120
WORKER_CHECKIN_TIMEOUT=600
EMBED_BATCH_SIZE=1000

# Workflow Configuration
WORKFLOW_DIR=config/workflows
DEFAULT_WORKFLOW_ID=batch
PARAM_DIR=config/params
DEFAULT_PARAM_ID=default

# Features
DO_RAG=true

Load with:

export $(cat .env | xargs)
si-cli serve

S3 Configuration Overview

This project supports S3 storage in two different contexts with different configuration methods:

1. LanceDB Vector Storage (`LANCEDB_DIR`)

Purpose: Stores vector embeddings for RAG retrieval Configuration Method: Standard AWS environment variables Example:

LANCEDB_DIR=s3://my-vector-bucket/lancedb
AWS_ACCESS_KEY_ID=your_key_id
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1

2. Artifact Storage (`FILE_STORE_TARGET=s3`)

Purpose: Stores intermediate processing artifacts (documents, markdown, chunks, embeddings) Configuration Method: Nested Pydantic settings with __ delimiter Example:

FILE_STORE_TARGET=s3
ARTIFACT_S3__BUCKET=my-artifact-bucket
ARTIFACT_S3__ACCESS_KEY_ID=your_key_id
ARTIFACT_S3__ACCESS_SECRET=your_secret_key
ARTIFACT_S3__REGION=us-east-1
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333

Important Notes: - These are independent systems and can use different S3 buckets or providers - LanceDB uses standard AWS SDK naming (AWS_SECRET_ACCESS_KEY) - Artifact/Input S3 uses Pydantic nested naming (ARTIFACT_S3__ACCESS_SECRET) - The field name difference is intentional to support Pydantic's nested configuration

Nested Configuration (Advanced)

Pydantic Settings supports nested configuration using __ delimiter for structured settings.

Artifact S3 Configuration:

ARTIFACT_S3__BUCKET=soliplex-artifacts
ARTIFACT_S3__ACCESS_KEY_ID=soliplex
ARTIFACT_S3__ACCESS_SECRET=soliplex
ARTIFACT_S3__REGION=xx
ARTIFACT_S3__ENDPOINT_URL=http://127.0.0.1:8333

Input S3 Configuration:

INPUT_S3__BUCKET=soliplex-inputs
INPUT_S3__ACCESS_KEY_ID=soliplex
INPUT_S3__ACCESS_SECRET=soliplex
INPUT_S3__REGION=xx
INPUT_S3__ENDPOINT_URL=http://127.0.0.1:8333

Notes: - ACCESS_SECRET (not SECRET_ACCESS_KEY) is used for nested config fields - Nested delimiter is __ (double underscore) - Both INPUT_S3 and ARTIFACT_S3 can point to different buckets/providers

See src/soliplex/ingester/lib/config.py:7-16 for nested model definitions.

Configuration Validation

Validate Settings

Check configuration without starting services:

si-cli validate-settings

Output:

doc_db_url='sqlite+aiosqlite:///./db/documents.db'
docling_server_url='http://localhost:5001/v1'
log_level='INFO'
...

Validation Errors:

invalid settings
{'type': 'missing', 'loc': ('doc_db_url',), 'msg': 'Field required'}

Environment-Specific Configuration

Development

dev.env:

DOC_DB_URL=sqlite+aiosqlite:///./db/dev.db
LOG_LEVEL=DEBUG
INGEST_WORKER_CONCURRENCY=5
DO_RAG=false

Staging

staging.env:

DOC_DB_URL=postgresql+psycopg://user:pass@staging-db:5432/soliplex
LOG_LEVEL=INFO
INGEST_WORKER_CONCURRENCY=10
DOCLING_SERVER_URL=http://docling-staging:5001/v1
DO_RAG=true

Production

production.env:

DOC_DB_URL=postgresql+psycopg://user:pass@prod-db:5432/soliplex
LOG_LEVEL=WARNING
INGEST_WORKER_CONCURRENCY=20
DOCLING_CONCURRENCY=5
DOCLING_SERVER_URL=http://docling-prod:5001/v1
FILE_STORE_TARGET=s3
DO_RAG=true
WORKER_CHECKIN_INTERVAL=60

Docker Configuration

Docker Compose Example

docker-compose.yml:

version: '3.8'

services:
  ingester:
    image: soliplex-ingester:latest
    environment:
      DOC_DB_URL: postgresql+psycopg://postgres:password@db:5432/soliplex
      DOCLING_SERVER_URL: http://docling:5001/v1
      LOG_LEVEL: INFO
      FILE_STORE_DIR: /data/files
      LANCEDB_DIR: /data/lancedb
      INGEST_WORKER_CONCURRENCY: 15
    volumes:
      - ./config/workflows:/app/config/workflows
      - ./config/params:/app/config/params
      - data-volume:/data
    depends_on:
      - db
      - docling

  db:
    image: postgres:16
    environment:
      POSTGRES_DB: soliplex
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
    volumes:
      - db-volume:/var/lib/postgresql/data

  docling:
    image: docling-server:latest
    ports:
      - "5001:5001"

volumes:
  db-volume:
  data-volume:

Kubernetes ConfigMap

configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: soliplex-config
data:
  DOC_DB_URL: "postgresql+psycopg://user:pass@postgres-service:5432/soliplex"
  DOCLING_SERVER_URL: "http://docling-service:5001/v1"
  LOG_LEVEL: "INFO"
  INGEST_WORKER_CONCURRENCY: "20"
  WORKFLOW_DIR: "/config/workflows"
  PARAM_DIR: "/config/params"

Performance Tuning

High Throughput

INGEST_WORKER_CONCURRENCY=50
DOCLING_CONCURRENCY=10
WORKER_TASK_COUNT=20
EMBED_BATCH_SIZE=2000

Notes: - Requires powerful hardware - Monitor CPU, memory, and I/O - Coordinate with external service capacity

Resource Constrained

INGEST_WORKER_CONCURRENCY=5
DOCLING_CONCURRENCY=2
WORKER_TASK_COUNT=3
EMBED_BATCH_SIZE=500

Notes: - Reduces memory and CPU usage - Lower throughput but more stable - Good for shared environments

Batch Processing

INGEST_WORKER_CONCURRENCY=30
DOCLING_CONCURRENCY=8
WORKER_TASK_COUNT=10
WORKER_CHECKIN_INTERVAL=300

Notes: - Optimized for processing large batches - Reduces monitoring overhead - Assumes long-running workers

Secrets Management

Using Environment Files

Keep secrets out of version control:

# .env.secrets (add to .gitignore)
DOC_DB_URL=postgresql+psycopg://user:$(cat /run/secrets/db_password)@db/soliplex

Using Secret Management Tools

AWS Secrets Manager:

export DOC_DB_URL=$(aws secretsmanager get-secret-value --secret-id db-url --query SecretString --output text)

HashiCorp Vault:

export DOC_DB_URL=$(vault kv get -field=url secret/soliplex/db)

Kubernetes Secrets:

env:
  - name: DOC_DB_URL
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: url

Troubleshooting

Configuration Not Loading

Symptom: Application uses default values

Solutions: 1. Verify environment variables are set: env | grep DOC_ 2. Check for typos in variable names 3. Ensure .env file is in correct directory 4. Verify .env file is being loaded

Validation Errors

Symptom: Application fails to start with validation error

Solutions: 1. Run si-cli validate-settings to see errors 2. Check required fields are set 3. Verify value types (e.g., integers for ports) 4. Check URL formats

Connection Errors

Symptom: Cannot connect to database or Docling

Solutions: 1. Verify URLs are correct 2. Test connectivity: curl http://docling-url/health 3. Check network policies/firewall 4. Verify credentials

File Permissions

Symptom: Cannot write to storage directories

Solutions: 1. Check directory exists and is writable 2. Verify application user permissions 3. Create directories if needed: mkdir -p file_store lancedb 4. Set ownership: chown -R app-user file_store lancedb

Configuration Reference

Variable	Type	Required	Default	Description
`DOC_DB_URL`	str	Yes	-	Database connection URL
`DOCLING_SERVER_URL`	str	No	`http://localhost:5001/v1`	Docling parsing service URL
`DOCLING_CHUNK_SERVER_URL`	str	No	`http://localhost:5001/v1`	Docling chunking service URL
`DOCLING_HTTP_TIMEOUT`	int	No	`600`	Docling timeout (seconds)
`LOG_LEVEL`	str	No	`INFO`	Logging level
`FILE_STORE_TARGET`	str	No	`fs`	Storage backend type
`FILE_STORE_DIR`	str	No	`file_store`	Base storage directory
`LANCEDB_DIR`	str	No	`lancedb`	LanceDB directory (supports S3 URIs)
`DOCUMENT_STORE_DIR`	str	No	`raw`	Raw documents subdir
`PARSED_MARKDOWN_STORE_DIR`	str	No	`markdown`	Markdown subdir
`PARSED_JSON_STORE_DIR`	str	No	`json`	JSON subdir
`CHUNKS_STORE_DIR`	str	No	`chunks`	Chunks subdir
`EMBEDDINGS_STORE_DIR`	str	No	`embeddings`	Embeddings subdir
`INGEST_QUEUE_CONCURRENCY`	int	No	`20`	Queue concurrency
`INGEST_WORKER_CONCURRENCY`	int	No	`10`	Worker concurrency
`DOCLING_CONCURRENCY`	int	No	`3`	Docling concurrency
`WORKER_TASK_COUNT`	int	No	`5`	Steps per query
`WORKER_CHECKIN_INTERVAL`	int	No	`120`	Heartbeat interval (sec)
`WORKER_CHECKIN_TIMEOUT`	int	No	`600`	Worker timeout (sec)
`EMBED_BATCH_SIZE`	int	No	`1000`	Embedding batch size
`OLLAMA_BASE_URL`	str	No	`http://ollama:11434`	Ollama server URL for embeddings
`OLLAMA_BASE_URL_DOCLING`	str	No	`http://ollama:11434`	Ollama server URL for Docling chunking (can differ for load distribution)
`WORKFLOW_DIR`	str	No	`config/workflows`	Workflow definitions dir
`DEFAULT_WORKFLOW_ID`	str	No	`batch_split`	Default workflow
`PARAM_DIR`	str	No	`config/params`	Parameter sets dir
`DEFAULT_PARAM_ID`	str	No	`default`	Default parameter set
`AWS_ACCESS_KEY_ID`	str	Conditional	-	AWS access key (required for S3 LanceDB)
`AWS_SECRET_ACCESS_KEY`	str	Conditional	-	AWS secret key (required for S3 LanceDB)
`AWS_REGION`	str	Conditional	-	AWS region (required for S3 LanceDB)
`AWS_ENDPOINT`	str	No	-	S3 endpoint (for non-AWS providers)
`AWS_ALLOW_HTTP`	int	No	-	Allow HTTP for S3 (set to 1 for HTTP)
`ARTIFACT_S3__*`	nested	Conditional	-	Artifact S3 config (BUCKET, ACCESS_SECRET, etc.)
`INPUT_S3__*`	nested	Conditional	-	Input S3 config (BUCKET, ACCESS_SECRET, etc.)
`DO_RAG`	bool	No	`True`	Enable RAG integration
`API_KEY`	str	Conditional	-	Static API key (required if API_KEY_ENABLED)
`API_KEY_ENABLED`	bool	No	`False`	Enable API key authentication
`AUTH_TRUST_PROXY_HEADERS`	bool	No	`False`	Trust OAuth2 Proxy headers

Configuration Guide

Overview

Environment Variables

Database Configuration

DOC_DB_URL (required)

External Services

DOCLING_SERVER_URL

DOCLING_CHUNK_SERVER_URL

DOCLING_HTTP_TIMEOUT

OLLAMA_BASE_URL

OLLAMA_BASE_URL_DOCLING

Logging

LOG_LEVEL

File Storage

FILE_STORE_TARGET

FILE_STORE_DIR

DOCUMENT_STORE_DIR

PARSED_MARKDOWN_STORE_DIR

PARSED_JSON_STORE_DIR

CHUNKS_STORE_DIR

EMBEDDINGS_STORE_DIR

Vector Database

LANCEDB_DIR

Worker Configuration

INGEST_QUEUE_CONCURRENCY

INGEST_WORKER_CONCURRENCY

DOCLING_CONCURRENCY

WORKER_TASK_COUNT

WORKER_CHECKIN_INTERVAL

WORKER_CHECKIN_TIMEOUT

EMBED_BATCH_SIZE

Workflow Configuration

WORKFLOW_DIR

DEFAULT_WORKFLOW_ID

PARAM_DIR

DEFAULT_PARAM_ID

Feature Flags

DO_RAG

Authentication

API_KEY

API_KEY_ENABLED

AUTH_TRUST_PROXY_HEADERS

Configuration File

S3 Configuration Overview

1. LanceDB Vector Storage (LANCEDB_DIR)

2. Artifact Storage (FILE_STORE_TARGET=s3)

Nested Configuration (Advanced)

Configuration Validation

Validate Settings

Environment-Specific Configuration

Development

Staging

Production

Docker Configuration

Docker Compose Example

Kubernetes ConfigMap

Performance Tuning

High Throughput

Resource Constrained

Batch Processing

Secrets Management

Using Environment Files

Using Secret Management Tools

Troubleshooting

Configuration Not Loading

Validation Errors

Connection Errors

File Permissions

Configuration Reference

1. LanceDB Vector Storage (`LANCEDB_DIR`)

2. Artifact Storage (`FILE_STORE_TARGET=s3`)