Parameter Sets
Parameter sets define the configuration for document processing workflows in Soliplex Ingester. Each parameter set specifies how documents should be parsed, chunked, embedded, and stored.
Table of Contents
- Overview
- YAML Schema
- Configuration Sections
- Creating Parameter Sets
- Managing Parameter Sets
- Examples
- Best Practices
Overview
Parameter sets control every aspect of document processing:
- Parsing: How to extract text and structure from documents
- Chunking: How to split documents into semantic chunks
- Embedding: Which model to use for vector embeddings
- Storage: Where to store the resulting vector database
Parameter sets are stored as YAML files in two separate directories:
- System parameter sets (config/params/ by default) — built-in, source: app, cannot be deleted via API
- User parameter sets (config/user_params/ by default) — user-uploaded, source: user, can be deleted via API
Both directories are merged transparently at load time. Parameter set IDs must be unique across both directories.
Parameter sets can be created via:
- Configuration files in config/params/ (built-in)
- REST API uploads (stored in config/user_params/)
- Web UI (if available)
YAML Schema
Complete Structure
id: <string> # Required: Unique identifier for this parameter set
name: <string> # Optional: Human-readable name
config: # Required: Configuration sections
parse: # Optional: Document parsing configuration
do_ocr: <boolean>
force_ocr: <boolean>
ocr_engine: <string>
ocr_lang: <string>
pdf_backend: <string>
table_mode: <string>
chunk: # Optional: Document chunking configuration
chunker: <string>
chunk_size: <integer>
text_context_radius: <integer>
chunker_type: <string>
embed: # Required: Embedding configuration
provider: <string>
model: <string>
vector_dim: <integer>
store: # Required: Storage configuration
data_dir: <string>
Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | Yes | Unique identifier (alphanumeric, hyphens, underscores) |
name |
string | No | Display name for UI |
config |
object | Yes | Configuration sections |
Configuration Sections
Parse Configuration
Controls how documents are parsed and text is extracted.
parse:
do_ocr: false # Enable OCR for images and scanned PDFs
force_ocr: false # Force OCR even if text layer exists
ocr_engine: easyocr # OCR engine: easyocr, tesseract
ocr_lang: en # OCR language code
pdf_backend: pypdfium2 # PDF parsing backend: pypdfium2, pdfplumber
table_mode: accurate # Table extraction: accurate, fast
Field Details:
| Field | Type | Default | Description |
|---|---|---|---|
do_ocr |
boolean | false |
Enable OCR for image-based PDFs |
force_ocr |
boolean | false |
Force OCR even when text layer exists |
ocr_engine |
string | easyocr |
OCR engine to use |
ocr_lang |
string | en |
Language code for OCR (ISO 639-1) |
pdf_backend |
string | pypdfium2 |
PDF parsing library |
table_mode |
string | accurate |
Table extraction mode |
OCR Engines:
- easyocr - Neural network-based OCR (recommended)
- tesseract - Traditional OCR engine
PDF Backends:
- pypdfium2 - Fast, modern PDF parsing (recommended)
- pdfplumber - Feature-rich parsing with table support
Table Modes:
- accurate - Slower but more precise table extraction
- fast - Faster processing with acceptable accuracy
Chunk Configuration
Controls how documents are split into chunks for vector search.
chunk:
chunker: docling-serve # Chunking service: docling-serve, local
chunk_size: 256 # Target chunk size in tokens
text_context_radius: 0 # Context overlap in characters
chunker_type: hybrid # Chunking strategy: hybrid, hierarchical, token
Field Details:
| Field | Type | Default | Description |
|---|---|---|---|
chunker |
string | docling-serve |
Chunking service to use |
chunk_size |
integer | 256 |
Target chunk size in tokens |
text_context_radius |
integer | 0 |
Character overlap between chunks |
chunker_type |
string | hybrid |
Chunking strategy |
Chunker Options:
- docling-serve - Uses Docling service for semantic chunking (recommended)
- local - Local chunking without external service
Chunker Types:
- hybrid - Combines semantic and token-based splitting (recommended)
- hierarchical - Respects document structure (sections, paragraphs)
- token - Simple token-based splitting
Chunk Size Guidelines: - 128-256 tokens: Good for precise search, higher storage cost - 256-512 tokens: Balanced performance (recommended) - 512-1024 tokens: Better context, less precise search
Text Context Radius: - 0: No overlap (faster indexing, less redundancy) - 50-100 chars: Small overlap for continuity - 100-200 chars: Medium overlap (recommended for narratives)
Embed Configuration
Controls vector embedding generation.
embed:
provider: ollama # Embedding provider: ollama, openai, azure
model: qwen3-embedding:4b # Model identifier
vector_dim: 2560 # Vector dimension
Field Details:
| Field | Type | Required | Description |
|---|---|---|---|
provider |
string | Yes | Embedding service provider |
model |
string | Yes | Model name or identifier |
vector_dim |
integer | Yes | Embedding vector dimension |
Providers:
Soliplex Ingester ships with two embedding providers. The provider is selected
per parameter set via embed.provider, but connection details (base URL, auth)
come from environment variables, not the YAML. This keeps secrets and
deployment-specific endpoints out of the param set definitions.
Ollama
Use Ollama when you want local / self-hosted embeddings with no per-token
cost. All ingester workers share the Ollama server configured via
OLLAMA_BASE_URL.
Required environment variables:
| Variable | Required | Purpose |
|---|---|---|
OLLAMA_BASE_URL |
Yes | Ollama server URL, e.g., http://ollama:11434 (no /v1 suffix) |
Setup checklist:
- Set
OLLAMA_BASE_URLin.env(or container env) — see CONFIGURATION.md. - Pull the embedding model on the Ollama server:
ollama pull qwen3-embedding:4b. - Confirm
vector_dimmatches the model output (see dimension table below).
Popular models:
nomic-embed-text- 768 dimensions, excellent performancemxbai-embed-large- 1024 dimensions, high qualityqwen3-embedding:4b- 2560 dimensions, multilingual
OpenAI
Use the OpenAI provider for either the real OpenAI API or any
OpenAI-compatible endpoint (vLLM, text-embeddings-inference, LiteLLM,
Azure OpenAI proxy, etc.). The ingester does not assume api.openai.com —
the endpoint is always taken from EMBED_LLM_URL.
Required environment variables:
| Variable | Required | Purpose |
|---|---|---|
OPENAI_API_KEY |
Yes | API key; may be loaded from /run/secrets/openai_api_key. For self-hosted endpoints that don't authenticate, set it to any non-empty value. |
EMBED_LLM_URL |
Yes | Base URL including /v1 suffix, e.g., https://api.openai.com/v1 or http://vllm:8000/v1 |
Setup checklist:
- Set
OPENAI_API_KEY— either as an env var or by mounting/run/secrets/openai_api_key. See CONFIGURATION.md. - Set
EMBED_LLM_URLto the full/v1endpoint. See CONFIGURATION.md. - Confirm
vector_dimmatches the model output.
Real OpenAI example:
Self-hosted vLLM example:
Models:
text-embedding-3-small- 1536 dimensions, cost-effectivetext-embedding-3-large- 3072 dimensions, highest qualitytext-embedding-ada-002- 1536 dimensions (legacy)- Any embedding model served by your OpenAI-compatible backend (when using vLLM, etc.)
Troubleshooting:
embed_llm_url is not setat embed time →EMBED_LLM_URLis missing; export it and restart workers.- 401 / 403 from your endpoint →
OPENAI_API_KEYeither wasn't set or is incorrect. If you mounted/run/secrets/openai_api_key, confirm no pre-existingOPENAI_API_KEYenv var is overriding it (env wins over the secret file).
Important: The vector_dim must match the actual dimension output by the model. Incorrect values will cause embedding errors.
Store Configuration
Controls where the vector database is stored.
Field Details:
| Field | Type | Required | Description |
|---|---|---|---|
data_dir |
string | Yes | Subdirectory name for this database |
Storage Location:
The full path is: {LANCEDB_DIR}/{data_dir}/
Example:
- LANCEDB_DIR=/var/soliplex/lancedb
- data_dir=default_db
- Result: /var/soliplex/lancedb/default_db/
Best Practices:
- Use descriptive names: financial_reports, legal_docs, product_manuals
- Separate databases by embedding model or chunk configuration
- Include version in name for model upgrades: reports_v2, docs_ada002
Creating Parameter Sets
Via Configuration File (System)
- Create YAML file in
config/params/(system parameter sets):
- Define parameter set:
id: my_custom_params
name: My Custom Configuration
config:
parse:
do_ocr: true
ocr_engine: easyocr
table_mode: accurate
chunk:
chunker: docling-serve
chunk_size: 512
chunker_type: hybrid
embed:
provider: openai
model: text-embedding-3-small
vector_dim: 1536
store:
data_dir: custom_db
- Validate:
Via REST API
Create parameter set:
curl -X POST "http://localhost:8000/api/v1/workflow/param-sets" \
--data-urlencode "yaml_content=$(cat <<'EOF'
id: api_created_params
name: API Created Parameters
config:
chunk:
chunker: docling-serve
chunk_size: 384
embed:
provider: ollama
model: nomic-embed-text
vector_dim: 768
store:
data_dir: api_db
EOF
)"
Response:
{
"message": "Parameter set created successfully",
"id": "api_created_params",
"file_path": "/path/to/params/api_created_params.yaml"
}
Via Python
import httpx
import yaml
# Define parameter set
params = {
"id": "python_params",
"name": "Python Created Parameters",
"config": {
"chunk": {
"chunker": "docling-serve",
"chunk_size": 256,
},
"embed": {
"provider": "ollama",
"model": "mxbai-embed-large",
"vector_dim": 1024,
},
"store": {
"data_dir": "python_db"
}
}
}
# Convert to YAML
yaml_content = yaml.dump(params, sort_keys=False)
# Upload via API
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/api/v1/workflow/param-sets",
data={"yaml_content": yaml_content}
)
print(response.json())
Managing Parameter Sets
List All Parameter Sets
CLI:
REST API:
Response:
[
{
"id": "default",
"name": "Default Parameters",
"source": "app"
},
{
"id": "my_custom_params",
"name": "My Custom Configuration",
"source": "user"
}
]
Source Types:
- app - Built-in parameter sets (cannot be deleted)
- user - User-uploaded parameter sets (can be deleted)
View Parameter Set
CLI:
REST API:
Returns the raw YAML content.
Delete Parameter Set
Only user-uploaded parameter sets can be deleted.
REST API:
Response:
Error if built-in:
Query by Target Database
Find parameter sets that use a specific LanceDB directory:
Examples
High-Quality Processing
For important documents where quality matters more than speed:
id: high_quality
name: High Quality Processing
config:
parse:
do_ocr: true
force_ocr: false
ocr_engine: easyocr
pdf_backend: pypdfium2
table_mode: accurate
chunk:
chunker: docling-serve
chunk_size: 384
text_context_radius: 100
chunker_type: hybrid
embed:
provider: openai
model: text-embedding-3-large
vector_dim: 3072
store:
data_dir: high_quality_db
Fast Batch Processing
For large volumes where speed is critical:
id: fast_batch
name: Fast Batch Processing
config:
parse:
do_ocr: false
pdf_backend: pypdfium2
table_mode: fast
chunk:
chunker: local
chunk_size: 512
text_context_radius: 0
chunker_type: token
embed:
provider: ollama
model: nomic-embed-text
vector_dim: 768
store:
data_dir: fast_batch_db
OCR-Heavy Documents
For scanned PDFs and images:
id: ocr_focused
name: OCR-Focused Processing
config:
parse:
do_ocr: true
force_ocr: true
ocr_engine: easyocr
ocr_lang: en
pdf_backend: pypdfium2
table_mode: accurate
chunk:
chunker: docling-serve
chunk_size: 256
text_context_radius: 50
chunker_type: hierarchical
embed:
provider: ollama
model: nomic-embed-text
vector_dim: 768
store:
data_dir: ocr_docs_db
S3 Storage
For using S3-compatible storage:
id: s3_default
name: S3 Storage Configuration
config:
chunk:
chunker: docling-serve
chunk_size: 256
embed:
provider: ollama
model: qwen3-embedding:4b
vector_dim: 2560
store:
data_dir: s3_lancedb
Environment configuration:
Multilingual Documents
For documents in multiple languages:
id: multilingual
name: Multilingual Processing
config:
parse:
do_ocr: true
ocr_engine: easyocr
ocr_lang: en+es+fr # Multiple languages
table_mode: accurate
chunk:
chunker: docling-serve
chunk_size: 384
chunker_type: hybrid
embed:
provider: ollama
model: qwen3-embedding:4b # Multilingual model
vector_dim: 2560
store:
data_dir: multilingual_db
Best Practices
Choosing Chunk Size
Small chunks (128-256 tokens): - ✅ More precise search results - ✅ Better for factual lookup - ❌ Higher storage costs - ❌ Less context per chunk
Medium chunks (256-512 tokens): - ✅ Balanced performance (recommended) - ✅ Good context and precision - ✅ Reasonable storage costs
Large chunks (512-1024 tokens): - ✅ More context per result - ✅ Better for document understanding - ❌ Less precise search - ❌ May exceed embedding limits
Selecting Embedding Models
Consider: 1. Vector dimension: Higher isn't always better - More dimensions = more storage, slower search - Quality matters more than size
- Cost: OpenAI charges per token
- Use
text-embedding-3-smallfor cost-effectiveness -
Use Ollama for free local embeddings
-
Performance: Test with your documents
- Different models work better for different content
-
Domain-specific models may outperform general models
-
Multilingual: If you need multiple languages
qwen3-embedding- Excellent multilingual support- OpenAI models - Good multilingual coverage
Parameter Set Versioning
When upgrading models or changing configurations:
- Create new parameter set with version suffix:
- Keep old parameter sets:
- Allows comparison between versions
-
Old databases remain accessible
-
Document changes:
Testing Parameter Sets
Before processing large batches:
- Create test parameter set:
- Process small sample:
# Create test batch with 5-10 documents
curl -X POST "http://localhost:8000/api/v1/batch/" \
-d "source=test" -d "name=Parameter Test"
# Start workflow with test parameters
curl -X POST "http://localhost:8000/api/v1/batch/start-workflows" \
-d "batch_id=1" -d "param_id=test_params"
- Evaluate results:
- Query precision
- Processing time
- Storage size
-
Embedding quality
-
Iterate and deploy:
Storage Organization
Organize databases by use case:
# Financial documents
id: financial_docs
store:
data_dir: financial_db
# Technical manuals
id: tech_manuals
store:
data_dir: manuals_db
# Customer support
id: support_docs
store:
data_dir: support_db
Benefits: - Separate vector spaces - Independent scaling - Easier maintenance - Better search relevance
Troubleshooting
Parameter Set Not Found
Error: 404 Not Found when accessing parameter set
Solutions: 1. List available parameter sets:
-
Check parameter set ID matches exactly (case-sensitive)
-
Verify file exists in
config/params/(system) orconfig/user_params/(user uploads)
Invalid YAML Syntax
Error: 400 Bad Request - Invalid YAML syntax
Solutions: 1. Validate YAML syntax:
-
Check indentation (use spaces, not tabs)
-
Ensure all strings with special characters are quoted
Embedding Dimension Mismatch
Error: Embedding dimension mismatch
Solutions:
1. Verify vector_dim matches model output:
- Query model documentation
- Test embedding generation
- Common dimensions:
text-embedding-3-small: 1536text-embedding-3-large: 3072nomic-embed-text: 768qwen3-embedding:4b: 2560
Cannot Delete Parameter Set
Error: 403 Forbidden - Cannot delete built-in parameter sets
Solution:
- Only parameter sets with source: user can be deleted
- Built-in parameter sets are protected
- Create a copy if you need to modify:
# Copy built-in set
curl "http://localhost:8000/api/v1/workflow/param-sets/default" > my_params.yaml
# Modify and upload
# Edit my_params.yaml, change id
curl -X POST "http://localhost:8000/api/v1/workflow/param-sets" \
--data-urlencode "yaml_content@my_params.yaml"
Related Documentation
- API Reference - REST API endpoints for parameter management
- Workflows - Using parameter sets in workflows
- Configuration - Environment variables for embedding providers
- Getting Started - Quick start with parameter sets
Additional Resources
- HaikuRAG Documentation: https://github.com/ggozad/haiku.rag
- Docling Documentation: https://docling-project.github.io/docling/
- Ollama Models: https://ollama.com/library
- OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings