Getting Started with Soliplex Ingester
Quick Start
This guide will help you get Soliplex Ingester up and running in minutes.
Prerequisites
- Python 3.12 or higher
- pip or uv package manager
- SQLite (included with Python) or PostgreSQL
- Docling server for document parsing (optional)
- S3 backend (optional)
Docker Deployment
For production deployment using Docker Compose, see the comprehensive Docker Deployment Guide.
The docker-compose configuration provides all necessary services: - PostgreSQL database with initialization scripts - Docling document parsing services with GPU support and load balancing - SeaweedFS for S3-compatible object storage - HAProxy load balancer for high availability
Quick Start with Docker:
Access the application at http://localhost:8002
For detailed instructions including: - Service configuration and scaling - GPU setup and optimization - Authentication with OAuth2 Proxy - Production deployment best practices - Troubleshooting guide
See DOCKER.md
1. Install Package
Installation from source
Using pip:
Using uv:
Running your own Install
You can integrate soliplex ingester into another python project by installing it like any other package. This will allow you to use custom methods for any part of the workflow if desired.
uv init --lib <my project name>
uv add https://github.com/soliplex/ingester.git
uv run si-cli bootstrap
This installs the package and makes the si-cli command available.
2. Verify Installation
You should see the CLI help menu.
Configuration
3. Set Environment Variables
Automatically configure:
Manually create a .env file in the project root:
# Minimum required configuration
DOC_DB_URL=sqlite+aiosqlite:///./db/documents.db
# Optional: Docling service (for document parsing)
DOCLING_SERVER_URL=http://localhost:5001/v1
# Optional: Adjust logging
LOG_LEVEL=INFO
Load the environment:
Or on Windows:
Get-Content .env | ForEach-Object { $var = $_.Split('='); [Environment]::SetEnvironmentVariable($var[0], $var[1]) }
4. Validate Configuration
This should display your configuration without errors.
Database Setup
5. Initialize Database
This creates:
- SQLite database file at db/documents.db
- All necessary tables
- Runs migrations
Start the Server
6. Run the Development Server
The server starts on http://127.0.0.1:8000 with:
- Auto-reload on code changes
- Integrated worker for processing
- Web UI at /
- OpenAPI docs at /docs
7. Access the Application
Web UI (Main Application):
Open your browser and navigate to:
The web UI provides: - Dashboard - Monitor workflow status and batch processing - Batches - View and manage document batches - Workflows - Inspect workflow definitions and runs - Parameters - View and create parameter sets - LanceDB - Manage vector databases - Statistics - View processing metrics and performance data
API Documentation (Swagger UI):
For API testing and documentation:
Alternative API Documentation (ReDoc):
Test the server:
You should see the Swagger UI.
Your First Batch
8. Create a Batch
Response:
9. Ingest a Document
Option A: Upload a file
curl -X POST "http://localhost:8000/api/v1/document/ingest-document" \
-F "file=@sample.pdf" \
-F "source_uri=/documents/sample.pdf" \
-F "source=test" \
-F "batch_id=1"
Option B: Provide a URI (requires Docling server)
curl -X POST "http://localhost:8000/api/v1/document/ingest-document" \
-F "input_uri=https://example.com/document.pdf" \
-F "source_uri=/remote/document.pdf" \
-F "source=test" \
-F "batch_id=1"
Response:
{
"batch_id": 1,
"document_uri": "/documents/sample.pdf",
"document_hash": "sha256-abc123...",
"source": "test",
"uri_id": 1
}
10. Start Workflow Processing
curl -X POST "http://localhost:8000/api/v1/batch/start-workflows" \
-d "batch_id=1" \
-d "workflow_definition_id=batch"
Response:
11. Monitor Progress
Check batch status:
Response:
{
"batch": { ... },
"document_count": 1,
"workflow_count": {
"COMPLETED": 0,
"RUNNING": 1,
"PENDING": 0
},
"workflows": [ ... ],
"parsed": 0,
"remaining": 1
}
Watch workflow runs:
12. View Results
Once processing completes, check the document:
Next Steps
Explore Workflows
List available workflows:
Inspect a workflow:
View workflow runs:
Configure Parameters
List parameter sets:
View parameters:
Create custom parameters:
1. Copy config/params/default.yaml to config/params/custom.yaml
2. Modify settings as needed
3. Use in API: -d "param_id=custom"
Scale Workers
Run additional workers:
Each worker processes steps independently, increasing throughput.
Monitor System
API Documentation:
Browse to http://localhost:8000/docs for interactive API docs.
Database Inspection:
sqlite3 db/documents.db
sqlite> .tables
sqlite> SELECT * FROM documentbatch;
sqlite> SELECT * FROM workflowrun WHERE batch_id = 1;
Troubleshooting
Server Won't Start
Problem: Configuration validation fails
Solution:
Fix any reported errors in your .env file.
Problem: Port already in use
Solution:
Workflows Stuck
Problem: Workflows remain in PENDING status
Solution: Ensure a worker is running:
Check worker logs for errors.
Document Parsing Fails
Problem: Parse step fails with connection error
Solution:
1. Verify Docling server is running
2. Check DOCLING_SERVER_URL is correct
3. Test connectivity:
Database Errors
Problem: Database connection fails
Solution:
1. Check DOC_DB_URL format
2. Ensure directory exists: mkdir -p db
3. Check permissions: chmod 755 db
4. Reinitialize: si-cli db-init
Development Mode
For active development:
1. Enable auto-reload:
2. Set debug logging:
3. Watch logs:
4. Monitor database:
Production Deployment
Configuration
Create production .env:
# Database
DOC_DB_URL=postgresql+asyncpg://user:password@db-host:5432/soliplex
# Services
DOCLING_SERVER_URL=http://docling-prod:5001/v1
# Logging
LOG_LEVEL=WARNING
# Performance
INGEST_WORKER_CONCURRENCY=20
DOCLING_CONCURRENCY=5
WORKER_TASK_COUNT=10
# Storage
FILE_STORE_DIR=/var/lib/soliplex/files
LANCEDB_DIR=/var/lib/soliplex/lancedb
Run Services
Server:
Workers: (in separate processes)
Behind Nginx:
upstream soliplex {
server 127.0.0.1:8000;
}
server {
listen 80;
server_name soliplex.example.com;
location / {
proxy_pass http://soliplex;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Docker Deployment
Dockerfile:
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e .
CMD ["si-cli", "serve", "--host", "0.0.0.0"]
docker-compose.yml:
version: '3.8'
services:
server:
build: .
ports:
- "8000:8000"
environment:
DOC_DB_URL: postgresql+asyncpg://postgres:password@db/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
depends_on:
- db
worker:
build: .
command: si-cli worker
environment:
DOC_DB_URL: postgresql+asyncpg://postgres:password@db/soliplex
DOCLING_SERVER_URL: http://docling:5001/v1
depends_on:
- db
deploy:
replicas: 3
db:
image: postgres:16
environment:
POSTGRES_DB: soliplex
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
volumes:
- db-data:/var/lib/postgresql/data
volumes:
db-data:
Run:
Learning More
Documentation
- Architecture Overview - System design and components
- API Reference - Complete REST API documentation
- Workflow System - Workflow concepts and configuration
- Database Schema - Data models and relationships
- Configuration - Environment variables and settings
- CLI Reference - Command-line interface guide
Examples
Check the examples/ directory (if available) for:
- Sample workflows
- Integration scripts
- Custom step handlers
- Batch processing examples
Community
- Issues: Report bugs and request features
- Discussions: Ask questions and share ideas
- Contributing: See CONTRIBUTING.md (if available)
Common Patterns
Batch Document Ingestion
import asyncio
from pathlib import Path
import httpx
async def ingest_directory(directory: Path, batch_id: int, source: str):
"""Ingest all documents in a directory."""
async with httpx.AsyncClient() as client:
for file_path in directory.glob("**/*.pdf"):
with open(file_path, "rb") as f:
files = {"file": f}
data = {
"source_uri": str(file_path),
"source": source,
"batch_id": batch_id,
}
response = await client.post(
"http://localhost:8000/api/v1/document/ingest-document",
files=files,
data=data,
)
print(f"Ingested {file_path}: {response.status_code}")
# Usage
asyncio.run(ingest_directory(Path("/documents"), batch_id=1, source="filesystem"))
Monitor Batch Progress
import asyncio
import httpx
async def wait_for_batch(batch_id: int, poll_interval: int = 5):
"""Wait for batch processing to complete."""
async with httpx.AsyncClient() as client:
while True:
response = await client.get(
f"http://localhost:8000/api/v1/batch/status",
params={"batch_id": batch_id}
)
data = response.json()
counts = data["workflow_count"]
print(f"Completed: {counts.get('COMPLETED', 0)}, "
f"Running: {counts.get('RUNNING', 0)}, "
f"Failed: {counts.get('FAILED', 0)}")
if counts.get("RUNNING", 0) == 0 and counts.get("PENDING", 0) == 0:
print("Batch complete!")
break
await asyncio.sleep(poll_interval)
# Usage
asyncio.run(wait_for_batch(1))
Retry Failed Workflows
#!/bin/bash
# retry_failed.sh
BATCH_ID=$1
RUN_GROUP=$(curl -s "http://localhost:8000/api/v1/workflow/run-groups?batch_id=${BATCH_ID}" | jq -r '.[0].id')
if [ -n "$RUN_GROUP" ]; then
curl -X POST "http://localhost:8000/api/v1/workflow/retry" \
-d "run_group_id=${RUN_GROUP}"
echo "Retried run group ${RUN_GROUP}"
else
echo "No run group found for batch ${BATCH_ID}"
fi
What's Next?
Now that you have Soliplex Ingester running:
- Customize workflows - Create workflows for your specific needs
- Integrate services - Connect your data sources and RAG backends
- Scale processing - Add more workers and optimize configuration
- Monitor production - Set up logging, metrics, and alerting
- Build applications - Use the API to build document processing apps
Welcome to Soliplex Ingester! 🚀