CLI Reference
Overview
Soliplex Ingester provides two command-line interfaces built with Typer:
si-cli- Main CLI for server management, worker processes, and database operationssi-diag- Read-only diagnostic CLI for inspecting batches, documents, workflows, and system status
Installation:
After installing the package, both commands are available:
Entry Points:
si-cli→src/soliplex/ingester/cli.pysi-diag→src/soliplex/ingester/diag_cli.py
Global Options
All commands support these options:
Initialization:
Before running any command, the CLI automatically:
- Validates settings
- Sets up logging based on
LOG_LEVEL
Commands
validate-settings
Validate and display application settings.
Usage:
Description:
- Validates all environment variables
- Displays current configuration
- Exits with error code if validation fails
Example Output:
doc_db_url='sqlite+aiosqlite:///./db/documents.db'
docling_server_url='http://localhost:5001/v1'
docling_http_timeout=600
log_level='INFO'
file_store_target='fs'
file_store_dir='file_store'
...
Error Output:
Exit Codes:
0- Settings valid1- Validation failed
Implementation: src/soliplex/ingester/cli.py:38
db-init
Initialize database tables and run migrations.
Usage:
Description:
- Creates all database tables using SQLModel metadata
- Runs Alembic migrations to latest version
- Idempotent (safe to run multiple times)
Prerequisites:
DOC_DB_URLenvironment variable must be set- Database server must be accessible
- User must have CREATE TABLE permissions
Example:
Notes:
- For SQLite, creates database file if it doesn't exist
- For PostgreSQL, database must already exist
- Uses synchronous SQLAlchemy engine (not async)
Implementation: src/soliplex/ingester/cli.py:68
serve
Start the FastAPI web server.
Usage:
Options:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--host |
-h |
str | 127.0.0.1 |
Bind address |
--port |
-p |
int | 8000 |
Port number |
--uds |
- | str | None | Unix domain socket path |
--fd |
- | int | None | File descriptor to bind |
--reload |
-r |
bool | False | Auto-reload on file changes |
--workers |
- | int | None | Number of worker processes |
--access-log |
- | bool | None | Enable/disable access log |
--proxy-headers |
- | bool | None | Trust proxy headers |
--forwarded-allow-ips |
- | str | None | IPs to trust for proxy headers |
Examples:
Basic server:
Custom host and port:
Development mode with auto-reload:
Production with multiple workers:
Unix socket:
Behind proxy:
Reload Configuration:
When --reload is enabled:
- Watches Python files in
soliplex.ingesterpackage - Watches
*.yaml,*.yml,*.txtfiles - Automatically restarts on changes
Worker Note:
The server automatically starts a background worker on startup. The worker processes workflow steps concurrently with serving API requests.
Environment Variables:
WEB_CONCURRENCY- Default number of workers if not specified
Implementation: src/soliplex/ingester/cli.py:207
worker
Run a standalone workflow processing worker.
Usage:
Description:
- Starts a worker that polls for pending workflow steps
- Executes steps according to workflow definitions
- Runs indefinitely until interrupted (Ctrl+C)
Behavior:
- Registers worker with unique ID in database
- Sends heartbeat every
WORKER_CHECKIN_INTERVALseconds - Processes steps based on priority and availability
- Handles retries according to step configuration
Example:
Multiple Workers:
Run multiple instances for increased throughput:
Graceful Shutdown:
- Press
Ctrl+Cto stop worker - Worker will finish current step before exiting
- Pending steps remain in database for other workers
Monitoring:
Check worker status via API:
Implementation: src/soliplex/ingester/cli.py
list-workflows
List all available workflow definitions.
Usage:
Description:
- Scans
WORKFLOW_DIRfor YAML files - Displays workflow IDs
Example:
Output:
Implementation: src/soliplex/ingester/cli.py:189
dump-workflow
Display complete workflow definition.
Usage:
Arguments:
WORKFLOW_ID(str, required) - Workflow definition ID
Description:
- Loads workflow from YAML
- Displays as formatted JSON
Example:
Output:
{
"id": "batch",
"name": "Batch Workflow",
"meta": {},
"item_steps": {
"validate": {
"name": "docling validate",
"retries": 3,
"method": "soliplex.ingester.lib.workflow.validate_document",
"parameters": {}
},
...
},
"lifecycle_events": null
}
Implementation: src/soliplex/ingester/cli.py:162
list-param-sets
List all available parameter sets.
Usage:
Description:
- Scans
PARAM_DIRfor YAML files - Displays parameter set IDs
Example:
Output:
Implementation: src/soliplex/ingester/cli.py:201
dump-param-set
Display complete parameter set configuration.
Usage:
Arguments:
PARAM_ID(str, optional) - Parameter set ID (default: "default")
Description:
- Loads parameter set from YAML
- Displays as formatted JSON
Example:
Output:
{
"id": "default",
"name": "Default Parameters",
"meta": {},
"config": {
"parse": {
"format": "markdown",
"ocr_enabled": true
},
"chunk": {
"chunk_size": 512,
"chunk_overlap": 50
},
...
}
}
Implementation: src/soliplex/ingester/cli.py:175
vacuum
Vacuum a LanceDB database to reclaim space from deleted rows.
Usage:
Arguments:
DB_NAME(str, required) - Name of the database directory underLANCEDB_DIR
Options:
| Option | Type | Default | Description |
|---|---|---|---|
--sign |
bool | False | Write an HMAC-SHA512 signature after vacuuming (requires LANCEDB_HMAC_KEY) |
Description:
- Resolves the database path under the configured
LANCEDB_DIR - If the directory contains a
haiku.rag.lancedbsubfolder, vacuums that instead - Automatically runs pending migrations before vacuuming if required
- Sets
vacuum_retention_secondsto 0 to ensure all deleted data is reclaimed - Will not create a new database — errors if the path does not exist
Examples:
Exit Codes:
0- Vacuum completed successfully1- Database not found or error during vacuum
Implementation: src/soliplex/ingester/cli.py → src/soliplex/ingester/lib/rag.py:vacuum_db
vacuum-all
Vacuum every LanceDB database under the configured LANCEDB_DIR.
Usage:
Options:
| Option | Type | Default | Description |
|---|---|---|---|
--sign |
bool | False | Write an HMAC-SHA512 signature after vacuuming each database (requires LANCEDB_HMAC_KEY) |
Description:
- Scans
LANCEDB_DIRfor all database directories - Vacuums each one sequentially
- Logs and continues on failure so one bad database does not block the rest
Examples:
Implementation: src/soliplex/ingester/cli.py → src/soliplex/ingester/lib/rag.py:vacuum_all
verify-db
Verify the HMAC-SHA512 signature of a LanceDB database.
Usage:
Arguments:
DB_NAME(str, required) - Name of the database directory underLANCEDB_DIR
Description:
- Reads the
.hmacsidecar file next to the database directory - Recomputes the HMAC-SHA512 over all files in the database
- Uses constant-time comparison to verify the signature
- Requires
LANCEDB_HMAC_KEYenvironment variable (must be 64 bytes)
Examples:
Exit Codes:
0- HMAC verification passed1- Verification failed, HMAC file not found, or key misconfigured
Implementation: src/soliplex/ingester/cli.py → src/soliplex/ingester/lib/rag.py:verify_db
Usage Patterns
Development Workflow
1. Validate configuration:
2. Initialize database:
3. Start server with reload:
4. (Optional) Start additional workers:
Production Deployment
1. Validate configuration:
2. Run migrations:
3. Start server with multiple workers:
4. Start dedicated worker processes:
# In separate terminals/services
si-cli worker # Process 1
si-cli worker # Process 2
si-cli worker # Process 3
Batch Processing
1. Create batch and ingest documents:
# Use API or client library
curl -X POST http://localhost:8000/api/v1/batch/ \
-d "source=filesystem" \
-d "name=Test Batch"
2. Start workflows:
curl -X POST http://localhost:8000/api/v1/batch/start-workflows \
-d "batch_id=1" \
-d "workflow_definition_id=batch"
3. Start workers to process:
4. Monitor progress:
Configuration Management
List available workflows:
Inspect workflow:
List parameter sets:
Inspect parameters:
Troubleshooting
Check configuration:
Verify database connection:
Test server startup:
Check worker connectivity:
Exit Codes
| Code | Meaning |
|---|---|
0 |
Success |
1 |
Configuration error / validation failed |
130 |
Interrupted by user (Ctrl+C) |
Environment Variables
The CLI respects all configuration environment variables. Key ones for CLI usage:
DOC_DB_URL- Database connection (required)LOG_LEVEL- Logging verbosity (DEBUG, INFO, WARNING, ERROR)WORKFLOW_DIR- Workflow definitions directoryPARAM_DIR- Parameter sets directoryDOCLING_SERVER_URL- Docling service endpointLANCEDB_DIR- Directory containing LanceDB databases (used by vacuum/verify commands)LANCEDB_HMAC_KEY- 64-byte key for HMAC-SHA512 database signing (used by--signandverify-db)
See CONFIGURATION.md for complete list.
Logging
Log Levels
Set via LOG_LEVEL environment variable:
Levels:
DEBUG- Detailed diagnostic informationINFO- General informational messages (default)WARNING- Warning messagesERROR- Error messagesCRITICAL- Critical error messages
Log Output
Logs are written to stderr. Redirect as needed:
Log Format
Default format includes:
- Timestamp
- Log level
- Logger name
- Message
Example:
2025-01-15 10:00:00,123 INFO soliplex.ingester.cli Starting server
2025-01-15 10:00:01,456 INFO soliplex.ingester.server Starting worker
Signal Handling
Graceful Shutdown
The CLI handles signals for graceful shutdown:
Signals:
SIGINT(Ctrl+C) - Graceful shutdownSIGTERM- Graceful shutdown
Behavior:
- Stop accepting new work
- Complete current operations
- Clean up resources
- Exit with code 0
Example:
Platform Notes
Windows
On Windows, the CLI automatically sets the event loop policy for compatibility:
if platform.system() == "Windows":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
This is handled automatically; no user action required.
Unix/Linux
No special configuration needed.
Running with Python
If si-cli is not in PATH, run directly:
Docker Usage
Dockerfile CMD examples:
Server:
Worker:
Init container:
Systemd Service
Example service file:
/etc/systemd/system/soliplex-ingester.service:
[Unit]
Description=Soliplex Ingester API Server
After=network.target postgresql.service
[Service]
Type=simple
User=soliplex
Group=soliplex
WorkingDirectory=/opt/soliplex-ingester
EnvironmentFile=/etc/soliplex/config.env
ExecStart=/opt/soliplex-ingester/.venv/bin/si-cli serve --host 0.0.0.0 --port 8000
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
/etc/systemd/system/soliplex-worker@.service:
[Unit]
Description=Soliplex Ingester Worker %i
After=network.target postgresql.service
[Service]
Type=simple
User=soliplex
Group=soliplex
WorkingDirectory=/opt/soliplex-ingester
EnvironmentFile=/etc/soliplex/config.env
Environment="WORKER_ID=worker-%i"
ExecStart=/opt/soliplex-ingester/.venv/bin/si-cli worker
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
Start services:
sudo systemctl daemon-reload
sudo systemctl enable soliplex-ingester
sudo systemctl start soliplex-ingester
sudo systemctl enable soliplex-worker@{1..3}
sudo systemctl start soliplex-worker@{1..3}
si-diag: Diagnostic CLI
The si-diag CLI provides read-only access to system state for debugging and monitoring. It uses the same database connection and configuration as si-cli.
Entry Point: src/soliplex/ingester/diag_cli.py
Command Groups
| Group | Description |
|---|---|
batch |
List batches |
document |
List, find, inspect, and view history of documents |
config |
List and inspect workflow definitions and parameter sets |
run-group |
List and inspect run groups |
workflow |
List and inspect workflow runs and steps |
status |
View running steps, recent activity, and aggregated details |
batch list
List all batches in the database.
Output columns: id, name, source, start_date, completed_date, duration
document list
List document URIs filtered by source or batch ID.
Options:
| Option | Type | Description |
|---|---|---|
--source |
str | Filter by source |
--batch-id |
int | Filter by batch ID |
One of --source or --batch-id is required.
document find
Search for document URIs by pattern (case-insensitive substring match).
Arguments:
PATTERN(str, required) - Search pattern for URI
document info
Display detailed information about a document by its hash.
Arguments:
DOC_HASH(str, required) - Document hash
Shows: mime_type, file_size, doc_meta (as JSON), and associated URIs.
document history
Show DocumentURIHistory records for a document hash.
Arguments:
DOC_HASH(str, required) - Document hash
config workflows
List all available workflow definitions from the workflow registry.
config params
List all available parameter sets from the parameter registry.
config workflow-def
Display a workflow definition as YAML.
Arguments:
WF_ID(str, required) - Workflow definition ID
config param-def
Display a parameter set definition as YAML.
Arguments:
PARAM_ID(str, required) - Parameter set ID
run-group list
List run groups, optionally filtered by batch ID.
Options:
| Option | Type | Description |
|---|---|---|
--batch-id |
int | Filter by batch ID |
run-group info
Display run group details including a status breakdown of workflow runs.
Arguments:
RUN_GROUP_ID(int, required) - Run group ID
workflow list
List workflow runs for a run group.
Arguments:
RUN_GROUP_ID(int, required) - Run group ID
Options:
| Option | Type | Description |
|---|---|---|
--status |
str | Filter by status (PENDING, RUNNING, COMPLETED, FAILED) |
workflow info
Display workflow run info and its steps.
Arguments:
WORKFLOW_RUN_ID(int, required) - Workflow run ID
Options:
| Option | Type | Description |
|---|---|---|
--status |
str | Filter steps by status |
status running
List all run steps currently in RUNNING status with enriched context.
Output columns: workflow_id, doc_hash, doc_uri, run_group, param_def_id, step_type, started, elapsed
status recent
List steps with status updates within a time interval.
si-diag status recent minute
si-diag status recent hour --status COMPLETED
si-diag status recent day --status FAILED
Arguments:
INTERVAL(str, default: "minute") - Time interval: minute, hour, day, week
Options:
| Option | Type | Description |
|---|---|---|
--status |
str | Filter by status |
status details
Display aggregated step details for a run group. PostgreSQL only.
Arguments:
RUN_GROUP_ID(int, required) - Run group ID
Output columns: batch_name, param_def_id, step_type, status, count, pages
Note: This command uses PostgreSQL-specific JSONB functions and will return an error on SQLite.
Future Commands
Commands planned for future releases:
si-cli batch create- Create batch via CLIsi-cli batch ingest- Ingest documents from directorysi-cli batch status- Check batch statussi-cli workflow retry- Retry failed workflowssi-cli stats- Display system statisticssi-cli clean- Clean up old data
Getting Help
Command help:
Report issues:
- GitHub: https://github.com/your-repo/soliplex-ingester/issues
- Include
si-cli validate-settingsoutput - Include relevant logs with
LOG_LEVEL=DEBUG