Soliplex Ingester Architecture

Overview

Soliplex Ingester is a document processing and RAG (Retrieval-Augmented Generation) ingestion system designed to handle large-scale document workflows. It provides a FastAPI-based REST API, workflow orchestration, and integration with document parsing and embedding services.

System Components

1. FastAPI Server

The server provides REST API endpoints for document and workflow management:

Document Routes (/api/v1/document/*) - Document upload, retrieval, and management
Batch Routes (/api/v1/batch/*) - Batch processing operations
Workflow Routes (/api/v1/workflow/*) - Workflow execution and monitoring
Stats Routes (/api/v1/stats/*) - System statistics and metrics

Server entry point: src/soliplex/ingester/server/__init__.py:30

2. Workflow System

The workflow system orchestrates multi-step document processing pipelines:

Document → Validate → Parse → Chunk → Embed → Store

Workflow Components:

WorkflowDefinition - Defines the steps and lifecycle events for a workflow
WorkflowRun - Represents a single execution instance for one document
RunGroup - Groups multiple workflow runs together
RunStep - Individual step execution within a workflow run

Step Types:

INGEST - Load document into system
VALIDATE - Validate document format and content
PARSE - Extract text and structure from document
CHUNK - Split document into semantic chunks
EMBED - Generate vector embeddings
STORE - Save to RAG system (LanceDB + HaikuRAG)
ENRICH - Add metadata or additional processing
ROUTE - Conditional routing logic

Implementation: src/soliplex/ingester/lib/wf/

3. Worker System

Async workers process workflow steps concurrently:

Workers poll for pending workflow steps
Configurable concurrency levels for different operations
Automatic retry logic with configurable retry counts
Health check/heartbeat system via WorkerCheckin

Worker implementation: src/soliplex/ingester/lib/wf/runner.py

4. Storage Layer

Database:

SQLModel + SQLAlchemy with async support
Supports SQLite (dev) and PostgreSQL (production)
Alembic for migrations

File Storage:

Configurable backends (filesystem, S3-compatible via OpenDAL)
Separate storage locations for different artifact types:
Raw documents
Parsed markdown
Parsed JSON
Chunks
Embeddings

Vector Storage:

LanceDB for vector embeddings
HaikuRAG client for retrieval operations

5. Document Processing Pipeline

graph LR
    A[Upload Document] --> B[Create DocumentURI]
    B --> C[Hash & Store as Document]
    C --> D[Queue Workflow Run]
    D --> E[Validate Step]
    E --> F[Parse with Docling]
    F --> G[Chunk Text]
    G --> H[Generate Embeddings]
    H --> I[Store in LanceDB]
    I --> J[Update RAG Index]

6. External Services

Docling Server:

Document parsing service
Extracts text, structure, and metadata
Configurable via DOCLING_SERVER_URL

HaikuRAG:

RAG backend for document retrieval
Vector search and document management
Optional (controlled by DO_RAG setting)

Data Flow

Document Ingestion Flow

Upload - Client uploads document via /api/v1/document/upload
Hash & Dedupe - System computes SHA256 hash, checks for duplicates
Create URI - Maps source URI to document hash
Batch Assignment - Associates document with processing batch
Workflow Creation - Creates WorkflowRun and RunSteps
Worker Processing - Workers pick up and execute steps
Status Updates - Database tracks step and run status
Completion - Document marked complete when all steps succeed

Workflow Execution Flow

Worker Startup - Worker registers and starts polling
Step Selection - Worker queries for PENDING steps with FOR UPDATE lock
Status Transition - PENDING → RUNNING → COMPLETED/ERROR/FAILED
Step Execution - Calls registered handler method
Artifact Storage - Saves intermediate results
Retry Logic - Automatic retry on ERROR status
Run Completion - Aggregates step status to run status
Group Completion - Aggregates run status to group status

Configuration

Configuration via environment variables with pydantic-settings:

Database connection
File storage paths
Worker concurrency settings
External service URLs
Workflow and parameter directories

See src/soliplex/ingester/lib/config.py:15 for full configuration schema.

Scalability

Horizontal Scaling:

Multiple workers can run concurrently
Database row-level locking prevents duplicate processing
Stateless API servers can be load balanced

Vertical Scaling:

Configurable concurrency per worker
Batch size controls for embedding operations
Connection pooling for database access

Workflow Parallelism:

Multiple workflows can process simultaneously
Steps within a workflow run sequentially
Different documents process independently

Technology Stack

Web Framework: FastAPI 0.120+
Database ORM: SQLModel 0.0.27+
Async Runtime: asyncio
CLI: Typer
Document Parsing: Docling
Vector DB: LanceDB 0.25+
RAG: HaikuRAG
Storage: OpenDAL (multi-backend support)

Extension Points

Custom Workflow Steps: Define custom step handlers by:

Creating a new async function matching the EventHandler signature
Registering in workflow YAML configuration
Implementing retry logic and error handling

Custom Storage Backends: Configure via FILE_STORE_TARGET environment variable and OpenDAL configuration.

Custom Lifecycle Events: Add event handlers in workflow configuration to respond to:

GROUP_START / GROUP_END
ITEM_START / ITEM_END
STEP_START / STEP_END
ITEM_FAILED / STEP_FAILED

Monitoring

Database Tables:

workflowrun - Track run status and duration
runstep - Monitor individual step execution
workcheckin - Worker health and activity
lifecyclehistory - Audit trail of events

Metrics Available:

Document processing throughput
Step success/failure rates
Worker utilization
Processing durations
Batch completion times

Access via /api/v1/stats/* endpoints.