The Challenge
The original architecture followed a common pattern:
Upload → Parse → Return text.
That works — until:
- Parsing blocks API responsiveness
- Workers crash mid-processing
- Multiple workers compete for the same document
- Tenants require strict isolation
- Users need to trust extracted outputs
The objective was clear:
Move from functional parser to production-grade platform.
The Approach
1. Asynchronous Architecture by Design
Parsing was removed from the API request path.
Uploads are acknowledged immediately; workers process documents independently.
Impact: Fast ingestion without blocking under CPU-heavy load.
2. Deterministic Worker Claiming
Atomic state transitions ensure only one worker processes a document at a time.
Impact: Safe concurrency without distributed lock complexity.
3. Event-Driven Processing with Durable Fallback
Queue notifications enable low-latency pickup.
If queue publication fails, documents remain durably queued in the database and are recovered automatically.
Impact: Forward progress even under infrastructure degradation.
4. Two-Tier Failure Recovery
- Queue-level stale message handling
- Database-level TTL recovery for stuck processing states
Impact: Worker crashes and queue failures no longer require user resubmission.
5. SaaS-Grade Tenant Isolation
Tenant context is enforced across API and BFF layers.
Cross-tenant access is denied with minimal information exposure while maintaining audit trails.
Impact: Security boundaries suitable for multi-tenant production environments.
6. Visual Trust Through PDF Highlighting
Extracted chunks include bounding box metadata.
The web console overlays highlights directly on the source PDF.
Impact: Users can verify extracted text in context — building confidence and reducing review time.
7. End-to-End Observability
Trace propagation across Web → BFF → API → Worker boundaries.
Metrics, logs, dashboards, and correlation IDs built in.
Impact: Distributed debugging becomes practical, not reactive.
Outcomes
Platform Capabilities
- Asynchronous document lifecycle with explicit status model
- Independent worker scaling
- Durable queue + database safety nets
- Automated retry and recovery pathways
- Tenant-aware access controls and audit logging
- PDF-bound chunk visualization
- Full observability surface (metrics, traces, logs)
The system now behaves like a managed parsing service — not a utility endpoint.
Engineering Validation
- Backend tests across unit, integration, contract, e2e, and performance layers
- Worker crash recovery validation
- Restart and replay safety testing
- Tenant isolation enforcement testing
- Queue durability scenario coverage
Reliability was validated — not assumed.
The Result
What began as a document parsing endpoint is now a secure, observable, multi-tenant platform foundation.
Parsing works.
Scaling works.
Recovery works.
Trust is visible.
That’s the difference between a working tool and an adoptable system.
