The Challenge

The original architecture followed a common pattern:

Upload → Parse → Return text.

That works — until:

  • Parsing blocks API responsiveness
  • Workers crash mid-processing
  • Multiple workers compete for the same document
  • Tenants require strict isolation
  • Users need to trust extracted outputs

The objective was clear:

Move from functional parser to production-grade platform.

The Approach

1. Asynchronous Architecture by Design

Parsing was removed from the API request path.
Uploads are acknowledged immediately; workers process documents independently.

Impact: Fast ingestion without blocking under CPU-heavy load.

2. Deterministic Worker Claiming

Atomic state transitions ensure only one worker processes a document at a time.

Impact: Safe concurrency without distributed lock complexity.

3. Event-Driven Processing with Durable Fallback

Queue notifications enable low-latency pickup.
If queue publication fails, documents remain durably queued in the database and are recovered automatically.

Impact: Forward progress even under infrastructure degradation.

4. Two-Tier Failure Recovery

  • Queue-level stale message handling
  • Database-level TTL recovery for stuck processing states

Impact: Worker crashes and queue failures no longer require user resubmission.

5. SaaS-Grade Tenant Isolation

Tenant context is enforced across API and BFF layers.
Cross-tenant access is denied with minimal information exposure while maintaining audit trails.

Impact: Security boundaries suitable for multi-tenant production environments.

6. Visual Trust Through PDF Highlighting

Extracted chunks include bounding box metadata.
The web console overlays highlights directly on the source PDF.

Impact: Users can verify extracted text in context — building confidence and reducing review time.

7. End-to-End Observability

Trace propagation across Web → BFF → API → Worker boundaries.
Metrics, logs, dashboards, and correlation IDs built in.

Impact: Distributed debugging becomes practical, not reactive.

Outcomes

Platform Capabilities

  • Asynchronous document lifecycle with explicit status model
  • Independent worker scaling
  • Durable queue + database safety nets
  • Automated retry and recovery pathways
  • Tenant-aware access controls and audit logging
  • PDF-bound chunk visualization
  • Full observability surface (metrics, traces, logs)

The system now behaves like a managed parsing service — not a utility endpoint.

Engineering Validation

  • Backend tests across unit, integration, contract, e2e, and performance layers
  • Worker crash recovery validation
  • Restart and replay safety testing
  • Tenant isolation enforcement testing
  • Queue durability scenario coverage

Reliability was validated — not assumed.

The Result

What began as a document parsing endpoint is now a secure, observable, multi-tenant platform foundation.

Parsing works.
Scaling works.
Recovery works.
Trust is visible.

That’s the difference between a working tool and an adoptable system.