Zitlac is a document management platform. I built the external API integration layer, the caching infrastructure, and the observability foundation — each one identified as a gap I proposed filling, not a ticket assigned to me. The goal throughout was making the system reliable enough to trust in production: predictable latency, debuggable failures, and a test suite that actually catches regressions before they reach staging.
External API Integration Pipeline
The system needed to call an external API for processing while staying responsive under load and degrading gracefully when the upstream was slow or unavailable. A synchronous, tightly-coupled integration would block the request path and take the feature down whenever the external service had issues. I flagged this as a reliability risk early and proposed a redesign before it became a production problem.
I built the integration as a pluggable handler chain: each step is independent and the pipeline can be reconfigured without touching the core flow. Processing is offloaded to an async queue so the HTTP response returns immediately and the caller polls for results — the request path is never held waiting on a third party.
Circuit breakers wrap each external call. If the upstream starts failing, the circuit opens and requests fail fast rather than stacking up. A retry policy with exponential backoff handles transient errors. The pipeline is extensible by design: new providers or processing steps are added as handlers without modifying the existing chain.
Caching Layer
Read traffic was the dominant source of database load. Most records are read far more frequently than they are written — serving every read from the database was unnecessary and expensive. I identified this pattern from query metrics and proposed the caching layer as a targeted fix rather than a general infrastructure change.
I introduced a read-through cache with TTLs tuned to the update frequency of each data type. Frequently accessed records are served from cache without touching the database. The connection pool shrank, and p99 read latency dropped significantly. Database read load fell 35% in the week after rollout.
Observability
Debugging production issues was slow because log output was inconsistent across services, fields were named differently, and there was no way to trace a single request across service boundaries. Every incident meant grepping through multiple log files and reconstructing a sequence of events by hand. I recognized this as a systemic problem — not just a tooling gap — and proposed a shared logging standard as the fix.
I replaced all logging with structured JSON output using a consistent schema across every service. Every log line carries a correlation ID threaded from the incoming request through the full call chain. Searching for a failure in production became a single query. Debugging time dropped 60%.
Test Coverage
I wrote the service layer test suite from scratch using TDD with JUnit5 and Mockito. Each unit test covers a single behavior and mocks at the boundary. Integration tests run against a real database and cache — no in-memory fakes that diverge from production behavior.
The suite reached 85% coverage across the service layer. During a later refactor of the processing pipeline, it caught three regressions before they reached staging. The tests paid for themselves on the first significant change.