How AiSensy Handles 200M+ API Requests Daily

From a free Heroku dyno to the distributed infrastructure powering WhatsApp for 210,000 businesses across 68 countries
— the complete story, unfiltered.

In India, WhatsApp is opened an average of 30 times a day. That's more than twice every waking hour.

Delhi Transport Corporation: 1.4 million ticket bookings enabled through WhatsApp
PhysicsWallah: 3x more student leads via Click-to-WhatsApp campaigns
Skullcandy: 25-40% of abandoned carts recovered, 150x ROI

Each of those numbers depends on a system that doesn't blink under burst load. This post is the unfiltered engineering story behind that system — not a single architectural breakthrough, but a sequence of decisions made under real production pressure, where every choice had immediate consequences.

If you've ever debugged a cascading failure at 3 a.m., or watched a well-intentioned retry mechanism take down a platform — the lessons here will feel familiar.

Table of Contents:

How It All Started: One Server, One Codebase

The Day More Hardware Made Things Worse

The Database Bottleneck

When Our Safety Net Became the Fire

The Engine Running 1.5 Billion Messages That No One Sees

Why We Were Always Late to Our Own Emergencies

The Tax You Don't Notice Until It's Costing Millions

210,000 Businesses Sharing One Platform Without Knowing It

Principles That Emerged

What We're Building Next

How It All Started: One Server, One Codebase

AiSensy started in 2020 as a chatbot product for education. When Covid arrived, the market shifted fast, and businesses needed WhatsApp at a scale nobody had anticipated. By late 2020, we had a beta with 20–30 clients. Today that number is 210,000.

The infrastructure we started with was a single free Heroku dyno — one monolith handling authentication, message dispatching, campaign management, and billing in a single process. It carried us to our first paying customers, and that was exactly the right call.

Monoliths were never the problem. Breaking them too early would have been.

1. Heroku free dyno

One monolith, one codebase, ~10,000 requests a day.

2. EC2 + vertical scaling

We moved to infrastructure we owned and bought time through bigger instances. That time eventually ran out.

3. Microservices, single MongoDB

We distributed the application layer but centralised the bottleneck. The DB became the new ceiling.

4. Async by default

Queues, background workers, Lambda for spikes, Dead Letter Queues for anything that couldn't be dropped.

5. Today

Reads and writes separated. Database changes propagate as events. Every service runs in a container sized for what it actually does. Autoscaling responds to queue depth, not CPU because by the time CPU spikes, you've already kept someone waiting.

The Day More Hardware Made Things Worse

Our platform didn't fail abruptly. It degraded gradually, which is significantly harder to debug. Nothing completely broken, yet nothing reliably predictable.

At peak load, broadcast campaigns, real-time ticketing, and lead-generation pipelines collided simultaneously. Latency spiked. The database contended. Retries amplified. Failures cascaded.

Adding hardware solved nothing because the constraint was architectural, not infrastructural. A slowdown in one service propagated across every dependent system.

So we stopped treating scale as a throughput problem and started treating it as a failure-behaviour problem. This meant decoupling services so failures stay localised, making async processing the default, applying retry strategies with exponential backoff and jitter, and designing systems to degrade gracefully under pressure.

After a sequence of debugs, the platform became predictable under load. Failures became contained instead of cascading.

The Database Bottleneck That Became Our 3 am Wake-Up Call

For years, a single MongoDB instance was the source of truth for everything. Eventually it became the system's most expensive bottleneck - the thing that woke engineers at 3 a.m.

From an engineering perspective, this felt like threads waiting indefinitely for resources. The system appeared operational but was effectively stalled. We fixed it in six layers, but three changes created the largest impact:

Query optimisation and index hygiene aligned the database with a write-heavy workload, eliminating N+1 queries and redundant indexes.
Connection pooling prevented exhaustion by multiplexing application threads over a controlled set of database connections.
CQRS separated reads from writes, ensuring analytical queries no longer competed with transactional workloads.

As scale increased, we layered on more:

Change Data Capture enabled async event propagation without impacting the hot path.
Bulk operations using insertMany and bulkWrite reduced database round-trips by orders of magnitude.
Partitioning and tiered storage ensured each workload ran on the system best suited for it — hot data in DB, cold data in S3 using columnar formats.
Results: 95% faster database queries. The database evolved from a bottleneck into a scalable backbone.

When Our Safety Net Became the Fire

This chapter is about the moment your resilience strategy becomes the source of failure.

In our system, retries created a feedback loop. A slow dependency triggered retries. Retries increased load. Increased load further slowed the system. The system began amplifying its own failure.

During these incidents, dashboards often showed low error rates while latency and queue depth quietly climbed.

The Incident that Forced the Rewrite

Early 2025. A large customer fired a multi million message campaign at peak hours. The first signal wasn't a CPU alert — it was support tickets about delayed delivery from other customers.

Our hypothesis was DB load. The real cause: synchronous Meta API calls had backed up the request thread pool, and retries were amplifying the original load by 3–4×. A "safety" mechanism had become the fire.

We patched it that night by capping retry budgets per pipeline. The permanent fix — moving to async-first with idempotency keys and circuit breakers shipped over the following weeks. That single incident is why nearly every operation in AiSensy today is async, not request-backed.

The permanent toolkit:

Exponential backoff with jitter spreads retries over time and prevents thundering-herd conditions.
Circuit breakers stop calling failing dependencies and allow them to recover.
Idempotency keys on every request ensure retries never create duplicate processing or inconsistent state.

Failures became controlled and isolated instead of cascading. Retry logic became a safety net again, not a source of load.

The Engine Running 1.5 Billion Messages That No One Sees

The most critical service in AiSensy's architecture is one most users never see.

Meta's WhatsApp Business API enforces per-WABA sending rate caps based on account quality tier and recent activity. At our scale, this becomes a distributed rate-control problem: 210,000 tenants, each with different limits, and the system must maximise throughput for all of them simultaneously without crossing any boundary. Breach the limits and you get throttled. Sustained breaches get accounts flagged.

So we built a feedback-driven traffic-shaping engine that:

Tracks per-account rate limits in real time using signals from Meta APIs.
Dispatches messages at the maximum compliant throughput and no faster.
Buffers upstream bursts and smooths them into permitted downstream traffic.
Batches requests through coalescing to reduce per-message API overhead.
Enforces idempotency so retries never create duplicate delivery.
Isolates pipelines per message category so cross-tenant failures don't propagate.

Reliability at scale isn't about preventing failures. It's about designing so that when failures happen and they will — they don't matter.

Results: No platform-induced account suspensions across 210,000 businesses, while delivering ^ billion messages.

Why We Were Always Late to Our Own Emergencies

WhatsApp broadcast campaigns generate near-instantaneous traffic surges. There is no gradual ramp.

It took us a while to realise that CPU-based autoscaling is too slow for this traffic profile. By the time a CPU spike registers, breaches a threshold, triggers scale-out, and the new instance warms up — the burst has already caused degradation. We were scaling in reaction to damage, not in anticipation of load.

To fix it, we had to move away from infrastructure-triggered scaling and build workload-aware mechanisms.

All services are now containerised — so a stateless webhook handler doesn't need the same instance family as a CPU-intensive message formatter.

For autoscaling itself, we moved to an event-driven sidecar pattern. A lightweight agent runs alongside each service and monitors queue depth, message arrival rate, and consumer lag. When queue depth starts climbing — before CPU spikes, the sidecar triggers scale-out. For scheduled campaigns, capacity is pre-warmed before the broadcast fires.

Scaling became proactive, not reactive. The platform absorbs sudden surges without degradation.

The Tax You Don't Notice Until It's Costing Millions

We initially focused on optimising application logic, assuming it was the primary source of latency. Deeper profiling told a different story: a significant portion of latency came from repeated TLS handshakes, connection setup overhead, and payload inefficiencies.

At a small scale these costs were negligible. At 200 million requests per day, they were dominant.

So instead of patching application logic, we eliminated unnecessary overhead first:

HTTP keep-alive cut TLS handshake cost from hundreds of milliseconds to near-zero for steady-state traffic.
DNS caching removed repeated resolution overhead.
In-memory LRU caching reduced database lookups for frequently accessed data — microsecond hits instead of millisecond round-trips.
GZIP compression shrank payload size and network latency on large template payloads.
A CDN handled static assets, improving global response times across 68 countries while reducing origin load.

Principle that emerged

Always eliminate work that doesn't need to happen before optimising the work that does. A cache hit beats any tuned query. A reused connection beats any tuned TLS stack.

Three Failures That Only Exist at Scale

At 200M+ API requests per day, low-probability failures become frequent. Three kept finding us:

1. Garbage collection storms

Services handling high concurrent message volumes hit GC pressure during burst periods. Pause storms caused latency spikes that cascaded into upstream timeouts. We profiled heap allocation patterns, tuned GC parameters, and introduced concurrency limits per worker. For select high-throughput workloads, we're evaluating lower-level runtimes with more predictable memory behaviour under sustained load.

2. Network bandwidth saturation

Moving hundreds of millions of messages means moving a lot of bytes. Network bandwidth became a real constraint — often before CPU did, which surprised us the first time. We added client-side rate limiting to shed load gracefully with backpressure rather than silently hitting a hard wall. Alerting was tuned to give actionable warning well before customer impact.

3. Debugging distributed failures

Reconstructing timelines across fragmented logs across dozens of services became impossible by hand. We instrumented every service with OpenTelemetry distributed traces, set CloudWatch alarms below hard limits, and introduced correlation IDs on every request — so we can reconstruct the full execution path of any failed request from a single log line, even on a day with hundreds of millions of them.

210,000 Businesses Sharing One Platform Without Knowing It

Without isolation, high traffic from one tenant degrades others. This is the classic noisy-neighbour problem — and with 210,000 businesses launching broadcast campaigns on different schedules, it's a daily operational reality, not a theoretical concern.

To address it, we layered multiple isolation strategies:

Tenant-level rate limiting prevents any single business from consuming disproportionate platform capacity.
Logical data isolation segregates tenant data at the data layer, not just at the application layer.
Strict access control is enforced at every service boundary.
Pipeline isolation in the messaging engine ensures no shared mutable state between tenant workflows.

The outcome: a business firing a 1-million-message campaign does not degrade delivery performance for the 209,999 businesses running alongside it.

Principles That Emerged

Hundreds of architecture decisions over the years. Most of the good ones trace back to a small set of principles we return to whenever we design something new.

Design for failure: Assume dependencies will be slow, unavailable, or wrong.
Async by default: If the user doesn't need the result in the HTTP response, it belongs in a queue.
Single-responsibility pipelines: One job each, no shared mutable state. Failures stay localised.
Idempotency everywhere: Any operation that can be retried must be safe to retry.
Feedback-driven systems: Rate control, autoscaling, and retries use real-time signals — not fixed thresholds.
Right tool for the job: MongoDB for transactional, columnar engines for analytics, Lambda for burst, S3/Parquet for archival.

What We're Building Next

We began in 2020 with 20–30 education clients during Covid.

Today we power WhatsApp marketing and engagement for 210,000 businesses across 68+ countries, processing 200M+ API requests and 6.5B+ messages annually.

The next horizon: 500 million API requests per day, AI-powered conversational agents, video message support, and multi-channel orchestration. The challenges ahead will be genuinely different. The stakes will be higher.

The principles in this post don't change. The specific implementations always will. The system we have today is the product of hundreds of iterations — most of them triggered by something that broke under real load, not by anything we read in a paper or saw in an architecture diagram.

If you've ever debugged a cascading failure at 3 a.m., optimised a full-table-scan query, or watched a well-intentioned retry mechanism take down a platform — and you came back the next day wanting to do it better — we'd like to talk.

How AiSensy Handles 200M+ API Requests Every Day