From a free Heroku dyno to the distributed infrastructure powering WhatsApp for 210,000 businesses across 68 countries
— the complete story, unfiltered.
In India, WhatsApp is opened an average of 30 times a day. That's more than twice every waking hour.
But how are businesses around the world utilizing it to make revenue?
This is where the official Meta BSP – AiSensy steals all of the spotlights.
Because…When our WhatsApp Marketing and Engagement platform carries messages between businesses and their users, a 10-minute outage isn't a metric in a dashboard.
It's immediately visible to every user in the app they open most often. That's the challenge we engineer against every day.
- Delhi Transport Corporation: 1.4 million ticket bookings enabled through WhatsApp
- PhysicsWallah: 3x more student leads via Click-to-WhatsApp campaigns
- Skullcandy: 25-40% of abandoned carts recovered, 150x ROI
At this scale, even small inefficiencies fall apart like a ripple effect. Systems that once worked reliably begin to fail in hot waters.
This isn't just a story about a single architectural breakthrough. Rather, it is a sequence of engineering decisions made under real production constraints where every choice had immediate consequences.
Table of Contents:
How It All Started: One Server, One Codebase
The Day More Hardware Made Things Worse
When Our Safety Net Became the Fire
The Engine Running 1.5 Billion Messages That No One Sees
Why We Were Always Late to Our Own Emergencies
The Tax You Don't Notice Until It's Costing Millions
210,000 Businesses Sharing One Platform Without Knowing It
How It All Started: One Server, One Codebase
AiSensy started in 2020 as a chatbot product for education. When Covid arrived, the market shifted fast, and businesses needed WhatsApp at a scale nobody had anticipated.
By late 2020, we had a beta with 20–30 clients, and today that number has grown to 210,000.
The infrastructure we started with was a single free Heroku dyno. It was one Ruby monolith handling authentication, message dispatching, campaign management, and billing; all in a single process. It carried us to our first paying customers, and that was exactly the right call.
However, for our engineers, monoliths were never a problem, but breaking them too early became one.
Here’s how our architecture has evolved with time:
- We started on a Heroku free dyno. One monolith, one codebase, 10,000 requests a day.
- Then, EC2 came next. We moved to the infrastructure we owned and bought time through vertical scaling. That time eventually ran out.
- Breaking the monolith into microservices helped, but we kept a single MongoDB cluster underneath everything. We'd distributed the application layer and centralised the bottleneck.
- Broadcast campaigns forced the async question. We introduced queues, background workers, Lambda for spikes, and Dead Letter Queues for anything that couldn't be dropped. Async became the default, not the exception.
- Today, in the AiSensy platform, reads and writes are separated. Database changes propagate as events. Every service runs in a container sized for what it actually does. Autoscaling responds to queue depth, not CPU, because by the time CPU spikes, you've already kept someone waiting.
The Day More Hardware Made Things Worse
With time, our platform didn't fail abruptly. It degraded gradually, which is significantly harder to debug: nothing completely broken, yet nothing reliably predictable.
At peak load, broadcast campaigns, real-time ticketing, and lead-generation pipelines collided simultaneously, producing latency spikes, database contention, retry amplification, and cascading failures.
During one production incident, retry amplification caused queue congestion that delayed message delivery across multiple workflows.
Increasing hardware solved nothing because the constraint was architectural, not infrastructural. A slowdown in one service propagated across every dependent system.
Therefore, we took a step backwards and stopped treating scale as a throughput problem and started treating it as a failure-behavior problem.
This required decoupling services so failures stay localised, making asynchronous processing the default (if the user doesn't need the result in the HTTP response, it belongs in a queue), applying controlled retry strategies with exponential backoff and jitter, and designing systems to degrade gracefully under pressure.
And as a result, after a sequence of debugs, the platform became predictable under load. Failures became contained instead of cascading across services.
The Database Bottleneck That Became Our 3 am Wake-Up Call
For years, a single MongoDB instance was the source of truth for everything. Eventually, it became the system's most expensive bottleneck, the thing that woke engineers at 3 am. We fixed it in six layers.
At scale, the problem emerged when MongoDB became the primary bottleneck due to connection pool exhaustion, inefficient write patterns, and analytical queries competing directly with transactional writes.
From an engineering perspective, this felt like threads waiting indefinitely for resources while the system appeared operational but was effectively stalled.
Moving towards the solution, we addressed this in layers, but three changes created the largest impact:
- Query optimization and index hygiene aligned the database with a write-heavy workload, eliminating inefficient patterns such as N+1 queries and redundant indexes
- Connection pooling prevented exhaustion by multiplexing application threads over a controlled set of database connections
- CQRS separated reads from writes, ensuring analytical queries no longer competed with transactional workloads
As the scale increased further, we introduced additional improvements that strengthened the system:
- Change Data Capture enabled asynchronous event propagation without impacting the hot path
- Bulk operations using insertMany and bulkWrite reduced database round-trips significantly
- Partitioning and tiered storage ensured each workload ran on the system best suited for it, with hot data in DocumentDB and cold data in S3 using columnar formats
Ultimately, we managed to achieve 95% faster database queries along with 40% lower database costs. And that’s how the database evolved from a bottleneck into a scalable backbone.
When Our Safety Net Became the Fire/h2>
This chapter is about the moment when your resilience strategy becomes the source of failure.
We were not done feeling the relief of ensuring 95% faster database queries with 40% lower database costs, and then a challenge emerged in front of us.
In our system architecture, retries created a feedback loop: a slow dependency triggered retries, retries increased load, and increased load further slowed the system. Hence, the system began amplifying its own failure.
During these incidents, dashboards often showed low error rates while latency and queue depth increased.
After some late-night sprints, we decided to apply exponential backoff with jitter to spread retries over time and prevent thundering-herd conditions, circuit breakers to stop calling failing dependencies and allow recovery, and idempotency keys on every request so that retries never create duplicate processing or inconsistent state.
And as a result, failures became controlled and isolated instead of cascading. Retry logic became a safety net, not a source of load.
The Engine Running 1.5 Billion Messages That No One Sees
The most critical service in AiSensy's architecture is one most users never see.
Meta's WhatsApp Business API enforces per-WABA sending rate caps based on account quality tier and recent activity.
At scale, this becomes a distributed rate-control problem: 210,000 tenants, each with different limits, and the system must maximize throughput for all of them simultaneously without crossing any boundary. Breach the limits and you get throttled; sustained breaches get accounts flagged.
We built a feedback-driven traffic shaping engine that:
- Tracks per-account rate limits in real time using signals from Meta APIs
- Dispatches messages at the maximum compliant throughput without triggering throttling
- Buffers upstream bursts and smooths them into permitted downstream traffic
- Batches requests through coalescing to reduce API overhead
- Enforces idempotency so retries never create duplicate delivery
- Isolates pipelines to prevent cross-tenant failure propagation
This is how our engineers managed to ensure zero WhatsApp account suspensions across 210,000 businesses with a delivery success rate of 1.5 billion messages monthly.
Why We Were Always Late to Our Own Emergencies
WhatsApp broadcast campaigns generate near-instantaneous traffic surges. There is no gradual ramp.
Not very soon, but we realized that CPU-based autoscaling is too slow for this traffic profile. By the time a CPU spike registers, breaches a threshold, triggers a scale-out event, and the new instance warms up, the burst has already caused degradation. We were scaling in reaction to damage, not in anticipation of load.
To solve this, we had to move away from infrastructure-triggered scaling and build workload-aware scaling mechanisms.
All services are containerised with Docker and orchestrated with Kubernetes, enabling right-sized instance selection per service. A stateless webhook handler doesn't need the same instance family as a CPU-intensive message formatter.
We moved to event-driven autoscaling using the sidecar pattern: a lightweight agent runs alongside each service and monitors queue depth, message arrival rate, and consumer lag.
When queue depth starts climbing, before CPU spikes, the sidecar triggers a scale-out event. For scheduled campaigns, capacity is pre-warmed before the broadcast fires.
Scaling became proactive, not reactive. The platform absorbs sudden surges without degradation.
The Tax You Don't Notice Until It's Costing Millions
At a small scale, these costs were negligible. At 200 million requests per day, they were dominant.
We initially focused on optimizing application logic, assuming it was the primary source of latency.
However, deeper profiling revealed that a significant portion of latency came from repeated TLS handshakes, connection setup overhead, and payload inefficiencies.
At a small scale, these costs were negligible, but at hundreds of millions of requests, they became dominant.
We eliminated unnecessary overhead before optimizing core logic:
- HTTP keep-alive reduced TLS handshake cost from hundreds of milliseconds to near zero for steady-state traffic
- DNS caching removed repeated resolution overhead
- In-memory LRU caching reduced database lookups for frequently accessed data
- GZIP compression reduced payload size and network latency
- A CDN handled static assets, improving global response times while reducing origin load
This resulted in significant latency reduction without additional infrastructure cost.
Therefore, always eliminate work that doesn't need to happen before optimizing work that does.
Three Failures That Only Exist at Scale
At 200M+ API calls per day, low-probability failures became frequent.
We observed three recurring issues:
- GC pauses during burst traffic introduced latency spikes
- Network bandwidth saturated before CPU limits were reached
- Debugging distributed failures required reconstructing timelines across fragmented logs
And, we addressed this by:
- Tuning memory allocation and GC behavior to reduce pause times
- Introducing concurrency limits and backpressure to prevent overload
- Applying client-side rate limiting to avoid bandwidth saturation
- Implementing full OpenTelemetry tracing with correlation IDs across services
As a result, latency spikes caused by GC were eliminated, bandwidth issues became observable before impact, and failures became traceable end-to-end within minutes.
210,000 Businesses Sharing One Platform Without Knowing It
210,000 businesses on one platform, each with different traffic patterns.
The problem began to appear when without isolation, high traffic from one tenant degrades others, the classic noisy-neighbour problem. At 210,000 businesses firing broadcast campaigns on different schedules, this isn't a theoretical concern. It's a daily operational reality.
To address this, we implemented tenant-level rate limiting to prevent any single business from consuming disproportionate platform capacity. Logical data isolation segregates tenant data at the data layer, not just at the application layer. Strict access control and data-protection policies are enforced at every service boundary. Pipeline isolation in the messaging engine ensures no shared mutable state between tenant workflows.
This is how, we managed to ensure that the workloads remain isolated. A business firing a 1-million-message campaign does not degrade delivery performance for the 209,999 businesses running alongside it.
Principles That Emerged
Over time, a few principles consistently guided our decisions:
- Design for failure: Assume dependencies will be slow, unavailable, or wrong
- Async by default: If the user doesn't need it in the HTTP response, it belongs in a queue.
- Single-responsibility pipelines: One thing, no shared mutable state, failures stay localised.
- Idempotency everywhere: Any operation that can be retried must be safe to retry.
- Feedback-driven systems: Rate control, autoscaling, and retry strategies use real-time signals, not fixed thresholds.
- Right tool for the job: DocumentDB for transactional workloads, columnar engines for analytics, Lambda for burst, S3/Parquet for archival.
- Observe before you optimise: Every meaningful improvement started with a distributed trace, a slow-query log, or a heap profile.
What We're Building Next
We started in 2020 with 20-30 clients, helping educators attract, engage, and enroll students easily.
Today, the AiSensy platform handles WhatsApp marketing and engagement for 210,000 businesses across 68+ countries, from global brands to small businesses running entirely on WhatsApp without a website.
Furthermore, we’re aiming for 500 million API requests per day while expanding into AI‑powered conversational agents, video message support, and multi‑channel orchestration. The challenges will be genuinely different, and the stakes will undoubtedly be even higher.
If you're an engineer who has debugged a cascading failure at 3 a.m., optimized a full-table-scan query, or watched a well‑intentioned retry mechanism take down an entire platform, then you know these lessons aren’t learned from architecture diagrams.
They're learned from production incidents and the humbling experience of watching your assumptions break under real load.
That is the kind of spirit every engineer at AiSensy lives by.
We value engineers who learn continuously, question assumptions, and prefer solving real problems over following playbooks.
If you're willing to take on these problems across backend systems, distributed infrastructure, machine learning, and data platforms, then this is the right place — We're hiring across roles in software engineering!




