How would you design a system to send 100,000 notifications?
Designing a Notification System That Actually Scales
Every modern product whether it’s a banking app, a ride-sharing platform, or a social network depends on one invisible yet powerful system: notifications.
They’re the tiny nudges that keep your product alive and your users informed.
A push when your cab is 2 minutes away
An SMS with your OTP
An email confirming your order
Seems simple enough, right? Just send a message.
But here’s where the trap lies 👇
At small scale, everything feels easy. At scale, everything breaks.
For a prototype, you can literally loop over all users and call your provider’s API. Works fine for a few hundred users.
But when you need to send 100,000+ messages, “just send” quickly turns into “why is nothing working?”
Let’s see why.
Why This Problem Is Deceptively Hard
Imagine a banking app where 100,000 transactions fail due to a temporary outage.
You need to alert all affected users and fast.
What looks like a simple “send notifications” job now becomes a distributed system problem.
You’re suddenly fighting on multiple fronts:
Throughput - Can you send thousands per second, not per hour?
Reliability - What if FCM or SMTP is slow or returns 5xx errors?
Idempotency - If your service crashes mid-way, can you retry safely without duplicates?
Scalability - Will it still hold if you jump from 100k → 1M tomorrow?
Observability - How do you know what succeeded, failed, or got delayed?
These five are the pillars every resilient notification system must solve.
First Attempts (and Why They Fail)
Approach 1: Loop and Send Directly
Idea: Just iterate through all users and send via Email/SMS/Push provider.
for user in users:
sendNotification(user)
Why it feels right:
Simple. Easy to code. Works perfectly in dev.
Why it fails:
Each call might take ~300ms.
100k × 300ms = 30,000 seconds → 8+ hours!
You can try threading, but you’ll hit:
Provider rate limits
Connection errors
Duplicate sends on retries
If your process crashes mid-way, you’ve no clue who got what.
Lesson: Code simplicity ≠ system simplicity. Throughput and fault tolerance don’t come free.
Approach 2: Use Database as a Queue
Idea: Insert all notifications in a table:
INSERT INTO notifications (user_id, status)
VALUES (user_id, ‘PENDING’)
A worker picks up PENDING
, sends it, then marks SENT
.
Why it feels right:
SQL is reliable. Transactions give confidence. You can count(*)
to see progress.
Why it fails:
At 100k+, your DB becomes the bottleneck:
Hot rows & lock contention
Slow vacuuming
Replication lag
DBs are great at state, not streaming events.
You’re forcing it to do something it wasn’t built for.
Lesson: Don’t turn your source of truth into a message queue.
Approach 3: Cron Job Batches
Idea: Run a job every minute that sends pending messages.
Why it feels right:
Scheduled jobs = easy mental model. “Send everything due now.”
Why it fails:
At the top of the minute → a burst of 100k requests.
Providers respond with 429 (rate-limit) or block your IP.
Your “real-time” notification now has up to 59s latency.
Lesson: Batching helps organization, not elasticity.
The Scalable Approach
To scale gracefully, separate concerns:
“Deciding what to send” ≠ “Actually sending it”
Think of it as an assembly line:
A producer decides what needs sending
A queue manages the flow
Workers deliver at safe speed
Metrics keep you informed
Let’s break it down.
1. Producer Service (Brain)
Figures out who to notify (recipients)
Prepares personalized payloads (
Hello, Tom 👋
)Publishes one message per user into a queue
It’s stateless: you can run multiple producers in parallel.
It doesn’t care when it’s delivered — just that it’s enqueued.
2. Queue or Stream (Buffer)
Use something like SQS, RabbitMQ, Kafka, Pub/Sub.
Why?
Absorbs bursts — workers can pull at their own pace
Handles retries gracefully
Offers DLQ (Dead Letter Queue) for poison messages
Enables backpressure — if providers slow down, queue length grows, not your CPU usage
Think of it as a shock absorber in your system.
3. Delivery Workers (Hands)
Each channel (push, email, SMS) has its own workers.
They:
Pull messages at a controlled rate
Enforce per-provider rate limits (avoid bans)
Use idempotency keys (
campaign_id + user_id + channel
)Retry transient failures (e.g. network blips) with exponential backoff
Send unrecoverable ones to DLQ for manual retry/inspection
Parallelize with more workers. They can scale horizontally.
4. Observability (Eyes)
You can’t fix what you can’t see.
Track:
Metrics: enqueue/sent/failed/sec, retries, p95 latency
Logs: idempotency key + provider response codes
Alerts: spikes in DLQ, latency breaches, 5xx rates
Use tools like Prometheus + Grafana or Datadog.
Even a simple dashboard helps detect silent failures early.
Why This Architecture Works
Throughput:
50 workers × 35 req/s = ~1,750 req/s → 100k in ~1 minute
Reliability:
Queue buffers bursts, retries are idempotent.
Scalability:
Add workers; partition queues, scale independently.
Correctness:
Idempotency prevents duplicates, DLQ captures bad events.
Visibility:
Metrics & alerts show real-time health.
This is first-principles engineering:
Decouple producers & consumers
Isolate failures
Scale horizontally
Design for retries & observability
Interview Pivot (How to Explain It)
If asked in an interview 👇
“How would you design a system to send 100,000 notifications?”
You say:
“Naively, I’d loop through all users and call the provider directly but that couples expansion and delivery, and breaks with retries or rate limits.
Instead, I’d split concerns:
A producer expands recipients and enqueues messages.
Workers consume from the queue at a safe pace, enforce per-provider rate limits, use idempotency keys to prevent duplicates, and push failed messages to a DLQ.
That gives me high throughput, safety, and visibility 100k notifications in minutes, and scalable to millions.”
That’s not just an answer that’s a design mindset.
What’s Next?
We’ve laid the foundation - a robust, scalable notification system.
But there’s more to explore if you really want to think like an architect:
How do we ensure each notification is sent exactly once?
How do we handle sudden spikes, like 1 million events per second?
What if iOS consumers stop working and fail to receive notifications?
How do we alert engineers automatically when something breaks?
We’ll tackle these real-world challenges one by one in upcoming articles. Stay tuned.