How would you design a system to send 100,000 notifications?
Designing a Notification System That Actually Scales
Every modern product whether it’s a banking app, a ride-sharing platform, or a social network depends on one invisible yet powerful system: notifications.
They’re the tiny nudges that keep your product alive and your users informed.
A push when your cab is 2 minutes away
An SMS with your OTP
An email confirming your order
Seems simple enough, right? Just send a message.
But here’s where the trap lies 👇
At small scale, everything feels easy. At scale, everything breaks.
For a prototype, you can literally loop over all users and call your provider’s API. Works fine for a few hundred users.
But when you need to send 100,000+ messages, “just send” quickly turns into “why is nothing working?”
Let’s see why.
Why This Problem Is Deceptively Hard
Imagine a banking app where 100,000 transactions fail due to a temporary outage.
You need to alert all affected users and fast.
What looks like a simple “send notifications” job now becomes a distributed system problem.
You’re suddenly fighting on multiple fronts:
Throughput - Can you send thousands per second, not per hour?
Reliability - What if FCM or SMTP is slow or returns 5xx errors?
Idempotency - If your service crashes mid-way, can you retry safely without duplicates?
Scalability - Will it still hold if you jump from 100k → 1M tomorrow?
Observability - How do you know what succeeded, failed, or got delayed?
These five are the pillars every resilient notification system must solve.
First Attempts (and Why They Fail)
Approach 1: Loop and Send Directly
Idea: Just iterate through all users and send via Email/SMS/Push provider.
for user in users:
sendNotification(user)Why it feels right:
Simple. Easy to code. Works perfectly in dev.
Why it fails:
Each call might take ~300ms.
100k × 300ms = 30,000 seconds → 8+ hours!
You can try threading, but you’ll hit:
Provider rate limits
Connection errors
Duplicate sends on retries
If your process crashes mid-way, you’ve no clue who got what.
Lesson: Code simplicity ≠ system simplicity. Throughput and fault tolerance don’t come free.
Approach 2: Use Database as a Queue
Idea: Insert all notifications in a table:
INSERT INTO notifications (user_id, status)
VALUES (user_id, ‘PENDING’)A worker picks up PENDING, sends it, then marks SENT.
Why it feels right:
SQL is reliable. Transactions give confidence. You can count(*) to see progress.
Why it fails:
At 100k+, your DB becomes the bottleneck:
Hot rows & lock contention
Slow vacuuming
Replication lag
DBs are great at state, not streaming events.
You’re forcing it to do something it wasn’t built for.
Lesson: Don’t turn your source of truth into a message queue.
Approach 3: Cron Job Batches
Idea: Run a job every minute that sends pending messages.
Why it feels right:
Scheduled jobs = easy mental model. “Send everything due now.”
Why it fails:
At the top of the minute → a burst of 100k requests.
Providers respond with 429 (rate-limit) or block your IP.
Your “real-time” notification now has up to 59s latency.
Lesson: Batching helps organization, not elasticity.
The Scalable Approach
To scale gracefully, separate concerns:
“Deciding what to send” ≠ “Actually sending it”
Think of it as an assembly line:
A producer decides what needs sending
A queue manages the flow
Workers deliver at safe speed
Metrics keep you informed
Let’s break it down.
1. Producer Service (Brain)
Figures out who to notify (recipients)
Prepares personalized payloads (
Hello, Tom 👋)Publishes one message per user into a queue
It’s stateless: you can run multiple producers in parallel.
It doesn’t care when it’s delivered — just that it’s enqueued.
2. Queue or Stream (Buffer)
Use something like SQS, RabbitMQ, Kafka, Pub/Sub.
Why?
Absorbs bursts — workers can pull at their own pace
Handles retries gracefully
Offers DLQ (Dead Letter Queue) for poison messages
Enables backpressure — if providers slow down, queue length grows, not your CPU usage
Think of it as a shock absorber in your system.
3. Delivery Workers (Hands)
Each channel (push, email, SMS) has its own workers.
They:
Pull messages at a controlled rate
Enforce per-provider rate limits (avoid bans)
Use idempotency keys (
campaign_id + user_id + channel)Retry transient failures (e.g. network blips) with exponential backoff
Send unrecoverable ones to DLQ for manual retry/inspection
Parallelize with more workers. They can scale horizontally.
4. Observability (Eyes)
You can’t fix what you can’t see.
Track:
Metrics: enqueue/sent/failed/sec, retries, p95 latency
Logs: idempotency key + provider response codes
Alerts: spikes in DLQ, latency breaches, 5xx rates
Use tools like Prometheus + Grafana or Datadog.
Even a simple dashboard helps detect silent failures early.
Why This Architecture Works
Throughput:
50 workers × 35 req/s = ~1,750 req/s → 100k in ~1 minute
Reliability:
Queue buffers bursts, retries are idempotent.
Scalability:
Add workers; partition queues, scale independently.
Correctness:
Idempotency prevents duplicates, DLQ captures bad events.
Visibility:
Metrics & alerts show real-time health.
This is first-principles engineering:
Decouple producers & consumers
Isolate failures
Scale horizontally
Design for retries & observability
Interview Pivot (How to Explain It)
If asked in an interview 👇
“How would you design a system to send 100,000 notifications?”
You say:
“Naively, I’d loop through all users and call the provider directly but that couples expansion and delivery, and breaks with retries or rate limits.
Instead, I’d split concerns:
A producer expands recipients and enqueues messages.
Workers consume from the queue at a safe pace, enforce per-provider rate limits, use idempotency keys to prevent duplicates, and push failed messages to a DLQ.
That gives me high throughput, safety, and visibility 100k notifications in minutes, and scalable to millions.”
That’s not just an answer that’s a design mindset.
What’s Next?
We’ve laid the foundation - a robust, scalable notification system.
But there’s more to explore if you really want to think like an architect:
How do we ensure each notification is sent exactly once?
How do we handle sudden spikes, like 1 million events per second?
What if iOS consumers stop working and fail to receive notifications?
How do we alert engineers automatically when something breaks?
We’ll tackle these real-world challenges one by one in upcoming articles. Stay tuned.
Related
This article scratched the surface. The real interview goes 10 levels deeper.
How do you handle hot partitions?
What if the cache goes down during a spike?
How do you avoid counting the same view twice?
I’ve written an ebook that prepares you for all of it.
35 real problems. The patterns that solve them. The follow-ups you’ll actually face. The principles behind solving problems at scale, not just the final answers.
Thanks for reading. See you in next post.

