Synero for Developers

Debug, architect, and build with four AI minds

Technical decisions are too important for a single AI opinion. Synero consults four specialized AI advisors — each analyzing your code, architecture, or design question from a different angle — then synthesizes their insights into one comprehensive answer.

Real Example

See the Council in action

This is a real response from Synero's AI Council — four advisors analyzed the question simultaneously, then their insights were synthesized into one answer.

I need to design a real-time notification system that handles 100K concurrent WebSocket connections. Compare architectural approaches and their trade-offs.
The Architect

Certainly. Designing a real-time notification system capable of supporting 100,000 concurrent WebSocket connections is a significant engineering challenge. Let's systematically analyze the architectural approaches and trade-offs.


1. Key Requirements

  • High Concurrency: 100,000+ persistent WebSocket connections
  • Scalability: Ability to scale horizontally as demand grows
  • Low Latency: Real-time message delivery
  • Reliability: Fault tolerance, connection management
  • Maintainability: Clean codebase, ease of development

2. Architectural Approaches

Appr...

The Philosopher

Real-Time Notification System: Architectural Analysis

Let me work through this systematically, because the right answer depends heavily on constraints you haven't specified — and acknowledging that uncertainty is important.

First, Clarify What "100K Concurrent" Actually Means

Before choosing architecture, these distinctions matter enormously:

100K concurrent connections ≠ 100K messages/second

Questions that change the answer:
├── Message fanout ratio (1:1, 1:many, broadcast?)
├── Message size and frequency
├── Delivery guarantees (at-most-once vs. at-least-once vs. exactly-once)
...
The Explorer

To design a system for 100,000 concurrent whispers in a digital stadium, we must think less like civil engineers and more like mycologists or air traffic controllers. Handling 100K WebSockets isn't just about raw power; it’s about managing the "state of presence" and the "cost of silence."

Here are three architectural archetypes, viewed through the lens of different organizational philosophies.


1. The "Mycelial Network" (The Actor Model approach)

Core Tech: Elixir/Phoenix (The BEAM VM), Akka (JVM), or Proto.actor.

In a forest, mushrooms are just the fruit of a massive, ...

The Maverick

Assumptions First: Scale Isn't Just Connections

100K concurrent WebSockets sounds impressive, but it's meaningless without RPS (notifications/sec), fan-out ratio (users per room/topic), or retention needs. If it's mostly idle connections with bursts, you're golden. If it's Twitter-scale firehose? Rethink. WebSockets aren't magic—SSE or push via service workers often beat them for one-way notifications (lower overhead, HTTP/2 multiplexing). But you asked for WebSockets, so let's dissect architectures. I'll compare four battle-tested ones, ranked by increasing complexity/maturity for your sc...

Synthesis

Council Synthesis

Real-Time Notification System: 100K Concurrent WebSocket Connections

Start With the Right Questions

Before choosing an architecture, three questions fundamentally change your answer:

1. What's your message throughput and fan-out pattern? 100K connections with 100 notifications/sec is a solved problem. 100K connections with 500K notifications/sec broadcasting to large groups is a different beast. The connection count is often not your bottleneck.

2. What are your delivery guarantees?

  • At-most-once (dropped notifications acceptable) → Redis Pub/Sub, simple
  • At-least-once (user must eventually see it) → persistence layer required
  • Exactly-once (financial, critical events) → Kafka + idempotency keys + transactional consumers

3. Is WebSocket actually the right transport? For one-way notifications, Server-Sent Events (SSE) impose less overhead and work over HTTP/2 multiplexing. Reserve WebSockets for bidirectional interaction, and consider a hybrid: SSE for the 95% of users passively receiving, WebSockets only for the 5% actively interacting.


The Four Architectural Approaches

Approach 1: Horizontally Scaled WebSocket Servers + Message Broker

This is your baseline recommendation for 100K connections. Deploy a fleet of stateless WebSocket gateway servers behind a load balancer, coordinated via a shared pub/sub layer.

[Clients]
    ↓
[Load Balancer — sticky sessions optional]
    ↓              ↓              ↓
[WS Server 1]  [WS Server 2]  [WS Server 3]
    \               |              /
         [Pub/Sub Broker]
               ↓
    [Backend Services / Message Sources]
         ↓
    [Persistence Store — for missed messages]

Why sticky sessions are a preference, not a requirement: With pub/sub routing, any server can receive a message and publish it; the server holding the target connection delivers it. Sticky sessions reduce broker traffic by increasing local-delivery hits, but they're not architecturally necessary.

The message flow:

  1. A backend service emits a notification
  2. It publishes to a channel (e.g., notify:user:42)
  3. All WebSocket servers subscribing to that channel receive it
  4. The server holding user 42's connection delivers it; others discard

Broker choice matters:

Redis Pub/SubRedis StreamsNATS/JetStreamKafka
Latency~1ms~2ms<1ms5–15ms
PersistenceNoYesYes (JetStream)Yes
Throughput ceiling~100K msg/secHighVery highMillions/sec
Operational complexityLowLow-MediumMediumHigh
Right scaleUp to ~10M usersSame + auditSame + reliability10M+ or replay needs

Concrete recommendation: Start with Redis Pub/Sub. It handles 100K connections comfortably. If you need persistence (missed-message recovery), add a PostgreSQL/DynamoDB store as a separate concern — don't conflate delivery with durability. If throughput exceeds ~100K msg/sec, migrate the broker to NATS JetStream before considering Kafka. Kafka's operational overhead is frequently not worth it unless you specifically need log replay or have a dedicated platform team.


Approach 2: Actor-Based / BEAM Architecture

If your use case involves complex per-connection state — collaborative editing, multiplayer games, rooms with rich logic — the actor model deserves serious consideration.

Elixir/Phoenix Channels run on the BEAM VM, where each WebSocket connection is a lightweight Erlang process (~2KB overhead). A single node can comfortably hold 1M+ connections, and supervision trees provide automatic crash recovery. WhatsApp famously ran 2 million concurrent connections on a single Erlang node.

Trade-offs:

  • ✅ Extraordinary resource efficiency per connection
  • ✅ Fault isolation — one crashing process doesn't affect neighbors
  • ✅ Natural state management for complex per-user logic
  • ❌ Requires Elixir/Erlang expertise — hiring is harder
  • ❌ Still needs a pub/sub layer for cross-cluster global events
  • ❌ Overkill for simple broadcast notifications

When to choose it: You have teams with BEAM expertise, complex stateful interactions per connection, or you're building something like a game server or collaborative editor. For a notification system that's primarily push-and-forget, this is probably over-engineered.


Approach 3: Managed / Cloud-Native Services

Offload the infrastructure entirely to providers like Ably, Pusher, Azure Web PubSub, or AWS API Gateway WebSocket API.

Genuine strengths:

  • Zero operational overhead for connection management and scaling
  • Global edge networks provide low latency internationally
  • Automatic handling of reconnection, fan-out, and failover

Genuine weaknesses:

  • Cost scales badly. AWS API Gateway WebSocket charges ~$3.50/million connection-minutes plus message fees. At 100K connections for a month, connection costs alone exceed $5K–$15K before a single message is sent.
  • Vendor lock-in is real. Custom routing logic, protocol extensions, and unusual delivery semantics become difficult or impossible.
  • Lambda cold starts behind API Gateway introduce 100–500ms latency spikes.

When to choose it: Time-to-market is your primary constraint, you have unpredictable spiky load that would require significant autoscaling engineering, or your team has no infrastructure expertise. Ably's free tier is also excellent for benchmarking and load testing your expected traffic patterns before committing to a self-hosted design.


Approach 4: Monolithic Single-Server

A single WebSocket server — Node.js with ws, Go with gorilla/websocket — can technically handle 100K connections on a tuned machine. This is the right choice for prototyping only.

In production: one process failure drops all 100K connections simultaneously. This is architecturally unacceptable for any real service.


The Problems Everyone Underestimates

These failure modes cause more production incidents than architectural choice.

1. The Thundering Herd on Reconnect

If your server restarts, 100K clients will attempt to reconnect within seconds. This self-inflicted DDoS will take down whatever they're reconnecting to.

Solution — always implement exponential backoff with jitter:

class ReconnectingWebSocket {
  constructor(url) {
    this.url = url;
    this.baseDelay = 1000;
    this.maxDelay = 30000;
    this.attempt = 0;
  }

  nextDelay() {
    // Exponential backoff
    const exp = Math.min(this.baseDelay * 2 ** this.attempt, this.maxDelay);
    // Full jitter — spreads 100K reconnects over the backoff window
    return Math.random() * exp;
  }

  connect() {
    this.ws = new WebSocket(this.url);
    this.ws.onopen = () => { this.attempt = 0; this.resubscribe(); };
    this.ws.onclose = (event) => {
      if (event.code === 1001) { this.connect(); return; } // Server-initiated graceful close
      setTimeout(() => { this.attempt++; this.connect(); }, this.nextDelay());
    };
  }
}

Without jitter, even a staggered restart causes a spike. With full jitter, 100K reconnects spread across the backoff window.

2. Zombie Connections and Heartbeats

A mobile client enters a tunnel. The OS-level TCP connection is gone, but your server still holds the socket open, accumulating phantom connections that consume memory and file descriptors.

Solution: Send a ping every 30 seconds. If no pong returns within 10 seconds, close the connection and clean up the registry entry. Most WebSocket libraries support this natively — enable it.

3. OS File Descriptor Limits

Linux defaults to 1,024 open file descriptors per process. 100K connections require 100K file descriptors. Your system will refuse connections long before the CPU is warm.

# /etc/sysctl.conf
fs.file-max = 500000
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Per-process limit
ulimit -n 200000

4. Memory Pressure Is Real

Per-connection overhead breakdown:

  • Kernel socket buffer: 4–8KB
  • TLS session state: 10–20KB
  • Application state: 20–50KB depending on your code

At 100K connections: 5–8GB RAM dedicated to connection state alone. WebSocket servers should not colocate with databases or other memory-intensive services. Dedicated pods/VMs with monitored heap usage are not optional.

5. Graceful Shutdown

process.on('SIGTERM', async () => {
  server.close(); // Stop accepting new connections
  
  for (const [userId, ws] of connections) {
    ws.send(JSON.stringify({ type: 'reconnect', backoffMs: 1000 }));
    ws.close(1001, 'Server shutting down');
  }
  
  await registry.deregisterAll(); // Clean up Redis entries
  await drain();                  // Wait for in-flight messages
  process.exit(0);
});

Without this, clients don't know to reconnect, and your registry accumulates stale entries that cause missed message deliveries.


Recommended Architecture for 100K Connections

[Load Balancer]
      ↓
[4–8 WebSocket Servers, ~15–25K connections each]
      ↓
[Redis Cluster — Pub/Sub + Connection Registry with TTL]
      ↓                              ↓
[Message Queue — SQS/RabbitMQ]   [PostgreSQL/DynamoDB — persistent notifications]
      ↓
[Backend Services — notification producers]

Stack rationale:

  • Redis for pub/sub and registry: connection registry entries use TTLs (e.g., 5-minute expiry, refreshed by heartbeats) so server crashes don't leave ghost registrations
  • Separate persistence store: Redis pub/sub doesn't persist — if a user is offline, the message is lost. Solve "missed while offline" in the database layer, not the delivery layer; these are separate concerns
  • Message queue for ingestion: decouples notification producers from delivery speed, provides backpressure

Scaling path:

  • 100K connections: 4–6 nodes, single Redis instance
  • 500K connections: 10–15 nodes, Redis Cluster
  • 1M+ connections: Consider NATS JetStream or BEAM-based architecture, or evaluate managed services against operational cost

Decision Framework

Are delivery guarantees critical (financial, legal)?
├── YES → Kafka + idempotency keys + transactional consumers
└── NO → Continue

Expected messages/second?
├── <50K/sec → Redis Pub/Sub
├── 50K–500K/sec → Redis Cluster or NATS
└── >500K/sec → Kafka or NATS JetStream

Team infrastructure expertise?
├── Low → Managed service (Ably, Azure Web PubSub)
└── Medium/High → Self-hosted with Redis

Per-connection state complexity?
├── Simple (push notifications) → Horizontally scaled servers + Redis
└── Complex (games, collaborative) → Elixir/Phoenix or Akka

Cost sensitivity?
├── High → Self-hosted Multi-process + Redis (~$500–2K/month)
└── Low → Managed services (offload ops, pay premium)

What to Benchmark Before Committing

Use Artillery.io or k6 to simulate your actual load pattern — not just connection count. The benchmark that matters is:

  1. Sustained connections at your target count
  2. Burst message delivery at your peak notification rate
  3. Reconnection storm — simulate a full server restart with all clients reconnecting
  4. Memory growth over time — zombie connections accumulate slowly and aren't visible in short tests

The architecture that looks correct on paper frequently breaks on reconnection storms and zombie connection accumulation. Build and test those failure scenarios before you're in production.

Benefits

Why developers choose Synero

Architecture Review

Get your system designs reviewed from four analytical angles — scalability, maintainability, innovation, and risk — before writing a single line of code.

Code Debugging

Paste error logs or problematic code and get four independent diagnostic analyses. The synthesized answer identifies the most likely root cause and recommended fix.

Technology Selection

Evaluating frameworks, databases, or cloud services? Four AI advisors compare options from different angles — performance, DX, ecosystem, and contrarian picks — helping you avoid hype-driven decisions.

Documentation & Communication

Generate clear technical documentation, RFC drafts, and architecture decision records that explain complex decisions from multiple stakeholder perspectives.

FAQ

Common questions from developers

Ready to build better software?

Get architectural insights from four AI perspectives before making your next technical decision.

Get Started