Real-Time Communication for DAOs: Architecture and Scaling

Real-time messaging is not a solved problem. At least not when your requirements go beyond a simple chat widget. While building the messaging layer for a DAO community platform, we had to design a system that could model decentralized governance structures, support token-gated access control, and still perform well under thousands of concurrent connections.

This post documents the architecture decisions we made, including benchmarks and code.

WebSocket vs. Server-Sent Events

The first decision is the transport protocol. WebSockets provide bidirectional communication over a persistent TCP connection. Server-Sent Events (SSE) are unidirectional (server-to-client) and run over HTTP/2.

Our benchmarks on a 4-core server with 16 GB RAM:

Protocol        | Connections | Latency p50 | Latency p99 | RAM/connection
----------------|-------------|-------------|-------------|---------------
WebSocket       | 10,000      | 2.1 ms      | 8.4 ms      | ~4.2 KB
SSE (HTTP/2)    | 10,000      | 3.8 ms      | 14.1 ms     | ~6.8 KB
WebSocket       | 50,000      | 3.4 ms      | 22.7 ms     | ~4.3 KB
SSE (HTTP/2)    | 50,000      | 9.2 ms      | 61.3 ms     | ~7.1 KB

WebSockets win on latency and memory consumption. SSE has the advantage of fewer issues with firewalls and proxies, plus automatic reconnection. For a community platform with bidirectional messaging (send and receive), WebSocket was the clear choice. We use SSE additionally for notifications and status updates where no return channel is needed.

Connection Management

The WebSocket server runs on Node.js using the ws library. Each connection goes through an authentication phase after the handshake:

import { WebSocketServer } from "ws";
import { verifyToken } from "./auth";

const wss = new WebSocketServer({ noServer: true });

server.on("upgrade", async (req, socket, head) => {
  try {
    const token = new URL(req.url!, `http://${req.headers.host}`)
      .searchParams.get("token");

    if (!token) {
      socket.write("HTTP/1.1 401 Unauthorized\r\n\r\n");
      socket.destroy();
      return;
    }

    const user = await verifyToken(token);

    wss.handleUpgrade(req, socket, head, (ws) => {
      wss.emit("connection", ws, req, user);
    });
  } catch {
    socket.write("HTTP/1.1 403 Forbidden\r\n\r\n");
    socket.destroy();
  }
});

Note: the token is passed as a query parameter, not a header. WebSocket clients cannot set custom headers during the initial handshake. The token is short-lived (30 seconds) and obtained via a separate REST endpoint.

Message Ordering and Consistency

In a distributed system with multiple server instances, message ordering is non-trivial. We use Hybrid Logical Clocks (HLC), which combine physical timestamps with a logical counter:

interface HLC {
  ts: number;   // Physical time in milliseconds
  counter: number;
  nodeId: string;
}

function tick(local: HLC, received?: HLC): HLC {
  const now = Date.now();

  if (!received) {
    return {
      ts: Math.max(local.ts, now),
      counter: local.ts === Math.max(local.ts, now)
        ? local.counter + 1 : 0,
      nodeId: local.nodeId,
    };
  }

  const maxTs = Math.max(local.ts, received.ts, now);
  let counter = 0;
  if (maxTs === local.ts && maxTs === received.ts)
    counter = Math.max(local.counter, received.counter) + 1;
  else if (maxTs === local.ts)
    counter = local.counter + 1;
  else if (maxTs === received.ts)
    counter = received.counter + 1;

  return { ts: maxTs, counter, nodeId: local.nodeId };
}

HLCs guarantee causal ordering without depending on synchronized clocks. Two messages from the same sender always maintain correct order. For concurrent messages from different senders, the nodeId serves as a tiebreaker.

Presence Detection

Presence (online/offline/away) is surprisingly expensive. Naive heartbeat implementations with 10,000 users at 30-second intervals generate over 300 messages per second just for presence. Our approach:

// Presence update only on status change
// Redis Sorted Set: Score = Timestamp, Member = userId
await redis.zadd("presence:channel:42", Date.now(), userId);

// Cleanup: remove entries older than 90 seconds
await redis.zremrangebyscore(
  "presence:channel:42", 0, Date.now() - 90_000
);

// Query online users for a channel
const online = await redis.zrangebyscore(
  "presence:channel:42",
  Date.now() - 90_000, "+inf"
);

Clients send a silent heartbeat every 60 seconds. The presence list is only broadcast to other clients on actual transitions (online → offline), not on every heartbeat.

Scaling with Redis Pub/Sub

A single Node.js instance hits its limits at around 50,000 concurrent WebSocket connections. For horizontal scaling, we use Redis Pub/Sub as a message broker between instances:

import Redis from "ioredis";

const pub = new Redis(process.env.REDIS_URL);
const sub = new Redis(process.env.REDIS_URL);

// Broadcast message to all instances
async function broadcastToChannel(
  channelId: string,
  message: Message
) {
  await pub.publish(
    `ch:${channelId}`,
    JSON.stringify(message)
  );
}

// Each instance subscribes to channels of its clients
sub.on("message", (redisChannel, data) => {
  const channelId = redisChannel.replace("ch:", "");
  const message = JSON.parse(data);

  // Forward to local WebSocket clients
  const locals = connectionsByChannel.get(channelId);
  if (locals) {
    const payload = JSON.stringify(message);
    for (const ws of locals) {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(payload);
      }
    }
  }
});

Benchmark with 3 instances behind a load balancer:

Instances | Connections | Messages/s | Latency p50 | Latency p99
----------|-------------|------------|-------------|----------
1         | 50,000      | 12,400     | 3.4 ms      | 22.7 ms
3         | 150,000     | 34,800     | 4.1 ms      | 28.3 ms
3         | 150,000     | 82,000     | 6.7 ms      | 41.2 ms

Scaling is nearly linear. Redis Pub/Sub is the bottleneck: above 100,000 messages per second, you should consider Redis Cluster or a dedicated solution like NATS.

Takeaways

The architecture boils down to: WebSockets for bidirectional communication, HLC for causal ordering, Redis Sorted Sets for presence, and Redis Pub/Sub for horizontal scaling. This combination carries us to approximately 150,000 concurrent connections across three instances. For a typical DAO deployment with a few thousand active users, that is more than sufficient.

The largest time investment was not the messaging logic itself, but edge cases: connection drops during token-gating checks, race conditions on concurrent channel joins, and correct delivery of messages that arrived during a reconnect window. More on that in a future post.