Prelude
Building a first MCP server is a satisfying Sunday afternoon project. It runs on a laptop, communicates over stdio, and gives Claude access to a handful of internal tools. It feels like plugging a new limb into the AI. You can ask Claude to query a database, check deployment status, or read from an internal wiki, all through natural language.
Then Monday comes and a colleague asks to use it too.
That question breaks everything. The server is a process that Claude Code spawns as a child, reading from stdin and writing to stdout. It is bound to one machine, one terminal, one session. There is no URL to share, no endpoint to point at, no way for a second person to connect.
Moving an MCP server from a developer's laptop into a production environment changes everything. The transport changes. The error handling changes. The security requirements change entirely. What worked as a local prototype needs authentication, monitoring, rate limiting, container packaging, and a deployment pipeline before it can serve a team.
This guide covers everything needed for that transition. If you have already built your first MCP server and want to move beyond localhost, this is the path forward.
The Problem
MCP servers in development are simple. The Model Context Protocol specification defines stdio as the default transport. Claude Code spawns your server as a child process, sends JSON-RPC messages through stdin, and reads responses from stdout. No networking, no ports, no configuration beyond a command path.
This simplicity is also a ceiling. Stdio transport means the server lives and dies with the client process. It runs on the same machine. It serves exactly one client. It cannot be load-balanced, health-checked, or monitored by external systems. If it crashes, the client has to restart it. If it leaks memory, there is no external watchdog to catch it.
Production workloads need fundamentally different properties. Multiple developers connecting to the same server. Centralised logging and monitoring. Authentication so that only authorised users can call tools. Rate limiting to prevent a runaway AI session from hammering your backend. Health checks so your orchestrator can restart failed instances. Horizontal scaling when one instance is not enough.
The MCP specification anticipated this. It defines multiple transport types, and the one designed for production is Streamable HTTP. But the specification gives you the protocol. It does not give you the deployment patterns, the operational practices, or the hard lessons from running these systems under real load.
That is what this guide provides.
The Journey
Understanding MCP Transport Types
The Model Context Protocol defines three transport mechanisms, each suited to different deployment scenarios.
Stdio is the simplest. The client spawns the server as a child process. Messages flow through stdin and stdout. This is what you use during development and what Claude Code defaults to when you configure an MCP server with a command. It is fast, requires no network configuration, and works everywhere. But it is inherently local. One client, one server, one machine.
Streamable HTTP is the production transport. The server exposes an HTTP endpoint (typically /mcp), and the client sends JSON-RPC requests as POST bodies. The server can respond with a simple JSON response for request-response patterns, or it can upgrade the connection to Server-Sent Events (SSE) for streaming responses and server-initiated notifications. Session management happens through the Mcp-Session-Id header.
SSE is the legacy transport from the earlier MCP specification. It uses a dedicated SSE endpoint for server-to-client messages and a separate POST endpoint for client-to-server messages. It still works, but the specification now recommends Streamable HTTP for all new implementations. If you are building something new, skip SSE entirely.
The transition from stdio to Streamable HTTP is not just a transport swap. It changes how you think about the server's lifecycle. A stdio server is ephemeral. It exists for the duration of a single client session. A Streamable HTTP server is a long-running service that manages multiple concurrent sessions, each with its own state.
Setting Up Streamable HTTP Transport
The following walkthrough covers building a production-ready MCP server with Streamable HTTP transport. The examples use TypeScript with the official MCP SDK because it has the most mature HTTP transport support.
First, the basic server structure.
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import express from "express";
const app = express();
app.use(express.json());
const server = new McpServer({
name: "production-tools",
version: "1.0.0",
});
// Register your tools
server.tool(
"get_deployment_status",
"Check the deployment status of a service",
{ service: { type: "string", description: "Service name" } },
async ({ service }) => {
const status = await checkDeployment(service);
return {
content: [{ type: "text", text: JSON.stringify(status, null, 2) }],
};
}
);
// Session management
const sessions = new Map<string, StreamableHTTPServerTransport>();
app.post("/mcp", async (req, res) => {
const sessionId = req.headers["mcp-session-id"] as string | undefined;
if (sessionId && sessions.has(sessionId)) {
const transport = sessions.get(sessionId)!;
await transport.handleRequest(req, res);
return;
}
// New session
const transport = new StreamableHTTPServerTransport({
sessionIdGenerator: () => crypto.randomUUID(),
onsessioninitialized: (id) => {
sessions.set(id, transport);
},
});
transport.onclose = () => {
if (transport.sessionId) {
sessions.delete(transport.sessionId);
}
};
await server.connect(transport);
await transport.handleRequest(req, res);
});
app.listen(3001, () => {
console.error("MCP server listening on port 3001");
});
Notice that console.error is used for the startup message, not console.log. This matters. MCP servers must never write non-protocol data to stdout. In stdio mode, stdout is the protocol channel. Even with HTTP transport, maintaining this discipline prevents subtle bugs if you ever need to support both transports.
The session management map tracks active sessions by their Mcp-Session-Id. When a client sends its first request (the initialize message), the server creates a new transport and assigns a session ID. Subsequent requests from the same client include that session ID in the header, routing them to the correct transport instance.
Adding Authentication
A production MCP server without authentication is an open door to your internal systems. Every tool you expose becomes callable by anyone who knows the endpoint. If your MCP server can query a database, an unauthenticated server lets anyone query that database.
The simplest authentication approach is Bearer token validation. Add middleware that checks every request before it reaches the MCP handler.
import { Request, Response, NextFunction } from "express";
const API_KEYS = new Set(
(process.env.MCP_API_KEYS || "").split(",").filter(Boolean)
);
function authenticate(req: Request, res: Response, next: NextFunction) {
const auth = req.headers.authorization;
if (!auth || !auth.startsWith("Bearer ")) {
res.status(401).json({
jsonrpc: "2.0",
error: { code: -32001, message: "Authentication required" },
id: null,
});
return;
}
const token = auth.slice(7);
if (!API_KEYS.has(token)) {
res.status(403).json({
jsonrpc: "2.0",
error: { code: -32002, message: "Invalid credentials" },
id: null,
});
return;
}
next();
}
app.post("/mcp", authenticate, async (req, res) => {
// ... MCP handling
});
This is the minimum. For a deeper treatment of authentication patterns including OAuth 2.1, token rotation, and per-user tool scoping, see the companion guide on MCP server authentication and security.
Bearer tokens work well for service-to-service communication where both sides are systems you control. For user-facing deployments where individual developers authenticate with their own credentials, OAuth 2.1 is the mechanism the MCP specification recommends.
Rate Limiting and Abuse Prevention
AI clients behave differently from human users. A single Claude Code session can generate dozens of tool calls in rapid succession, particularly during agentic workflows where Claude is iterating on a problem. Without rate limiting, one developer's aggressive session can overwhelm your backend services.
A sliding window rate limiter keyed on the client's API key or session ID works well here.
const rateLimits = new Map<string, { count: number; resetAt: number }>();
const RATE_LIMIT = 100; // requests per window
const WINDOW_MS = 60000; // 1 minute window
function rateLimit(req: Request, res: Response, next: NextFunction) {
const token = req.headers.authorization?.slice(7) || "anonymous";
const now = Date.now();
let bucket = rateLimits.get(token);
if (!bucket || now > bucket.resetAt) {
bucket = { count: 0, resetAt: now + WINDOW_MS };
rateLimits.set(token, bucket);
}
bucket.count++;
res.setHeader("X-RateLimit-Limit", RATE_LIMIT);
res.setHeader("X-RateLimit-Remaining", Math.max(0, RATE_LIMIT - bucket.count));
res.setHeader("X-RateLimit-Reset", Math.ceil(bucket.resetAt / 1000));
if (bucket.count > RATE_LIMIT) {
res.status(429).json({
jsonrpc: "2.0",
error: {
code: -32003,
message: "Rate limit exceeded. Try again later.",
},
id: null,
});
return;
}
next();
}
One hundred requests per minute is a reasonable starting point for most internal tools. Adjust based on your backend capacity and the expected call patterns of your tools. Some tools are cheap (reading a config value) and some are expensive (running a database migration). You might want per-tool rate limits in addition to the global limit.
Error Handling and JSON-RPC Codes
MCP uses JSON-RPC 2.0, which defines a specific error response format. Getting this right matters because the client (Claude Code) uses these error codes to decide what to do next. A well-formed error response lets Claude retry intelligently or explain the failure to the user. A malformed error or a connection drop leaves Claude guessing.
The standard JSON-RPC error codes are your foundation.
const ErrorCodes = {
PARSE_ERROR: -32700, // Invalid JSON
INVALID_REQUEST: -32600, // Not a valid JSON-RPC request
METHOD_NOT_FOUND: -32601, // Tool or method does not exist
INVALID_PARAMS: -32602, // Invalid tool arguments
INTERNAL_ERROR: -32603, // Server-side failure
};
Beyond these, define application-specific codes for your tools. The -32000 to -32099 range that JSON-RPC reserves for implementation-defined errors is ideal for this purpose.
const AppErrorCodes = {
AUTH_REQUIRED: -32001,
AUTH_INVALID: -32002,
RATE_LIMITED: -32003,
SERVICE_UNAVAILABLE: -32004,
UPSTREAM_TIMEOUT: -32005,
};
Wrap your tool implementations in error handlers that catch exceptions and return structured errors. Never let an unhandled exception crash the server or return a raw stack trace.
server.tool(
"query_database",
"Run a read-only SQL query",
{ query: { type: "string" } },
async ({ query }) => {
try {
const result = await db.query(query);
return {
content: [{ type: "text", text: JSON.stringify(result.rows) }],
};
} catch (error) {
return {
content: [
{
type: "text",
text: `Database query failed: ${error.message}`,
},
],
isError: true,
};
}
}
);
The isError: true flag in the tool response tells Claude that the tool call failed. Claude will typically report the error to the user rather than trying to interpret the error message as successful output. Without this flag, Claude might treat an error message as a valid query result.
Monitoring and Observability
A production MCP server without monitoring is a server you will be debugging blind at 2am. Consider a scenario where a tool that queries an external API starts timing out intermittently. Without metrics, there is no way to know it is happening until users report that Claude is "being slow."
Start with three layers of observability.
Health checks tell your orchestrator whether the server is alive and ready to accept requests.
app.get("/health", (req, res) => {
const health = {
status: "ok",
uptime: process.uptime(),
activeSessions: sessions.size,
timestamp: new Date().toISOString(),
};
res.json(health);
});
app.get("/ready", async (req, res) => {
try {
await db.query("SELECT 1");
res.json({ status: "ready" });
} catch {
res.status(503).json({ status: "not ready", reason: "database unavailable" });
}
});
Separate liveness (/health) from readiness (/ready). Your orchestrator uses liveness to decide whether to restart the container and readiness to decide whether to route traffic to it. A server can be alive but not ready if its database connection is down.
Request logging captures every tool call with timing, caller identity, and outcome.
function requestLogger(req: Request, res: Response, next: NextFunction) {
const start = Date.now();
const requestId = crypto.randomUUID();
res.on("finish", () => {
const duration = Date.now() - start;
const logEntry = {
requestId,
method: req.method,
path: req.path,
sessionId: req.headers["mcp-session-id"],
status: res.statusCode,
duration,
timestamp: new Date().toISOString(),
};
console.error(JSON.stringify(logEntry));
});
next();
}
app.use(requestLogger);
Metrics feed into your existing monitoring stack. If you use Prometheus, expose a /metrics endpoint with counters for tool calls, histograms for response times, and gauges for active sessions.
import { Registry, Counter, Histogram, Gauge } from "prom-client";
const registry = new Registry();
const toolCallCounter = new Counter({
name: "mcp_tool_calls_total",
help: "Total number of MCP tool calls",
labelNames: ["tool_name", "status"],
registers: [registry],
});
const toolCallDuration = new Histogram({
name: "mcp_tool_call_duration_seconds",
help: "Duration of MCP tool calls",
labelNames: ["tool_name"],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5, 10],
registers: [registry],
});
const activeSessions = new Gauge({
name: "mcp_active_sessions",
help: "Number of active MCP sessions",
registers: [registry],
});
app.get("/metrics", async (req, res) => {
res.set("Content-Type", registry.contentType);
res.end(await registry.metrics());
});
With these three layers, you can answer the questions that matter. Is the server healthy? How many requests per second is it handling? Which tools are slowest? Which clients are generating the most load?
Scaling Patterns
A single MCP server instance can handle a surprising amount of load. JSON-RPC messages are small, tool calls are typically I/O-bound (waiting on databases, APIs, or file systems), and Node.js handles concurrent I/O well. A single instance can comfortably serve 50 or more concurrent sessions.
But eventually you need more than one instance. The key challenge is session state.
If your server is stateless (no session-level caching, no in-memory state beyond the transport), horizontal scaling is straightforward. Run multiple instances behind a load balancer. Any instance can handle any request.
If your server maintains session state (which the session management map in the earlier example does), you need session affinity. The load balancer must route all requests from the same session to the same instance.
# nginx configuration for MCP server with session affinity
upstream mcp_servers {
hash $http_mcp_session_id consistent;
server mcp-server-1:3001;
server mcp-server-2:3001;
server mcp-server-3:3001;
}
server {
listen 443 ssl;
server_name mcp.internal.company.com;
location /mcp {
proxy_pass http://mcp_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# SSE support
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
location /health {
proxy_pass http://mcp_servers;
}
location /metrics {
proxy_pass http://mcp_servers;
# Restrict metrics to internal network
allow 10.0.0.0/8;
deny all;
}
}
The hash $http_mcp_session_id consistent directive routes requests with the same Mcp-Session-Id header to the same backend. The consistent modifier ensures that when a server is added or removed, only a fraction of sessions are remapped rather than all of them.
For production deployments where sessions must survive server restarts, move session state out of the process. Redis is the natural choice. Store the session data in Redis keyed by session ID, and any server instance can pick up any session.
Container Deployment
Docker is the standard packaging for production MCP servers. Here is a production Dockerfile.
FROM node:22-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:22-slim
WORKDIR /app
RUN addgroup --system mcp && adduser --system --ingroup mcp mcp
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER mcp
EXPOSE 3001
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD node -e "fetch('http://localhost:3001/health').then(r => r.ok ? process.exit(0) : process.exit(1))"
CMD ["node", "dist/server.js"]
Key details. The multi-stage build keeps the final image small. The non-root user (mcp) follows the principle of least privilege. The HEALTHCHECK directive lets Docker and orchestrators detect failures automatically.
For Kubernetes, a minimal deployment looks like this.
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 3
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
- name: mcp-server
image: registry.internal/mcp-server:1.0.0
ports:
- containerPort: 3001
env:
- name: MCP_API_KEYS
valueFrom:
secretKeyRef:
name: mcp-secrets
key: api-keys
livenessProbe:
httpGet:
path: /health
port: 3001
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 3001
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Secrets like API keys come from Kubernetes secrets, not environment variables baked into the image. The resource limits prevent a single pod from consuming unbounded memory or CPU. The liveness and readiness probes give Kubernetes the information it needs to manage the server's lifecycle.
Configuration Management
Production servers need different configurations for different environments. Use environment variables for values that change between deployments and configuration files for values that change between releases.
const config = {
port: parseInt(process.env.PORT || "3001"),
logLevel: process.env.LOG_LEVEL || "info",
rateLimitMax: parseInt(process.env.RATE_LIMIT_MAX || "100"),
rateLimitWindowMs: parseInt(process.env.RATE_LIMIT_WINDOW_MS || "60000"),
dbConnectionString: process.env.DATABASE_URL,
corsOrigins: (process.env.CORS_ORIGINS || "").split(",").filter(Boolean),
tlsEnabled: process.env.TLS_ENABLED === "true",
};
// Validate required config at startup
const required = ["DATABASE_URL", "MCP_API_KEYS"];
for (const key of required) {
if (!process.env[key]) {
console.error(`Missing required environment variable: ${key}`);
process.exit(1);
}
}
Fail fast on missing configuration. A server that starts without its database connection string will fail on the first tool call, producing a confusing error. Better to fail immediately at startup with a clear message about what is missing.
Never log secrets. If you log your configuration at startup for debugging (which is recommended), redact sensitive values.
console.error("Configuration loaded:", {
...config,
dbConnectionString: config.dbConnectionString ? "[REDACTED]" : "not set",
});
The Production Architecture
After extensive iteration, the architecture that works best for production MCP deployments looks like this.
Client (Claude Code)
|
| HTTPS + Bearer Token
|
Reverse Proxy (Caddy or nginx)
|
| HTTP (internal network)
|
MCP Server (Node.js / Rust / Python)
|
|--- Backend API (REST/gRPC)
|--- Database (PostgreSQL/Redis)
|--- External Services (APIs, queues)
The reverse proxy handles TLS termination, connection limits, and request buffering. It also provides a natural point for adding IP allowlisting or mutual TLS for internal services.
The MCP server itself is as thin as possible. It validates inputs, calls backend services, and formats results. Business logic lives in the backend services, not in the MCP server. This separation means you can update your backend APIs without redeploying the MCP server, and you can expose the same backend through both MCP and traditional REST APIs.
For SSE support (which Streamable HTTP uses for streaming responses), the reverse proxy must be configured to disable response buffering. Without this, SSE events are buffered and delivered in batches, which defeats the purpose of streaming.
With Caddy, the configuration is simpler.
mcp.internal.company.com {
reverse_proxy mcp-server:3001 {
flush_interval -1
}
}
The flush_interval -1 directive disables response buffering, allowing SSE events to flow through immediately.
Graceful Shutdown
Production servers must handle shutdown signals cleanly. When Kubernetes sends a SIGTERM, or when you deploy a new version, active sessions should complete rather than being killed mid-request.
let isShuttingDown = false;
process.on("SIGTERM", async () => {
console.error("Received SIGTERM, starting graceful shutdown");
isShuttingDown = true;
// Stop accepting new sessions
app.use((req, res, next) => {
if (req.path === "/mcp" && !req.headers["mcp-session-id"]) {
res.status(503).json({
jsonrpc: "2.0",
error: { code: -32004, message: "Server is shutting down" },
id: null,
});
return;
}
next();
});
// Wait for active sessions to complete (max 30 seconds)
const deadline = Date.now() + 30000;
while (sessions.size > 0 && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, 1000));
}
// Close remaining sessions
for (const [id, transport] of sessions) {
await transport.close();
sessions.delete(id);
}
process.exit(0);
});
The pattern is to stop accepting new sessions, wait for existing sessions to finish (with a deadline), then force-close anything still open. The 30-second deadline matches Kubernetes' default terminationGracePeriodSeconds.
The Lesson
Moving an MCP server to production is not primarily a coding challenge. The protocol is the same. The tools are the same. The messages are the same JSON-RPC payloads.
The real challenge is operational. It is deciding how to authenticate clients and what happens when authentication fails. It is knowing that your server handles 50 concurrent sessions but understanding what happens at 500. It is having metrics that tell you a tool's 99th percentile latency jumped from 200ms to 2 seconds before your users notice.
Every production system, MCP or otherwise, follows the same pattern. The protocol is the easy part. The deployment, monitoring, security, and operational practices around the protocol are what make it production-ready.
If you are building MCP servers that will serve a team, start with the patterns in this guide. Streamable HTTP transport for remote access. Authentication middleware from day one. Health checks and metrics before you deploy. Rate limiting before you need it.
And read the companion guide on MCP server authentication and security before you expose any tools that touch sensitive data. Authentication is not optional. It is the first thing to get right.
Conclusion
This journey starts with a server that runs on a laptop and serves one person. It ends with a containerised service behind a reverse proxy, authenticated with Bearer tokens, monitored with Prometheus, and scaled across three replicas in Kubernetes.
The path from localhost to production is well-worn. HTTP transport gives you the network layer. Authentication gives you access control. Rate limiting gives you safety margins. Health checks give you automated recovery. Metrics give you visibility. Container packaging gives you reproducible deployments.
None of these concepts are new to anyone who has deployed web services. The insight is that MCP servers are web services. They speak JSON-RPC instead of REST, and they serve AI clients instead of browsers, but the operational requirements are identical.
Build your MCP server with the same rigour you would apply to any production API. Then go further. Read about building MCP servers in Rust for the performance characteristics of a systems language. Read about authentication and security patterns if your tools access sensitive data.
The protocol is powerful. The tools are capable. The missing piece for most teams is the operational maturity to run them reliably. This guide aims to fill that gap.