MODULE 8

Deployment & Scaling

How to ship AI agents to production, keep them running under load, and control costs—with real examples from The Website's infrastructure.

What You'll Learn

  • ✓ Choose the right deployment platform for your agent (Vercel, Railway, fly.io)
  • ✓ Manage environment variables and secrets safely across environments
  • ✓ Scale your database with Turso replication for global latency
  • ✓ Add observability: structured logging, error tracking, and usage metrics
  • ✓ Control LLM costs with caching, batching, and model routing
  • ✓ Implement rate limiting to protect your agent from abuse
  • ✓ Scale horizontally when one instance isn't enough

From Prototype to Production

There's a gap between an agent that runs on your laptop and one that serves real users. In the first few days of running The Website, I learned this the hard way: my agent would work perfectly in local testing, then fail silently in production because of missing environment variables, cold-start latency, or an unhandled API error that nobody noticed for two hours.

Deployment is not an afterthought. It's the difference between a demo and a business. This module covers everything you need to get your agent running reliably at scale—from the first deploy to handling thousands of concurrent users.

The Website's production stack

Next.js on Vercel + Turso (distributed SQLite) + GitHub Actions for the agent pipeline. Total infrastructure cost: ~$20/month. Handles the current traffic comfortably with room to 100x.

1. Choosing a Deployment Platform

The right platform depends on what your agent does. Web-facing agents (handling HTTP requests) have different needs than long-running background agents. Here are the three platforms worth knowing:

Vercel — Best for Web Agents

Serverless functions that auto-scale to zero. Git-push deploys. Built-in CDN and edge network. This is what powers The Website.

  • ✅ Zero-config deploys from GitHub
  • ✅ Automatic preview deployments per branch
  • ✅ Edge functions for ultra-low latency
  • ✅ Generous free tier (100GB bandwidth, 100k function invocations)
  • ❌ 10-second default timeout (configurable to 300s on Pro)
  • ❌ Cold starts on serverless functions (~200ms)
  • ❌ Not suited for long-running background agents

Use when: Your agent responds to HTTP requests, you want zero ops overhead, you're running Next.js or similar.

Railway — Best for Always-On Agents

Persistent containers that stay warm. No cold starts. Better for agents that need to maintain state, long-running tasks, or WebSocket connections.

  • ✅ Always-on containers, no cold starts
  • ✅ Built-in Postgres, Redis, MongoDB add-ons
  • ✅ Simple pricing: pay for what you use (~$5/month baseline)
  • ✅ Git-push deploys with automatic rollbacks
  • ❌ More expensive than serverless at low traffic
  • ❌ Manual scaling (vs. auto-scale to zero)

Use when: Your agent runs background jobs, maintains persistent connections, or needs more than 300s execution time.

fly.io — Best for Global Multi-Region

Run containers in 30+ regions. Requests route to nearest instance. Ideal when your users are globally distributed and latency matters.

  • ✅ Deploy to 30+ regions with one command
  • ✅ Machines scale to zero when idle
  • ✅ Persistent volumes per region
  • ✅ WireGuard VPN between machines (private networking)
  • ❌ More complex than Vercel/Railway to configure
  • ❌ Steeper learning curve (Dockerfile required)

Use when: You need sub-100ms latency globally, want to run SQLite at the edge, or need fine-grained control over regional placement.

Quick decision guide:

Building a Next.js app? → Vercel

Long-running background agent? → Railway

Users in 10+ countries? → fly.io

Not sure? → Vercel (start here, migrate later)

2. Environment Management

AI agents use a lot of secrets: API keys for Claude, OpenAI, GitHub, Stripe, Resend. Mismanaging these is one of the most common production failures I see. Here's the system that works.

The Three Environments

Development

Local machine, your .env.local file

  • • Test API keys (low rate limits OK)
  • • Local SQLite database
  • • Verbose logging enabled
  • • No real emails/payments

Preview / Staging

Per-branch deploys on Vercel

  • • Separate Turso database branch
  • • Stripe test mode keys
  • • Error tracking enabled
  • • Integration tests run here

Production

main branch, live traffic

  • • Production API keys, full quotas
  • • Production database
  • • Minimal logging (cost)
  • • Alerts on errors

The .env Pattern

Never commit secrets to git. The Website uses this pattern:

# .env.local (never committed, in .gitignore)
ANTHROPIC_API_KEY=sk-ant-...
TURSO_DATABASE_URL=libsql://...
TURSO_AUTH_TOKEN=...
GITHUB_APP_ID=...
GITHUB_PRIVATE_KEY=...
STRIPE_SECRET_KEY=sk_live_...
RESEND_API_KEY=re_...
NEXTAUTH_SECRET=...

# .env.example (committed — tells teammates what vars are needed)
ANTHROPIC_API_KEY=
TURSO_DATABASE_URL=
TURSO_AUTH_TOKEN=
GITHUB_APP_ID=
GITHUB_PRIVATE_KEY=
STRIPE_SECRET_KEY=
RESEND_API_KEY=
NEXTAUTH_SECRET=

Validating Environment Variables at Startup

Silent failures from missing env vars are the worst. Add a validation check that runs at startup and fails loudly:

// lib/env.ts
const required = [
  "ANTHROPIC_API_KEY",
  "TURSO_DATABASE_URL",
  "TURSO_AUTH_TOKEN",
  "NEXTAUTH_SECRET",
] as const;

export function validateEnv() {
  const missing = required.filter((key) => !process.env[key]);
  if (missing.length > 0) {
    throw new Error(
      `Missing required environment variables: ${missing.join(", ")}`
    );
  }
}

// Export typed env for safe access
export const env = {
  anthropicApiKey: process.env.ANTHROPIC_API_KEY!,
  tursoUrl: process.env.TURSO_DATABASE_URL!,
  tursoToken: process.env.TURSO_AUTH_TOKEN!,
  nextAuthSecret: process.env.NEXTAUTH_SECRET!,
} as const;

Production incident I had

Deployed with GITHUB_PRIVATE_KEY missing. The agent ran fine for 2 hours (no GitHub operations needed) then silently failed when trying to label an issue. No error appeared in logs because the failure was swallowed in a try/catch. Now I validate all vars on startup.

3. Database Scaling with Turso

The Website uses Turso—a distributed SQLite database. Most people assume SQLite can't scale, but Turso proves otherwise. With replication, you get read replicas in every region, sub-10ms reads globally, and the simplicity of SQLite.

How Turso Replication Works

Primary database (write operations)

├── Replica: us-east (reads routed here for US users)

├── Replica: eu-west (reads routed here for EU users)

└── Replica: ap-southeast (reads routed here for APAC users)

Writes go to primary → replicate to all replicas in <100ms

// lib/db.ts — connect to nearest replica automatically
import { createClient } from "@libsql/client";
import { drizzle } from "drizzle-orm/libsql";

const client = createClient({
  url: process.env.TURSO_DATABASE_URL!,
  authToken: process.env.TURSO_AUTH_TOKEN!,
});

export const db = drizzle(client);

// For write-heavy operations, connect to primary explicitly:
export const primaryClient = createClient({
  url: process.env.TURSO_PRIMARY_URL!,   // primary URL
  authToken: process.env.TURSO_AUTH_TOKEN!,
});
export const primaryDb = drizzle(primaryClient);

Read vs. Write Routing Pattern

The key insight: most agent operations are reads (checking cache, loading context, fetching issues). Route reads to replicas, writes to primary:

// Read from nearest replica (fast, globally distributed)
const issues = await db
  .select()
  .from(issueCache)
  .where(eq(issueCache.status, "open"));

// Write to primary (consistent, single source of truth)
await primaryDb
  .insert(issueCache)
  .values({ id, title, status: "open", votes: 0 })
  .onConflictDoUpdate({
    target: issueCache.id,
    set: { title, status, votes },
  });

Database Connection Pooling

On serverless (Vercel), each function invocation creates a new database connection by default. At scale this exhausts connection limits fast. Fix it with a module-level singleton:

// lib/db.ts — module-level singleton (reused across warm invocations)
import { createClient } from "@libsql/client";
import { drizzle } from "drizzle-orm/libsql";

// This module is cached between serverless invocations in the same container
let _db: ReturnType<typeof drizzle> | null = null;

export function getDb() {
  if (!_db) {
    const client = createClient({
      url: process.env.TURSO_DATABASE_URL!,
      authToken: process.env.TURSO_AUTH_TOKEN!,
    });
    _db = drizzle(client);
  }
  return _db;
}

4. Monitoring: Logging, Errors, and Observability

You can't fix what you can't see. AI agents fail in subtle ways—wrong outputs, hallucinated tool calls, unexpected costs. Good monitoring catches these before your users do.

Structured Logging

Don't use console.log. Use structured logs with consistent fields so you can filter and query them:

// lib/logger.ts
type LogLevel = "info" | "warn" | "error" | "debug";

interface LogEvent {
  level: LogLevel;
  message: string;
  agentId?: string;
  taskId?: string;
  durationMs?: number;
  tokensUsed?: number;
  error?: string;
  [key: string]: unknown;
}

export function log(event: LogEvent) {
  const entry = {
    timestamp: new Date().toISOString(),
    env: process.env.NODE_ENV,
    ...event,
  };

  // In production, output JSON for log aggregators (Datadog, Logtail, etc.)
  if (process.env.NODE_ENV === "production") {
    console.log(JSON.stringify(entry));
  } else {
    // In dev, pretty-print for readability
    const { level, message, ...rest } = entry;
    console.log(`[${level.toUpperCase()}] ${message}`, rest);
  }
}

// Usage in agent code:
log({
  level: "info",
  message: "Task completed",
  taskId: "task-123",
  agentId: "developer-agent",
  durationMs: 4200,
  tokensUsed: 3847,
});

Error Tracking with Sentry

The Website uses Sentry for error tracking. When an agent throws an unhandled error, Sentry captures the full context: request headers, user session, recent breadcrumbs. Setup takes 5 minutes:

// sentry.server.config.ts
import * as Sentry from "@sentry/nextjs";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1,   // Sample 10% of requests for performance
  beforeSend(event) {
    // Don't send errors for expected cases
    if (event.exception?.values?.[0]?.value?.includes("Rate limit")) {
      return null;
    }
    return event;
  },
});

// Wrap agent execution to capture errors with context:
export async function runAgentTask(taskId: string, fn: () => Promise<void>) {
  return Sentry.withScope(async (scope) => {
    scope.setTag("taskId", taskId);
    scope.setContext("agent", { taskId, startTime: Date.now() });
    try {
      await fn();
    } catch (error) {
      Sentry.captureException(error);
      throw error;
    }
  });
}

What to Monitor for AI Agents

Infrastructure metrics

  • • Function invocation count & errors
  • • Cold start frequency & duration
  • • Database query latency (p50/p95/p99)
  • • API endpoint response times

Agent-specific metrics

  • • Tokens used per task (cost proxy)
  • • Task success rate vs. failure rate
  • • Tool call error frequency by tool
  • • Agent turnaround time per task type

The metric I watch most closely

Tokens per completed task. If this number starts creeping up, an agent is probably stuck in a loop, getting confused by context bloat, or making unnecessary tool calls. It's the earliest signal of agent degradation.

5. Cost Optimization

LLM API costs can spiral fast. A single Claude Sonnet call processing a large context costs ~$0.01–$0.05. At 1000 agent tasks/day, that's $10–$50 daily just in tokens. Here's how I keep costs under control.

Strategy 1: Prompt Caching

Anthropic supports prompt caching for repeated system prompts and large context blocks. If your agent has a large system prompt that doesn't change, cache it—saves 90% on input token cost for cached portions:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Mark stable system prompt content for caching
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: LARGE_CODEBASE_CONTEXT,  // Your large, stable context
      cache_control: { type: "ephemeral" },  // Cache this block
    },
  ],
  messages: [
    { role: "user", content: "Fix the bug in auth.ts" }
  ],
});

// Subsequent calls with the same cached block cost ~10x less
// Cache persists for ~5 minutes by default

Strategy 2: Model Routing

Not every task needs the most powerful (expensive) model. Route tasks to the cheapest model that can handle them:

// lib/model-router.ts
type TaskType = "classify" | "summarize" | "code" | "reason";

const MODEL_MAP: Record<TaskType, string> = {
  // Simple classification → cheapest model ($0.80/MTok input)
  classify: "claude-haiku-4-5-20251001",
  // Summarization → mid-tier ($3/MTok input)
  summarize: "claude-haiku-4-5-20251001",
  // Code generation → capable model ($3/MTok input)
  code: "claude-sonnet-4-6",
  // Complex reasoning → most capable ($15/MTok input)
  reason: "claude-opus-4-6",
};

export function selectModel(taskType: TaskType): string {
  return MODEL_MAP[taskType];
}

// Usage:
const model = selectModel("classify");  // haiku for cheap classification
const response = await client.messages.create({
  model,
  max_tokens: 100,
  messages: [{ role: "user", content: "Classify this issue: is it a bug or feature?" }],
});

Strategy 3: Response Caching

For deterministic queries—same input, same output—cache the LLM response. Issue classification, sentiment analysis, and label suggestions are all good candidates:

// lib/llm-cache.ts
import { db } from "./db";
import { llmCache } from "./schema";
import { eq } from "drizzle-orm";
import crypto from "crypto";

export async function cachedCompletion(
  prompt: string,
  fn: () => Promise<string>
): Promise<string> {
  const key = crypto.createHash("sha256").update(prompt).digest("hex");

  // Check cache first
  const cached = await db
    .select()
    .from(llmCache)
    .where(eq(llmCache.key, key))
    .get();

  if (cached) {
    log({ level: "debug", message: "LLM cache hit", key });
    return cached.value;
  }

  // Cache miss — call the LLM
  const result = await fn();

  // Store result (TTL: 24 hours)
  await db.insert(llmCache).values({
    key,
    value: result,
    expiresAt: new Date(Date.now() + 86400 * 1000),
  });

  return result;
}

// Usage:
const label = await cachedCompletion(
  `Classify issue: "${issue.title}"`,
  () => classifyWithClaude(issue)
);

6. Rate Limiting

Without rate limiting, a single bad actor or runaway script can exhaust your API quotas in minutes. I learned this when a test loop accidentally hammered The Website's /api/requestsendpoint 400 times in 30 seconds.

Simple In-Memory Rate Limiter

For a single-instance deployment (or when eventual consistency is fine), an in-memory sliding window limiter is enough:

// lib/rate-limiter.ts
const requests = new Map<string, number[]>();

export function checkRateLimit(
  key: string,
  limit: number,
  windowMs: number
): { allowed: boolean; retryAfterMs?: number } {
  const now = Date.now();
  const windowStart = now - windowMs;

  // Get existing timestamps for this key, filter to current window
  const timestamps = (requests.get(key) ?? []).filter(
    (t) => t > windowStart
  );

  if (timestamps.length >= limit) {
    const oldestInWindow = timestamps[0];
    const retryAfterMs = oldestInWindow + windowMs - now;
    return { allowed: false, retryAfterMs };
  }

  // Record this request
  timestamps.push(now);
  requests.set(key, timestamps);

  return { allowed: true };
}

// Usage in API route:
export async function POST(req: Request) {
  const ip = req.headers.get("x-forwarded-for") ?? "unknown";

  const { allowed, retryAfterMs } = checkRateLimit(
    `create-issue:${ip}`,
    5,       // 5 requests
    60_000,  // per 60 seconds
  );

  if (!allowed) {
    return Response.json(
      { error: "Too many requests" },
      {
        status: 429,
        headers: { "Retry-After": String(Math.ceil((retryAfterMs ?? 0) / 1000)) },
      }
    );
  }

  // ... handle request
}

Distributed Rate Limiting with Upstash Redis

When you have multiple serverless instances, in-memory state isn't shared. Use Upstash Redis (serverless-friendly) for consistent limits:

import { Ratelimit } from "@upstash/ratelimit";
import { Redis } from "@upstash/redis";

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, "60 s"),
  analytics: true,  // Track usage in Upstash console
});

export async function POST(req: Request) {
  const ip = req.headers.get("x-forwarded-for") ?? "anonymous";
  const { success, limit, remaining, reset } = await ratelimit.limit(ip);

  if (!success) {
    return Response.json(
      { error: "Rate limit exceeded", reset },
      {
        status: 429,
        headers: {
          "X-RateLimit-Limit": String(limit),
          "X-RateLimit-Remaining": String(remaining),
          "X-RateLimit-Reset": String(reset),
        },
      }
    );
  }

  // ... handle request
}

7. Caching Strategies

Caching is how you make an agent feel instantaneous without paying for instantaneous infrastructure. The Website caches GitHub issue data in Turso to avoid hitting GitHub's API on every page load.

The Three Levels of Caching

L1

In-Process Memory Cache

Fastest. Lives in the Node.js process. Lost on restart. Use for hot data that changes rarely: config, feature flags, model outputs.

const cache = new Map<string, { value: unknown; expiresAt: number }>();

export function memCache<T>(key: string, ttlMs: number, fn: () => T): T {
  const entry = cache.get(key);
  if (entry && entry.expiresAt > Date.now()) return entry.value as T;
  const value = fn();
  cache.set(key, { value, expiresAt: Date.now() + ttlMs });
  return value;
}
L2

Database Cache (Turso)

Medium speed. Persistent across restarts. The Website stores GitHub issues here. Shared across all instances.

// Store expensive API results in Turso
await db.insert(issueCache).values({
  id: issue.number,
  title: issue.title,
  body: issue.body,
  votes: issue.reactions["+1"],
  cachedAt: new Date(),
  expiresAt: new Date(Date.now() + 5 * 60 * 1000), // 5 min TTL
}).onConflictDoUpdate({ target: issueCache.id, set: { ... } });
L3

HTTP / CDN Cache

Fastest at scale. Responses cached at the CDN edge—Vercel does this automatically for static routes. Add cache headers for dynamic routes.

// Cache API response at CDN for 60 seconds
return Response.json(data, {
  headers: {
    "Cache-Control": "public, s-maxage=60, stale-while-revalidate=300",
  },
});

// Or use Next.js route config:
export const revalidate = 60; // revalidate every 60 seconds

Cache Invalidation: The Hard Part

The classic problem: when data changes, cached copies become stale. For The Website's issue cache, I use a simple TTL + event-based invalidation:

// When a user votes, immediately invalidate the specific issue cache
export async function POST(req: Request, { params }: { params: { id: string } }) {
  const issueId = Number(params.id);

  // 1. Update the vote via GitHub API
  await addReaction(issueId, "+1");

  // 2. Immediately sync the cache for this issue (don't wait for TTL)
  const freshData = await getIssue(issueId);
  await db
    .update(issueCache)
    .set({ votes: freshData.reactions["+1"], cachedAt: new Date() })
    .where(eq(issueCache.id, issueId));

  // 3. Revalidate the Next.js page cache so CDN serves fresh data
  revalidatePath("/");
  revalidatePath(`/requests/${issueId}`);

  return Response.json({ success: true });
}

8. Horizontal Scaling

Horizontal scaling means running multiple copies of your agent in parallel rather than making one instance bigger. This is how you handle traffic spikes, reduce per-task latency, and build fault tolerance.

Stateless Agents Scale Easily

The golden rule: make your agent stateless. Store all state in the database, not in memory. Then any instance can handle any request:

Stateful (hard to scale)

// State lives in memory
let taskQueue: Task[] = [];
let currentTask: Task | null = null;

// Instance A and Instance B have
// different queues → race conditions,
// duplicate work, inconsistency

Stateless (scales horizontally)

// State lives in database
const task = await db
  .update(tasks)
  .set({ status: "in_progress", workerId: MY_ID })
  .where(eq(tasks.status, "pending"))
  .returning()
  .get();
// Atomic claim — works across N instances

Work Queue Pattern

For background agents that process tasks, use a database-backed work queue. Multiple worker instances poll the queue; atomic claims prevent duplicates:

// lib/work-queue.ts — used by The Website's agent pipeline
import { db } from "./db";
import { tasks } from "./schema";
import { eq, and, isNull } from "drizzle-orm";

const WORKER_ID = process.env.WORKER_ID ?? crypto.randomUUID();

export async function claimNextTask() {
  // Atomic claim: only one worker gets each task
  const task = await db
    .update(tasks)
    .set({
      status: "in_progress",
      workerId: WORKER_ID,
      startedAt: new Date(),
    })
    .where(
      and(
        eq(tasks.status, "pending"),
        isNull(tasks.workerId)
      )
    )
    .returning()
    .get();

  return task ?? null;
}

export async function completeTask(taskId: string, result: unknown) {
  await db
    .update(tasks)
    .set({
      status: "completed",
      result: JSON.stringify(result),
      completedAt: new Date(),
    })
    .where(eq(tasks.id, taskId));
}

// Worker loop — run N instances in parallel for horizontal scale
async function workerLoop() {
  while (true) {
    const task = await claimNextTask();
    if (!task) {
      await sleep(5000);  // No work — poll again in 5s
      continue;
    }
    try {
      const result = await executeTask(task);
      await completeTask(task.id, result);
    } catch (error) {
      await failTask(task.id, error);
    }
  }
}

Handling Concurrency with GitHub Actions

The Website runs agent workers as GitHub Actions jobs. Each issue triggers a separate job, so multiple issues get processed concurrently:

# .github/workflows/agent.yml (simplified)
on:
  issues:
    types: [labeled]

jobs:
  process-issue:
    # Max 3 concurrent workers (GitHub Actions limit on free tier)
    concurrency:
      group: agent-worker
      cancel-in-progress: false

    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          # Each job is an isolated worker — no shared state
          node scripts/process-issue.js ${{ github.event.issue.number }}

How The Website currently scales

The website backend is stateless Next.js on Vercel — scales to zero automatically, handles traffic spikes with no config. The agent pipeline uses GitHub Actions with up to 20 concurrent jobs. Turso handles database reads globally via replicas. Total: $0 incremental cost until ~50k monthly active users.

9. Production Checklist

Before shipping an AI agent to production, run through this checklist. Every item represents something I either got wrong myself or have seen fail in the wild.

Environment

  • All secrets stored in env vars, not hardcoded
  • Startup validation rejects missing env vars immediately
  • .env.example committed with all required keys
  • Separate API keys for dev, staging, production

Database

  • Connection singleton prevents connection pool exhaustion
  • Migrations run before deployment (not at startup)
  • Read replicas used for read-heavy operations
  • Schema validated against production before deploy

Reliability

  • Unhandled promise rejections are caught and logged
  • External API calls have timeouts (never leave open-ended)
  • Retry logic for transient failures (exponential backoff)
  • Circuit breaker for repeatedly failing dependencies

Observability

  • Structured JSON logging in production
  • Error tracking (Sentry) configured with environment tags
  • Token usage logged per task for cost monitoring
  • Alerts set for error rate > 1% or P95 latency > 5s

Cost & Security

  • Rate limiting on all public-facing endpoints
  • LLM response caching for deterministic queries
  • Model routing: cheap models for simple tasks
  • Max token limits set on all LLM calls (no runaway spend)

Exercises

1

Deploy your agent to Vercel

Take the agent you built in Module 2 and deploy it. Add a/api/runendpoint that accepts a task via POST and runs the agent. Verify it works via curl after deploy.

2

Add environment validation

Implement the validateEnv() pattern from Section 2. Call it at the start of yourinstrumentation.ts(Next.js startup hook) so a missing secret fails the deploy rather than silently breaking in production.

3

Instrument with structured logging

Add the log() utility from Section 4 to your agent. Log every task start, completion, failure, tokens used, and duration. Then query your Vercel function logs to find the slowest task type.

4

Add rate limiting to an API route

Protect your agent's public endpoint with rate limiting. Allow 10 requests per minute per IP. Return a proper 429 response withRetry-Afterheader. Test it by writing a quick script that fires 15 requests rapidly and confirm it gets rate limited.

Stretch: Implement model routing

Audit every LLM call in your agent and classify each as: classify, summarize, code, or reason. Implement the model router from Section 5 so that only reasoning tasks use Opus, and everything else uses Haiku or Sonnet. Measure the cost reduction over 100 test runs.

Key Takeaways

  1. 1.

    Start with Vercel for web-facing agents. Railway or fly.io for long-running background agents. You can always migrate later.

  2. 2.

    Validate env vars at startup. Silent failures from missing secrets are the hardest bugs to debug in production.

  3. 3.

    Turso replication gives you global read latency under 10ms with zero schema changes. Add replicas before you need them.

  4. 4.

    Token usage per task is your most important cost and quality metric. If it rises, the agent is degrading.

  5. 5.

    LLM costs compound. Add prompt caching, model routing, and response caching before you hit scale, not after.

  6. 6.

    Make agents stateless. All state in the database means any instance can handle any request—horizontal scaling becomes trivial.

  7. 7.

    Rate limit everything public-facing. You will get hammered—whether by bots, a buggy client, or your own test scripts.