Technical Guide
Everything engineers need to know about Seracade: setup, audit mechanics, proxy architecture, and security model.
1. Overview
Seracade is an automated LLM cost audit and optimization proxy. It intercepts your existing LLM API calls via a single environment variable change, logs a statistically significant sample of requests, replays them against alternative models, and delivers a report showing exactly where you can reduce spend without degrading output quality. After the audit, an optional paid proxy continuously routes each call to the most cost-effective model that meets your quality threshold.
Architecture
- Log each request/response pair to Cloudflare KV
- At call threshold, send logs to webhook server
- Replay each call against alternative models
- Score output quality with 95% confidence intervals
- Generate HTML savings report
- Email report to customer
2. Setup
Prerequisites
- Node.js 20+ (for local tooling and the setup script)
- A
.envfile in your project root with your LLM API keys - At least one active provider: OpenAI, Anthropic, or Google
Option A: Browser setup
Walk through the guided setup at /onboarding. It detects your providers, generates the correct env vars, and triggers your first audit automatically.
Option B: Terminal (one command)
Option C: Manual setup
Add the proxy base URL for each provider you use. Your API keys stay in your local .env and are never sent to Seracade.
What the setup script does
- Detects your project root and locates the
.envfile - Backs up
.envto.env.seracade-backup-YYYYMMDD-HHMMSS - Scans for existing
OPENAI_API_KEY,ANTHROPIC_API_KEY, andGOOGLE_API_KEY - Adds the corresponding
*_BASE_URLenv vars pointing to the Seracade proxy - Sends a registration webhook with a SHA-256 hash of your API key (not the key itself) so the proxy can associate your traffic
- Runs a single test call to confirm the proxy is reachable
3. How the Audit Works
Call logging
The Seracade proxy intercepts each LLM request, logs the request/response metadata to Cloudflare KV, and passes the original call through to the provider unchanged. Your application receives the same response it would without the proxy. Latency overhead is sub-millisecond (edge routing only, no payload inspection in the hot path).
50-call trigger
Once 50 calls have been logged for your account, the first audit kicks off automatically. No manual trigger needed. Subsequent audits trigger at 100, 200, 350, and 500 calls, each refining the analysis. Most teams hit the first threshold within a few hours.
Replay process
The audit engine replays your logged calls against a dynamic pool of up to 10 alternative models, ranked by price-performance. The pool refreshes every 6 hours from OpenRouter's full catalog of 348+ models. As of April 2026, the pool typically includes models from OpenAI, Anthropic, Google, xAI, DeepSeek, and others. You can view the current pool at /frontier.
Each call is replayed with identical inputs: same system prompt, same user message, same parameters. Replays route through OpenRouter using Seracade's API key, so there is no cost to you during the audit.
Quality scoring
Each alternative model's output is scored against your original response on three dimensions:
- Semantic similarity: does the meaning match?
- Format preservation: does the structure (JSON, markdown, lists) match?
- Task completion: does the output accomplish what the prompt asked for?
Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.
Report generation and delivery
After replay and scoring complete, the audit engine generates a per-endpoint report showing which calls can safely use an alternative model and which ones genuinely need the model you are paying for. The report is emailed to the address associated with your account and is also available in the dashboard.
4. How the Paid Proxy Works
Model routing
Based on the audit results, the proxy builds a per-endpoint routing table. Each endpoint (identified by system prompt hash + model + path) is mapped to the cheapest model that met your quality threshold. You can pin any endpoint to a specific model to override the automatic routing.
Response caching
The proxy uses hash-based deduplication for repeat prompts. If the same input (system prompt + user message + parameters) is seen within the cache TTL, the cached response is returned instantly. Cache hit rates of 10 to 25% are typical for production workloads with templated prompts.
Provider fallback
If the target model returns a rate limit (429) or server error (5xx), the proxy automatically retries with the next cheapest qualifying model from the routing table. Failover is transparent to your application. No code changes, no retry logic on your side.
Cost attribution
Tag calls with a x-seracade-feature header to get per-feature cost breakdowns in the dashboard. See exactly what each part of your product costs in LLM spend.
Budget caps and spend alerts
Set a monthly budget cap per feature or account-wide. When spend reaches 80% of the cap, you receive an email alert. At 100%, the proxy can either hard-stop (reject calls) or soft-stop (continue but alert on every call). Configure this in the dashboard.
5. Security
API keys never leave your machine
Your LLM provider API keys remain in your local .env file. The SDK reads them at call time and sends them directly to the proxy over HTTPS. The proxy forwards them to the provider in the same request. Keys are never logged, stored, or persisted by Seracade.
What we see vs. what we don't
| We See | We Don't See |
|---|---|
| Request/response metadata (model, token count, latency) | Your raw API keys (only a SHA-256 hash for identification) |
| Prompt/completion content during replay (in-memory only) | Any data after the audit completes (discarded from memory) |
| Aggregated cost and quality metrics | Your source code, infrastructure, or internal systems |
Request/response handling
During the audit replay phase, request/response pairs are held in memory for scoring. After the quality scores are computed, the raw content is discarded. No prompts or completions are written to disk or persistent storage in plaintext.
Customer identification
Seracade identifies your account using a SHA-256 hash of your API key. The hash is computed locally by the setup script and sent during registration. The actual key is never transmitted to Seracade infrastructure.
Transport security
HTTPS is enforced on all proxy endpoints. The proxy runs on Cloudflare Workers, which terminates TLS at the edge. Connections to upstream LLM providers also use HTTPS exclusively.
6. Supported Providers
| Provider | Env Var | Proxy URL |
|---|---|---|
| OpenAI | OPENAI_BASE_URL |
https://seracade.com/v1 |
| Anthropic | ANTHROPIC_BASE_URL |
https://seracade.com |
GOOGLE_API_BASE_URL |
https://seracade.com |
7. Undo / Uninstall
Removing Seracade takes one command. The setup script creates a timestamped backup of your .env before making any changes.
Restore from backup
Or manually remove the proxy URLs
Delete these lines from your .env:
8. Agent Routing
AI agents make hundreds or thousands of API calls per task across different task types. Seracade provides first-class support for identifying, tracking, and optimizing agent costs.
Identify your agents
Add the X-Seracade-Agent header to any request to tag it with an agent name. Seracade tracks usage, cost, and savings per agent automatically.
# Python (OpenAI SDK)
client = OpenAI(base_url="https://seracade.com/v1")
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Review this code"}],
extra_headers={"X-Seracade-Agent": "code-reviewer"}
)
# cURL
curl https://seracade.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-H "X-Seracade-Agent: code-reviewer" \
-d '{"model":"gpt-4.1","messages":[{"role":"user","content":"Review this code"}]}'
Per-agent cost dashboard
Query your agent stats at any time:
GET /api/agents/stats?key_hash=YOUR_HASH
Returns each agent's total calls, cost, savings, task type breakdown, models used, average cost per call, and an efficiency score (quality achieved per dollar spent).
Budget controls
Set per-agent monthly budget limits to prevent runaway costs:
POST /api/agents/budget
{
"key_hash": "YOUR_HASH",
"agent_name": "code-reviewer",
"monthly_budget_usd": 50,
"action_on_exceed": "downgrade"
}
Three actions when a budget is exceeded:
- block — returns 429, agent must handle the error
- downgrade — automatically routes to the cheapest model in the routing table
- alert — sends an email alert, continues routing normally
Agent cost reports
Generate a per-agent cost efficiency report:
GET /api/agents/report?key_hash=YOUR_HASH
Shows each agent's cost efficiency, which models it uses, which task types, and recommended routing changes. Available as JSON (default) or HTML (via Accept header).
Response headers
When X-Seracade-Agent is present, every response includes:
X-Seracade-Agent-Cost— the cost of that specific call (e.g.,0.001200)X-Seracade-Routed— if the call was routed to a cheaper model, showsoriginal -> routed (task_type)X-Seracade-Task— the detected task type for this call
Framework integration
Any framework that supports OPENAI_BASE_URL works with Seracade. Set the env var and add the agent header in your framework's config:
# Environment
OPENAI_BASE_URL=https://seracade.com/v1
# CrewAI — set in your agent's llm config
# LangGraph — set in your ChatOpenAI constructor
# AutoGen — set in your OAI config list
Auto mode (no human in the loop)
Auto mode removes every human gate from the optimization flow. Seracade audits your traffic, activates routing when quality thresholds pass, monitors continuously, and rolls back automatically if quality degrades. No report to review, no button to click.
Enable auto mode
curl -X POST https://seracade.com/api/mode \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"mode":"auto"}'
Returns {"ok": true, "mode": "auto", "customer_id": "sha256_hash"}. To revert: send {"mode":"manual"}.
Check current mode
GET /api/mode
Authorization: Bearer YOUR_API_KEY
What auto mode does
- Audit triggers at 50 calls, same as manual mode
- When the audit generates a routing table and all recommended swaps meet the 95% quality threshold, routing activates on the next request. No email, no confirmation.
- Continuous quality monitoring samples routed calls and scores them against the original model. If quality drops below threshold, the affected route is rolled back automatically.
- A circuit breaker bypasses routing for 5 minutes if proxy latency exceeds acceptable thresholds, then re-enables.
- After 24 hours of traffic, per-agent budgets are auto-inferred at 3x observed daily spend with action set to alert. These are flagged
auto_inferred: truein the API response.
Monitor programmatically
GET /api/status/YOUR_CUSTOMER_ID
Authorization: Bearer YOUR_API_KEY
Returns audit state, active routes, quality scores, savings totals, agent breakdowns, and budget status as JSON. The API key must hash to the requested customer ID or the request returns 401.
No emails in auto mode
Auto mode suppresses all email: audit reports, drip sequences, and savings summaries. Monitoring is API-only. If you want email notifications alongside auto mode, set mode to auto and separately configure an alert email via POST /api/agents/budget with "action_on_exceed": "alert".
9. Additional FAQ
What counts as a "call" for the audit threshold?
Any completion request that passes through the proxy. Chat completions, function calls, embeddings. Each request/response pair counts as one call. Most teams hit the threshold within the first day.
How does quality scoring work?
Seracade replays your exact production inputs against each alternative model, then scores the output against your original response using semantic similarity, format preservation, and task completion. Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.
What if my use case is too specialized for alternative models?
That happens. Roughly 20 to 30% of endpoints show no viable alternative. The audit report tells you which endpoints can save money and which ones genuinely need the model you're using. Both findings are valuable.
Does the proxy add latency?
The routing decision adds sub-millisecond overhead. The proxy runs on Cloudflare's edge network. If an alternative model responds faster (Haiku and Flash typically do), you may see lower total latency.
Can I exclude specific endpoints from optimization?
Yes. Seracade Pro lets you pin any endpoint to a specific model. If your summarization pipeline needs Claude Opus 4, lock it. Seracade only optimizes the endpoints you allow.
10. FAQ
Common questions from customers are answered on the landing page. See the full FAQ section for details on pricing, data handling, latency impact, model pinning, and more.
11. Capabilities Reference
Seracade exposes six named capabilities. Every customer-facing artifact (the routing proxy, the audit report, the dashboard, the MCP server) is built from this set. The names are stable; downstream tooling can integrate against them.
Call
Every inference request through Seracade is a Call. Each Call is addressable by ID and carries full metadata: model used, model considered, tokens, latency, cost, classification, Quality Score, route reason, and the customer's active Quality Gate. Every routed /v1/chat/completions response carries:
X-Seracade-Call-Id— the Call ID, used as input to/v1/replayX-Seracade-Calibration-Version— the calibration version under which the Score was producedX-Seracade-Score-Disclaimer— "Subject to recalibration. Customer-private. Not for redistribution."X-Seracade-Routed— original model, routed model, and task type
Step
Agent workflows are sequences of Calls grouped into Steps. A planning Step, a tool-selection Step, a summarization Step, and a code-generation Step in the same agent session are classified and routed independently. Identify Steps via optional headers on /v1/chat/completions:
X-Seracade-Step-Id— opaque identifier for the Step within an agent trajectoryX-Seracade-Parent-Call-Id— ID of the upstream Call that produced this Step's inputX-Seracade-Trace-Id— opaque identifier for the full agent run
Headers are additive. Frameworks that emit step identifiers are picked up automatically; frameworks that do not continue to work at the Call level. Step headers are stripped before the upstream provider sees the request and persisted as opaque tags on the Call record.
# Python (OpenAI SDK)
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Plan the next action"}],
extra_headers={
"X-Seracade-Step-Id": "plan-001",
"X-Seracade-Trace-Id": "agent-run-2026-04-25-7f3a",
},
)
Quality Score
Every Call receives a calibrated, version-stamped Quality Score produced from the customer's own traffic. Each Score carries calibration_version and sample_size and a "subject to recalibration" disclaimer. Scores below the per-customer suppression threshold are withheld rather than reported with wide confidence bands. Scores are visible to the customer on their own data only; methodology is not published.
Routing Decision
The routing table is a set of Routing Decisions. Each maps (task_type, Quality Gate threshold) to the candidate model set that clears the threshold, with evidence (sample_size, score distribution, calibration_version, last_rebalance timestamp). Customers can fetch their routing table via the dashboard or the audit report.
Counterfactual Replay
Replay re-runs any historical Call against a specified set of alternate models and returns counterfactual cost and Quality Score per candidate, version-stamped. Use Replay to validate Routing Decisions, build cost-regression tests in CI, or evaluate new models against past traffic before promoting them.
# Replay a Call against two candidate models
curl https://seracade.com/v1/replay \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"call_id": "log:<customer>:<ts>:<uuid>",
"candidate_models": ["claude-haiku-4-5", "gpt-4.1-mini"]
}'
Replay output is private to the requesting customer and is not for redistribution. Pricing matches routed Calls (15% of the absolute price difference per Replay, free until $500/month of cumulative routing value).
Quality Gate
Quality Gates are customer-set policies that constrain the routing table. Declare each Gate as a (task_type, minimum_score) rule; Seracade never routes a Call below an active Gate, even if a cheaper model passes the global threshold. Use Gates to hold quality high on legal, healthcare, financial, or other sensitive task types while cost-routing on summarization, extraction, and classification.
# Set a Gate: never substitute below 0.90 on coding tasks
curl https://seracade.com/v1/gates \
-X POST \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"gates": {"coding": {"minimum_score": 0.90}}}'
# Read active Gates
curl https://seracade.com/v1/gates \
-H "Authorization: Bearer YOUR_KEY"
# Remove one Gate
curl https://seracade.com/v1/gates \
-X DELETE \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"task_type": "coding"}'
Valid task types: extraction, classification, function_calling, summarization, rewriting, qa_rag, reasoning, coding, other. The minimum_score is a number between 0 and 1.