Helicone Features Deep Dive — From Tracing to Prompt Management

The Motivation

In Part 1, we transformed LLM applications from black boxes into observable systems. With two lines of code, you gained visibility into every request—costs, latency, tokens, full request/response bodies. You can now answer "How much did we spend last week?" and "Which users consume the most tokens?"

But visibility alone doesn't give you control.

Your dashboard shows a 47-step multi-agent workflow, but you can't tell which agent is the bottleneck. You're making the same GPT-4 call hundreds of times per day with identical prompts, wasting $200/month on redundant requests. Users regularly exceed their intended budgets because there are no guardrails. Your prompt evolved through six iterations in production, and nobody documented which version is currently deployed. When OpenAI rate limits your API key, your entire application crashes instead of gracefully falling back to Claude.

Production LLM systems need more than logs—they need session tracing to understand agent workflows, caching to eliminate redundant costs, rate limiting to enforce budgets, prompt versioning to track changes, and provider fallbacks to maintain reliability.

The questions this article answers are:

"How do I trace complex multi-agent workflows and attribute costs to specific conversation threads?"
"Can I cache LLM responses to reduce costs by 30-50% without writing caching infrastructure?"
"How do I enforce per-user rate limits and budget caps to prevent cost overruns?"
"What's the right way to version prompts in production and roll back when needed?"
"How do I set up automatic fallbacks from GPT-4 to Claude when providers have outages?"

This guide provides the complete blueprint for transforming basic observability into a production control plane. By the end, you'll implement session tracing for multi-agent systems, configure semantic caching with TTL policies, set up granular rate limiting, manage prompt versions through code, and establish retry+fallback chains—all using Helicone's header-based approach.

                Key Pattern:
                Every advanced Helicone feature uses headers. No SDK changes, no middleware installation, no code refactoring. Add Helicone-Cache-Enabled: true to enable caching. Add Helicone-Session-Id to trace workflows. Add Helicone-RateLimit-Policy to enforce quotas. Control everything through HTTP headers.
            

The Challenge

The Challenge is that visibility without control creates a false sense of security. You can see your LLM requests in a dashboard, but that doesn't prevent cost overruns, enable workflow tracing, or let you roll back problematic prompt changes. Observability showed you the problem—now you need production-grade controls to actually manage it.

Consider a healthcare AI triage system that uses multiple agents: one agent handles initial symptom intake, another performs diagnostic analysis, a third reviews lab results, and a fourth generates the final report. With basic logging from Part 1, you can see that this workflow cost $1.20. But you can't see which agent consumed the most budget, you can't trace the conversation thread through all four steps, and you can't identify bottlenecks in the multi-agent pipeline.

Or imagine your application makes the same diabetes diagnosis query hundreds of times per day: "What is diabetes?" Basic logging shows you made 400 identical requests costing $0.03 each—that's $12/day or $360/month wasted on redundant work. GPT-4's response never changes for this deterministic question, yet you're paying full price every time because you have no caching layer.

Then there's the budget problem. A user accidentally triggers an infinite loop that makes 10,000 API calls in an hour, costing $450 before you notice. Your dashboard shows the spike, but it happened in the past—you have no rate limits or quotas to prevent it. Another team member tweaks a prompt in production, response quality tanks, but nobody documented which prompt version is deployed, so rolling back requires guesswork.

When OpenAI hits rate limits at 3PM on a Tuesday, your entire application crashes with 429 errors. You're paying for Claude and Gemini subscriptions as backups, but you have no automatic fallback logic to switch providers when primary models fail.

The production reality is clear:

Session tracing: You need hierarchical paths to trace multi-agent workflows and understand parent-child relationships between requests
Semantic caching: You need to cache LLM responses with TTLs to reduce costs by 30-50% for deterministic queries
Rate limiting: You need per-user quotas (requests/hour or dollars/day) to prevent runaway costs
Prompt versioning: You need to track which prompt version is in production and enable instant rollbacks
Provider fallbacks: You need automatic retries with exponential backoff and cross-provider failover chains

Helicone transforms observability into a control plane by exposing all these features through HTTP headers. No SDKs, no middleware, no code changes beyond adding headers to requests. This is production-grade LLM infrastructure delivered with the same simplicity that made Part 1's basic logging possible.

Lucifying the Problem

Let's lucify this concept with an everyday analogy.

Imagine running a busy restaurant kitchen during dinner rush. Orders come in from the dining room, and your kitchen staff needs to coordinate to fulfill them. One cook handles appetizers, another manages main courses, a third works the grill, and a fourth plates desserts. Each dish for a single table travels through multiple stations before going out together.

Now imagine you added a kitchen display system (KDS) that tracks every order. You can see the full order ticket follow each dish through the prep stations—the scallops go from prep to grill to plating, and you can trace the entire path. That's session tracing: following a single conversation through multiple agent steps with hierarchical paths like /triage/analysis/lab-review.

Next, you realize you're making the same hollandaise sauce from scratch 50 times a night. It's always the same recipe, takes 8 minutes each time, and uses expensive ingredients. So you prep a large batch in advance and portion it out when orders arrive. That's caching: making identical LLM calls once, storing the response with a TTL, and reusing it instead of paying full price repeatedly.

During peak hours, you can only handle 120 orders per hour before quality degrades. So you implement table reservations to limit seating and prevent kitchen overload. That's rate limiting: enforcing quotas like "100 requests per user per hour" or "$5 per user per day" to prevent cost overruns.

When your fish supplier runs out of salmon at 7PM, you don't close the kitchen—you automatically call your backup supplier and substitute with tuna for remaining orders. That's provider fallback: when OpenAI hits rate limits, Helicone automatically retries with Claude or Gemini instead of failing the request.

Limitation of this analogy: Restaurant kitchens have physical space constraints and ingredient spoilage, while LLM requests are stateless and cached responses don't "go bad" (though they do have TTLs). Also, kitchen stations work in serial (one step after another), while LLM multi-agent workflows can be parallel or conditional. But the core pattern holds: complex operations need tracing, repeated work needs caching, capacity needs limiting, and failures need fallback plans.

Lucifying the Tech Terms

To solve this, we first need to lucify the key technical terms that enable production control. Understanding these five concepts will clarify how Helicone transforms basic logging into a full observability and control plane.

Session Tracing

Definition: Session tracing groups related LLM requests into hierarchical trees using session IDs and paths, allowing you to follow a single conversation or workflow through multiple agent steps and understand parent-child relationships between requests.

Simple Example: A patient query flows through 4 agents with paths /triage → /triage/analysis → /triage/analysis/lab-review → /triage/report. In the dashboard, you see a collapsible tree showing which agent cost the most, which was slowest, and the full conversation thread.

Analogy: Session tracing is like tracking a package through the postal system. Each step (sorting facility, delivery truck, local post office) gets its own scan, but they're all linked by the same tracking number so you can see the complete journey from origin to destination.

Semantic Caching

Definition: Semantic caching stores LLM responses by request hash and reuses them for identical prompts within a TTL window, eliminating redundant API calls. A cache hit returns the stored response instantly at near-zero cost instead of making a new $0.03 call.

Simple Example: Your app asks "What is diabetes?" 400 times/day. With Helicone-Cache-Enabled: true and max-age=86400 (24 hours), the first call costs $0.03 and goes to GPT-4. The next 399 calls hit cache and cost $0, saving $11.97/day ($359/month).

Analogy: Semantic caching is like photocopying a document instead of retyping it every time. The first copy takes 5 minutes to create from scratch, but subsequent copies take 10 seconds and cost pennies. Why recreate identical work?

Bucket Caching

Definition: Bucket caching caches non-deterministic prompts (temperature > 0) by grouping similar requests into "buckets" of size N and returning any response from that bucket randomly. This enables approximate caching when exact matching is impossible due to randomness.

Simple Example: You generate 5 creative product descriptions for "wireless headphones" with temperature=0.7. With Helicone-Cache-Bucket-Max-Size: 5, the first 5 requests populate the bucket with different responses. Request #6 randomly returns one of those 5 instead of calling the API, saving cost while maintaining variety.

Analogy: Bucket caching is like a restaurant's "chef's special" where the dish varies slightly each night. Instead of creating a new special from scratch every time, the chef rotates through 5 pre-made variations. Diners still get variety, but the kitchen saves prep time.

Rate Limiting Policy

Definition: A rate limiting policy defines a quota (max requests or cost), a time window, a unit (requests or cents), and an optional segment (global, per-user, per-property) using the syntax quota;w=window;u=unit;s=segment. Helicone enforces this and returns 429 errors when limits are exceeded.

Simple Example: Policy "100;w=3600;s=user" means "100 requests per user per hour." User Alice makes 100 requests in 30 minutes; her 101st request gets rejected with 429 status. Bob's requests are unaffected because the limit is s=user (per-user), not global.

Analogy: Rate limiting is like a data plan with caps. You get 5GB/month per line (per-user quota). If you exceed 5GB, your speed gets throttled (429 errors). Your family members' lines aren't affected by your overuse because each line has its own quota.

Provider Fallback

Definition: Provider fallback automatically retries failed LLM requests with exponential backoff and can switch to backup providers when the primary fails. You specify models like "gpt-4o/claude-sonnet-4" to try GPT-4o first, then fall back to Claude if OpenAI returns errors or hits rate limits.

Simple Example: You set model="gpt-4o/claude-sonnet-4" and Helicone-Retry-Enabled: true. OpenAI hits rate limits at 3PM (429 error). Helicone automatically retries with Claude Sonnet 4 after 1 second, and your request succeeds instead of failing.

Analogy: Provider fallback is like having backup suppliers for a critical ingredient. If your primary flour supplier runs out, you automatically call supplier #2, then supplier #3 if needed. Your bakery never stops producing bread because you have redundancy built in.

Making the Blueprint

Now, let's make the blueprint for activating Helicone's advanced features. This eight-step plan shows you exactly how to enable session tracing, caching, rate limiting, prompt versioning, and retries—all through HTTP headers without code refactoring.

Step 1: Add Session Headers for Tracing

For multi-agent workflows, add Helicone-Session-Id and Helicone-Session-Path headers to group related requests. Use hierarchical paths like /triage, /triage/analysis, /triage/analysis/lab-review to show parent-child relationships. Every agent in the workflow shares the same session ID but has a unique path.

Why this matters: Trace entire conversation threads through multiple agents and attribute costs to specific workflow steps in the dashboard's session tree view.

Step 2: Enable Caching with TTL

For deterministic queries (temperature=0), add Helicone-Cache-Enabled: true and Cache-Control: max-age=86400 to cache responses for 24 hours. For creative prompts, use bucket caching with Helicone-Cache-Bucket-Max-Size: 5 to return one of 5 pre-generated variations. Per-user caching uses Helicone-Cache-Seed to namespace responses.

Why this matters: Reduce costs by 30-50% for frequently repeated queries without building custom caching infrastructure.

Step 3: Configure Rate Limit Policies

Enforce quotas with Helicone-RateLimit-Policy header using the syntax quota;w=window;u=unit;s=segment. Example: "100;w=3600;s=user" (100 requests/hour per user) or "500;w=86400;u=cents;s=user" ($5/day per user). Segment can be user, property, or omitted for global limits.

Why this matters: Prevent runaway costs from infinite loops or malicious users exceeding intended budgets. Returns 429 errors when limits are exceeded.

Step 4: Set Up Retry + Fallback Chains

Handle provider outages with automatic retries and fallbacks. Use Helicone-Retry-Enabled: true, Helicone-Retry-Num: 3, and Helicone-Retry-Factor: 2 for exponential backoff (1s, 2s, 4s delays). Specify fallback models like model="gpt-4o/claude-sonnet-4" to try GPT-4o first, then Claude if OpenAI fails.

Why this matters: Maintain high availability by automatically switching providers when one hits rate limits or has outages. Critical for production reliability.

Step 5: Create Prompts in Playground

Use the Helicone Playground to create prompt templates with variables. Define prompts like "Summarize {document} in {style} tone" and test different variable combinations. Export the prompt with a version ID. Prompts support Jinja2 templating for complex logic.

Why this matters: Centralize prompt management outside code. Enables A/B testing, version control, and instant rollbacks without deploying new code.

Step 6: Deploy Prompt Versions

Reference prompt versions in your code with Helicone-Prompt-Id: diabetes-query-v2. Helicone substitutes the versioned prompt template and logs which version was used. Update prompts in the dashboard and redeploy by changing the version number in code—no prompt text lives in your repository.

Why this matters: Track exactly which prompt version generated each response. Roll back to v1 instantly if v2 degrades quality.

Step 7: Post Evaluation Scores

After receiving LLM responses, post human or AI evaluation scores to Helicone using the Feedback API. Track metrics like "accuracy", "helpfulness", "tone" with numeric scores or boolean values. Associate scores with the original request ID to analyze quality trends over time.

Why this matters: Measure response quality at scale. Identify which prompts, models, or agent steps produce the best outcomes and optimize based on data.

Step 8: Track Per-User Analytics

Add Helicone-User-Id headers to every request to segment analytics by user. The dashboard's Users page shows per-user request counts, costs, average latency, and error rates. Combine with rate limiting (s=user) to enforce per-user budgets.

Why this matters: Identify power users consuming budget, detect unusual usage patterns, and provide transparent cost attribution for multi-tenant applications.

Trade-offs to consider:

Helicone offers three integration methods. The AI Gateway (recommended) adds ~1-5ms latency but provides unified multi-provider routing. Provider-specific proxies add ~50-80ms latency but let you keep provider keys local. Async logging adds zero latency but sacrifices proxy features like caching and rate limiting. For most production applications, the AI Gateway's minimal latency overhead is worthwhile for the operational simplicity.

6-Step Integration Flow

Create Account

Configure Keys

Add provider keys
to dashboard

Change Base URL

Point to
ai-gateway.helicone.ai

Add Auth

Use Helicone
API key

Make API Call

Run your app
as normal

View Dashboard

See costs, tokens,
latency metrics

Executing the Blueprint

Let's carry out the blueprint plan with real, working code demonstrating session tracing, caching, rate limiting, retries, and the kitchen sink example combining all features.

Complete Code Examples

All Part 2 examples are in the part2-features/ directory. Includes multi-agent session tracing, 3 caching strategies, 4 rate limiting policies, retry+fallback chains, and a kitchen sink example combining all headers. Total: 5 files, 467 lines of tested code.

View on GitHub

Example 1: Multi-Agent Session Tracing

Track a 4-agent healthcare workflow using hierarchical session paths. Each agent shares the same session ID but has a unique path showing parent-child relationships:

# part2-features/session_tracing.py
from openai import OpenAI
import os
import uuid

client = OpenAI(
    base_url="https://ai-gateway.helicone.ai",
    api_key=os.getenv("HELICONE_API_KEY"),
)

session_id = str(uuid.uuid4())
patient_id = "patient_12345"

# Agent 1: Triage (Parent)
triage_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Patient reports: fatigue, increased thirst..."}],
    max_tokens=150,
    extra_headers={
        "Helicone-Session-Id": session_id,
        "Helicone-Session-Path": "/triage",              # Root level
        "Helicone-Property-Agent": "triage",
        "Helicone-User-Id": patient_id,
    }
)

# Agent 2: Analysis (Child of triage)
analysis_response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": f"Analyze: {triage_response.choices[0].message.content}"}],
    max_tokens=300,
    extra_headers={
        "Helicone-Session-Id": session_id,
        "Helicone-Session-Path": "/triage/analysis",     # Child level
        "Helicone-Property-Agent": "analysis",
        "Helicone-User-Id": patient_id,
    }
)

# Agent 3: Lab Review (Grandchild)
lab_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Review labs for: {analysis_response.choices[0].message.content}"}],
    max_tokens=200,
    extra_headers={
        "Helicone-Session-Id": session_id,
        "Helicone-Session-Path": "/triage/analysis/lab-review",  # Grandchild level
        "Helicone-Property-Agent": "lab-review",
        "Helicone-User-Id": patient_id,
    }
)

# Dashboard shows hierarchical tree: /triage → /triage/analysis → /triage/analysis/lab-review
# Click on session to see total cost, per-agent costs, conversation thread

Key observation: All three agents share session_id but use hierarchical paths. The dashboard renders this as a collapsible tree where you can expand /triage to see its children, drill into costs per agent, and replay the entire conversation thread.

Example 2: Semantic Caching with TTL

Save 30-50% on costs by caching deterministic queries. Full implementation in caching_examples.py:

# Basic caching: cache for 24 hours
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is diabetes?"}],
    max_tokens=200,
    temperature=0,  # Deterministic responses
    extra_headers={
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=86400",  # 24 hours in seconds
    }
)

# First request: costs $0.03, goes to GPT-4
# Next 399 requests in 24h: cost $0, instant cache hits
# Savings: $11.97/day = $359/month

Example 3: Rate Limiting Policies

Enforce per-user quotas to prevent cost overruns. Full implementation in rate_limiting.py:

# Example 1: Per-user request limit (100 requests/hour)
extra_headers={
    "Helicone-User-Id": user_id,
    "Helicone-RateLimit-Policy": "100;w=3600;s=user"
}

# Example 2: Per-user cost limit ($5/day)
extra_headers={
    "Helicone-User-Id": user_id,
    "Helicone-RateLimit-Policy": "500;w=86400;u=cents;s=user"  # 500 cents = $5
}

# When limit exceeded: 429 error returned, user must wait for window reset

Example 4: Retry + Provider Fallback

Maintain high availability with automatic retries and cross-provider failover. Full implementation in retry_fallback.py:

# Try GPT-4o first, fallback to Claude if OpenAI fails
response = client.chat.completions.create(
    model="gpt-4o/claude-sonnet-4",  # Primary / Fallback
    messages=[{"role": "user", "content": "Critical query with fallback"}],
    max_tokens=100,
    extra_headers={
        "Helicone-Retry-Enabled": "true",
        "Helicone-Retry-Num": "3",         # Max 3 retries
        "Helicone-Retry-Factor": "2",      # Exponential backoff: 1s, 2s, 4s
        "Helicone-Fallback-Enabled": "true",
    }
)
# If OpenAI hits rate limits: automatic retry with Claude after 1s delay

Example 5: Kitchen Sink (All Features Combined)

Combine session tracing, caching, rate limiting, and prompt versioning in a single request. Full implementation in kitchen_sink.py:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is diabetes?"}],
    max_tokens=200,
    temperature=0,
    extra_headers={
        # Session tracing
        "Helicone-Session-Id": "kitchen-sink-001",
        "Helicone-Session-Path": "/demo",

        # User tracking
        "Helicone-User-Id": "demo-user",

        # Custom properties
        "Helicone-Property-Environment": "demo",
        "Helicone-Property-Feature": "kitchen-sink",

        # Caching
        "Helicone-Cache-Enabled": "true",
        "Cache-Control": "max-age=3600",

        # Rate limiting
        "Helicone-RateLimit-Policy": "100;w=3600;s=user",

        # Prompt versioning
        "Helicone-Prompt-Id": "diabetes-query-v1",
    }
)
# Single request using 6 advanced features simultaneously!

Complete Examples on GitHub:

All 5 Part 2 examples (session tracing, caching, rate limiting, retry+fallback, kitchen sink) are available in the part2-features/ directory with full documentation. Total: 467 lines of tested, production-ready code. View at github.com/zubairashfaque/helicone-examples

What just happened: By changing the base_url from OpenAI's default to ai-gateway.helicone.ai and swapping your API key, every request now flows through Helicone. The AI Gateway looks up your OpenAI key (configured in Step 2), forwards the request, logs the round-trip, and returns the response. Your application code is unchanged—same parameters, same response format, same error handling.

TypeScript: AI Gateway Integration

The pattern is identical in TypeScript:

import OpenAI from "openai";

// BEFORE Helicone
// const client = new OpenAI();

// AFTER Helicone
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "You are a helpful medical assistant." },
    { role: "user", content: "What are the symptoms of Type 2 diabetes?" }
  ],
  max_tokens: 500,
});

console.log(response.choices[0].message.content);
// Automatically logged: tokens, cost ($), latency (ms), TTFT, model, status

Alternative: Provider-Specific Proxy

If you prefer managing provider keys locally (never uploading them to Helicone's dashboard), use the provider-specific proxy approach:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),          # Your key stays local
    base_url="https://oai.helicone.ai/v1",         # Helicone's OpenAI proxy
    default_headers={
        "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}"
    }
)

# Use exactly as before—all calls are logged
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello, world!"}]
)

Key difference: Here you're passing both your OpenAI key (in api_key) and your Helicone key (in headers). Helicone never sees your OpenAI key—it's sent directly to OpenAI's API through the proxy. This adds ~50-80ms latency vs. the AI Gateway's ~1-5ms, but some organizations prefer this model for security/compliance reasons.

Three Integration Methods Compared

AI Gateway

Latency:

~1-5ms overhead

Key Management:

Provider keys in Helicone dashboard

Features:

All features available

Best For:

New projects, multi-provider routing

Provider Proxy

Latency:

~50-80ms overhead

Key Management:

Provider keys stay local

Features:

All features available

Best For:

Security/compliance requirements

Async Logging

Latency:

0ms (zero overhead)

Key Management:

Provider keys stay local

Features:

Observability only (no caching/rate limiting)

Best For:

Latency-critical applications

Multi-Provider Routing with AI Gateway

The AI Gateway's killer feature is provider-agnostic routing. Switch between OpenAI, Claude, and Gemini by changing one string:

client = OpenAI(
    base_url="https://ai-gateway.helicone.ai",
    api_key=os.getenv("HELICONE_API_KEY"),
)

# OpenAI GPT-4o
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain HIPAA compliance"}]
)

# Anthropic Claude—same client, same format
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Explain HIPAA compliance"}]
)

# Google Gemini—same client, same format
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[{"role": "user", "content": "Explain HIPAA compliance"}]
)

All three requests use the same OpenAI-compatible Python client. Helicone's AI Gateway translates the request to each provider's format, handles authentication, logs everything uniformly, and returns results in OpenAI's response schema. No provider-specific SDKs, no format conversions, no switching between clients.

Cost tracking benefit: Because Helicone logs all three requests in a unified format, you can compare costs across providers directly in the dashboard. See instantly that Gemini Flash costs 10× less than GPT-4o for the same task.

Why Multi-Provider Routing Matters

Cost Optimization

Route to cheapest provider for each task. Gemini Flash: 10× cheaper than GPT-4o for summaries.

Vendor Independence

Never locked into one provider. Switch models without rewriting code or learning new SDKs.

Automatic Fallbacks

Part 2 covers fallback chains: try GPT-4o → if fails, fallback to Claude → if fails, use Gemini.

Easy A/B Testing

Compare GPT-4o vs Claude Sonnet on the same prompts. See quality + cost differences side-by-side.

Provider	Helicone Proxy URL	Notes
OpenAI	`oai.helicone.ai/v1`	Dedicated subdomain
Anthropic	`anthropic.helicone.ai`	Dedicated subdomain
Azure OpenAI	`gateway.helicone.ai`	Uses Helicone-Target-Url header
Google Gemini	`gateway.helicone.ai`	Uses Helicone-Target-Url header
Together AI	`together.helicone.ai`	Dedicated subdomain
Groq	`groq.helicone.ai`	Dedicated subdomain
DeepSeek	`deepseek.helicone.ai`	Dedicated subdomain
AWS Bedrock	`bedrock.helicone.ai`	Dedicated subdomain
Any other	`gateway.helicone.ai`	Universal gateway with Helicone-Target-Url

Real-World Example: Healthcare AI Triage Assistant

Here's a complete example that demonstrates Helicone's value in a production scenario. This healthcare triage assistant classifies patient symptoms and uses Helicone headers to enable department-level cost analytics, per-patient tracking, and prompt versioning:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://ai-gateway.helicone.ai",
    api_key=os.getenv("HELICONE_API_KEY"),
)

def triage_patient(patient_id: str, symptoms: str, department: str) -> str:
    """Classify patient symptoms with full Helicone observability."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a medical triage assistant. Classify urgency as: "
                    "EMERGENCY, URGENT, STANDARD, or LOW-PRIORITY. Give rationale."
                ),
            },
            {"role": "user", "content": f"Patient symptoms: {symptoms}"},
        ],
        max_tokens=200,
        temperature=0.1,  # Low temp for consistency
        extra_headers={
            "Helicone-User-Id": patient_id,                    # Per-patient analytics
            "Helicone-Property-Department": department,          # Filter by department
            "Helicone-Property-App": "triage-assistant",        # App-wide tagging
            "Helicone-Property-Environment": "production",      # Track by environment
            "Helicone-Prompt-Id": "triage-classifier-v1",       # Prompt versioning
        },
    )

    return response.choices[0].message.content

# Usage
result = triage_patient(
    patient_id="patient-7829",
    symptoms="Severe chest pain, shortness of breath, radiating to left arm",
    department="cardiology"
)
print(result)
# Output: "EMERGENCY — Symptoms consistent with acute coronary syndrome..."

What this unlocks in the Helicone dashboard:

Per-department costs: Filter by Property: Department = cardiology to see total cardiology LLM spend
Per-patient history: Filter by User: patient-7829 to view all triage requests for this patient
Prompt versioning: Filter by Prompt-Id: triage-classifier-v1 to analyze this specific prompt's performance and costs over time
Environment tracking: Separate production from staging costs

This example uses just five Helicone headers to transform a basic LLM call into a fully instrumented, production-ready operation. Check out the complete code with error handling and additional examples in the GitHub repository.

Your Helicone Dashboard at a Glance

Requests

Every API call logged with full context

                                    Model: gpt-4o-mini

                                    Tokens: 150 in / 89 out

                                    Cost: $0.004

                                    Latency: 1,230ms

                                    TTFT: 340ms

Cost Analytics

Track spending across models and users

$247

This Month

12.4K

Requests

User Analytics

Per-user costs and consumption patterns

user-7829 $12.40

user-4521 $8.20

user-9103 $5.60

Powerful Filters

Query by any dimension with HQL

Filter by model, status, date...

Custom property filtering

HQL for complex queries

Session Tracing

Visualize multi-step agent workflows

└─ /triage
├─ /triage/intake
├─ /triage/analysis
└─ /triage/report

Alerts

Get notified before problems escalate

Cost threshold: $500/day

Error rate: >5%

Latency spike: >3s avg

All this data is automatically captured from your 2-line code change

What's Next in Part 3

With observability and control features enabled, Part 3 will take you to production readiness. We'll cover:

Complete AutoGen multi-agent example: Build a healthcare AI system with triage → specialist agents → report generator. Full Helicone instrumentation with session tracing and cost attribution.
LLM security: Integrate Meta Llama Guard (14-category threat detection) and Prompt Guard headers to protect against prompt injection, jailbreaks, and adversarial inputs.
Self-hosting with Docker: Deploy your own Helicone instance on-premise with PostgreSQL, ClickHouse, and the web UI. Complete docker-compose.yml included.
Cost optimization strategies: Five proven techniques (caching, model routing, prompt optimization, batch processing, smart rate limiting) with estimated savings.
Helicone vs LangSmith vs Langfuse: Feature comparison, pricing analysis, and when to choose each platform.

The complete series takes you from zero to production-ready LLM operations in 3 parts. See you in Part 3!

Start Building Today

Clone the repository, add your Helicone API key, and run any example in under 60 seconds.

View GitHub Repository Open Helicone Dashboard