Traditional application monitoring tells you when your service is down and how long requests take. It tells you nothing useful about an LLM application. A model that is responding in 200ms with 200 tokens is behaving perfectly from an infrastructure perspective and producing confidently wrong answers from an application perspective. You need a completely different observability layer, and most teams build it too late.
Analysis Briefing
- Topic: LLM observability, monitoring tools, and production debugging for AI applications
- Analyst: Mike D (@MrComputerScience)
- Context: A research sprint initiated by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: What does a mature LLM observability stack look like and what does each layer actually tell you?
Why Standard APM Tools Are Blind to LLM Failures
Datadog, New Relic, and Prometheus tell you about system behavior. They measure latency, error rates, CPU usage, and queue depth. These metrics are necessary for LLM applications. They are not sufficient.
An LLM application fails in ways that produce no HTTP errors and no latency spikes. A RAG pipeline that retrieves the wrong documents and generates a plausible-sounding but incorrect answer returns HTTP 200 in 800ms. A prompt that worked correctly for three months starts producing outputs that miss the point after a model update. A system prompt that was carefully tuned starts being undermined by user messages that shift the model’s behavior mid-conversation.
None of these failures appear in your existing observability stack. They require tracking the content of requests and responses, evaluating that content against quality criteria, and correlating output quality with everything that might have caused it: prompt version, model version, retrieval results, user input characteristics, and time.
The Four Layers of a Production LLM Observability Stack
Layer 1: Trace logging. Every LLM call is a trace: the input prompt, the model and version, the parameters (temperature, max tokens), the output, the latency, and the cost. This is the foundation. Without it, you cannot debug anything. LangSmith captures this automatically for LangChain applications. For custom stacks, the OpenTelemetry LLM semantic conventions provide a standard schema for logging LLM calls alongside your existing traces.
import anthropic
from opentelemetry import trace
from opentelemetry.trace import SpanKind
import time
tracer = trace.get_tracer("llm-app")
def traced_llm_call(prompt: str, system: str) -> str:
with tracer.start_as_current_span("llm.completion", kind=SpanKind.CLIENT) as span:
span.set_attribute("llm.model", "claude-sonnet-4-20250514")
span.set_attribute("llm.input.tokens_estimate", len(prompt.split()))
span.set_attribute("llm.prompt_hash", hash(system))
start = time.time()
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.time() - start) * 1000
output = response.content[0].text
span.set_attribute("llm.output.tokens", response.usage.output_tokens)
span.set_attribute("llm.latency_ms", latency_ms)
span.set_attribute("llm.cost_usd",
(response.usage.input_tokens * 3 + response.usage.output_tokens * 15) / 1_000_000)
return output
Layer 2: Quality evaluation. Latency and cost tell you about efficiency. They tell you nothing about whether the output was correct. Quality evaluation runs automated checks against every LLM response: groundedness (did the answer use information from the retrieved context?), faithfulness (did the answer contradict the context?), relevance (did the answer address the question?), and task-specific checks (did the JSON parse? did the output match the expected schema?).
Arize Phoenix and Braintrust both run evaluation pipelines on logged traces. They use LLM-as-judge for semantic quality checks and programmatic checks for structural requirements. The output is a quality score per trace that you can alert on when it drops below threshold.
Layer 3: Retrieval monitoring (for RAG applications). Retrieval quality is the most common root cause of bad RAG outputs and the least commonly monitored metric. You need to track: what queries hit the vector store, what documents were retrieved, what similarity scores those documents had, and whether the retrieved documents were actually used in the final answer. A retrieval hit rate that is high but decreasing over time signals index staleness. Low similarity scores on successful retrievals signal that your embedding model and your document corpus have diverged.
Layer 4: Drift detection. Model behavior changes when the model is updated even when you did not change your prompts. Helicone and Weights & Biases both support tracking output distributions over time so that a model update that changes behavior in your specific use case triggers an alert rather than silently degrading user experience.
The Minimal Viable Observability Stack
Full observability is expensive to build and maintain. The minimal stack that catches the most important failure modes:
Start with Helicone as a proxy layer. It sits between your application and any LLM API, logs every request and response with full metadata, tracks costs, and provides a dashboard with zero code changes beyond updating your base URL. Free tier handles most early-stage applications.
Add Braintrust when you have enough traffic to run meaningful evals. Define a dataset of representative inputs with expected outputs, write a scorer (or use their built-in LLM-as-judge scorers), and run evals on every deployment. Braintrust’s dataset versioning and experiment tracking gives you the same regression detection as a custom eval pipeline with a fraction of the setup time.
Add retrieval logging with a simple custom middleware layer if you are running RAG. Log the query, top-k results, similarity scores, and whether each result was cited in the final answer. This does not require a third-party tool. It requires 40 lines of Python and a log aggregation system you probably already have.
The full commercial stack (LangSmith for traces, Arize for quality monitoring, W&B for experiment tracking, Datadog for infrastructure) is appropriate for production applications at meaningful scale. Start with Helicone plus Braintrust, add layers as the application grows, and resist the urge to build a custom observability platform when you have fewer than 10,000 LLM calls per day.
What This Means For You
- Instrument every LLM call with cost tracking from day one, because LLM cost grows faster than almost any other infrastructure cost and surprises you if you do not monitor it, and retroactive cost attribution across thousands of undifferentiated calls is extremely painful.
- Log full request and response content, not just metadata, because the only way to debug a quality regression is to look at actual prompts and outputs, and reconstructing what the model saw from logs that contain only token counts is impossible.
- Set up retrieval quality monitoring before you ship any RAG application to production, because retrieval failures are the most common root cause of bad RAG outputs and they are completely invisible without explicit monitoring of what documents are being retrieved and with what confidence.
- Treat a drop in your LLM-as-judge quality score the same way you treat a spike in error rate, because a quality regression that produces plausible wrong answers is more damaging to user trust than an outage that produces obvious errors.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
