Agent Observability: Langfuse, Phoenix, Opik

> Three open-source agent observability platforms dominate 2026. Langfuse (MIT) — 6M+ installs/month, tracing + prompt management + evals + session replay. Arize Phoenix (Elastic 2.0) — deep agent-specific evals, RAG relevancy, OpenInference auto-instrumentation. Comet Opik (Apache 2.0) — automated prompt optimization, guardrails, LLM-judge hallucination detection.

Type: Learn

Languages: Python (stdlib)

Prerequisites: Phase 14 · 23 (OTel GenAI)

Time: ~45 minutes

Learning Objectives

The Problem

OTel GenAI (Lesson 23) gives you the schema. You still need the platform that ingests spans, runs evaluations, stores prompt versions, and surfaces regressions. The three contenders each emphasize different parts of the lifecycle.

The Concept

Langfuse (MIT)

Arize Phoenix (Elastic License 2.0)

Comet Opik (Apache 2.0)

Industry data

Per Maxim (2026 field analysis): 89% of organizations have agent observability in place; quality issues are the top production barrier (32% of respondents cite them).

Picking one

Need Pick
All-in-one with prompt management Langfuse
Deep RAG evaluation + drift Phoenix
Automated optimization + guardrails Opik
Open licensing, no ELv2 Langfuse (MIT) or Opik (Apache 2.0)
Datadog / New Relic integration Any — they all export OTel

Where this pattern goes wrong

Build It

code/main.py implements a stdlib trace collector + LLM-judge evaluator:

Run it:

python3 code/main.py

Output: per-session eval scores and failure categorization matching what Langfuse/Phoenix/Opik would show.

Use It

Ship It

outputs/skill-obs-platform-wiring.md picks a platform and wires traces + evals + prompt versions into an existing agent.

Exercises

  1. Export a week of OTel traces to Langfuse cloud (free tier). Which sessions failed? Why?
  2. Write an LLM-judge rubric for your domain (factual correctness, tone, scope adherence). Test on 50 traces.
  3. Compare Langfuse prompt versioning against Phoenix's trace clustering. Which tells you what broke faster?
  4. Read Opik's guardrail docs. Wire a PII redaction guardrail to one of your agent runs.
  5. Benchmark the three on your corpus. Ignore vendor-published numbers; measure your own.

Key Terms

Term What people say What it actually means
Tracing "Spans collector" Ingest OTel / SDK spans; index by session
Prompt management "Prompt CMS" Versioned prompts tied to traces
LLM-as-judge "Automated eval" Separate LLM scores agent output against a rubric
Session replay "Trace playback" Step through past runs for debugging
RAG relevancy "Retrieval quality" Does the retrieved context match the query
Trace clustering "Behavioral grouping" Cluster similar runs for drift detection
Guardrail enforcement "Policy at log time" PII/toxicity/scope checks on logged content

Further Reading