AI & LLMs · Guide · AI & Prompt Tools

How to Use Langfuse

Monitor AI apps by tracing prompts, scoring outputs, and running evals via the Langfuse SDK. Deploy the open-source dashboard for free in your environment.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Langfuse is the observability layer your LLM app should have had from day one — traces, scores, prompts, and evals in one OSS stack.

Langfuse solves the “why did my chatbot say that?” problem. It captures every LLM call, tool invocation, and user interaction as a nested trace, adds latency and cost math, and lets you score outputs manually or with LLM-as-judge. It’s open-source, self-hostable, and drops into Python or JS apps with a few lines of code.

What it is

A Next.js + Postgres + ClickHouse stack (Redis and S3 for object storage). SDKs for Python and TypeScript send events to the ingestion API, which populates traces made of spans, generations, and events. The UI renders traces, aggregates metrics, and runs dataset-based evals against prompt versions.

Install / set up

# self-host
git clone https://github.com/langfuse/langfuse
cd langfuse
docker compose up -d
# or use cloud
pip install langfuse

First run

Sign up at cloud.langfuse.com or your self-hosted URL, create a project, and copy the public and secret keys. Instrument your app: one decorator or one @observe() on the entry function is enough to start seeing nested traces in the dashboard.

$ python
from langfuse.decorators import observe
@observe()
def ask(q): return llm.invoke(q)
ask("hello")
# trace appears in UI

Everyday workflows

Use Prompts to version system prompts in Langfuse and pull them at runtime with langfuse.get_prompt().
Create Datasets from production traces, then run Evals against new prompt versions before promoting them.
Score traces in the UI (thumbs up/down) or automatically via LLM-as-judge templates to track quality over time.

Gotchas and tips

ClickHouse is required as of v3 and it’s heavier than the old Postgres-only stack. If you self-host on a small VM, the ingestion worker can fall behind and traces arrive minutes late. Size your instance for ClickHouse, not for Next.js.

Cost accuracy depends on token counts. If your model provider doesn’t return usage, Langfuse estimates from the text — fine for ballpark, bad for billing. Always pass usage explicitly when you call langfuse.generation() manually so your cost dashboards aren’t fiction.

Who it’s for

Any team shipping an LLM feature to real users. The moment you have more than 10 daily conversations and someone asks “is it getting better or worse?”, you need Langfuse or something like it.

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →