AI & LLMs · Guide · AI & Prompt Tools
How to Use Langfuse
Installing Langfuse self-hosted or cloud, tracing prompts, scores, evals, SDK for Python/JS, dashboards.
Langfuse is the observability layer your LLM app should have had from day one — traces, scores, prompts, and evals in one OSS stack.
Advertisement
Langfuse solves the “why did my chatbot say that?” problem. It captures every LLM call, tool invocation, and user interaction as a nested trace, adds latency and cost math, and lets you score outputs manually or with LLM-as-judge. It’s open-source, self-hostable, and drops into Python or JS apps with a few lines of code.
What it is
A Next.js + Postgres + ClickHouse stack (Redis and S3 for object storage). SDKs for Python and TypeScript send events to the ingestion API, which populates traces made of spans, generations, and events. The UI renders traces, aggregates metrics, and runs dataset-based evals against prompt versions.
Install / set up
# self-host git clone https://github.com/langfuse/langfuse cd langfuse docker compose up -d # or use cloud pip install langfuse
First run
Sign up at cloud.langfuse.com or your self-hosted URL, create a project, and copy the public and secret keys. Instrument your app: one decorator or one @observe() on the entry function is enough to start seeing nested traces in the dashboard.
$ python
from langfuse.decorators import observe
@observe()
def ask(q): return llm.invoke(q)
ask("hello")
# trace appears in UIEveryday workflows
- Use Prompts to version system prompts in Langfuse and pull them at runtime with
langfuse.get_prompt(). - Create Datasets from production traces, then run Evals against new prompt versions before promoting them.
- Score traces in the UI (thumbs up/down) or automatically via LLM-as-judge templates to track quality over time.
Gotchas and tips
ClickHouse is required as of v3 and it’s heavier than the old Postgres-only stack. If you self-host on a small VM, the ingestion worker can fall behind and traces arrive minutes late. Size your instance for ClickHouse, not for Next.js.
Cost accuracy depends on token counts. If your model provider doesn’t return usage, Langfuse estimates from the text — fine for ballpark, bad for billing. Always pass usage explicitly when you call langfuse.generation() manually so your cost dashboards aren’t fiction.
Who it’s for
Any team shipping an LLM feature to real users. The moment you have more than 10 daily conversations and someone asks “is it getting better or worse?”, you need Langfuse or something like it.
Advertisement