AI & LLMs · Guide · AI & Prompt Tools
How to Use Langfuse
Monitor AI apps by tracing prompts, scoring outputs, and running evals via the Langfuse SDK. Deploy the open-source dashboard for free in your environment.
Langfuse is the observability layer your LLM app should have had from day one — traces, scores, prompts, and evals in one OSS stack.
Advertisement
Langfuse solves the “why did my chatbot say that?” problem. It captures every LLM call, tool invocation, and user interaction as a nested trace, adds latency and cost math, and lets you score outputs manually or with LLM-as-judge. It’s open-source, self-hostable, and drops into Python or JS apps with a few lines of code.
What it is
A Next.js + Postgres + ClickHouse stack (Redis and S3 for object storage). SDKs for Python and TypeScript send events to the ingestion API, which populates traces made of spans, generations, and events. The UI renders traces, aggregates metrics, and runs dataset-based evals against prompt versions.
Install / set up
# self-host git clone https://github.com/langfuse/langfuse cd langfuse docker compose up -d # or use cloud pip install langfuse
First run
Sign up at cloud.langfuse.com or your self-hosted URL, create a project, and copy the public and secret keys. Instrument your app: one decorator or one @observe() on the entry function is enough to start seeing nested traces in the dashboard.
$ python
from langfuse.decorators import observe
@observe()
def ask(q): return llm.invoke(q)
ask("hello")
# trace appears in UIEveryday workflows
- Use Prompts to version system prompts in Langfuse and pull them at runtime with
langfuse.get_prompt(). - Create Datasets from production traces, then run Evals against new prompt versions before promoting them.
- Score traces in the UI (thumbs up/down) or automatically via LLM-as-judge templates to track quality over time.
Gotchas and tips
ClickHouse is required as of v3 and it’s heavier than the old Postgres-only stack. If you self-host on a small VM, the ingestion worker can fall behind and traces arrive minutes late. Size your instance for ClickHouse, not for Next.js.
Cost accuracy depends on token counts. If your model provider doesn’t return usage, Langfuse estimates from the text — fine for ballpark, bad for billing. Always pass usage explicitly when you call langfuse.generation() manually so your cost dashboards aren’t fiction.
Who it’s for
Any team shipping an LLM feature to real users. The moment you have more than 10 daily conversations and someone asks “is it getting better or worse?”, you need Langfuse or something like it.
Use these while you read
Tools that pair with this guide
- AI Cost EstimatorEstimate daily, monthly, and yearly API cost for GPT-4o, Claude, Gemini, and more based on your traffic and token usage.AI & Prompt Tools
- AI Prompt GeneratorTurn a vague idea into a structured prompt. Pick role, task, context, constraints, and output format. Works with ChatGPT, Claude, and Gemini.AI & Prompt Tools
- AI Token CounterEstimate tokens, characters, words, and approximate API cost for GPT-4o, GPT-4, Claude, and Gemini — before you hit send.AI & Prompt Tools
- AI Prompt LibraryBrowse a curated catalog of prompt templates for writing, coding, marketing, and research. One click to copy.AI & Prompt Tools
Advertisement
Continue reading
- AI & LLMsGitHub Copilot Pricing and ComparisonCompare free vs paid GitHub Copilot tiers and analyze it against ChatGPT, Cursor, and Tabnine. Find the best value plan instantly with this free online guide.
- AI & LLMsGitHub Copilot Features and CapabilitiesTest what Copilot really does — code accuracy, scope limits, debugging, web dev, legacy code, tests, docs, team customization. Free guide, no sign-up.
- AI & LLMsGitHub Copilot Security and Data HandlingAudit where your code goes, who sees it, training-data policy, network needs, and what happens when Copilot suggests broken code. Free, no sign-up.
- AI & LLMsAI Fluency SkillsThe 8 sub-skills of AI fluency: prompt structure, model selection, tool use, quality calibration, iteration, context management, cost awareness, privacy.
- AI & LLMsAnthropic Skills ExplainedSkills as Anthropic's answer to Custom GPTs — markdown-defined, version-controlled in git, work in terminal. Anatomy + Skills vs Custom GPTs.
- AI & LLMsKimi K2 vs DeepSeek V3Two open-weight Chinese flagships. Kimi K2 = 1M context, DeepSeek V3.2 = top-tier reasoning + coding. Pick by use case.