Skip to content
Free Tool Arena

AI & LLMs · Guide · AI & Prompt Tools

How to Use LM Studio

Downloading models from the catalog, chatting in the UI, exposing an OpenAI-compatible server on port 1234, GPU offload.

Updated April 2026 · 6 min read

LM Studio is a desktop GUI for running local LLMs — download weights from a built-in Hugging Face browser, chat with them in a clean UI, and expose an OpenAI-compatible server on localhost. This guide covers a working setup on a typical developer laptop.

Advertisement

What LM Studio is

LM Studio is an Electron app that wraps llama.cpp (and optionally MLX on Apple Silicon) with a polished UI. It handles model discovery, downloads, GPU offload config, chat templates, and serving through a single window. If Ollama is the CLI/server experience, LM Studio is the desktop-client experience — and the two coexist fine on the same machine.

It is free for personal use. Commercial use requires filling out a form on their site; read the latest terms before shipping it to coworkers.

Install and first launch

Download the installer for macOS, Windows, or Linux from lmstudio.ai. On first launch it will ask which runtime to use — pick the CUDA build on NVIDIA, Metal on Apple Silicon, or the Vulkan/ROCm build on AMD. The app self-updates the runtime from within Settings.

Check the Hardware tab under Settings. It should detect your GPU and show available VRAM. If it does not, your drivers are likely out of date — fix that before loading a model.

Downloading and loading a model

Hit the magnifying-glass icon to open the model search. Type something like llama-3.1-8b-instructand LM Studio surfaces GGUF quantizations from Hugging Face. Each result shows download size and a green/yellow /red badge for whether it will fit in your RAM + VRAM.

For a 16GB MacBook, Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf is a good first pick. Download it, then click the Chat tab and select it from the top dropdown. The first load takes a few seconds while weights stream into GPU memory.

Using the local server

Click the green Developer tab on the left sidebar and toggle Status: Running. LM Studio now exposes an OpenAI-compatible API at http://localhost:1234/v1:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "ping"}]
  }'

From Python, use the OpenAI SDK with base_url="http://localhost:1234/v1" and any non-empty API key. Structured outputs and tool-calling work for models that were fine-tuned for them.

GPU offload and context length

In the right-side configuration panel, the GPU Offload slider controls how many transformer layers run on the GPU. Push it to max if VRAM allows; if you OOM at load time, back off a few layers. TheContext Length field sets the KV-cache window — larger contexts eat memory quadratically in some kernels, so start at 4096 and raise only if you actually need it.

Enable Flash Attention when available — it cuts memory and speeds up long contexts. On Apple Silicon, try the MLX runtime variants of models for measurably faster token throughput than GGUF.

When LM Studio is the wrong choice

LM Studio is great on a workstation but a bad fit for headless servers (it is a GUI app) and for automation pipelines where you want models defined in code. It is also closed-source, which matters if you need to audit the stack. For servers, use Ollama or llama.cpp directly. For desktop use and quickly A/B-testing models, LM Studio is the fastest path from zero to a running local LLM.

Advertisement

Found this useful?Email