AI & LLMs · Guide · AI & Prompt Tools

How to Use Hermes Models

Master Hermes 3 system prompts and function-calling syntax to extract reliable structured data from the Nous Research Llama tune. Free online walkthrough.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Hermes is Nous Research’s family of open-weight fine-tunes built on top of Meta’s Llama base models. This guide covers what Hermes 3 is actually good at, how to pick a size, and how to run it locally alongside your existing LLM stack.

What Hermes models are

Hermes 3 is Nous Research’s flagship fine-tune series, released in sizes matching the Llama 3.1 base (8B, 70B, 405B). Nous specializes in instruction-following, function calling, structured outputs, long-context reliability, and preserving steerability — Hermes models tend to refuse less than stock Llama-Instruct and follow system prompts more literally.

The weights are Llama-3.1-licensed (inherited from Meta), so you can use them commercially under the usual Llama terms. They publish on Hugging Face under NousResearch/Hermes-3-Llama-3.1-*.

Picking the right size

Choose based on your hardware and task:

Hermes 3 8B — runs on a 16GB laptop at Q4. Good agent/assistant quality, better function-calling than stock Llama 3.1 Instruct.
Hermes 3 70B — needs serious hardware (48GB+ VRAM at Q4, or a Mac Studio with sufficient unified memory). Competitive with frontier open models on reasoning.
Hermes 3 405B — datacenter-only. Multi-GPU or quantized heavily on an H100 cluster.

For most local use cases, start with the 8B. It is the pragmatic sweet spot and ships with the same function-calling and structured-output training as its larger siblings.

Running Hermes locally

With Ollama, pull a community GGUF port (or roll your own via llama.cpp’s converter):

ollama pull hermes3:8b
ollama run hermes3:8b "You are a terse code reviewer. Review this function: ..."

With llama.cpp directly, download a GGUF and serve it:

huggingface-cli download bartowski/Hermes-3-Llama-3.1-8B-GGUF \
  Hermes-3-Llama-3.1-8B-Q4_K_M.gguf --local-dir ./models
./build/bin/llama-server -m ./models/Hermes-3-Llama-3.1-8B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 -c 8192 -ngl 99

Using function calling and structured outputs

Hermes 3 uses a specific tool-call format that it was trained on. It emits calls wrapped in <tool_call>...</tool_call> XML tags with JSON payloads. The model card spells out the exact system prompt template — read it before building an agent on top.

For strict JSON output, combine a clear system prompt with llama.cpp’s --grammar flag or a GBNF grammar file to constrain decoding. You will get dramatically more reliable structured outputs than relying on the model alone:

./build/bin/llama-cli -m ./models/hermes-3-8b.gguf \
  --grammar-file json.gbnf \
  -p "Extract name and age as JSON from: 'Sam is 34.'"

Sampling settings that matter

Hermes benefits from slightly lower temperatures than stock Llama for agentic work. Try temperature=0.4, top_p=0.9, and a mild repeat penalty of 1.05 as a starting point. For creative writing, push temperature up to 0.8–1.0. Context length is inherited from Llama 3.1, so 128k is supported on paper, but quality degrades past ~32k unless your hardware can fit the full KV cache.

When Hermes is the wrong choice

If you are doing code-specific work, Qwen 2.5 Coder or DeepSeek-Coder V2 usually beat Hermes at the same size. If you want the absolute most refusal-free chat model, there are more specialized fine-tunes — though they come with their own risks. For general-purpose assistants, agents, and function-calling workloads on open weights, Hermes 3 is a strong, well-supported default.

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →