AI & LLMs · Guide · AI & Prompt Tools
How to Use Hermes Models
Running Hermes 3, system prompt tricks, function calling JSON format, getting the most out of the Llama-based tune.
Hermes is Nous Research’s family of open-weight fine-tunes built on top of Meta’s Llama base models. This guide covers what Hermes 3 is actually good at, how to pick a size, and how to run it locally alongside your existing LLM stack.
Advertisement
What Hermes models are
Hermes 3 is Nous Research’s flagship fine-tune series, released in sizes matching the Llama 3.1 base (8B, 70B, 405B). Nous specializes in instruction-following, function calling, structured outputs, long-context reliability, and preserving steerability — Hermes models tend to refuse less than stock Llama-Instruct and follow system prompts more literally.
The weights are Llama-3.1-licensed (inherited from Meta), so you can use them commercially under the usual Llama terms. They publish on Hugging Face under NousResearch/Hermes-3-Llama-3.1-*.
Picking the right size
Choose based on your hardware and task:
- Hermes 3 8B — runs on a 16GB laptop at Q4. Good agent/assistant quality, better function-calling than stock Llama 3.1 Instruct.
- Hermes 3 70B — needs serious hardware (48GB+ VRAM at Q4, or a Mac Studio with sufficient unified memory). Competitive with frontier open models on reasoning.
- Hermes 3 405B — datacenter-only. Multi-GPU or quantized heavily on an H100 cluster.
For most local use cases, start with the 8B. It is the pragmatic sweet spot and ships with the same function-calling and structured-output training as its larger siblings.
Running Hermes locally
With Ollama, pull a community GGUF port (or roll your own via llama.cpp’s converter):
ollama pull hermes3:8b ollama run hermes3:8b "You are a terse code reviewer. Review this function: ..."
With llama.cpp directly, download a GGUF and serve it:
huggingface-cli download bartowski/Hermes-3-Llama-3.1-8B-GGUF \ Hermes-3-Llama-3.1-8B-Q4_K_M.gguf --local-dir ./models ./build/bin/llama-server -m ./models/Hermes-3-Llama-3.1-8B-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 -c 8192 -ngl 99
Using function calling and structured outputs
Hermes 3 uses a specific tool-call format that it was trained on. It emits calls wrapped in <tool_call>...</tool_call> XML tags with JSON payloads. The model card spells out the exact system prompt template — read it before building an agent on top.
For strict JSON output, combine a clear system prompt with llama.cpp’s --grammar flag or a GBNF grammar file to constrain decoding. You will get dramatically more reliable structured outputs than relying on the model alone:
./build/bin/llama-cli -m ./models/hermes-3-8b.gguf \ --grammar-file json.gbnf \ -p "Extract name and age as JSON from: 'Sam is 34.'"
Sampling settings that matter
Hermes benefits from slightly lower temperatures than stock Llama for agentic work. Try temperature=0.4, top_p=0.9, and a mild repeat penalty of 1.05 as a starting point. For creative writing, push temperature up to 0.8–1.0. Context length is inherited from Llama 3.1, so 128k is supported on paper, but quality degrades past ~32k unless your hardware can fit the full KV cache.
When Hermes is the wrong choice
If you are doing code-specific work, Qwen 2.5 Coder or DeepSeek-Coder V2 usually beat Hermes at the same size. If you want the absolute most refusal-free chat model, there are more specialized fine-tunes — though they come with their own risks. For general-purpose assistants, agents, and function-calling workloads on open weights, Hermes 3 is a strong, well-supported default.
Advertisement