AI & LLMs · Guide · AI & Prompt Tools

How to Set Up a Hyperspace Pod

Pool the laptops you already own into one peer-to-peer AI cluster. Automatic model sharding, an OpenAI-compatible endpoint, and free local inference — set up in 4 commands.

Updated April 2026 · 6 min read

Hyperspace Pods lets a small group of people — a family, a startup, a few friends — pool the laptops and desktops they already own into one shared AI cluster. Everyone installs the CLI, someone creates a pod, shares an invite link, and the machines form a peer-to-peer mesh. Models that need more memory than any single laptop has — Qwen 3.5 32B, GLM-5 Turbo — get sharded across the group automatically.

From the outside it looks like one OpenAI-compatible API endpoint with a pk_* key that drops straight into your IDE, your agent, or any tool that already speaks the OpenAI protocol. No configuration beyond pasting the key and changing the base URL. This guide walks you through what Pods are, how the sharding actually works, and the four commands to set one up. Pods ship today in Hyperspace v5.19.

What a Pod actually is

A pod is a group of machines running the Hyperspace CLI that have agreed to share inference work. Membership, API keys, and the shared treasury are replicated across every member using Raft consensus, so there's no central server to depend on. If your internet drops, the pod keeps running on your local network. If a member's laptop closes, the pod reshards around it.

Five laptops, one model. Layers split proportionally; the prompt pipelines through the ring and the response comes back.

The picture above is the part most people don't believe at first: there really is no middleman. A prompt leaves your machine, hops between your pod members' machines, and the response comes back the same way. The coordinator picks the routing, but the data plane is direct.

How automatic sharding works

You don't configure layer ranges or budget VRAM by hand. You tell the pod which model you want — say qwen-3.5-32b — and the coordinator inspects every member's free memory, splits the model's layers in proportion, and pipelines tokens through the resulting ring.

Layer assignment is proportional to free VRAM. The 24 GB machine carries more layers than the 8 GB ones.

Two things are worth noticing. First, machines with more VRAM carry more layers, so a beefy desktop and a thin laptop coexist gracefully — the desktop just pulls more weight. Second, because inference is pipelined, each machine is doing work on a different token at the same time; throughput rises with the slowest machine in the ring, not the fastest.

Setting up a pod (4 commands)

The same install works on macOS, Linux, and Windows under WSL2. Bring a machine with at least 16 GB of unified memory or 8 GB of dedicated VRAM to the table; smaller machines can still join, they'll just get fewer layers.

Step 1 — Install the CLI

install.sh

# macOS / Linux
curl -fsSL https://hyperspace.ai/install | sh

# Windows (PowerShell, in WSL2)
irm https://hyperspace.ai/install.ps1 | iex

Step 2 — Create the pod

The first member runs pod create on the machine that will host the initial Raft leader. Any member can take leader later; this is just where the pod is born.

hyperspace pod create

$ hyperspace pod create --name "team-pod" --models qwen-3.5-32b,glm-5-turbo,gemma-4

  Pod created.
  Members:    1 (you)
  Treasury:   $0.00
  API base:   http://pod.local:5891/v1
  API key:    pk_live_4f9c…7a32   (shown once — copy it now)

Step 3 — Invite the others

The invite link is short-lived and bound to the pod's identity, not to a public URL. Members behind home routers don't need port forwarding; the network handles NAT traversal automatically.

hyperspace pod invite

$ hyperspace pod invite --uses 4 --expires 24h

  https://hyperspace.ai/i/k7Q8e3-r2tL  (4 uses · expires in 24h)

# On each invitee's machine:
$ hyperspace pod join https://hyperspace.ai/i/k7Q8e3-r2tL
  Joined "team-pod". Detected 24 GB free VRAM.
  Coordinator is resharding qwen-3.5-32b…  done.

Step 4 — Point your tools at it

The pod exposes an OpenAI-compatible endpoint. Anything that speaks that protocol — the OpenAI Python SDK, your IDE, an agent SDK, a code-completion plugin, a custom script — works without a code change.

python

from openai import OpenAI

client = OpenAI(
    base_url="http://pod.local:5891/v1",
    api_key="pk_live_4f9c…7a32",
)

resp = client.chat.completions.create(
    model="qwen-3.5-32b",
    messages=[{"role": "user", "content": "Refactor this function for clarity…"}],
)
print(resp.choices[0].message.content)

Why this changes the cost math

A team of five paying for cloud AI typically burns $500–$2,000 a month on API calls. The same team's existing machines can serve Qwen 3.5 (competitive on SWE-bench) and GLM-5 Turbo (#1 on BrowseComp for tool-calling and web research) at the marginal cost of electricity. For day-to-day work — code review, refactors, research, drafting — local models handle it and nobody gets billed.

When a query genuinely needs a frontier model nobody has locally, the pod falls back to the cloud at wholesale rates from the shared treasury. So you don't have to choose between "all local" and "all cloud" — you get a sensible default of local, with a wholesale escape hatch for the 5% of queries that actually need it. If you want a reference point for which models punch above their weight, our AI model compare tool lays out the current benchmarks side by side.

A practical three-model setup

Most pods we see in the wild settle on the same three models, each doing the job it's best at:

Qwen 3.5 32B for code and reasoning — the workhorse for refactors, code review, and long-form thinking.
GLM-5 Turbo for browsing and research — the current leader on BrowseComp for tool-calling and web research.
Gemma 4 for fast lightweight tasks — autocomplete, small classifications, anything you'd hate waiting two seconds for.

All three load simultaneously on a five-machine pod with mixed hardware. The coordinator routes each request to the model the caller asked for; no juggling.

What makes this different

No middleman. Prompts travel from your IDE to your pod members' hardware and back. There is no server in between reading your data.
No vendor lock-in. Pod membership, API keys, and treasury are replicated across your own machines using Raft. If the internet goes down, the local network keeps working. There is no database in someone else's cloud that your pod depends on.
Automatic sharding. You don't configure layer ranges or calculate VRAM budgets. Tell the pod which model you want; it figures out how to split it across whatever hardware is online.
Real NAT traversal. Your friend behind a home router with a dynamic IP? Works. No VPN, no Tailscale, no port forwarding.
Free when local. Cloud bills scale with usage. Pod inference on local hardware scales with nothing. The marginal cost of your 10,000th prompt is the electricity your laptop was already using.

Treasury and the compute marketplace

The treasury is a shared balance that funds the rare cloud-fallback query when no local model is good enough. Any member can top it up; every spend is replicated to every member's ledger, so there are no surprise bills. When the pod is idle — overnight, weekends, while everyone's at lunch — you can rent its compute out on the Hyperspace marketplace and credit the treasury, with fine-grained permissions controlling who can use what.

Common mistakes

Loading a model bigger than the pod's combined VRAM.The CLI will warn you, but if you bypass the warning the pod falls back to disk-swapped layers and throughput collapses. Pick a model that fits.
Putting the leader on the laptop that closes most often.Raft re-elects fine, but you'll see a 2–3 second hiccup on every leader change. Promote the desktop that stays on.
Forgetting the cloud fallback is opt-in per call.Pass fallback="auto" in your request only when you actually want it. Otherwise the pod returns the local model's answer and the treasury stays put.
Counting tokens by eyeballing prompt length. Local is free but latency isn't — long prompts still slow the pipeline. Run your prompts through our AI token counter first if you're tuning for speed.

When a Pod isn't the right answer

If you're a solo developer with one laptop and no friends to pool with, a single-node setup like Ollama or LM Studio is simpler and gives you the same local-first benefits without the coordination overhead. Pods earn their keep when there are at least three machines pooling — that's where automatic sharding and the shared treasury start paying for themselves.

Frequently asked questions

Does my prompt go through Hyperspace's servers?

No. The data plane is fully peer-to-peer — your prompt travels directly from your machine to your pod members' machines and back. Hyperspace's infrastructure helps with NAT traversal and the initial handshake but never sees the prompt or response.

What happens if a pod member's laptop goes to sleep mid-request?

The coordinator detects the dropped node, reshards the model across the remaining members, and retries the in-flight request. You typically see a 1–3 second hiccup; the request completes successfully.

Can I use a Hyperspace Pod with Cursor, Claude Code, or my OpenAI-SDK script?

Yes. The pod exposes an OpenAI-compatible endpoint, so any tool that lets you set a base URL and API key works without modification — paste the pod's URL and pk_* key and you're done.

How big a model can a pod actually run?

Add up the free VRAM across your members. A pod of five mid-range machines can comfortably run 32B-parameter models like Qwen 3.5 32B; larger setups have run 70B-class models. The CLI tells you what fits before you load it.

Is the pod actually free, or is there a Hyperspace subscription?

Pods themselves are free — the CLI is open and the inference happens on hardware you already own. The shared treasury only spends money when you opt into a cloud-fallback call to a frontier model your pod can't run locally, and those are billed at wholesale rates.

Found this useful?Email