Skip to content
Free Tool Arena

AI & LLMs · Guide · AI & Prompt Tools

How to Combine Laptops to Run Large LLMs

Pool laptops into a cluster that runs Llama 70B or Qwen 72B. Compare Hyperspace, exo, llama.cpp RPC, and Petals — pick the right one for your network.

Updated May 2026 · 6 min read

A single laptop with 16 GB of RAM can run a 7B model and feel snappy. It cannot run Llama 3.3 70B or Qwen 3.5 72B. The fix isn’t a $5,000 GPU upgrade — it’s pooling the machines you already own. With the right runtime, three or four laptops can cooperatively load a model that none of them could hold alone, and serve it at the speed of the slowest one in the ring.

Advertisement

What “combine laptops” actually means

Modern open-weight LLMs are stored as a stack of transformer layers. Most layers are independent — layer 12 doesn’t need to be on the same machine as layer 11, it just needs to receive the activations and pass them on. Distributed inference runtimes exploit that: they slice the model into shards, hand each shard to a different machine, and pipeline tokens through the resulting ring. The technique is called pipeline parallelism and it works on any commodity network (Wi-Fi, gigabit Ethernet, Thunderbolt-bridge).

You don’t need identical machines. A 64-GB Mac Studio, a 32-GB ThinkPad, and a 16-GB MacBook Air can all join the same pod — the bigger machines just carry more layers. Your bottleneck becomes the slowest member, not the smallest.

The four runtimes worth knowing in 2026

1. Hyperspace pods (easiest, OpenAI-compatible)

Hyperspace is a peer-to-peer pod manager: one machine runs pod create, everyone else joins with an invite link, and the resulting cluster exposes an OpenAI-compatible HTTP endpoint. Tools that already speak the OpenAI protocol — Cursor, Continue.dev, the OpenAI Python SDK, custom agents — work without code changes. NAT traversal is automatic, so members behind home routers don’t need port forwarding. See how to set up a hyperspace pod for the full walkthrough.

2. exo (terminal-first, Apple Silicon shines)

exo (from exo Labs) is an open-source distributed inference engine that auto-discovers machines on your local network and shards models across them by available memory. It runs on macOS, Linux, iPhone, iPad, and Android, and it’s especially fast on unified-memory Apple Silicon because there’s no copy across PCIe. Single command to start a node:

# install
pip install exo

# on every machine on the same wifi:
exo

The first node prints a localhost API endpoint that speaks OpenAI’s wire format. Pull a model with exo run llama-3.1-70b and exo decides which layers go where based on the cluster topology it discovered.

3. llama.cpp RPC (most control, lowest dependencies)

llama.cpp ships a built-in rpc-server mode that turns any machine into a shard worker. The pattern is: start rpc-server on each helper machine, then start llama-server on the “leader” with --rpc 192.168.1.10:50052,192.168.1.11:50052 pointing at the helpers. The leader stripes layers across helpers automatically. No central registry, no daemon to run — just two binaries and a list of IP addresses. Works on every platform llama.cpp supports. Worth reading how to use llama.cpp first if you haven’t.

4. Petals (truly distributed across the internet)

Petals is a BitTorrent-style network for LLMs: anyone can contribute spare compute, anyone can join and run inference against a model that’s currently loaded across the swarm. It’s the right choice if you want to run a 405B model and you’re OK with multi-second per-token latency from public-network hops. Not the right choice for low-latency local pods on the same LAN.

Choosing between them

Use casePick
Team of 3-8 sharing a pod, needs OpenAI API surfaceHyperspace
Apple-Silicon-heavy household, casual useexo
Maximum control, custom quantization, mixed Linux/Windowsllama.cpp RPC
Run something larger than any private pool can holdPetals
Single machine with 64+ GB and good GPUNo clustering — just use Ollama

How big a model can you actually fit?

Total available memory across the cluster has to exceed the model’s on-disk size at your chosen quantization, plus context-window overhead. Rough rules:

  • Q4_K_M (4-bit): model size in GB ≈ parameters in billions × 0.6. Llama 3.3 70B at Q4 is ~40 GB.
  • Q5_K_M (5-bit): ~0.75× parameters. 70B is ~52 GB.
  • Q8_0 (8-bit): ~1× parameters. 70B is ~70 GB.
  • FP16 (16-bit): ~2× parameters. 70B is ~140 GB.
  • Context overhead: add ~0.5 GB per 1k tokens of context window for KV cache.

Use the LLM context window calculatorto size the KV cache for a specific config, and the AI cost estimator to compare a self-hosted pod against equivalent cloud-API spend. A 5-person team with three 32-GB laptops and one 64-GB desktop can comfortably host a 70B model at Q4 for the cost of electricity.

Network: the part most guides skip

Pipeline parallelism shuttles activations between layers across the network on every token. The tensor sizes are small (typically 4–16 KB per token at 8B–70B scales), so latency hurts more than bandwidth. In rough order of best-to-worst:

  • Thunderbolt 4 / USB-C bridge (40 Gbps): two laptops cabled directly feel like one machine. Best for two-node pods.
  • 2.5 GbE / 10 GbE wired: ideal for 3-8 node home setups; zero token-rate hit.
  • 1 GbE wired: fine for any current open-weight model up to 70B.
  • Wi-Fi 6 (5 GHz, line-of-sight): usable for 7B-13B models; 70B is slow but works.
  • Wi-Fi over multiple walls: expect 2–3 tokens/sec at best.

Quick troubleshooting

  • One machine’s fans scream. Layers are unevenly assigned. Most runtimes have a --shard-mem or --max-vram flag to bias the split.
  • Tokens stutter mid-generation. Usually network jitter on Wi-Fi or a background process pegging one node’s CPU. Wire that node, kill the process.
  • Cluster forms but inference fails. Mixed-precision mismatch — not every machine supports the same quantization. Pull the same exact model file (same SHA) on every machine.
  • Mac mini is the bottleneck despite extra RAM. Power-cap kicks in on small enclosures during sustained load. Plug in to wall power, not battery; raise the machine for airflow.

What the workflow actually looks like

Once a pod is up, you stop thinking about it. Cursor, Continue.dev, your custom agent, or a script all point at http://pod.local:5891/v1 with the same OpenAI client they’d use against api.openai.com. The pod handles failover when a member sleeps, reshards on join/leave, and falls back to the cloud (at wholesale rates) for the 5% of requests that genuinely need a frontier model.

For the deeper architecture — how Raft picks leaders, what the treasury does, how local-vs-cloud routing works — see the full hyperspace pod guide. To compare the open-weight options worth running on a pod, the AI model compare tool tracks current benchmarks.

Frequently asked questions

Can you really run a 70B model on regular laptops?

Yes. With pipeline parallelism (Hyperspace, exo, llama.cpp RPC), a 70B model at Q4 quantization (~42 GB on disk) can be sharded across three or four laptops with 16–32 GB of memory each. Each machine carries some layers; tokens flow through the ring. Throughput is bound by the slowest member, but quality matches a single big-machine run.

What's the difference between Hyperspace, exo, llama.cpp RPC, and Petals?

Hyperspace is the easiest team option — invite link, OpenAI-compatible endpoint, automatic NAT traversal. exo is open-source, terminal-first, and tuned for Apple Silicon. llama.cpp RPC gives maximum control with minimal dependencies. Petals is for running models too large to fit any private pool, using a public BitTorrent-style swarm.

Does pooling laptops require fast networking?

Less than you'd think. Pipeline parallelism shuttles small (~4–16 KB) tensors per token, so latency matters more than bandwidth. Wired 1 GbE handles 70B comfortably. Wi-Fi 6 works for casual use; Wi-Fi over multiple walls becomes the bottleneck at 2–3 tokens per second.

Do all the laptops need the same OS or hardware?

No. A pod can mix macOS, Linux, and Windows under WSL2; Apple Silicon, AMD, and Intel; 16 GB and 64 GB machines. The runtime auto-shards based on available memory — bigger machines just carry more layers.

Advertisement

Found this useful?Email