AI & LLMs · Guide · AI & Prompt Tools

How to Share a GPU Across Machines

Expose one GPU host to your whole LAN — Ollama, vLLM, LM Studio. Tensor parallel vs pipeline parallel. Auth, throughput math, and 30-minute starter.

Updated May 2026 · 6 min read

A 24 GB RTX 4090 sitting idle in a desktop is wasted potential when there are three other machines on your LAN waiting on inference. Sharing GPUs across machines used to mean expensive InfiniBand fabric and DGX-class hardware. In 2026 the same outcome — multiple clients, one shared accelerator, sub-100 ms first-token latency — is a weekend project on commodity gear.

Three ways to share, ordered by what you’re trying to do

1. One GPU serves many clients (the 90% case)

You don’t want to combine GPUs — you want one fast GPU to serve a whole household or team. The right pattern is a model-serving daemon on the GPU host that exposes an HTTP endpoint everyone else points their tools at. Pick one of:

Ollama (easiest): single binary, OpenAI-compatible endpoint, one-line model pulls. OLLAMA_HOST=0.0.0.0:11434 ollama serve exposes it on the LAN. See how to use Ollama for setup.
vLLM (highest throughput): the right call for 5+ concurrent users. Continuous batching plus PagedAttention can serve 5–20× the tokens/sec of a naive single-stream server on the same hardware.
LM Studio server mode: if you want a GUI for the GPU host and an API for clients. Same OpenAI surface as Ollama.

All three speak the OpenAI HTTP wire format, so clients (Cursor, Continue.dev, custom scripts, agents) need only a base URL change to start using the shared GPU.

2. One model split across multiple GPUs on the same machine

If you have two 24-GB cards in one box and want to run a 70B model in FP8 (~70 GB), that’s tensor parallelism. Both vLLM and TGI handle it natively:

# vLLM: split the model across both cards on this host
vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 32768

The two GPUs need to be in the same chassis with NVLink or at least PCIe 4.0 x16 each. Splitting one model across PCIe between machines is technically possible but latency-prohibitive — do that with pipeline parallelism (next section) instead of tensor parallelism.

3. One model split across multiple machines (pipeline parallelism)

Different machines on the same LAN can each hold a slice of a model and pipeline tokens through the resulting ring. This is what Hyperspace, exo, llama.cpp RPC, and Petals all do under different brand names. The deep dive lives in how to combine laptops to run large LLMs — the GPU-specific concern here is matching computeto memory on each shard so no node becomes the bottleneck.

Throughput math you should run before buying anything

The output of a serving stack is tokens-per-second-per-user multiplied by concurrent users. Both numbers move with the model size and the request mix. Rough single-GPU ballpark on an RTX 4090 (24 GB), measured with vLLM:

Model	1 user (tok/s)	4 users (tok/s each)	16 users (tok/s each)
Qwen 2.5 7B Q5	120	~95	~55
Llama 3.1 8B FP16	90	~70	~30
Mixtral 8x7B Q4	55	~45	~25
Llama 3.3 70B Q4 (offloaded)	12	~9	stalls

The pattern: small models with continuous batching deliver near-linear scaling up to 4–8 simultaneous users. Past that the math depends on cache pressure and prompt length.

Network requirements (less than you think for case 1)

For the “one GPU, many clients” pattern, the network sees compact request / response tokens — usually 1–10 KB per round trip. A standard 1 GbE LAN handles 50+ concurrent users without breaking a sweat. The sensitive number is latency, not bandwidth: ping the GPU host from each client; if it’s under 5 ms, you’re fine. Wi-Fi is usable but adds a noticeable first-token delay versus wired.

For tensor-parallel splitting across machines (rare, hard, and slow over PCIe-class hardware), you’re in 25 GbE+ territory. Skip it for home labs.

What about Apple Silicon, ROCm, and CPU-only hosts?

Apple Silicon (M1–M4, M-Ultra): unified memory makes a Mac Studio with 192 GB the cheapest 70B host on the market. Ollama, LM Studio, and exo all use Metal natively — no CUDA, no ROCm, just works.
AMD ROCm (RX 7900 XTX, MI300X): 2026 ROCm support in vLLM and llama.cpp is solid. Performance is within 10–25% of CUDA on equivalent silicon for most workloads.
Intel Arc + iGPU: usable via SYCL backends in llama.cpp; performance depends heavily on memory bandwidth more than the GPU itself.
CPU-only: realistic for 8B-class models at Q4–Q5. AVX-512 and Apple AMX speed it up; commodity x86 without AVX-512 caps out at ~5 tokens/sec for 8B.

Auth and access control

Once a GPU is exposed on the LAN, you’ve created an open AI endpoint. Anyone who can reach 192.168.1.10:11434 can query your models. Don’t expose this to the public internet without a reverse proxy that does at least:

API-key header check (Authorization: Bearer ...).
Rate limiting per key (Caddy, Traefik, or nginx all handle this with one config block).
HTTPS — modern OpenAI clients refuse plain HTTP for non-localhost endpoints.

For team setups, a Hyperspace pod handles the auth and treasury layers natively — see the pod guide.

The 30-minute starter

Pick the GPU host. Plug it into wired ethernet.
Install Ollama. Pull two models: a 7B for speed (qwen2.5:7b) and a 70B Q4 for quality (llama3.3:70b-q4_K_M).
OLLAMA_HOST=0.0.0.0:11434 ollama serve
On every client machine, point the OpenAI base URL at http://<gpu-host-ip>:11434/v1. Cursor, Continue.dev, your custom scripts — same code, different base URL.
Use the AI cost estimator to compare what you’d save vs paid APIs at your usage level.

That’s the foundation. Add vLLM later if you outgrow Ollama’s throughput, add a pod (Hyperspace / exo) when you outgrow a single GPU’s memory budget.

Frequently asked questions

How do I share a single GPU with multiple machines on my LAN?

Run a model-serving daemon on the GPU host that exposes an OpenAI-compatible HTTP endpoint. Ollama (easiest), vLLM (highest throughput), and LM Studio server mode all work. Bind to 0.0.0.0:port, then point each client's OpenAI base URL at http://gpu-host-ip:port/v1.

What's the difference between tensor parallelism and pipeline parallelism?

Tensor parallelism splits each layer's matrices across GPUs in the same machine via NVLink — used to fit models too big for one card. Pipeline parallelism splits whole layers across machines on a network — used to pool memory across machines. Tensor parallel needs 25 GbE+ if cross-machine, so it's almost always single-host. Pipeline parallel works fine on standard LAN.

How many users can a single 4090 handle?

With vLLM and continuous batching, a 24 GB RTX 4090 serves a 7B model at ~95 tokens/sec for 4 simultaneous users, ~55 tokens/sec for 16. For 70B Q4 with offload, you're looking at ~9 tokens/sec for 4 users — usable but slower.

Can I expose a shared GPU endpoint to the public internet?

Don't, unless you've added auth, rate limiting, and HTTPS. A LAN endpoint with no protection is fine for a household; a public endpoint without protection is a free open AI service. Caddy or Traefik handle reverse-proxy auth in one config block. For team setups, a Hyperspace pod handles this natively.

Found this useful?Email