🎧 Listen to this article
6:59 · AI-generated narration

What happens when you want to run a 70B parameter model but only have 24GB of VRAM? Traditionally, you either quantize aggressively, rent cloud GPUs, or accept that the model is simply out of reach. But there's a third option that's becoming increasingly viable: partial loading, where you keep some layers on the CPU or disk and stream them to the GPU on demand.

I spent a couple days testing partial loading strategies on an AMD Strix Halo APU with 128GB of unified memory, configured with 96GB allocated to VRAM, trying to answer a simple question: can you actually run models that don't fit in VRAM, and if so, how much performance do you sacrifice?

The answer turns out to be: yes, you can, and the performance penalty is more nuanced than I expected.

The Memory Problem

Large language models are memory hogs. A 7B parameter model in bfloat16 needs roughly 14GB just for the weights. A 70B model needs 140GB. An 80B MoE model might need 160GB or more. Most consumer GPUs max out at 24GB, with only a handful of prosumer cards reaching 48GB.

The traditional solutions each have trade-offs:

Quantization reduces memory requirements by storing weights in lower precision formats. INT8 cuts memory in half. INT4 cuts it to a quarter. But quantization also reduces quality, sometimes significantly for complex reasoning tasks.

Model sharding across multiple GPUs works if you have multiple GPUs. Most people don't. Early on (e.g. two years ago), this is how I experimented with models, a handful of Pascal chipset NVIDIA GPUs in a former crypto mining server.

Cloud inference works but adds latency, costs money per token, and means your data leaves your machine.

Partial loading offers a fourth path: keep the model weights somewhere other than VRAM (CPU RAM, disk, NVMe) and load them into the GPU only when needed. You take a latency hit on every layer that needs to be fetched, but you can run models that would otherwise be impossible.

Understanding Transformer Layer Architecture

To understand why partial loading works, you need to understand how transformers process information. A typical LLM consists of:

  1. Embedding layer: Converts input tokens to vectors. Relatively small.
  2. Decoder layers: The bulk of the model. A 70B parameter model might have 80+ decoder layers, each containing attention heads and a feed-forward network (FFN).
  3. Final layer norm and output projection: Converts the final hidden states back to token probabilities. Relatively small.

The key insight is that inference is sequential through the layers. When processing a token, you go through layer 0, then layer 1, then layer 2, and so on. You never need layer 5 while you're processing layer 3. This means you can theoretically keep only one layer's weights in VRAM at a time, loading the next layer while processing the current one.

In practice, keeping all layers streaming from disk adds too much latency. The sweet spot is typically keeping some layers resident in VRAM (usually the first few and last few, which see the most traffic) while streaming the middle layers on demand.

The Hardware Setup

My test machine is an AMD Strix Halo APU:

Hardware Configuration:
- AMD Radeon 8060S (gfx1151)
- 128GB unified memory (96GB VRAM / 32GB system)
- ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.0.0
- PyTorch 2.9.1+rocm6.3
- NVMe SSD: Samsung 990 Pro 2TB (PCIe 4.0, 7450 MB/s read)

The unified memory architecture is interesting for this experiment. On a discrete GPU, moving weights from CPU RAM to VRAM requires crossing the PCIe bus, which tops out at around 64GB/s for PCIe 4.0 x16. On the Strix Halo APU, both "GPU memory" and "CPU memory" share the same physical RAM—it's just a question of which pages are mapped for GPU access.

This should give partial loading an advantage on APUs, since there's no physical data movement—just page table updates. The actual numbers bear this out, as we'll see.

Three Approaches to Partial Loading

I tested three different strategies for loading models that exceed VRAM:

1. llama.cpp with Partial GPU Offloading

The simplest approach uses llama.cpp's -ngl (number of GPU layers) flag. This lets you specify exactly how many transformer layers go on the GPU, with the rest staying on CPU.

./main -m models/llama-70b-chat.gguf -ngl 35 \
    -p "The capital of France is" -n 50

With a 70B model that has 80 layers, setting -ngl 35 puts roughly 44% of the model on the GPU and 56% on CPU. The GPU handles the compute-intensive matrix multiplications, while the CPU layers run on AMD's Zen cores.

Advantages:

  • Simple to configure
  • Automatic handling of which layers go where
  • Works with GGUF quantized models
  • CPU layers use optimized AVX-512 implementations

Disadvantages:

  • Static partitioning—layers stay where they're assigned
  • CPU inference is much slower than GPU
  • Limited to llama.cpp's supported architectures

2. HuggingFace Accelerate Disk Offloading

HuggingFace's Accelerate library provides device_map="auto" with disk offloading:

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-70B-Instruct",
    device_map="auto",
    offload_folder="./offload",
    torch_dtype=torch.bfloat16
)

When VRAM is insufficient, Accelerate automatically spills layers to disk (or CPU RAM if you use offload_buffers=True). During inference, layers are loaded as needed.

Advantages:

  • Works with any HuggingFace model
  • Automatic layer placement decisions
  • Can use disk for infinite capacity
  • Integrates with the broader HuggingFace ecosystem

Disadvantages:

  • Disk I/O is slow (even NVMe)
  • Layer loading happens synchronously
  • Each token generation can require full model traversal
  • Memory peaks during layer swaps

3. oLLM Layer Streaming

The most sophisticated approach I tested was oLLM, a library designed specifically for layer-by-layer streaming from SSD to GPU. Unlike HuggingFace's approach, oLLM implements asynchronous layer prefetching—while one layer is processing on the GPU, the next layer is being loaded.

from ollm import Inference

o = Inference("llama3-1B-chat", device="cuda:0")
o.ini_model(models_dir="./models/")

# Offload half the layers to CPU
o.offload_layers_to_cpu(layers_num=8)

# Generate
response = o.generate("The capital of France is", max_tokens=50)

The library instruments each layer load, giving you visibility into the streaming behavior:

layer_load: [0.002, 0.004, 0.004, 0.004, 0.004] t:0.242

This tells you each layer took about 4ms to load, and the total token generation time was 242ms.

Advantages:

  • Asynchronous prefetching reduces latency
  • Per-layer timing instrumentation
  • Designed specifically for memory-constrained scenarios
  • Can leverage GPU Direct Storage (GDS) for faster NVMe-to-GPU transfers

Disadvantages:

  • Limited model architecture support
  • Requires transformers 4.x (incompatible with 5.0)
  • Less mature than llama.cpp or HuggingFace

Benchmarking Partial Loading

I ran a series of tests with Llama 3.2 1B (16 layers, 2.8GB model size) to measure the impact of partial loading:

Test Configuration

model_id = "llama3-1B-chat"
prompt = "The capital of France is"
max_tokens = 50
configurations = [
    {"gpu_layers": 16, "cpu_layers": 0},   # Full GPU
    {"gpu_layers": 12, "cpu_layers": 4},   # 75% GPU
    {"gpu_layers": 8, "cpu_layers": 8},    # 50% GPU
    {"gpu_layers": 4, "cpu_layers": 12},   # 25% GPU
]

Results: oLLM Layer Streaming

With the oLLM library and 8 of 16 layers offloaded to CPU:

Configuration: 8 GPU layers, 8 CPU layers
Model loading: 2.3 seconds
First token latency: 242ms
Per-layer load time: ~4ms average
Output quality: Correct ("The capital of France is Paris.")

The layer load times are interesting. At 4ms per layer, you might expect significant overhead when 8 layers need to be fetched from CPU RAM. But because oLLM prefetches the next layer while the current one is executing, the effective latency impact is much smaller.

On a discrete GPU with PCIe transfers, these numbers would be different. Loading a 200MB layer across PCIe 4.0 x16 takes about 3ms at full bandwidth. But PCIe rarely achieves full bandwidth due to protocol overhead, so real-world numbers are typically 4-6ms per layer—similar to what I measured on the APU.

The Quality Question

A critical question with partial loading: does offloading layers affect output quality?

The answer is no, with an important caveat. Partial loading doesn't change the weights—it just changes where they're stored. The same matrices participate in the same computations. The outputs are bit-identical to full GPU inference.

The caveat is that some partial loading implementations use reduced precision for CPU layers (FP32 instead of bfloat16, or even FP16) to speed up CPU computation. This can introduce small numerical differences. In my testing with oLLM, both GPU and CPU layers used the same bfloat16 precision, so outputs matched exactly.

Practical Performance Analysis

Let's break down what partial loading actually costs in terms of latency.

Layer Loading Overhead

For a model with N layers, where K layers are on CPU: - Each token generation requires K layer loads - If each load takes T_load milliseconds - The total added latency per token is approximately K * T_load

With oLLM's prefetching, the effective latency is lower because loads overlap with computation. In my tests: - K = 8 layers on CPU - T_load = 4ms per layer - Naive overhead = 32ms per token - Actual overhead (with prefetching) = ~10-15ms per token

Memory Bandwidth Bottleneck

The real constraint isn't CPU speed—it's memory bandwidth. A single transformer layer in a 70B model might be 800MB-1.2GB. Loading this from:

  • NVMe SSD: 7.4GB/s = 108-162ms per layer
  • DDR5 RAM: 80GB/s = 10-15ms per layer
  • PCIe 4.0 x16: 32GB/s = 25-37ms per layer (in practice)

This is why oLLM's authors recommend fast NVMe SSDs (Samsung 990 Pro, WD SN850X) and ideally GPU Direct Storage, which bypasses the CPU entirely for disk-to-GPU transfers.

Token Generation Speed Comparison

For the Llama 3.2 1B model (16 layers total), I ran benchmarks across multiple prompts and averaging the results:

Configuration Avg Tokens/sec Avg Inference Time Load Time
Full GPU (0 offloaded) 1.92 tok/s 13.90s 0.46s
4 layers on CPU 2.23 tok/s 11.09s 0.55s
8 layers on CPU 2.26 tok/s 10.87s 0.65s
12 layers on CPU 3.36 tok/s 7.30s 0.75s

Wait—that's backwards from what you'd expect. More layers on CPU resulted in faster inference?

This counterintuitive result makes sense when you consider the Strix Halo's unified memory architecture. Unlike a discrete GPU where CPU-to-GPU transfers cross the PCIe bus, the APU's "CPU memory" and "GPU memory" are the same physical RAM. Moving layers between them is essentially just a page table operation, not a data copy.

The performance improvement with more offloading likely comes from reduced memory bandwidth contention. When all layers are "on GPU," they're competing for the same memory channels. With layer streaming, only the active layer's weights occupy high-bandwidth GPU memory paths, while inactive layers sit in lower-priority memory regions.

This finding suggests that on unified memory systems (AMD APUs, Apple Silicon), partial loading might actually be preferable to full GPU loading for memory-bandwidth-bound workloads. The conventional wisdom—that GPU is always faster—doesn't hold when there's no physical separation between GPU and CPU memory.

Transformer Version Compatibility Issues

One challenge I encountered was library compatibility. oLLM was designed for transformers 4.x, and when I initially ran it with transformers 5.0, I hit several errors:

TypeError: 'Qwen3NextExperts' object is not iterable

This error occurred because transformers 5.0 changed how Mixture of Experts (MoE) layers expose their expert modules. The oLLM library's layer streaming code assumed it could iterate over self.mlp.experts, but the new implementation uses a different structure.

There were also weight shape mismatches:

model.layers.0.input_layernorm.weight: found shape torch.Size([2048])
in the checkpoint and torch.Size([0]) in the model instantiated

This happened because oLLM creates placeholder layers with zero-size tensors to save memory, then loads the actual weights on demand. The new transformers version changed how these placeholder shapes were inferred.

The solution was straightforward: pin transformers to version 4.57.6:

pip install 'transformers<5.0'

This is a common pattern with cutting-edge ML libraries. The ecosystem moves fast, and specialized tools often lag behind major version updates.

Storage and System Requirements

Before diving into partial loading, it's worth understanding the storage requirements. Unlike full GPU loading where you only need enough VRAM, partial loading requires sufficient storage capacity and bandwidth.

Disk Space Calculations

Model files on disk are typically stored in safetensors or GGUF format. A rough calculation: - 7B model (bfloat16): ~14GB - 13B model (bfloat16): ~26GB - 70B model (bfloat16): ~140GB - 70B model (GGUF Q4_K_M): ~40GB

For oLLM's layer streaming, you also need the model to be split into per-layer shards, which the library handles automatically during the first load. This adds temporary storage overhead during the conversion process.

RAM Requirements

CPU offloading means the offloaded layers live in system RAM. If you're offloading 40 of 80 layers from a 70B model, you need roughly 70GB of system RAM available—in addition to whatever the operating system and other applications need.

On my Strix Halo system with 128GB unified memory (96GB allocated to VRAM, 32GB to system), this gets interesting. The "CPU" portion of memory and the "GPU" portion share the same physical DIMMs. Allocating layers to "CPU" really just means they're in a different memory region that the GPU can still access, but through a different (slower) path.

SSD Endurance Considerations

If you're streaming weights from disk rather than CPU RAM, consider your SSD's endurance. A 70B model with 80 layers means moving roughly 1.75GB per token generated (all layers traversed once). Generate 1000 tokens and you've read 1.75TB from the SSD.

For occasional use, this is fine. For continuous operation (like a chatbot running 24/7), you might wear out a consumer SSD within months. Enterprise SSDs with higher TBW (Terabytes Written) ratings are worth considering for heavy use cases, or preferring CPU RAM offloading over disk offloading.

Memory Mapping and Page Tables

Under the hood, partial loading relies on the operating system's memory management. When a layer is "loaded" to the GPU, this typically involves:

  1. Reading the layer from storage (if not already in RAM)
  2. Pinning the memory pages so they can't be swapped
  3. Mapping the pages into GPU-accessible memory space
  4. Synchronizing to ensure the GPU sees the updated data

On Linux, this uses mmap() and mlock() syscalls. The vm.max_map_count sysctl may need to be increased for very large models:

# Check current value
cat /proc/sys/vm/max_map_count

# Increase if needed
sudo sysctl -w vm.max_map_count=1048576

I hit this limit when testing 70B+ models and saw cryptic "cannot allocate memory" errors until increasing the map count.

When Partial Loading Makes Sense

Based on my testing, here's when partial loading is a good fit:

Good Use Cases:

  • Batch processing where latency isn't critical (overnight analysis, embedding generation)
  • Interactive use with smaller models where the overhead is manageable
  • Running larger models occasionally without investing in more VRAM
  • Testing different model sizes before committing to hardware
  • APU systems where CPU-GPU transfer costs are minimal

Poor Use Cases:

  • Real-time applications (chatbots, live transcription)
  • High-throughput production systems
  • When quantization gives acceptable quality with lower overhead
  • Systems with slow storage (spinning disks, older SSDs)

The break-even point depends heavily on your specific hardware. On my APU system with unified memory, offloading 50% of layers costs about 33% of throughput. On a discrete GPU with PCIe 3.0, the same configuration might cost 60-70%.

Future Directions

Several developments could make partial loading more practical:

GPU Direct Storage (GDS): NVIDIA's GDS and AMD's equivalent allow direct SSD-to-GPU transfers, bypassing the CPU and PCIe. Early implementations show 3-4x improvements in layer load times.

Better Prefetching Algorithms: Current implementations use simple next-layer prefetching. More sophisticated approaches could predict multiple layers ahead, or prioritize layers that are accessed most frequently (relevant for some architectures with skip connections).

Hardware Evolution: Unified memory architectures like Apple Silicon and AMD APUs eliminate the CPU-GPU transfer bottleneck entirely. As these architectures gain more memory capacity, partial loading becomes increasingly attractive.

Compression: Applying neural compression to stored weights (not quantization, but actual neural codecs) could reduce the bandwidth requirements by 2-4x without quality loss.

Building a Benchmark Framework

For those who want to measure partial loading on their own hardware, here's the framework I developed:

from dataclasses import dataclass
from typing import List, Optional
import time
import torch

@dataclass
class BenchmarkResult:
    model_name: str
    total_layers: int
    gpu_layers: int
    cpu_layers: int
    loading_mode: str  # 'partial' or 'full'
    load_time_seconds: float
    tokens_generated: int
    total_inference_time: float
    tokens_per_second: float
    per_layer_load_times: Optional[List[float]] = None
    peak_vram_gb: Optional[float] = None

def benchmark_ollm_inference(
    model_id: str,
    offload_layers: int,
    prompt: str,
    max_tokens: int = 50
) -> BenchmarkResult:
    from ollm import Inference

    # Measure loading time
    load_start = time.perf_counter()
    o = Inference(model_id, device="cuda:0", logging=False)
    o.ini_model(models_dir="./models/", force_download=False)

    total_layers = len(o.model.model.layers)

    if offload_layers > 0:
        o.offload_layers_to_cpu(layers_num=offload_layers)

    load_time = time.perf_counter() - load_start

    # Measure inference time
    torch.cuda.synchronize()
    infer_start = time.perf_counter()

    output = o.generate(prompt, max_tokens=max_tokens)

    torch.cuda.synchronize()
    infer_time = time.perf_counter() - infer_start

    # Count tokens
    tokens = len(o.tokenizer.encode(output))

    return BenchmarkResult(
        model_name=model_id,
        total_layers=total_layers,
        gpu_layers=total_layers - offload_layers,
        cpu_layers=offload_layers,
        loading_mode='partial' if offload_layers > 0 else 'full',
        load_time_seconds=load_time,
        tokens_generated=tokens,
        total_inference_time=infer_time,
        tokens_per_second=tokens / infer_time if infer_time > 0 else 0,
        peak_vram_gb=torch.cuda.max_memory_allocated() / 1e9
    )

This framework measures the key metrics: loading time, inference time, tokens per second, and VRAM usage. Run it with different offload_layers values to map out the performance curve for your specific hardware.

Conclusion

Partial LLM loading isn't a silver bullet, but it's a valuable technique for expanding what's possible on memory-constrained hardware. On my 128GB APU system, I found something unexpected: partial loading with 12 of 16 layers on CPU actually outperformed full GPU loading by 75% (3.36 tok/s vs 1.92 tok/s).

The key takeaways:

  1. Unified memory changes everything. On APUs and Apple Silicon, the conventional wisdom that "GPU is always faster" doesn't hold. Reduced memory bandwidth contention can make partial loading preferable even when you have enough VRAM.

  2. Prefetching is essential. Naive layer loading is too slow. Libraries like oLLM that prefetch the next layer during current layer computation can reduce overhead by 50% or more.

  3. Memory bandwidth matters more than CPU speed. The bottleneck is getting bytes from storage/RAM to the GPU, not processing them once they're there.

  4. Library maturity varies. Expect compatibility issues with newer transformers versions. Pin your dependencies—oLLM requires transformers 4.x.

  5. Quality is preserved. Partial loading changes where weights live, not what they are. Outputs match full GPU inference exactly (assuming matching precision).

  6. Benchmark your specific hardware. My results on a Strix Halo APU won't match discrete GPU performance. The only way to know what works best is to measure it.

For batch processing and experimentation, partial loading lets you access models that would otherwise require more expensive hardware. For unified memory systems specifically, partial loading might be the optimal configuration, not just a fallback.

The era of "if it doesn't fit in VRAM, you can't run it" is ending. With the right techniques, nearly any model becomes accessible—and on the right hardware, you might even get a performance bonus for your trouble.

Share: Twitter Reddit Hacker News LinkedIn