Running a 22B Video Model on Four Tesla P40s

A.C. Jokela

2026-03-20

LTX-Video 2.3 is a 22 billion parameter model that generates video from text prompts. It was designed for modern hardware: GPUs with bfloat16 support, high-bandwidth memory, and enough VRAM to hold the full model on one or two cards. The Tesla P40 has none of these things. It is a Pascal-generation GPU from 2016, with 24GB of GDDR5X per card, no native bfloat16, no Tensor Cores, and a PCIe 3.0 bus. It was built for data center inference workloads that no longer exist.

I have four of them in a rack-mount server in an unheated shop building in Minnesota. Together they provide 96GB of VRAM. The question was whether that 96GB, spread across four old cards, could run a model that was never meant to run on any of them.

The answer is yes, with significant caveats and a substantial amount of code to work around hardware limitations that the model's authors never anticipated.

The Problem

LTX-Video 2.3's transformer has 48 blocks. At fp16 precision, the model weights alone consume roughly 44GB. With the Gemma text encoder, the video VAE encoder/decoder, the spatial upsampler, and the audio components, the full pipeline needs more memory than any single P40 can provide. The model doesn't fit on one card. It doesn't fit on two. It barely fits on three, with no room for activations during inference.

Four cards at 24GB each gives 96GB total, which is enough for the weights with room for intermediate activations. But CUDA doesn't automatically spread a model across multiple GPUs. You have to tell it how.

The standard approach for multi-GPU inference is accelerate's dispatch_model, which automatically distributes model layers across available GPUs based on memory constraints. This works for the Gemma text encoder, which is a straightforward transformer. For the LTX transformer, it doesn't work, because the model has a custom forward pass with audio-video cross-attention that accelerate's automatic dispatch can't handle correctly. The model needs to move data between GPUs at specific points in the forward pass, and accelerate doesn't know where those points are.

The solution was manual pipeline parallelism: split the 48 transformer blocks evenly across four GPUs (12 blocks per card), keep the shared components (patchify projections, normalization, output projections) on GPU 0, and write a custom forward pass that moves tensors between devices at block boundaries.

The Precision Problem

Even with the model split across four cards, nothing worked on the first attempt. Or the fifth. Getting LTX-Video running on Pascal hardware was an iterative process, with Claude Code generating solutions and me testing them against the actual hardware. Each failure revealed another assumption the model made about the GPU it would run on. The feedback loop was brutal: load a 22B model across four GPUs, wait eight minutes for a test generation, get a black frame or a NaN error, diagnose which precision boundary caused it, generate a fix, and try again.

The first problem was bfloat16. The model weights are stored in bf16 format. Pascal GPUs cannot compute in bf16. PyTorch handles this silently for some operations by promoting to fp32, but other operations fail or produce garbage. The initial approach was the obvious one: monkey-patch torch.bfloat16 to redirect to torch.float16. This seemed to work at load time. The model loaded, the weights populated, no errors. Then the first forward pass produced NaN everywhere. The monkey-patch had corrupted the safetensors weight loading. The weights loaded as fp16 bit patterns interpreted as bf16 values, which is not the same thing. A bf16 value of 1.0 has a different bit pattern than an fp16 value of 1.0. Reinterpret one as the other and you get a number that's either wildly wrong or NaN.

The second attempt tried running everything in fp16 natively, converting weights properly during load. This got further: the model produced output that wasn't NaN. But the output was a solid green frame. The intermediate activations in the transformer blocks were overflowing fp16 range. Values above 65,504 become infinity in fp16, and the model's internal representations regularly exceed that during the attention and feedforward passes. The green frame was the model's attempt to decode latents that had been clipped to infinity at some point in the pipeline.

The working solution was to let the model builder properly convert weights from bf16 to fp16 on load, then run the entire computation pipeline in float32. The weights sit in memory as fp16 (saving space), but every computation promotes to fp32 before executing. This required patching F.linear to handle mixed dtype inputs:

_orig_linear = F.linear
def _mixed_linear(input, weight, bias=None):
    if input.dtype != weight.dtype:
        weight = weight.to(input.dtype)
        if bias is not None:
            bias = bias.to(input.dtype)
    return _orig_linear(input, weight, bias)
F.linear = _mixed_linear

The same pattern extends to every normalization function and every convolution operation. Layer norm, group norm, RMS norm, conv1d through conv_transpose3d: all patched to handle mixed dtypes and accumulate in float32. Without these patches, intermediate values overflow fp16 range (values above 65,504 become infinity) and the output is a black frame.

The Gemma Problem

The text encoder is Google's Gemma 3, a separate model that converts text prompts into embeddings the video transformer can condition on. Gemma's attention mechanism overflows when run in fp16 on Pascal hardware. The attention scores grow large enough to exceed fp16 range, producing NaN values that propagate through the rest of the pipeline.

The fix was running the entire Gemma encoder in float32. This uses more memory, but the text encoder only runs once per generation (to encode the prompt), and its weights can be freed from GPU memory before the transformer starts. The sequence is: load Gemma across all four GPUs using accelerate, encode the prompt in float32, delete the encoder, free the memory, then load the video transformer.

def encode_prompt_float32(prompt, model_ledger, device):
    model_ledger.dtype = torch.float32
    te = model_ledger.text_encoder()
    # Dispatch across all 4 GPUs for memory
    max_memory = get_balanced_memory(
        te, max_memory={i: "22GiB" for i in range(4)},
        no_split_module_classes=["Gemma3DecoderLayer"],
    )
    te = dispatch_model(te, device_map=device_map)
    hidden_states, attention_mask = te.encode(prompt)
    # Free GPU memory before transformer loads
    del te
    gc.collect()
    torch.cuda.empty_cache()

This load-encode-delete cycle is ugly but necessary. There isn't enough total memory to hold both Gemma and the video transformer simultaneously, even across four cards. The sequential approach works because each component only needs to exist during its phase of the pipeline.

The Pipeline

The generation runs in two stages, matching LTX-Video's distilled inference schedule.

Stage 1 generates a half-resolution latent video (e.g., 256x384) through 8 denoising steps. Each step runs the full 48-block transformer, with data moving across all four GPUs:

def patched_process(video, audio, perturbations):
    for i, block in enumerate(ltx.transformer_blocks):
        dev = block_devices[i]
        video = move_args_to_device(video, dev)
        audio = move_args_to_device(audio, dev)
        video, audio = block(video=video, audio=audio,
                             perturbations=perturbations)
    video = move_args_to_device(video, device0)
    audio = move_args_to_device(audio, device0)
    return video, audio

Every GPU boundary involves a tensor transfer across PCIe 3.0. With 12 blocks per GPU, there are 3 boundary crossings per denoising step (GPU 0 to 1, 1 to 2, 2 to 3), plus a final transfer back to GPU 0. With 8 denoising steps, that's 32 cross-device transfers per stage, each moving both video and audio state tensors. PCIe 3.0 x16 has a theoretical bandwidth of ~16 GB/s. The tensors being transferred are small relative to the bandwidth (attention states and activations, not full weight matrices), so the overhead is manageable. But it adds up.

Stage 1 takes roughly 4 minutes for 241 frames at 24 fps (a 10-second clip). The spatial upsampler then doubles the resolution. Stage 2 runs 3 more denoising steps at full resolution (512x768), taking roughly 6.5 minutes. The VAE decoder converts latents to pixels and generates the audio track in another 40 seconds.

Total generation time for a 10-second, 512x768 video with audio: approximately 18.5 minutes. For a 1-second clip (25 frames): about 8 minutes. For a 4-second clip (97 frames): about 10.5 minutes.

The Memory Layout

During inference, the four GPUs aren't loaded equally. GPU 0 carries extra weight because it hosts all the shared components (patchify projections, normalization layers, output projections) plus its 12 transformer blocks. The actual memory distribution:

GPU	VRAM Used	Role
0	10.8 GB	Shared components + blocks 0-11
1	9.3 GB	Blocks 12-23
2	9.3 GB	Blocks 24-35
3	9.3 GB	Blocks 36-47

That's 38.7 GB of the available 96 GB. The remaining 57 GB provides headroom for activations, KV cache growth, and the VAE decoder. There's enough margin that generation never OOMs, even at 241 frames.

The API

Running inference from the command line is fine for testing, but generating videos for blog content requires something more practical. I wrapped the generation script in a FastAPI server with an async job queue:

# Submit a text-to-video job
curl -X POST http://10.1.1.24:8585/jobs \
  -F "prompt=A cinematic flyover of a Zilog Z80 processor on a PCB" \
  -F "duration=10" -F "seed=42"

# Submit an image-to-video job
curl -X POST http://10.1.1.24:8585/jobs \
  -F "prompt=A fluffy orange cat dancing" \
  -F "duration=4" -F "image=@cat.jpg"

# Check status
curl http://10.1.1.24:8585/jobs/07420abb6d82

# Download result
curl http://10.1.1.24:8585/jobs/07420abb6d82/video -o output.mp4

Jobs queue and execute sequentially. The GPU can only handle one generation at a time, and the load-encode-delete cycle for Gemma means there's significant setup overhead per job. The API spawns each job as a subprocess, which gives clean GPU memory cleanup between runs. If a generation crashes (which happened frequently during development), the next job starts fresh.

The server supports both text-to-video and image-to-video. Image conditioning locks the first frame to a provided image and generates subsequent frames from it, which produces more controllable results for specific visual subjects. In practice, image-to-video is the more useful mode. Text-to-video gives the model complete creative freedom, which means the output is unpredictable. You might ask for a Z80 processor and get something that looks like a generic IC, or something that looks like a Z80, depending on the seed. Image-to-video lets you provide the exact first frame you want and the model animates from there. For blog content where visual accuracy matters, starting from a real photograph or a specific reference image gives consistently better results.

What the Output Looks Like

The video quality is genuinely good. LTX-Video 2.3 produces coherent motion, reasonable physics, and detailed textures. Here are three examples, generated entirely on the P40 server:

Text-to-video: "A cinematic flyover of a Zilog Z80 processor on a printed circuit board" (10 seconds, 18.5 minutes to generate)

Image-to-video: "A fluffy orange cat with a hat dancing" (4 seconds, 10.5 minutes to generate)

Text-to-video: "A cat sitting on a windowsill, sunlight streaming in" (1 second, 8 minutes to generate)

The model understands object permanence, lighting consistency, and basic spatial relationships. The Z80 flyover produces a recognizable IC package with surrounding components, proper lighting, and smooth camera movement.

The audio is a different story. LTX-Video 2.3 generates an audio track alongside the video, but the results are inconsistent. Prompts describing characters speaking produce odd ambient music instead of voices. Prompts describing environments produce vaguely appropriate soundscapes. The audio pipeline works mechanically (it generates real audio waveforms via a separate VAE decoder and vocoder), but the semantic connection between prompt and audio output is weak. For blog content, I'd likely strip the generated audio and add narration or music separately.

The 512x768 resolution at 24fps is usable for web content. It's not 4K. It's not going to replace stock footage for production video. But for blog hero images in motion, visual demonstrations, or supplementary content alongside text, it works.

What This Cost

The hardware cost is zero incremental. The four P40s and the server already existed for LLM inference. LTX-Video is an additional workload on the same hardware.

The electricity cost is modest. The server draws roughly 500W under full GPU load. An 18.5-minute generation (10-second video at full resolution) consumes about 0.15 kWh, roughly $0.024 at Minnesota residential rates. You could generate forty 10-second clips for a dollar.

The real cost was development time. Getting from "model downloaded" to "working generation pipeline" took many iterations across multiple sessions with Claude Code. Each precision-related failure mode (bf16 corruption, fp16 overflow, mixed-dtype kernel errors, NaN propagation through attention) required diagnosis, a hypothesis, a code change, and a test cycle that involved loading a 22B model across four GPUs. The feedback loop was slow. A single test takes 8 to 18 minutes to confirm whether a change worked. Many didn't.

The Broader Point

A 22 billion parameter video generation model was not designed to run on 2016 hardware. The authors assumed bf16, assumed modern attention kernels, assumed enough memory on one or two cards. None of those assumptions hold on the P40.

But the model runs anyway, because the underlying math doesn't actually require any of those features. Bfloat16 is a convenience, not a requirement; float32 computes the same function. Flash attention is an optimization, not a necessity; standard attention produces identical results. And 96GB across four cards is 96GB, regardless of whether it's cutting-edge HBM3 or decade-old GDDR5X.

The generation is slow. Eighteen minutes for ten seconds of video is not competitive with a single A100, which would finish the same job in under two minutes. The float32 computation pipeline roughly doubles the FLOPS required compared to the bf16 path the model was designed for, and the PCIe 3.0 transfers between four separate memory pools add latency that a single modern GPU with unified HBM would never incur. But competitive wasn't the point. The point was that four GPUs I bought on eBay for a thousand dollars total, sitting in a server in a shop building, can run a model that was released this month. The gap between "latest model" and "latest hardware" is not as wide as the spec sheets suggest, as long as you're willing to write the code that bridges it.

The P40 server was already paying for itself on LLM inference and TTS generation. Video generation is one more workload on a machine that I own, running models that I choose, on a schedule that I control. The 18-minute wait is the price of not asking anyone's permission.