I shoved 2 GPUs in an Old PC to Make a Killer Local AI Rig!

I spent the weekend experimenting with a six-year-old gaming PC to see if a dual-GPU configuration could transform it into a powerful local AI workstation. The objective was to determine if older hardware could handle large language models without relying on cloud-based processing. And it worked!

Check it out in my latest Local AI video.

The system is built around an Asus Maximus XI motherboard and an Intel i9-9900K processor, both dating back to 2019. It contains 32GB of DDR4 RAM.

For the graphics setup, I used an Nvidia 3080 with 10GB of VRAM (circa 2020) and added the affordable 5060ti we reviewed recently with 16GB of VRAM. Combined, these cards provide 26GB of available video memory. One notable aspect of this configuration is that it does not require specialized cabling or SLI configurations. Software such as LM Studio can detect both cards and distribute the model across the available memory just over the system bus. To maximize efficiency, I routed the display output through the Intel processor’s onboard video, ensuring that none of the GPU memory was diverted to drive the display.

During testing, I evaluated several models, including Qwen’s 35-billion parameter Mixture of Experts (MoE) model and a Google Gemma 31-billion parameter dense model. The MoE model proved to be significantly faster, generating text at approximately 52 tokens per second. Because an MoE model only activates a fraction of its parameters for any given task, it offers a higher speed of output but still needs enough VRAM to house its entire parameter space. In contrast, the dense model processed information through all its parameters simultaneously, which resulted in a slower rate of about 9 tokens per second.

I applied these models to practical tasks, such as generating code for a Space Invaders-style game and summarizing long video transcripts. The results varied by model; for instance, one model produced a visually cohesive game that required minor debugging, while another produced a stable but less detailed version. In both cases, the local setup handled the coding and iteration process in a manner similar to cloud-based AI models. I also integrated the local AI with N8N, a self-hosted automation platform, to summarize RSS news feeds. The system successfully filtered large amounts of data and generated HTML summaries with functional links in under a minute.

You can see some output samples of each these tasks here.

Finding the optimal context length was a primary technical challenge. Context refers to the amount of information the AI can retain during a conversation. While I initially used lower defaults, I eventually identified 28,000 tokens as the functional limit for this hardware. Pushing beyond this number caused the system to crash or significantly slowed performance as data spilled over into the system RAM. To maintain stability, I enabled settings to limit offloading to dedicated GPU memory and used a compression technique for the cache.

The project confirms that a multi-GPU approach can extend the utility of aging hardware for modern AI tasks. Rather than investing in a new, high-cost workstation, I was able to pool the resources of existing cards to run complex models locally. My next steps involve exploring how these models can be further integrated into daily workflows and testing their limits with more specialized data sets.

See more of my Local AI series here!