I recently began exploring the practical applications of local AI by integrating a new GPU into my hardware configuration. I purchased an Asus card equipped with 16 gigabytes of video memory (compensated affiliate link) that by today’s standards is at a fairly “budget” price.
Check out what it can do in my latest video!
16 gigabytes is enough RAM to house the Gemma 4 26-billion parameter mixture of experts model. This specific model provides a balance of performance and efficiency that mirrors what many users expect from cloud-based subscriptions, but without the need for external data processing and costs.
My current setup involves an external GPU configuration using an Oculink Thunderbolt dock connected to a GMKTec mini PC. While the computer itself contains 64 gigabytes of system RAM, running these models exclusively on a GPU is necessary for maintaining acceptable speeds. For the software interface, I am using LM Studio, a free cross-platform application that allows for fine-tuning performance settings. One useful feature of this software is the ability to offload portions of a model to system memory if the GPU’s capacity is exceeded, though this results in a significant reduction in processing speed.
During initial testing, the Gemma model generated text at a rate of approximately 45 tokens per second generating a short fiction story. While the system consumes about 200 watts of power under a full load, it idles at 36 watts. Beyond simple text generation, the setup is capable of visual analysis. In one test, I provided the model with a photograph of some friends and I in front of Space Shuttle Atlantis. It accurately identified the shuttle and the attributes of the individuals in the frame, correctly processing the visual data without external assistance.
I also tested the model’s ability to handle complex document analysis. I combined a 24-page FCC proposal regarding prepaid smartphones with a transcript of a video I had previously recorded on the topic. Because LM Studio does not currently support PDF files, I converted the information into a single text document. After increasing the context length to its maximum setting to ensure the AI could “see” the entire file, I asked it to draft testimony for the FCC based on my specific concerns raised in the video and attributing those concerns to a specific portion of the FCC’s draft proposal. The model successfully identified my points regarding the privacy of whistleblowers and reporters and was mostly correct in its attributions.
The model demonstrated further utility in coding tasks. It managed to write a functional browser-based Space Invaders clone in a single attempt, including basic game logic and sound effects. Later, I used it to generate a Python script designed to scrape education statutes from a state website. When the first version of the script returned an error, the AI analyzed the problem and provided a corrected version that successfully consolidated numerous legal chapters into one searchable document.
For datasets too large for a standard chat interface, I utilized an application called Anything LLM to perform retrieval augmented generation, or RAG. This process involves indexing and embedding documents locally so the AI can query them efficiently. I uploaded five megabytes of state statutes and asked the model to calculate a specific school grant amount based on a complex formula. The model performed the necessary math and returned the correct figure of $2.2 million. I found that while the local model sometimes requires more specific prompting than cloud-based alternatives like Google’s Notebook LM, it is capable of providing high-fidelity results.
Running larger models, such as those with 31 billion parameters, reduces performance to about five tokens per second when the GPU memory is exceeded. However, the flexibility to swap between various models from Google, Qwen, or other developers allows for a customized approach to different tasks. These tools have reached a point where they are functional for data analysis and automated workflows while keeping all information on a local server. For anyone with a modern video card and a sufficient amount of VRAM, these local models offer a viable way to experiment with AI without relying on the cloud.

















