For the past few years, I have been downloading local AI models to determine if they could handle practical automation tasks or summarize long-form content. Historically, these experiments have been unsuccessful, with the models typically failing to provide comprehensive results or losing track of the original context. However, recent developments in model optimization, specifically on Apple hardware, have changed the baseline for what is possible on a personal computer.
In my latest video, I demo running Google’s latest Gemma 4 model, a 26-billion parameter “mixture of experts” model optimized for the Mac using the MLX framework on my 2021 Macbook Pro with an M1 Max and 32GB of RAM.
I observed the model generating approximately 50 tokens per second. While this is slower than a high-end cloud-based system, it represents a very usuable generation speed for a local setup. The unified memory architecture of the Mac allows the GPU to access data efficiently, which is why these older machines remain relevant for AI tasks that would otherwise require significant cloud computing resources.
During my testing, I provided the model with a transcript from a recent video to see if it could produce a coherent summary. Unlike previous local models that often provided incomplete or erratic responses, this model maintained a consistent narrative and adhered strictly to the provided text. I also tested it with a dense legal document from an FCC docket. After processing a large amount of extracted text, the model was able to delineate the key arguments of the filing and, upon further prompting, condensed the information into a concise executive summary.
I also examined the model’s vision capabilities using a tool called MLX studio, which supports image analysis. I uploaded a photograph with some friends and I in front of a space shuttle and asked the model to describe the scene. While it misidentified the vehicle as a Dreamchaser—a different type of spacecraft—the level of detail was a step forward from earlier local models that often provided much less accurate descriptions. This functionality is particularly useful for my ongoing project to index a large archive of digital photos dating back to 1997. Using a local model for this type of organization could potentially eliminate the costs and privacy issues associated with the thousands of API calls required for cloud-based indexing.
To test the model’s utility in a production environment, I integrated it into my N8N automation server. I currently use a cloud-based AI with my N8N server to scan news feeds and identify relevant stories for my daily work. I ran a portion of this same workflow using the local Gemma model to see if it could replicate the results. It took approximately three minutes to process the news briefing. Although the results were not quite as polished as those from the cloud, the model successfully identified unique stories and avoided duplicated stories about Apple’s WWDC event that were being published at the time.
Google appears to be prioritizing the development of effective local models more than some of its competitors, providing a way for users to utilize AI without incurring expenses beyond the electricity required to run their own hardware. Seeing a 26-billion parameter model function with this level of stability on a five-year-old laptop has caused me to rethink my existing workflows. I am now looking at which of my daily tasks can be moved away from the cloud and managed entirely on my own hardware.
