Run LLMs on your local computer—for free
This post is brought to you by the A.I. Collaborative.
It’s April 2024. Chatbots are, as my mother says, “a dime a dozen.” You can use perfectly capable models for free from just about any service provider—Gemini from Google, Claude from Anthropic, ChatGPT from OpenAI. These proprietary models are fast, accurate, and often have helpful integrations like custom GPTs, the ability to run code, or the connections to services like your email and calendar.
So why would anyone want to run a language model locally, on their own computer?
There are a few reasons. First, if you’re a tinkerer, it’s cool. But more practically, running local language models:
- Keeps your data safe, as your data never leaves your local machine.
- Is customizable, giving you the ability to choose between a growing library of foundational models and fine-tuned variants.
- Is extensible, with features like retrieval augmented generation to access your docs and data, the ability to create agents to work together to accomplish tasks, access to your local file system, and the ability to run code on your local machine.
If any of these use cases sound intriguing to you, let’s dive into how you can set up a local LLM on your own hardware and get started experimenting.
Hardware for running local LLMs
Let’s start by addressing hardware—the physical computer you’re going to use to run your LLM.
As a general rule in the world of generative A.I., your graphics card (GPU) is king, as most A.I. models are currently designed to be run on powerful graphics cards. Specifically, the more video memory (VRAM) you have, the larger and more sophisticated models ou’ll be able to run. Many A.I. models can technically run on your CPU, but it’s slooooooow and your mileage may vary.
Personally, I have a Windows machine with an RTX 3080 graphics card, which has 10 GB of VRAM. This is on the low side, and while I can run medium-sized models at a slower speed, I can’t run the largest and most sophisticated models.
I also use a MacBook Pro M1 with 64GB of unified memory. Because the MBP’s unified memory is shared between CPU and GPU, I’ve found that I can in fact run much larger models.
What LLMs can I run locally?
Let’s take a second and explore models. Just like A.I. image models or hosted LLMs, there are a gazillion options to choose from, each with their own strengths and weaknesses, and it may be overwhelming to parse through them. Let’s simplify.
You’ll often see models listed in a format something like this: llama2:7b or mixtral:8x22b.
The first part of that string is the model’s name. There are lots of variats, but some of the most common open weight LLMs at the time of this writing include:
- llama2 (from Meta)
- gemma (from Google)
- mistral (from Mistral)
- mixtral (a “mixture of experts” model from Mistral)
- WizardLM (from Meta)
- command-r (from Cohere)
- phi (a lightweight model from Microsoft)
The second part of the string (e.g. 7b or 8x7b) represents the size of the model, represented as the number of billions (b) of parameters that the model contains. Generally speaking, large models with more parameters will be more sophisticated in terms of knowledge and accuracy, at the trade off of requiring much more storage space and more VRAM to run.
Common model sizes and their VRAM requirements include:
- 7b models (at least 8GB of RAM)
- 13b models (at least 16GB of RAM)
- 30b models (at least 32GB of RAM)
- 70b models (at least 64GB of RAM)
Personally, I have found Cohere’s Command R model to be the most interesting model I’ve used recently. It was just recently released and performs very, very well—outperforming many much larger open models, and on par with or better than some proprietary models such as GPT-3.5 Turbo.
Another favorite is LLaVa, a vision-equipped language model that can interact with images. This is a really cool LLM to integrate into image generation workflows to write image prompts based on an input image, or to caption datasets for training.
Other capable models that I’ve used include:
- Mixtral 8x7b, a mixture of several smaller models that work together for “big model” results with much lower VRAM requirements.
- WizardLM 2, a brand new model, briefly released by Microsoft before being taken down—but still accessible from sources that downloaded it before it was removed.
- Wizard Vicuna, an “uncensored” model that can be used for good or for evil. Do the right thing.
- Llama 2, Meta’s classic open LLM. I can run the 70b variant on my 64GB M1 MacBook Pro. As of the time of this writing, its successor, Llama 3, is expected to be released in May 2024.
If you have the hardware to run them (I don’t), there are some very good, very large models on the horizon. These include WizardLM 2 8x22b and Mixtral 8x22b, both massive models that are expected to perform extremely well.
What software do I need to run LLMs on my computer?
Just like in the world of models, there are plenty of options that you can use to run local models. I’m going to direct your attention to a few of them that I have used with success.
Again, each of these will come with pros and cons, and a big part of which one you choose will depend on your use cases and personal preferences.
LM Studio
LM Studio is a popular and easy-to-use suite for downloading and running local LLMs. It includes a built-in model downloader so you can easily browse and download models, with each model being listed with its compatibility for your hardware.
LM Studio also includes an OpenAI compatible server for communicating with your LLMs through external apps, and a playground to compare multiple models alongside one another.
One downside to LM Studio is that it sometimes takes some work to configure the models to run properly, and that could take some technical knowledge.
Ollama
This is my preferred application for running LLMs on my Mac, but also has a preview version available for Windows.
The reason I like Ollama is that, like LM Studio, it works as an API that other applications can connect to, but without the overhead of running a front end application. So I can leave Ollama running on one machine and connect to it through a client on any of my others, including my phone.
Ollama’s Github page has a long list of clients and integrations. I’ve used Enchanted (a no-frills Mac/iOS native client) and Anything LLM (which can also interact with your documents natively). Mac users may also be interested in MindMac, which integrates LLM features in other applications as well.
Ollama also has its own model library with pre-configured templates for many models. It doesn’t always have the latest and greatest (though you can always import them manually) and pulling models is done through the command line, which can be intimidating for some users.
Other options for running local LLMs
There is no shortage of options when it comes to running local LLMs, and there’s a good chance if you can dream of a feature, it’s out there somewhere.
If the options above don’t strike your fancy, you may want to check out Text Generation WebUI. This is the LLM equivalent of the Stable Diffusion WebUI for image generation. It’s meant to be a flexible and extensible front end for running language models. It takes a bit more setup than the options above, but gives you more flexibility.
Advanced use cases for local LLMs.
Once you dip your toes into the waters of local LLMs, there are tons of interesting applications for the technology. Here are a few ways that you may consider putting your LLM to work for you.
- Try LangChain. LangChain is a framework for developing applications powered by LLMs. It connects models and features (such as interacting with documents, text to speech, creating embeddings, and custom functions) to help develop sophisticated workflows beyond the simple chatbot.
- Create API workflows. I’ve created simple Python applications to use Ollama’s chat API with LLaVa vision models to caption image datasets, process audio with Whisper and create summaries of meetings and workshops, and more.
- Create a model fine-tuned on your data. Because many models have open weights and generous licenses, you can create your own fine tuned version of a model based on your data. This could mean training the model to answer questions about your organization, writing in your style, or accomplishing a specific task.
Where will you start?
The most important thing to do is to start. If running a local LLM isn’t your thing right now, stick to hosted models. If you want to dabble, pick one model and one application to run it, and give it a try.
If you want to engage with how to use LLMs for work, creativity, and personal growth, check out my A.I. learning community, the A.I. Collaborative or join us on Discord. The Collaborative is a community of like minded A.I. practitioners, with optional access to resources, workshops, and support.