A Crash Course on Local Image Generation with Stable Diffusion

10 min readMar 4, 2024

This post is brought to you by the A.I. Collaborative.

In my last post, An Overview of A.I. Image Generators, I covered a few of my favorite online image generation services.

As I mention in the post, however, my absolute, number one favorite image generator is one that can’t be purchased. You can’t access it online. There are no subscriptions to buy. It runs locally, on your own machine. No internet access, no subscription required.

That’s Stable Diffusion.

Stable Diffusion is a family of image models from Stability that is free to use non-commercially. It is designed to be easy to fine tune the model with unique styles and concepts, and there are some really cool extensions that can create incredible images. Backing all of it is a vibrant and active community dedicated both the model and the software used to generate images with it, with new features coming almost every day.

As of the time of this writing, SDXL is the reigning king of open source image models — though that will almost certainly be upset in weeks, if not days. Older versions include SD 1.5 (which is still popular) and SD 2.1 (which never quite caught on). Upcoming models include Stable Diffusion 3 and Stable Cascade.

You need two things to get started generating images with Stable Diffusion: a Web UI and a checkpoint.

Step One: Download and Install a Web UI

“an eagle in flight taking a selfie, fisheye lens” — Rob’s Mix Ultimate

A web UI is a front end interface that lets you generate images with Stable Diffusion and other models in your web browser. Basically, a web UI handles all of the back end wizardry, letting you focus on creating cool images.

There are several web UI options to choose from, and each has its strenghts and weaknesses. I’ll briefly list a few below in order of easiest to use to most powerful, with their pros and cons.

Fooocus (link)
Pros: Generally regarded as the easiest to use. Enhances prompts similar to Midjourney and DALL-E 3.
Cons: Fewer features than other web UIs.

InvokeAI (link)
Pros: Easy for beginners to use, but with advanced features for power users
Cons: Slower to get new features

Stable Diffusion WebUI (link)
Pros: The O.G. SD WebUI. Lots of extensions, pretty stable.
Cons: Takes a bit more work to set up, and options can be overwhelming for a first time user.

SD.Next (link)
A fork of the original A1111 web UI.
Pros: Tends to get more cutting edge features faster. Similar interface to A1111.
Cons: More advanced setup and usage.

ComfyUI (link)
Pros: The most flexible and extensible web UI, designed to build custom node-based workflows.
Cons: Crazy overwhelming if you’ve never used something like it before.

StableSwarm UI (link)
My current favorite web UI. Uses ComfyUI as a backend, but with a streamlined front end interface.
Pros: Extensibility of ComfyUI with the ease of use of A1111. Can link multiple computers together to generate faster.
Cons: Also somewhat tricky to set up. UI is still clunky while it’s in alpha.

Each one will require a bit of technical know-how to set up, but generally speaking, their installation processes are well-documented on their Github pages.

Before you get started…

Stable Diffusion models can run on a wide range of consumer hardware, but the speed and quality of your generations will vary based on your system specifications. Larger models — such as SDXL models — require more resources, specifically your GPU’s VRAM. If you don’t have a dedicated GPU or you don’t have sufficient VRAM, they can run on your CPU, but images will generate very slowly.

The minimum amount of VRAM I’d recommend for running SDXL models is 6 GB, preferably 10 GB. I have an RTX 3080 in my desktop, and it is sufficient for most things.

And while most image generation options run best on Windows machines with NVIDIA GPUs, they can run on Macs. My MacBook Pro M1 with 64GB of unified memory can run most models fine, albeit more slowly than on my GPU.

If you have a “potato” computer that just can’t run A.I. models, you can rent GPU time with a number of cloud services such as Runpod, or you can run models in the cloud with services such as Replicate.

Step Two: Find some Checkpoints

“A serene landscape photograph of a tranquil lake reflecting the rugged peaks of the Rockies, surrounded by dense pine forests. Early morning, with mist rising off the water, natural light, wide angle shot, shot on Canon EOS R5 with a 24mm f/11 lens” — Rob’s Mix Ultimate

The SDXL 1.0 base model is generally automatically downloaded by your web UI so you can get started right away, but some of the real fun lies in experimenting with user-created checkpoints.

Checkpoints are fine tuned variants of the SDXL base model. Anyone can fine tune a model by curating a data set of images and running a training script, but most users will download checkpoints trained by the Stable Diffusion Community.

The most common place to download community checkpoints is CivitAI. Just keep in mind that Civitai doesn’t censor user models or images that users post to the site, so be sure to set your filters to weed out prurient content.

Checkpoints generally fall in one of two categories: photorealistic or animated/illustrated. There are thousands to choose from, but by sorting the list by all time downloads, you can see which ones the community finds most popular and start there.

I’m kind of proud of my own checkpoint, Rob’s Mix Ultimate, which I used for all of the images on this page. Check it out here.

A couple of all-time heavy hitters include:

Juggernaut (SDXL)
DreamShaper (SDXL)
RealVisXL (SDXL)
CyberRealistic (SD 1.5)

SDXL has a higher base resolution than SD 1.5 (1024x1024 vs. 512x512), but takes more system resources to generate. I strongly prefer SDXL, but there are great SD 1.5 models out there. Be sure to download a checkpoint for the base model of your choice.

Other variants to look into after you get your feet wet are SDXL Lightning models, which are tuned to generate images super fast but with lower quality, and Stable Cascade, which is a more complex but up and coming model. Playground 2.5 is a closed source (but free for non-commercial use) model that achieves great results, though tends to be more “opinionated.”

Stable Diffusion 3 was recently announced. As of the time of this writing, it has not been released, but early previews look very, very promising.

Support for each model varies amongst web UIs, so be sure to check which models and architecture are supported by your UI of choice.

Step Three: Generate!

“A worship team leading praise on stage a contemporary megachurch, stage lights cutting through the haze. In the foreground, silhouetted hands are raised. Shot with an ultrawide fisheye lens” — Rob’s Mix Ultimate

Now that you have your UI installed and you’ve downloaded your first checkpoint, it’s time to warm up that GPU . Fire up your Web UI using the instructions from its Github page, load up a checkpoint, and start generating images!

Part of the fun of generating images locally is finding settings and workflows that help you dial the images in to achieve your creative vision. If you’re just wanting to crank out images, use Midjourney. If you’re wanting to refine your skills, Stable Diffusion is the way to go.

As you’re generating images, start with the base settings, but you’ll also want to experiment with some of these basic parameters.

Resolution. Don’t get crazy with your resolution. The SDXL base resolution is 1024x1024. You can technically generate larger images, but the model was trained on specific image sizes and things can get a little crazy if you get too far outside of those bounds. The greater the resolution, the more resources and more time it will take to generate as well. You can see a list of “supported” SDXL resolutions here.
Samples. Samples are the number of steps that the model takes to generate the image. Generally speaking, more samples means higher quality, but you it a point of diminishing returns pretty quickly, and extremely high sample counts can have a negative impact. Most checkpoints work best around 30 to 40 samples. The more samples you choose, the longer the image will take to generate. I generally stick with 40.
CFG. CFG stands for “classifier free guidance.” At the most basic level, CFG controls how closely the model will try to stick to your prompt. The higher the CFG, the more closely it will follow your prompt and the more detail the image will contain. If you set your CFG too high, however, you’ll notice that your images start to get a “burned” look, with extreme contrast and oversaturated colors. A good starting CFG is 7. I go as low as 4, and as high as 10.
Sampler. You don’t have to change this, but it’s worth knowing what it does. The sampler is an algorithm that guides the image generation process. They do have an impact on the final image, so feel free to experiment. Popular samplers are Euler (the “O.G.” sampler) and DPM++ 3M SDE Karras (my fav).
Second Pass, Refiner, or “Hi Res Fix.” As a solution to generating images at a higher resolution than the model was trained to produce, many image generation workflows include a second pass. In the second pass, the generated image is upscaled (typically 1.5x or 2x) and then generation continues. This process not only increases the resolution of the final image, but it also helps to refine the fine details in the result. Each web UI handles this a bit differently, so check the documentation for details on this process. (For what it’s worth, I always use a second pass with a 1.5x to 2x upscale. The difference is incredible.)

Step Four: Advanced Features

“A close up fisheye ultrawide cinematic slow shutter action shot of kid driving a power wheels car on a race track, motion blur, drifting around a corner with tires screeching, smoke, dark contrast, underexposed, dutch angle“—Rob’s Mix Ultimate

The fun doesn’t stop at simply generating an image from a text prompt. The real power and advantage to running image models locally comes with the advanced features that web UIs offer. Some of the workflows that you can build are absolutely mind-blowing.

I’m not going to get super deep here, but I want to draw your attention to a few features that you may find particularly interesting.

Image to Image. The Image to Image uses an input image as the starting point for your image generation. The model will first deconstruct (or “denoise”) the image, and then start building a new image from your prompt. Note that Image to Image doesn’t accept instructions like “give him a hat,” or “change the color of her shirt.” It’s a rather imprecise way to use an image as an input, but simple.
LoRAs. LoRAs (Low Rank Adaptation models) are smaller, specialized models for fine-tuning your images. They can be used to create images with a specific style (e.g. sketches, contrasty photo, space nebulas, or cardboard), a specific character (e.g. Super Mario or Pokemon), or a specific concept (e.g. an outfit or vibe). Check your web UI documentation for how to use LoRAs.
ControlNets. ControlNets are models that can be used to guide the image generation process from an input image. Some examples include the OpenPose ControlNet (which detects a pose from a source image and generates a new image with a similar pose), the depth ControlNet (which creates a depth map from a source image and creates a new image to match), or the canny ControlNet (which creates an outline from a source image and matches the output to it). You can use multiple ControlNets together to craft the specific image you have in mind. Other ControlNets include Reference (which uses an input image as a style reference) and recolor (which is used to colorize greyscale images).
IP-Adapters. IP-Adapters can be used to transfer a character, object, or style from a source image to the generated image, without the need to prompt for that thing. It’s particularly useful for creating consistent characters between generations. It’s more tricky to use than other tools, but super powerful.

There are hundreds of tiny features that web UIs include to help you customize your images, and these only scratch the surface. If you decide to dive in, be warned: it’s highly addictive! I’ll often crank out a few hundred images in one session tweaking values and trying new things. The result, though, is that I can use the tool more effectively to execute on my creative vision.

Wrapping Up

”A national geographic photo of a chinese cormorant fisherman, a solitary lantern at the bow of his bamboo raft casting a warm glow on his nets, blue hour, in a karst landscape in yangshuo china, underexposed”—Rob’s Mix Ultimate

At the end of the day, running image models like Stable Diffusion locally takes more work than jumping into DALL-E or Midjourney, typing a prompt, and getting a great image. But if you’re serious about generating images with A.I. and working toward an artistic vision, there is no better set of tools available to achieve that goal.

If you want to learn more about A.I. image generation, and you’d like to hone your skills in a community of like-minded peers, check out my A.I. Collaborative.

The A.I. Collaborative is a free community with some incredible paid resources and support — a space that I created to share what I’ve learned about generative A.I. through thousands of hours of experimentation.