A First Timer’s Guide to Cooking Up a Fine Tuned SDXL Model

Rob Laughter
13 min readApr 4, 2024

--

Let’s cook up a personalized Stable Diffusion model. Image generated with the mix I created in this article, RobMix Evolution.

This post is brought to you by the A.I. Collaborative.

Stable Diffusion is one of my favorite models for generating images with A.I. because of the flexibility that it offers in creating images that I’ve envisioned. A.I. images get a lot of criticism for being “created by a robot,” or not requiring artistic talent, but I find that using the suite of tools out there to craft an image and execute on a creative vision is an art form of its own.

Part of the fun is in experimenting with different checkpoints and LoRA models that the community has created and shared on sites like Civitai. I’m ashamed to admit that I currently have more than 400 GB of SDXL checkpoints on my hard drive, and that’s clearly not enough, because I’ve recently found myself interested in fine tuning my own checkpoint.

I didn’t really know where to start or what I was doing (in fact, I still don’t), so in this post, I’m going to lay out a “blind leading the blind” approach to how I fine tuned my first couple of checkpoints so maybe you can, too. I’m sure this isn’t the best way to do it, and I’m sure that I may get some technical details wrong, so keep that in mind as you go.

Model Merging: More of an art than a science

You have a couple of options for creating a fine tuned version of a Stable Diffusion model. The first way — the hard way — is to train a checkpoint with new image data. This gives you control over the inputs and lets you introduce new concepts to the model, but I wasn’t ready for that.

I opted for the model merge. This approach blends one or more existing checkpoints into a base model, altering the weights of the model and combining properties from each. It can’t introduce new concepts or fine tune the behaviour of specific weights in the model, but it can blend existing qualities — such as style, tone, subject matter, etc. — into a new checkpoint.

Think of it like a crossover between baking a cake from scratch without a recipe and the “which is better, one or two?” game from the optometrist.

In this case, I was fine tuning with a model merge, which involves blending two or more models into your base model to combine different properties of each. (One could also add new data to a model by giving it further training with a new dataset. I didn’t.)

Step one: choosing model candidates

Your model merge project starts with a base model — in this case, I wanted to more deliberately refine RobMix Ultimate, a Stable Diffusion XL checkpoint that I created a couple of weeks ago in a sort of “happy accident” as I was first playing around with model merging.

As the name suggests, model merging involves merging one model into another, and to do so, you need to choose which models to blend. In my case, I had some specific objectives that I wanted to accomplish with this merge. I wanted to:

  1. Improve the overall tone and contrast
  2. Add detail
  3. Improve prompt following and coherence
  4. Make images of people feel more candid and natural and less posed

I selected models that had elements of these specific qualities. I’ve been collecting checkpoints forever, so I had a good selection to choose from, but you can always browse Civitai to find models that fit your style.

Model choice matters. If you’re going for a photographic style, mixing an anime model may not be the best choice, but there may be elements of that model that you can merge in to produce the result you’re envisioning. It all comes down to experimentation.

Step two: narrowing down the selection

As I’m writing this, we’re in the middle of March Madness and, while I may not know the first thing about basketball, I still filled out a bracket.

In much the same way, I started with a selection of 17 initial candidates from my library and narrowed down my selection by pitting model against model. I ranked my preference, “seeded” my bracket, and generated an initial set of images with each model. I kept all of the settings the same — prompt, CFG, samples, sampler, etc. — and only changed the model so I would have an apples-to-apples comparison of what images produced by each model looked like.

From there, I chose my preferred model for each matchup. I was looking for which model best accomplished the specific goals that I had for my merge. Then I moved to the next round, changed the settings and subject matter, and continued on until I was left with my Final Four.

Step three: understand the model

As I began tinkering, I wanted to get an idea for how the model worked before I started adjusting dials all willy-nilly without any real idea of what they were doing.

A Stable Diffusion checkpoint consists of two parts — the model and the text encoder. The model (or Unet) guides the image generation process, while the text encoder affects the way your prompt is understood by the model. It’s undoubtedly more complicated than that, but that’s the gist. Both have a big impact on the final image.

I noticed a ModelMergeBlocks node in ComfyUI and did some digging, which led me to this in-depth explanation of what block merging was and how it worked. It’s super technical, but the illustrations are helpful.

Explanations of block merging from this article.

These illustrations, specifically, really drove the concept home for me. They show how a model contains many layers with varying levels of detail. Those layers are collected into blocks, and sorted into input, middle, and output blocks. As we’ll explore in a moment, each of these different areas has a unique impact on the final image.

A classic model merge is a crude blend of two models with a simple ratio. In a block merge, you can blend different layers from each model at different ratios to really fine tune the result.

This left me with a question — what exactly is each layer of the model actually doing? I needed to test to see which layers affected which parts of the image — overall structure and composition vs. fine details, subject vs. background, etc.

If that seems like a lot to you, don’t worry. You can get started with a classic merge using the concepts I’ll outline below. Let’s look at how to actually put this into practice.

Step four: build the test kitchen

Before I could start experimenting with different recipes, I had to build the test kitchen — the environment where I would experiment and iterate on my merges.

I chose ComfyUI for the task because it offers the most flexibility for creating a workflow to accomplish the task at hand. It’s not the easiest tool, especially if you’ve never used a node-based interface, but once you get used to how things work, you’ll never go back.

ComfyUI offers a several nodes for merging models, each with varying levels of control.

  1. ModelMergeSimple. It does what it says — it’s a simple merge between two models, with a simple ratio for mixing the weights. This is what I started with.
  2. ModelMergeBlocks. This gives you some more control over how different parts of the model — the input, middle, and output blocks — are blended together.
  3. ModelMergeBlockNumber. Even more fine control over your merge, letting you adjust the merge layer-by-layer.

There are a few others that get even more complex, up to the ModelMergeSDXLDetailedTransformers node that gives you hundreds of parameters to fine tune. Unless you’re a mad scientist, you probably don’t need this one. I stuck with the ModelMergeBlocks node. You can stick with the simple merge if you want a simple start.

As we discussed above, checkpoints also include a text encoder, which instructs the model how to interpret your prompt. At a lay level, one checkpoint could associate “a cute puppy dog” with a cocker spaniel, while another could associate it with a golden retriever. That’s way oversimplified, but it gets the point across that the text encoder matters.

ComfyUI happens to have a ClipMerge node to blend between text encoders. I found that I usually got the best results by using one model’s text encoder, but there were some instances where I would blend.

In my final workflow, I set up a merge between three models. The first merge blends my base model with a second model, then the result is merged again with the third. I used the ModelMergeBlocks node to give myself some room for experimentation without getting too complicated.

I also added some LoRA loaders to optionally add some LoRA models to the mix before saving the checkpoint. I didn’t use them here, but I used them to tweak my initial base model.

Step five: lay out the ingredients

Because we’re playing jazz rather than a rehearsed classical piece, we need to figure out what ingredients we’re cooking with (I know, I’m mixing metaphors).

What I mean is that we need to understand what each model looks like, what qualities they have, and with our understanding of how blocks work generally, how different blocks within the model affect the image.

Because I could adjust how much of each model — both Unet and text encoder — is mixed in, I created some crude comparisons in which I mixed 100% of Model A into Model B at a specific block — input, middle, or output — to see how it affected the base image.

For example, I would merge Model A input, Model B middle, Model B output, all at 100%. I repeated this for all of the possible permutations (there were eight combinations per pair of models), and then compared the results.

I compiled all of my outputs into a canvas in PureRef to arrange and compare results.

As I looked at the comparisons, I realized that the middle layers had an almost imperceptible effect on the finest details — useful for getting into the nitty gritty of the fine tune, but not so much for making big changes — while the outer layers had the biggest impact on the image. Input layers had a big impact on the overall structure of the image, while output layers played a significant role in defining the overall tone, the subject, pose, etc.

I discovered that the text encoder also played a big role in the final image, everything from the style and tone to the subject’s appearance and composition.

Step six: mix up the first batch

Let’s recap where we are in the process.

  • I have a base model — my initial simple merge — that I wanted to refine.
  • I had a goal — improve contrast and tone, prompt adherence, and make people feel less posed.
  • I had some candidate models to merge in with sample images from each.
  • I had an how each layer within the model affected the final image.

Armed with this info, I could start experimenting with different ratios of those ingredients to get the result that best matched my creative vision.

A good chef knows generally that some ingredients don’t mix — think chocolate and anchovies — but also knows that some unlikely ingredients pair well together — like strawberries, ice cream, and balsamic vinegar.

In the same way, I had no way of knowing for sure how different models would blend with my base model, but I had a pretty good idea of what ingredients to combine, with the expectation that I might find some unlikely pairs. Just to be sure I didn’t end up with any chocolate-and-anchovy disasters, I did a quick simple 50/50 merge of each model with my base.

As I started mixing, my reasoning went something like this:

  • For composition, I would experiment with input blocks and text encoder.
  • For style and tone, I would experiment with output blocks.
  • For pose and subject, I would experiment with text encoder and output blocks.
  • For background details, I would experiment with output blocks.
  • I would ignore the middle block for now and leave it at 100% base model.

I could probably get way more granular here and really dial in the details by going layer by layer, but I wasn’t ready for that yet.

I started with the text encoder, as it’s the first step in the chain, and ran several comparisons to see what balance got me closest to my desired look. I made an X/Y plot of each model’s text encoder merged into the result at 0/25/50/75/100% ratios and narrowed it down to the few that I preferred, noting the values in my PureRef canvas so I could keep track.

(As a side note, one handy feature of ComfyUI is that you can drag an image into the canvas to restore the workflow that it was created with. This is really handy if you forget things like prompts, etc.)

With my text encoder candidates, I did the same thing with the models for each of my checkpoints. I ran a few dozen generations tweaking the balance of each model in the same ratios as my text encoder at the input and output stage to “rough out” the overall mix.

After all of that, I landed on the best model for the first merge into my base. I ran a few more tests, more subtly tweaking each value in the input block, output block, and text encoder to get what I thought most closely aligned with my vision.

With that locked in, I added a third model to the mix, introducing it more subtly and deliberately to refine the result. I ended up using Model C’s text encoder and output block more heavily because it created the best candid images and fine detail — things my base model was lacking.

At the end of the process, I merged Model B into Model A at 20% in the input block and 25% in the text encoder, then merged Model C into the result at 20% in the input, 40% in the output, and 70% in the text encoder.

When all was said and done, I added a CheckpointSave node and saved the resulting merged model.

Step seven: test and tweak

Throughout this process, I had been testing the models with a narrow range of prompts — mostly photographic scenes with a single subject. But my base model was a really great, versatile model, and I wanted to make sure I hadn’t inadvertently over-emphasized portraits in the merge.

To quickly compare models, I set up a workflow to generate and upscale images with each of my three model candidates so I could see differences at a glance.

I also only used a couple of resolutions in the merge (3:4 portrait and 16:9 widescreen), so I wanted to test the model to make sure it would do well across more extreme resolutions without running into issues like doubling up the subject or creating incoherent scenes.

When I tested my initial merge, I created a ton of images with the most random subjects I could think of, including running some prompts from Midjourney’s explore page that matched the kinds of images I hoped my model would be good at. I also tried a wide range of aspect ratios, from 21:9 to 9:21.

Some of the subject/style points that I wanted to hit included:

  • Macro/still life photography
  • Nature photography
  • Complex subject/style combinations (e.g. “claymation Dolly Parton”)
  • Images with text
  • Camera angles/focal lengths
  • Known figures (e.g. “Abraham Lincoln”) vs. generic subjects (e.g. “a scuba diver”)

I dumped these into my PureRef board to compare my base model to two contenders for the final merge — Merge A and Merge B.

The final decision came down to the same qualities I set out to improve: subject, tone and contrast, detail, and coherence.

As you can see above, some of the details were really subtle. But if you look closely at things like the snorkel and the squares on the chess board above, you’ll see how Evolution B won out over my base model (Ultimate) and the other merge candidate.

The prompt for this image mentioned jumping in the air and pyrotechnics. Evolution A didn’t include the pyrotechnics, and I thought Evolution B was a more expressive pose.

A lot of this comes down to preference, too. It’s my merge, and I’m the one that is going to use it most, so I can make it look however I want to.

Final Tips

As I went through this process, I learned a lot about how merges work and noted some tips that would help next time.

  1. Consistency matters. Be sure to lock in every variable except for the one that you’re testing for. Seed, CFG, prompt, etc. Don’t be tempted to adjust a prompt to suit a model. You want to see how they compare.
  2. Think systematically. Don’t just charge in and adjust things randomly (unless that’s how you roll).
  3. Curate your images. Using PureRef was clutch for this process because I could copy and paste generated images straight into a canvas where I could organize and arrange them.
  4. Document everything. You won’t remember every detail of every setting of every image you create.
  5. Be patient. Creating a good merge takes a while. Enjoy the process.

Conclusion: go forth and experiment

All in, this project probably took me about ten hours from start to finish. I generated a ton of images (I didn’t save them, so I can’t give you a total count, but it was probably close to a thousand).

Even if you never do anything with this, it should help you become a more deliberate creator when it comes to A.I. imagery, and it should give you a sense of appreciation for the hard work that goes into creating the great models that you get to use.

If you want to learn more about A.I. image generation, and you’d like to hone your skills in a community of like-minded peers, check out my A.I. Collaborative and join me on Discord.

The A.I. Collaborative is a free community with some incredible paid resources and support — a space that I created to share what I’ve learned about generative A.I. through thousands of hours of experimentation.

--

--

Rob Laughter
Rob Laughter

Written by Rob Laughter

Rob is a creative professional exploring the intersection of technology and creativity. His current muse is generative A.I.

No responses yet