habib's rabbit hole

SDXL something

Latent Space — What It Is & How Pixels Become Latents

The autoencoder converts images into a compressed, abstract representation called latent space.

Think of it like:

Pixels = raw brushstrokes

Latent space = the “idea” of the image (style, shapes, semantics)

Steps:

Encoder extracts features via convolutions.

Downsamples → compresses spatial dimensions.

Produces a dense, meaningful latent representation (e.g., 64×64×4).

The decoder later reverses this — turning the latent back into pixels.

The diffusion model works entirely in this latent space because it’s smaller, structured, and far more semantically meaningful than raw pixel space.

UNet Changes — More Parameters & Why That Matters

SDXL bumps the UNet backbone from ~860M to 2.6B parameters.

Why is this good? More parameters → more capacity → more ability to understand complex visual concepts → better fidelity, richer details, more nuanced alignment with text prompts.

Basically: the UNet “thinks” in a higher-dimensional way now.

Transformer Blocks — Why SDXL Doesn’t Use Uniform Distribution

Previous Models:

Used 1,1,1,1 transformer blocks uniformly at the four feature levels (corresponding to progressively downsampled feature maps).

SDXL:

Uses 0 → 2 → 10 transformer blocks as features keep getting reduced.

Why does this work better?

High-resolution level (0 blocks): Transformers here are too expensive and don’t add much. Convs handle fine spatial details anyway.

Lowest-resolution level (10 blocks): This is where the model holds the most abstract, global understanding. Transformers shine at semantic relationships → so SDXL pushes heavy attention here.

Mid-level (2 blocks): A balance between local + global understanding.

So SDXL doesn’t “just distribute evenly.” It places attention where it actually matters for global semantic composition.

Efficiency + Correct Allocation = better results.

Channel Multipliers — Why Only 1, 2, 4?

Yes, channel multipliers compensate for loss of spatial resolution by increasing channel depth. But SDXL uses only 3 multipliers because:

SDXL removed the lowest (8×) downsampling stage from the UNet.

One fewer stage → one fewer multiplier → simpler, more efficient UNet.

Still captures the necessary abstractions while avoiding the extra computation of another downsampling level.

Two Text Encoders + Pooled Embedding — What’s the Point?

SDXL uses:

Why?

Cross-attention handles token-level details and pooled embedding handles the big picture (style, composition, overall intent).

Together: better semantic grounding for the image.

Size Conditioning — Why SDXL Tells the UNet the Original Image Size

Older models dealt with mismatched image sizes by:

SDXL’s fix:

For every training image, it feeds the original height and width into the UNet as conditioning (c_size = (h_original, w_original)). These values are Fourier-encoded → concatenated → added to the timestep embedding.

What does this solve?

In short: this metadata gives the model context it never had before.

Improved Autoencoder — Same Architecture, Better Training

Even though the architecture is the same, SDXL trains the autoencoder with:

Why is this better?

Large batch → more stable gradients, more robust learning and EMA led to smoother, more stable weight trajectory which resulted in better generalization.

Result:

Autoencoder outperforms previous SD models in all reconstruction metrics. This is crucial because LDMs rely on the autoencoder’s latent space for local, high-frequency details.

SDXL Workflow — From Text to Image (First Principles)

Here’s the whole pipeline:

  1. Text Prompt → Text Embeddings :

This becomes the “instruction manual” for the UNet.

  1. Start with Random Latent Noise :

Random noise is generated for the next step.

  1. Denoising Loop (The Core Generation Process):

At every step the UNet receives noisy latent + text embeddings timestep and size conditioning. The job of the UNet is to predict what noise to remove, remove it and make sense out of the chaos.

  1. Decode:

The clean latent is passed through the improved autoencoder and the image is formed out of it.

NOTE: It also uses a refinement model to sharpen the image and the edges and all.