SDXL something
Latent Space — What It Is & How Pixels Become Latents
The autoencoder converts images into a compressed, abstract representation called latent space.
Think of it like:
Pixels = raw brushstrokes
Latent space = the “idea” of the image (style, shapes, semantics)
Steps:
Encoder extracts features via convolutions.
Downsamples → compresses spatial dimensions.
Produces a dense, meaningful latent representation (e.g., 64×64×4).
The decoder later reverses this — turning the latent back into pixels.
The diffusion model works entirely in this latent space because it’s smaller, structured, and far more semantically meaningful than raw pixel space.
UNet Changes — More Parameters & Why That Matters
SDXL bumps the UNet backbone from ~860M to 2.6B parameters.
Why is this good? More parameters → more capacity → more ability to understand complex visual concepts → better fidelity, richer details, more nuanced alignment with text prompts.
Basically: the UNet “thinks” in a higher-dimensional way now.
Transformer Blocks — Why SDXL Doesn’t Use Uniform Distribution
Previous Models:
Used 1,1,1,1 transformer blocks uniformly at the four feature levels (corresponding to progressively downsampled feature maps).
SDXL:
Uses 0 → 2 → 10 transformer blocks as features keep getting reduced.
Why does this work better?
High-resolution level (0 blocks): Transformers here are too expensive and don’t add much. Convs handle fine spatial details anyway.
Lowest-resolution level (10 blocks): This is where the model holds the most abstract, global understanding. Transformers shine at semantic relationships → so SDXL pushes heavy attention here.
Mid-level (2 blocks): A balance between local + global understanding.
So SDXL doesn’t “just distribute evenly.” It places attention where it actually matters for global semantic composition.
Efficiency + Correct Allocation = better results.
Channel Multipliers — Why Only 1, 2, 4?
Yes, channel multipliers compensate for loss of spatial resolution by increasing channel depth. But SDXL uses only 3 multipliers because:
SDXL removed the lowest (8×) downsampling stage from the UNet.
One fewer stage → one fewer multiplier → simpler, more efficient UNet.
Still captures the necessary abstractions while avoiding the extra computation of another downsampling level.
Two Text Encoders + Pooled Embedding — What’s the Point?
SDXL uses:
CLIP ViT-L
OpenCLIP ViT-bigG
Why?
Each encoder captures different nuances. Concatenating them gives a richer, more robust understanding of the text.
It also uses a pooled text embedding (from ViT-bigG):
This is a single global summary of the entire prompt.
Cross-attention handles token-level details and pooled embedding handles the big picture (style, composition, overall intent).
Together: better semantic grounding for the image.
Size Conditioning — Why SDXL Tells the UNet the Original Image Size
Older models dealt with mismatched image sizes by:
Discarding small images → losing valuable training data
Upscaling small images → introducing blur artifacts the model ends up learning
SDXL’s fix:
For every training image, it feeds the original height and width into the UNet as conditioning (c_size = (h_original, w_original)). These values are Fourier-encoded → concatenated → added to the timestep embedding.
What does this solve?
Model no longer has to discard small images.
Model knows when an image was originally tiny and doesn’t learn upscaling artifacts as “truth.”
Model can learn different aspect ratios naturally.
In short: this metadata gives the model context it never had before.
Improved Autoencoder — Same Architecture, Better Training
Even though the architecture is the same, SDXL trains the autoencoder with:
Batch size 256 (vs 9 previously)
EMA (Exponential Moving Average) weight tracking
Why is this better?
Large batch → more stable gradients, more robust learning and EMA led to smoother, more stable weight trajectory which resulted in better generalization.
Result:
Autoencoder outperforms previous SD models in all reconstruction metrics. This is crucial because LDMs rely on the autoencoder’s latent space for local, high-frequency details.
SDXL Workflow — From Text to Image (First Principles)
Here’s the whole pipeline:
- Text Prompt → Text Embeddings :
Two text encoders process the prompt.
Produce token-level embeddings + a pooled embedding.
This becomes the “instruction manual” for the UNet.
- Start with Random Latent Noise :
Random noise is generated for the next step.
- Denoising Loop (The Core Generation Process):
At every step the UNet receives noisy latent + text embeddings timestep and size conditioning. The job of the UNet is to predict what noise to remove, remove it and make sense out of the chaos.
- Decode:
The clean latent is passed through the improved autoencoder and the image is formed out of it.
NOTE: It also uses a refinement model to sharpen the image and the edges and all.