This document outlines the technical implementation of fine-tuning Stable Diffusion 1.4 using Low-Rank Adaptation (LoRA). It provides a detailed guide for beginners to understand and contribute to the project.
sd-lora-finetuning/
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── requirements.txt
├── src/ (Example implementation)
│ ├── Dataset/
│ │ ├── ImageCaptions/
│ │ │ └── example1.txt
│ │ └── Images/
│ │ └── example1.png
│ ├── dataset.py
│ ├── generate.py
│ ├── lora.py
│ ├── main.py
| |── scraping.py
│ ├── train.py
│ └── utils.py
└── CONTRIBUTIONS/
└── Example1/
├── Dataset/
│ ├── ImageCaptions/
│ │ └── example1.txt
│ └── Images/
│ └── example1.png
└── src/
├── dataset.py
├── generate.py
├── lora.py
├── main.py
|── scraping.py
├── train.py
└── utils.py
src/: Contains the example implementation (refer to this for your contribution)CONTRIBUTIONS/: Directory where participants should add their implementationsCONTRIBUTING.mdandCODE_OF_CONDUCT.md: Guidelines and help regarding contributing(MUST READ!)- Other files in the root directory are for project documentation and setup
LoRA is implemented as follows:
a) LoRALayer class:
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=4, alpha=1):
super().__init__()
self.lora_A = nn.Parameter(torch.zeros((rank, in_features)))
self.lora_B = nn.Parameter(torch.zeros((out_features, rank)))
self.scale = alpha / rank
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
return (x @ self.lora_A.T @ self.lora_B.T) * self.scaleb) apply_lora_to_model function:
def apply_lora_to_model(model, rank=4, alpha=1):
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
lora_layer = LoRALayer(module.in_features, module.out_features, rank, alpha)
setattr(module, 'lora', lora_layer)
return modelKey concept: LoRA adds trainable low-rank matrices to existing layers, allowing for efficient fine-tuning.
The CustomDataset class:
class CustomDataset(Dataset):
def __init__(self, img_dir, caption_dir=None, transform=None):
self.img_dir = img_dir
self.caption_dir = caption_dir
self.transform = transform or transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5])
])
self.images = [f for f in os.listdir(img_dir) if f.endswith(('.png', '.jpg', '.jpeg', '.webp'))]
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.images[idx])
image = Image.open(img_path).convert('RGB')
image = self.transform(image)
if self.caption_dir:
caption_path = os.path.join(self.caption_dir, self.images[idx].rsplit('.', 1)[0] + '.txt')
with open(caption_path, 'r') as f:
caption = f.read().strip()
else:
caption = ""
return image, captionThe train_loop function implements the core training logic:
def train_loop(dataloader, unet, text_encoder, vae, noise_scheduler, optimizer, device, num_epochs):
for epoch in range(num_epochs):
for batch in dataloader:
images, captions = batch
latents = vae.encode(images.to(device)).latent_dist.sample().detach()
noise = torch.randn_like(latents)
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (latents.shape[0],), device=device)
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
text_embeddings = text_encoder(captions)[0]
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states=text_embeddings).sample
loss = F.mse_loss(noise_pred, noise)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")Key concept: We're training the model to denoise latent representations, conditioned on text embeddings.
def generate_image(prompt, pipeline, num_inference_steps=50):
with torch.no_grad():
image = pipeline(prompt, num_inference_steps=num_inference_steps).images[0]
return image-
Clone the repository:
git clone https://github.com/your-username/sd-lora-finetuning.git cd sd-lora-finetuning -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install dependencies:
pip install -r requirements.txt
If you want to scrape images and captions from the web, you can use the scraping.py script, which uses BING API to download images and captions which can be used for training. Dataset folder should be created in the src directory to store the images and captions.
- How To Use:
- Create a
.envfile in the root directory and add the following:API_KEY="YOUR_BING_API_KEY" - You can get the API key from the Azure portal Create.
- Replace this "YOUR_BING_API_KEY" with your API key.
- Change the
queryin thescraping.pyfile to the desired search query. - Change the path in the
scraping.pyfile to the desired path where you want to store the images and captions. - Run the
scraping.pyfile to download the images and captions.
- Create a
Note: The Image Caption will be the same as the Search Query. So, make sure to use the search query that you want to use as the caption of the image while training.
- Fork the repository and clone your fork.
- Create a new folder in the
CONTRIBUTIONSdirectory with your username or project name. - Implement your version of the LoRA fine-tuning following the structure in the
srcdirectory. - Ensure you include a
Datasetfolder with example images and captions. - Create a pull request with your contribution.
Refer to the src directory for an example of how to structure your contribution.
Refer to CONTRIBUTING.md for a detailed overview, if you're a beginner!
LoRA adapts the model by injecting trainable rank decomposition matrices into existing layers:
- For a layer with weight W, LoRA adds BA where B ∈ R^(d×r) and A ∈ R^(r×k)
- The output is computed as: h = Wx + BAx
- Only A and B are trained, keeping the original weights W frozen
This is implemented in the LoRALayer class:
def forward(self, x):
return (x @ self.lora_A.T @ self.lora_B.T) * self.scaleThe model is trained to predict the noise added to the latent representation:
- Images are encoded to latent space: z = Encode(x)
- Noise is added: z_noisy = z + ε
- The model predicts the noise: ε_pred = Model(z_noisy, t, text_embedding)
- Loss is computed: L = MSE(ε_pred, ε)
This is implemented in the training loop:
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states=text_embeddings).sample
loss = F.mse_loss(noise_pred, noise)By learning to denoise, the model implicitly learns to generate images conditioned on text.
Customimzation and uniqueness is expected from each contributor.
- Feel free to modify
LoRALayerinlora.pyto experiment with different LoRA architectures - Adjust the U-Net architecture in
main.pyby modifying which layers receive LoRA - Implement additional training techniques in
train.py(e.g., gradient clipping, learning rate scheduling)
- LoRA paper: LoRA: Low-Rank Adaptation of Large Language Models
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models
- Hugging Face Diffusers: Documentation