Skip to content

Commit db715e2

Browse files
authored
feat: add multiple input image support in Flux Kontext (#11880)
* feat: add multiple input image support in Flux Kontext * move model to community * fix linter
1 parent 754fe85 commit db715e2

File tree

2 files changed

+1257
-1
lines changed

2 files changed

+1257
-1
lines changed

examples/community/README.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixar
8787
| CogVideoX DDIM Inversion Pipeline | Implementation of DDIM inversion and guided attention-based editing denoising process on CogVideoX. | [CogVideoX DDIM Inversion Pipeline](#cogvideox-ddim-inversion-pipeline) | - | [LittleNyima](https://github.com/LittleNyima) |
8888
| FaithDiff Stable Diffusion XL Pipeline | Implementation of [(CVPR 2025) FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolutionUnleashing Diffusion Priors for Faithful Image Super-resolution](https://huggingface.co/papers/2411.18824) - FaithDiff is a faithful image super-resolution method that leverages latent diffusion models by actively adapting the diffusion prior and jointly fine-tuning its components (encoder and diffusion model) with an alignment module to ensure high fidelity and structural consistency. | [FaithDiff Stable Diffusion XL Pipeline](#faithdiff-stable-diffusion-xl-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/jychen9811/FaithDiff) | [Junyang Chen, Jinshan Pan, Jiangxin Dong, IMAG Lab, (Adapted by Eliseu Silva)](https://github.com/JyChen9811/FaithDiff) |
8989
| Stable Diffusion 3 InstructPix2Pix Pipeline | Implementation of Stable Diffusion 3 InstructPix2Pix Pipeline | [Stable Diffusion 3 InstructPix2Pix Pipeline](#stable-diffusion-3-instructpix2pix-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/BleachNick/SD3_UltraEdit_freeform) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/CaptainZZZ/sd3-instructpix2pix) | [Jiayu Zhang](https://github.com/xduzhangjiayu) and [Haozhe Zhao](https://github.com/HaozheZhao)|
90+
| Flux Kontext multiple images | A modified version of the `FluxKontextPipeline` that supports calling Flux Kontext with multiple reference images.| [Flux Kontext multiple input Pipeline](#flux-kontext-multiple-images) | - | [Net-Mist](https://github.com/Net-Mist) |
9091
To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
9192

9293
```py
@@ -5479,4 +5480,48 @@ edited_image.save("edited_image.png")
54795480
### Note
54805481
This model is trained on 512x512, so input size is better on 512x512.
54815482
For better editing performance, please refer to this powerful model https://huggingface.co/BleachNick/SD3_UltraEdit_freeform and Paper "UltraEdit: Instruction-based Fine-Grained Image
5482-
Editing at Scale", many thanks to their contribution!
5483+
Editing at Scale", many thanks to their contribution!
5484+
5485+
# Flux Kontext multiple images
5486+
5487+
This implementation of Flux Kontext allows users to pass multiple reference images. Each image is encoded separately, and the resulting latent vectors are concatenated.
5488+
5489+
As explained in Section 3 of [the paper](https://arxiv.org/pdf/2506.15742), the model's sequence concatenation mechanism can extend its capabilities to handle multiple reference images. However, note that the current version of Flux Kontext was not trained for this use case. In practice, stacking along the first axis does not yield correct results, while stacking along the other two axes appears to work.
5490+
5491+
## Example Usage
5492+
5493+
This pipeline loads two reference images and generates a new image based on them.
5494+
5495+
```python
5496+
import torch
5497+
5498+
from diffusers import FluxKontextPipeline
5499+
from diffusers.utils import load_image
5500+
5501+
5502+
pipe = FluxKontextPipeline.from_pretrained(
5503+
"black-forest-labs/FLUX.1-Kontext-dev",
5504+
torch_dtype=torch.bfloat16,
5505+
custom_pipeline="pipeline_flux_kontext_multiple_images",
5506+
)
5507+
pipe.to("cuda")
5508+
5509+
pikachu_image = load_image(
5510+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
5511+
).convert("RGB")
5512+
cat_image = load_image(
5513+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
5514+
).convert("RGB")
5515+
5516+
5517+
prompts = [
5518+
"Pikachu and the cat are sitting together at a pizzeria table, enjoying a delicious pizza.",
5519+
]
5520+
images = pipe(
5521+
multiple_images=[(pikachu_image, cat_image)],
5522+
prompt=prompts,
5523+
guidance_scale=2.5,
5524+
generator=torch.Generator().manual_seed(42),
5525+
).images
5526+
images[0].save("pizzeria.png")
5527+
```

0 commit comments

Comments
 (0)