Skip to content

wuyi2020/StyleAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu · Lingting Zhu · Shengju Qian · Lei Liu · Wandi Qiao · Lequan Yu · Bin Li

StyleAR is a framework that enables the multimodal autoregressive model to perform style-aligned text-to-image generation.

🔥 Release

  • [2025/06/06] 🎉 We release the inference code and checkpoints of StyleAR integrate with depth control.
  • [2025/05/27] 🎉 We release the inference code and checkpoints.
  • [2025/05/27] 🎉 We release the technical report.

🧰 Models

Base Model Task Type Resolution Checkpoint
Lumina-mGPT Reference Style 768x768 Hugging Face
Lumina-mGPT-Omni Reference Style with Depth Condition 768x768 Hugging Face

⚙️ Setup

The code relies on the implementation of Lumina-mGPT, the setup procedure is the same.

1. Basic Setup

# Create a new conda environment named 'stylear' with Python 3.10
conda create -n stylear python=3.10 -y
# Activate the 'stylear' environment
conda activate stylear
# Install required packages from 'requirements.txt'
pip install -r requirements.txt

2. Install Flash-Attention

pip install flash-attn --no-build-isolation

3. Install xllmx as Python Package

The xllmx module is a lightweight engine designed to support the training and inference of LLM-centered Any2Any models. It is evolved from LLaMA2-Accessory, undergoing comprehensive improvements to achieve higher efficiency and wider functionality, including the support for flexible arrangement and processing of interleaved media and text.

The Lumina-mGPT implementation heavily relies on xllmx and requires xllmx to be installed as a python package (so that import xllmx can be used anywhere in your machine, without the restriction of working directory). The installation process is as follows:

# go to the root path of the project and install as package
pip install -e .

4. Model Perpetration

Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights provided by Meta and put them to the following directory:

StyleAR
- lumina_mgpt/
    - ckpts/
        - chameleon/
            - tokenizer/
                - text_tokenizer.json
                - vqgan.yaml
                - vqgan.ckpt
- xllmx/
- ...

Then download the stylear models from Hugging Face and put them to 'stylear_models' directory:

StyleAR
- lumina_mgpt/
    - ckpts/
    - stylear_models/
        - image_proj.pth
        - model.safetensors
- xllmx/
- ...

💫 Inference

python inference.py --params_path "{stylear_params_path}" --style "{reference_style_image}" --prompt "{prompt}" --save_path "{save_path}"
# samples
python inference.py --params_path stylear_models --style ../test_images/doll.png --prompt "a ship" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/airplane.png --prompt "a train" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/owl.png --prompt "a dog" --noise_strength 0.1 --save_path output
# samples with depth condition
python inference_MultiCondition.py --params_path stylear_models/Depth_condition --style ../test_images/flower.png --prompt "a shoes" --noise_strength 0.1 --multi_condition ../test_images/shoes_depth.png

🔆 Demos

1. Comparison with Previous Works

2. Integrate with Depth Condition

3. Integrate with Segmentation map Condition

📝 To-Do List

  • Style-driven Text-to-Image Generation Inference Code & Checkpoints
  • Technical Report
  • StyleAR Integration with Depth Control Inference Code & Checkpoints
  • StyleAR Integration with Segmentation Map Control Inference Code & Checkpoints
  • StyleAR Based on Lumina-mGPT 2.0

Citation

If you find our work useful, please kindly cite as:

@article{wu2025stylear,
  title={StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation},
  author={Wu, Yi and Zhu, Lingting and Qian, Shengju and Liu, Lei and Qiao, Wandi and Yu, Lequan and Li, Bin},
  journal={arXiv preprint arXiv:2505.19874},
  year={2025}
}

Related Links

If you are interested in Personalized Image Generation with AR Models, we would also like to recommend you to check out our related work:

About

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages