Yi Wu · Lingting Zhu · Shengju Qian · Lei Liu · Wandi Qiao · Lequan Yu · Bin Li
StyleAR is a framework that enables the multimodal autoregressive model to perform style-aligned text-to-image generation.
- [2025/06/06] 🎉 We release the inference code and checkpoints of StyleAR integrate with depth control.
- [2025/05/27] 🎉 We release the inference code and checkpoints.
- [2025/05/27] 🎉 We release the technical report.
| Base Model | Task Type | Resolution | Checkpoint |
|---|---|---|---|
| Lumina-mGPT | Reference Style | 768x768 | Hugging Face |
| Lumina-mGPT-Omni | Reference Style with Depth Condition | 768x768 | Hugging Face |
The code relies on the implementation of Lumina-mGPT, the setup procedure is the same.
# Create a new conda environment named 'stylear' with Python 3.10
conda create -n stylear python=3.10 -y
# Activate the 'stylear' environment
conda activate stylear
# Install required packages from 'requirements.txt'
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
The xllmx module is a lightweight engine designed to support the training and inference of LLM-centered Any2Any models. It is evolved from LLaMA2-Accessory, undergoing comprehensive improvements to achieve higher efficiency and wider functionality, including the support for flexible arrangement and processing of interleaved media and text.
The Lumina-mGPT implementation heavily relies on xllmx and requires xllmx to be installed as a python package (so that import xllmx can be used anywhere in your machine, without the restriction of working directory).
The installation process is as follows:
# go to the root path of the project and install as package
pip install -e .Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights provided by Meta and put them to the following directory:
StyleAR
- lumina_mgpt/
- ckpts/
- chameleon/
- tokenizer/
- text_tokenizer.json
- vqgan.yaml
- vqgan.ckpt
- xllmx/
- ...
Then download the stylear models from Hugging Face and put them to 'stylear_models' directory:
StyleAR
- lumina_mgpt/
- ckpts/
- stylear_models/
- image_proj.pth
- model.safetensors
- xllmx/
- ...
python inference.py --params_path "{stylear_params_path}" --style "{reference_style_image}" --prompt "{prompt}" --save_path "{save_path}"
# samples
python inference.py --params_path stylear_models --style ../test_images/doll.png --prompt "a ship" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/airplane.png --prompt "a train" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/owl.png --prompt "a dog" --noise_strength 0.1 --save_path output
# samples with depth condition
python inference_MultiCondition.py --params_path stylear_models/Depth_condition --style ../test_images/flower.png --prompt "a shoes" --noise_strength 0.1 --multi_condition ../test_images/shoes_depth.png
- Style-driven Text-to-Image Generation Inference Code & Checkpoints
- Technical Report
- StyleAR Integration with Depth Control Inference Code & Checkpoints
- StyleAR Integration with Segmentation Map Control Inference Code & Checkpoints
- StyleAR Based on Lumina-mGPT 2.0
If you find our work useful, please kindly cite as:
@article{wu2025stylear,
title={StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation},
author={Wu, Yi and Zhu, Lingting and Qian, Shengju and Liu, Lei and Qiao, Wandi and Yu, Lequan and Li, Bin},
journal={arXiv preprint arXiv:2505.19874},
year={2025}
}
If you are interested in Personalized Image Generation with AR Models, we would also like to recommend you to check out our related work:



