StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu · Lingting Zhu · Shengju Qian · Lei Liu · Wandi Qiao · Lequan Yu · Bin Li

StyleAR is a framework that enables the multimodal autoregressive model to perform style-aligned text-to-image generation.

🔥 Release

[2025/06/06] 🎉 We release the inference code and checkpoints of StyleAR integrate with depth control.
[2025/05/27] 🎉 We release the inference code and checkpoints.
[2025/05/27] 🎉 We release the technical report.

🧰 Models

Base Model	Task Type	Resolution	Checkpoint
Lumina-mGPT	Reference Style	768x768	Hugging Face
Lumina-mGPT-Omni	Reference Style with Depth Condition	768x768	Hugging Face

⚙️ Setup

The code relies on the implementation of Lumina-mGPT, the setup procedure is the same.

1. Basic Setup

# Create a new conda environment named 'stylear' with Python 3.10
conda create -n stylear python=3.10 -y
# Activate the 'stylear' environment
conda activate stylear
# Install required packages from 'requirements.txt'
pip install -r requirements.txt

2. Install Flash-Attention

pip install flash-attn --no-build-isolation

3. Install xllmx as Python Package

The xllmx module is a lightweight engine designed to support the training and inference of LLM-centered Any2Any models. It is evolved from LLaMA2-Accessory, undergoing comprehensive improvements to achieve higher efficiency and wider functionality, including the support for flexible arrangement and processing of interleaved media and text.

The Lumina-mGPT implementation heavily relies on xllmx and requires xllmx to be installed as a python package (so that import xllmx can be used anywhere in your machine, without the restriction of working directory). The installation process is as follows:

# go to the root path of the project and install as package
pip install -e .

4. Model Perpetration

Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights provided by Meta and put them to the following directory:

StyleAR
- lumina_mgpt/
    - ckpts/
        - chameleon/
            - tokenizer/
                - text_tokenizer.json
                - vqgan.yaml
                - vqgan.ckpt
- xllmx/
- ...

Then download the stylear models from Hugging Face and put them to 'stylear_models' directory:

StyleAR
- lumina_mgpt/
    - ckpts/
    - stylear_models/
        - image_proj.pth
        - model.safetensors
- xllmx/
- ...

💫 Inference

python inference.py --params_path "{stylear_params_path}" --style "{reference_style_image}" --prompt "{prompt}" --save_path "{save_path}"
# samples
python inference.py --params_path stylear_models --style ../test_images/doll.png --prompt "a ship" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/airplane.png --prompt "a train" --noise_strength 0.3 --save_path output
python inference.py --params_path stylear_models --style ../test_images/owl.png --prompt "a dog" --noise_strength 0.1 --save_path output
# samples with depth condition
python inference_MultiCondition.py --params_path stylear_models/Depth_condition --style ../test_images/flower.png --prompt "a shoes" --noise_strength 0.1 --multi_condition ../test_images/shoes_depth.png

🔆 Demos

1. Comparison with Previous Works

2. Integrate with Depth Condition

3. Integrate with Segmentation map Condition

📝 To-Do List

Style-driven Text-to-Image Generation Inference Code & Checkpoints
Technical Report
StyleAR Integration with Depth Control Inference Code & Checkpoints
StyleAR Integration with Segmentation Map Control Inference Code & Checkpoints
StyleAR Based on Lumina-mGPT 2.0

Citation

If you find our work useful, please kindly cite as:

@article{wu2025stylear,
  title={StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation},
  author={Wu, Yi and Zhu, Lingting and Qian, Shengju and Liu, Lei and Qiao, Wandi and Yu, Lequan and Li, Bin},
  journal={arXiv preprint arXiv:2505.19874},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
assets		assets
lumina_mgpt		lumina_mgpt
test_images		test_images
xllmx		xllmx
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

🔥 Release

🧰 Models

⚙️ Setup

1. Basic Setup

2. Install Flash-Attention

3. Install xllmx as Python Package

4. Model Perpetration

💫 Inference

🔆 Demos

1. Comparison with Previous Works

2. Integrate with Depth Condition

3. Integrate with Segmentation map Condition

📝 To-Do List

Citation

Related Links

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

wuyi2020/StyleAR

Folders and files

Latest commit

History

Repository files navigation

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

🔥 Release

🧰 Models

⚙️ Setup

1. Basic Setup

2. Install Flash-Attention

3. Install xllmx as Python Package

4. Model Perpetration

💫 Inference

🔆 Demos

1. Comparison with Previous Works

2. Integrate with Depth Condition

3. Integrate with Segmentation map Condition

📝 To-Do List

Citation

Related Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages