This is an official PyTorch implementation of MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models.
- Inference codes and pretrained weights
- Training scripts
We Recommend a python version ==3.9.21 and cuda version ==12.2. Then build environment as follows:
git clone https://github.com/kkakkkka/MambaTalk -b main
# [Optional] Create a virtual env
conda create -n mambatalk python==3.9.21
conda activate mambatalk
# Install ffmpeg for media processing and libstdcxx-ng for rendering
conda install -c conda-forge libstdcxx-ng ffmpeg
# Install with pip:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install 'git+https://github.com/facebookresearch/pytorch3d.git@stable'
pip install https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.4.0/causal_conv1d-1.4.0+cu122torch2.1cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
pip install https://github.com/state-spaces/mamba/releases/download/v2.2.4/mamba_ssm-2.2.4+cu11torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whlYou may run the following command to download weights from Huggingface in ./pretrained/:
pip3 install "huggingface_hub[cli]"
huggingface-cli download --resume-download kkakkkka/MambaTalk --local-dir pretrainedThese weights should be orgnized as follows:
./pretrained/
|-- pretrained_vq
| |-- face.bin
| |-- foot.bin
| |-- hands.bin
| |-- lower_foot.bin
| |-- upper.bin
|-- smplx_models
| |-- smplx/SMPLX_NEUTRAL_2020.npz
|-- test_sequences
|-- mambatalk_100.bin
Download the unzip version BEAT2 via hugging face in path <your root>:
git lfs install
git clone https://huggingface.co/datasets/H-Liu1997/BEAT2After you downloaded BEAT2 dataset, run:
bash run_scripts/test.shFor your own data, you should organize it as follows:
.
├── smplxflame_30
│ ├── 2_scott_0_1_1.npz
│ ├── 2_scott_0_2_2.npz
├── test.csv
├── textgrid
│ ├── 2_scott_0_1_1.TextGrid
│ ├── 2_scott_0_2_2.TextGrid
├── wave16k
│ ├── 2_scott_0_1_1.wav
│ ├── 2_scott_0_2_2.wavIn test.csv, please list your files as shown below:
id,type
2_scott_0_1_1,test
2_scott_0_2_2,testIf you want to generate corresponding TextGrid files from your speech recordings, we recommend installing Montreal Forced Aligner (MFA). These aligned text files should then be used as input alongside your audio files.
pip install git+https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner
conda install -c conda-forge kalpy
pip install pgvector
pip install Bio
mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa
# Please put your speech recordings under ./data first
mfa align ./data english_us_arpa english_us_arpa ./data/resultWith generated npy files, you can visualize the results using command below:
npy_path="./res_2_scott_0_1_1.npz"
wav_path="./BEAT2/beat_english_v2.0.0/wave16k/2_scott_0_1_1.wav"
save_dir="outputs/render"
xvfb-run -a python render.py --npy_path $npy_path --wav_path $wav_path --save_dir $save_dirbash run_scripts/train.shpython train.py --config ./configs/cnn_vqvae_face_30.yaml python train.py --config configs/cnn_vqvae_hands_30.yaml python train.py --config configs/cnn_vqvae_lower_30.yaml python train.py --config configs/cnn_vqvae_lower_foot_30.yaml python train.py --config configs/cnn_vqvae_upper_30.yaml The code is based on EMAGE. We thank the authors for their open-sourced code and encourage users to cite their works when applicable.
If MambaTalk is useful for your research, please consider citing:
@article{xu2024mambatalk,
title={Mambatalk: Efficient holistic gesture synthesis with selective state space models},
author={Xu, Zunnan and Lin, Yukang and Han, Haonan and Yang, Sicheng and Li, Ronghui and Zhang, Yachao and Li, Xiu},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={20055--20080},
year={2024}
}