Transpilation

Usage

Setup

pip install -r requirements.txt
pip install -Ue .
python setup.py install

Docker images

For RISC: Download and unzip verification_image.zip Run bash build.sh Update the docker settings.json (or just use the UI) file sharing locations to include the path to this folder.

For ARM:

docker run -ti ubuntu /bin/bash
apt-get update
apt-get install environment-modules
apt-get install libc6-dev
apt-get install libgmp3-dev
apt-get install wget
wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-compiler-for-linux/23-04-1/arm-compiler-for-linux_23.04.1_Ubuntu-22.04_aarch64.tar?rev=73781ef890034505a60cd9f776298289&revision=73781ef8-9003-4505-a60c-d9f776298289
tar -xvf arm-compiler-for-linux_23.04.1_Ubuntu-22.04_aarch64.tar\?rev\=73781ef890034505a60cd9f776298289
cd arm-compiler-for-linux_23.04.1_Ubuntu-22.04
./arm-compiler-for-linux_23.04.1_Ubuntu-22.04.sh
export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles
export PATH=$PATH:/opt/arm/arm-linux-compiler-23.04.1_Ubuntu-22.04/bin/

apt-get install gcc-aarch64-linux-gnu
apt-get install -y gcc-riscv64-linux-gnu // optional, since we shouldn't need to emulate RISC on the ARM machine.

Setup QEMU

apt-get install qemu-user

apt-get install xz-utils
apt-get install python3-pip
apt-get install sphinx
apt-get install ninja-build
apt-get install libglib2.0-dev
apt-get install libpixman-1-dev
apt-get install flex bison
wget https://download.qemu.org/qemu-8.1.2.tar.xz
tar xvJf qemu-8.1.2.tar.xz
cd qemu-8.1.2
./configure --enable-plugins
make

Collect training data

cd data_creation
python get_data.py all_programs
cp Makefile all_programs
cd all_programs
make -i -j 16
cd ..
python parse.py all_programs ../data/data.json
cd ..

Separate into train / dev split as desired.

Train model

ORIG_MODEL_NAME_OR_PATH=facebook/bart-large
MAX_LENGTH=2048
CHECKPOINTING_STEPS=500
TRAIN_BATCH_SIZE=8
python training/train.py \
    --config_name ORIG_MODEL_NAME_OR_PATH \
    --tokenizer_name ORIG_MODEL_NAME_OR_PATH \
    --source_lang SOURCE_LANG --target_lang TARGET_LANG \
    --train \
    --train_file data/data_train.json \
    --num_beams 3 \
    --window_overlap 150 \
    --validation_file data/data_dev.json \ 
    --max_length MAX_LENGTH --report_to wandb --with_tracking --checkpointing_steps CHECKPOINTING_STEPS --clip_gradients \
    --per_device_train_batch_size TRAIN_BATCH_SIZE \
    --learning_rate 3e-5 \
    --num_train_epochs 3 \
    --output_dir OUTPUT_DIR

Fine-tune decoder-only model

(actually this doesn't work, do not do it.) Note that first you will need to enable passing a custom position_ids parameter in, and modifying the 2D attention mask according to position ids. This is all to allow tne multi-example chunks in training. Do so as follows: In file .../site-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py starting line 609 while setting self_attention_mask:

    recent_posid_restart = (position_ids == 0).nonzero()
    for batch_idx, seq_loc in recent_posid_restart:
        self_attention_mask[batch_idx][seq_loc:,:seq_loc] = 0

In file .../site-packages/transformers/trainer.py starting line 766 while setting self._signature_columns:

    if "position_ids" not in self._signature_columns: self._signature_columns.append("position_ids")

And run as follows:

python training/ft_model.py --model_path ORIG_MODEL_NAME_OR_PATH \
    --source_lang SOURCE_LANG --target_lang TARGET_LANG \
    --run_name RUN_NAME \
    --output_dir OUTPUT_DIR

Run Guess & Sketch on test set

have traired *.c, *.risc.s and *.arm.s files in TEST_DATA_FOLDER
chunk assembly files according to the same preprocessing procedure as in data_creation/parse.py

>> python data_creation/parse.py TEST_DATA_FOLDER data/YOUR_INPUT_FILE

Run Guess & Sketch

python main.py --guess \
    --source_lang SOURCE_LANG --target_lang TARGET_LANG \
    --data_file data/YOUR_INPUT_FILE \
    --predictions_folder TESTSET_FOLDER \
    --k 15 \
    --model_name_or_path MODEL_NAME_OR_PATH \
    --max_length MAX_LENGTH
python main.py --sketch \
    --source_lang SOURCE_LANG --target_lang TARGET_LANG \
    --predictions_folder TESTSET_FOLDER \
    --executions_folder EXECUTION_FOLDER

3.b. Run GPT baselines

python main.py --few_shot \
    --source_lang SOURCE_LANG --target_lang TARGET_LANG \
    --executions_folder EXECUTION_FOLDER

Display Guess & Sketch

cd ..
streamlit run streamlit/Guess_and_Sketch.py

Repository structure

main.py: entry point data_creation/: code for generating and processing data data/: folder holding train and inference data training/: code for training model for Guess phase guess_and_sketch/: code for running guess and sketch system solver/: code for solver used in Sketch phase baselines/: code for baseline experiments streamlit/: code for streamlit visualization of results

Alignment

Identify alignment with align/get_alignment.py. If using enc-dec model:

python align/get_alignment.py --orig_model_path ORIG_MODEL_NAME_OR_PATH \
                              --model_path MODEL_NAME_OR_PATH \
                              --is_enc_dec \
                              --tokenizer_name ORIG_MODEL_NAME_OR_PATH \
                              --save_folder ALIGNMENT_OUTPUT_DIR \
                              --max_length MAX_LENGTH \
                              --source_lang SOURCE_LANG --target_lang TARGET_LANG \
                              --find_best_reduction

Add the --interactive flag to see identified aligned blocks as you go.

To extract alignment using the maxnorm described in the paper, use the --get_maximal_norm flag.

For enc-dec A->R, proj euler set: averaged across the 10th layer all heads: f1: 15.47; idxmax precision: 82.9 R -> A, proj euler set: averaged across the 10th layer all heads: f1: 8.5; idxmax precision: 49.2 Codellama A -> R, proj euler set: avged acros layer 17 heads: idxmax precision 14.5; layer 17 head 19: 68.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transpilation

Usage

Setup

Docker images

Collect training data

Train model

Fine-tune decoder-only model

Run Guess & Sketch on test set

Repository structure

Alignment

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
align		align
baselines		baselines
data		data
data_creation		data_creation
guess_and_sketch		guess_and_sketch
solver		solver
streamlit		streamlit
training		training
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
verification_image.zip		verification_image.zip

celine-lee/transpile

Folders and files

Latest commit

History

Repository files navigation

Transpilation

Usage

Setup

Docker images

Collect training data

Train model

Fine-tune decoder-only model

Run Guess & Sketch on test set

Repository structure

Alignment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages