pip install -r requirements.txt
pip install -Ue .
python setup.py install
For RISC:
Download and unzip verification_image.zip
Run bash build.sh
Update the docker settings.json (or just use the UI) file sharing locations to include the path to this folder.
For ARM:
docker run -ti ubuntu /bin/bash
apt-get update
apt-get install environment-modules
apt-get install libc6-dev
apt-get install libgmp3-dev
apt-get install wget
wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-compiler-for-linux/23-04-1/arm-compiler-for-linux_23.04.1_Ubuntu-22.04_aarch64.tar?rev=73781ef890034505a60cd9f776298289&revision=73781ef8-9003-4505-a60c-d9f776298289
tar -xvf arm-compiler-for-linux_23.04.1_Ubuntu-22.04_aarch64.tar\?rev\=73781ef890034505a60cd9f776298289
cd arm-compiler-for-linux_23.04.1_Ubuntu-22.04
./arm-compiler-for-linux_23.04.1_Ubuntu-22.04.sh
export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles
export PATH=$PATH:/opt/arm/arm-linux-compiler-23.04.1_Ubuntu-22.04/bin/
apt-get install gcc-aarch64-linux-gnu
apt-get install -y gcc-riscv64-linux-gnu // optional, since we shouldn't need to emulate RISC on the ARM machine.
Setup QEMU
apt-get install qemu-user
apt-get install xz-utils
apt-get install python3-pip
apt-get install sphinx
apt-get install ninja-build
apt-get install libglib2.0-dev
apt-get install libpixman-1-dev
apt-get install flex bison
wget https://download.qemu.org/qemu-8.1.2.tar.xz
tar xvJf qemu-8.1.2.tar.xz
cd qemu-8.1.2
./configure --enable-plugins
make
cd data_creation
python get_data.py all_programs
cp Makefile all_programs
cd all_programs
make -i -j 16
cd ..
python parse.py all_programs ../data/data.json
cd ..
Separate into train / dev split as desired.
ORIG_MODEL_NAME_OR_PATH=facebook/bart-large
MAX_LENGTH=2048
CHECKPOINTING_STEPS=500
TRAIN_BATCH_SIZE=8
python training/train.py \
--config_name ORIG_MODEL_NAME_OR_PATH \
--tokenizer_name ORIG_MODEL_NAME_OR_PATH \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--train \
--train_file data/data_train.json \
--num_beams 3 \
--window_overlap 150 \
--validation_file data/data_dev.json \
--max_length MAX_LENGTH --report_to wandb --with_tracking --checkpointing_steps CHECKPOINTING_STEPS --clip_gradients \
--per_device_train_batch_size TRAIN_BATCH_SIZE \
--learning_rate 3e-5 \
--num_train_epochs 3 \
--output_dir OUTPUT_DIR
(actually this doesn't work, do not do it.)
Note that first you will need to enable passing a custom position_ids parameter in, and modifying the 2D attention mask according to position ids. This is all to allow tne multi-example chunks in training. Do so as follows:
In file .../site-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py starting line 609 while setting self_attention_mask:
recent_posid_restart = (position_ids == 0).nonzero()
for batch_idx, seq_loc in recent_posid_restart:
self_attention_mask[batch_idx][seq_loc:,:seq_loc] = 0
In file .../site-packages/transformers/trainer.py starting line 766 while setting self._signature_columns:
if "position_ids" not in self._signature_columns: self._signature_columns.append("position_ids")
And run as follows:
python training/ft_model.py --model_path ORIG_MODEL_NAME_OR_PATH \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--run_name RUN_NAME \
--output_dir OUTPUT_DIR
- have traired
*.c,*.risc.sand*.arm.sfiles inTEST_DATA_FOLDER - chunk assembly files according to the same preprocessing procedure as in
data_creation/parse.py
>> python data_creation/parse.py TEST_DATA_FOLDER data/YOUR_INPUT_FILE
- Run Guess & Sketch
python main.py --guess \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--data_file data/YOUR_INPUT_FILE \
--predictions_folder TESTSET_FOLDER \
--k 15 \
--model_name_or_path MODEL_NAME_OR_PATH \
--max_length MAX_LENGTH
python main.py --sketch \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--predictions_folder TESTSET_FOLDER \
--executions_folder EXECUTION_FOLDER
3.b. Run GPT baselines
python main.py --few_shot \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--executions_folder EXECUTION_FOLDER
- Display Guess & Sketch
cd ..
streamlit run streamlit/Guess_and_Sketch.py
main.py: entry point
data_creation/: code for generating and processing data
data/: folder holding train and inference data
training/: code for training model for Guess phase
guess_and_sketch/: code for running guess and sketch system
solver/: code for solver used in Sketch phase
baselines/: code for baseline experiments
streamlit/: code for streamlit visualization of results
Identify alignment with align/get_alignment.py.
If using enc-dec model:
python align/get_alignment.py --orig_model_path ORIG_MODEL_NAME_OR_PATH \
--model_path MODEL_NAME_OR_PATH \
--is_enc_dec \
--tokenizer_name ORIG_MODEL_NAME_OR_PATH \
--save_folder ALIGNMENT_OUTPUT_DIR \
--max_length MAX_LENGTH \
--source_lang SOURCE_LANG --target_lang TARGET_LANG \
--find_best_reduction
Add the --interactive flag to see identified aligned blocks as you go.
To extract alignment using the maxnorm described in the paper, use the --get_maximal_norm flag.
For enc-dec A->R, proj euler set: averaged across the 10th layer all heads: f1: 15.47; idxmax precision: 82.9 R -> A, proj euler set: averaged across the 10th layer all heads: f1: 8.5; idxmax precision: 49.2 Codellama A -> R, proj euler set: avged acros layer 17 heads: idxmax precision 14.5; layer 17 head 19: 68.4