Skip to content

Conversation

metaclassing
Copy link

Tensor load logic batching netted some performance savings, for before/after results I am using:

python -m cProfile -o stloader_profile4.prof test_inference.py -m /models/exl2/Mistral-Small-22B-8bpw-exl2 -p "Once upon a time," --gpu_split auto

which spits out the prof file that can be inspected by

python3 -c "import pstats; pstats.Stats('stloader_profile4.prof').sort_stats('cumulative').print_stats(30)"

Depending on underlying hardware, storage, I saw anywhere between 30% - 50% time savings on my inference boxes.

I cant say for certain this is the right solution, but it worked in my admittedly limited tests. Would appreciate it if someone more familiar with the code could compare results and see if this is something that might be useful to others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant