-
Notifications
You must be signed in to change notification settings - Fork 32
[Compression] Remove legacy compression and decompression pathways #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we want to remove these until the new methods can compress / decompress when starting with a checkpoint that isn’t already in memory.
Signed-off-by: Kyle Sayers <[email protected]>
@dsikka To be clear, are you referring to compressing/ decompressing from disk? Is there a remaining use case for this? |
Signed-off-by: Kyle Sayers <[email protected]>
Yeah for anyone who wants to use compressed-tensors independent of the transformers pathway / is using ct as a standalone I certainly think we can improve these functions but from disk is something we should have some way to support. |
@dsikka There has never been a "from disk" compression pathway. In order to load any compressed model, you must use transformers In terms of the from disk decompression pathway, there has also never been a "from disk" pathway that doesn't also rely on transformers |
This is not true: Afrom transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.utils import dispatch_for_generation
MODEL_ID = "nm-testing/TinyLlama-1.1B-Chat-v1.0-W8A8_tensor_weight_static_per_tensor_act-e2e"
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
from compressed_tensors.compressors import ModelCompressor
compressor = ModelCompressor.from_pretrained(MODEL_ID)
compressor.decompress(MODEL_ID, model)
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n") The compressed weights are decompressed after being read from disk. from_pretrained loads the skeleton for the model however the compressed weights are never read through the from_pretrained pathway. This enables decompression without relying on our transformers integration This also gives us independent compression functionality, such as
While the default pathway makes sense to be in memory compression / decompression, these are useful tools we still should maintain |
I think part of the motivation for this is that |
No description provided.