-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip for LoRA modules #3331
base: develop
Are you sure you want to change the base?
Strip for LoRA modules #3331
Conversation
tests/torch/ptq/test_fq_lora.py
Outdated
), | ||
ids=["asym", "sym"], | ||
) | ||
def test_fq_lora_tuning(mode, backup_mode, compression_kwargs, _seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest extending this test by calling strip and exporting to OV and checking similarity for OV model (you can pre-compute similarity for "stripped to float" model or do it in the test)
nncf/torch/quantization/strip.py
Outdated
result_dtype=original_dtype, | ||
) | ||
|
||
elif isinstance(quantizer, SymmetricLoraQuantizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordinary FQ should be also supported here, since some layers can be selected to INT8 (first/last or by mixed precision), and they will be represented by ordinary FQ w/o LoRA.
You can check number of u8/u4 constants after export to OV.
nncf/torch/quantization/strip.py
Outdated
original_shape = original_weight.shape | ||
original_eps = torch.finfo(original_dtype).eps | ||
|
||
# Quantize-dequantize using universal quantization formula |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd reference to the markdown with this “universal”, otherwise it can be not clear what you mean here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to include a note explaining why the weights are not directly quantized. Please mention that this approach is necessary to prevent floating-point errors that can occur due to the different order of operations during quantization when using Torch for tuning and OpenVINO (OV) for inference.
00ba62e
to
ba00566
Compare
Changes
Reason for changes
Related tickets
Tests
On top of #3322