🙏 How to Convert a FLUX-Dev Checkpoint to an NF4 Model? #1224
-
Hi everyone, I've recently trained a model using the FLUX-Dev checkpoint. A lot of users in the comments are requesting an NF4 version, as it would allow them to generate images on GPUs with lower VRAM. I'm looking for a step-by-step guide on how to convert my FLUX-Dev checkpoint into an NF4 model. Specifically, I would like to know:
I'm trying to make this model more accessible for users with less powerful GPUs, so any help would be greatly appreciated! Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 16 replies
-
I would say look more into GGUF instead. It produces better quality than NF4 and runs just fine on a 12GB card. NF4 is kind of like the craiyon version of Flux. And while it can produce some pretty good looking images, the precision issues do crop up here and there and the keeper ratio drops drastically. Besides, NF4 is actually built into the UI, you can run any model at NF4 precision on the fly. There's no convert and save tool for it though, but again, GGUF Q4 is probably the better thing to convert it into, because it comes out to about the same download size as an SDXL checkpoint. |
Beta Was this translation helpful? Give feedback.
-
I will probably give some convert codes later ... Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster. NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk size in NF4 but it seems that bnb hard coded some thread numbers and changing that is non trivial. However Q4_1 and Q4_K are technically granted to be more precise than NF4, but with even more computation overheads – and such overheads may be more costly than simply moving higher precision weight from CPU to GPU. If that happens then the quant lose the point. |
Beta Was this translation helpful? Give feedback.
-
can save like this now: red marked areas can be changed. Will always save without double quant (the method for nf4-v2) - double quant will be removed soon because of quality and performance problems |
Beta Was this translation helpful? Give feedback.
-
I have converted my dev safetensor file including the vae, clip and t5 fp8 to bnb-nf4(FP16 Lora) however, I think there's a memory leak bug. During inference, the RAM usage will jump to 26 GB RAM, which is more than FP8 and on par with FP16. When using it in Comfy with the BnB node I'm even getting and OOM which I never got with the FP16 and the FP8 one, so it's not only forge. The file size is just 12 GB which appears to be correct. Why does it need 26 GB RAM then? |
Beta Was this translation helpful? Give feedback.
I will probably give some convert codes later ...
Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers
BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster.
NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk …