Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flux support #356

Merged
merged 8 commits into from
Aug 24, 2024
Merged

add flux support #356

merged 8 commits into from
Aug 24, 2024

Conversation

leejet
Copy link
Owner

@leejet leejet commented Aug 21, 2024

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer. After merging this pr, I will take some time to merge the PRs of other contributors.

How to Use

Download weights

Convert flux weights

Using fp16 will lead to overflow, but ggml's support for bf16 is not yet fully developed. Therefore, we need to convert flux to gguf format here, which also saves VRAM. For example:

.\bin\Release\sd.exe -M convert -m ..\..\ComfyUI\models\unet\flux1-dev.sft -o ..\models\flux1-dev-q8_0.gguf -v --type q8_0

Run

  • --cfg-scale is recommended to be set to 1.

Flux-dev q8_0

 .\bin\Release\sd.exe --diffusion-model  ..\models\flux1-dev-q8_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v

output

Flux-dev q4_0

.\bin\Release\sd.exe --diffusion-model  ..\models\flux1-dev-q4_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v

output

Flux-dev q3_k

.\bin\Release\sd.exe --diffusion-model  ..\models\flux1-dev-q3_k.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v

output

Flux-dev q2_k

.\bin\Release\sd.exe --diffusion-model  ..\models\flux1-dev-q2_k.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v

output

Flux-schnell q8_0

 .\bin\Release\sd.exe --diffusion-model  ..\models\flux1-schnell-q8_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4

output

Run with LoRA

Since many flux LoRA training libraries have used various LoRA naming formats, it is possible that not all flux LoRA naming formats are supported. It is recommended to use LoRA with naming formats compatible with ComfyUI.

Flux dev q8_0 with LoRA

.\bin\Release\sd.exe --diffusion-model  ..\..\flux-gguf\flux1-dev-q8_0.gguf --vae ..\..\ComfyUI\models\vae\ae.sft --clip_l ..\..\ComfyUI\models\clip\clip_l.safetensors --t5xxl ..\..\ComfyUI\models\clip\t5xxl_fp16.safetensors  -p "a lovely cat holding a sign says 'flux.cpp'<lora:realism_lora_comfy_converted:1>" --cfg-scale 1.0 --sampling-method euler -v --lora-model-dir ../models

output

@leejet leejet mentioned this pull request Aug 21, 2024
@stduhpf
Copy link
Contributor

stduhpf commented Aug 21, 2024

Doesn't compile for me. Somthing about ggml_group_norm expecting 4 arguments in ggml_extend.hpp, but only 3 are passed. If I set the 4th argument to some float it does compile, but I'm not sure if it will work.

EDIT: I set the 4th argument to EPS, the compilation goes fine, but it consitently crashes when loading the flux model.

EDIT 2: I was just being stupid, nevermind

@phudtran
Copy link
Contributor

Doesn't compile for me. Somthing about ggml_group_norm expecting 4 arguments in ggml_extend.hpp, but only 3 are passed. If I set the 4th argument to some float it does compile, but I'm not sure if it will work.

EDIT: I set the 4th argument to EPS, the compilation goes fine, but it consitently crashes when loading the flux model.

Which model did you use and how much VRAM does your GPU have? Could be a memory issue since these models are pretty large.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 21, 2024

Doesn't compile for me. Somthing about ggml_group_norm expecting 4 arguments in ggml_extend.hpp, but only 3 are passed. If I set the 4th argument to some float it does compile, but I'm not sure if it will work.
EDIT: I set the 4th argument to EPS, the compilation goes fine, but it consitently crashes when loading the flux model.

Which model did you use and how much VRAM does your GPU have? Could be a memory issue since these models are pretty large.

I was trying to run it on CPU. With 32 GB of RAM + lots of swap

@stduhpf
Copy link
Contributor

stduhpf commented Aug 21, 2024

Anyways, I figured out I was just on the wrong commit for the ggml submodule, checking out the correct one fixed the compilation and now it works!

@stduhpf
Copy link
Contributor

stduhpf commented Aug 21, 2024

It's almost twice as fast as ComfyUI's implementation of GGUF support for Flux.

On my Ryzen9 5900x (no GPU) with q4_1 model (512² resolution):

  • sd.cpp: 52.83 s/it
  • ComfyUI: 97.16 s/it

Great job @leejet !

@SkutteOleg SkutteOleg mentioned this pull request Aug 21, 2024
@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 22, 2024

Looks like conversion on cpu with 32gigs or ram + swap is not enough.

[1171961.971637] Out of memory: Killed process 1686847 (sd) total-vm:25179064kB, anon-rss:23787920kB, file-rss:768kB, shmem-rss:0kB, UID:1000 pgtables:46624kB oom_score_adj:0
[INFO ] model.cpp:737  - load models/flux1-schnell.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from 'models/flux1-schnell.safetensors'
[INFO ] model.cpp:1665 - model tensors mem size: 12050.42MB
[DEBUG] model.cpp:1459 - loading tensors from models/flux1-schnell.safetensors
[INFO ] model.cpp:1704 - load tensors done
[INFO ] model.cpp:1705 - trying to save tensors to models/flux1-schnell-q8_0.gguf
Killed

update: added a 32gig swapfile, conversion works now

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 22, 2024

Ok, took a stab at it on cpu only for now.

here q8_0:
flux1-schnell-q8_0-orig

and here q2_k:
flux1-schnell-q2_k-orig

while I am amazed that q2_k works this well, its obv not good rn. also its way slower on cpu, but also only uses 4.5gig of ram !

edit: redid q2_k with the other prompt and simd (was running pure scalar before, slow af)
output

on my system with avx2 q2_k gives 84.02s/it and for q8_0 gives 84.69s/it, so its memory bottlenecked.

edit2: q2_k with cuda on gpu is somehow slightly better
output

@cheeseng
Copy link

Worked with CPU-only build, crashed with core dump when built with cuda, probably due to low 4GB VRAM on my GTX GPU.

Thanks for the nice work!

@Green-Sky
Copy link
Contributor

q3_k looks a lot better compared to q2_k while still being sized resonable for my 8gig vram.

output

@Green-Sky
Copy link
Contributor

@leejet it would be nice if sd.cpp supported llama.cpp tensor naming conventions. Since text encoders exploded in size and now consume substantial amounts of resources, making use of ggml quantizations would be very usable. So I went and tried loading the q8_0 t5xxl from here https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main but it does not load.

looking at the log it becomes obvious quick:

[DEBUG] model.cpp:1459 - loading tensors from models/flux-extra/t5-v1_1-xxl-encoder-Q8_0.gguf
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_k.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_o.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_q.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_rel_b.weight | f32 | 2 [64, 32, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_v.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_gate.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_up.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_down.weight | q8_0 | 2 [10240, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.k.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.o.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.q.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.v.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.layer_norm.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_0.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_1.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wo.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.layer_norm.weight' not in model file
...

I think city96's conversion is using llama.cpp's tensor name convention.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 22, 2024

@leejet it would be nice if sd.cpp supported llama.cpp tensor naming conventions. Since text encoders exploded in size and now consume substantial amounts of resources, making use of ggml quantizations would be very usable. So I went and tried loading the q8_0 t5xxl from here https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main but it does not load.

looking at the log it becomes obvious quick:

[DEBUG] model.cpp:1459 - loading tensors from models/flux-extra/t5-v1_1-xxl-encoder-Q8_0.gguf
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_k.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_o.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_q.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_rel_b.weight | f32 | 2 [64, 32, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_v.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_gate.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_up.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_down.weight | q8_0 | 2 [10240, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.k.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.o.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.q.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.v.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.layer_norm.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_0.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_1.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wo.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.layer_norm.weight' not in model file
...

I think city96's conversion is using llama.cpp's tensor name convention.

I agree being able to use llama.cpp quants could be great, though you can always quantize the t5 encoder with stable-diffusion.cpp yourself and get a working gguf.

@MGTRIDER
Copy link

MGTRIDER commented Aug 23, 2024

Something seems to be really wrong with flux rendering on stable-diffusion.cpp backend. With q4_0 quantization i run out of vram and the program crashes. I have 8gb of vram and on ComfyUI i can render resolutions of 1152x896 without problems at 6s/it, with out crashes. I can even use q5_0 quantized flux model without crashing. The weird thing is that with sd models including sdxl, everything renders fine and fast with memory efficiency on stable-diffusion.cpp, so i really wonder why flux acts this way on this backend. And another thing i have noticed is that when the nvidia driver tries to send a part of the model to shared vram, it looks like the clip models get unloaded from ram, causing the program to just hang at he sampling stage.

@leejet
Copy link
Owner Author

leejet commented Aug 23, 2024

@leejet it would be nice if sd.cpp supported llama.cpp tensor naming conventions. Since text encoders exploded in size and now consume substantial amounts of resources, making use of ggml quantizations would be very usable. So I went and tried loading the q8_0 t5xxl from here https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main but it does not load.

looking at the log it becomes obvious quick:

[DEBUG] model.cpp:1459 - loading tensors from models/flux-extra/t5-v1_1-xxl-encoder-Q8_0.gguf
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_k.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_o.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_q.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_rel_b.weight | f32 | 2 [64, 32, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_v.weight | q8_0 | 2 [4096, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.attn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_gate.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_up.weight | q8_0 | 2 [4096, 10240, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_down.weight | q8_0 | 2 [10240, 4096, 1, 1, 1]' in model file
[INFO ] model.cpp:1605 - unknown tensor 'text_encoders.t5xxl.enc.blk.0.ffn_norm.weight | f32 | 1 [4096, 1, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.k.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.o.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.q.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.SelfAttention.v.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.0.layer_norm.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_0.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wi_1.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.DenseReluDense.wo.weight' not in model file
[ERROR] model.cpp:1649 - tensor 'text_encoders.t5xxl.encoder.block.0.layer.1.layer_norm.weight' not in model file
...

I think city96's conversion is using llama.cpp's tensor name convention.

You can perform the quantization yourself.

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 23, 2024

You can perform the quantization yourself.

You are right I tried it the wrong way first.

Here is q3_k flux with q8_0 t5xxl

output

I can not spot a difference to f16 t5xxl, so I recommend this over f16 in any case.

However, it does look like it is not using less memory.

[DEBUG] ggml_extend.hpp:1019 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors)
vs f16:
[DEBUG] ggml_extend.hpp:1019 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors)

edit: using q4_k for t5xxl does indeed change the result a bit

output

bit still acceptable.
but also
[DEBUG] ggml_extend.hpp:1019 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors)

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 23, 2024

I wanted to test a fine tune and merge(?) of flux.1 schnell and dev (??), but it contained f8_e4m3 tensors. So I went ahead and wrote an inplace up-convert similar to how it is done to bf16.

output

will make a pr in a bit.

edit: pr here #359

in the same way it is done for bf16
like how bf16 converts losslessly to fp32,
f8_e4m3 converts losslessly to fp16
@leejet
Copy link
Owner Author

leejet commented Aug 24, 2024

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

@leejet
Copy link
Owner Author

leejet commented Aug 24, 2024

LoRA support has been added!

@Green-Sky
Copy link
Contributor

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

Yea I also had issues with uploads being canceled all the time... no idea why.
Anyway some files got though and live here https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/tree/main

@leejet leejet merged commit 64d231f into master Aug 24, 2024
8 checks passed
@Green-Sky
Copy link
Contributor

I also uploaded a f16 conversion of the vae, it looks almost lossless to me.

@stduhpf
Copy link
Contributor

stduhpf commented Aug 24, 2024

I also uploaded a f16 conversion of the vae, it looks almost lossless to me.

Even a q2_k vae looks good enough.

@grigio
Copy link

grigio commented Aug 24, 2024

Does this Also work with AMD rocm ?

@Green-Sky
Copy link
Contributor

I also uploaded a f16 conversion of the vae, it looks almost lossless to me.

Even a q2_k vae looks good enough.

If you look at the file sizes, it blocks anything lower than f16, so you are looking at f16.

@Green-Sky
Copy link
Contributor

Does this Also work with AMD rocm ?

Not sure if anyone tried yet, but you can grab a build from here https://github.com/leejet/stable-diffusion.cpp/releases/tag/master-64d231f (if you run windows)

@JohnClaw
Copy link

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

Yea I also had issues with uploads being canceled all the time... no idea why. Anyway some files got though and live here https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/tree/main

Thank you very much for uploading flux schnell gguf. Could you upload clip_l.safetensors or clip_l.gguf for this model, please?

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 24, 2024

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

Yea I also had issues with uploads being canceled all the time... no idea why. Anyway some files got though and live here https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/tree/main

Thank you very much for uploading flux schnell gguf. Could you upload clip_l.safetensors or clip_l.gguf for this model, please?

Sure, I uploaded gguf f16 (same as source safetensors) and q8_0.

If you want the safetensors, check the op for a link.

edit: I am not seeing much of a difference between f16 and q8_0 either.

@JohnClaw
Copy link

JohnClaw commented Aug 24, 2024

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

Yea I also had issues with uploads being canceled all the time... no idea why. Anyway some files got though and live here https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/tree/main

Thank you very much for uploading flux schnell gguf. Could you upload clip_l.safetensors or clip_l.gguf for this model, please?

Sure, I uploaded gguf f16 (same as source safetensors) and q8_0.

If you want the safetensors, check the op for a link.

edit: I am not seeing much of a difference between f16 and q8_0 either.

Thanks. Tested it with this command-line parameters: sd.exe --diffusion-model ./models/flux1-schnell-q2_k.gguf --vae ./models/ae-f16.gguf --clip_l ./models/clip_l-f16.gguf --t5xxl ./models/t5xxl_q2_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" -t 8 --steps 4 --cfg-scale 1.0 --sampling-method euler -v

My system configuration: Ryzen 7 4700u, igpu Vega 7, 16gb ram, ssd, Windows 11. Image generation took 520 seconds or so. Each step took 110 seconds. Hope that kobold.cpp will upgrade it's stable-duffusion plugin to support flux, because kobold.cpp uses Vulkan acceleration which makes generation much faster. Are there any plans to add Vulkan build to next releases of stable-diffusion.cpp? By the way, i recently downloaded Amuse windows app (https://www.amuse-ai.com/) and it generates images very fast because it uses DirectML acceleration, onnx and SD-turbo technologies. 512x512, 4 steps image generation takes only 7 seconds! I'm very sad that there is no DirectML acceleration in stable-diffusion.cpp and llama.cpp. Another thing which makes me cry is the fact that flux onnx model can't be quantatized to fit into my 16gb ram. Or i don't know something and it can be done?
image

@stduhpf
Copy link
Contributor

stduhpf commented Aug 24, 2024

I tried uploading some quantized models to Hugging Face, but no matter which network I use, the upload speed is limited to 3 Mbps.

Yea I also had issues with uploads being canceled all the time... no idea why. Anyway some files got though and live here https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/tree/main

Thank you very much for uploading flux schnell gguf. Could you upload clip_l.safetensors or clip_l.gguf for this model, please?

Sure, I uploaded gguf f16 (same as source safetensors) and q8_0.
If you want the safetensors, check the op for a link.
edit: I am not seeing much of a difference between f16 and q8_0 either.

Thanks. Tested it with this command-line parameters: sd.exe --diffusion-model ./models/flux1-schnell-q2_k.gguf --vae ./models/ae-f16.gguf --clip_l ./models/clip_l-f16.gguf --t5xxl ./models/t5xxl_q2_k.gguf -p "a lovely cat holding a sign says 'flux.cpp'" -t 8 --steps 4 --cfg-scale 1.0 --sampling-method euler -v

My system configuration: Ryzen 7 4700u, igpu Vega 7, 16gb ram, ssd, Windows 11. Image generation took 520 seconds or so. Each step took 110 seconds. Hope that kobold.cpp will upgrade it's stable-duffusion plugin to support flux, because kobold.cpp uses Vulkan acceleration which makes generation much faster. Are there any plans to add Vulkan build to next releases of stable-diffusion.cpp? By the way, i recently downloaded Amuse windows app (https://www.amuse-ai.com/) and it generates images very fast because it uses DirectML acceleration, onnx and SD-turbo technologies. 512x512, 4 steps image generation takes only 7 seconds! I'm very sad that there is no DirectML acceleration in stable-diffusion.cpp and llama.cpp. Another thing which makes me cry is the fact that flux onnx model can't be quantatized to fit into my 16gb ram. Or i don't know something and it can be done? image

If you want Vulkan support, take a look at the discussion here: #291
(Especially this part: #291 (comment))

@stduhpf
Copy link
Contributor

stduhpf commented Aug 24, 2024

I also uploaded a f16 conversion of the vae, it looks almost lossless to me.

Even a q2_k vae looks good enough.

If you look at the file sizes, it blocks anything lower than f16, so you are looking at f16.

Interestingly, the file sizes are very close, but still slightly different.

  • f16: 163728 kB
  • q8_0: 163669 kB
  • q2_k: 163630 kB

There's also some very slight artifacting (a bit like jpeg) with the q2_k auto encoder that isn't noticable with the other quants I tested (q8 and f16):

Comparison

q2_k ae:
q2
q8_0 ae:
q8
f16 ae:
f16
original floats:
full

I'm not sure if saving only a few kilobytes is worth a barely noticable difference in output. That's a strange dilemma. Quantized is definitely worth it compared to full size though.

@leejet
Copy link
Owner Author

leejet commented Aug 24, 2024

Only a very small number of tensors of ae will be quantized.

@phudtran
Copy link
Contributor

Would it be possible to package and use the flux unet, clip, ae, etc into a single file like with the SD models?

@city96
Copy link

city96 commented Aug 24, 2024

Figured I'd chime in. I've been doing some work over at ComfyUI-GGUF to support flux quantization for image gen.

I've noticed some differences between my version and this version by @Green-Sky higher up in the thread.

The most obvious thing is that the bias weights are quantized. These can be kept in FP32 without adding more than at most 40MBs to the final model file, and doing this should increase both quality and speed (since less tensors have to be dequantized overall, though this should be relatively fast on small tensors like that).

The second issue I noticed is that there's no logic for keeping more vital tensors in higher precision the same way llama.cpp does with LLMs. From my short tests, these benefit the most from doing so while only adding ~100MB:

image

For the text encoder, I've used the default llama.cpp binary to create them as both the full encoder/decoder as well as the encoder only model is supported natively now. Assuming your code can handle mixed quantization, I recommend using this method since keeping the token_embed and the norm/biases in higher precisions makes the effects of quantization a lot less severe.

Mapping the keys back to the original names is fairly straight forward. This is the mapping I ended up with for the replacement:

clip_sd_map = {
    "enc.": "encoder.",
    ".blk.": ".block.",
    "token_embd": "shared",
    "output_norm": "final_layer_norm",
    "attn_q": "layer.0.SelfAttention.q",
    "attn_k": "layer.0.SelfAttention.k",
    "attn_v": "layer.0.SelfAttention.v",
    "attn_o": "layer.0.SelfAttention.o",
    "attn_norm": "layer.0.layer_norm",
    "attn_rel_b": "layer.0.SelfAttention.relative_attention_bias",
    "ffn_up": "layer.1.DenseReluDense.wi_1",
    "ffn_down": "layer.1.DenseReluDense.wo",
    "ffn_gate": "layer.1.DenseReluDense.wi_0",
    "ffn_norm": "layer.1.layer_norm",
}

for k,v in state_dict.items():
    for s,d in clip_sd_map.items():
        k = k.replace(s,d)
    ...

Hope this helps!

@leejet
Copy link
Owner Author

leejet commented Aug 25, 2024

In my tests, not converting the bias didn't make something better. Moreover, if I convert the txt_in/img_in layers, I even get worse results.

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 25, 2024

like for sd3, there exists an tiny auto encoder that does not work with sd.cpp yet
https://huggingface.co/madebyollin/taef1

edit: this is not a priority, since vae speed has improved since taesd was first implemented in sd.cpp and uses less compute compared to flux diffusion anyway.

@Green-Sky
Copy link
Contributor

q2_k q3_k
flux1-schnell unet-q2_k ae-f16 clip_l-q8_0 t5xxl-q8_0 1024p flux1-schnell unet-q3_k ae-f16 clip_l-q8_0 t5xxl-q8_0 1024p

flux.1-schnell 1024x1024 4step using the new q2_k and q3_k variants converted using 5c561ea

also using quants for:

  • ae f16
  • clip_l q8_0
  • t5xxl q8_0

q3_k looks like a real winner here. It looks ok with small imperfections but still very small.

(also i hate comic sans 🙈 )

@bssrdf
Copy link
Contributor

bssrdf commented Sep 1, 2024

I am seeing "unknown tensor " using 58d5473

[INFO ] model.cpp:829  - load ..\models\ae.safetensors using safetensors format
[DEBUG] model.cpp:897  - init from '..\models\ae.safetensors'
[INFO ] stable-diffusion.cpp:237  - Version: Flux Dev
[INFO ] stable-diffusion.cpp:268  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:269  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:270  - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:271  - VAE weight type:             f32
[DEBUG] stable-diffusion.cpp:273  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:312  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:315  - CLIP: Using CPU backend
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1050 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1050 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1050 - flux params backend buffer size =  12068.09 MB(VRAM) (780 tensors)
[DEBUG] ggml_extend.hpp:1050 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:414  - loading weights
[DEBUG] model.cpp:1568 - loading tensors from ..\models\clip_l.safetensors
[DEBUG] model.cpp:1568 - loading tensors from ..\models\t5xxl_fp16.safetensors
[INFO ] model.cpp:1723 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1568 - loading tensors from ..\models\flux1-dev-q8_0.gguf
[DEBUG] model.cpp:1568 - loading tensors from ..\models\ae.safetensors
[INFO ] stable-diffusion.cpp:513  - total params memory size = 21481.50MB (VRAM 12162.66MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 12068.09MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:517  - loading model from '' completed, taking 36.52s
[INFO ] stable-diffusion.cpp:534  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:588  - finished loaded file
[DEBUG] stable-diffusion.cpp:1405 - txt2img 832x1216
[DEBUG] stable-diffusion.cpp:1146 - prompt after extract and remove lora: "model as a navy officer on a ship, colorful, perfect face, natural skin, hard shadows, highly detail"

I am using t5xxl from https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp16.safetensors

@ayttop
Copy link

ayttop commented Sep 29, 2024

.\bin\Release\sd.exe
Where is the path?

@CrushDemo01
Copy link

Hello, I want the quantized versions of T5 and CLIP to also use video memory. Is there any way to achieve this?

@stduhpf
Copy link
Contributor

stduhpf commented Oct 28, 2024

Hello, I want the quantized versions of T5 and CLIP to also use video memory. Is there any way to achieve this?

Yeah, sure. Just remove those lines and compile it again.
https://github.com/leejet/stable-diffusion.cpp/blob/master/stable-diffusion.cpp#L317C1-L320C17

@CrushDemo01
Copy link

Hello, I want the quantized versions of T5 and CLIP to also use video memory. Is there any way to achieve this?

Yeah, sure. Just remove those lines and compile it again. https://github.com/leejet/stable-diffusion.cpp/blob/master/stable-diffusion.cpp#L317C1-L320C17
I commented out the code and recompiled it. But it has such errors.

root@ucloud-wlcb-gpu-010:/text2img/stable-diffusion.cpp/build#  ./bin/sd --diffusion-model /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf --vae /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf --clip_l /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf --t5xxl /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf  -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --steps 4 -o flux_schenll.png
Option: 
    n_threads:         64
    mode:              txt2img
    model_path:        
    wtype:             unspecified
    clip_l_path:       /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf
    clip_g_path:       
    t5xxl_path:        /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf
    diffusion_model_path:   /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf
    vae_path:          /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normalize input image :  false
    output_path:       flux_schenll.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            a lovely cat holding a sign says 'flux.cpp'
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         1.00
    guidance:          3.50
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      4
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 1
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:159  - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
[INFO ] stable-diffusion.cpp:204  - loading clip_l from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf'
[INFO ] model.cpp:801  - load /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf using gguf format
[DEBUG] model.cpp:818  - init from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:218  - loading t5xxl from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf'
[INFO ] model.cpp:801  - load /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf using gguf format
[DEBUG] model.cpp:818  - init from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:225  - loading diffusion model from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf'
[INFO ] model.cpp:801  - load /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf using gguf format
[DEBUG] model.cpp:818  - init from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:232  - loading vae from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf'
[INFO ] model.cpp:801  - load /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf using gguf format
[DEBUG] model.cpp:818  - init from '/mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:244  - Version: Flux Schnell 
[INFO ] stable-diffusion.cpp:275  - Weight type:                 q8_0
[INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     q8_0
[INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: q2_K
[INFO ] stable-diffusion.cpp:278  - VAE weight type:             f16
[DEBUG] stable-diffusion.cpp:280  - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1045 - clip params backend buffer size =  125.22 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1045 - t5 params backend buffer size =  4826.11 MB(VRAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1045 - flux params backend buffer size =  3732.51 MB(VRAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1045 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:407  - loading weights
[DEBUG] model.cpp:1548 - loading tensors from /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/clip_l-q8_0.gguf
[DEBUG] model.cpp:1548 - loading tensors from /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/t5xxl_q2_k.gguf
[INFO ] model.cpp:1703 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | q2_K | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1548 - loading tensors from /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/flux1-schnell-q2_k.gguf
[DEBUG] model.cpp:1548 - loading tensors from /mnt/data/xxxyyyzzz/flux.1-schnell-GGUF/ae-f16.gguf
[INFO ] stable-diffusion.cpp:491  - total params memory size = 8778.42MB (VRAM 8778.42MB, RAM 0.00MB): clip 4951.33MB(VRAM), unet 3732.51MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:510  - loading model from '' completed, taking 45.05s
[INFO ] stable-diffusion.cpp:527  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:581  - finished loaded file
[DEBUG] stable-diffusion.cpp:1390 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1139 - prompt after extract and remove lora: "a lovely cat holding a sign says 'flux.cpp'"
[INFO ] stable-diffusion.cpp:664  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1144 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1036 - parse 'a lovely cat holding a sign says 'flux.cpp'' to [['a lovely cat holding a sign says 'flux.cpp'', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] t5.hpp:397  - token length: 256
[DEBUG] ggml_extend.hpp:997  - t5 compute buffer size: 68.25 MB(VRAM)
ggml_cuda_compute_forward: GET_ROWS failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /text2img/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:2326
  err
/text2img/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
Aborted (core dumped)

@stduhpf
Copy link
Contributor

stduhpf commented Oct 28, 2024

Ah i thought it would work, I guess I was wrong.

Glancing at the code of the Cuda backend it looks like the GET_ROWS operation isn't supported for k quants? Do you have enough VRAM to test with a q4_0 quant instead?

@CrushDemo01
Copy link

Thanks. I've tried it all, and many types of quant have the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.