Training questions #3

Phhofm · 2023-12-24T12:56:24Z

Thank you, its working ;)

Just a few questions, when using GRLGAN it prints some messages which I get none of these when using just GRL in the config. One sais "We didn't yet handle discriminator, but we think that it should be necessary" - should I wait with GRLGAN training until some discriminator handling were implemented? (it saves discrim and generator checkpoints successfully currently)

Is there resumeability (so to pick up from a last training_state to load latest generator and discrim and know which epoch it left off) or should I just not ever interrupt training? (using --use_pretrianed with only a generator file would reset the discriminator I believe)

Also if validation images were a possibility for training so I could visually check progression with checkpoints? If not its no big deal, was just wondering. (I could also stop trianing to free up the VRAM and run inference with checkpoints to see visual results, but thats only an option if training can be resumed. If not, I will just let it run for a while, and then once stop training and run inference to see the result)

A sidenote - when using GRL I could train with batch 8, but needed to lower batch to 6 for GRLGAN training to not run out of vram (rtx 3060). GRLGAN seems to use way more vram than GRL, maybe that is expected, just something I noticed.

Thank you for your work :)

Kiteretsu77 · 2023-12-26T13:58:18Z

"We didn't yet handle discriminator, but we think that it should be necessary" means that the learning rate decay only considers the generator instead of the discriminator.

If you want to use the latest or the best loss weight, you can use "python train_code/train.py --auto_resume_closest" for the closest weight or "python train_code/train.py --auto_resume_best" to load the best weights. All weights are stored inside the "saved_models" folder by default.

Currently, I didn't write any code to check validation images in the middle of training. Personally, I prefer to start the evaluation after the training process is finished. There are multiple checkpoints that will be saved in "saved_models". You can check "checkpoints_freq" in opt.py to set up the checkpoint frequency.

Yes, training GAN requires more memory due to perceptual loss and the discriminator. You can use the RRDB network structure which will cause less memory and less time in inference. I used GRL because it is the SOTA SR paper in 2023 CVPR. They claimed to have an efficient structure and the least model parameter size. However, their model inference is comparatively slow and I would personally recommend other network structures for general SR purposes. Thus, GRL may be good for research purposes, but it is not a good choice for general inference.

Phhofm · 2023-12-27T00:01:24Z

Wow, thank you for all your answers :) I appreciate it :)

Great to know about the --auto_resume_best flag thanks

And yeah thats fine. I simply asked because during model training I normally like to check metrics (tensorboard) and also validation output because i cant catch for example brightness shifts that happens when I train SPAN, OmniSR or Real-CUGAN models (just those tough, other archs like DAT, SRFormer, SRVGGNetCompact, ESRGAN etc train fine in my system without brightness change), I dont catch it in the metrics (like colorloss) but can easily see it in validation images. If only visually checked at the very end of training, a lot of training time can be lost before adjusting gan loss weight in the config file for example or any other adjustments needed like bigger batch size to stabilize training. Anyway no biggie if not present.

Ah yea GRL for sure had some of the best metrics I had seen on the testsets I agree. I had trained one GRL Small model once (on the HFA2k_LUDVAE dataset so was meant for anime with realistic degradations). But what I found a bit sad is that since its release in march, on the github repo, it had never received any updates (no readme updates) and issues have been left unanswered, the repo felt abandoned for a while now. This is why I trained a lot of DAT models instead, which I think is one of the most capable sisr arches we have currently still (and dev to this day answering issues on github). But yea, its a heavier arch thats also slow. In contrast to SPAN (which basically is a transformer with convnet speeds), or SRVGGNetCompact, or DITN, or OmniSR ... they are all way faster, but I like how well the heavier archs can handle multiple degradations.

Anyway thanks for your answer :) This issue can be closed, all my questions have been answered, thanks :)

Phhofm · 2023-12-27T10:13:06Z

PS It works, thanks, just wanted to show :)

Started working on a 4x anime VCISR GRLGAN model (altought will be in the mountains for a week now so will maybe continue training next week). This is around epoch 73, 31400 iter, 5e-05, lowest gen loss 6.489010810852051. Input (h264_crf28 compressed) and 4x output:

Phhofm · 2024-01-04T08:04:04Z

I trained my model for 200 epochs :)

I think the otf video compression degradation pipeline is great for handling video compression, thanks :) Would be great if we could train faster archs for video inference with this code repo / degradation pipeline aswell :)

Here is the info to my model:

Name: 4xHFA2k_VCISR_GRLGAN_ep200
Download Folder
License: CC BY 4.0
Network: GRL
Scale: 4
Purpose: 4x anime upscaler handling video compression artifacts, trained for 200 epochs
Iterations: 85959
epoch: 200
batch_size: 6
HR_size: 128
Dataset: hfa2k
Number of train images: 2568
OTF Training: Yes
Pretrained_Model_G: None

Description:
4x anime upscaler handling video compression artifacts since trained with otf degradations for "mpeg2video", "libxvid", "libx264", "libx265" with crf 20-32, mpeg bitrate 3800-5800 (together with the standard Real-ESRGAN otf pipeline). A faster arch using this otf degradation pipeline would be great for handling video compression artifacts. Since this one is a GRL model and therefore slow, maybe more for research purposes (or more for single images/screenshots). Trained using VCISR for 200 epochs.

"This is epoch 200 and the start iteration is 85959 with learning rate 2.5e-05"

Slow Pics examples:
h264_crf28
ludvae1
ludvae2

Phhofm closed this as completed Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training questions #3

Training questions #3

Phhofm commented Dec 24, 2023

Kiteretsu77 commented Dec 26, 2023 •

edited

Loading

Phhofm commented Dec 27, 2023

Phhofm commented Dec 27, 2023

Phhofm commented Jan 4, 2024

Training questions #3

Training questions #3

Comments

Phhofm commented Dec 24, 2023

Kiteretsu77 commented Dec 26, 2023 • edited Loading

Phhofm commented Dec 27, 2023

Phhofm commented Dec 27, 2023

Phhofm commented Jan 4, 2024

Kiteretsu77 commented Dec 26, 2023 •

edited

Loading