This is a tensorflow implementation of the training code for "Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos" in CVPR 2021 (Oral Presentation).
Project Page |
---|
TikTok Dataset |
(This code is checked with tensorflow-gpu version 1.14.0, Python 3.7.4, CUDA 10 (version 10.0.130) and cuDNN 7 (version 7.4.2).)
- numpy
- imageio
- matplotlib
- scikit-image
- scipy==1.1.0
- tensorflow-gpu==1.14.0
- gast==0.2.2
- Pillow
Run the following code to install all pip packages:
pip install -r requirements.txt
In case there is a problem, you can use the following tensorflow docker container "(tensorflow:19.02-py3)":
sudo docker run --gpus all -it --rm -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.02-py3
Then you can install the requirements:
pip install -r requirements.txt
This project was trained on "RenderPeople dataset". However, as this data is commercial, we cannot share it. Instead a sample of the public data from "Tang et al. dataset" is given here for a training trial. Also a small sample of TikTok data is provided to train the semi-supervised framework. The complete TikTok dataset can be downloaded from "here".
- Download and extract the sample training data from here in this folder or run this:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1uJ_yQ0XQwNhmHI_irsx8H4f6kQ-yhp5P' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1uJ_yQ0XQwNhmHI_irsx8H4f6kQ-yhp5P" -O training_data.zip && rm -rf /tmp/cookies.txt
unzip training_data.zip
- Download and extract the pretrained model from here in this folder or run this:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UOHkmwcWpwt9r11VzOCa_CVamwHVaobV' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UOHkmwcWpwt9r11VzOCa_CVamwHVaobV" -O model.zip && rm -rf /tmp/cookies.txt
unzip model.zip
In the end you will have three folders in this directory: (1. training_code 2. training_data 3. model)
You can reorganize your own data in this manner as well and run the training. Note that your data size should be 256x256.
Here we present the training code for 1. Normal Estimator, 2. Depth Estimator 3. HDNet
First, go to the training_code directory:
cd training_code
1. Normal Estimator
For this, you will just need to run the python code training_NormalEstimator.py. Note that if you would like, you can change the variables from line 27 to 34.
python training_NormalEstimator.py
This Estimator is pretraining the Network with ground truth data which is here Tang et al. data.
Every 100 steps the training results will be stored in "training_progress/visualization/NormalEstimator/Tang/"
Every 10000 steps the checkpoints will be stored in "/training_progress/model/NormalEstimator/"
2. Depth Estimator
For this, you will just need to run the python code training_DepthEstimator.py. Note that if you would like, you can change the variables from line 27 to 34.
python training_DepthEstimator.py
This Estimator is pretraining the Network with ground truth data which is here Tang et al. data.
Every 100 steps the training results will be stored in "training_progress/visualization/DepthEstimator/Tang/"
Every 10000 steps the checkpoints will be stored in "/training_progress/model/DepthEstimator/"
3. HDNet
For this, you will just need to run the python code training_HDNet.py. Note that if you would like, you can change the variables from line 27 to 34. Note that this can be trained with batch size 1 only.
python training_HDNet.py
HDNet is a semi-supervised framework that is trained on both labeled data (Tang et al. data) and unlabeled data (TikTok data).
You can choose to let the network either use the pretrained models or not by commenting the lines 128 and 129
Every 100 steps the training results on Tang et al. data will be stored in "training_progress/visualization/HDNet/Tang/"
Every 100 steps the training results on TikTok data will be stored in "training_progress/visualization/HDNet/tiktok/"
Every 10000 steps the checkpoints will be stored in "/training_progress/model/HDNet/"
You can visualize the progress of training convergence by running the code plot_convergence.py
python plot_convergence.py
- The training data with ground truth: (e.g. Tang_data)
|-color (Images)
| |- 0000000.png
| |- 0000001.png
| |- 0000002.png
| |- ...
|-color_WO_bg (Images with white background)
| |- 0000000.png
| |- 0000001.png
| |- 0000002.png
| |- ...
|-densepose (DensePose IUV maps)
| |- 0000000.png
| |- 0000001.png
| |- 0000002.png
| |- ...
|-depth (Ground truth depth values (256x256))
| |- 0000000.txt
| |- 0000001.txt
| |- 0000002.txt
| |- ...
|-mask (human segmentation)
| |- 0000000.png
| |- 0000001.png
| |- 0000002.png
| |- ...
|-normal (Ground truth normal)
| |- 0000000_1.txt (first channel)
| |- 0000000_2.txt (second channel)
| |- 0000000_3.txt (third channel)
| |- ...
- The training data without ground truth: (e.g. tiktok_data)
|-color (Images)
| |- 0017634.png
| |- 0017635.png
| |- 0017636.png
| |- ...
|-color_WO_bg (Images with white background)
| |- 0017634.png
| |- 0017635.png
| |- 0017636.png
| |- ...
|-densepose (DensePose IUV maps)
| |- 0017634.png
| |- 0017635.png
| |- 0017636.png
| |- ...
|-correspondences (The correspondences between two frames) (see more explanation below)
| |- corrs
| | |- 0017634_0017637_i_limit.txt
| | |- 0017634_0017637_i_r1_c1_r2_c2.txt
| | |- 0017634_0017643_i_limit.txt
| | |- 0017634_0017643_i_r1_c1_r2_c2.txt
| | |- ...
| |- corr_mat.txt
|-mask (human segmentation)
| |- 0017634.png
| |- 0017635.png
| |- 0017636.png
| |- ...
|-pred_normals (Predicted normals)
| |- 0017634_1.txt (first channel)
| |- 0017634_2.txt (second channel)
| |- 0017634_3.txt (third channel)
| |- ...
You need to do some pre-processing to use unlabeled data for training:
- "corr_mat.txt" is a Nx5 matrix that in each row, the first value is the frame you want to use for self supervision (frame i) and the rest of 4 values are the frame numbres that the correspondences have been computed (frame j) and stored in corrs folder. (For each frame we chose 4 different frame to use for our self-supervised framework.)
- 0017634_0017637_i_r1_c1_r2_c2.txt is an example of one of the correspondences files. Each row has 5 values representing one 2D correspondences between frame 0017634 and frame 0017637: (1. i (the body part number based on densepose) 2. r1 (row number of the correspondence in frame 0017634) 3. c1 (column number of the correspondence in frame 0017634) 4. r2 (row number of the correspondence in frame 0017637) 5. c2 (column number of the correspondence in frame 0017637) )
- 0017634_0017637_i_limit.txt is a 24 by 3 matrix that the first column represents the 24 body part numbers from DensePose, second row and third row shows what range of rows in 0017634_0017637_i_r1_c1_r2_c2.txt represents that body part. If second row and third row is -1, means that the corresponding body part is not visible or doesn't have any correspondences.
If you find the code or our dataset useful in your research, please consider citing the paper.
@InProceedings{jafarian2021tiktok,
author={Yasamin Jafarian and Hyun Soo Park},
title = {Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}}
@misc{jafarian2021selfsupervised,
title={Self-supervised 3D Representation Learning of Dressed Humans from Social Media Videos},
author={Yasamin Jafarian and Hyun Soo Park},
year={2021},
eprint={2103.03319},
archivePrefix={arXiv},
primaryClass={cs.CV}}