Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile #212

Witiko · 2020-10-19T15:13:28Z

As discussed in OCR-D/ocrd_calamari#46 (comment), the Dockerfile currently omits a number of important environmental variables: LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH.

The `LC_ALL` and `LANG` environmental variables

Unless the LC_ALL and LANG environmental variables are configured on the host device to use Unicode, running Python in the produced Docker image fails with the following error:

RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Consult https://click.palletsprojects.com/en/7.x/python3/ for mitigation steps.

This system supports the C.UTF-8 locale which is recommended.
You might be able to resolve your issue by exporting the
following environment variables:

    export LC_ALL=C.UTF-8
    export LANG=C.UTF-8

This pull request overrides the values of the LC_ALL and LANG environmental variables to make the Dockerfile portable.

The `TF_FORCE_GPU_ALLOW_GROWTH` environmental variable

Unless the TF_FORCE_GPU_ALLOW_GROWTH environmental variable is true, a single calamari-recognize process will consume all VRAM, although it only needs ca 4G:

Sat Oct 17 19:35:37 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3F:00.0 Off |                    0 |
| N/A   50C    P0    28W /  70W |  14850MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   48C    P0    27W /  70W |    106MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4697      C   .../headless-tf1/bin/python3    14847MiB |
|    1   N/A  N/A      4697      C   .../headless-tf1/bin/python3      103MiB |
+-----------------------------------------------------------------------------+

This pull request sets the value of the TF_FORCE_GPU_ALLOW_GROWTH environmental variable to true to make Tensoflow to only allocate GPU memory as needed.

kba

LGTM

bertsky · 2020-10-19T18:50:13Z

Agreeing on the l10n variables, but surprised to see TF_FORCE_GPU_ALLOW_GROWTH here. This is an implementation detail for just one package of one module. The TF processors should instead use config.gpu_options.allow_growth = True IMHO.

bertsky · 2020-10-19T18:52:02Z

but surprised to see TF_FORCE_GPU_ALLOW_GROWTH here. This is an implementation detail for just one package of one module. The TF processors should instead use config.gpu_options.allow_growth = True IMHO.

Or is this something that we do in fact need to see as deployment detail (deciding to prevent multiple use of the GPU in some cases, and allowing it in others)?

Witiko · 2020-10-19T19:15:52Z

@bertsky TF_FORCE_GPU_ALLOW_GROWTH could definitely be substituted by patches to the individual modules (although I only encountered issues with calamari, this issue likely affects other modules as well). However, fixing the allocation issue using an environmental variable seems to give the same result with less work and less deviation from the upsteam. Do you suppose allowing TF to allocate the whole VRAM even though it only needs a fraction could be useful for some users?

bertsky · 2020-10-19T19:29:29Z

Do you suppose allowing TF to allocate the whole RAM even though it only needs a fraction could be useful for some users?

Yes, that was my previous question. I have no problem imagining this to be useful in some circumstances. For example, forcing one processor per GPU makes runtime races for GPU resources more controllable (and allow early CPU fallback). But if the processors may grow the allocated GPU RAM, then it might depend on input data (image sizes) and other random features (like which combination of workflow steps happens to run at the same time) whether or not OOM occurs.

However, fixing the allocation issue using an environmental variable seems to give the same result with less work and less deviation from the upsteam.

That's a good argument, but you could also os.setenv['TF_FORCE_GPU_ALLOW_GROWTH'] in ocrd_calamari for that.

Anyway, I think it's enough to remember this might become an issue and adopt the envvar solution for now.

Witiko · 2020-10-19T19:46:33Z

@bertsky In the case you described, a user could also force a single GPU per process by docker run --env TF_FORCE_GPU_ALLOW_GROWTH=false. However, TF_FORCE_GPU_ALLOW_GROWTH=true seems as the saner default setting, since GPUs are costly and allocating a single GPU per process is usually a waste, especially on shared computing nodes.

bertsky · 2020-10-20T08:59:20Z

In the case you described, a user could also force a single GPU per process by docker run --env TF_FORCE_GPU_ALLOW_GROWTH=false.

Yep, good idea!

However, TF_FORCE_GPU_ALLOW_GROWTH=true seems as the saner default setting, since GPUs are costly and allocating a single GPU per process is usually a waste, especially on shared computing nodes.

It depends on the kind of GPU (how much RAM) and the kind of compute task (RAM requirements). For the case where 1 job already takes more than half of the memory, exclusive allocation is okay and early CPU fallback better than late OOM failure.

stweil

The modification is fine, thank you, although more comments would have been nice.

Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile

0a609c2

Witiko mentioned this pull request Oct 19, 2020

Using the ocrd/all:maximum-cuda Docker image: No supported devices found for platform CUDA OCR-D/ocrd_calamari#46

Closed

kba approved these changes Oct 19, 2020

View reviewed changes

kba assigned stweil and bertsky Oct 19, 2020

bertsky self-requested a review October 19, 2020 19:29

bertsky approved these changes Oct 19, 2020

View reviewed changes

stweil approved these changes Oct 20, 2020

View reviewed changes

stweil merged commit 67bd9dd into OCR-D:master Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile #212

Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile #212

Witiko commented Oct 19, 2020 •

edited

Loading

kba left a comment

bertsky commented Oct 19, 2020

bertsky commented Oct 19, 2020

Witiko commented Oct 19, 2020 •

edited

Loading

bertsky commented Oct 19, 2020

Witiko commented Oct 19, 2020 •

edited

Loading

bertsky commented Oct 20, 2020

stweil left a comment

Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile #212

Add LC_ALL, LANG, and TF_FORCE_GPU_ALLOW_GROWTH ENVs to Dockerfile #212

Conversation

Witiko commented Oct 19, 2020 • edited Loading

The LC_ALL and LANG environmental variables

The TF_FORCE_GPU_ALLOW_GROWTH environmental variable

kba left a comment

Choose a reason for hiding this comment

bertsky commented Oct 19, 2020

bertsky commented Oct 19, 2020

Witiko commented Oct 19, 2020 • edited Loading

bertsky commented Oct 19, 2020

Witiko commented Oct 19, 2020 • edited Loading

bertsky commented Oct 20, 2020

stweil left a comment

Choose a reason for hiding this comment

Witiko commented Oct 19, 2020 •

edited

Loading

The `LC_ALL` and `LANG` environmental variables

The `TF_FORCE_GPU_ALLOW_GROWTH` environmental variable

Witiko commented Oct 19, 2020 •

edited

Loading

Witiko commented Oct 19, 2020 •

edited

Loading