Skip to content
This repository was archived by the owner on Jul 2, 2021. It is now read-only.

Add dilate option and MultiNodeBatchNormalization to Conv2DActiv and Conv2DBNActiv #494

Merged
merged 37 commits into from
Mar 25, 2018

Conversation

mitmul
Copy link
Member

@mitmul mitmul commented Dec 10, 2017

#388 became a too large PR, then I split it into some small PRs.
First, Conv2DActiv and Conv2DBNActive should be able to take dilate option to use DilatedConvolution2D instead of Convolution2D.
Next, Conv2DBNActive should be able to take chainermn.links.MultiNodeBatchNormalization to use batch normalization layers correctly with multiple GPUs.
This PR introduces these two features.

@mitmul mitmul changed the title [WIP] Add dilate option and MultiNodeBatchNormalization to Conv2DActiv and Conv2DBNActiv Add dilate option and MultiNodeBatchNormalization to Conv2DActiv and Conv2DBNActiv Dec 10, 2017
@mitmul mitmul mentioned this pull request Dec 10, 2017
3 tasks
except for :obj:`activ` and :obj:`bn_kwargs`.
except for :obj:`activ`, :obj:`bn_kwargs`, and :obj:`comm`.
:obj:`comm` is a communicator of ChainerMN which is used for
:obj:`MultiNodeBatchNormalization`. If :obj:`None` is given to the argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`chainermn.links.MultiNodeBatchNormalization`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

If a ChainerMN communicator is given,
:obj:`~chainermn.links.MultiNodeBatchNormalization` will be used
for the batch normalization. If :obj:`None`,
:obj:`~chainer.links.BatchNormalization` will be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to take comm as an element of bn_kwargs because comm is used only for batchnorm.
If 'comm' in bn_kwargs, it use MultiNodeBatchNormalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I changed it

:obj:`MultiNodeBatchNormalization`. If :obj:`None` is given to the argument
:obj:`comm`, :obj:`BatchNormalization` link from Chainer is used.
:class:`chainermn.links.MultiNodeBatchNormalization`. If
:obj:`None` is given to the argument :obj:`comm`, :obj:`BatchNormalization` link from Chainer is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`chainer.links.BatchNormalization`

self.bn = MultiNodeBatchNormalization(
out_channels, comm, **bn_kwargs)
out_channels, [**bn_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ ?

If a ChainerMN communicator is given,
:class:`chainer.links.BatchNormalization`. If a ChainerMN
communicator (:class:`~chainermn.communicators.CommunicatorBase)
is given with the key :obj:`comm`,
:obj:`~chainermn.links.MultiNodeBatchNormalization` will be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`chainermn.links.MultiNodeBatchNormalization`

try:
from chainermn.links import MultiNodeBatchNormalization
_chainermn_available = True
except (ImportError, TypeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does TypeError occur?
If not, we can remove it.

stride=1, pad=0, nobias=True, initialW=None,
initial_bias=None, activ=relu, bn_kwargs=dict()):
stride=1, pad=0, dilate=1, nobias=True, initialW=None,
initial_bias=None, activ=relu, bn_kwargs=dict(), comm=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove comm option?

self.in_channels, self.out_channels, self.ksize, self.stride,
self.pad, self.dilate, initialW=initialW,
initial_bias=initial_bias, activ=activ, bn_kwargs=bn_kwargs,
comm=comm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include comm in bn_kwards.

@yuyu2172
Copy link
Member

yuyu2172 commented Mar 6, 2018

I am very sorry for late review.
Could you please resolve the conflict with master branch?

.travis.yml Outdated
conda env create -f environment.yml;
source activate chainercv;
cd $HOME;
wget https://github.com/chainer/chainermn/archive/v1.0.0.tar.gz -O chainermn.tar.gz;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to the latest release.

@yuyu2172
Copy link
Member

yuyu2172 commented Mar 6, 2018

ChainerMN seems to be not installed correctly.

It is raising an error error: option --no-nccl not recognized.

@Hakuyume
Copy link
Member

Hakuyume commented Mar 7, 2018

Why do you install ChainerMN from source? Can't we install it by modifying chainercv/environment.yml?

@yuyu2172
Copy link
Member

yuyu2172 commented Mar 7, 2018

Travis is failing because environment file is incorrect.
Perhaps, pip should be used inside the env file to install mn.

@yuyu2172
Copy link
Member

yuyu2172 commented Mar 7, 2018

Conda installation is still failing.

@yuyu2172
Copy link
Member

@mitmul

You need - before pip.

The env file below worked.

name: chainercv
channels:
- !!python/unicode
  'menpo'
- !!python/unicode
  'mpi4py'
- !!python/unicode
  'defaults'
dependencies:
  - Cython
  - opencv3=3.2.0
  - matplotlib
  - numpy
  - Pillow
  - openmpi
  - pip:
    - chainermn==1.2.0

@mitmul
Copy link
Member Author

mitmul commented Mar 15, 2018

Thanks. I updated it.

@yuyu2172
Copy link
Member

Tests are failing when dilate == 2.
I think you forgot to modify test_conv_2d_activ.py. It is fine for test_conv_2d_bn_activ.py.

@mitmul
Copy link
Member Author

mitmul commented Mar 16, 2018

@yuyu2172 Sorry, I updated the test too.

@mitmul
Copy link
Member Author

mitmul commented Mar 16, 2018

All tests passed.

@yuyu2172
Copy link
Member

yuyu2172 commented Mar 19, 2018

I checked the test log and it seems that ChainerMN is not installed properly.

The log says that mpi4py fails to get installed. https://travis-ci.org/chainer/chainercv/jobs/354200398#L751
Could you fix this?

  _configtest.c:2:17: fatal error: mpi.h: No such file or directory
   #include <mpi.h>
                   ^
  compilation terminated.
  failure.
  removing: _configtest.c _configtest.o
  error: Cannot compile MPI programs. Check your configuration!!!
  
  ----------------------------------------
�[31m  Failed building wheel for mpi4py�[0m
�[?25h  Running setup.py clean for mpi4py
  Running setup.py bdist_wheel for pycparser ... �[?25l-� �\� �done
�[?25h  Stored in directory: /home/travis/.cache/pip/wheels/95/14/9a/5e7b9024459d2a6600aaa64e0ba485325aff7a9ac7489db1b6
  Running setup.py bdist_wheel for filelock ... �[?25l-� �done
�[?25h  Stored in directory: /home/travis/.cache/pip/wheels/5f/5e/8a/9f1eb481ffbfff95d5f550570c1dbeff3c1785c8383c12c62b
Successfully built chainermn chainer pycparser filelock
Failed to build mpi4py

As a consequence, the tests related to ChainerMN are skipped.
https://travis-ci.org/chainer/chainercv/jobs/354200398#L1077

@mitmul
Copy link
Member Author

mitmul commented Mar 19, 2018

I see. Thanks for catching it. I'll fix that.

@mitmul
Copy link
Member Author

mitmul commented Mar 22, 2018

@yuyu2172 Fixed the environment.yml to install mpi4py from conda. I think it fixed the installation of mpi4py.

@mitmul mitmul force-pushed the add-conv2dmultinodebnrelu branch from 63a1f02 to 96cd29a Compare March 22, 2018 09:37
@mitmul mitmul force-pushed the add-conv2dmultinodebnrelu branch from 96cd29a to a9e908d Compare March 22, 2018 09:46
Copy link
Member

@yuyu2172 yuyu2172 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuyu2172 yuyu2172 merged commit aee70e6 into chainer:master Mar 25, 2018
@yuyu2172 yuyu2172 added this to the v0.9 milestone Mar 25, 2018
@yuyu2172 yuyu2172 self-assigned this Mar 25, 2018
@mitmul mitmul deleted the add-conv2dmultinodebnrelu branch May 18, 2018 09:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants