Skip to content
This repository was archived by the owner on Jul 2, 2021. It is now read-only.

add MaskRCNN #781

Merged
merged 90 commits into from
Apr 23, 2019
Merged

add MaskRCNN #781

merged 90 commits into from
Apr 23, 2019

Conversation

yuyu2172
Copy link
Member

@yuyu2172 yuyu2172 commented Feb 12, 2019

Merge after #778

@knorth55 knorth55 added this to the 0.13 milestone Feb 12, 2019
@yuyu2172 yuyu2172 force-pushed the mask-rcnn branch 4 times, most recently from 1bd2c8a to f402118 Compare February 13, 2019 07:37
@yuyu2172 yuyu2172 force-pushed the mask-rcnn branch 4 times, most recently from bceb057 to 6b48d9a Compare February 13, 2019 14:07
@yuyu2172
Copy link
Member Author

yuyu2172 commented Feb 14, 2019

Wierd segfault during training...
Experiemented with n_gpu=1, batchsize=1.
For debugging purpose, I changed MultithreadIterator to SerialIterator.

 *** Process received signal ***........]  1.42%                                                                        
 Signal: Segmentation fault (11)........] 17.32%                                                                        
 Signal code: Address not mapped (1)                                                                                    
 Failing at address: 0x67ee6000s, 21:36:26.539275.                                                                      
 [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbb433f5390]                                                   
 [ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fb9d82c31a0]             
 [ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fb9d8148e58]             
 [ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fb9d82fb482]             
 [ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fb9d82fc879]             
 [ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fb9d7e1b784]             
 [ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fbb43733d6c]                              
 [ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fbb43733bfd]                           
 [ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fbb4378c9fb]                                                    
 [ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]                              
 [10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]                                                    
 [11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]                               
 [13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fbb4378d0f3]                                                    
 [14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]                              
 [16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]                                                    
 [17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]                               
 [19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]                               
 [20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]                                 
 [21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]                                  
 [22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 [23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fbb43748506]                                                    
 [24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 [25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fbb43786389]                              
 [26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]                               
 [27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]                                 
 [28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]                                  
 [29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 *** End of error message ***                                                                                           

other configs

  • failed
    • PIL for read_image
    • PIL for resize
    • Serial Iterator
  • failed
    • PIL for read_image
    • PIL for resize
    • Serial Iterator
    • Use the latest RoiAlign

@yuyu2172
Copy link
Member Author

$ mpiexec -n 8 python3 train_multi.py --batchsize 8
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'coco1-worker-0', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
[coco1-worker-0:115997] 7 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[coco1-worker-0:115997] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0 (1, 3, 1344, 576)
6 (1, 3, 800, 1216)
4 (1, 3, 1088, 800)
7 (1, 3, 800, 1216)
15 (1, 3, 800, 1088)
19 (1, 3, 1088, 800)
19 (1, 3, 704, 1344)
41 (1, 3, 800, 1088)
[coco1-worker-0:116005] Failed to cuMemcpy GPU memory, rc=-1
--------------------------------------------------------------------------
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
  cuMemcpyAsync(0x7f5f2f22ac00, 0x7f5df4dfb000, 8192) returned value 1
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
Exception in main training loop: MPI_ERR_TRUNCATE: message truncated
Traceback (most recent call last):
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
    update()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
    self.update_core()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
    self.communicator.bcast_data(target)
  File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
    self.mpi_comm.Bcast(buf)
  File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_multi.py", line 249, in <module>
    main()
  File "train_multi.py", line 245, in main
    trainer.run()
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 349, in run
    six.reraise(*exc_info)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
    update()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
    self.update_core()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
    self.communicator.bcast_data(target)
  File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
    self.mpi_comm.Bcast(buf)
  File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

@yuyu2172
Copy link
Member Author

yuyu2172 commented Feb 15, 2019

Seemingly working with the current master
Chainer: chainer/chainer@afe9033
CuPy: cupy/cupy@155228f

EDIT:

*** Process received signal ***........] 48.41%
Signal: Segmentation fault (11)##......] 89.34%
Signal code: Address not mapped (1)
Failing at address: 0x6af2d000, 9:44:55.033078.
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fc50f859390]
[ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fc422b991a0]
[ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fc422a1ee58]
[ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fc422bd1482]
[ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fc422bd2879]
[ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fc4226f1784]
[ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fc50fb97d6c]
[ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fc50fb97bfd]
[ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fc50fbf09fb]
[ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fc50fbf10f3]
[14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fc50fbac506]
[24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fc50fbea389]
[26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
*** End of error message ***

EDIT:
Segfault problem seems to have stopped after #798

_canonical_scale = 224
_roi_size = 14
_roi_sample_ratio = 2
segm_size = _roi_size * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this member public?

Copy link
Member Author

@yuyu2172 yuyu2172 Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for mask_head_loss_pre.
To circumvent this problem, should we make mask_head_loss_pre as a method of MaskHead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It is ok to make this member public.


:obj:`imgs`, ":math:`[(3, H, W)]`", :obj:`float32`, \
"RGB, :math:`[0, 255]`"
:obj:`rois`, ":math:`[(R', 4)]`", :obj:`float32`, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a line between inputs part and outputs part? (or just make two tables)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

# pad with repeated border values), we manually zero-pad the masks by 1
# pixel prior to resizing back to the original image resolution.
# This prevents "top hat" artifacts. We therefore need to expand
# the reference boxes by an appropriate factor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two similar blocks. Can we merge them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@yuyu2172
Copy link
Member Author

  1. Define min_size and max_size in train_multi.py.
  2. use only cv2 backend in mask_utils.py
    • Change the behavior of resize. Make it break when chainer.config.cv_resize_backend == 'cv2' and cv2 is not installed.
    • Force cv2 backend in mask_utils.py

@yuyu2172
Copy link
Member Author

pfnCI, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 6acc48f (2975879):

@Hakuyume
Copy link
Member

pfnCI, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 6acc48f (2975879):

@yuyu2172
Copy link
Member Author

pfnCI, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 58ef6c7 (15c5cdc):

@yuyu2172
Copy link
Member Author

pfnCI, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 6c4dc80 (9fad50d):

@Hakuyume Hakuyume merged commit 938c86f into chainer:master Apr 23, 2019
@yuyu2172 yuyu2172 deleted the mask-rcnn branch May 14, 2019 14:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants