Adding -mca comm_method to print table of communication methods #5507

markalle · 2018-08-01T22:50:42Z

This is closely related to Platform-MPI's old -prot feature.

The long-format of the tables it prints could look like this:

>   Host 0 [myhost001] ranks 0 - 1
>   Host 1 [myhost002] ranks 2 - 3
>   Host 2 [myhost003] ranks 4
>   Host 3 [myhost004] ranks 5
>   Host 4 [myhost005] ranks 6
>   Host 5 [myhost006] ranks 7
>   Host 6 [myhost007] ranks 8
>   Host 7 [myhost008] ranks 9
>   Host 8 [myhost009] ranks 10
>
>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self
>
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.

Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:

>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:

>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

The options to control it are

    -mca comm_method 1   :   print the above table at the end of MPI_Init
    -mca comm_method 2   :   print the above table at the beginning of MPI_Finalize

The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks). If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.

The other tunable is a simple environment variable MPI_COMM_METHOD_MAX that
defaults to 12 that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:

    1 - n      : full size 2d table
    n+1 - 3n   : shortened 2d table
    3n+1 - inf : summary, no 2d table

The source of the information used in the table is the .mca_component_name

In the case of BTLs, the module always had a .btl_component linking back to the
component. This adds a similar field for .pml_component and .mtl_component to
those modules. Note, when setting the .pml_component field I noticed nobody was
setting .pml_flags, so I added a 0 for setting that field as well.

So with the new field linking back to the component, we can then access
the component name with code like this

    mca_pml.pml_component->pmlm_version.mca_component_name

See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.

I think the weakest part is the strings_map[] list of recognized names in hook_comm_method_fns.c that's needed in order to have a decent mapping between the strings and integers. If someone adds a new btl/mtl/pml with a new name that's not in this list, it will map to 0 for unknown and print as "n/a".

Signed-off-by: Mark Allen markalle@us.ibm.com

bosilca

This patch is not showing the connectivity but the BTL with the highest priority from the send list of BTLs. In most clusters these two might be equivalent, but there are situations where this is not the case.

bosilca · 2018-08-02T20:56:58Z

ompi/mca/hook/comm_method/hook_comm_method_fns.c

+    if (!str) { return 0; } // default to "n/a" for any bad or unrecognized inputs
+
+    suffix_match = -1;
+    match = 0; // default to "n/a" for anything unreognized


bosilca · 2018-08-02T21:11:41Z

ompi/mca/pml/pml.h

@@ -523,6 +524,8 @@ struct mca_pml_base_module_1_0_1_t {
    uint32_t                              pml_max_contextid;
    int                                   pml_max_tag;
    int                                   pml_flags;
+
+    mca_pml_base_component_t             *pml_component;


Isn't this equivalent to using mca_pml_base_selected_component or ompi_mtl_base_selected_component (which are supposed to either be NULL if the corresponding type was not selected or contain the component selected)?

bosilca · 2018-08-02T21:13:07Z

ompi/mca/hook/comm_method/hook_comm_method_fns.c

+// settings, this is only for testing, eg to make sure the printing comes out
+// right.
+    if (myleaderrank == 0) {
+        p = getenv("MPI_COMM_METHOD_FAKEFILE");


undocumented.

Not only is this undocumented -- why wouldn't you use an MCA parameter for this? (and the other place(s) where you use getenv())

bosilca · 2018-08-02T21:13:32Z

ompi/mca/hook/comm_method/hook_comm_method_fns.c

+            max2D1Cprottable = 3 * max2Dprottable;
+        }
+
+        hostidprotptr = getenv("MPI_COMM_METHOD_BRIEF");


undocumented.

markalle · 2018-12-14T07:17:55Z

I updated a few things, switched to -mca for the settings, even the "fakefile" one that's really just for internal debugging.

Thanks a bunch for the mca_pml_base_selected_component and ompi_mtl_base_selected_component variables, that helped simplify this checkin a lot.

So what about the other main comment about the highest priority item from the send list of BTLs. Is there anything else that would be more meaningful there?

I could probably redo it to understand the possibility of multiple BTLs between two hosts. I'd switch the internals to a mask, and for the printout use the second format for the table, using A,B,C,D etc and a key to identify things. I don't think the current code is very close to letting that work but it's not impossible.

I still lean toward thinking it's good enough without supporting printout of multiple BTLs between two hosts though. It still has a decent utility to let a person make sure they're really running what they think they're running

I'd still like to have a 1d table at some point with more per-host info like IP address, and that 1d table is where I'd be more prone to list multiple items for a host, eg list multiple IP addresses for a host or list it as having both an ibverbs BTL and a tcp BTL active there

bosilca · 2018-12-14T20:14:52Z

ompi/mca/hook/comm_method/hook_comm_method_fns.c

+// Find the send btl's module:component:name for the incoming comm,rank
+static char*
+lookup_btl_name_for_send(ompi_communicator_t* comm, int rank) {
+    ompi_proc_t *dst_proc = ompi_comm_peer_lookup (comm, rank);


Using this function forces the allocation of the underlying proc structure, but not it's initialization (so the endpoint will be NULL). I suggest using ompi_group_peer_lookup_existing(comm->c_remote_group, rank) instead, which will prevent the allocation of the proc if the proc is still the sentinel.

bosilca · 2018-12-14T20:16:08Z

ompi/mca/hook/comm_method/hook_comm_method_fns.c

+    return NULL;
+}
+
+char *strings_map[] = {


This approach is extremely static. Why not using a hash table for the names ?

I guess it could be a hash, but I'd still need a way to convert the hashes back to strings. So I'd either still have a static list of strings to walk or else I'd have all the ranks including their list of strings in the messages they send to rank 0 so it could have a complete list of strings.

The other part is right now the code can distinguish between "ofi" as an MTL and "ofi" as a BTL. If it was just a general string hash I don't think it would know that both existed. Maybe that's a minor enough feature I shouldn't care.

I'd hate for the table that currently says "tcp" to get unnecessarily bloated to "tcp(btl)" etc.

If we throw out the first long-format table

> host | 0 1 2 3 4 5 6 7 8 > ======|============================================== > 0 : sm tcp tcp tcp tcp tcp tcp tcp tcp > 1 : tcp sm tcp tcp tcp tcp tcp tcp tcp > 2 : tcp tcp self tcp tcp tcp tcp tcp tcp > 3 : tcp tcp tcp self tcp tcp tcp tcp tcp > 4 : tcp tcp tcp tcp self tcp tcp tcp tcp > 5 : tcp tcp tcp tcp tcp self tcp tcp tcp > 6 : tcp tcp tcp tcp tcp tcp self tcp tcp > 7 : tcp tcp tcp tcp tcp tcp tcp self tcp > 8 : tcp tcp tcp tcp tcp tcp tcp tcp self

and only used the more abbreviated form

> host | 0 1 2 3 4 8 > ======|==================== > 0 : A C C C C C C C C > 1 : C A C C C C C C C > 2 : C C B C C C C C C > 3 : C C C B C C C C C > 4 : C C C C B C C C C > 5 : C C C C C B C C C > 6 : C C C C C C B C C > 7 : C C C C C C C B C > 8 : C C C C C C C C B > key: A == sm > key: B == self > key: C == tcp

then the extra characters wouldn't give me any heartburn. But I still find the long form more readable than the above so prefer it on the smaller runs

Can you comment why some hosts talk to the same host over sm, and other hosts use self?
Can we infer that the hosts that talk to themselves over self only have 1 rank on them?

Yeah, it's identifying the local-host traffic. If there's one rank on the host then it's just 'self'. If there are multiple ranks then both 'self' and 'vader' would be active between ranks on that host, so it shows 'vader' as that's the more useful piece of info

I have to say the reading of such table is not trivial, as one might expect to read connectivity between MPI ranks when in fact this table shows connectivity (one of the possible BTL) between nodes (all ranks accumulated together). Moreover, it works best for regular and symmetric cases on homogeneous environments, the exact places where the best answer would have been "tcp across nodes, vader between processes on the same node, and self for communication in the same process".

markalle · 2019-01-08T17:08:54Z

Repushed just with the change to use ompi_group_peer_lookup_existing() in the BTL lookup, still using the static list in strings_map[] as the mapping that turns the known interconnect strings into numbers and back.

ibm-ompi · 2019-01-08T18:14:46Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/2daa7e94eb91dd18632a74c859f0916b

jjhursey · 2019-01-08T19:49:01Z

The IBM PGI failure seems unrelated. So can be disregarded. We updated the compiler from 17.10 to 18.10 and there must be a regression.

markalle · 2019-02-21T20:08:08Z

retest
bot:retest

gpaulsen · 2019-04-03T22:05:37Z

@markalle What's the fate of this PR? Can you please rebase it, and try again for late April?

markalle · 2019-04-26T07:01:04Z

Okay, I fixed the other big sticking point nobody liked: the static list of strings for the recognized interconnects. Now they Allreduce their individual lists into a uniform list so the ranks can map strings to id#'s and all have the same result.

Related to this I decided that my original objection (that the "portals4" name for example exists both as a btl and an mtl, and is thus ambiguous) isn't a big deal. The pml/mtl tables are always pretty boring with just a bunch of uniform entries anyway. So I let the strings be used as is, and I added in the "Connection Summary" section a little note saying whether the thing being shown in the table was a btl/mtl/pml.

gpaulsen · 2019-04-26T14:49:59Z

I added the Label:State Awaiting user information, as It'd be nice to have community buy-in on this approach before merging to master.

gpaulsen · 2019-09-05T20:49:09Z

@markalle Can you please rebase this, and we can try to get this into master?

This is closely related to Platform-MPI's old -prot feature. The long-format of the tables it prints could look like this: > Host 0 [myhost001] ranks 0 - 1 > Host 1 [myhost002] ranks 2 - 3 > Host 2 [myhost003] ranks 4 > Host 3 [myhost004] ranks 5 > Host 4 [myhost005] ranks 6 > Host 5 [myhost006] ranks 7 > Host 6 [myhost007] ranks 8 > Host 7 [myhost008] ranks 9 > Host 8 [myhost009] ranks 10 > > host | 0 1 2 3 4 5 6 7 8 > ======|============================================== > 0 : sm tcp tcp tcp tcp tcp tcp tcp tcp > 1 : tcp sm tcp tcp tcp tcp tcp tcp tcp > 2 : tcp tcp self tcp tcp tcp tcp tcp tcp > 3 : tcp tcp tcp self tcp tcp tcp tcp tcp > 4 : tcp tcp tcp tcp self tcp tcp tcp tcp > 5 : tcp tcp tcp tcp tcp self tcp tcp tcp > 6 : tcp tcp tcp tcp tcp tcp self tcp tcp > 7 : tcp tcp tcp tcp tcp tcp tcp self tcp > 8 : tcp tcp tcp tcp tcp tcp tcp tcp self > > Connection summary: > on-host: all connections are sm or self > off-host: all connections are tcp In this example hosts 0 and 1 had multiple ranks so "sm" was more meaningful than "self" to identify how the ranks on the host are talking to each other. While host 2..8 were one rank per host so "self" was more meaningful as their btl. Above a certain number of hosts (12 by default) the above table gets too big so we shrink to a more abbreviated looking table that has the same data: > host | 0 1 2 3 4 8 > ======|==================== > 0 : A C C C C C C C C > 1 : C A C C C C C C C > 2 : C C B C C C C C C > 3 : C C C B C C C C C > 4 : C C C C B C C C C > 5 : C C C C C B C C C > 6 : C C C C C C B C C > 7 : C C C C C C C B C > 8 : C C C C C C C C B > key: A == sm > key: B == self > key: C == tcp Then above 36 hosts we stop printing the 2d table entirely and just print the summary: > Connection summary: > on-host: all connections are sm or self > off-host: all connections are tcp The options to control it are -mca comm_method 1 : print the above table at the end of MPI_Init -mca comm_method 2 : print the above table at the beginning of MPI_Finalize -mca comm_method_max <n> : number of hosts <n> for which to print a full size 2d -mca comm_method_brief 1 : only print summary output, no 2d table -mca comm_method_fakefile <filename> : for debugging only * printing at init vs finalize: The most important difference between these two is that when printing the table during MPI_Init(), we send extra messages to make sure all hosts are connected to each other. So the table ends up working against the idea of on-demand connections (although it's only forcing the n^2 connections in the number of hosts, not the total ranks). If printing at MPI_Finalize() we don't create any connections that aren't already connected, so the table is more likely to have "n/a" entries if some hosts never connected to each other. * how many hosts <n> for which to print a full size 2d table The option -mca comm_method_max <n> can be used to specify a number of hosts <n> (default 12) that controls at what host-count the unabbreviated / abbreviated 2d tables get printed: 1 - n : full size 2d table n+1 - 3n : shortened 2d table 3n+1 - inf : summary only, no 2d table * brief The option -mca comm_method_brief 1 can be used to skip the printing of the 2d table and only show the short summary * fakefile This is a debugging option that allows easeir testing of all the printout routines by letting all the detected communication methods between the hosts be overridden by fake data from a file. The source of the information used in the table is the .mca_component_name In the case of BTLs, the module always had a .btl_component linking back to the component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component offer similar functionality for pml/mtl. So with the ability to identify the component, we can then access the component name with code like this mca_pml_base_selected_component.pmlm_version.mca_component_name See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c, and their use in comm_method() to parse the strings and produce an integer to represent the connection type being used. Signed-off-by: Mark Allen <markalle@us.ibm.com>

markalle · 2019-11-01T21:45:59Z

I think I've addressed all the concerns above, with maybe an exception being the fact that if multiple BTLs are used this only reports on btl[0]. I still think it's producing meaningful information though.

bosilca reviewed Aug 2, 2018

View reviewed changes

bwbarrett added the Target: main label Aug 7, 2018

markalle force-pushed the prot_table branch from d523c75 to d466f48 Compare December 14, 2018 06:57

markalle force-pushed the prot_table branch from d466f48 to 5c6552d Compare December 14, 2018 19:59

bosilca reviewed Dec 14, 2018

View reviewed changes

markalle force-pushed the prot_table branch from 5c6552d to ebad34f Compare January 8, 2019 17:06

markalle force-pushed the prot_table branch from ebad34f to e8e3271 Compare April 26, 2019 06:57

gpaulsen added the State-Awaiting user information label Apr 26, 2019

markalle force-pushed the prot_table branch from e8e3271 to 6855ebb Compare November 1, 2019 21:08

gpaulsen self-requested a review November 14, 2019 20:10

gpaulsen approved these changes Nov 14, 2019

View reviewed changes

gpaulsen merged commit 9e17f02 into open-mpi:master Nov 14, 2019

This was referenced Aug 5, 2020

hook/prot: Connectivity Map #2825

Closed

mca comm_method needs docs / CI test for v5.0 #7987

Closed

jjhursey mentioned this pull request Dec 22, 2020

Update hook component to use enum MCA parameter #8313

Merged

jjhursey mentioned this pull request Mar 17, 2021

Show MPI connectivity map during MPI_INIT #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding -mca comm_method to print table of communication methods #5507

Adding -mca comm_method to print table of communication methods #5507

markalle commented Aug 1, 2018

bosilca left a comment

bosilca Aug 2, 2018

bosilca Aug 2, 2018

bosilca Aug 2, 2018

jsquyres Aug 2, 2018

bosilca Aug 2, 2018

markalle commented Dec 14, 2018

bosilca Dec 14, 2018

bosilca Dec 14, 2018

markalle Jan 7, 2019

gpaulsen Apr 26, 2019

markalle Apr 26, 2019

bosilca Apr 26, 2019

markalle commented Jan 8, 2019

ibm-ompi commented Jan 8, 2019

jjhursey commented Jan 8, 2019

markalle commented Feb 21, 2019

gpaulsen commented Apr 3, 2019

markalle commented Apr 26, 2019

gpaulsen commented Apr 26, 2019 •

edited

Loading

gpaulsen commented Sep 5, 2019

markalle commented Nov 1, 2019

Adding -mca comm_method to print table of communication methods #5507

Adding -mca comm_method to print table of communication methods #5507

Conversation

markalle commented Aug 1, 2018

bosilca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markalle commented Dec 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markalle commented Jan 8, 2019

ibm-ompi commented Jan 8, 2019

jjhursey commented Jan 8, 2019

markalle commented Feb 21, 2019

gpaulsen commented Apr 3, 2019

markalle commented Apr 26, 2019

gpaulsen commented Apr 26, 2019 • edited Loading

gpaulsen commented Sep 5, 2019

markalle commented Nov 1, 2019

gpaulsen commented Apr 26, 2019 •

edited

Loading