Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding -mca comm_method to print table of communication methods #5507

Merged
merged 1 commit into from
Nov 14, 2019

Conversation

markalle
Copy link
Contributor

@markalle markalle commented Aug 1, 2018

This is closely related to Platform-MPI's old -prot feature.

The long-format of the tables it prints could look like this:

>   Host 0 [myhost001] ranks 0 - 1
>   Host 1 [myhost002] ranks 2 - 3
>   Host 2 [myhost003] ranks 4
>   Host 3 [myhost004] ranks 5
>   Host 4 [myhost005] ranks 6
>   Host 5 [myhost006] ranks 7
>   Host 6 [myhost007] ranks 8
>   Host 7 [myhost008] ranks 9
>   Host 8 [myhost009] ranks 10
>
>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self
>
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.

Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:

>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:

>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

The options to control it are

    -mca comm_method 1   :   print the above table at the end of MPI_Init
    -mca comm_method 2   :   print the above table at the beginning of MPI_Finalize

The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks). If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.

The other tunable is a simple environment variable MPI_COMM_METHOD_MAX that
defaults to 12 that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:

    1 - n      : full size 2d table
    n+1 - 3n   : shortened 2d table
    3n+1 - inf : summary, no 2d table

The source of the information used in the table is the .mca_component_name

In the case of BTLs, the module always had a .btl_component linking back to the
component. This adds a similar field for .pml_component and .mtl_component to
those modules. Note, when setting the .pml_component field I noticed nobody was
setting .pml_flags, so I added a 0 for setting that field as well.

So with the new field linking back to the component, we can then access
the component name with code like this

    mca_pml.pml_component->pmlm_version.mca_component_name

See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.

I think the weakest part is the strings_map[] list of recognized names in hook_comm_method_fns.c that's needed in order to have a decent mapping between the strings and integers. If someone adds a new btl/mtl/pml with a new name that's not in this list, it will map to 0 for unknown and print as "n/a".

Signed-off-by: Mark Allen markalle@us.ibm.com

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is not showing the connectivity but the BTL with the highest priority from the send list of BTLs. In most clusters these two might be equivalent, but there are situations where this is not the case.

if (!str) { return 0; } // default to "n/a" for any bad or unrecognized inputs

suffix_match = -1;
match = 0; // default to "n/a" for anything unreognized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -523,6 +524,8 @@ struct mca_pml_base_module_1_0_1_t {
uint32_t pml_max_contextid;
int pml_max_tag;
int pml_flags;

mca_pml_base_component_t *pml_component;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this equivalent to using mca_pml_base_selected_component or ompi_mtl_base_selected_component (which are supposed to either be NULL if the corresponding type was not selected or contain the component selected)?

// settings, this is only for testing, eg to make sure the printing comes out
// right.
if (myleaderrank == 0) {
p = getenv("MPI_COMM_METHOD_FAKEFILE");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undocumented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is this undocumented -- why wouldn't you use an MCA parameter for this? (and the other place(s) where you use getenv())

max2D1Cprottable = 3 * max2Dprottable;
}

hostidprotptr = getenv("MPI_COMM_METHOD_BRIEF");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undocumented.

@markalle
Copy link
Contributor Author

I updated a few things, switched to -mca for the settings, even the "fakefile" one that's really just for internal debugging.

Thanks a bunch for the mca_pml_base_selected_component and ompi_mtl_base_selected_component variables, that helped simplify this checkin a lot.

So what about the other main comment about the highest priority item from the send list of BTLs. Is there anything else that would be more meaningful there?

I could probably redo it to understand the possibility of multiple BTLs between two hosts. I'd switch the internals to a mask, and for the printout use the second format for the table, using A,B,C,D etc and a key to identify things. I don't think the current code is very close to letting that work but it's not impossible.

I still lean toward thinking it's good enough without supporting printout of multiple BTLs between two hosts though. It still has a decent utility to let a person make sure they're really running what they think they're running

I'd still like to have a 1d table at some point with more per-host info like IP address, and that 1d table is where I'd be more prone to list multiple items for a host, eg list multiple IP addresses for a host or list it as having both an ibverbs BTL and a tcp BTL active there

// Find the send btl's module:component:name for the incoming comm,rank
static char*
lookup_btl_name_for_send(ompi_communicator_t* comm, int rank) {
ompi_proc_t *dst_proc = ompi_comm_peer_lookup (comm, rank);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this function forces the allocation of the underlying proc structure, but not it's initialization (so the endpoint will be NULL). I suggest using ompi_group_peer_lookup_existing(comm->c_remote_group, rank) instead, which will prevent the allocation of the proc if the proc is still the sentinel.

return NULL;
}

char *strings_map[] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is extremely static. Why not using a hash table for the names ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be a hash, but I'd still need a way to convert the hashes back to strings. So I'd either still have a static list of strings to walk or else I'd have all the ranks including their list of strings in the messages they send to rank 0 so it could have a complete list of strings.

The other part is right now the code can distinguish between "ofi" as an MTL and "ofi" as a BTL. If it was just a general string hash I don't think it would know that both existed. Maybe that's a minor enough feature I shouldn't care.

I'd hate for the table that currently says "tcp" to get unnecessarily bloated to "tcp(btl)" etc.

If we throw out the first long-format table

>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self

and only used the more abbreviated form

>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

then the extra characters wouldn't give me any heartburn. But I still find the long form more readable than the above so prefer it on the smaller runs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment why some hosts talk to the same host over sm, and other hosts use self?
Can we infer that the hosts that talk to themselves over self only have 1 rank on them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's identifying the local-host traffic. If there's one rank on the host then it's just 'self'. If there are multiple ranks then both 'self' and 'vader' would be active between ranks on that host, so it shows 'vader' as that's the more useful piece of info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say the reading of such table is not trivial, as one might expect to read connectivity between MPI ranks when in fact this table shows connectivity (one of the possible BTL) between nodes (all ranks accumulated together). Moreover, it works best for regular and symmetric cases on homogeneous environments, the exact places where the best answer would have been "tcp across nodes, vader between processes on the same node, and self for communication in the same process".

@markalle
Copy link
Contributor Author

markalle commented Jan 8, 2019

Repushed just with the change to use ompi_group_peer_lookup_existing() in the BTL lookup, still using the static list in strings_map[] as the mapping that turns the known interconnect strings into numbers and back.

@ibm-ompi
Copy link

ibm-ompi commented Jan 8, 2019

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/2daa7e94eb91dd18632a74c859f0916b

@jjhursey
Copy link
Member

jjhursey commented Jan 8, 2019

The IBM PGI failure seems unrelated. So can be disregarded. We updated the compiler from 17.10 to 18.10 and there must be a regression.

@markalle
Copy link
Contributor Author

retest
bot:retest

@gpaulsen
Copy link
Member

gpaulsen commented Apr 3, 2019

@markalle What's the fate of this PR? Can you please rebase it, and try again for late April?

@markalle
Copy link
Contributor Author

Okay, I fixed the other big sticking point nobody liked: the static list of strings for the recognized interconnects. Now they Allreduce their individual lists into a uniform list so the ranks can map strings to id#'s and all have the same result.

Related to this I decided that my original objection (that the "portals4" name for example exists both as a btl and an mtl, and is thus ambiguous) isn't a big deal. The pml/mtl tables are always pretty boring with just a bunch of uniform entries anyway. So I let the strings be used as is, and I added in the "Connection Summary" section a little note saying whether the thing being shown in the table was a btl/mtl/pml.

@gpaulsen
Copy link
Member

gpaulsen commented Apr 26, 2019

I added the Label:State Awaiting user information, as It'd be nice to have community buy-in on this approach before merging to master.

@gpaulsen
Copy link
Member

gpaulsen commented Sep 5, 2019

@markalle Can you please rebase this, and we can try to get this into master?

This is closely related to Platform-MPI's old -prot feature.

The long-format of the tables it prints could look like this:
>   Host 0 [myhost001] ranks 0 - 1
>   Host 1 [myhost002] ranks 2 - 3
>   Host 2 [myhost003] ranks 4
>   Host 3 [myhost004] ranks 5
>   Host 4 [myhost005] ranks 6
>   Host 5 [myhost006] ranks 7
>   Host 6 [myhost007] ranks 8
>   Host 7 [myhost008] ranks 9
>   Host 8 [myhost009] ranks 10
>
>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self
>
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.

Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:
>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

The options to control it are
    -mca comm_method 1   :   print the above table at the end of MPI_Init
    -mca comm_method 2   :   print the above table at the beginning of MPI_Finalize
    -mca comm_method_max <n> :  number of hosts <n> for which to print a full size 2d
    -mca comm_method_brief 1 :  only print summary output, no 2d table
    -mca comm_method_fakefile <filename> :  for debugging only

* printing at init vs finalize:

The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks).  If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.

* how many hosts <n> for which to print a full size 2d table

The option -mca comm_method_max <n> can be used to specify a number of hosts <n>
(default 12) that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:
    1 - n      : full size 2d table
    n+1 - 3n   : shortened 2d table
    3n+1 - inf : summary only, no 2d table

* brief

The option -mca comm_method_brief 1 can be used to skip the printing of the 2d
table and only show the short summary

* fakefile

This is a debugging option that allows easeir testing of all the printout
routines by letting all the detected communication methods between the hosts
be overridden by fake data from a file.

The source of the information used in the table is the .mca_component_name

In the case of BTLs, the module always had a .btl_component linking back to the
component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component
offer similar functionality for pml/mtl.

So with the ability to identify the component, we can then access
the component name with code like this
    mca_pml_base_selected_component.pmlm_version.mca_component_name
See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle
Copy link
Contributor Author

markalle commented Nov 1, 2019

I think I've addressed all the concerns above, with maybe an exception being the fact that if multiple BTLs are used this only reports on btl[0]. I still think it's producing meaningful information though.

@gpaulsen gpaulsen self-requested a review November 14, 2019 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants