Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-detect multi-node based on env vars #832

Merged
merged 2 commits into from
Sep 7, 2023

Conversation

bryevdv
Copy link
Contributor

@bryevdv bryevdv commented Sep 6, 2023

fixes: #829

This PR add support to modulate the default nodes and ranks_per_node values in the presence of common relevant environment variables. This is most useful when legate is "externally launched" but the --help string is updated with the corresponding information when this occurs:

  --nodes NODES         Number of nodes to use [default auto-detected from OMPI] (default: 4)
  --ranks-per-node RANKS_PER_NODE
                        Number of ranks (processes running copies of the program) to launch per node. [default auto-detected from OMPI] (default: 2)

@bryevdv bryevdv added the category:improvement PR introduces an improvement and will be classified as such in release notes label Sep 6, 2023
@bryevdv bryevdv requested a review from manopapad September 6, 2023 20:58
@bryevdv
Copy link
Contributor Author

bryevdv commented Sep 6, 2023

@manopapad note that I do also see:

dev310 ❯ OMPI_COMM_WORLD_SIZE=4 OMPI_COMM_WORLD_LOCAL_SIZE=2 OMPI_COMM_WORLD_RANK=1 legate 
WARNING: Disabling control replication for interactive run
Error: Could not determine node-local rank

but I am not sure that is a real issue (e.g. are other env vars necessary for "real" usage?)

@manopapad
Copy link
Contributor

but I am not sure that is a real issue (e.g. are other env vars necessary for "real" usage?)

That error message is coming from bind.sh I believe. Try also specifying OMPI_COMM_WORLD_RANK and OMPI_COMM_WORLD_LOCAL_RANK.

@bryevdv
Copy link
Contributor Author

bryevdv commented Sep 7, 2023

👍 adding OMPI_COMM_WORLD_LOCAL_RANK resolved the bind.sh issue

@bryevdv bryevdv requested a review from manopapad September 7, 2023 00:09
@bryevdv bryevdv merged commit 3208b58 into nv-legate:branch-23.09 Sep 7, 2023
@bryevdv bryevdv deleted the bv/auto-dectect-multi-node branch September 7, 2023 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:improvement PR introduces an improvement and will be classified as such in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Launcher: detect when externally launched
2 participants