diff --git a/BUILD.md b/BUILD.md index 2734f5152..41511ff81 100644 --- a/BUILD.md +++ b/BUILD.md @@ -10,14 +10,16 @@ without an express license agreement from NVIDIA CORPORATION or its affiliates is strictly prohibited. --> -# Basic build +# Building Legate from Source + +## Basic build If you are building on a cluster, first check if there are specialized scripts available for your cluster at [nv-legate/quickstart](https://github.com/nv-legate/quickstart). Even if your specific cluster is not covered, you may be able to adapt an existing workflow. -## Getting dependencies through conda +### Getting dependencies through conda The primary method of retrieving dependencies for Legate Core and downstream libraries is through [conda](https://docs.conda.io/en/latest/). You will need @@ -58,7 +60,7 @@ doing this in your shell startup script): conda activate legate ``` -## Building from source +### Building from source Build and install basic C++ core: @@ -102,15 +104,14 @@ of machines. For example, to configure a debug build on a [DGX SuperPOD](https://www.nvidia.com/en-us/data-center/dgx-superpod/) you may use `config/examples/arch-dgx-superpod-debug.py`. -For multi-node execution Legate can use [UCX](https://openucx.org) (use `--with-ucx`, see -[below](#ucx-optional) for more details) or [GASNet](https://gasnet.lbl.gov/) (use -`--with-gasnet` see [below](#gasnet-optional) for more details). . +For multi-node execution, Legate can use [UCX](https://openucx.org) (use `--with-ucx`) +or [GASNet](https://gasnet.lbl.gov/) (use `--with-gasnet`) see [below](#dependency-listing) for more details). . Compiling with networking support requires MPI. -# Advanced topics +## Advanced build topics -## Support matrix +### Support matrix The following table lists Legate's minimum supported versions of major dependencies. @@ -133,14 +134,14 @@ contributions that fix such incompatibilities. | Python | 3.10 | | | NumPy | 1.22 | | -## Dependency listing +### Dependency listing In this section we comment further on our major dependencies. Please consult an environment file created by `generate-conda-envs.py` for a full listing of dependencies, e.g. building and testing tools, and for exact version requirements. -### Operating system +#### Operating system Legate has been tested on Linux and MacOS, although only a few flavors of Linux such as Ubuntu have been thoroughly tested. There is currently no support for @@ -149,7 +150,7 @@ Windows. Specify your OS when creating a conda environment file through the `--os` flag of `generate-conda-envs.py`. -### Python +#### Python In terms of Python compatibility, Legate *roughly* follows the timeline outlined in [NEP 29](https://numpy.org/neps/nep-0029-deprecation_policy.html). @@ -157,7 +158,7 @@ in [NEP 29](https://numpy.org/neps/nep-0029-deprecation_policy.html). Specify your desired Python version when creating a conda environment file through the `--python` flag of `generate-conda-envs.py`. -### C++ compiler +#### C++ compiler We suggest that you avoid using the compiler packages available on conda-forge. These compilers are configured with the specific goal of building @@ -170,7 +171,7 @@ If you want to pull the compilers from conda, use an environment file created by `generate-conda-envs.py` using the `--compilers` flag. An appropriate compiler for the target OS will be chosen automatically. -### CUDA (optional) +#### CUDA (optional) Only necessary if you wish to run with Nvidia GPUs. @@ -191,7 +192,7 @@ architectures. You can use Legate with Pascal GPUs as well, but there could be issues due to lack of independent thread scheduling. Please report any such issues on GitHub. -### CUDA libraries (optional) +#### CUDA libraries (optional) Only necessary if you wish to run with Nvidia GPUs. @@ -215,7 +216,7 @@ them from the environment file (or invoke `generate-conda-envs.py` with `--ctk none`, which will skip them all), and pass the corresponding `--with-` flag to `configure` (or let the build process attempt to locate them automatically). -### OpenBLAS +#### OpenBLAS Used by cuNumeric for implementing linear algebra routines on CPUs. @@ -239,7 +240,7 @@ OpenBLAS configured with the following options: BLAS work. If `NUM_PARALLEL` is not high enough, some of this parallel work will be serialized. -### TBLIS +#### TBLIS Used by cuNumeric for implementing tensor contraction routines on CPUs. @@ -256,14 +257,14 @@ cuNumeric requires a build of TBLIS configured as follows: and additionally `--enable-thread-model=openmp` if cuNumeric is compiled with OpenMP support. -### Numactl (optional) +#### Numactl (optional) Required to support CPU and memory binding in the Legate launcher. Not available on conda; typically available through the system-level package manager. -### MPI (optional) +#### MPI (optional) Only necessary if you wish to run on multiple nodes. @@ -279,7 +280,7 @@ file created by `generate-conda-envs.py` using the `--openmpi` flag. Legate requires a build of MPI that supports `MPI_THREAD_MULTIPLE`. -### RDMA/networking libraries (e.g. Infiniband, RoCE, Slingshot) (optional) +#### RDMA/networking libraries (e.g. Infiniband, RoCE, Slingshot) (optional) Only necessary if you wish to run on multiple nodes, using the corresponding networking hardware. @@ -291,7 +292,7 @@ Depending on your hardware, you may need to use a particular Realm networking backend, e.g. as of October 2023 HPE Slingshot is only compatible with GASNet. -### GASNet (optional) +#### GASNet (optional) Only necessary if you wish to run on multiple nodes, using the GASNet1 or GASNetEx Realm networking backend. @@ -303,7 +304,7 @@ installation. If you wish to provide an alternative installation, pass When using GASNet, you also need to specify the interconnect network of the target machine using the `--gasnet-conduit` flag. -### UCX (optional) +#### UCX (optional) Only necessary if you wish to run on multiple nodes, using the UCX Realm networking backend. @@ -319,7 +320,7 @@ of your UCX installation to `configure` (if necessary) using `--with-ucx-dir`. Legate requires a build of UCX configured with `--enable-mt`. -## Alternative sources for dependencies +### Alternative sources for dependencies If you do not wish to use conda for some (or all) of the dependencies, you can remove the corresponding entries from the environment file before passing it to diff --git a/README.md b/README.md index 9c091b9f7..736289166 100644 --- a/README.md +++ b/README.md @@ -12,214 +12,54 @@ its affiliates is strictly prohibited. # Legate -The Legate project endeavors to democratize computing by making it possible -for all programmers to leverage the power of large clusters of CPUs and GPUs -by running the same code that runs on a desktop or a laptop at scale. -Using this technology, computational and data scientists can develop and test -programs on moderately sized data sets on local machines and then immediately -scale up to larger data sets deployed on many nodes in the cloud or on a -supercomputer without any code modifications. +The Legate project makes it easier for programmers to leverage the +power of large clusters of CPUs and GPUs. Using Legate, programs can be +developed and tested on moderately sized data sets on local machines and +then immediately scaled up to larger data sets deployed on many nodes in +the cloud or on a supercomputer, *without any code modifications*. -The Legate project is built upon two foundational principles: +--- -1. For end users, such as computational and data scientists, the programming - model must be identical to programming a single sequential CPU on their - laptop or desktop. All concerns relating to parallelism, data - distribution, and synchronization must be implicit. The cloud or a - supercomputer should appear as nothing more than a super-powerful CPU core. -2. Software must be compositional and not just interoperable (i.e. functionally - correct). Libraries developed in the Legate ecosystem must be able to exchange - partitioned and distributed data without requiring "shuffles" or unnecessary - blocking synchronization. Computations from different libraries should be - able to use arbitrary data and still be reordered across abstraction boundaries - to hide communication and synchronization latencies where the original sequential - semantics of the program allow. This is essential for achieving speed-of-light - performance on large scale machines. +The Legate API is implemented on top of the [Legion](https://legion.stanford.edu/) +programming model and runtime system, which was originally designed for large +HPC applications that target supercomputers. -The Legate project is still in its nascent stages of development, but much of -the fundamental architecture is in place. We encourage development and contributions -to existing Legate libraries, as well as the development of new Legate libraries. -Pull requests are welcomed. +The Legate project is built from two foundational principles: -If you have questions, please contact us at legate(at)nvidia.com. - -- [Legate](#legate) - - [Why Legate?](#why-legate) - - [What is the Legate Core?](#what-is-the-legate-core) - - [How Does Legate Work?](#how-does-legate-work) - - [How Do I Install Legate?](#how-do-i-install-legate) - - [How Do I Use Legate?](#how-do-i-use-legate) - - [Distributed Launch](#distributed-launch) - - [Debugging and Profiling](#debugging-and-profiling) - - [Running Legate programs with Jupyter Notebook](#running-legate-programs-with-jupyter-notebook) - - [Installation of the Legate IPython Kernel](#installation-of-the-legate-ipython-kernel) - - [Running with Jupyter Notebook](#running-with-jupyter-notebook) - - [Configuring the Jupyter Notebook](#configuring-the-jupyter-notebook) - - [Magic Command](#magic-command) - - [Other FAQs](#other-faqs) - - [Documentation](#documentation) - - [Next Steps](#next-steps) - -## Why Legate? - -Computational problems today continue to grow both in their complexity as well -as the scale of the data that they consume and generate. This is true both in -traditional HPC domains as well as enterprise data analytics cases. Consequently, -more and more users truly need the power of large clusters of both CPUs and -GPUs to address their computational problems. Not everyone has the time or -resources required to learn and deploy the advanced programming models and tools -needed to target this class of hardware today. Legate aims to bridge this gap -so that any programmer can run code on any scale machine without needing to be -an expert in parallel programming and distributed systems, thereby allowing -developers to bring the problem-solving power of large machines to bear on -more kinds of challenging problems than ever before. - -## What is the Legate Core? - -The Legate Core is our version of [Apache Arrow](https://arrow.apache.org/). Apache -Arrow has significantly improved composability of software libraries by making it -possible for different libraries to share in-memory buffers of data without -unnecessary copying. However, it falls short when it comes to meeting two -of our primary requirements for Legate: - -1. Arrow only provides an API for describing a physical representation - of data as a single memory allocation. There is no interface for describing - cases where data has been partitioned and then capturing the logical - relationships of those partitioned subsets of data. -2. Arrow is mute on the subject of synchronization. Accelerators such as GPUs - achieve significantly higher performance when computations are performed - asynchronously with respect to other components of the system. When data is - passed between libraries today, accelerators must be pessimistically - synchronized to ensure that data dependences are satisfied across abstraction - boundaries. This might result in tolerable overheads for single GPU systems, - but can result in catastrophically poor performance when hundreds of GPUs are involved. - -The Legate Core provides an API very similar to Arrow's interface with several -important distinctions that provide stronger guarantees about data coherence and -synchronization to aid library developers when building Legate libraries. These -guarantees are the crux of how libraries in the Legate ecosystem are able to -provide excellent composability. - -The Legate Core API imports several important concepts from Arrow such that -users that are familiar with Arrow already will find it unsurprising. We use -the same type system representation as Arrow so libraries that have already -adopted it do not need to learn or adapt to a new type system. We also reuse -the concept of an [Array](https://arrow.apache.org/docs/cpp/api/array.html) -from Arrow. The `LegateArray` class supports many of the same methods as -the Arrow Array interface (we'll continue to add methods to improve -compatibility). The main difference is that instead of obtaining -[Buffer](https://arrow.apache.org/docs/cpp/api/memory.html#buffers) -objects from arrays to describe allocations of data that back the array, the -Legate Core API introduces a new primitive called a `LegateStore` which -provides a new interface for reasoning about partitioned and distributed -data in asynchronous execution environments. +**Implicit parallelism** -Any implementation of a `LegateStore` must maintain the following guarantees -to clients of the Legate Core API (i.e. Legate library developers): +For end users, the programming model must be identical to programming a +single sequential CPU on their laptop or desktop. Parallelism, data +distribution, and synchronization must be implicit. The cloud or a +supercomputer should appear as nothing more than a super-powerful CPU core. -1. The coherence of data contained in a `LegateStore` must be implicitly - managed by the implementation of the Legate Core API. This means that - no matter where data is requested to perform a computation in a machine, - the most recent modifications to that data in program order must be - reflected. It should never be clients responsibility to maintain this - coherence. -2. It should be possible to create arbitrary views onto `LegateStore` objects - such that library developers can precisely describe the working sets of - their computations. Modifications to views must be reflected onto all - aliasing views data. This property must be maintained by the Legate Core - API implementation such that it is never the concern of clients. -3. Dependence management between uses of the `LegateStore` objects and their - views is the responsibility of Legate Core API regardless of what - (asynchronous) computations are performed on `LegateStore` objects or their - views. This dependence analysis must be both sound and precise. It is - illegal to over-approximate dependences. This dependence analysis must also - be performed globally in scope. Any use of the `LegateStore` on any - processor/node in the system must abide by the original sequential - semantics of the program +**Composibility** -Note that we do not specify exactly what the abstractions are that are needed -for implementing `LegateStore` objects. Our goal is not prescribe what these -abstractions are as they may be implementation dependent. Our only requirements -are that they have these properties to ensure that incentives are aligned in -such a way for Legate libraries to achieve a high degree of composability -at any scale of machine. Indeed, these requirements shift many of the burdens -that make implementing distributed and accelerated libraries hard off of the -library developers and onto the implementation of the Legate Core API. This -is by design as it allows the costs to be amortized across all libraries in -the ecosystem and ensures that Legate library developers are more productive. +Software must be compositional and not merely interoperable. Libraries +developed in the Legate ecosystem must be able to exchange partitioned +and distributed data without requiring "shuffles" or unnecessary blocking +synchronization. Computations from different libraries should be able to +use arbitrary data and still be reordered across abstraction boundaries +to hide communication and synchronization latencies (where the original +sequential semantics of the program allow). This is essential to achieve +optimal performance on large-scale machines. -## How Does Legate Work? +For more information about Legate's goals, architecture, and functioning, +see the [Legate overview](https://nv-legate.github.io/legate.core/overview). -Our implementation of the Legate Core API is built on top of the -[Legion](https://legion.stanford.edu/) programming model and runtime system. -Legion was originally designed for large HPC applications that target -supercomputers and consequently applications written in the Legion programming -model tend to both perform and scale well on large clusters of both CPUs and -GPUs. Legion programs are also easy to port to new machines as they inherently -decouple the machine-independent specification of computations from decisions -about how that application is mapped to the target machine. Due to this -abstract nature, many programmers find writing Legion programs challenging. -By implementing the Legate Core API on top of Legion, we've made it easier -to use Legion such that developers can still get access to the benefits of -Legion without needing to learn all of the lowest-level interfaces. - -The [Legion programming model](https://legion.stanford.edu/pdfs/sc2012.pdf) -greatly aids in implementing the Legate Core API. Data types from libraries, -such as arrays in cuNumeric are mapped down onto `LegateStore` objects -that wrap Legion data types such as logical regions or futures. -In the case of regions, Legate application libraries rely heavily on -Legion's [support for partitioning of logical regions into arbitrary -subregion views](https://legion.stanford.edu/pdfs/oopsla2013.pdf). -Each library has its own heuristics for computing such partitions that -take into consideration the computations that will access the data, the -ideal sizes of data to be consumed by different processor kinds, and -the available number of processors. Legion automatically manages the coherence -of subregion views regardless of the scale of the machine. - -Computations in Legate application libraries are described by Legion tasks. -Tasks describe their data usage in terms of `LegateStore` objects, thereby -allowing Legion to infer where dependences exist. Legion uses distributed -bounding volume hierarchies, similar to a high performance ray-tracer, -to soundly and precisely perform dependence analysis on logical regions -and insert the necessary synchronization between tasks to maintain the -original sequential semantics of a Legate program. - -Each Legate application library also comes with its own custom Legion -mapper that uses heuristics to determine the best choice of mapping for -tasks (e.g. are they best run on a CPU or a GPU). All -Legate tasks are currently implemented in native C or CUDA in order to -achieve excellent performance on the target processor kind, but Legion -has bindings in other languages such as Python, Fortran, and Lua for -users that would prefer to use them. Importantly, by using Legion, -Legate is able to control the placement of data in order to leave it -in-place in fast memories like GPU framebuffers across tasks. - -When running on large clusters, Legate leverages a novel technology provided -by Legion called "[control replication](https://research.nvidia.com/sites/default/files/pubs/2021-02_Scaling-Implicit-Parallelism//ppopp.pdf)" to avoid the sequential bottleneck -of having one node farm out work to all the nodes in the cluster. With -control replication, Legate will actually replicate the Legate program and -run it across all the nodes of the machine at the same time. These copies -of the program all cooperate logically to appear to execute as one -program. When communication is necessary between -different computations, the Legion runtime's program analysis will automatically -detect it and insert the necessary data movement and synchronization -across nodes (or GPU framebuffers). This is the transformation that allows -sequential programs to run efficiently at scale across large clusters -as though they are running on a single processor. - -## How Do I Install Legate? +## Installation Legate Core is available [on conda](https://anaconda.org/legate/legate-core). Create a new environment containing Legate Core: ``` -mamba create -n myenv -c nvidia -c conda-forge -c legate legate-core +conda create -n myenv -c nvidia -c conda-forge -c legate legate-core ``` or install it into an existing environment: ``` -mamba install -c nvidia -c conda-forge -c legate legate-core +conda install -c nvidia -c conda-forge -c legate legate-core ``` Only linux-64 packages are available at the moment. @@ -231,250 +71,27 @@ installing on a machine without GPUs. You can force installation of a CPU-only package by requesting it as follows: ``` -mamba ... legate-core=*=*_cpu +conda ... legate-core=*=*_cpu ``` See [BUILD.md](BUILD.md) for instructions on building Legate Core from source. -## How Do I Use Legate? - -After installing the Legate Core library, the next step is to install a Legate -application library such as cuNumeric. The installation process for a -Legate application library will require you to provide a pointer to the location -of your Legate Core library installation as this will be used to configure the -installation of the Legate application library. After you finish installing any -Legate application libraries, you can then simply replace their `import` statements -with the equivalent ones from any Legate application libraries you have installed. -For example, you can change this: -```python -import numpy as np -``` -to this: -```python -import cunumeric as np -``` -After this, you can use the `legate` driver script in the `bin` directory of -your installation to run any Python program. - -You can also use the standard Python interpreter, but in that case configuration -options can only be passed through the environment (see below), and some options -are not available (check the output of legate --help for more details). - -For example, to run your script in the default configuration (4 CPUs cores and -4 GB of memory) just run: -``` -$ legate my_python_program.py [other args] -``` -The `legate` script also allows you to control the amount of resources that -Legate consumes when running on the machine. The `--cpus` and `--gpus` flags -are used to specify how many CPU and GPU processors should be used on a node. -The `--sysmem` flag can be used to specify how many MBs of DRAM Legate is allowed -to use per node, while the `--fbmem` flag controls how many MBs of framebuffer -memory Legate is allowed to use per GPU. For example, when running on a DGX -station, you might run your application as follows: -``` -$ legate --cpus 16 --gpus 4 --sysmem 100000 --fbmem 15000 my_python_program.py -``` -This will make 16 CPU processors and all 4 GPUs available for use by Legate. -It will also allow Legate to consume up to 100 GB of DRAM memory and 15 GB of -framebuffer memory per GPU for a total of 60 GB of GPU framebuffer memory. Note -that you probably will not be able to make all the resources of the machine -available for Legate as some will be used by the system or Legate itself for -meta-work. Currently if you try to exceed these resources during execution then -Legate will inform you that it had insufficient resources to complete the job -given its current mapping heuristics. If you believe the job should fit within -the assigned resources please let us know so we can improve our mapping heuristics. -There are many other flags available for use in the `legate` driver script that you -can use to communicate how Legate should view the available machine resources. -You can see a list of them by running: -``` -$ legate --help -``` -In addition to running Legate programs, you can also use Legate in an interactive -mode by simply not passing any `*py` files on the command line. You can still -request resources just as you would though with a normal file. Legate will -still use all the resources available to it, including doing multi-node execution. -``` -$ legate --cpus 16 --gpus 4 --sysmem 100000 --fbmem 15000 -Welcome to Legion Python interactive console ->>> -``` -Note that Legate does not currently support multi-tenancy cases where different -users are attempting to use the same hardware concurrently. - -As a convenience, several command-line options can have their default values set -via environment variables. These environment variables, their corresponding command- -line options, and their default values are as follows. - -| CLI Option | Env. Variable | Default Value | -|--------------------------|----------------------------------|---------------| -| --omps | LEGATE\_OMP\_PROCS | 0 | -| --ompthreads | LEGATE\_OMP\_THREADS | 4 | -| --utility | LEGATE\_UTILITY\_CORES | 2 | -| --sysmem | LEGATE\_SYSMEM | 4000 | -| --numamem | LEGATE\_NUMAMEM | 0 | -| --fbmem | LEGATE\_FBMEM | 4000 | -| --zcmem | LEGATE\_ZCMEM | 32 | -| --regmem | LEGATE\_REGMEM | 0 | -| --eager-alloc-percentage | LEGATE\_EAGER\_ALLOC\_PERCENTAGE | 50 | - -### Distributed Launch - -If Legate is compiled with networking support (see the -[installation section](#how-do-i-install-legate)), -it can be run in parallel by using the `--nodes` option followed by the number of nodes -to be used. Whenever the `--nodes` option is used, Legate will be launched -using `mpirun`, even with `--nodes 1`. Without the `--nodes` option, no launcher will -be used. Legate currently supports `mpirun`, `srun`, and `jsrun` as launchers and we -are open to adding additional launcher kinds. You can select the -target kind of launcher with `--launcher`. - -### Debugging and Profiling - -Legate also comes with several tools that you can use to better understand -your program both from a correctness and a performance standpoint. For -correctness, Legate has facilities for constructing both dataflow -and event graphs for the actual run of an application. These graphs require -that you have an installation of [GraphViz](https://www.graphviz.org/) -available on your machine. To generate a dataflow graph for your Legate -program simply pass the `--dataflow` flag to the `legate.py` script and after -your run is complete we will generate a `dataflow_legate.pdf` file containing -the dataflow graph of your program. To generate a corresponding event graph -you simply need to pass the `--event` flag to the `legate.py` script to generate -a `event_graph_legate.pdf` file. These files can grow to be fairly large for non-trivial -programs so we encourage you to keep your programs small when using these -visualizations or invest in a [robust PDF viewer](https://get.adobe.com/reader/). - -For profiling all you need to do is pass the `--profile` flag to Legate and -afterwards you will have a `legate_prof` directory containing a web page that -can be viewed in any web browser that displays a timeline of your program's -execution. You simply need to load the `index.html` page from a browser. You -may have to enable local JavaScript execution if you are viewing the page from -your local machine (depending on your browser). - -We recommend that you do not mix debugging and profiling in the same run as -some of the logging for the debugging features requires significant file I/O -that can adversely effect the performance of the application. - -## Running Legate programs with Jupyter Notebook - -Same as normal Python programs, Legate programs can be run -using Jupyter Notebook. Currently we support single node execution with -multiple CPUs and GPUs, and plan to support multi-node execution in the future. -We leverage Legion's Jupyter support, so you may want to refer to the -[relevant section in Legion's README](https://github.com/StanfordLegion/legion/blob/master/jupyter_notebook/README.md). -To simplify the installation, we provide a script specifically for Legate libraries. - -### Installation of the Legate IPython Kernel - -Please install Legate, then run the following command to install a default -Jupyter kernel: -``` -legate-jupyter -``` -If installation is successful, you will see some output like the following: -``` -Jupyter kernel spec Legate_SM_GPU (Legate_SM_GPU) has been installed -``` -`Legate_SM_GPU` is the default kernel name. - -### Running with Jupyter Notebook - -You will need to start a Jupyter server, then you can use a Jupyter notebook -from any browser. Please refer to the following two sections from the README of -the [Legion Jupyter Notebook extension](https://github.com/StanfordLegion/legion/tree/master/jupyter_notebook) - -* Start the Jupyter Notebook server -* Use the Jupyter Notebook in the browser - -### Configuring the Jupyter Notebook - -The Legate Jupyter kernel is configured according to the command line arguments -provided at install time. Standard `legate` options for Core, Memory, and -Mult-node configuration may be provided, as well as a name for the kernel: -``` -legate-jupyter --name legate_cpus_2 --cpus 2 -``` -Other configuration options can be seen by using the `--help` command line option. - -### Magic Command - -We provide a Jupyter magic command to display the IPython kernel configuration. -``` -%load_ext legate.jupyter -%legate_info -``` -results in output: -``` -Kernel 'Legate_SM_GPU' configured for 1 node(s) - -Cores: - CPUs to use per rank : 4 - GPUs to use per rank : 0 - OpenMP groups to use per rank : 0 - Threads per OpenMP group : 4 - Utility processors per rank : 2 - -Memory: - DRAM memory per rank (in MBs) : 4000 - DRAM memory per NUMA domain per rank (in MBs) : 0 - Framebuffer memory per GPU (in MBs) : 4000 - Zero-copy memory per rank (in MBs) : 32 - Registered CPU-side pinned memory per rank (in MBs) : 0 -``` - -## Other FAQs - -* *Does Legate only work on NVIDIA hardware?* - No, Legate will run on any processor supported by Legion (e.g. x86, ARM, and - PowerPC CPUs), and any network supported by GASNet or UCX (e.g. Infiniband, - Cray, Omnipath, and (ROC-)Ethernet based interconnects). - -* *What languages does the Legate Core API have bindings for?* - Currently the Legate Core bindings are only available in Python. Watch - this space for new language bindings soon or make a pull request to - contribute your own. Legion has a C API which should make it easy to - develop bindings in any language with a foreign function interface. - -* *Do I have to build drop-in replacement libraries?* - No! While we've chosen to provide drop-in replacement libraries for - popular Python libraries to illustrate the benefits of Legate, you - are both welcomed and encouraged to develop your own libraries on top - of Legate. We promise that they will compose well with other existing - Legate libraries. - -* *What other libraries are you planning to release for the Legate ecosystem?* - We're still working on that. If you have thoughts about what is important - please let us know so that we can get a feel for where best to put our time. - -* *Can I use Legate with other Legion libraries?* - Yes! If you're willing to extract the Legion primitives from the `LegateStore` - objects you should be able to pass them into other Legion libraries such as - [FlexFlow](https://flexflow.ai/). - -* *Does Legate interoperate with X?* - Yes, probably, but we don't recommend it. Our motivation for building - Legate is to provide the bare minimum subset of functionality that - we believe is essential for building truly composable software that can still - run at scale. No other systems out there met our needs. Providing - interoperability with those other systems will destroy the very essence - of what Legate is and significantly dilute its benefits. All that being - said, Legion does provide some means of doing stop-the-world exchanges - with other runtime system running concurrently in the same processes. - If you are interested in pursuing this approach please open an issue - on the [Legion github issue tracker](https://github.com/StanfordLegion/legion/issues) - as it will be almost entirely orthogonal to how you use Legate. - ## Documentation -A complete list of available features can is found in the [Legate Core +A complete list of available features and APIs can be found in the [Legate Core documentation](https://nv-legate.github.io/legate.core). + ## Next Steps We recommend starting by experimenting with at least one Legate application library to test out performance and see how Legate works. If you are interested -in building your own Legate application library, we recommend that you -investigate our [Legate Hello World application library](https://github.com/nv-legate/legate.core/tree/HEAD/examples/hello) that -provides a small example of how to get started developing your own drop-in -replacement library on top of Legion using the Legate Core library. +in building your own Legate application library, then the +[Legate Hello World application library](https://github.com/nv-legate/legate.core/tree/HEAD/examples/hello) +provides a getting-started example of developing your own library +on top of Legion, using the Legate Core library. + +We also encourage development and contributions to existing Legate libraries, as +well as the development of new Legate libraries. Pull requests are welcomed. + +If you have questions, please contact us at legate(at)nvidia.com. diff --git a/docs/legate/core/source/README.md b/docs/legate/core/source/README.md deleted file mode 120000 index ff5c79602..000000000 --- a/docs/legate/core/source/README.md +++ /dev/null @@ -1 +0,0 @@ -../../../../README.md \ No newline at end of file diff --git a/docs/legate/core/source/api.rst b/docs/legate/core/source/api.rst new file mode 100644 index 000000000..58b1fb2d0 --- /dev/null +++ b/docs/legate/core/source/api.rst @@ -0,0 +1,9 @@ +API Reference +============= + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + Python API Reference + C++ API Reference diff --git a/docs/legate/core/source/faq.rst b/docs/legate/core/source/faq.rst new file mode 100644 index 000000000..d9fd3bf7b --- /dev/null +++ b/docs/legate/core/source/faq.rst @@ -0,0 +1,43 @@ +Frequently Asked Questions +========================== + +Does Legate only work on NVIDIA hardware? + No, Legate will run on any processor supported by Legion (e.g. x86, ARM, and + PowerPC CPUs), and any network supported by GASNet or UCX (e.g. Infiniband, + Cray, Omnipath, and (ROC-)Ethernet based interconnects). + +What languages does the Legate Core API have bindings for? + Currently the Legate Core bindings are only available in Python. Watch + this space for new language bindings soon or make a pull request to + contribute your own. Legion has a C API which should make it easy to + develop bindings in any language with a foreign function interface. + +Do I have to build drop-in replacement libraries? + No! While we've chosen to provide drop-in replacement libraries for + popular Python libraries to illustrate the benefits of Legate, you + are both welcomed and encouraged to develop your own libraries on top + of Legate. We promise that they will compose well with other existing + Legate libraries. + +What other libraries are you planning to release for the Legate ecosystem? + We're still working on that. If you have thoughts about what is important + please let us know at *legate(at)nvidia.com* so that we can get a feel for + where best to put our time. + +Can I use Legate with other Legion libraries? + Yes! If you're willing to extract the Legion primitives from the ``LegateStore`` + objects you should be able to pass them into other Legion libraries such as + `FlexFlow `_. + +Does Legate interoperate with X? + Yes, probably, but we don't recommend it. Our motivation for building + Legate is to provide the bare minimum subset of functionality that + we believe is essential for building truly composable software that can still + run at scale. No other systems out there met our needs. Providing + interoperability with those other systems will destroy the very essence + of what Legate is and significantly dilute its benefits. All that being + said, Legion does provide some means of doing stop-the-world exchanges + with other runtime system running concurrently in the same processes. + If you are interested in pursuing this approach please open an issue + on the `Legion github issue tracker `_ + as it will be almost entirely orthogonal to how you use Legate. \ No newline at end of file diff --git a/docs/legate/core/source/index.rst b/docs/legate/core/source/index.rst index 6d17e1309..dcd3ee330 100644 --- a/docs/legate/core/source/index.rst +++ b/docs/legate/core/source/index.rst @@ -15,10 +15,10 @@ supercomputer without any code modifications. :maxdepth: 2 :caption: Contents: - Overview - Build instructions - Python API Reference - C++ API Reference + Overview + User Guide + Frequently Asked Questions + API Reference .. toctree:: :maxdepth: 1 diff --git a/docs/legate/core/source/installation.rst b/docs/legate/core/source/installation.rst new file mode 100644 index 000000000..bcea6ddae --- /dev/null +++ b/docs/legate/core/source/installation.rst @@ -0,0 +1,50 @@ +Installation +============ + +How Do I Install Legate +----------------------- + +Legate Core is available `on conda `_. +Create a new environment containing Legate Core: + +.. code-block:: sh + + conda create -n myenv -c nvidia -c conda-forge -c legate legate-core + +or install it into an existing environment: + +.. code-block:: sh + + conda install -c nvidia -c conda-forge -c legate legate-core + +Only linux-64 packages are available at the moment. + +The default package contains GPU support, and is compatible with CUDA >= 12.0 +(CUDA driver version >= r520), and Volta or later GPU architectures. There are +also CPU-only packages available, and will be automatically selected when +installing on a machine without GPUs. You can force installation of a CPU-only +package by requesting it as follows: + +.. code-block:: sh + + conda ... legate-core=*=*_cpu + +See `BUILD.md `_ for instructions on building Legate Core from source. + +Installation of the Legate IPython Kernel +----------------------------------------- + +Please install Legate, then run the following command to install a default +Jupyter kernel: + +.. code-block:: sh + + legate-jupyter + +If installation is successful, you will see some output like the following: + +.. code-block:: + + Jupyter kernel spec Legate_SM_GPU (Legate_SM_GPU) has been installed + +``Legate_SM_GPU`` is the default kernel name. \ No newline at end of file diff --git a/docs/legate/core/source/jupyter.rst b/docs/legate/core/source/jupyter.rst new file mode 100644 index 000000000..7dd9857d5 --- /dev/null +++ b/docs/legate/core/source/jupyter.rst @@ -0,0 +1,66 @@ + + +Running Legate programs with Jupyter Notebook +============================================= + +Same as normal Python programs, Legate programs can be run +using Jupyter Notebook. Currently we support single node execution with +multiple CPUs and GPUs, and plan to support multi-node execution in the future. +We leverage Legion's Jupyter support, so you may want to refer to the +`relevant section in Legion's README `_. +To simplify the installation, we provide a script specifically for Legate libraries. + +Running with Jupyter Notebook +----------------------------- + +You will need to start a Jupyter server, then you can use a Jupyter notebook +from any browser. Please refer to the following two sections from the README of +the `Legion Jupyter Notebook extension `_. + +* Start the Jupyter Notebook server +* Use the Jupyter Notebook in the browser + +Configuring the Jupyter Notebook +-------------------------------- + +The Legate Jupyter kernel is configured according to the command line +arguments provided at install time. Standard ``legate`` options for Core, +Memory, and Muli-node configuration may be provided, as well as a name for +the kernel: + +.. code-block:: sh + + legate-jupyter --name legate_cpus_2 --cpus 2 + +Other configuration options can be seen by using the ``--help`` command line +option. + +Magic Command +------------- + +We provide a Jupyter magic command to display the IPython kernel configuration. + +.. code-block:: + + %load_ext legate.jupyter + %legate_info + +results in output: + +.. code-block:: + + Kernel 'Legate_SM_GPU' configured for 1 node(s) + + Cores: + CPUs to use per rank : 4 + GPUs to use per rank : 0 + OpenMP groups to use per rank : 0 + Threads per OpenMP group : 4 + Utility processors per rank : 2 + + Memory: + DRAM memory per rank (in MBs) : 4000 + DRAM memory per NUMA domain per rank (in MBs) : 0 + Framebuffer memory per GPU (in MBs) : 4000 + Zero-copy memory per rank (in MBs) : 32 + Registered CPU-side pinned memory per rank (in MBs) : 0 \ No newline at end of file diff --git a/docs/legate/core/source/overview.rst b/docs/legate/core/source/overview.rst new file mode 100644 index 000000000..558e955ba --- /dev/null +++ b/docs/legate/core/source/overview.rst @@ -0,0 +1,323 @@ +Overview +======== + + +The Legate project makes it easier for programmers to leverage the +power of large clusters of CPUs and GPUs. Using Legate, programs can be +developed and tested on moderately sized data sets on local machines and +then immediately scaled up to larger data sets deployed on many nodes in +the cloud or on a supercomputer, *without any code modifications*. + +The Legate API is implemented on top of the `Legion `_ +programming model and runtime system, which was originally designed for large +HPC applications that target supercomputers. + +The Legate project is built from two foundational principles: + +**Implicit parallelism** + +For end users, the programming model must be identical to programming a +single sequential CPU on their laptop or desktop. Parallelism, data +distribution, and synchronization must be implicit. The cloud or a +supercomputer should appear as nothing more than a super-powerful CPU core. + +**Composibility** + +Software must be compositional and not merely interoperable. Libraries +developed in the Legate ecosystem must be able to exchange partitioned +and distributed data without requiring "shuffles" or unnecessary blocking +synchronization. Computations from different libraries should be able to +use arbitrary data and still be reordered across abstraction boundaries +to hide communication and synchronization latencies (where the original +sequential semantics of the program allow). This is essential to achieve +optimal performance on large-scale machines. + + + +Why Legate +---------- + +Computational problems today continue to grow both in their complexity as well +as the scale of the data that they consume and generate. This is true both in +traditional HPC domains as well as enterprise data analytics cases. Consequently, +more and more users truly need the power of large clusters of both CPUs and +GPUs to address their computational problems. Not everyone has the time or +resources required to learn and deploy the advanced programming models and tools +needed to target this class of hardware today. Legate aims to bridge this gap +so that any programmer can run code on any scale machine without needing to be +an expert in parallel programming and distributed systems, thereby allowing +developers to bring the problem-solving power of large machines to bear on +more kinds of challenging problems than ever before. + +What is the Legate Core +----------------------- + +The Legate Core is our version of `Apache Arrow `_. Apache +Arrow has significantly improved composability of software libraries by making it +possible for different libraries to share in-memory buffers of data without +unnecessary copying. However, it falls short when it comes to meeting two +of our primary requirements for Legate: + +1. Arrow only provides an API for describing a physical representation + of data as a single memory allocation. There is no interface for describing + cases where data has been partitioned and then capturing the logical + relationships of those partitioned subsets of data. +2. Arrow is mute on the subject of synchronization. Accelerators such as GPUs + achieve significantly higher performance when computations are performed + asynchronously with respect to other components of the system. When data is + passed between libraries today, accelerators must be pessimistically + synchronized to ensure that data dependences are satisfied across abstraction + boundaries. This might result in tolerable overheads for single GPU systems, + but can result in catastrophically poor performance when hundreds of GPUs are involved. + +The Legate Core provides an API very similar to Arrow's interface with several +important distinctions that provide stronger guarantees about data coherence and +synchronization to aid library developers when building Legate libraries. These +guarantees are the crux of how libraries in the Legate ecosystem are able to +provide excellent composability. + +The Legate Core API imports several important concepts from Arrow such that +users that are familiar with Arrow already will find it unsurprising. We use +the same type system representation as Arrow so libraries that have already +adopted it do not need to learn or adapt to a new type system. We also reuse +the concept of an `Array `_ +from Arrow. The ``LegateArray`` class supports many of the same methods as +the Arrow Array interface (we'll continue to add methods to improve +compatibility). The main difference is that instead of obtaining +`Buffer `_ +objects from arrays to describe allocations of data that back the array, the +Legate Core API introduces a new primitive called a `LegateStore` which +provides a new interface for reasoning about partitioned and distributed +data in asynchronous execution environments. + +Any implementation of a ``LegateStore`` must maintain the following guarantees +to clients of the Legate Core API (i.e. Legate library developers): + +1. The coherence of data contained in a ``LegateStore`` must be implicitly + managed by the implementation of the Legate Core API. This means that + no matter where data is requested to perform a computation in a machine, + the most recent modifications to that data in program order must be + reflected. It should never be clients responsibility to maintain this + coherence. +2. It should be possible to create arbitrary views onto ``LegateStore`` objects + such that library developers can precisely describe the working sets of + their computations. Modifications to views must be reflected onto all + aliasing views data. This property must be maintained by the Legate Core + API implementation such that it is never the concern of clients. +3. Dependence management between uses of the ``LegateStore`` objects and their + views is the responsibility of Legate Core API regardless of what + (asynchronous) computations are performed on ``LegateStore`` objects or their + views. This dependence analysis must be both sound and precise. It is + illegal to over-approximate dependences. This dependence analysis must also + be performed globally in scope. Any use of the ``LegateStore`` on any + processor/node in the system must abide by the original sequential + semantics of the program + +Note that we do not specify exactly what the abstractions are that are needed +for implementing ``LegateStore`` objects. Our goal is not prescribe what these +abstractions are as they may be implementation dependent. Our only requirements +are that they have these properties to ensure that incentives are aligned in +such a way for Legate libraries to achieve a high degree of composability +at any scale of machine. Indeed, these requirements shift many of the burdens +that make implementing distributed and accelerated libraries hard off of the +library developers and onto the implementation of the Legate Core API. This +is by design as it allows the costs to be amortized across all libraries in +the ecosystem and ensures that Legate library developers are more productive. + +How Does Legate Work +-------------------- + +Our implementation of the Legate Core API is built on top of the +`Legion `_ programming model and runtime system. +Legion was originally designed for large HPC applications that target +supercomputers and consequently applications written in the Legion programming +model tend to both perform and scale well on large clusters of both CPUs and +GPUs. Legion programs are also easy to port to new machines as they inherently +decouple the machine-independent specification of computations from decisions +about how that application is mapped to the target machine. Due to this +abstract nature, many programmers find writing Legion programs challenging. +By implementing the Legate Core API on top of Legion, we've made it easier +to use Legion such that developers can still get access to the benefits of +Legion without needing to learn all of the lowest-level interfaces. + +The `Legion programming model `_ +greatly aids in implementing the Legate Core API. Data types from libraries, +such as arrays in cuNumeric are mapped down onto ``LegateStore`` objects +that wrap Legion data types such as logical regions or futures. +In the case of regions, Legate application libraries rely heavily on +Legion's `support for partitioning of logical regions into arbitrary subregion views `_. +Each library has its own heuristics for computing such partitions that +take into consideration the computations that will access the data, the +ideal sizes of data to be consumed by different processor kinds, and +the available number of processors. Legion automatically manages the coherence +of subregion views regardless of the scale of the machine. + +Computations in Legate application libraries are described by Legion tasks. +Tasks describe their data usage in terms of ``LegateStore`` objects, thereby +allowing Legion to infer where dependences exist. Legion uses distributed +bounding volume hierarchies, similar to a high performance ray-tracer, +to soundly and precisely perform dependence analysis on logical regions +and insert the necessary synchronization between tasks to maintain the +original sequential semantics of a Legate program. + +Each Legate application library also comes with its own custom Legion +mapper that uses heuristics to determine the best choice of mapping for +tasks (e.g. are they best run on a CPU or a GPU). All +Legate tasks are currently implemented in native C or CUDA in order to +achieve excellent performance on the target processor kind, but Legion +has bindings in other languages such as Python, Fortran, and Lua for +users that would prefer to use them. Importantly, by using Legion, +Legate is able to control the placement of data in order to leave it +in-place in fast memories like GPU framebuffers across tasks. + +When running on large clusters, Legate leverages a novel technology provided +by Legion called "`control replication `_" +to avoid the sequential bottleneck +of having one node farm out work to all the nodes in the cluster. With +control replication, Legate will actually replicate the Legate program and +run it across all the nodes of the machine at the same time. These copies +of the program all cooperate logically to appear to execute as one +program. When communication is necessary between +different computations, the Legion runtime's program analysis will automatically +detect it and insert the necessary data movement and synchronization +across nodes (or GPU framebuffers). This is the transformation that allows +sequential programs to run efficiently at scale across large clusters +as though they are running on a single processor. + +How Do I Use Legate +------------------- + +After installing the Legate Core library, the next step is to install a Legate +application library such as cuNumeric. The installation process for a +Legate application library will require you to provide a pointer to the location +of your Legate Core library installation as this will be used to configure the +installation of the Legate application library. After you finish installing any +Legate application libraries, you can then simply replace their ``import`` statements +with the equivalent ones from any Legate application libraries you have installed. +For example, you can change this: + +.. code-block:: python + + import numpy as np + +to this: + +.. code-block:: python + + import cunumeric as np + +After this, you can use the ``legate`` driver script in the ```bin``` directory +of your installation to run any Python program. + +You can also use the standard Python interpreter, but in that case configuration +options can only be passed through the environment (see below), and some options +are not available (check the output of legate --help for more details). + +For example, to run your script in the default configuration (4 CPUs cores and +4 GB of memory) just run: + +.. code-block:: sh + + $ legate my_python_program.py [other args] + +The ``legate`` script also allows you to control the amount of resources that +Legate consumes when running on the machine. The ``--cpus`` and ``--gpus`` +flags are used to specify how many CPU and GPU processors should be used on a +node. The ``--sysmem`` flag can be used to specify how many MBs of DRAM Legate +is allowed to use per node, while the ``--fbmem`` flag controls how many MBs +of framebuffer memory Legate is allowed to use per GPU. For example, when +running on a DGX station, you might run your application as follows: + +.. code-block:: sh + + $ legate --cpus 16 --gpus 4 --sysmem 100000 --fbmem 15000 my_python_program.py + +This will make 16 CPU processors and all 4 GPUs available for use by Legate. +It will also allow Legate to consume up to 100 GB of DRAM memory and 15 GB of +framebuffer memory per GPU for a total of 60 GB of GPU framebuffer memory. Note +that you probably will not be able to make all the resources of the machine +available for Legate as some will be used by the system or Legate itself for +meta-work. Currently if you try to exceed these resources during execution then +Legate will inform you that it had insufficient resources to complete the job +given its current mapping heuristics. If you believe the job should fit within +the assigned resources please let us know so we can improve our mapping heuristics. +There are many other flags available for use in the ``legate`` driver script +that you can use to communicate how Legate should view the available machine +resources. You can see a list of them by running: + +.. code-block:: sh + + $ legate --help + +In addition to running Legate programs, you can also use Legate in an interactive +mode by simply not passing any ``*py`` files on the command line. You can still +request resources just as you would though with a normal file. Legate will +still use all the resources available to it, including doing multi-node execution. + +.. code-block:: sh + + $ legate --cpus 16 --gpus 4 --sysmem 100000 --fbmem 15000 + Welcome to Legion Python interactive console + >>> + +Note that Legate does not currently support multi-tenancy cases where different +users are attempting to use the same hardware concurrently. + +As a convenience, several command-line options can have their default values set +via environment variables. These environment variables, their corresponding command- +line options, and their default values are as follows. + +============================ ================================ ============= +CLI Option Env. Variable Default Value +============================ ================================ ============= +``--omps`` LEGATE_OMP_PROCS 0 +``--ompthreads`` LEGATE_OMP_THREADS 4 +``--utility`` LEGATE_UTILITY_CORES 2 +``--sysmem`` LEGATE_SYSMEM 4000 +``--numamem`` LEGATE_NUMAMEM 0 +``--fbmem`` LEGATE_FBMEM 4000 +``--zcmem`` LEGATE_ZCMEM 32 +``--regmem`` LEGATE_REGMEM 0 +``--eager-alloc-percentage`` LEGATE_EAGER_ALLOC_PERCENTAGE 50 +============================ ================================ ============= + +Distributed Launch +~~~~~~~~~~~~~~~~~~ + +If Legate is compiled with networking support (see `installation `_) then +it can be run in parallel by using the ``--nodes`` option followed by the +number of nodes to be used. Whenever the ``--nodes`` option is used, Legate +will be launched using ``mpirun``, even with ``--nodes 1``. Without the +``--nodes`` option, no launcher will be used. Legate currently supports +``mpirun``, ``srun``, and ``jsrun`` as launchers and we are open to adding +additional launcher kinds. You can select the target kind of launcher with +``--launcher``. + +Debugging and Profiling +~~~~~~~~~~~~~~~~~~~~~~~ + +Legate also comes with several tools that you can use to better understand +your program both from a correctness and a performance standpoint. For +correctness, Legate has facilities for constructing both dataflow +and event graphs for the actual run of an application. These graphs require +that you have an installation of `GraphViz `_. +available on your machine. To generate a dataflow graph for your Legate +program simply pass the ``--dataflow`` flag to the ``legate.py`` script and after +your run is complete we will generate a ``dataflow_legate.pdf`` file containing +the dataflow graph of your program. To generate a corresponding event graph +you simply need to pass the ``--event`` flag to the ``legate.py`` script to generate +a ``event_graph_legate.pdf`` file. These files can grow to be fairly large for non-trivial +programs so we encourage you to keep your programs small when using these +visualizations or invest in a `robust PDF viewer `_. + +For profiling all you need to do is pass the ``--profile`` flag to Legate and +afterwards you will have a ``legate_prof`` directory containing a web page that +can be viewed in any web browser that displays a timeline of your program's +execution. You simply need to load the ``index.html`` page from a browser. You +may have to enable local JavaScript execution if you are viewing the page from +your local machine (depending on your browser). + +We recommend that you do not mix debugging and profiling in the same run as +some of the logging for the debugging features requires significant file I/O +that can adversely effect the performance of the application. + diff --git a/docs/legate/core/source/user.rst b/docs/legate/core/source/user.rst new file mode 100644 index 000000000..fd464c8b8 --- /dev/null +++ b/docs/legate/core/source/user.rst @@ -0,0 +1,10 @@ +User Guide +========== + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + Installation + Build instructions + Using with Jupyter