Document HIP support

lindstro · lindstro · commit e06901a01b47 · 2024-12-27T23:36:15.000-08:00
diff --git a/docs/source/defs.rst b/docs/source/defs.rst
@@ -41,3 +41,4 @@
 .. |cpprelease| replace:: 1.0.0
 .. |verrelease| replace:: 1.0.0
 .. |vrdecrelease| replace:: 1.1.0
+.. |hiprelease| replace:: 1.1.0
diff --git a/docs/source/execution.rst b/docs/source/execution.rst
@@ -7,11 +7,14 @@
 Parallel Execution
 ==================
 
-As of |zfp| |omprelease|, parallel compression (but not decompression) is
-supported on multicore processors via `OpenMP <http://www.openmp.org>`_
-threads.
+As of |zfp| |omprelease|, parallel compression is supported on multicore
+processors via `OpenMP <http://www.openmp.org>`_ threads.
 |zfp| |cudarelease| adds `CUDA <https://developer.nvidia.com/about-cuda>`_
 support for fixed-rate compression and decompression on the GPU.
+|zfp| |hiprelease| further adds support for
+`HIP <https://rocm.docs.amd.com/projects/HIP/en/latest/>`_
+and for fixed- and variable-rate parallel compression and decompression
+for all three back-ends (OpenMP, CUDA, and HIP).
 
 Since |zfp| partitions arrays into small independent blocks, a
 large amount of data parallelism is inherent in the compression scheme that
@@ -40,10 +43,10 @@ Execution Policies
 
 |zfp| supports multiple *execution policies*, which dictate how (e.g.,
 sequentially, in parallel) and where (e.g., on the CPU or GPU) arrays are
-compressed.  Currently three execution policies are available:
-``serial``, ``omp``, and ``cuda``.  The default mode is
+compressed.  Currently four execution policies are available:
+``serial``, ``omp``, ``cuda``, and ``hip``.  The default mode is
 ``serial``, which ensures sequential compression on a single thread.
-The ``omp`` and ``cuda`` execution policies allow for data-parallel
+The ``omp``, ``cuda``, and ``hip`` execution policies allow for data-parallel
 compression on multiple threads.
 
 The execution policy is set by :c:func:`zfp_stream_set_execution` and
@@ -62,7 +65,7 @@ Execution Parameters
 
 Each execution policy allows tailoring the execution via its associated
 *execution parameters*.  Examples include number of threads, chunk size,
-scheduling, etc.  The ``serial`` and ``cuda`` policies have no
+scheduling, etc.  The ``serial``, ``cuda``, and ``hip`` policies have no
 parameters.  The subsections below discuss the ``omp`` parameters.
 
 Whenever the execution policy is changed via
@@ -216,6 +219,18 @@ The CUDA implementation has a number of limitations:
 We expect to address these limitations over time.
 
 
+Using HIP
+---------
+
+Support for HIP is available as of |zfp| |hiprelease|, allowing |zfp| to be
+run in parallel on AMD GPUs.  To enable support, |zfp| the
+:c:macro:`ZFP_WITH_HIP` macro must be set and |zfp| must be built with CMake.
+See :c:macro:`ZFP_WITH_HIP` for further details.
+
+The HIP implementation is based off the CUDA implementation, and therefore
+the same :ref:`limitations <cuda-limitations>` apply.
+
+
 Setting the Execution Policy
 ----------------------------
 
@@ -230,9 +245,10 @@ calling :c:func:`zfp_stream_set_execution`
     }
 
 before calling :c:func:`zfp_compress`.  Replacing :code:`zfp_exec_omp`
-with :code:`zfp_exec_cuda` enables CUDA execution.  If OpenMP or CUDA is
-disabled or not supported, then the return value of functions setting these
-execution policies and parameters will indicate failure.  Execution
+with :code:`zfp_exec_cuda` enables CUDA execution.  Similarly,
+:code:`zfp_exec_hip` enables HIP execution.  If the corresponding execution
+policy is disabled or not supported, then the return value of functions
+setting these policies and parameters will indicate failure.  Execution
 parameters are optional and may be set using the functions discussed above.
 
 The source code for the |zfpcmd| command-line tool includes further examples
@@ -241,39 +257,42 @@ decompression in this tool, see the :option:`-x` command-line option.
 
 .. note::
   As of |zfp| |cudarelease|, the execution policy refers to both
-  compression and decompression.  The OpenMP implementation does not
-  yet support decompression, and hence :c:func:`zfp_decompress` will
-  fail if the execution policy is not reset to :code:`zfp_exec_serial`
-  before calling the decompressor.  Similarly, the CUDA implementation
-  supports only fixed-rate mode and will fail if other compression modes
-  are specified.
+  compression and decompression.
+
+.. note::
+  As of |zfp| |vrdecrelease|, variable-rate compression modes are supported
+  for all execution policies, both for compression and decompression.
+  However, for parallel decompression, a block index must be provided that
+  encodes where in the compressed stream each block resides.  See the section
+  on :ref:`parallel decompression <parallel-decompression>` for further
+  details.
 
 The following table summarizes which execution policies are supported
 with which :ref:`compression modes <modes>`:
 
-  +---------------------------------+---------+---------+---------+
-  | (de)compression mode            | serial  | OpenMP  | CUDA    |
-  +===============+=================+=========+=========+=========+
-  |               | expert          | |check| | |check| |         |
-  |               +-----------------+---------+---------+---------+
-  |               | fixed rate      | |check| | |check| | |check| |
-  |               +-----------------+---------+---------+---------+
-  | compression   | fixed precision | |check| | |check| |         |
-  |               +-----------------+---------+---------+---------+
-  |               | fixed accuracy  | |check| | |check| |         |
-  |               +-----------------+---------+---------+---------+
-  |               | reversible      | |check| | |check| |         |
-  +---------------+-----------------+---------+---------+---------+
-  |               | expert          | |check| | |check| |         |
-  |               +-----------------+---------+---------+---------+
-  |               | fixed rate      | |check| | |check| | |check| |
-  |               +-----------------+---------+---------+---------+
-  | decompression | fixed precision | |check| | |check| | |check| |
-  |               +-----------------+---------+---------+---------+
-  |               | fixed accuracy  | |check| | |check| | |check| |
-  |               +-----------------+---------+---------+---------+
-  |               | reversible      | |check| | |check| |         |
-  +---------------+-----------------+---------+---------+---------+
+  +---------------------------------+---------+---------+---------+---------+
+  | (de)compression mode            | serial  | OpenMP  | CUDA    | HIP     |
+  +===============+=================+=========+=========+=========+=========+
+  |               | expert          | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | fixed rate      | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  | compression   | fixed precision | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | fixed accuracy  | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | reversible      | |check| | |check| |         |         |
+  +---------------+-----------------+---------+---------+---------+---------+
+  |               | expert          | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | fixed rate      | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  | decompression | fixed precision | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | fixed accuracy  | |check| | |check| | |check| | |check| |
+  |               +-----------------+---------+---------+---------+---------+
+  |               | reversible      | |check| | |check| |         |         |
+  +---------------+-----------------+---------+---------+---------+---------+
 
 :c:func:`zfp_compress` and :c:func:`zfp_decompress` both return zero if the
 current execution policy is not supported for the requested compression
@@ -290,6 +309,8 @@ function in turn inspects the execution policy given by the
 for executing compression.
 
 
+.. _parallel-decompression:
+
 Parallel Decompression
 ----------------------
 
diff --git a/docs/source/high-level-api.rst b/docs/source/high-level-api.rst
@@ -272,7 +272,7 @@ Types
   ::
 
     typedef struct {
-      zfp_exec_policy policy; // execution policy (serial, omp, cuda, ...)
+      zfp_exec_policy policy; // execution policy (serial, omp, cuda, hip, ...)
       void* params;           // execution parameters
     } zfp_execution;
 
@@ -287,14 +287,15 @@ Types
 
 .. c:type:: zfp_exec_policy
 
-  Currently three execution policies are available: serial, OpenMP parallel,
-  and CUDA parallel.
+  Currently four execution policies are available: serial, OpenMP, CUDA, and
+  HIP.
   ::
 
     typedef enum {
       zfp_exec_serial = 0, // serial execution (default)
       zfp_exec_omp    = 1, // OpenMP multi-threaded execution
-      zfp_exec_cuda   = 2  // CUDA parallel execution
+      zfp_exec_cuda   = 2, // CUDA parallel execution
+      zfp_exec_hip    = 3  // HIP parallel execution
     } zfp_exec_policy;
 
 ----
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -241,17 +241,57 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
 
 .. c:macro:: ZFP_WITH_CUDA
 
-  CMake macro for enabling or disabling CUDA support for
-  GPU compression and decompression.  When enabled, CUDA and a compatible
-  host compiler must be installed.  For a full list of compatible compilers,
+  CMake macro for enabling or disabling CUDA support for GPU compression and
+  decompression.  When enabled, CUDA 11.0 or later and a compatible host
+  compiler must be installed.  For a full list of compatible compilers,
   please consult the
   `NVIDIA documentation <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/>`__.
-  If a CUDA installation is in the user's path, it will be
-  automatically found by CMake.  Alternatively, the CUDA binary directory 
-  can be specified using the :envvar:`CUDA_BIN_DIR` environment variable.
+  If a CUDA installation is in the user's path, it will be automatically found
+  by CMake.  See also :c:macro:`CMAKE_CUDA_ARCHITECTURES`.
   CMake default: off.
   GNU make default: off and ignored.
 
+
+.. c:macro:: CMAKE_CUDA_ARCHITECTURES
+
+  `CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html>`__
+  for optionally specifying which
+  `NVIDIA GPU architectures <https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/>`__
+  to build for.  Use a semicolon separated list of architectures to override
+  the default, e.g., ``35;50;72`` generates code for compute capabilities
+  3.5, 5.0, and 7.2.  Set to ``all`` to build for all supported architectures.
+  CMake default: compiler specific.
+  GNU make default: ignored.
+
+.. note::
+  Setting ``CMAKE_CUDA_ARCHITECTURES=all`` makes it possible to use a single
+  binary across multiple architectures.  However, this option can significantly
+  increase the build time and size of ``libzfp``.
+
+
+.. c:macro:: ZFP_WITH_HIP
+
+  CMake macro for enabling or disabling HIP support for GPU compression and
+  decompression.  If a HIP installation is in the user's path, it will be
+  automatically found by CMake.  Alternatively, one may set the environment
+  variable :envvar:`HIP_PATH` to point to the HIP installation.  Some
+  platforms further require setting ``CMAKE_C_COMPILER=hipcc`` and
+  ``CMAKE_CXX_COMPILER=hipcc``.  See also :c:macro:`CMAKE_HIP_ARCHITECTURES`.
+  CMake default: off.
+  GNU make default: off and ignored.
+
+
+.. c:macro:: CMAKE_HIP_ARCHITECTURES
+
+  `CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_HIP_ARCHITECTURES.html>`__
+  for optionally specifying which
+  `AMD GPU architectures <https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html>`__
+  to build for.  Use a semicolon separated list of architectures to override
+  the default, e.g., ``gfx900;gfx908``.
+  CMake default: compiler specific.
+  GNU make default: ignored.
+
+
 .. _rounding:
 .. c:macro:: ZFP_ROUNDING_MODE
 
@@ -279,6 +319,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
   :code:`serial` and :code:`omp` :ref:`execution policies <execution>`.
   Default: :code:`ZFP_ROUND_NEVER`.
 
+
 .. c:macro:: ZFP_WITH_TIGHT_ERROR
 
   **Experimental feature**.  When enabled, this feature takes advantage of the
@@ -293,6 +334,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
   :ref:`execution policies <execution>`.
   Default: undefined/off.
 
+
 .. c:macro:: ZFP_WITH_DAZ
 
   When enabled, blocks consisting solely of subnormal floating-point numbers
@@ -312,6 +354,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
   :code:`omp`.
   Default: undefined/off.
 
+
 .. c:macro:: ZFP_WITH_ALIGNED_ALLOC
 
   Use aligned memory allocation in an attempt to align compressed blocks
@@ -398,8 +441,8 @@ in the sections below.
 CMake
 ^^^^^
 
-CMake builds require version 3.9 or later.  CMake is available
-`here <https://cmake.org>`__.
+CPU-only CMake builds require version 3.9 or later; see below for GPU build
+requirements.  CMake is available `here <https://cmake.org>`__.
 
 OpenMP
 ^^^^^^
@@ -409,8 +452,14 @@ OpenMP support requires OpenMP 2.0 or later.
 CUDA
 ^^^^
 
-CUDA support requires CUDA 7.0 or later, CMake, and a compatible host
-compiler (see :c:macro:`ZFP_WITH_CUDA`).
+CUDA support requires CUDA 11.0 or later, CMake 3.23 or later, and a
+compatible host compiler (see :c:macro:`ZFP_WITH_CUDA`).
+
+HIP
+^^^
+
+HIP support requires ROCm 4.0 or later, CMake 3.21 or later, and a
+compatible host compiler (see :c:macro:`ZFP_WITH_HIP`).
 
 C/C++
 ^^^^^
diff --git a/docs/source/zfpcmd.rst b/docs/source/zfpcmd.rst
@@ -232,7 +232,8 @@ Execution parameters
   :code:`-x omp=threads,chunk_size` to specify the chunk size in number
   of blocks (see also :c:func:`zfp_stream_set_omp_chunk_size`).  A
   chunk size of zero is ignored and results in the default size.
-  Use :code:`-x cuda` to for parallel CUDA compression and decompression.
+  Use :code:`-x cuda` or :code:`-x hip` for parallel CUDA or HIP
+  compression and decompression, respectively.
 
 As of |cudarelease|, the execution policy applies to both compression
 and decompression.  If the execution policy is not supported for
@@ -245,9 +246,9 @@ Block Index
 ^^^^^^^^^^^
 
 A block index is needed to support variable-rate decompression using any
-of the parallel execution policies (OpenMP and CUDA).  This index must be
-captured and stored to file during compression and later accessed prior to
-decompression.
+of the parallel execution policies (OpenMP, CUDA, and HIP).  This index
+must be captured and stored to file during compression and later accessed
+prior to decompression.
 
 .. option:: -m <path>
 
@@ -258,8 +259,8 @@ decompression.
 
   Block index type ("offset" or "hybrid") and granularity in number of blocks
   per index entry.  A granularity of one provides the highest flexibility and
-  performance potential (especially for CUDA) but also the highest storage
-  cost.
+  performance potential (especially for CUDA and HIP) but also the highest
+  storage cost.
 
 See the :ref:`hl-func-index` section for further details.