Skip to content

Commit e06901a

Browse files
committed
Document HIP support
1 parent f3d31d7 commit e06901a

File tree

5 files changed

+132
-59
lines changed

5 files changed

+132
-59
lines changed

docs/source/defs.rst

+1
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,4 @@
4141
.. |cpprelease| replace:: 1.0.0
4242
.. |verrelease| replace:: 1.0.0
4343
.. |vrdecrelease| replace:: 1.1.0
44+
.. |hiprelease| replace:: 1.1.0

docs/source/execution.rst

+60-39
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,14 @@
77
Parallel Execution
88
==================
99

10-
As of |zfp| |omprelease|, parallel compression (but not decompression) is
11-
supported on multicore processors via `OpenMP <http://www.openmp.org>`_
12-
threads.
10+
As of |zfp| |omprelease|, parallel compression is supported on multicore
11+
processors via `OpenMP <http://www.openmp.org>`_ threads.
1312
|zfp| |cudarelease| adds `CUDA <https://developer.nvidia.com/about-cuda>`_
1413
support for fixed-rate compression and decompression on the GPU.
14+
|zfp| |hiprelease| further adds support for
15+
`HIP <https://rocm.docs.amd.com/projects/HIP/en/latest/>`_
16+
and for fixed- and variable-rate parallel compression and decompression
17+
for all three back-ends (OpenMP, CUDA, and HIP).
1518

1619
Since |zfp| partitions arrays into small independent blocks, a
1720
large amount of data parallelism is inherent in the compression scheme that
@@ -40,10 +43,10 @@ Execution Policies
4043

4144
|zfp| supports multiple *execution policies*, which dictate how (e.g.,
4245
sequentially, in parallel) and where (e.g., on the CPU or GPU) arrays are
43-
compressed. Currently three execution policies are available:
44-
``serial``, ``omp``, and ``cuda``. The default mode is
46+
compressed. Currently four execution policies are available:
47+
``serial``, ``omp``, ``cuda``, and ``hip``. The default mode is
4548
``serial``, which ensures sequential compression on a single thread.
46-
The ``omp`` and ``cuda`` execution policies allow for data-parallel
49+
The ``omp``, ``cuda``, and ``hip`` execution policies allow for data-parallel
4750
compression on multiple threads.
4851

4952
The execution policy is set by :c:func:`zfp_stream_set_execution` and
@@ -62,7 +65,7 @@ Execution Parameters
6265

6366
Each execution policy allows tailoring the execution via its associated
6467
*execution parameters*. Examples include number of threads, chunk size,
65-
scheduling, etc. The ``serial`` and ``cuda`` policies have no
68+
scheduling, etc. The ``serial``, ``cuda``, and ``hip`` policies have no
6669
parameters. The subsections below discuss the ``omp`` parameters.
6770

6871
Whenever the execution policy is changed via
@@ -216,6 +219,18 @@ The CUDA implementation has a number of limitations:
216219
We expect to address these limitations over time.
217220

218221

222+
Using HIP
223+
---------
224+
225+
Support for HIP is available as of |zfp| |hiprelease|, allowing |zfp| to be
226+
run in parallel on AMD GPUs. To enable support, |zfp| the
227+
:c:macro:`ZFP_WITH_HIP` macro must be set and |zfp| must be built with CMake.
228+
See :c:macro:`ZFP_WITH_HIP` for further details.
229+
230+
The HIP implementation is based off the CUDA implementation, and therefore
231+
the same :ref:`limitations <cuda-limitations>` apply.
232+
233+
219234
Setting the Execution Policy
220235
----------------------------
221236

@@ -230,9 +245,10 @@ calling :c:func:`zfp_stream_set_execution`
230245
}
231246

232247
before calling :c:func:`zfp_compress`. Replacing :code:`zfp_exec_omp`
233-
with :code:`zfp_exec_cuda` enables CUDA execution. If OpenMP or CUDA is
234-
disabled or not supported, then the return value of functions setting these
235-
execution policies and parameters will indicate failure. Execution
248+
with :code:`zfp_exec_cuda` enables CUDA execution. Similarly,
249+
:code:`zfp_exec_hip` enables HIP execution. If the corresponding execution
250+
policy is disabled or not supported, then the return value of functions
251+
setting these policies and parameters will indicate failure. Execution
236252
parameters are optional and may be set using the functions discussed above.
237253

238254
The source code for the |zfpcmd| command-line tool includes further examples
@@ -241,39 +257,42 @@ decompression in this tool, see the :option:`-x` command-line option.
241257

242258
.. note::
243259
As of |zfp| |cudarelease|, the execution policy refers to both
244-
compression and decompression. The OpenMP implementation does not
245-
yet support decompression, and hence :c:func:`zfp_decompress` will
246-
fail if the execution policy is not reset to :code:`zfp_exec_serial`
247-
before calling the decompressor. Similarly, the CUDA implementation
248-
supports only fixed-rate mode and will fail if other compression modes
249-
are specified.
260+
compression and decompression.
261+
262+
.. note::
263+
As of |zfp| |vrdecrelease|, variable-rate compression modes are supported
264+
for all execution policies, both for compression and decompression.
265+
However, for parallel decompression, a block index must be provided that
266+
encodes where in the compressed stream each block resides. See the section
267+
on :ref:`parallel decompression <parallel-decompression>` for further
268+
details.
250269

251270
The following table summarizes which execution policies are supported
252271
with which :ref:`compression modes <modes>`:
253272

254-
+---------------------------------+---------+---------+---------+
255-
| (de)compression mode | serial | OpenMP | CUDA |
256-
+===============+=================+=========+=========+=========+
257-
| | expert | |check| | |check| | |
258-
| +-----------------+---------+---------+---------+
259-
| | fixed rate | |check| | |check| | |check| |
260-
| +-----------------+---------+---------+---------+
261-
| compression | fixed precision | |check| | |check| | |
262-
| +-----------------+---------+---------+---------+
263-
| | fixed accuracy | |check| | |check| | |
264-
| +-----------------+---------+---------+---------+
265-
| | reversible | |check| | |check| | |
266-
+---------------+-----------------+---------+---------+---------+
267-
| | expert | |check| | |check| | |
268-
| +-----------------+---------+---------+---------+
269-
| | fixed rate | |check| | |check| | |check| |
270-
| +-----------------+---------+---------+---------+
271-
| decompression | fixed precision | |check| | |check| | |check| |
272-
| +-----------------+---------+---------+---------+
273-
| | fixed accuracy | |check| | |check| | |check| |
274-
| +-----------------+---------+---------+---------+
275-
| | reversible | |check| | |check| | |
276-
+---------------+-----------------+---------+---------+---------+
273+
+---------------------------------+---------+---------+---------+---------+
274+
| (de)compression mode | serial | OpenMP | CUDA | HIP |
275+
+===============+=================+=========+=========+=========+=========+
276+
| | expert | |check| | |check| | |check| | |check| |
277+
| +-----------------+---------+---------+---------+---------+
278+
| | fixed rate | |check| | |check| | |check| | |check| |
279+
| +-----------------+---------+---------+---------+---------+
280+
| compression | fixed precision | |check| | |check| | |check| | |check| |
281+
| +-----------------+---------+---------+---------+---------+
282+
| | fixed accuracy | |check| | |check| | |check| | |check| |
283+
| +-----------------+---------+---------+---------+---------+
284+
| | reversible | |check| | |check| | | |
285+
+---------------+-----------------+---------+---------+---------+---------+
286+
| | expert | |check| | |check| | |check| | |check| |
287+
| +-----------------+---------+---------+---------+---------+
288+
| | fixed rate | |check| | |check| | |check| | |check| |
289+
| +-----------------+---------+---------+---------+---------+
290+
| decompression | fixed precision | |check| | |check| | |check| | |check| |
291+
| +-----------------+---------+---------+---------+---------+
292+
| | fixed accuracy | |check| | |check| | |check| | |check| |
293+
| +-----------------+---------+---------+---------+---------+
294+
| | reversible | |check| | |check| | | |
295+
+---------------+-----------------+---------+---------+---------+---------+
277296

278297
:c:func:`zfp_compress` and :c:func:`zfp_decompress` both return zero if the
279298
current execution policy is not supported for the requested compression
@@ -290,6 +309,8 @@ function in turn inspects the execution policy given by the
290309
for executing compression.
291310

292311

312+
.. _parallel-decompression:
313+
293314
Parallel Decompression
294315
----------------------
295316

docs/source/high-level-api.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,7 @@ Types
272272
::
273273

274274
typedef struct {
275-
zfp_exec_policy policy; // execution policy (serial, omp, cuda, ...)
275+
zfp_exec_policy policy; // execution policy (serial, omp, cuda, hip, ...)
276276
void* params; // execution parameters
277277
} zfp_execution;
278278

@@ -287,14 +287,15 @@ Types
287287

288288
.. c:type:: zfp_exec_policy
289289
290-
Currently three execution policies are available: serial, OpenMP parallel,
291-
and CUDA parallel.
290+
Currently four execution policies are available: serial, OpenMP, CUDA, and
291+
HIP.
292292
::
293293

294294
typedef enum {
295295
zfp_exec_serial = 0, // serial execution (default)
296296
zfp_exec_omp = 1, // OpenMP multi-threaded execution
297-
zfp_exec_cuda = 2 // CUDA parallel execution
297+
zfp_exec_cuda = 2, // CUDA parallel execution
298+
zfp_exec_hip = 3 // HIP parallel execution
298299
} zfp_exec_policy;
299300

300301
----

docs/source/installation.rst

+59-10
Original file line numberDiff line numberDiff line change
@@ -241,17 +241,57 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
241241

242242
.. c:macro:: ZFP_WITH_CUDA
243243
244-
CMake macro for enabling or disabling CUDA support for
245-
GPU compression and decompression. When enabled, CUDA and a compatible
246-
host compiler must be installed. For a full list of compatible compilers,
244+
CMake macro for enabling or disabling CUDA support for GPU compression and
245+
decompression. When enabled, CUDA 11.0 or later and a compatible host
246+
compiler must be installed. For a full list of compatible compilers,
247247
please consult the
248248
`NVIDIA documentation <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/>`__.
249-
If a CUDA installation is in the user's path, it will be
250-
automatically found by CMake. Alternatively, the CUDA binary directory
251-
can be specified using the :envvar:`CUDA_BIN_DIR` environment variable.
249+
If a CUDA installation is in the user's path, it will be automatically found
250+
by CMake. See also :c:macro:`CMAKE_CUDA_ARCHITECTURES`.
252251
CMake default: off.
253252
GNU make default: off and ignored.
254253

254+
255+
.. c:macro:: CMAKE_CUDA_ARCHITECTURES
256+
257+
`CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html>`__
258+
for optionally specifying which
259+
`NVIDIA GPU architectures <https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/>`__
260+
to build for. Use a semicolon separated list of architectures to override
261+
the default, e.g., ``35;50;72`` generates code for compute capabilities
262+
3.5, 5.0, and 7.2. Set to ``all`` to build for all supported architectures.
263+
CMake default: compiler specific.
264+
GNU make default: ignored.
265+
266+
.. note::
267+
Setting ``CMAKE_CUDA_ARCHITECTURES=all`` makes it possible to use a single
268+
binary across multiple architectures. However, this option can significantly
269+
increase the build time and size of ``libzfp``.
270+
271+
272+
.. c:macro:: ZFP_WITH_HIP
273+
274+
CMake macro for enabling or disabling HIP support for GPU compression and
275+
decompression. If a HIP installation is in the user's path, it will be
276+
automatically found by CMake. Alternatively, one may set the environment
277+
variable :envvar:`HIP_PATH` to point to the HIP installation. Some
278+
platforms further require setting ``CMAKE_C_COMPILER=hipcc`` and
279+
``CMAKE_CXX_COMPILER=hipcc``. See also :c:macro:`CMAKE_HIP_ARCHITECTURES`.
280+
CMake default: off.
281+
GNU make default: off and ignored.
282+
283+
284+
.. c:macro:: CMAKE_HIP_ARCHITECTURES
285+
286+
`CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_HIP_ARCHITECTURES.html>`__
287+
for optionally specifying which
288+
`AMD GPU architectures <https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html>`__
289+
to build for. Use a semicolon separated list of architectures to override
290+
the default, e.g., ``gfx900;gfx908``.
291+
CMake default: compiler specific.
292+
GNU make default: ignored.
293+
294+
255295
.. _rounding:
256296
.. c:macro:: ZFP_ROUNDING_MODE
257297
@@ -279,6 +319,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
279319
:code:`serial` and :code:`omp` :ref:`execution policies <execution>`.
280320
Default: :code:`ZFP_ROUND_NEVER`.
281321

322+
282323
.. c:macro:: ZFP_WITH_TIGHT_ERROR
283324
284325
**Experimental feature**. When enabled, this feature takes advantage of the
@@ -293,6 +334,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
293334
:ref:`execution policies <execution>`.
294335
Default: undefined/off.
295336

337+
296338
.. c:macro:: ZFP_WITH_DAZ
297339
298340
When enabled, blocks consisting solely of subnormal floating-point numbers
@@ -312,6 +354,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
312354
:code:`omp`.
313355
Default: undefined/off.
314356

357+
315358
.. c:macro:: ZFP_WITH_ALIGNED_ALLOC
316359
317360
Use aligned memory allocation in an attempt to align compressed blocks
@@ -398,8 +441,8 @@ in the sections below.
398441
CMake
399442
^^^^^
400443

401-
CMake builds require version 3.9 or later. CMake is available
402-
`here <https://cmake.org>`__.
444+
CPU-only CMake builds require version 3.9 or later; see below for GPU build
445+
requirements. CMake is available `here <https://cmake.org>`__.
403446

404447
OpenMP
405448
^^^^^^
@@ -409,8 +452,14 @@ OpenMP support requires OpenMP 2.0 or later.
409452
CUDA
410453
^^^^
411454

412-
CUDA support requires CUDA 7.0 or later, CMake, and a compatible host
413-
compiler (see :c:macro:`ZFP_WITH_CUDA`).
455+
CUDA support requires CUDA 11.0 or later, CMake 3.23 or later, and a
456+
compatible host compiler (see :c:macro:`ZFP_WITH_CUDA`).
457+
458+
HIP
459+
^^^
460+
461+
HIP support requires ROCm 4.0 or later, CMake 3.21 or later, and a
462+
compatible host compiler (see :c:macro:`ZFP_WITH_HIP`).
414463

415464
C/C++
416465
^^^^^

docs/source/zfpcmd.rst

+7-6
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,8 @@ Execution parameters
232232
:code:`-x omp=threads,chunk_size` to specify the chunk size in number
233233
of blocks (see also :c:func:`zfp_stream_set_omp_chunk_size`). A
234234
chunk size of zero is ignored and results in the default size.
235-
Use :code:`-x cuda` to for parallel CUDA compression and decompression.
235+
Use :code:`-x cuda` or :code:`-x hip` for parallel CUDA or HIP
236+
compression and decompression, respectively.
236237

237238
As of |cudarelease|, the execution policy applies to both compression
238239
and decompression. If the execution policy is not supported for
@@ -245,9 +246,9 @@ Block Index
245246
^^^^^^^^^^^
246247

247248
A block index is needed to support variable-rate decompression using any
248-
of the parallel execution policies (OpenMP and CUDA). This index must be
249-
captured and stored to file during compression and later accessed prior to
250-
decompression.
249+
of the parallel execution policies (OpenMP, CUDA, and HIP). This index
250+
must be captured and stored to file during compression and later accessed
251+
prior to decompression.
251252

252253
.. option:: -m <path>
253254

@@ -258,8 +259,8 @@ decompression.
258259

259260
Block index type ("offset" or "hybrid") and granularity in number of blocks
260261
per index entry. A granularity of one provides the highest flexibility and
261-
performance potential (especially for CUDA) but also the highest storage
262-
cost.
262+
performance potential (especially for CUDA and HIP) but also the highest
263+
storage cost.
263264

264265
See the :ref:`hl-func-index` section for further details.
265266

0 commit comments

Comments
 (0)