Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

vorj · 2023-05-31T01:11:28Z

Summary

Dear @mdouze and all,

ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx.
I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times.
It seems that my implementation improves the performance on some environment.
This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.

It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?

Platform

OS: Ubuntu 22.04

Faiss version: a3296f4, and mine

Installed from: compiled by myself

Faiss compilation options: cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve ( -DFAISS_OPT_LEVEL=sve is new optlevel introduced by my changes)

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

I only post the results to search SIFT1M. If you need more detailed information, please let me know.

Evaluated on an AWS EC2 c7g.large instance, run faiss on
original is the current (a3296f4) implementation
SVE is the result of my implementation supporting ARM SVE

The above image illustrates the ratio of speed up.

In the best case, SVE is approx. 2.26x faster than original (IndexIVFPQ + IndexHNSWFlat, M: 32 nprove: 16)
- original : 0.618 ms
- SVE : 0.274 ms

The text was updated successfully, but these errors were encountered:

mdouze · 2023-05-31T05:41:37Z

Thanks for looking into this!
Do I understand correctly that this is with a 512-bit SIMD width?
Indeed we should have a way to integrate code for hardware that is not supported by CircleCI (AVX512 being the other example).
So we welcome a PR for this functionality.

vorj · 2023-05-31T05:54:16Z

@mdouze

Do I understand correctly that this is with a 512-bit SIMD width?

SVE is an abbreviation of Scalable Vector Extension .
In this context, scalable means that the vector length is not fixed on the instruction set.
The vector length is specified by each CPU, for example, A64fx has 512bit SVE register, but Graviton3 has 256bit SVE register.
So programmer should write length-independent code, then the binary will work on each CPU with detecting a real vector length at run time.
The length of SVE register becomes 128*n bits in the range of [128, 2048] bits.

So we welcome a PR for this functionality.

I'm glad to hear that! 😄 I will make the PRs later.

alexanderguzhva · 2023-05-31T13:45:49Z

@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE.

Do I get it correct that if, say, SVE vector length is 512 bits, then it still will be possible to have evaluations for 256 bits and 128 bits, just like AVX-512 extends AVX2, which extends AVX?
Do I get it right that the most speedup effect is related to faster distance computation (fvec_L2sqr_* and fvec_inner_product_* functions)?

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

vorj · 2023-05-31T15:04:46Z

@alexanderguzhva To answer your question,

Let the vector length is 512bit and considering the 32bit element type (so the vector is used as 16 x 32bit). Below, I will represent the mask as {mask0, mask1, ..., mask15}. If you pass the mask as {1, 1, 1, 1, 0, 0, 0, 0, ..., 0} , you can load/calculate/store 4 x 32bit (=128bit) data. {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..., 0} is for 256bit. Of course this should be slow rather than using full length. Another option, you can still use Advanced SIMD(NEON) for 128/64bit SIMD instruction set. When you use it, you must need to write peel loop or something like that for the data with non-4-multiple length in the same manner as before.
At least in this PR, almost yes. I plan to make another PR, which contains SVE implementation of code_distance and exhaustive_L2sqr_blas .

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

Thank you! 😄

Summary: related: #2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: #2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282

Summary: related: facebookresearch#2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: facebookresearch#2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282

Summary: related: #2884 I added some SVE implementations of: - `code_distance` - `distance_single_code` - `distance_four_codes` - `exhaustive_L2sqr_blas_cmax_sve` - `fvec_inner_products_ny` - `fvec_madd` ## Evaluation result I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances. `main` is the current (2e6551f) implementation. ### c7g.large (Graviton 3) ![g3_sift1m](https://github.com/user-attachments/assets/9c03cffa-72d1-4c77-9ae8-0ec0a5f5a6a5) ![g3_ivfpq](https://github.com/user-attachments/assets/4a8dfcc8-823c-4c31-ae79-3f4af9be28c8) On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` . - nprobe: 1, 0.069ms/query → 0.029ms/query - nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query - nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query ### r8g.large (Graviton 4) ![g4_sift1m](https://github.com/user-attachments/assets/e8510163-49d2-4143-babe-d406e2e40398) ![g4_ivfpq](https://github.com/user-attachments/assets/dc9a3ae0-a6b5-4a07-9898-c6aff372025c) On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query). Pull Request resolved: #3933 Reviewed By: mengdilin Differential Revision: D64249808 Pulled By: asadoughi fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4

Summary: related: facebookresearch#2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: facebookresearch#2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282

Summary: related: facebookresearch#2884 I added some SVE implementations of: - `code_distance` - `distance_single_code` - `distance_four_codes` - `exhaustive_L2sqr_blas_cmax_sve` - `fvec_inner_products_ny` - `fvec_madd` ## Evaluation result I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances. `main` is the current (2e6551f) implementation. ### c7g.large (Graviton 3) ![g3_sift1m](https://github.com/user-attachments/assets/9c03cffa-72d1-4c77-9ae8-0ec0a5f5a6a5) ![g3_ivfpq](https://github.com/user-attachments/assets/4a8dfcc8-823c-4c31-ae79-3f4af9be28c8) On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` . - nprobe: 1, 0.069ms/query → 0.029ms/query - nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query - nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query ### r8g.large (Graviton 4) ![g4_sift1m](https://github.com/user-attachments/assets/e8510163-49d2-4143-babe-d406e2e40398) ![g4_ivfpq](https://github.com/user-attachments/assets/dc9a3ae0-a6b5-4a07-9898-c6aff372025c) On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query). Pull Request resolved: facebookresearch#3933 Reviewed By: mengdilin Differential Revision: D64249808 Pulled By: asadoughi fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4

mdouze added the ARM / PPC / exotic hardware label May 31, 2023

vorj mentioned this issue May 31, 2023

Add sve targets #2886

Closed

asadoughi added Platform feature request labels Jun 30, 2024

vorj mentioned this issue Oct 9, 2024

Add some SVE implementations #3933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

vorj commented May 31, 2023 •

edited

Loading

mdouze commented May 31, 2023

vorj commented May 31, 2023

alexanderguzhva commented May 31, 2023

vorj commented May 31, 2023

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

Comments

vorj commented May 31, 2023 • edited Loading

Summary

Platform

Reproduction instructions

mdouze commented May 31, 2023

vorj commented May 31, 2023

alexanderguzhva commented May 31, 2023

vorj commented May 31, 2023

vorj commented May 31, 2023 •

edited

Loading