-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884
Comments
Thanks for looking into this! |
SVE is an abbreviation of Scalable Vector Extension .
I'm glad to hear that! 😄 I will make the PRs later. |
@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE.
Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed. |
@alexanderguzhva To answer your question,
Thank you! 😄 |
Summary: related: #2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: #2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
Summary: related: facebookresearch#2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: facebookresearch#2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
Summary: related: #2884 I added some SVE implementations of: - `code_distance` - `distance_single_code` - `distance_four_codes` - `exhaustive_L2sqr_blas_cmax_sve` - `fvec_inner_products_ny` - `fvec_madd` ## Evaluation result I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances. `main` is the current (2e6551f) implementation. ### c7g.large (Graviton 3)   On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` . - nprobe: 1, 0.069ms/query → 0.029ms/query - nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query - nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query ### r8g.large (Graviton 4)   On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query). Pull Request resolved: #3933 Reviewed By: mengdilin Differential Revision: D64249808 Pulled By: asadoughi fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4
Summary: related: facebookresearch#2884 This PR contains below changes: - Add new optlevel `sve` - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO - Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve` - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time - Design decision: Don't fix SVE register length. - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`) - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time. - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming) - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded. - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time. - Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target. Pull Request resolved: facebookresearch#2886 Reviewed By: ramilbakhshyiev Differential Revision: D60386983 Pulled By: mengdilin fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
Summary: related: facebookresearch#2884 I added some SVE implementations of: - `code_distance` - `distance_single_code` - `distance_four_codes` - `exhaustive_L2sqr_blas_cmax_sve` - `fvec_inner_products_ny` - `fvec_madd` ## Evaluation result I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances. `main` is the current (2e6551f) implementation. ### c7g.large (Graviton 3)   On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` . - nprobe: 1, 0.069ms/query → 0.029ms/query - nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query - nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query ### r8g.large (Graviton 4)   On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query). Pull Request resolved: facebookresearch#3933 Reviewed By: mengdilin Differential Revision: D64249808 Pulled By: asadoughi fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4
Summary
Dear @mdouze and all,
ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx.
I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times.
It seems that my implementation improves the performance on some environment.
This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.
It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?
Platform
OS: Ubuntu 22.04
Faiss version: a3296f4, and mine
Installed from: compiled by myself
Faiss compilation options:
cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve
(-DFAISS_OPT_LEVEL=sve
is new optlevel introduced by my changes)Running on:
Interface:
Reproduction instructions
I only post the results to search SIFT1M. If you need more detailed information, please let me know.
original
is the current (a3296f4) implementationSVE
is the result of my implementation supporting ARM SVEThe above image illustrates the ratio of speed up.
SVE
is approx. 2.26x faster thanoriginal
(IndexIVFPQ + IndexHNSWFlat, M: 32 nprove: 16)original
: 0.618 msSVE
: 0.274 msThe text was updated successfully, but these errors were encountered: