Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

Open
2 of 4 tasks
vorj opened this issue May 31, 2023 · 4 comments
Open
2 of 4 tasks

Comments

@vorj
Copy link
Contributor

vorj commented May 31, 2023

Summary

Dear @mdouze and all,

ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx.
I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times.
It seems that my implementation improves the performance on some environment.
This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.

It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?

Platform

OS: Ubuntu 22.04

Faiss version: a3296f4, and mine

Installed from: compiled by myself

Faiss compilation options: cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve ( -DFAISS_OPT_LEVEL=sve is new optlevel introduced by my changes)

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

I only post the results to search SIFT1M. If you need more detailed information, please let me know.

benchmark result

  • Evaluated on an AWS EC2 c7g.large instance, run faiss on
  • original is the current (a3296f4) implementation
  • SVE is the result of my implementation supporting ARM SVE

image

The above image illustrates the ratio of speed up.

  • In the best case, SVE is approx. 2.26x faster than original (IndexIVFPQ + IndexHNSWFlat, M: 32 nprove: 16)
    • original : 0.618 ms
    • SVE : 0.274 ms
@mdouze
Copy link
Contributor

mdouze commented May 31, 2023

Thanks for looking into this!
Do I understand correctly that this is with a 512-bit SIMD width?
Indeed we should have a way to integrate code for hardware that is not supported by CircleCI (AVX512 being the other example).
So we welcome a PR for this functionality.

@vorj
Copy link
Contributor Author

vorj commented May 31, 2023

@mdouze

Do I understand correctly that this is with a 512-bit SIMD width?

SVE is an abbreviation of Scalable Vector Extension .
In this context, scalable means that the vector length is not fixed on the instruction set.
The vector length is specified by each CPU, for example, A64fx has 512bit SVE register, but Graviton3 has 256bit SVE register.
So programmer should write length-independent code, then the binary will work on each CPU with detecting a real vector length at run time.
The length of SVE register becomes 128*n bits in the range of [128, 2048] bits.

So we welcome a PR for this functionality.

I'm glad to hear that! 😄 I will make the PRs later.

@vorj vorj mentioned this issue May 31, 2023
@alexanderguzhva
Copy link
Contributor

@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE.

  1. Do I get it correct that if, say, SVE vector length is 512 bits, then it still will be possible to have evaluations for 256 bits and 128 bits, just like AVX-512 extends AVX2, which extends AVX?
  2. Do I get it right that the most speedup effect is related to faster distance computation (fvec_L2sqr_* and fvec_inner_product_* functions)?

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

@vorj
Copy link
Contributor Author

vorj commented May 31, 2023

@alexanderguzhva To answer your question,

  1. Let the vector length is 512bit and considering the 32bit element type (so the vector is used as 16 x 32bit). Below, I will represent the mask as {mask0, mask1, ..., mask15}. If you pass the mask as {1, 1, 1, 1, 0, 0, 0, 0, ..., 0} , you can load/calculate/store 4 x 32bit (=128bit) data. {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..., 0} is for 256bit. Of course this should be slow rather than using full length. Another option, you can still use Advanced SIMD(NEON) for 128/64bit SIMD instruction set. When you use it, you must need to write peel loop or something like that for the data with non-4-multiple length in the same manner as before.
  2. At least in this PR, almost yes. I plan to make another PR, which contains SVE implementation of code_distance and exhaustive_L2sqr_blas .

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

Thank you! 😄

facebook-github-bot pushed a commit that referenced this issue Jul 29, 2024
Summary:
related: #2884

This PR contains below changes:

- Add new optlevel `sve`
    - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO
- Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve`
    - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time
    - Design decision: Don't fix SVE register length.
        - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`)
        - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time.
            - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming)
            - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time
        - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded.
        - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time.
- Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically
    - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know

NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target.

Pull Request resolved: #2886

Reviewed By: ramilbakhshyiev

Differential Revision: D60386983

Pulled By: mengdilin

fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
ketor pushed a commit to dingodb/faiss that referenced this issue Aug 20, 2024
Summary:
related: facebookresearch#2884

This PR contains below changes:

- Add new optlevel `sve`
    - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO
- Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve`
    - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time
    - Design decision: Don't fix SVE register length.
        - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`)
        - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time.
            - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming)
            - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time
        - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded.
        - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time.
- Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically
    - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know

NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target.

Pull Request resolved: facebookresearch#2886

Reviewed By: ramilbakhshyiev

Differential Revision: D60386983

Pulled By: mengdilin

fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
facebook-github-bot pushed a commit that referenced this issue Oct 15, 2024
Summary:
related: #2884

I added some SVE implementations of:

- `code_distance`
    - `distance_single_code`
    - `distance_four_codes`
- `exhaustive_L2sqr_blas_cmax_sve`
- `fvec_inner_products_ny`
- `fvec_madd`

## Evaluation result

I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances.
`main` is the current (2e6551f) implementation.

### c7g.large (Graviton 3)

![g3_sift1m](https://github.com/user-attachments/assets/9c03cffa-72d1-4c77-9ae8-0ec0a5f5a6a5)

![g3_ivfpq](https://github.com/user-attachments/assets/4a8dfcc8-823c-4c31-ae79-3f4af9be28c8)

On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` .

- nprobe: 1, 0.069ms/query → 0.029ms/query
- nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query
- nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query

### r8g.large (Graviton 4)

![g4_sift1m](https://github.com/user-attachments/assets/e8510163-49d2-4143-babe-d406e2e40398)

![g4_ivfpq](https://github.com/user-attachments/assets/dc9a3ae0-a6b5-4a07-9898-c6aff372025c)

On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query).

Pull Request resolved: #3933

Reviewed By: mengdilin

Differential Revision: D64249808

Pulled By: asadoughi

fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4
aalekhpatel07 pushed a commit to aalekhpatel07/faiss that referenced this issue Oct 17, 2024
Summary:
related: facebookresearch#2884

This PR contains below changes:

- Add new optlevel `sve`
    - ARM SVE is _extension_ of ARMv8, so it should be treated similar to AVX2 IMO
- Add targets for ARM SVE, `faiss_sve` and `swigfaiss_sve`
    - These targets will be built when you give `-DFAISS_OPT_LEVEL=sve` at build time
    - Design decision: Don't fix SVE register length.
        - The python package of faiss is "fat binary" (for example, the package for avx2 contains `_swigfaiss_avx2.so` and `_swigfaiss.so`)
        - SVE is scalable instruction set (= doesn't fix vector length), but actually we can specify the vector length at compile time.
            - [with `-msve-vector-length=` option](https://developer.arm.com/documentation/101726/4-0/Coding-for-Scalable-Vector-Extension--SVE-/SVE-Vector-Length-Specific--VLS--programming)
            - When this option is specified, the binary can't work correctly on the CPU which has other vector length rather than specified at compile time
        - When we use fixed vector length, SVE-supported faiss python package will contain 7 shared libraries like `_swigfaiss.so` , `_swigfaiss_sve.so` , `_swigfaiss_sve128.so` , `_swigfaiss_sve256.so` , `_swigfaiss_sve512.so` , `_swigfaiss_sve1024.so` , and `_swigfaiss_sve2048.so` . The package size will be exploded.
        - For these reason, I don't specify the vector length at compile time and `faiss_sve` detects the vector length at run time.
- Add a mechanism of detecting ARM SVE on runtime environment and importing `swigfaiss_sve` dynamically
    - Currently it only supports Linux, but there is no SVE environment with non-Linux OS now, as far as I know

NOTE: I plan to make one more PR about add some SVE implementation after this PR merged. This PR only contains adding sve target.

Pull Request resolved: facebookresearch#2886

Reviewed By: ramilbakhshyiev

Differential Revision: D60386983

Pulled By: mengdilin

fbshipit-source-id: 7e66162ee53ce88fbfb6636e7bf705b44e6c3282
aalekhpatel07 pushed a commit to aalekhpatel07/faiss that referenced this issue Oct 17, 2024
Summary:
related: facebookresearch#2884

I added some SVE implementations of:

- `code_distance`
    - `distance_single_code`
    - `distance_four_codes`
- `exhaustive_L2sqr_blas_cmax_sve`
- `fvec_inner_products_ny`
- `fvec_madd`

## Evaluation result

I evaluated the search for SIFT1M dataset on AWS EC2 c7g.large and r8g.large instances.
`main` is the current (2e6551f) implementation.

### c7g.large (Graviton 3)

![g3_sift1m](https://github.com/user-attachments/assets/9c03cffa-72d1-4c77-9ae8-0ec0a5f5a6a5)

![g3_ivfpq](https://github.com/user-attachments/assets/4a8dfcc8-823c-4c31-ae79-3f4af9be28c8)

On Graviton 3, `IndexIVFPQ` has been improved particularly. In the best case (IndexIVFPQ + IndexFlatL2, M: 32), this PR is approx. 2.38-~~2.50~~**2.44**x faster than `main` .

- nprobe: 1, 0.069ms/query → 0.029ms/query
- nprobe: 4, 0.181ms/query → ~~0.074~~**0.075**ms/query
- nprobe: 16, 0.613ms/query → ~~0.245~~**0.251**ms/query

### r8g.large (Graviton 4)

![g4_sift1m](https://github.com/user-attachments/assets/e8510163-49d2-4143-babe-d406e2e40398)

![g4_ivfpq](https://github.com/user-attachments/assets/dc9a3ae0-a6b5-4a07-9898-c6aff372025c)

On Graviton 4, especially `IndexIVFPQ` for tiny `nprobe` has been improved. In the best case (IndexIVFPQ + IndexFlatL2, M: 8, nprobe: 1), this PR is approx. 1.33x faster than `main` (0.016ms/query → 0.012ms/query).

Pull Request resolved: facebookresearch#3933

Reviewed By: mengdilin

Differential Revision: D64249808

Pulled By: asadoughi

fbshipit-source-id: 8a625f0ab37732d330192599c851f864350885c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants