-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DYNAMIC_ARCH support for ARM64 #1829
Add DYNAMIC_ARCH support for ARM64 #1829
Conversation
Remove XGENE1 target as the implementation for the same is incomplete. Moreover whoever wishes to use on XGENE1 can use the generic ARMV8 target as there are no XGENE1 specific optimizations in OpenBLAS.
Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.
Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.
Thanks a lot for this. Github status reports problem with a storage system today, which is probably why no CI tests have been run yet. (Not that those could do much as we have no arm64 builds set up) |
The commit does not touch any x86 path. So I dont expect it cause any issues with CI tests (unless I have completely missed something). |
Is it possible to take cache sizes, too? I know lscpu does that, but I'm not sure you can do it from here. I like the idea of defaulting to ARMv8 and making it robust enough, but I'd like it to be even more generic (ex. assume FP/SIMD are always available, regardless of the core). In time, I'd like to see us moving away from core-name check and more towards features (cache sizes, DTB, features available) than assuming a core will always have a specific functionality. Custom cores (like Qualcomm or Huawei) would have to add support for their cores (like you did) if they want to get anything more performing that base ARMv8. |
Not sure whats the correct way to determine the cache sizes at runtime on aarch64. Even if we do determine the cache sizes at runtime, it will be difficult for me to come up with a good generic formula for determining the blocking sizes that suits all armv8 implementations. As of now, the aarch64 implementations in OpenBLAS are mainly targeted at ThunderX2 (though it also benefits other cores). Hence most of the code written is assuming a custom core rather than a generic core like CortexA57. |
@ashwinyes my point was just that I intend to hold off on the merge until github stops playing hide and seek with PRs and issues again. @rengolin x86 code does query cache sizes via cpuid(), seems for arm nobody figured out how to do more than parse /proc/cpuinfo (if available) |
@martin-frbg I was just answering @rengolin questions. github has been playing hide and seek with me too :). No hurries to merge the PR. The way to determine cpuid on ARM64 without /proc/cpuinfo can be seen within this patch. It requires kernel cpuid feature support (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). The discussion was rather on how to determine the cache sizes on ARM64. |
You mean compile-time, right? IIUC, those are defines that get used in the code, rather than dynamically changing the runtime library behaviour.
I don't think there is, that's why I said, even for A57, we shouldn't play the guessing game. A57 is a template that vendors can implement in different ways, so there is no guarantee that cache sizes will be anything. This can be done for TX1/2 because the vendor controls the implementation (same for Huawei, Qualcomm and others). The best thing, though, would make all of that irrelevant by implementing cpuid for your cores in the kernel and upstream it. :)
I have to make it work well with, at the very least, Huawei and Qualcomm server hardware, too. We should not assume they'll be fine by passing "thunderx2" in the build system. The two points I made in PR #1821 are pertinent here:
My issue is not with the ThunderX/2 implementation, but with the A57 distinction from ARMv8, which apart from what's not right (fixed cache sizes, DTB), they're mandatory for ARMv8 as a whole. Removing A57 altogether would probably make more sense right now, and make ARMv8 behave similar for all ARMv8. I have done comparisons for "make TARGET=ARMV8" and "make TARGET=CORTEXA57" on A53, A57, Huawei, Qualcomm and ThunderX2 and there is no functionality difference at all in OpenBLAS's test suite. A53 performance, however, is consistently worse if building ARMv8 then A57. On the test I've done, of the 76 benchmarks in OpenBLAS, 46 are either identical or comparable, 24 are either better or much better (from 20% to 10x) and 6 are better with ARMv8. |
Runtime detection of cache sizes would be necessary (or at least desirable) for the "dynamic arch" case - and I guess even for the case where a library compiled for just "ARMV8" is copied to a system with "similar" but not identical cpu. (And is it just me or is there a contradiction between your suggestion to remove A57 target and your observation that A53 performs worse without it ?) |
Sorry, 24 are better with A57, 6 with ARMv8. I'm proposing to remove A57 once we make ARMv8 equal or better with the code changes we are proposing in these pull requests (and comments). |
If DYNAMIC_ARCH is not used, then the cpu core detection logic and the defines in the cpuid_*.c files in the root directory is used at compile time to detect cores and compile in specific features. But some of these defines are totally not used in the code anywhere (eg: L1_DATA_SIZE) while some are used only for certain architectures (eg: L2_SIZE). While cleaning up, we should remove all defines which are useless for ARM64 targets. Then detect the cache sizes at compile time and use those cache sizes to set the blocking parameters in param.h or driver/others/parameter.c . If DYNAMIC_ARCH is used, then core detection is not done at compile time. Instead it compiles for all cores defined in the DYNAMIC_CORES variable in the Makefile.system. Then at runtime, it uses the logic in driver/others/dynamic*.c to detect the core at run time. The cache size detection and setting of blocking parameters is done in kernel/setparam-ref.c.
Before PR #1821, the ARMV8 target was using the C files in kernel/arm/ *.c for most APIs. Using NEON versions in kernel/arm64/ *.S will be always better. The CortexA57 target was using the NEON kernels, hence you see the performance difference. Talking more, for a generic armv8 target, the different neon kernels in kernel/arm64 should not make much of a difference. For. Eg. If we use dgemm_kernel_8x4_thunderx2t99.S or dgemm_kernel_8x4.S on CortexA57, I dont think it will make much difference.
If we are doing this, then I would suggest to remove CORTEXA57 target and keep only ARMV8, THUNDERX and THUNDERX2 targets. Let ARMV8 use the neon kernels of THUNDERX2 (no change required in current kernels for ARMV8 target). But the cache sizes and blocking parameters should be detected and set at compile time for DYNAMIC_ARCH=0 and at runtime for DYNAMIC_ARCH=1 as explained in the first part of this post. Also other feature detection like SVE (may be in future), AdvSIMD etc. can be done as part of this.
Implementing cpuid only in the compute kernels may not be enough as we also need to set the blocking parameters based on the cache sizes as explained above. |
Awesome! Agree with everything you said. :-) I will do some benchmarks with the new patches (previous was last Monday) but that should not hold this PR. Let's sync about the cleanup. Thanks! |
Android does not provide cpuid module, only /proc/cpuinfo |
@brada4 Whats the kernel version? The cpuid feature will be available in recent linux kernel versions only. What does Features under /proc/cpuinfo show ? Does it show cpuid as below? processor : 255 If the feature is not available, the user can manually select the core type using the OPENBLAS_CORETYPE environment variable. |
It is android one 8.1
|
The kernel version is old. I tried on a Oneplus 5T with android oreo 8.1 with kernel version 4.4.78 and even that does not have it. The system on which it was available was a ThunderX2 machine with kernel 4.14.52. I think the feature was first available in kernel 4.10 (https://lkml.org/lkml/2017/1/9/536), but not sure. So unless android moves to 4.10+ kernel, it won't be available on android (unless someone backports it). |
These are but a few examples and are not only related to Android vs Linux. CPUs are not required to provide any information, and they generally provide different parts of the information, if at all. If Android exposes cpuid or not, that's largely irrelevant. You should use the tools you have at your disposal and default to some sane values if that info is not available. If you want to try multiple ways before failing, that's also up to you, but you should never assume the information is there to begin with. Modern Android uses 4.4 kernel which is still too old, but most old ones are still 3.x. I agree with Ashwin that if we have the info, we use it, if not, we tell the user they better select by hand. In the future, if we want to try "harder", one can parse the output of |
Actually its 4.11 kernel (https://lkml.org/lkml/2017/1/10/816) |
The link yuyichao dug up in #989 mentions another method that uses an NDK function, but I guess this is just another API to access the (unimplemented) kernel function. |
Android has API to /proc/cpuinfo parser |
No description provided.