Add DYNAMIC_ARCH support for ARM64 #1829

ashwinyes · 2018-10-22T09:51:05Z

No description provided.

Remove XGENE1 target as the implementation for the same is incomplete. Moreover whoever wishes to use on XGENE1 can use the generic ARMV8 target as there are no XGENE1 specific optimizations in OpenBLAS.

Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.

Enable DYNAMIC_ARCH feature on ARM64. This patch uses the cpuid feature in linux kernel to detect the core type at runtime (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). If this feature is missing in kernel, then the user should use the OPENBLAS_CORETYPE env variable to select the desired core type.

martin-frbg · 2018-10-22T14:12:32Z

Thanks a lot for this. Github status reports problem with a storage system today, which is probably why no CI tests have been run yet. (Not that those could do much as we have no arm64 builds set up)

ashwinyes · 2018-10-22T14:17:14Z

The commit does not touch any x86 path. So I dont expect it cause any issues with CI tests (unless I have completely missed something).

rengolin · 2018-10-22T14:21:25Z

Is it possible to take cache sizes, too? I know lscpu does that, but I'm not sure you can do it from here.

I like the idea of defaulting to ARMv8 and making it robust enough, but I'd like it to be even more generic (ex. assume FP/SIMD are always available, regardless of the core).

In time, I'd like to see us moving away from core-name check and more towards features (cache sizes, DTB, features available) than assuming a core will always have a specific functionality.

Custom cores (like Qualcomm or Huawei) would have to add support for their cores (like you did) if they want to get anything more performing that base ARMv8.

ashwinyes · 2018-10-22T15:07:20Z

Not sure whats the correct way to determine the cache sizes at runtime on aarch64. Even if we do determine the cache sizes at runtime, it will be difficult for me to come up with a good generic formula for determining the blocking sizes that suits all armv8 implementations.

As of now, the aarch64 implementations in OpenBLAS are mainly targeted at ThunderX2 (though it also benefits other cores). Hence most of the code written is assuming a custom core rather than a generic core like CortexA57.

martin-frbg · 2018-10-22T15:10:22Z

@ashwinyes my point was just that I intend to hold off on the merge until github stops playing hide and seek with PRs and issues again. @rengolin x86 code does query cache sizes via cpuid(), seems for arm nobody figured out how to do more than parse /proc/cpuinfo (if available)

ashwinyes · 2018-10-22T15:16:42Z

@martin-frbg I was just answering @rengolin questions. github has been playing hide and seek with me too :). No hurries to merge the PR.

The way to determine cpuid on ARM64 without /proc/cpuinfo can be seen within this patch. It requires kernel cpuid feature support (https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt). The discussion was rather on how to determine the cache sizes on ARM64.

rengolin · 2018-10-22T16:10:42Z

Not sure whats the correct way to determine the cache sizes at runtime on aarch64.

You mean compile-time, right? IIUC, those are defines that get used in the code, rather than dynamically changing the runtime library behaviour.

Even if we do determine the cache sizes at runtime, it will be difficult for me to come up with a good generic formula for determining the blocking sizes that suits all armv8 implementations.

I don't think there is, that's why I said, even for A57, we shouldn't play the guessing game. A57 is a template that vendors can implement in different ways, so there is no guarantee that cache sizes will be anything.

This can be done for TX1/2 because the vendor controls the implementation (same for Huawei, Qualcomm and others).

The best thing, though, would make all of that irrelevant by implementing cpuid for your cores in the kernel and upstream it. :)

As of now, the aarch64 implementations in OpenBLAS are mainly targeted at ThunderX2 (though it also benefits other cores).

I have to make it work well with, at the very least, Huawei and Qualcomm server hardware, too. We should not assume they'll be fine by passing "thunderx2" in the build system.

The two points I made in PR #1821 are pertinent here:

If the changes benefit others (and are truly generic): move them to "armv8"
If a feature is common to two or more sub-arches: check feature, not core name

Hence most of the code written is assuming a custom core rather than a generic core like CortexA57.

My issue is not with the ThunderX/2 implementation, but with the A57 distinction from ARMv8, which apart from what's not right (fixed cache sizes, DTB), they're mandatory for ARMv8 as a whole.

Removing A57 altogether would probably make more sense right now, and make ARMv8 behave similar for all ARMv8.

I have done comparisons for "make TARGET=ARMV8" and "make TARGET=CORTEXA57" on A53, A57, Huawei, Qualcomm and ThunderX2 and there is no functionality difference at all in OpenBLAS's test suite.

A53 performance, however, is consistently worse if building ARMv8 then A57. On the test I've done, of the 76 benchmarks in OpenBLAS, 46 are either identical or comparable, 24 are either better or much better (from 20% to 10x) and 6 are better with ARMv8.

martin-frbg · 2018-10-22T16:21:39Z

Runtime detection of cache sizes would be necessary (or at least desirable) for the "dynamic arch" case - and I guess even for the case where a library compiled for just "ARMV8" is copied to a system with "similar" but not identical cpu. (And is it just me or is there a contradiction between your suggestion to remove A57 target and your observation that A53 performs worse without it ?)

rengolin · 2018-10-22T17:35:09Z

Sorry, 24 are better with A57, 6 with ARMv8. I'm proposing to remove A57 once we make ARMv8 equal or better with the code changes we are proposing in these pull requests (and comments).

ashwinyes · 2018-10-22T17:43:09Z

You mean compile-time, right? IIUC, those are defines that get used in the code, rather than dynamically changing the runtime library behaviour.
Its in fact both.

If DYNAMIC_ARCH is not used, then the cpu core detection logic and the defines in the cpuid_*.c files in the root directory is used at compile time to detect cores and compile in specific features. But some of these defines are totally not used in the code anywhere (eg: L1_DATA_SIZE) while some are used only for certain architectures (eg: L2_SIZE). While cleaning up, we should remove all defines which are useless for ARM64 targets. Then detect the cache sizes at compile time and use those cache sizes to set the blocking parameters in param.h or driver/others/parameter.c .

If DYNAMIC_ARCH is used, then core detection is not done at compile time. Instead it compiles for all cores defined in the DYNAMIC_CORES variable in the Makefile.system. Then at runtime, it uses the logic in driver/others/dynamic*.c to detect the core at run time. The cache size detection and setting of blocking parameters is done in kernel/setparam-ref.c.

A53 performance, however, is consistently worse if building ARMv8 then A57. On the test I've done, of the 76 benchmarks in OpenBLAS, 46 are either identical or comparable, 24 are either better or much better (from 20% to 10x) and 6 are better with ARMv8.

Before PR #1821, the ARMV8 target was using the C files in kernel/arm/ *.c for most APIs. Using NEON versions in kernel/arm64/ *.S will be always better. The CortexA57 target was using the NEON kernels, hence you see the performance difference.

Talking more, for a generic armv8 target, the different neon kernels in kernel/arm64 should not make much of a difference. For. Eg. If we use dgemm_kernel_8x4_thunderx2t99.S or dgemm_kernel_8x4.S on CortexA57, I dont think it will make much difference.

Removing A57 altogether would probably make more sense right now, and make ARMv8 behave similar for all ARMv8.

If we are doing this, then I would suggest to remove CORTEXA57 target and keep only ARMV8, THUNDERX and THUNDERX2 targets. Let ARMV8 use the neon kernels of THUNDERX2 (no change required in current kernels for ARMV8 target).

But the cache sizes and blocking parameters should be detected and set at compile time for DYNAMIC_ARCH=0 and at runtime for DYNAMIC_ARCH=1 as explained in the first part of this post. Also other feature detection like SVE (may be in future), AdvSIMD etc. can be done as part of this.

The best thing, though, would make all of that irrelevant by implementing cpuid for your cores in the kernel and upstream it. :)

Implementing cpuid only in the compute kernels may not be enough as we also need to set the blocking parameters based on the cache sizes as explained above.

rengolin · 2018-10-22T17:47:54Z

Awesome! Agree with everything you said. :-) I will do some benchmarks with the new patches (previous was last Monday) but that should not hold this PR. Let's sync about the cleanup. Thanks!

brada4 · 2018-10-25T07:28:00Z

Android does not provide cpuid module, only /proc/cpuinfo

ashwinyes · 2018-10-25T07:52:47Z

@brada4 Whats the kernel version? The cpuid feature will be available in recent linux kernel versions only.

What does Features under /proc/cpuinfo show ? Does it show cpuid as below?

processor : 255
BogoMIPS : 400.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part : 0x0af
CPU revision : 1

If the feature is not available, the user can manually select the core type using the OPENBLAS_CORETYPE environment variable.

brada4 · 2018-10-25T10:30:36Z

It is android one 8.1

cat /proc/cpuinfo
...
processor       : 7
BogoMIPS        : 38.40
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4
 
Hardware        : Qualcomm Technologies, Inc MSM8953
tissot_sprout:/ $ uname -a
Linux localhost 3.18.71-perf-g022146d #1 SMP PREEMPT Mon Oct 15 19:37:49 WIB 2018 armv8l
tissot_sprout:/ $

ashwinyes · 2018-10-25T10:58:42Z

Linux localhost 3.18.71-perf-g022146d #1 SMP PREEMPT Mon Oct 15 19:37:49 WIB 2018 armv8l

The kernel version is old.

I tried on a Oneplus 5T with android oreo 8.1 with kernel version 4.4.78 and even that does not have it. The system on which it was available was a ThunderX2 machine with kernel 4.14.52. I think the feature was first available in kernel 4.10 (https://lkml.org/lkml/2017/1/9/536), but not sure.

So unless android moves to 4.10+ kernel, it won't be available on android (unless someone backports it).

rengolin · 2018-10-25T11:00:07Z

These are but a few examples and are not only related to Android vs Linux. CPUs are not required to provide any information, and they generally provide different parts of the information, if at all.

If Android exposes cpuid or not, that's largely irrelevant. You should use the tools you have at your disposal and default to some sane values if that info is not available. If you want to try multiple ways before failing, that's also up to you, but you should never assume the information is there to begin with.

Modern Android uses 4.4 kernel which is still too old, but most old ones are still 3.x. I agree with Ashwin that if we have the info, we use it, if not, we tell the user they better select by hand.

In the future, if we want to try "harder", one can parse the output of cpuinfo and look for more clues. But that doesn't mean the info will be there at all. Also, you'll more easily find part/variant/revision on standard Arm cores (A53, A57, etc) than custom ones (Apple, Qualcomm, Samsung), but YMMV by a lot.

ashwinyes · 2018-10-25T11:08:27Z

So unless android moves to 4.10+ kernel, it won't be available on android (unless someone backports it).

Actually its 4.11 kernel (https://lkml.org/lkml/2017/1/10/816)

martin-frbg · 2018-10-25T11:41:40Z

The link yuyichao dug up in #989 mentions another method that uses an NDK function, but I guess this is just another API to access the (unimplemented) kernel function.
(And I concur that the fallback to a sane default is no worse than what we already do on x86 hardware - and in practice the performance impact from using a non-ideal kernel on aarch64 is probably much smaller than on x86_64)

brada4 · 2018-10-25T12:56:29Z

Android has API to /proc/cpuinfo parser
https://android.googlesource.com/platform/ndk/+/froyo/sources/cpufeatures/cpu-features.c

Ashwin Sekhar T K added 5 commits October 22, 2018 01:45

ARM64: Remove XGENE1 references

351a0c7

Remove XGENE1 target as the implementation for the same is incomplete. Moreover whoever wishes to use on XGENE1 can use the generic ARMV8 target as there are no XGENE1 specific optimizations in OpenBLAS.

ARM64: Move parameters from parameter.c to param.h

d50abc8

Remove the runtime setting of P, Q, R parameters for targets ARMV8, THUNDERX2T99. Instead set them as constants in param.h at compile time.

ARM64: Fix DYNAMIC_ARCH compilation for cores which dont use GEMM3M

e7b66cd

ARM64: Remove #define ARMV8 for THUNDERX

af2837c

ashwinyes mentioned this pull request Oct 22, 2018

Add support for DYNAMIC_ARCH compilation on ARM64 #1828

Closed

ashwinyes mentioned this pull request Oct 22, 2018

Use ThunderX2 Neon Kernels for ARMV8 Target #1821

Merged

martin-frbg merged commit 76a66ea into OpenMathLib:develop Oct 23, 2018

martin-frbg mentioned this pull request Oct 24, 2018

Use getauxval to implement dynamic arch on ARM and AArch64 #989

Closed

martin-frbg added this to the 0.3.4 milestone Nov 3, 2018

martin-frbg mentioned this pull request Dec 7, 2018

[cmake][arm64] compilation is broken #1908

Closed

martin-frbg mentioned this pull request Dec 29, 2018

Use OpenBLAS in C program #1925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DYNAMIC_ARCH support for ARM64 #1829

Add DYNAMIC_ARCH support for ARM64 #1829

ashwinyes commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

rengolin commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

brada4 commented Oct 25, 2018

ashwinyes commented Oct 25, 2018 •

edited

Loading

brada4 commented Oct 25, 2018

ashwinyes commented Oct 25, 2018

rengolin commented Oct 25, 2018

ashwinyes commented Oct 25, 2018

martin-frbg commented Oct 25, 2018 •

edited

Loading

brada4 commented Oct 25, 2018

Add DYNAMIC_ARCH support for ARM64 #1829

Add DYNAMIC_ARCH support for ARM64 #1829

Conversation

ashwinyes commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

martin-frbg commented Oct 22, 2018

rengolin commented Oct 22, 2018

ashwinyes commented Oct 22, 2018

rengolin commented Oct 22, 2018

brada4 commented Oct 25, 2018

ashwinyes commented Oct 25, 2018 • edited Loading

brada4 commented Oct 25, 2018

ashwinyes commented Oct 25, 2018

rengolin commented Oct 25, 2018

ashwinyes commented Oct 25, 2018

martin-frbg commented Oct 25, 2018 • edited Loading

brada4 commented Oct 25, 2018

ashwinyes commented Oct 25, 2018 •

edited

Loading

martin-frbg commented Oct 25, 2018 •

edited

Loading