-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper.tex
947 lines (851 loc) · 65.4 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
% paper.tex
% Author: L. K. Melhus and R. E. Jensen
\documentclass[10pt, conference, compsocconf]{IEEEtran}
\usepackage{graphicx}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\usepackage{filecontents}
\usepackage{tikz}
\usepackage{listings}
\usepackage{mathtools}
\usepackage{underscore}
\usepackage{url} % Better line breaking of urls
\usepackage{lmodern} % Much clearer text
\usepackage[utf8]{inputenc}
\usepackage[font=small]{caption}
\usetikzlibrary{patterns,shapes,positioning,calc}
\lstset{language=C, basicstyle=\ttfamily\small, columns=flexible, belowskip=0pt}
\pgfplotsset{compat=1.8}
\newcommand{\perfctr}[1] {
{\lowercase{#1}}
}
\begin{document}
\title{Measurement Bias from Address Aliasing}
\author{\IEEEauthorblockN{Lars Kirkholt Melhus}
\IEEEauthorblockA{Dept. of Computer and Information Science\\
Norwegian University of Science and Technology\\
Trondheim, Norway\\
larskirk@alumni.ntnu.no}
\and
\IEEEauthorblockN{Rune Erlend Jensen}
\IEEEauthorblockA{Dept. of Computer and Information Science\\
Norwegian University of Science and Technology\\
Trondheim, Norway\\
runeerle@idi.ntnu.no}
}
\maketitle
% Motivation -- Problem statement -- Approach -- Results -- Conclusions
\begin{abstract} % Should be 150 -- 200 words max
%Understanding program behavior and obtaining accurate measurements is important in performance analysis.
%However, recent research has shown that performance measurements can be \emph{biased} by external factors in unpredictable ways.
%Seemingly irrelevant properties---such as the length of your user name---can impact program performance.
In this paper, we identify \emph{address aliasing} as an important underlying mechanism causing measurement bias on modern Intel microarchitectures.
By analyzing hardware performance counters, we show how bias arises from two external factors: size of environment variables, and characteristics of dynamically linked heap allocators.
%Our approach is to use a large set of hardware performance counter measurements to reveal the inner workings of the CPU, and search for correlations over a series of execution contexts.
%We find that \emph{address aliasing} can explain bias from two external factors; size of environment variables, and characteristics of dynamically linked heap allocators.
%This result not only makes accounting for and avoiding measurement bias much simpler, but also enables explicit optimizations for it.
We demonstrate ways to deal with this type of bias, through runtime detection and correction of aliasing conditions.
%This is especially important for heap allocations, by default most allocators tend to produce worst case alignment, favoring page alignment.
For heap allocators, we show that common implementations tend to give worst case alignment by default; favoring page alignment for large allocations.
The performance impact of aliased memory accesses can be significant, causing up to a $2x$ performance degradation in already optimized code.
%Finally, we show that even highly optimized BLAS libraries can be impacted by address aliasing.
Our results can enable new optimization strategies to account for this phenomenon.
\end{abstract}
% Free to choose keywords, but must be in alphabetical order
\begin{IEEEkeywords}
4K address aliasing; BLAS; heap allocators; measurement bias; memory disambiguation; performance counters;
\end{IEEEkeywords}
\section{Introduction}
%Accurately measuring the performance of computer programs is important in order to evaluate new algorithms, or compare different implementations.
A key challenge for performance analysts and systems researchers is handling \emph{measurement bias}.
Seemingly harmless external factors---such as link ordering or contents of environment variables---have been shown to have significant impact on performance in real applications~\cite{Mytkowicz:2009:WrongData}.
The underlying explanation is that these factors affect memory addresses of data at runtime, which in turn interacts with various low level hardware buffers.
Because of the complexity of modern processors, it is often hard, or even impossible to predict exactly how changes to things like environment variables ultimately affect program execution.
%A literature study of 88 papers by Mytkowicz et al. showed that the median reported speedup was within the margin of bias, possibly invalidating the claimed speedups~\cite{Mytkowicz:2008:OE&MB}.
To alleviate the impact of measurement bias in performance analysis, researchers have proposed techniques like causal analysis, and randomization of execution contexts~\cite{Mytkowicz:2008:OE&MB}.
Others treat the interaction complexity as a potential optimization problem, using search over variant spaces to find optimal environments~\cite{Knights:2009:BlindOpt}.
Our approach is to analyze isolated programs in detail using hardware performance counters, in an attempt to identify which underlying mechanisms are causing bias.
%In Figure~\ref{fig:motivation}, we have reproduced a result originally presented in a paper by Mytkowicz et al.\cite{Mytkowicz:2009:WrongData}, showing how the number of cycles executed for a simple program (Figure~\ref{fig:microkernel}) depends on environment variables.
%This example provides the basis for our analysis.
We show how some instances of measurement bias can be explained by \emph{address aliasing}, an artifact of a presumably Intel specific optimization on the out-of-order execution pipeline.
The CPU uses a heuristic for determining whether loads are dependent on previous stores, comparing only the 12 least significant virtual address bits.
Aliased memory accesses can produce false dependencies, and incur performance penalties.
We look at how address aliasing can trigger bias from two external factors: size of system environment variables, and configuration of dynamic heap allocation libraries.
Our results show that the cost of aliasing can be significant, exemplified by a simple program with up to $2x$ speedup based on heap address alignment alone.
%We also find that many heap allocation libraries tend to produce worst case behavior by default with respect to aliasing, by favoring page alignment for large allocations.
With an understanding of how false dependencies in the CPU can impact performance, measurement bias caused by it can to some extent be predicted, and even accounted for in software.
We show how techniques like manual padding of allocations and dynamic detection of aliasing conditions can be used to improve performance.
The rest of this paper is structured as follows:
Section~\ref{sec:methodology} starts by describing methodology and experimental setup.
Section~\ref{sec:aliasing} introduces ``4K aliasing'', providing necessary background for interpreting the later results.
Section~\ref{sec:environment} presents an analysis of how the size of environment variables causes bias, and in Section~\ref{sec:heap} we go through a similar analysis for bias to heap address alignment.
Section~\ref{sec:blas} looks at alias in BLAS libraries.
Finally, in Section~\ref{sec:mitigation} we discuss some approaches to handle aliasing effects.
Related work is discussed in Section~\ref{sec:related}, before we summarize and conclude in Section~\ref{sec:conclusions}.
\section{Experimental Methodology}
\label{sec:methodology}
In order to understand biased behavior, it is important to be able to make precise and detailed measurements without introducing \emph{observer effects}~\cite{Mytkowicz:2008:OE&MB}.
This can be achieved by using performance counters; instrumentation support in hardware that can be used to count various events, such as cycles executed, branch prediction misses, instructions fetched, and so on.
Recent Intel architectures have several hundred available events, providing a detailed view of what is happening inside the CPU.
Performance counters are supported in the Linux kernel, and can be accessed via a tool called \emph{perf}.
This utility instructs the kernel to enable the processor's Performance Monitoring Unit (PMU), before executing a specified program.
Using this technique avoids modifications to the executed program, unlike tools like PAPI~\cite{PAPI:PortableInterface}.
We use the perf-stat command in all experiments, which accepts raw PMU event codes listed in the reference manual~\cite{Volume3B}.
Reliability of PMU measurements has been evaluated by Weaver and Mckee~\cite{Weaver:2008:PCTrusted}.
The issues they find are either handled explicitly, or are irrelevant to our setup.
%They find intra-machine variations of up to 1.07\%,
A small Python script is used to iterate over an exhaustive set of all known counters, which amounts to about 200 in our setups.
Only a small set of events are collected at a time, to ensure events are actually counted continuously.
We do not use multiplexing between a limited set of counter registers, as it will introduce significant variations~\cite{Wiplove:ImproveEstimation}. % Paper is rename of "Toward Accurate Performance Evaluation using Hardware Counters".
Controlled variations in environment size are achieved by setting a dummy environment variable.\footnote{Because perf-stat itself adds a few variables, the environment will never be completely empty.}
% to $n$ number of zero characters.
Interesting events are identified by computing linear correlation to cycle count, measuring all counters over a series of execution contexts.
% Cache related metrics are monitored in order to rule out cache as the underlying cause of bias, such as hit rates of micro-ops for each level of cache~\cite{OptimizationManual}.
Results are also averaged over multiple runs to reduce potential random error.%, using built-in perf rerun functionality.
%Because bias can be hardware dependent, and to keep the scope of this work manageable, we initially chose to focus our analysis on the 4th generation Intel ``Haswell'' microarchitecture.
%We later reproduced similar results for the ``Core'', ``Nehalem'', and ``Ivy Bridge'' microarchitectures as well.
%Each setup is described in Table~\ref{tab:machines}, and referenced throughout the paper.
\begin{table}
\centering
\caption{Experimental setups used.\label{tab:machines}}{
\begin{tabular}{|l|l|}
\hline
Core & Intel\textsuperscript{\textregistered{}} Core\texttrademark{}2 Duo P8600 (laptop) \\
& 64-bit Ubuntu 14.04 LTS (Linux 3.13.0),\\
& GCC 4.8.2-19ubuntu1 \\ \hline
Nehalem & Intel\textsuperscript{\textregistered{}} Core\texttrademark{} i7-950 \\
& 64-bit Ubuntu 12.04 LTS (Linux 3.2.0),\\
& GCC 4.8.2-19ubuntu1 \\ \hline
% Sandy Bridge & Intel\textsuperscript{\textregistered{}} Core\texttrademark{} i7-2670QM (laptop) \\
% & 64-bit Ubuntu 11.04 (Linux 3.0.0) \\ \hline
Ivy Bridge & Intel\textsuperscript{\textregistered{}} Core\texttrademark{} i5-3470 \\
& 64-bit Ubuntu 12.04 LTS (Linux 3.8.0),\\
& GCC 4.8.2-19ubuntu1 \\ \hline
Haswell & Intel\textsuperscript{\textregistered{}} Core\texttrademark{} i7-4770K \\
& 64-bit Ubuntu 14.04 LTS (Linux 3.13.0),\\
& GCC 4.8.2-19ubuntu1 \\ \hline
\end{tabular}}
\end{table}
Best practices for controlling the execution context were applied, ensuring that we are not affected by any unwanted bias~\cite{Mytkowicz:2009:WrongData}.
Most importantly, this means keeping the memory address space under control.
For security reasons, addresses of the stack, heap and dynamic libraries are often randomized at load time, a technique known as \emph{Address Space Layout Randomization}~\cite{Pax:ASLR,Bhatkar:AddressObfuscation}.
By disabling ASLR, we are able to execute the same program multiple times with identical virtual address spaces.
All experiments are conducted under minimal load, disabling \emph{frequency scaling} to keep the CPU's clock speed fixed, and \emph{Hyper-threading} (HT) to reduce the possibility for resource contention between threads.\footnote{Some PMU events also require HT to be turned off, or produce inaccurate results with HT enabled.}
% Virtual memory/randomization illustration
\begin{figure}
\centering
\begin{tikzpicture}[font=\footnotesize]
% See page 453 in pgf manual
\node [
rectangle split, rectangle split parts=10,
rectangle split part fill={white, white, lightgray, white, lightgray, white, lightgray, white},
draw, anchor=center, text width=2cm
] (m) {
\nodepart{one}
environment
\nodepart{two}
stack
\nodepart{four}
mmap area
\nodepart{six}
heap
\nodepart{eight}
bss
\nodepart{nine}
data
\nodepart{ten}
text
};
\draw [decorate, decoration={brace, amplitude=5pt}] (m.south west) -- (m.seven split west)
node [black, midway, xshift=-1.5cm, text width=2.0cm]
{ Program code and static data } ;
\node[right] at (m.north east)
{ \scriptsize{0x7fff'ffffffff} } ;
\node[right] at (m.south east)
{ \scriptsize{0x400000} } ;
\end{tikzpicture}
\caption{Memory execution context, assuming a 64-bit process running on a Linux system.}
%Initial addresses of stack, heap and memory mapped files are often randomized for security reasons.
%The stack is also offset by environment variables and program arguments.
%Addresses of code and statically allocated data are allocated at compile time by the linker, and can be determined by inspecting the executable.}
\label{fig:virtualmemory}
\end{figure}
\section{4K Address Aliasing}
\label{sec:aliasing}
Modern processors are \emph{superscalar}, and achieve parallelism by issuing multiple instructions simultaneously and out of order.
One of the issues that can limit throughput is dependencies between a load and previous stores.
To increase parallelism, modern architectures use a technique called \emph{memory disambiguation} to execute memory operations out of order~\cite{Intel:2006:InsideICM:SmartMemoryAccess}.
%More often than not, loads can safely be issued before a previous store has completed and written its value to L1 cache.
Loads are issued \emph{speculatively}, based on a prediction on whether it will conflict with a previous store that is not yet retired.
The prediction is later verified, replaying any instructions that were wrongly assumed to have no dependencies.
Similarly, if the load and store locations are the same, the value can be \emph{forwarded} from the store before it retires.
While optimizations such as these are good on the average case, there are corner cases.
In particular, an event known as ``4K aliasing'' can occur when the memory addresses of a store followed by a load differ by a multiple of 4096 bytes.
%A store to address 0x601020 followed by a load to address 0x821020 is an aliasing pair, because the 12-bit address suffix of 0x020 is the same in both.
%Despite being independent,
In these cases, the memory order subsystem generates \emph{false} dependencies, causing the load to be reissued.
The number of times this happens can be counted by the following event:
\begin{description}
\item[{\small LD\_BLOCKS\_PARTIAL.ADDRESS\_ALIAS.}]
\hfill \\ \emph{``Counts the number of loads that have partial address match with preceding stores, causing the load to be reissued.''}
\cite[B.3.4.4]{OptimizationManual}
\end{description}
This event is listed in the manual for microarchitectures going back to ``Nehalem'', including ``Ivy Bridge'' and ``Haswell'' CPUs~\cite{Volume3B}.
Older architectures such as Core do not have this particular event listed, but 4K aliasing is covered by a more general event:
\begin{description}
\item[{\small LOAD\_BLOCKS.OVERLAP\_STORE.}]
\hfill \\ \emph{``Loads that partially overlap an earlier store, or 4-Kbyte aliased with a previous store.''}
\cite[Table 19-17]{Volume3B}
\end{description}
As address aliasing depends on the memory addresses of loads and stores, any environmental factor that affects memory layout has the potential to induce aliasing conditions.
In the following sections, we show how performance penalties from address aliasing can be the root cause of measurement bias.
\section{Bias from Environment Size}
\label{sec:environment}
System environment variables contain strings like the name of the logged in user, home directory, shell state, and various other settings.
As a source of bias, it is not the environment variables themselves that are important, but rather the effect their total \emph{size} has on the position and alignment of the stack.
Environment variables and program arguments are allocated in the stack section of virtual memory, close to the upper address 0x7fff'ffffffff\footnote{Modern processors do not actually use the full 64-bit space, only the low order 47 bits are used for addressing memory.} and before the first call frame. Figure \ref{fig:virtualmemory} shows this layout.
Changing environment variables will offset the location of the stack, and consequently all stack allocated variables.
After this offset, the stack is aligned to a 16 byte boundary on x86_64 Linux systems.
Within a span of 4096 bytes there are thus 256 possible initial addresses, each representing a different execution context with respect to address aliasing.
%Alignment is restricted to 16 byte boundaries accoring to the ABI, creating 256 possible initial addresses for every period of 4096 bytes.
%Each one representing a different execution context with respect to address aliasing.
%Note that there is no clear relationship between environment size and stack location with ASLR enabled.
%However, there will still be as many execution contexts with respect to aliasing (considering the stack only), making any occurrences of measurement bias indeed random.
\subsection{Microkernel Analysis}
\label{sec:microkernel}
To illustrate how address aliasing can cause bias, we revisit the example first presented by Mytkowicz \emph{et al.}~\cite{Mytkowicz:2009:WrongData}, a small C program reproduced here in Figure~\ref{fig:microkernel}.
% in ``Producing Wrong Data Without Doing Anything Obviously Wrong!''
This example is interesting for several reasons; the bias effects are significant and easily reproducible, while the code itself is simple and straightforward to analyze.
Still, no satisfactory explanation as to what actually causes bias was given in the original paper.
The following data and analysis of this program are from the ``Haswell'' configuration in Table~\ref{tab:machines}.
%Similar experiments were conducted on all other experimental setups listed, arriving at the same conclusion.
% Original code presented my Mytkowicz et al.
\begin{figure}[t]
\begin{lstlisting}[frame=single, xleftmargin=.01\textwidth, xrightmargin=.01\textwidth]
static int i, j, k;
int main() {
int g = 0, inc = 1;
for (; g < 65536; g++) {
i += inc;
j += inc;
k += inc;
}
return 0;
}
\end{lstlisting}
\caption{\label{fig:microkernel}Microkernel first presented by Mytkowicz \emph{et al.}\cite{Mytkowicz:2009:WrongData}, showing bias to environment size.}
\end{figure}
% Large plot of cycle count for two 4K stack sections.
\pgfplotstableread{bin/microkernel-cycles-haswell.dat}{\stackoffsettable}
\begin{figure*}[t]
\begin{tikzpicture}
\begin{axis}[
title = Haswell,
width = \textwidth,
height = 5.5cm,
font = \footnotesize,
xlabel=Bytes added to environment,
ylabel=Cycles,
domain = 0:8192,
xtick = {0,1024,...,8192},
xmin = 0,
xmax = 8192,
cycle list name = exotic
]
\addplot[ycomb] table[x expr = \thisrowno{0}*16, y = cycles:u] \stackoffsettable ;
\end{axis}
\end{tikzpicture}
\caption{Bias from environment size for microkernel. Measured average of 10 cycle count samples for 512 different environments. Spikes show aliasing case, occurring once for each 4K period.}
\label{fig:envbias}
\end{figure*}
Performance counter measurements of cycle counts over 512 different environment sizes are shown in Figure~\ref{fig:envbias}.
Every 16 byte increment of environment size is measured, covering two 4K periods of initial stack addresses.
%A finer sampling is not necessary, because the stack is by default aligned to 16 bytes.
The code is compiled with {\small{GCC}} using no optimization; any optimization would likely disregard most of the function as redundant, reducing it to return zero immediately.
There are clearly two worst cases, indicated by significant spikes at the 3184 and 7280 byte offsets.
We sampled an extensive set of performance counters in addition to cycle count for each execution context.
To analyze the bias effects at these points, we calculate the median value of each counter over all contexts, and compare that to the two extreme cases.
Going through the list of events, we selected the set that had the highest correlation with cycle count\footnote{Some correlating performance events are omitted; for example bus-cycles, which will naturally vary with total cycle count.}.
Table \ref{tab:loopcorrelation} shows numerical counter values for the first spike, compared to the median values. Event counts on the second spike are almost identical to the first, and omitted for brevity.
% Detailed comparison between median and worst cases.
\begin{table}
\centering
\caption{Events with significant correlation to cycle count.\label{tab:loopcorrelation}}
\pgfplotstabletypeset[
int detect, % Output whole numbers for counter values
col sep=comma,
columns={Performance counter, Median, [index]3}, %, [index]4
column type=r,
columns/Performance counter/.style={
string type,
column type=l,
column type/.add={|}{},
postproc cell content/.append code={
\pgfkeysalso{@cell content=\perfctr{##1}}
}
},
every head row/.style={
output empty row,
before row={\hline
Performance counter & Median & Spike \\ % & Spike 2
},
after row=\hline\hline
},
every last row/.style={after row=\hline},
every last column/.style={column type/.add={}{|}}
]{bin/microkernel-comparison-haswell.csv}
\end{table}
The most distinctive change from the median is the number of alias events.
This counter reports close to zero except exactly at the two spikes, where it reaches upwards of 320,000.
%If we plot the graph for address aliasing, we see that it is near zero everywhere and spikes at exactly the points we observe bias.
%We found that there are near zero address aliasing events everywhere except exactly the points we observe bias.
The results show a high number of resource stalls and pending memory loads when the spikes occur, which is consistent with address aliasing issues.
On the other end we get a much lower number of reservation station (RS) stalls in the aliasing case, with a reduction from around 272,000 to 136,000.
The reservation station buffers micro-ops for scheduling to the execution units, and a stall event means that there are no free slots available~\cite[Table 19-2]{Volume3B}.
We attribute fewer stalls in the aliasing case to less contention on the reservation station,
relating to the overall \emph{decrease} in the number of micro-ops executed per port, as all but one of the 8 execution ports perform fewer operations in parallel.
Note that the number of micro-ops \emph{retired} overall does not change.
The memory disambiguation system automatically assumes dependency between a load and a previous store that are aliased, which can limit the potential for issuing many instructions simultaneously and out of order~\cite[Page 2-20]{OptimizationManual}.
In our case, lower occupation of execution ports, and less pressure on the reservation station, can be explained by the CPU not being able to issue more operations while waiting for an aliased memory access to be resolved.
The cycles\_ldm\_pending and stalls\_l2\_pending numbers are interesting, as they are related to the memory and cache system.
However, this program uses only 2 cache-lines, except for the initial startup.
Using the L1 hit and miss counts also revealed that the L1 load hit rate is 99.3\% and store hit rate is {\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}}100\%.
These events do not indicate a cache hit rate issue, but are probably due to having to wait for aliased writes to be retired and committed to cache.
Performance counter data clearly points to address aliasing as the underlying explanation, so the next step is to see exactly which memory accesses are aliasing.
%For that we need to know the addresses of each variable at runtime.
%The program contains five variables; \texttt{g} and \texttt{inc} which are stack allocated, and \texttt{i}, \texttt{j} and \texttt{k} which are statically allocated.
Memory layout of static data is decided at compile time, and we find the addresses to be \texttt{\&i} = 0x60103c, \texttt{\&j} = 0x601040, and \texttt{\&k} = 0x601044 by reading the ELF symbol table.
%\footnote{ELF symbol tables can be read using \texttt{readelf -s}}
%Observing addresses of stack allocated data at runtime is more challenging, as we have to make sure to not introduce any observer effects that alters the addresses as we are observing them.
Carefully crafted assembly code was used to capture the runtime addresses of \texttt{g} and \texttt{inc}.
%We made no change to the stack allocation instructions, and the code offset did not affect the addresses of static variables.
%The same experiment was run again, showing that the modified program had the exact same bias to environment size.
From the output at each spike, we found the addresses to be \texttt{\&g} = 0x7fffffffe038, \texttt{\&inc} = 0x7fffffffe03c, and \texttt{\&g} = 0x7fffffffd038, \texttt{\&inc} = 0x7fffffffd03c, respectively.
Notice the common suffix 0x03c between \texttt{inc} and \texttt{i}, which is the aliasing pair.
% The assembly output of {\small GCC} for the inner loop is shown below, indicating the relevant load and store instructions.
% There is only one store to variable \texttt{i}, which will alias with loads of \texttt{inc}.
% Refer to relevant section of the generated assembly file (in gas syntax)
% \lstinputlisting[
% basicstyle=\ttfamily\small,
% linerange={22-38},
% numbers=left,
% firstnumber=22
% ]{bin/micro-kernel-annotated.s}
% Because the stack is aligned to 16 bytes, or 4 \emph{words}, there are a couple of different scenarios that could have been observed here.
% Static variables are fixed and cover 12 contiguous bytes (3 words), in our case the addresses end in 0x0, 0x4 and 0xc, leaving the 0x8 slot free.
% The two automatic variables \texttt{g} and \texttt{inc} occupies 8 contiguous bytes on the stack (2 words), in our case the addresses will always fit in the 0x8 and 0xc slots.
% In this scenario, \texttt{g} will never alias with any of the static variables -- as it always covers the 0x8 slot not occupied by either of \texttt{i}, \texttt{j} or \texttt{k}.
% A less fortunate scenario with respect to the number of alias events occurs when there can be collisions with both stack allocated variables, which can be achieved for example by reserving an extra 8 bytes to offset \texttt{i}, \texttt{j} into the 0x8, 0xc slots.
% While this will give significantly more alias counts, we found that it had little effect on the total number of cycles executed.
% In conclusion, we identified that address aliasing is the root cause of measurement bias from environment size for this program.
% Worst case occurs for precisely one out of 256 possible initial stack addresses in every 4K segment, where resource stalls are generated because of false dependencies between the stack and static data.
% The program is \emph{biased} towards environment sizes that avoid this specific stack alignment, which in principle could be triggered just by changing user name.
% This type of bias can occur in a number of memory configurations.
% Here, the aliasing was caused by interactions between the stack and static variables, but we can imagine similar conflicts between the stack and the heap, depending also on how dynamic memory allocators behave.
% Because the stack is used for local variables, complex code is likely to contain many potential conflicts with other areas of memory.
%\pgfplotstableread{bin/microkernel-cycles-core2.dat}{\microkernelcoretwotable}
\pgfplotstableread{bin/microkernel-core.dat}{\tblmicrokernelcore}
\pgfplotstableread{bin/microkernel-nehalem.dat}{\tblmicrokernelnehalem}
\pgfplotstableread{bin/microkernel-ivybridge.dat}{\tblmicrokernelivybridge}
\begin{figure}[t]
\begin{tikzpicture}
\begin{axis}[
title = Core,
width = \textwidth/2.1,
height = 4.5cm,
font = \footnotesize,
xlabel=Bytes added to environment,
ylabel=Cycles,
domain = 0:4096,
xtick = {0,1024,...,4096},
xmin = 0,
xmax = 4096,
cycle list name = exotic
]
\addplot[ycomb] table[x expr = \thisrowno{0}*16, y = cycles:u] \tblmicrokernelcore ;
\end{axis}
\end{tikzpicture}
%\caption{\label{fig:motivation}Performance counter data running the micro-kernel from Figure~\ref{fig:microkernel} under different environment sizes. Notice the clear spike at around 3200 bytes, where the number of cycles executed nearly doubles. Measuring performance in this configuration would give a biased result.}
\begin{tikzpicture}
\begin{axis}[
title = Nehalem (HT),
y tick label style={
/pgf/number format/.cd,
fixed,
fixed zerofill,
precision=1,
/tikz/.cd
},
width = \textwidth/2.1,
height = 4.5cm,
xlabel = Bytes added to environment,
ylabel = Cycles,
font = \footnotesize,
domain = 0:4096,
xtick = {0,1024,...,4096},
xmin = 0,
xmax = 4096,
cycle list name = exotic
]
\addplot[ycomb] table[x expr = \thisrowno{0}*16, y = cycles:u] \tblmicrokernelnehalem ;
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[
title = Ivy Bridge,
y tick label style={
/pgf/number format/.cd,
fixed,
fixed zerofill,
precision=1,
/tikz/.cd
},
width = \textwidth/2.1,
height = 4.5cm,
xlabel = Bytes added to environment,
ylabel = Cycles,
font = \footnotesize,
domain = 0:4096,
xtick = {0,1024,...,4096},
xmin = 0,
xmax = 4096,
cycle list name = exotic
]
\addplot[ycomb] table[x expr = \thisrowno{0}*16, y = cycles:u] \tblmicrokernelivybridge ;
\end{axis}
\end{tikzpicture}
\caption{\label{fig:microkernel-all-archs}Microkernel experiment on the ``Core'', ``Nehalem'' and ``Ivy Bridge'' microarchitectures.}
\end{figure}
\subsection{Other Architectures}
While we show the analysis only for ``Haswell'', the same approach gave similar results for all the CPUs we tested.
Figure~\ref{fig:microkernel-all-archs} shows cycle count plots for ``Core'', ``Nehalem'' and ``Ivy Bridge'', revealing the same alias behavior.
%\footnote{For these results we also pinned the process to a single core during measurements using \texttt{taskset}, which helps remove some of the random noise.}
%The experiment was also done for the older ``Core'' setup, shown in Figure~\ref{fig:motivation}.
On the older ``Core'' architecture, \emph{load\_blocks.overlap\_store} is used to indicate address aliasing.
Notably, this event was also found to be significant in the original paper, where the authors used a similar setup~\cite{Mytkowicz:2009:WrongData}.
We were able to verify that partial address overlap was occurring at the same time this event is spiking, concluding that address aliasing must be the underlying explanation.
Still, this only explains part of what is shown in Figure~\ref{fig:microkernel-all-archs}, as there seems to be yet another bias effect at every fourth environment increment.
These smaller spikes correlate with the performance event \emph{load\_blocks.std}, which is described as ``loads blocked by a preceding store with unknown data''~\cite[Table 19-17]{Volume3B}.
%Our assumption is that this is a different phenomenon, not related to 4K aliasing.
We were only able to observe this effect on this particular architecture, and did not investigate it further.% is related to 4K aliasing.
Results for the ``Nehalem'' and ``Ivy Bridge'' architectures are more uniform, with large spikes only at the aliasing environment configuration.
It is clear that this issue exists for a wide range of Intel microarchitectures.
%We believe that address alias issue exist for a wider range of Intel microarchitectures.
% \subsubsection{Note on Hyper-Threading}
% As described in Section~\ref{sec:methodology}, Hyper-Threading was disabled by default in order to avoid any potential interference while collecting data.
% However, we found that configuration of Hyper-Threading either way did not change the alias behavior.
% As an example of this, we show the results for ``Nehalem'' with HT enabled in Figure~\ref{fig:microkernel-all-archs}.
\section{Bias from Heap Allocation}
\label{sec:heap}
Address aliasing can be caused by conflicting pairs of load and store operations to any part of memory.
In the previous section, we saw collisions between static data and the stack, observing bias from external conditions that affected addresses of automatic variables.
Most dynamic memory is allocated on the \emph{heap}, which is managed by an \emph{allocator}.
A heap allocator is responsible for managing dynamic memory, and ultimately assigns the actual addresses of heap allocated variables at runtime.
Heap allocation routines such as \texttt{malloc} and \texttt{free} are typically dynamically linked, for example as part of glibc.
%If this assignment is independent of the initial stack offset all heap memory accesses can be biased.
The particular library used therefore constitutes an important part of the execution context, as linking to a different library, or a library with some alternative configuration, can impact heap addresses at runtime.
\subsection{Memory Allocators and Alias}
Acquiring dynamic memory at runtime is usually handled by a variant of \texttt{malloc}.
Depending on the particular request and allocator used, the returned value will either point to the ``regular'' heap, or to a memory mapped area.
\begin{itemize}
\item The \emph{heap} is marked by a break point representing the end of uninitialized data in virtual memory, and more space is requested by the \texttt{brk} or \texttt{sbrk} system calls.
\item The \texttt{mmap} system call is used to map file descriptors to virtual memory. \emph{Anonymous} mappings, \emph{i.e.} buffers not backed by a file, can be used for general purpose allocations.
\end{itemize}
The allocator included as part of glibc uses both mechanisms, choosing which based on the size of the request~\cite{GlibcManual}.
Small sizes are generally allocated on the heap, while larger allocations tend to be memory mapped.
The heap section starts at a low address right above static code and data, while memory mapped chunks are placed towards the upper end of the virtual address space, closer to the stack (see Figure \ref{fig:virtualmemory}).
Whether a request is served by the heap or by memory mapping can be determined by looking at the pointer values returned;
%Addresses in the regular heap can look something like 0x16e30a0 or 0x1723020, while pointers returned by \texttt{mmap} are numerically much larger, for example 0x7f0318a8f010 or 0x7f03105d2010.
addresses in the regular heap are several orders of magnitude smaller than pointers returned by \texttt{mmap}.
This distinction is unimportant for application developers, as everything is conceptually the same ``heap''.
However, \texttt{mmap} has an interesting property in that allocations will always be page aligned.
The page size is usually 4096 bytes, meaning addresses returned by \texttt{mmap} will \emph{always} alias.\footnote{glibc's version of malloc adds 16 bytes of metadata at the beginning, therefore every memory mapped address ends with 0x010.}
%This behavior is often the worst case for functions that operate on two or more independent buffers.
\begin{table}[t]
\centering
\caption{Addresses returned by different heap allocators when allocating pairs of equally sized buffers. Equal three digit suffix indicate an aliasing pair.\label{tab:mallocompare}}
\pgfplotstabletypeset[
col sep=comma,
string type,
columns={Allocation, 5120, 1048576},
column type=r,
columns/Allocation/.style={
string type,
column type=l,
column type/.add={|}{}
},
every head row/.style={
output empty row,
before row={\hline
& 5,120 B & 1,048,576 B \\
},
after row=\hline\hline,
},
every last row/.style={after row=\hline},
every last column/.style={column type/.add={}{|}},
every odd row/.style={after row=\hline},
]{bin/malloc-comparison.csv}
\end{table}
Table \ref{tab:mallocompare} illustrates how using a different allocator can affect potential aliasing between heap pointers.\footnote{Switching allocator was done by setting the LD\_PRELOAD environment variable.} \footnote{Memory mapped addresses starting with the 0x2aa... prefix is an artifact of using \emph{make} to generate results. Executing the test program from \emph{bash} directly result in 0x7fff... prefixes. This difference is not important to our discussion.}
We observe the addresses of two equally sized buffers allocated with \texttt{malloc}, for different size parameters.
%Equal three digit address suffix indicate an aliasing pair.
In addition to glibc, where the heap allocator is called \emph{ptmalloc}, we look at the following alternatives:
\begin{enumerate}
\item Thread-Caching Malloc (tcmalloc) by Google~\cite{TCMalloc}.
\item jemalloc, originally developed for FreeBSD~\cite{JEMalloc}.
\item Hoard~\cite{Berger:2000:Hoard}.
\end{enumerate}
All of the above focus heavily on performance in multithreaded environments.
Heap allocation is inefficient in that all threads share the same address space, leading to a high potential lock contention on memory accesses.
Here, we only look at behavior for a single thread, and whether the addresses returned alias or not.
We see that glibc and tcmalloc utilize the normal heap area for smaller allocations, returning numerically low addresses. % Verified by looking at strace that tcmalloc actually uses brk
Interestingly, jemalloc and Hoard appear to never use the heap, but allocate to memory mapped areas even for smaller requests.
Conversely, tcmalloc seem manage only the heap.
We also find an example where one allocator yields aliasing buffers while another does not.
Allocating $2 \times 5120$ bytes returns aliasing pointers for jemalloc and Hoard, but not with glibc or tcmalloc.
Given that these results are deterministic (with ASLR disabled), it is possible to construct programs with significant bias towards one or the other allocator.
Even with ASLR enabled, pointers returned by \texttt{mmap} must still be page aligned.
This means that addresses returned by allocators using this mechanism directly will \emph{always} alias, giving a deterministic execution context considering only the address suffix.
%From our limited analysis, this seems to be the case for glibc, jemalloc and Hoard.
\subsection{Aligned Sequential Access}
Many functions operate in a ``sliding window'' fashion; reading and writing to different buffers in some loop construction.
This type of access pattern is potentially vulnerable to 4K aliasing, where the worst case will be when the read and write pointer addresses are aliased, continuously generating false conflicts.
As an example of this, consider a simple implementation of convolution shown in Figure \ref{lst:conv}.
The performance of this program greatly depends on the address alignment of each buffer, favoring memory addresses that are not closely aligned on the last 12-bits.
\begin{figure}[t]
\centering
\lstinputlisting[
language=C,
frame=single,
xleftmargin=.00\textwidth,
xrightmargin=.00\textwidth
]{bin/convolution-kernel.c}
\caption{Basic implementation of convolution with a fixed kernel, ignoring endpoints for simplicity. This program is highly sensitive to aliasing between input and output arrays.}
\label{lst:conv}
\end{figure}
We use an input size of $n=2^{20}$ (4 MiB in memory for each array), which results in glibc's heap allocator always choosing \texttt{mmap} to serve requests.
By default, even with address randomization enabled, both \texttt{input} and \texttt{output} will have start addresses with the same address suffix of 0x010.
To analyze performance for different addresses, we manually insert padding to control the address of one of the arrays.
This is accomplished by requesting a bit more memory, and use pointer arithmetic to offset the pointer past the aligned baseline.
Controlling the offset parameter, we can create environments where the address suffixes differ by any desired number of bytes.
Allocating and managing these buffers at startup takes a non-negligible amount of work, but the overhead can be masked by repeatedly invoking the convolution kernel in a loop.
Repeated invocation will also even out performance by warming up the cache.
%\begin{lstlisting}
% for (i = 0; i < k; ++i)
% conv(n, input, output + offset);
%\end{lstlisting}
An estimate of the actual cost of a single invocation can be calculated by the following formula:
%An estimate of the actual cost a single invocation can be calculated by averaging over $ k $ repeated function calls, after subtracting time $ t_{1} $ which includes allocation overhead.
\[
t_{\text{estimate}} = \frac{t_{k} - t_{1}}{k - 1}
\]
Time spent on a single invocation is subtracted from the estimate to mask initialization time.
We choose $k=11$, effectively getting the average over 10 iterations.
In addition, performance counter measurements are averaged over 10 samples using the repeat mechanism of perf.
\pgfplotstableread{bin/conv-default-o2-haswell.estimate.dat}{\convtabletwohaswell}
\pgfplotstableread{bin/conv-default-o3-haswell.estimate.dat}{\convtablethreehaswell}
\begin{figure*}[t]
\centering
\begin{tikzpicture}
\begin{axis}[
title=Haswell -O2,
font=\footnotesize,
xlabel=Relative offset in \texttt{sizeof(float)} bytes,
ylabel=Event count,
cycle list name=black white,
width=\textwidth/2.05, % Make it fit to text width side by side
height=5.6cm,
skip coords between index={20}{32} % Limit level of detail to fit page width nicely
]
\addplot table[x expr = \thisrowno{0}, y = cycles:u] \convtabletwohaswell ;
\addplot table[x expr = \thisrowno{0}, y = r0107:u ] \convtabletwohaswell ;
\addlegendentry{Cycles} ;
\addlegendentry{Alias} ;
\end{axis}
\end{tikzpicture}
\hspace{0.5cm}
\begin{tikzpicture}
\begin{axis}[
title=Haswell -O3,
font=\footnotesize,
xlabel=Relative offset in \texttt{sizeof(float)} bytes,
ylabel=Event count,
cycle list name=black white,
width=\textwidth/2.05,
height=5.6cm,
skip coords between index={20}{32}
]
\addplot table[x expr = \thisrowno{0}, y = cycles:u] \convtablethreehaswell ;
\addplot table[x expr = \thisrowno{0}, y = r0107:u ] \convtablethreehaswell ;
\addlegendentry{Cycles} ;
\addlegendentry{Alias} ;
\end{axis}
\end{tikzpicture}
\hspace{0.5cm}
\caption{\label{fig:conv-default}Cycle- and alias counts for different offsets between input and output arrays in convolution kernel from Figure~\ref{lst:conv}, input size $n=2^{20}$.}
%Offset 0 means equal 12-bit address suffix, which is default behavior for \texttt{mmap} allocations. Showing results for optimization levels O2 and O3 compiled with GCC, input size $n=2^{20}$.}
\end{figure*}
Figure~\ref{fig:conv-default} illustrates how the convolution kernel behaves for increasing offsets between the heap addresses (modulo 4096), clearly indicating a relationship between address aliasing events and cycles executed.
%The effects are most distinct on optimization level 2 and 3, where the ratio of cycles to alias events is most significant.
%Numbers on the x axis represent the amount of offset, measured in number of \texttt{sizeof(float)} bytes.
An offset of zero is the default behavior for this program when using \texttt{malloc} and moderately large inputs.
For both optimization levels 2 and 3, this is close to worst case performance.
The differences in cycles executed is significant, with around $1.7x$ speedup for O2 and as much as $2x$ speedup for O3 on increasing relative offset.
%This effect is only observed for address separations close to zero, thus we only show the first 20 data points.
%If extended to cover the full width of possible offsets within a 4K segment, we see that the performance is uniform everywhere else.
More detailed performance counter data is presented in Table~\ref{tab:convstats}, where we show the subset that correlates best with the cycle count.
The numbers are for the level 2 optimization case, which we choose to study in more detail.
However, the performance data are similar between the two data sets, with a couple of events that stand out:
\begin{itemize}
\item A high number of resource stalls for the default alignment, which is reduced substantially together with increasing offsets.
\item A high number of cycles with memory loads pending, indicating that the pipeline is stalled waiting for load operations to resolve.
\item Changes to the number of micro-ops executed for certain ports.
\end{itemize}
% Table with detailed performance data, use estimated for O2 because it does not need as many columns (!)
\begin{table*}[t]
\centering
\caption{Relevant performance counters and correlation ($r$) with cycle count for optimization O2. Estimated cost accounting for constant overhead.\label{tab:convstats}}{
\pgfplotstabletypeset[
int detect, % Output whole numbers for counter values
col sep=comma,
columns={Performance counter, Correlation, 0, 2, 4, 8},
column type=r,
columns/Performance counter/.style={
string type,
column type=l,
column type/.add={|}{},
postproc cell content/.append code={
\pgfkeysalso{@cell content=\perfctr{##1}}
}
},
columns/Correlation/.style={
fixed,
fixed zerofill,
precision=2
},
every head row/.style={
output empty row,
before row={\hline
Performance counter & $r$ & 0 & 2 & 4 & 8 \\
},
after row=\hline\hline
},
every last row/.style={after row=\hline},
every last column/.style={column type/.add={}{|}}
]{bin/conv-default-o2-haswell.estimate.csv}
}
\end{table*}
Interestingly, small address offsets incur a substantial increase in operations issued on port 0 for the O2 case.
On ``Haswell'', this port handles various ALU operations, and branching together with port 6~\cite[Figure 2.1]{OptimizationManual}.
As there are also variations in the number of branch instructions executed, it appears that these counters together show that certain branches are being re-issued.
Unlike the microkernel in section \ref{sec:microkernel}, the micro-ops executed and RS stall counts drop together with aliasing events.
Speculation is possible in this kernel, as there are many independent store addresses.
However, false dependencies will limit the amount of speculation, making the CPU incorrectly discard executed instructions.
This can explain the inverted behavior compared to the microkernel, where the store addresses were fixed between loop iterations.
It is worth noting that most cache related metrics do \emph{not} stand out in this experiment.
For cycles\_ldm\_pending and stalls\_l2\_pending, we see similar behavior to the microkernel analysis in section \ref{sec:microkernel}.
These events can be explained by pipeline stalls waiting for aliasing store operations to be retired.
The L1 hit rate also remains stable across all offsets, and only a negligible number of memory accesses actually miss L1.
On the other hand, we see a fairly strong correlation between cycle count and outstanding offcore requests.
These events count the number of outstanding loads to memory outside the processor core each cycle.
Since we do not see any significant number of L1 misses, the correlation to outstanding loads is probably a result of stalling and more cycles executed.
Overall, we conclude that the resource stalling is causing the slowdown, ultimately generated by false dependencies from address aliasing.
We do not attempt to pinpoint exactly which access generates these conflicts, but at a high level the CPU falsely assumes dependencies between \texttt{in[i]} and \texttt{out[i]}.
By manually adjusting the address alignment of one of these buffers, cycle count can be reduced by as much as 50\%.
%This is a speedup on top of already aggressive compiler optimization.
% Variations in execution port activity, and in this case increased number of branches executed, is consistent with micro-ops being reissued after the potential conflict is detected.
% We do not attempt to pinpoint exactly which assembly instructions might alias, but one can assume that the processor thinks memory accesses to \texttt{input[i]} potentially conflicts with \texttt{output[i]}.
\section{Aliasing in Real Applications}
\label{sec:blas}
To investigate how real world applications can be impacted by aliasing, we look at low level numerical applications.
Basic Linear Algebra Subroutines (BLAS) is the de facto standard API for high performance dense linear algebra routines, with several heavily optimized implementations available~\cite{Blackford:2002:UpdatedBLAS}.
\pgfplotstableread{bin/libblas.dat}{\libblastable}
\pgfplotstableread{bin/libatlas.dat}{\libatlastable}
\pgfplotstableread{bin/libopenblas.dat}{\libopenblastable}
\begin{figure*}
\centering
\begin{tikzpicture}
\begin{axis}[
title=libblas,
height=5.0cm,
font=\footnotesize,
xlabel=Address offset of $\boldsymbol{y}$,
ylabel=Event count,
cycle list name=black white,
width=\textwidth/2.8, % Make it fit to text width side by side
legend style={at={(0.4,0.55)},anchor=west,draw=none}
]
\addplot table[x expr = \thisrowno{0}, y = cycles:u] \libblastable ;
\addplot table[x expr = \thisrowno{0}, y = r0107:u ] \libblastable ;
\addlegendentry{Cycles} ;
\addlegendentry{Alias} ;
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[
title=libatlas,
height=5.0cm,
font=\footnotesize,
xlabel=Address offset of $\boldsymbol{y}$,
ylabel=Event count,
cycle list name=black white,
width=\textwidth/2.8,
legend style={at={(0.4,0.55)},anchor=west,draw=none}
]
\addplot table[x expr = \thisrowno{0}, y = cycles:u] \libatlastable ;
\addplot table[x expr = \thisrowno{0}, y = r0107:u ] \libatlastable ;
\addlegendentry{Cycles} ;
\addlegendentry{Alias} ;
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[
title=libopenblas,
height=5.0cm,
font=\footnotesize,
xlabel=Address offset of $\boldsymbol{y}$,
ylabel=Event count,
cycle list name=black white,
width=\textwidth/2.8,
legend style={at={(0.4,0.55)},anchor=west,draw=none}
]
\addplot table[x expr = \thisrowno{0}, y = cycles:u] \libopenblastable ;
\addplot table[x expr = \thisrowno{0}, y = r0107:u ] \libopenblastable ;
\addlegendentry{Cycles} ;
\addlegendentry{Alias} ;
\end{axis}
\end{tikzpicture}
\caption{\label{fig:blas}Performance of \texttt{cblas\_dgemv} for different BLAS packages, measuring cycles executed and address alias events for varying relative address offset between matrix $A$ and vector $\boldsymbol{y}$. Each sample point represents an increment of 16 bytes, or 0x10.}
\end{figure*}
%The level 1 and 2 routines involving vectors intuitively appears to be more likely to have potential for aliasing.
We look at three different publicly available libraries; \emph{libblas}, \emph{libatlas}, and \emph{libopenblas}~\cite{Whaley:1998:ATLAS,Wang:2013:OpenBLAS}.
We ran our bias analysis on the level 2 procedure \texttt{cblas\_dgemv}, which computes the matrix-vector product $\boldsymbol{y} = \alpha\text{op}\left(A\right)\boldsymbol{x} + \beta\boldsymbol{y}$ for \texttt{double} sized numbers.
In this equation, $\alpha$ and $\beta$ are constants, and $\text{op}\left(A\right)$ is an optional transpose or complex conjugate of the matrix.
We set $\alpha = 1$, $\beta = 0$ and $\text{op} \left(A\right) = A$ to reduce the formula to $\boldsymbol{y}=A\boldsymbol{x}$.
Denote the matrix size as $M \times N$, making $\boldsymbol{x}$ of length $N$ and $\boldsymbol{y}$ of length $M$.
%In code, the function is invoked with the following parameters:
%\begin{lstlisting}[breaklines=true]
% cblas_dgemv(CblasColMajor, CblasNoTrans, M, N, alpha, A, M, x, 1, beta, y, 1);
%\end{lstlisting}
Each of the structures \texttt{A}, \texttt{x} and \texttt{y} represent different locations in memory, and their addresses are part of the execution context.
We use \texttt{mmap} with manually added offsets to control the location of each of these buffers.
From our experiments, we found that the memory address offset of vector $\boldsymbol{y}$ compared to matrix $A$ is the most important with respect to the number of alias events.
For the results presented here, we set $A = I$ (identity matrix) and $\boldsymbol{x} = [0, 1, 2, ..., N-1]$, with parameters $M = N = 1024$.
Address suffixes of $A$ and $\boldsymbol{x}$ are fixed to 0x000, while performance events are sampled for increasing address offset of $\boldsymbol{y}$, starting at a baseline of 0x000.
The experiment was run on the ``Haswell'' setup, with Hyper-Threading enabled.
In an attempt to isolate the cost of a single \texttt{cblas_dgemv} invocation, we use the estimation approach described in Section~\ref{sec:heap} to subtract the constant overhead from heap allocation and initialization.
Cycles executed and alias events for each library are shown in Figure~\ref{fig:blas}.
Each implemetation behaves differently, but in all cases we identify a significant amount of aliasing.
The diagrams can be understood the same way as for the convolution example; partial address overlap occur when the address suffix delta of $A$ and $\boldsymbol{y}$ is close to zero.
Libblas generates the most distinct pattern, with a clear spike for address deltas between 0x10 and 0x60.
% NB: Taken from analysis directory, update manually on new results
Correlation between cycles and address aliasing were 95\% for libblas, 81\% for libatlas and 45\% for libopenblas.
We do not go into a more detailed analysis of the raw performance data here, but note that we see many of the previously discussed events stand out, such as execution port occupancy, resource stalls, branch instructions executed, offcore requests, and ldm pending.
Especially for the libblas case, results are similar to what we saw for the convolution kernel.
Our results show that address aliasing can indeed occur in real, and already highly optimized applications.
The amount of \emph{bias} this creates with respect to number of cycles executed is not as clear in all cases, but at least for libblas we are confident in claiming that aliasing is indeed causing a performance degradation.
\section{Address Aliasing Mitigation Techniques}
\label{sec:mitigation}
Performance penalties caused by address aliasing can be significant, creating bias towards certain memory layouts.
However, with an understanding of the underlying mechanism that causes bias, the effects can be predicted, and to some extent eliminated in software.
%The most relevant scenario to consider is probably code that can hit the aliasing \texttt{mmap} scenario, where performance impact can be consistently bad over all environments.
In addition to controlling bias in experimental setups, avoiding alias issues is also necessary for achieving optimal performance.
This opens up new avenues to explore, for example in making more intelligent architecture aware compiler optimizations, or use of automatic performance tuning to avoid suboptimal memory layouts.
Some principles are also more directly applicable, and we identify the following relevant optimization techniques:
\subsection{Mark buffers with \texttt{restrict}}
In our convolution kernel implementation, the compiler has to account for the fact that input and output pointers might alias, or that the buffers partially overlap.
This limits the extent to which generated code can keep data in registers without updating the values from memory, as a write to one buffer could potentially invalidate a cached value from the other.
The C99 keyword \emph{restrict} can be used to explicitly tell the compiler that accesses through a pointer do not alias with any others, allowing for more efficient code generation with fewer memory accesses.
% \begin{lstlisting}[breaklines=true]
% void conv(int n, const float * restrict in, float * restrict out);
% \end{lstlisting}
Applied to our example, the compiler is able to move a store instruction out of the inner loop, reducing the number of stores by more than 80 \% and alias events by approximately 10 million on optimization level O2.
%Alias events are reduced by about 10 million for the default alignment, with a corresponding improvement in cycle count. % File backing this in results
\subsection{Use a special purpose allocator}
The Intel optimization manual suggests that a special purpose memory allocator could be used to avoid aliasing~\cite{OptimizationManual}. % in User/Source Coding Rule 8
However, to our knowledge there are no allocators that specifically try to mitigate aliasing in heap allocated memory.
%Similarly none offset their allocations based on the initial stack frame.
%We show that heap allocators are prone to generate pairwise aliasing buffers, the worst case for many functions.
One approach to achieve this could be to apply some heuristic to randomize addresses more, and in particular not always return the same 12 bit suffix for large allocations.
The allocator could also accept hints to which buffers are going to be accessed in parallel, for example as an optional argument to a custom version of \texttt{malloc}.
\subsection{Adjusting address offsets}
In some cases it might help to explicitly control the memory addresses used, for example by requiring fixed relative offset between input and output pointers.
For heap allocations, one can exploit that \texttt{mmap} is guaranteed to be page aligned, and use that directly to create specific alignments.
% The following approach can be used to make an anonymous memory mapping with offset \texttt{d} bytes away from page alignment.
% \begin{lstlisting}[breaklines=true]
% mmap(NULL, (n+d), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) +d;
% \end{lstlisting}
% The same pointer difference must of course by subtracted before unmapping the memory.
% See /analysis for results showing this does not alias
\begin{figure}
\begin{lstlisting}[frame=single, xleftmargin=.005\textwidth, xrightmargin=.005\textwidth]
#define ALIAS(a, b) \
((((long)&a) & 0xfff) == (((long)&b) & 0xfff))
static int i, j, k;
int main() {
int g = 0, inc = 1;
if (ALIAS(inc, i) || ALIAS(g, i))
return main();
for (; g < 65536; g++) {
i += inc;
j += inc;
k += inc;
}
return 0;
}
\end{lstlisting}
\caption{Modified microkernel that can dynamically detect aliasing case, and avoid it by pushing another stack frame.}
\label{lst:loopfixed}
\end{figure}
% Addresses of automatic variables can not be determined statically, because the location of the stack at runtime is generally unknown.
% In addition to being offset by environment variables, the initial stack address can also be perturbed by for example program arguments or address randomization.
% Although one can not easily know if a collision is going to happen for a given environment a priori, it is possible to change the program to account for possible alias conditions.
It is also possible to manipulate addresses of automatic (stack allocated) variables.
A proof of concept of how alias-free code could be generated for the program discussed in Section~\ref{sec:microkernel} is shown in Figure \ref{lst:loopfixed}.
If the addresses do alias, a branch to an alternative but semantically equivalent code path is performed.
Calling the function recursively will effectively allocate a new set of variables further down the stack, and the alias condition is avoided.
Even though this particular solution is impractical, it proves the possibility for compilers, as well as programmers, to take measures to account for address aliasing.
% A complier can generate the same solution automatically by using data access relevant pointer identification
%While this example uses C code to address the problem it is possible to perform the same directly by a compiler extension.
%In general we need automated searches for both relevant pointer addresses, offset rules and good offsets.
%For general code one need to use automated searches for both relevant pointer identification and offset generator rules.
%By utilizing hardware performance counter information this issue can be checked
\section{Related Work}
\label{sec:related}
Measurement bias and observer effect in performance analysis has been studied in detail by Mytkowicz \emph{et al.}~\cite{Mytkowicz:2008:OE&MB,Mytkowicz:2008:Easy}.
The authors introduce environment variables and link ordering as examples of external bias triggers, showing how standardized {\small SPECint} benchmarks suffer from measurement bias when trying to evaluate the effectiveness of O3 over O2.
Strategies for detecting bias through causal analysis and randomization of experimental setups is introduced.
The microkernel we analyzed is taken from a followup paper by the same authors, where it is used as an example of how environment size can cause bias~\cite{Mytkowicz:2009:WrongData}.
Many different approaches have been explored in order to take performance counter data and processor intrinsics into account for optimization.
Knights \emph{et al.} introduce ``blind'' optimization, treating the underlying hardware as a black box and use search over \emph{variant spaces}~\cite{Knights:2009:BlindOpt}.
Their space of variants consist of different position and alignment for each function and global variable, in principle exploring the different configurations possible by altering link order.
Others have used machine learning to classify performance counter data, recognizing patterns or \emph{pathologies} in measurements to identify optimization opportunities~\cite{Yoo:ADD}.
The authors of MAO present a framework for extending compilers to provide very low level microarchitectural assembly optimization, using rules, pattern matching, and random insertion of nop-instructions to discover optimal assembly outputs~\cite{Hundt:2011:MAO}.
\section{Conclusion}
\label{sec:conclusions}
We have shown how address aliasing affects program performance under different memory contexts, and how it can explain certain cases of measurement bias.
%The effect is caused by how speculative and out-of-order memory operations are handled by the CPU, which only considers the last 12 address bits to resolve conflicts between a load and previous store operations.
False conflicts occur from speculatively executing memory operations out of order, when considering only the last 12 address bits to resolve conflicts between a load and previous store.
This effect is observed on Intel platforms, ranging from ``Core'' to ``Haswell'' microarchitectures.
In general, any change to how data is laid out in memory can potentially introduce bias effects from address aliasing.
Analyzing an example with bias to environment size, we determined that collisions between automatic variables and static data resulted in aliasing for certain stack positions.
Aliasing conditions were triggered because variations in environment size offset the virtual addresses of stack allocated variables.
Dynamically allocated data is controlled by heap allocators, which we show is another source of bias.
Comparing four different libraries, we show that most implementations tend to favor page alignment for larger requests.
This means that structures such as vectors and matrices are likely to alias by default, which can be worst case scenarios for many algorithms.
%Address aliasing can have significant performance impact for algorithms operating on heap allocated buffers, which
We analyze an example with up to $2x$ variation in cycle count between different heap alignments, showing that address aliasing can have a significant performance impact.
We are able to identify instances of alias issues in matrix-vector multiplication routines for different BLAS libraries, showing that this effect also appears in real, performance critical code.
%We have shown a method of performance search with hardware performance counters that can identify unknown performance issues inside modern CPUs, and how to deal with the issue.
%This shows that other issues with address sensitive memory interactions may also be detected and addressed automatically.
%We have shown new possibilities in the area of auto-tuning, both how to detect general issues, and outlined how to deal with a specific one.
We show how techniques like padding of variables and alternative alias-free code paths can be used to avoid aliasing at runtime.
For heap address aliasing especially, we find that manual intervention can be required in order to achieve optimal performance.
Our results can inspire more clever optimizations and heuristics taking address aliasing effects into account, as the potential performance improvements can be substantial.
% According to template, this is supposed to be \section*
\section*{Acknowledgment}
This work is derived from a Master's thesis project supervised by Anne Cathrine Elster and Rune Erlend Jensen, looking at sources of measurement bias on the Intel ``Ivy Bridge'' microarchitecture~\cite{MasterThesis}.
% Bibliography
\bibliographystyle{IEEEtran}
\bibliography{references}
\end{document}