-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmapping.tex
668 lines (566 loc) · 39.1 KB
/
mapping.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
\chapter{Mapping}
\label{chap:mapping}
The Legion mapper interface is a key part
of the Legion programming system. Through the mapping interface applications
can control most decisions that impact application performance.
The philosophy is that these choices are better left to applications
rather than using hard-wired heuristics in Legion that attempt to ``do the right thing'' in
every situation. The few performance heuristics
that are included in Legion are associated with low levels of the system
where there is no good way to expose those choices to the application.
For everything else applications can set the policies.
This design resulted from our own past experience with systems
where built-in performance heuristics did not behave as we desired and there was no recourse
to override those decisions. While Legion does allows
experts to squeeze every last bit of performance from a system, it is important
to realize that doing so potentially requires understanding and setting a wide
variety of parameters exposed in the mapping interface.
This level of control can be overwheling at first to users who are not used to
considering all the possible dimensions that influence performance in complex,
distributed and heterogeneous systems.
To help users write initial versions of their applications without needing
to concern themselves with tuning the performance knobs exposed by the mapper
interface, Legion provides a {\em default mapper}. The default mapper
implements the Legion mapper API (like any other mapper) and provides a number
of heuristics that can provide reasonably performant, or at least correct, initial
settings. A good way to think about the default mapper is that it is the version
of Legion with built-in heuristics that allows casual users to write Legion
applications and allows experts to start quickly on a new application.
It is, however, unreasonable to expect the default mapper to provide excellent performance, and in
particular assuming that the performance of an application using the default
mapper is even an approximation of the performance that could be
achieved with a custom mapper is a mistake.
We will use several examples from the default mapper
to illustrate how mappers are constructed. We will also describe where
possible the heuristics that the default mapper employs to achieve
reasonable performance. Because the default mapper uses generic heuristics
with no specific knowledge of the apllication, it is almost certain to make
poor decisions at least some of the time.
Performance benchmarking using only the default mapper is strongly
discouraged, while using custom application-specific mappers is
encouraged.
It is likely that the moment when you are dissatisfied with the
heuristics in the default mapper will come sooner rather than later.
At that point the information in this chapter will be necessary for you
to write your own custom mapper. In practice, our experience has been that in
many cases all that is necessary is to replace a small number of policies in the
default mapper that are a poor fit for the application.
\section{Mapper Organization}
\label{sec:mapping:org}
The Legion mapper interface is an abstract C++ class that defines a set of
pure virtual functions that the Legion runtime invokes as {\em callbacks}
for making performance-related decisions. A Legion mapper is
a class that inherits from the base abstract class and provides
implementations of the associated pure virtual methods.
A callback is just a function pointer---when the runtime system calls a mapper
function, it is said to have ``invoked the callback''. Callbacks are a commomly-used
mechanism in software systems for parameterizing some specific functionality; in our case
mappers parameterize the performance heuristics of the Legion runtime system.
There are a few general things to keep in mind about mappers and callbacks:
\begin{itemize}
\item The runtime may invoke callbacks in an unpredictable order. While multiple callbacks associated with a
single instance of a Legion object, such as a task, will happen in a specific order for that task,
other callbacks for other operations may be interleaved.
\item Depending on the synchronization model selected (see Section~\ref{subsec:mapping:sync}), mappers
may have a degree of concurrency between mapper callbacks.
\item Since mappers are C++ objects, they can have arbitrary internal state. For example, it may be useful
to maintain performance or load-balancing statistics that inform mapping decisions.
However, state updates done by a mapper must take into account the unpredictable order in
which callbacks are invoked as well any issues of concurrent access to mapper data structures.
\end{itemize}
\subsection{Mapper Registration}
\label{subsec:mapping:registration}
After the Legion runtime is created, but before the application
begins, mapper objects can be registered
with the runtime. Figure~\ref{fig:mapper_registration} gives a small
example registering a custom mapper.
\begin{figure}
{ \small
\lstinputlisting[linerange={14,78}]{Examples/Mapping/registration/registration.cc}
}
\caption{\legionbook{Mapping/registration/registration.cc}}
\label{fig:mapper_registration}
\end{figure}
To register {\tt CustomMapper} objects, the
application adds the mapper callback function by invoking the
{\tt Runtime::add\_registration\_callback} method, which takes as an
argument a function pointer to be invoked. The function pointer must
have a specific type, taking as arguments a {\tt Machine} object,
a {\tt Runtime} pointer, and a reference to an STL set of {\tt Processor}
objects. The call can be invoked multiple times to record multiple
callback functions (e.g., to register multiple custom mappers, perhaps for different libraries). All
callback functions must be added prior to the invocation of the
{\tt Runtime::start} method. We recommend that applications include the registration
method as a static method on the mapper class (as in Figure~\ref{fig:mapper_registration})
so that it is closely coupled to the custom mapper itself.
Before invoking any of the registration callback functions, the runtime
creates an instance of the default mapper for each processor of
the system. The runtime then invokes the callback functions in the order
they were added. Each callback function is invoked once on each
instance of the Legion runtime. For multi-process jobs, there will be
one copy of the Legion runtime per process and therefore one invocation
of each callback per process. The set of processors passed into each
registration callback function will be the set of application processors
that are local to the process\footnote{Mappers cannot be associated with
utility processors, and therefore utility processors are not included
in the set.}, thereby providing a registration callback
function with the necessary context to know which processors it
will create new custom mappers for.
If no callback functions are registered then the only mappers
that will be available are instances of the default mapper associated
with each application processor.
Upon invocation, the registration callbacks should create instances
of custom mappers and associate them with application processors.
This step can be done through one of two runtime mapper calls. The mapper
can replace the default mappers (always registered with {\tt MapperID}
0) by calling {\tt Runtime::replace\_default\_mapper}, which is the
only way to replace the default mappers. Alternatively, the registration
callback can use {\tt Runtime::add\_mapper} to register a mapper with a
new {\tt MapperID}. Both the {\tt Runtime::replace\_default\_mapper} and
the {\tt Runtime::add\_mapper} methods support an optional processor
argument, which tells the runtime to associate the mapper with a specific
processor. If no processor is specified, the mapper is associated
with all processors on the local node. The choice between
whether one mapper object should handle a single application processor's
mapping decisions or one mapper object handling mapping decisions for
all application processors on a node is mapper-specific. Legion supports both use cases
and it is up to custom mappers to make the best choice. From a performance
perspective, the best choice is likely to depend on the mapper synchronization
model (see Section~\ref{subsec:mapping:sync}).
Note that the mapper calls require a pointer to the {\tt MapperRuntime}, such as on
lines 27 and 49 of Figure~\ref{fig:mapper_registration}.
The mapper runtime provides the interface for mapper calls to call back
into the runtime to acquire access to different physical resources. We
will see examples of the use of the mapper runtime throughout
this chapter.
\subsection{Synchronization Model}
\label{subsec:mapping:sync}
Within an instance of the Legion runtime there are often several threads
performing the analysis necessary to advance the execution of an
application. If some threads are performing work for operations
owned by the same mapper, it is possible that they will attempt to
invoke mapper calls for the same mapper object concurrently. For both
productivity and correctness reasons, we do not want users to be
responsible for making their mappers thread-safe. Therefore we allow
mappers to specify a {\em synchronization model} that the runtime
follows when concurrent mapper calls are made.
Each mapper object can specify its synchronization model via the
{\tt get\_mapper\_sync\_model} mapper call. The runtime invokes this
method exactly once per mapper object immediately after the mapper is
registered with the runtime. Once the synchronization model has been set
for a mapper object it cannot be changed. Currently three
synchronization models are supported:
\begin{itemize}
\item {\em Serialized Non-Reentrant}. Calls to the
mapper object are serialized and execute atomically. If the mapper
calls out to the runtime and the mapper call is preempted,
no other mapper calls can be invoked by the runtime.
This synchronization model conforms to the original version of
the Legion mapper interface.
\item {\em Serialized Reentrant}. At most one mapper call
executes at a time. However, if a mapper call invokes a runtime
method that preempts the mapper call, the runtime may
execute another mapper call or resume a previously blocked
mapper call. It is up to the user to handle any changes in internal mapper
state that might occur while a mapper call is preempted (e.g., the
invalidation of STL iterators to internal mapper data structures).
\item {\em Concurrent}. Mapper calls to the same mapper object can
proceed concurrently. Users can invoke the {\tt lock\_mapper} and
{\tt unlock\_mapper} calls to perform their own synchronization
of the mapper. This synchronization model is particularly useful for
mappers that simply return static mapping decisions
without changing internal mapper state.
\end{itemize}
The mapper synchronization offers mappers tradeoffs between mapper complexity and performance. The default mapper uses the
serialized reentrant synchronization model as it offers a good trade-off
between programmability and performance.
\subsection{Machine Interface}
\label{subsec:mapping:machine}
All mappers are given a {\tt Machine} object to enable
introspection of the hardware on which the application is executing. The
{\tt Machine} object is defined by Realm, Legion's low-level portability layer (see {\tt realm/machine.h}).
There are two interfaces for querying the machine
object. The old interface contains methods such as {\tt get\_all\_processors}
and {\tt get\_all\_memories}. These methods populate STL data structures
with the appropriate names of processors and memories. We strongly
discourage using these methods as they are not scalable on large
architectures with tens to hundreds of thousands of processors or memories.
The recommended, and more efficient and scalable, interface is based
on {\em queries}, which come in two types: {\tt ProcessorQuery} and
{\tt MemoryQuery}. Each query is initially given a reference to the machine
object. After initialization the query lazily materializes the (entire) set of
either processors or memories of the machine.
The mapper applies {\em filters} to the query to reduce the
set to processors or memories of interest. These filters can include specializing
the query to the local node using {\tt local\_address\_space}, to one kind of processors with the {\tt only\_kind} method, or by
requesting that the processor or memory have a specific affinity to another
processor or memory with the {\tt has\_affinity\_to} method. Affinity can either be
specified as a maximum bandwidth or a minimum latency. Figure~\ref{fig:mapper_machine}
shows how to create a custom mapper that uses queries to find the local set of
processors with the same processor kind as and the memories with affinities to the local
mapper processor. In some cases, these queries are still expensive, so we
encourage the creation of mappers that memoize the results of their most
commonly invoked queries to avoid duplicated work.
\begin{figure}
{\small
\lstinputlisting[linerange={22,80}]{Examples/Mapping/machine/machine.cc}}
\caption{\legionbook{Mapping/machine/machine.cc}}
\label{fig:mapper_machine}
\end{figure}
\section{Mapping Tasks}
\label{sec:mapping:tasks}
There are a number of different kinds of operations with mapping callbacks, but the core of the mapping interface, and the parts
of mappers that users will most commonly customize, are the callbacks for mapping tasks.
When a task is launched it proceeds through a pipeline of mapping callbacks. The most important pipeline stages are:
\begin{enumerate}
\item {\tt select\_task\_options }
\item {\tt select\_sharding\_functor} (for control-replicated tasks)
\item {\tt slice\_task } (for index launches)
\item {\tt select\_tasks\_to\_map} (tasks remain in this stage until selected for mapping)
\item {\tt map\_task}
\end{enumerate}
Stages 2 and 3 do not apply to every task, and tasks may repeat stage 4 any number of times depending on the implementation of {\tt select\_tasks\_to\_map}.
After discussing these five components of the task mapping pipeline, we discuss a few other topics relevant to task mapping: allocating new physical instances, postmapping of tasks, virtual mappings, and profiling requests.
\subsection{Controlling Task Mapping}
{\tt select\_task\_options} is the first callback for mapping tasks. It is invoked for every task $t$ exactly once in the Legion process where $t$ is launched.
The signature of the function is:
\begin{lstlisting}
virtual void select_task_options(const MapperContext ctx,
const Task& task,
TaskOptions& output) = 0;
\end{lstlisting}
The purpose of the callback is to set fields of the {\tt output} object. All of the fields have defaults, so none are required to be set by the callback implementation.
This callback comes first because the fields of {\tt TaskOptions} control the rest of the mapping process for the task.
\begin{itemize}
\item For a single task $t$ (not an index launch), {\tt output.initial\_proc} is the processor that will execute $t$; the default is the current processor.
The processor does not need to be local---the mapper can select any processor in the machine model for which a variant of $t$ exists. As we will see, $t$'s target processor can be changed by subsequent stages. The reason for choosing a target processor
here is that by default $t$ is sent to the Legion process that manages the target processor to be mapped.
\item If {\tt output.inline\_task} is true (the default is false) the task will be inlined into the parent task and use the parent task's regions. Any needed regions that are unmapped will be remapped. Inline tasks do not go through the rest of the task pipeline, except for the selection of
a task variant.
\item If {\tt output.stealable} is true then the task can be stolen for load balancing; the default is false. A stealable task $t$ can be stolen by another mapper until $t$ is chosen by {\tt select\_tasks\_to\_map}.
\item As mentioned above, by default the {\tt map\_task} stage of the mapping pipeline is done by the Legion process that manages the processor where the task will execute. If {\tt output.map\_locally} is true (the default is false) then {\tt map\_task} will be run by the current mapper.
Just to emphasize: {\tt map\_locally} controls where a mapping callback for the task is run, not where the task executes. This option is mostly useful for leaf tasks that will be sent to remote processors. In this case, making the mapping decisions locally saves transmitting
task metadata to the remote Legion runtime.
\item If {\tt valid\_instances} is set to false, then the task will not recieve a list of the currently valid instances of regions in subsequent calls to {\tt request\_valid\_instances}, which saves some runtime overhead. This setting is useful if the task will never use
a currently valid region instance, such as when all the regions of an inner task will be virtually mapped.
\item Setting {\tt replicate\_default} to true turns on replication of single tasks in a control-replication context, which means that the task will be executed separately in every Legion process participating in the replication of the parent task. The default setting
is false; in this case only one instance of a single task with a control-replicated parent is executed on one processor and then the results are broadcast to the other Legion processes. Replicating single tasks avoids the broadcast communication. There are some restrictions on replicated single tasks to ensure the
replicated versions all have identical behavior: the tasks cannot have reduction-only privileges on any field, and any fields with write privileges must use a separate instance for each replicated task.
\item A task can set the priority of the parent task by modifying {\tt output.parent\_priority}, if that is permitted by the mapper. The default is the parent's current priority. When tasks are ready to execute, tasks with higher priority are moved to the front of the ready queue.
\end{itemize}
\subsection{Sharding}
As the name suggests, {\tt select\_sharding\_functor} is used to select the functor for {\em sharding} index task launches in control-replicated contexts. Sharding divides the index space of the task launch into subspaces and associates each shard with a mapper (a processor)
where those tasks will be mapped. This callback is invoked once per replicated task index launch in each replicated context:
\begin{lstlisting}
virtual void select_sharding_functor(
const MapperContext ctx,
const Task& task,
const SelectShardingFunctorInput& input,
SelectShardingFunctorOutput& output) = 0;
struct SelectShardingFunctorInput {
std::vector<Processor> shard_mapping;
};
struct SelectShardingFunctorOutput {
ShardingID chosen_functor;
bool slice_recurse;
};
\end{lstlisting}
The {\tt shard\_mapping} of the input structure provides a vector of the processors where the replicated task is running. The callback must fill in the {\tt chosen\_functor} field of the output structure with the id of a sharding function registered with the mapper at
startup. The callback can set {\tt slice\_recurse} to indicate whether or not the index subspaces chosen by the sharding functor should be recursively sharded on the destination processor. The same sharding functor must be selected in every control-replicated context, which
will be checked by the runtime when in debug mode.
\subsection{Slicing}
{\tt slice\_task} is called for every index launch. To make index launches efficient, the index space of tasks is first sliced into smaller sets of tasks and each set is sent to a destination mapper as a single object rather than sending
multiple individual tasks. The signature of {\tt slice\_task} is:
\begin{lstlisting}
virtual void slice_task(const MapperContext ctx,
const Task& task,
const SliceTaskInput& input,
SliceTaskOutput& output) = 0;
\end{lstlisting}
The {\tt SliceTaskInput} includes the index space of the task launch (field {\tt domain\_is)}. The index space of the shard is also included for control-replicated tasks.
\begin{lstlisting}
struct SliceTaskInput {
IndexSpace domain_is;
Domain domain;
IndexSpace sharding_is;
};
\end{lstlisting}
The implementation of {\tt slice\_task} should set the fields of {\tt SliceTaskOutput}:
\begin{lstlisting}
struct SliceTaskOutput {
std::vector<TaskSlice> slices;
bool verify_correctness; // = false
struct TaskSlice {
public:
TaskSlice(void) : domain_is(IndexSpace::NO_SPACE),
domain(Domain::NO_DOMAIN), proc(Processor::NO_PROC),
recurse(false), stealable(false) { }
TaskSlice(const Domain &d, Processor p, bool r, bool s)
: domain_is(IndexSpace::NO_SPACE), domain(d),
proc(p), recurse(r), stealable(s) { }
TaskSlice(IndexSpace is, Processor p, bool r, bool s)
: domain_is(is), domain(Domain::NO_DOMAIN),
proc(p), recurse(r), stealable(s) { }
public:
IndexSpace domain_is;
Domain domain;
Processor proc;
bool recurse;
bool stealable;
};
\end{lstlisting}
The {\tt slices} field is a vector of {\tt TaskSlice}, each of which names a subspace of the index space in {\tt domain\_is} and a destination processor {\tt proc} for the slice of tasks. The tasks of the slice can be marked as stealable, and setting the {\tt recurse} field
means that {\tt slice\_task} will be called again by the mapper associated with the destination processor to allow the slice to be further subdivided before processing individual tasks.
\subsection{Selecting Tasks to Map}
{\tt select\_tasks\_to\_map} gives the mapper control over which tasks should be mapped and which should be sent to other processors---the initial processor assignment set in {\tt select\_task\_options} can be changed if desired. At this point
in the task mapping pipeline all index tasks have been expanded into single tasks, and {\tt select\_tasks\_to\_map} is called by the mapper associated with the destination process, unless {\tt map\_locally} was chosen in {\tt select\_task\_options}.
The signature of the callback is:
\begin{lstlisting}
virtual void select_tasks_to_map(const MapperContext ctx,
const SelectMappingInput& input,
SelectMappingOutput& output) = 0;
struct SelectMappingInput {
std::list<const Task*> ready_tasks;
};
struct SelectMappingOutput {
std::set<const Task*> map_tasks;
std::map<const Task*,Processor> relocate_tasks;
MapperEvent deferral_event;
};
\end{lstlisting}
For each task in {\tt ready\_tasks} of the {\tt SelectMappingInput} structure, the callback implementation can do one of three things:
\begin{itemize}
\item Add the task to {\tt map\_tasks}, in which case the task will proceed with mapping on the assigned local processor.
\item Add the task to {\tt relocate\_tasks} along with a new destination processor to which the task will be transferred.
\item Nothing, in which case the task will remain in the {\tt ready\_tasks} list for the next call to {\tt select\_tasks\_to\_map}.
\end{itemize}
If the call does not select at least one task to map or transfer, then it must provide a {\tt MapperEvent} in the field {\tt deferral\_event}---another call to {\tt select\_tasks\_to\_map} will not be made until that event is triggered.
Of course, it is up to the mapper to guarantee that the event is eventually triggered.
\subsection{Map\_Task}
\label{subsec:maptask}
{\tt map\_task} is normally the final stage of the task mapping pipeline. This callback selects a processor or processors for the task, maps the task's region arguments, and selects the task variant to use, after which the task will run on one of the selected processors.
\begin{lstlisting}
virtual void map_task(
const MapperContext ctx,
const Task& task,
const MapTaskInput& input,
MapTaskOutput& output) = 0;
struct MapTaskInput {
std::vector<std::vector<PhysicalInstance> > valid_instances;
std::vector<unsigned> premapped_regions;
};
struct MapTaskOutput {
std::vector<std::vector<PhysicalInstance> > chosen_instances;
std::vector<std::vector<PhysicalInstance> > source_instances;
std::vector<Memory> output_targets;
std::vector<LayoutConstraintSet> output_constraints;
std::set<unsigned> untracked_valid_regions;
std::vector<Memory>future_locations;
std::vector<Processor> target_procs;
VariantID chosen_variant; // = 0
TaskPriority task_priority; // = 0
TaskPriority profiling_priority;
ProfilingRequest task_prof_requests;
ProfilingRequest copy_prof_requests;
bool postmap_task; // = false
};
\end{lstlisting}
The input structure contains a vector of vector of valid instances: each element of the vector is a vector of instances that
hold valid data for the corresponding region requirement. The {\tt premapped\_regions} is a vector of indices of region
requirements that are already satisfied and do not need to be mapped by the callback.
The callback must fill in the following fields of the {\tt output} structure:
\begin{itemize}
\item {\tt target\_procs} is a vector of processors. All processors must be on the same node and of the same kind (e.g., all LOCs or all TOCs). The runtime will execute the task on the first processor in the vector that becomes available.
\item {\tt chosen\_variant} is the {\tt VariantID} of a variant of the task. The chosen variant must be compatible with the chosen processor kind.
\item For each region requirement, the {\tt input} structure has a vector of valid instances of the region in the same order
as region requirements are added to the task launcher. The entry of the {\tt chosen\_instances} field should be filled either with one or more
instances from the correponding entry of {\tt valid\_instances}, or the mapper can add newly created instances. A new instance is created by the runtime call
{\tt create\_physical\_instance}, which, in addition to other arguments, takes a target memory in which the instance should be created and a vector of logical regions---physical instances can be created that hold the data of multiple logical regions.
If new physical regions are created, the mapper calls {\tt select\_task\_sources} to choose existing instances to be the source of data to fill those new instances (see below).
\item For any regions that are strictly output regions (e.g, with {\tt WRITE\_DISCARD} privileges) where no input data will be loaded, the callback must fill in the {\tt output\_targets} with a memory for the corresponding
region requirement. These memories must be visible to the selected processor(s).
\item The callback should set a memory that will hold each future produced by the task in {\tt future\_locations}.
\item Normally the runtime system will retain regions with valid data even if no tasks are known that will use those regions at the time the task finishes. This policy can lead to an accumulation of read-only regions that are never garbage colleted (since read-only regions
are not invalidated by any write operations). This policy can be controlled by specifying a set of indices of read-only regions
in {\tt untracked\_valid\_regions}---these instances will be marked for garbage collection after the task is complete.
\item Optionally the task may request that the {\tt postmap\_task} be invoked for this task once mapping is complete; see Section~\ref{subsec:postmap}.
\end{itemize}
\subsection{Creating Physical Instances}
\label{subsec:mapping:instances}
New phyiscal instances are created by the runtime call {\tt create\_physical\_instance}:
\begin{lstlisting}
bool MapperRuntime::create_physical_instance(
MapperContext ctx, Memory target_memory,
const LayoutConstraintSet &constraints,
const std::vector<LogicalRegion> ®ions,
PhysicalInstance &result,
bool acquire, GCPriority priority,
bool tight_bounds, size_t *footprint,
const LayoutConstraint **unsat) const
\end{lstlisting}
Besides the standard runtime context, the arguments to this function are:
\begin{itemize}
\item The {\tt target\_memory} is the memory where the instance will be created.
\item The {\tt constraints} specify the layout constraints of the region, such as whether it should be laid out in column-major or row-major order for 2D index spaces. Layout constraints are discussed in Section~\ref{sec:layout}.
\item The {\tt regions} field is a vector of logical regions, all of which should be included in the created instance. The ability to have more than one logical region in an instance allows for colocation of data from multiple regions.
\item The {\tt result} field holds the newly created instance after the call returns; if successful the function returns true.
\item If {\tt tight\_bounds} is true, then the call will select the most specific (tightest) solution to the constraints, if more than one solution is possible. Otherwise, the runtime is free to pick any valid solution.
\item {\tt footprint} holds the size of the allocated instance in bytes.
\item {\tt unsat} holds any constraints that could not be satisfied if the call fails.
\end{itemize}
The runtime function {\tt find\_or\_create\_physical\_instance} provides higher level functionality that preferentially finds an existing physical instance satisfying some constraints or creates a new one if necessary. The default mapper also provides
higher-level functions that wrap {\tt create\_physical\_instance}; see {\tt default\_create\_custom\_instances} for an example.
\subsection{Selecting Sources for New Physical Instances}
\label{subsec:selectsources}
When a new physical instance is created, if its contents may be read the mapper callback {\tt select\_task\_sources} will be invoked to pick a source of data for the instance:
\begin{lstlisting}
virtual void select_task_sources(const MapperContext ctx,
const Task& task,
const SelectTaskSrcInput& input,
SelectTaskSrcOutput& output) = 0;
struct SelectTaskSrcInput {
PhysicalInstance target;
std::vector<PhysicalInstance> source_instances;
unsigned region_req_index;
};
struct SelectTaskSrcOutput {
std::deque<PhysicalInstance> chosen_ranking;
};
\end{lstlisting}
An implementation of this callback fills in {\tt chosen\_ranking} with a queue of instances selected from {\tt source\_instances}, most preferred instance first. The default mapper, for example, ranks instances in order of bandwidth between the
source instance and the target memory---see {\tt default\_policy\_select\_target\_memory} in {\tt default\_mapper.cc}.
Despite its name, this callback is also used for other operations that create new physical instances, such as copy operations.
\subsection{Postmapping}
\label{subsec:postmap}
The callback {\tt postmap\_task} is called only if requested by {\tt map\_task} (see Section~\ref{subsec:maptask}). The purpose
of this callback is to allow additional copies of regions updated by a task to be made once the task has finished. As input
the callback is given the mapped instances for each region requirement as well as the valid instances. The callback should fill
in {\tt chosen\_instances} with a vector for each region requirement of additional copies to be made; possible sources of these
copies are specified by {\tt source\_instances}.
\begin{lstlisting}
virtual void postmap_task(
const MapperContext ctx,
const Task& task,
const PostMapInput& input,
PostMapOutput& output) = 0;
struct PostMapInput {
std::vector<std::vector<PhysicalInstance> > mapped_regions;
std::vector<std::vector<PhysicalInstance> > valid_instances;
};
struct PostMapOutput {
std::vector<std::vector<PhysicalInstance> > chosen_instances;
std::vector<std::vector<PhysicalInstance> > source_instances;
};
\end{lstlisting}
\subsection{Using Virtual Mappings}
\label{subsec:mapping:virtual}
A useful optimization is to use {\em virtual mapping} for a logical region argument that a task does not use itself but only passes
as an argument to a subtask. A virtual mapping is just a way of recording that no physical instance will be created for the region
argument, but the name and metadata for the region are still available so that it can be passed as an argument to subtasks.
The function {\tt PhysicalInstances::get\_virtual\_instance()} returns a virtual instance which can be used as the chosen physical
isntance of a region requirement. If a task variant is marked as an {\tt inner} task (meaning that it does not access any of its regions and only passes them on to subtasks), the default mapper will use virtual instances for all of the region arguments, except for fields with reduction privileges, for which the Legion runtime always requires a real physical instance to be mapped. See {\tt map\_task} in {\tt default\_mapper.cc}.
\section{Other Mapping Features}
\label{sec:mapping:others}
Custom policies for mapping tasks and their region requirements are the most common reasons for users to write their own mappers.
In this section we cover a few other mapping features that can be included in custom mappers. This section is very incomplete; only
a handful of calls relevant to other features covered in this manual are currently included.
\subsection{Profiling Requests}
\label{subsec:mapping:profiling}
Legion has a general interface to profiling through the type {\tt ProfileRequest}, which has one public method, {\tt add\_measurement()}.
Most Legion operations take an optional profile request that will turn on the gathering of profiling information for that specific operation.
Most profiling is done in the Realm low-level runtime, and running a Legion program with the command-line flag {\tt -lg:prof} will turn on
profiling of many runtime operations; see \url{https://legion.stanford.edu/profiling/index.html#legion-prof} for an introduction to using
the Legion profiler. Most users only use the Legion profiler, but {\tt ProfileRequests} are available for users who want more
selective control over profiling.
\subsection{Mapping Acquires and Releases}
\label{subsec:mapping:acquires}
The callback {\tt map\_acquire} is called for every {\tt acquire} operation. Other than the possibility of adding a profiling request, {\tt map\_acquire} has no options to set.
For the callback {\tt map\_release} there is a policy decision to make:
\begin{lstlisting}
virtual void select_release_sources(
const MapperContext ctx,
const Release& release,
const SelectReleaseSrcInput& input,
SelectReleaseSrcOutput& output) = 0;
struct SelectReleaseSrcInput {
PhysicalInstance target;
std::vector<PhysicalInstance> source_instances;
};
struct SelectReleaseSrcOutput {
std::deque<PhysicalInstance> chosen_ranking;
};
\end{lstlisting}
Recall that the semantics of release is that it restores the copy restriction on a region with simultaneous coherence and any updates
to the region are flushed to the original {\tt target} instance. This call allows the mapper to produce a ranking {\tt chosen\_ranking} of
which of the valid instances of the region {\tt source\_instances} should be the source of the copy to the {\tt target} at the point of the release.
%\subsection{Mapping Must Epoch Launches}
%\label{subsec:mapping:mustepoch}
\subsection{Controlling Stealing}
\label{subsec:mapping:stealing}
There are two callbacks for controlling how tasks are stolen. A mapper may try to steal tasks from another mapper using {\tt select\_steal\_targets}, and a mapper can control which tasks it allows to be stolen using {\tt permit\_steal\_request}.
Mappers that want to steal tasks should implement {\tt select\_steal\_targets}. This callback sets
{\tt targets} to a set of processors from which tasks can be stolen. A {\tt blacklist} is supplied as input, which records processors
for which a previous steal request failed due to insufficient work. The blacklist is managed automatically by the runtime system, and
processors are removed from the blacklist when they acquire additional work.
\begin{lstlisting}
struct SelectStealingInput {
std::set<Processor> blacklist;
};
struct SelectStealingOutput {
std::set<Processor> targets;
};
virtual void select_steal_targets(
const MapperContext ctx,
const SelectStealingInput& input,
SelectStealingOutput& output) = 0;
\end{lstlisting}
When a mapper receives a steal request the {\tt permit\_steal\_request} callback is invoked, notifying the mapper of the requesting
processor (the {\tt thief}) and the tasks the mapper has available to steal, from which the callback selects a set of {\tt stolen\_tasks}.
\begin{lstlisting}
struct StealRequestInput {
Processor thief_proc;
std::vector<const Task*> stealable_tasks;
};
struct StealRequestOutput {
std::set<const Task*> stolen_tasks;
};
virtual void permit_steal_request(const MapperContext ctx,
const StealRequestInput& input,
StealRequestOutput& output) = 0;
\end{lstlisting}
%\section{Managing Execution}
%\label{sec:mapping:execution}
%\subsection{Context Management}
%\label{subsec:mapping:context}
%\subsection{Mapper Communication}
%\label{subsec:mapping:communication}
%\section{Performance: Tracing}
%
%
%type TraceID
%\begin{lstlisting}
%for(...) {
% runtime->begin_trace(ctx, TRACE_ID);
% ...
% runtime->end_trace(ctx, TRACE_ID);
%}
%\end{lstlisting}
\section{Mappers Included with Legion}
Several useful mappers are included in the Legion repository:
\begin{itemize}
\item The {\em default mapper} has already been discussed. The default mapper is a full implementation of the legion mapping
API with reasonably heuristics for every mapping callback. The default mapper has grown over time---as users have found cases where
the default mapper did not perform well, improvements have been made. As a result, the default mapper is a non-trivial
mapper, even though it still does not come close to achieving optimal mappings for most complex applications.
\item The {\em null mapper} is a base class that fails an assertion for every mapper API call. The null mapper is a useful starting
point when writing a mapper from scratch, as the mapper will show exactly which API calls need to be implemented to support the application.
\item The {\em replay mapper} can be used to replay mapping decisions recorded in a replay file by Legion Spy. The replay mapper
is used mostly for ensuring that a failed computation can be deterministically replayed to help diagnose the source of
bugs in the Legion runtime itself.
\item The {\em logging wrapper} adds logging of mapping operations (which calls were made and with what arguments) to an existing mapper.
To use the logging wrapper, replace any use of {\tt new MyMapper(\ldots)} in the application
with {\tt new LoggingWrapper(new MyMapper(\ldots))} and run with the command line flag
{\tt -level mapper=2}.
\item The {\em forwarding mapper } is a base class used to build mapper wrappers; the fowarding mapper simply forwards all mapper
calls to another mapper. The logging wrapper is written using the forwarding mapper.
\end{itemize}