mapping.tex

\chapter{Mapping}
\label{chap:mapping}
The Legion mapper interface is a key part
of the Legion programming system. Through the mapping interface applications
can control most decisions that impact application performance.
The philosophy is that these choices are better left to applications
rather than using hard-wired heuristics in Legion that attempt to ``do the right thing'' in
every situation.  The few performance heuristics
that are included in Legion are associated with low levels of the system
where there is no good way to expose those choices to the application.
For everything else applications can set the policies.

This design resulted from our own past experience with systems
where built-in performance heuristics did not behave as we desired and there was no recourse
to override those decisions.  While Legion does allows
experts to squeeze every last bit of performance from a system, it is important
to realize that doing so potentially requires understanding and setting a wide
variety of parameters exposed in the mapping interface.
This level of control can be overwheling at first to users who are not used to
considering all the possible dimensions that influence performance in complex,
distributed and heterogeneous systems.

To help users write initial versions of their applications without needing
to concern themselves with tuning the performance knobs exposed by the mapper
interface, Legion provides a {\em default mapper}.  The default mapper
implements the Legion mapper API (like any other mapper) and provides a number
of heuristics that can provide reasonably performant, or at least correct, initial
settings.  A good way to think about the default mapper is that it is the version
of Legion with built-in heuristics that allows casual users to write Legion
applications and allows experts to start quickly on a new application.
It is, however, unreasonable to expect the default mapper to provide excellent performance, and in
particular assuming that the performance of an application using the default
mapper is even an approximation of the performance that could be
achieved with a custom mapper is a mistake.


We will use several examples from the default mapper
to illustrate how mappers are constructed. We will also describe where
possible the heuristics that the default mapper employs to achieve
reasonable performance. Because the default mapper uses generic heuristics
with no specific knowledge of the apllication, it is almost certain to make
poor decisions at least some of the time.
Performance benchmarking using only the default mapper is strongly
discouraged, while using custom application-specific mappers is
encouraged.

It is likely that the moment when you are dissatisfied with the 
heuristics in the default mapper will come sooner rather than later.
At that point the information in this chapter will be necessary for you
to write your own custom mapper.  In practice, our experience has been that in
many cases all that is necessary is to replace a small number of policies in the
default mapper that are a poor fit for the application.

\section{Mapper Organization}
\label{sec:mapping:org}

The Legion mapper interface is an abstract C++ class that defines a set of 
pure virtual functions that the Legion runtime invokes as {\em callbacks}
for making performance-related decisions. A Legion mapper is 
a class that inherits from the base abstract class and provides 
implementations of the associated pure virtual methods.

A callback is just a function pointer---when the runtime system calls a mapper
function, it is said to have ``invoked the callback''.  Callbacks are a commomly-used
mechanism in software systems for parameterizing some specific functionality; in our case
mappers parameterize the performance heuristics of the Legion runtime system.
There are a few general things to keep in mind about mappers and callbacks:
\begin{itemize}
\item The runtime may invoke callbacks in an unpredictable order.  While multiple callbacks associated with a
  single instance of a Legion object, such as a task, will happen in a specific order for that task,
  other callbacks for other operations may be interleaved.
\item Depending on the synchronization model selected (see Section~\ref{subsec:mapping:sync}), mappers
  may have a degree of concurrency between mapper callbacks.
\item Since mappers are C++ objects, they can have arbitrary internal state.  For example, it may be useful
  to maintain performance or load-balancing statistics that inform mapping decisions.
  However, state updates done by a mapper must take into account the unpredictable order in
  which callbacks are invoked as well any issues of concurrent access to mapper data structures.
\end{itemize}

\subsection{Mapper Registration}
\label{subsec:mapping:registration}

After the Legion runtime is created, but before the application 
begins, mapper objects can be registered 
with the runtime. Figure~\ref{fig:mapper_registration} gives a small
example registering a custom mapper.

\begin{figure}
{  \small
  \lstinputlisting[linerange={14,78}]{Examples/Mapping/registration/registration.cc}
  }
\caption{\legionbook{Mapping/registration/registration.cc}}
\label{fig:mapper_registration}
\end{figure}

To register {\tt CustomMapper} objects, the
application adds the mapper callback function by invoking the
{\tt Runtime::add\_registration\_callback} method, which takes as an
argument a function pointer to be invoked. The function pointer must
have a specific type, taking as arguments a {\tt Machine} object, 
a {\tt Runtime} pointer, and a reference to an STL set of {\tt Processor}
objects. The call can be invoked multiple times to record multiple
callback functions (e.g., to register multiple custom mappers, perhaps for different libraries). All
callback functions must be added prior to the invocation of the 
{\tt Runtime::start} method. We recommend that applications include the registration
method as a static method on the mapper class (as in Figure~\ref{fig:mapper_registration})
so that it is closely coupled to the custom mapper itself.

Before invoking any of the registration callback functions, the runtime 
creates an instance of the default mapper for each processor of
the system. The runtime then invokes the callback functions in the order
they were added. Each callback function is invoked once on each 
instance of the Legion runtime. For multi-process jobs, there will be 
one copy of the Legion runtime per process and therefore one invocation
of each callback per process. The set of processors passed into each 
registration callback function will be the set of application processors 
that are local to the process\footnote{Mappers cannot be associated with
utility processors, and therefore utility processors are not included
in the set.}, thereby providing a registration callback
function with the necessary context to know which processors it
will create new custom mappers for. 
If no callback functions are registered then the only mappers
that will be available are instances of the default mapper associated
with each application processor.

Upon invocation, the registration callbacks should create instances
of custom mappers and associate them with application processors. 
This step can be done through one of two runtime mapper calls. The mapper
can replace the default mappers (always registered with {\tt MapperID}
0) by calling {\tt Runtime::replace\_default\_mapper}, which is the
only way to replace the default mappers. Alternatively, the registration
callback can use {\tt Runtime::add\_mapper} to register a mapper with a
new {\tt MapperID}. Both the {\tt Runtime::replace\_default\_mapper} and
the {\tt Runtime::add\_mapper} methods support an optional processor
argument, which tells the runtime to associate the mapper with a specific
processor. If no processor is specified, the mapper is associated 
with all processors on the local node. The choice between 
whether one mapper object should handle a single application processor's
mapping decisions or one mapper object handling  mapping decisions for
all application processors on a node is mapper-specific. Legion supports both use cases
and it is up to custom mappers to make the best choice. From a performance
perspective, the best choice is likely to depend on the mapper synchronization
model (see Section~\ref{subsec:mapping:sync}).

Note that the mapper calls require a pointer to the {\tt MapperRuntime}, such as on
lines 27 and 49 of Figure~\ref{fig:mapper_registration}.
The mapper runtime provides the interface for mapper calls to call back
into the runtime to acquire access to different physical resources. We 
will see examples of the use of the mapper runtime throughout 
this chapter.

\subsection{Synchronization Model}
\label{subsec:mapping:sync}

Within an instance of the Legion runtime there are often several threads
performing the analysis necessary to advance the execution of an
application. If some threads are performing work for operations 
owned by the same mapper, it is possible that they will attempt to 
invoke mapper calls for the same mapper object concurrently. For both 
productivity and correctness reasons, we do not want users to be
responsible for making their mappers thread-safe. Therefore we allow
mappers to specify a {\em synchronization model} that the runtime 
follows when concurrent mapper calls are made.

Each mapper object can specify its synchronization model via the
{\tt get\_mapper\_sync\_model} mapper call. The runtime invokes this
method exactly once per mapper object immediately after the mapper is
registered with the runtime. Once the synchronization model has been set
for a mapper object it cannot be changed. Currently three
synchronization models are supported:

\begin{itemize}
\item {\em Serialized Non-Reentrant}. Calls to the
      mapper object are serialized and execute atomically. If the mapper 
      calls out to the runtime and the mapper call is preempted, 
      no other mapper calls can be invoked by the runtime.
      This synchronization model conforms to the original version of
      the Legion mapper interface.
\item {\em Serialized Reentrant}. At most one mapper call
      executes at a time. However, if a mapper call invokes a runtime
      method that preempts the mapper call, the runtime may
      execute another mapper call or resume a previously blocked
      mapper call. It is up to the user to handle any changes in internal mapper
      state that might occur while a mapper call is preempted (e.g., the
      invalidation of STL iterators to internal mapper data structures).
\item {\em Concurrent}. Mapper calls to the same mapper object can
      proceed concurrently. Users can invoke the {\tt lock\_mapper} and
      {\tt unlock\_mapper} calls to perform their own synchronization
      of the mapper. This synchronization model is particularly useful for
      mappers that simply return static mapping decisions
      without changing internal mapper state.
\end{itemize}

The mapper synchronization offers mappers tradeoffs between mapper complexity and performance. The default mapper uses the 
serialized reentrant synchronization model as it offers a good trade-off
between programmability and performance.

\subsection{Machine Interface}
\label{subsec:mapping:machine}

All mappers are given a {\tt Machine} object to enable
introspection of the hardware on which the application is executing. The
{\tt Machine} object is defined by Realm, Legion's low-level portability layer (see {\tt realm/machine.h}).

There are two interfaces for querying the machine
object. The old interface contains methods such as {\tt get\_all\_processors}
and {\tt get\_all\_memories}. These methods populate STL data structures
with the appropriate names of processors and memories. We strongly
discourage using these methods as they are not scalable on large
architectures with tens to hundreds of thousands of processors or memories.

The recommended, and more efficient and scalable, interface is based
on {\em queries}, which come in two types: {\tt ProcessorQuery} and 
{\tt MemoryQuery}. Each query is initially given a reference to the machine
object. After initialization the query lazily materializes the (entire) set of 
either processors or memories of the machine.
The mapper applies {\em filters} to the query to reduce the
set to processors or memories of interest.  These filters can include specializing
the query to the local node using {\tt local\_address\_space}, to one kind of processors with the {\tt only\_kind} method, or by
requesting that the processor or memory have a specific affinity to another
processor or memory with the {\tt has\_affinity\_to} method. Affinity can either be
specified as a maximum bandwidth or a minimum latency. Figure~\ref{fig:mapper_machine}
shows how to create a custom mapper that uses queries to find the local set of 
processors with the same processor kind as and the memories with affinities to the local
mapper processor. In some cases, these queries are still expensive, so we
encourage the creation of mappers that memoize the results of their most 
commonly invoked queries to avoid duplicated work.

\begin{figure}
  {\small
\lstinputlisting[linerange={22,80}]{Examples/Mapping/machine/machine.cc}}
\caption{\legionbook{Mapping/machine/machine.cc}}
\label{fig:mapper_machine}
\end{figure}


\section{Mapping Tasks}
\label{sec:mapping:tasks}

There are a number of different kinds of operations with mapping callbacks, but the core of the mapping interface, and the parts
of mappers that users will most commonly customize, are the callbacks for mapping tasks.
When a task is launched it proceeds through a pipeline of mapping callbacks.  The most important pipeline stages are:
\begin{enumerate}
\item {\tt select\_task\_options }
\item {\tt select\_sharding\_functor} (for control-replicated tasks)
\item {\tt slice\_task }  (for index launches)
\item {\tt select\_tasks\_to\_map} (tasks remain in this stage until selected for mapping)
\item {\tt map\_task}
\end{enumerate}
Stages 2 and 3 do not apply to every task, and tasks may repeat stage 4 any number of times depending on the implementation of {\tt select\_tasks\_to\_map}.

After discussing these five components of the task mapping pipeline, we discuss a few other topics relevant to task mapping: allocating new physical instances, postmapping of tasks, virtual mappings, and profiling requests.

\subsection{Controlling Task Mapping}
{\tt select\_task\_options} is the first callback for mapping tasks. It is invoked for every task $t$ exactly once in the Legion process where $t$ is launched.
The signature of the function is:
\begin{lstlisting}
 virtual void select_task_options(const MapperContext    ctx,
                                   const Task&            task,
                                   TaskOptions&           output) = 0;
\end{lstlisting}
The purpose of the callback is to set fields of the {\tt output} object.  All of the fields have defaults, so none are required to be set by the callback implementation.
This callback comes first because the fields of {\tt TaskOptions} control the rest of the mapping process for the task.
\begin{itemize}
\item For a single task $t$ (not an index launch), {\tt output.initial\_proc} is the processor that will execute $t$; the default is the current processor.
  The processor does not need to be local---the mapper can select any processor in the machine model for which a variant of $t$ exists.  As we will see, $t$'s target processor can be changed by subsequent stages.  The reason for choosing a target processor
  here is that by default $t$ is sent to the Legion process that manages the target processor to be mapped.

\item If {\tt output.inline\_task} is true (the default is false) the task will be inlined into the parent task and use the parent task's regions.  Any needed regions that are unmapped will be remapped.  Inline tasks do not go through the rest of the task pipeline, except for the selection of
  a task variant.  

\item   If {\tt output.stealable} is true then the task can be stolen for load balancing; the default is false.  A stealable task $t$ can be stolen by another mapper until $t$ is chosen by {\tt select\_tasks\_to\_map}.

\item As mentioned above, by default the {\tt map\_task} stage of the mapping pipeline is done by the Legion process that manages the processor where the task will execute.  If {\tt output.map\_locally} is true (the default is false) then {\tt map\_task} will be run by the current mapper.
  Just to emphasize: {\tt map\_locally} controls where a mapping callback for the task is run, not where the task executes.  This option is mostly useful for leaf tasks that will be sent to remote processors.  In this case, making the  mapping decisions locally saves transmitting
  task metadata to the remote Legion runtime.

\item If {\tt valid\_instances} is set to false, then the task will not recieve a list of the currently valid instances of regions in subsequent calls to {\tt request\_valid\_instances}, which saves some runtime overhead.  This setting is useful if the task will never use
  a currently valid region instance, such as when all the regions of an inner task will be virtually mapped.

\item Setting {\tt replicate\_default} to true turns on replication of single tasks in a control-replication context, which means that the task will be executed separately in every Legion process participating in the replication of the parent task.  The default setting
  is false; in this case only one instance of a single task with a control-replicated parent is executed on one processor and then the results are broadcast to the other Legion processes.  Replicating single tasks avoids the broadcast communication.  There are some restrictions on replicated single tasks to ensure the
  replicated versions all have identical behavior: the tasks cannot have reduction-only privileges on any field, and any fields with write privileges must use a separate instance for each replicated task.

\item A task can set the priority of the parent task by modifying {\tt  output.parent\_priority}, if that is permitted by the mapper.  The default is the parent's current priority.  When tasks are ready to execute, tasks with higher priority are moved to the front of the ready queue.
\end{itemize}

\subsection{Sharding}
As the name suggests, {\tt select\_sharding\_functor} is used to select the functor for {\em sharding} index task launches in control-replicated contexts.  Sharding divides the index space of the task launch into subspaces and associates each shard with a mapper (a processor)
where those tasks will be mapped.  This callback is invoked once per replicated task index launch in each replicated context:
\begin{lstlisting}
virtual void select_sharding_functor(
    const MapperContext ctx,
    const Task& task,
    const SelectShardingFunctorInput& input,
    SelectShardingFunctorOutput& output) = 0;

struct SelectShardingFunctorInput {
   std::vector<Processor> shard_mapping;
};

struct SelectShardingFunctorOutput {
   ShardingID chosen_functor;
   bool slice_recurse;
};
\end{lstlisting}
The {\tt shard\_mapping} of the input structure provides a vector of the processors where the replicated task is running.  The callback must fill in the {\tt chosen\_functor} field of the output structure with the id of a sharding function registered with the mapper at
startup.  The callback can set {\tt slice\_recurse} to indicate whether or not the index subspaces chosen by the sharding functor should be recursively sharded on the destination processor.  The same sharding functor must be selected in every control-replicated context, which
will be checked by the runtime when in debug mode.

\subsection{Slicing}
{\tt slice\_task} is called for every index launch.  To make index launches efficient, the index space of tasks is first sliced into smaller sets of tasks and each set is sent to a destination mapper as a single object rather than sending
multiple individual tasks.  The signature of {\tt slice\_task} is:
\begin{lstlisting}
virtual void slice_task(const MapperContext ctx,
                        const Task& task,
                        const SliceTaskInput& input,
                        SliceTaskOutput& output) = 0;
\end{lstlisting}


The {\tt SliceTaskInput} includes the index space of the task launch (field {\tt domain\_is)}.  The index space of the shard is also included for control-replicated tasks.
\begin{lstlisting}
struct SliceTaskInput {
   IndexSpace domain_is;
   Domain domain;
   IndexSpace sharding_is;
};
\end{lstlisting}
The implementation of {\tt slice\_task} should set the fields of {\tt SliceTaskOutput}:
\begin{lstlisting}
struct SliceTaskOutput {
    std::vector<TaskSlice> slices;
    bool verify_correctness; // = false                                                                                            

struct TaskSlice {
    public:
        TaskSlice(void) : domain_is(IndexSpace::NO_SPACE),
          domain(Domain::NO_DOMAIN), proc(Processor::NO_PROC),
          recurse(false), stealable(false) { }
        TaskSlice(const Domain &d, Processor p, bool r, bool s)
          : domain_is(IndexSpace::NO_SPACE), domain(d),
            proc(p), recurse(r), stealable(s) { }
        TaskSlice(IndexSpace is, Processor p, bool r, bool s)
          : domain_is(is), domain(Domain::NO_DOMAIN),
            proc(p), recurse(r), stealable(s) { }
    public:
        IndexSpace domain_is;
        Domain domain;
        Processor proc;
        bool recurse;
        bool stealable;
};
\end{lstlisting}
The {\tt slices} field is a vector of {\tt TaskSlice}, each of which names a subspace of the index space in {\tt domain\_is} and a destination processor {\tt proc} for the slice of tasks.  The tasks of the slice can be marked as stealable, and setting the {\tt recurse} field
means that {\tt slice\_task} will be called again by the mapper associated with the destination processor to allow the slice to be further subdivided before processing individual tasks.

\subsection{Selecting Tasks to Map}
{\tt select\_tasks\_to\_map} gives the mapper control over which tasks should be mapped and which should be sent to other processors---the initial processor assignment set in {\tt select\_task\_options} can be changed if desired.  At this point
in the task mapping pipeline all index tasks have been expanded into single tasks, and {\tt select\_tasks\_to\_map} is called by the mapper associated with the destination process, unless {\tt map\_locally} was chosen in {\tt select\_task\_options}.
The signature of the callback is:
\begin{lstlisting}
      virtual void select_tasks_to_map(const MapperContext ctx,
                                       const SelectMappingInput& input,
                                       SelectMappingOutput& output) = 0;
      struct SelectMappingInput {
        std::list<const Task*> ready_tasks;
      };
      struct SelectMappingOutput {
        std::set<const Task*> map_tasks;
        std::map<const Task*,Processor> relocate_tasks;
        MapperEvent deferral_event;
      };
\end{lstlisting}
For each task in {\tt ready\_tasks} of the {\tt SelectMappingInput} structure, the callback implementation can do one of three things:
\begin{itemize}
\item Add the task to {\tt map\_tasks}, in which case the task will proceed with mapping on the assigned local processor.
\item Add the task to {\tt relocate\_tasks} along with a new destination processor to which the task will be transferred.
\item Nothing, in which case the task will remain in the {\tt ready\_tasks} list for the next call to {\tt select\_tasks\_to\_map}.
\end{itemize}
If the call does not select at least one task to map or transfer, then it must provide a {\tt MapperEvent} in the field {\tt deferral\_event}---another call to {\tt select\_tasks\_to\_map} will not be made until that event is triggered.
Of course, it is up to the mapper to guarantee that the event is eventually triggered.

\subsection{Map\_Task}
\label{subsec:maptask}

{\tt map\_task} is normally the final stage of the task mapping pipeline.  This callback selects a processor or processors for the task, maps the task's region arguments, and selects the task variant to use, after which the task will run on one of the selected processors.
\begin{lstlisting}
virtual void map_task(
    const MapperContext ctx,
    const Task& task,
    const MapTaskInput& input,
    MapTaskOutput& output) = 0;

struct MapTaskInput {
    std::vector<std::vector<PhysicalInstance> > valid_instances;
    std::vector<unsigned>                       premapped_regions;
};

struct MapTaskOutput {
    std::vector<std::vector<PhysicalInstance> > chosen_instances; 
    std::vector<std::vector<PhysicalInstance> > source_instances;
    std::vector<Memory> output_targets;
    std::vector<LayoutConstraintSet> output_constraints;
    std::set<unsigned> untracked_valid_regions;
    std::vector<Memory>future_locations;
    std::vector<Processor> target_procs;
    VariantID chosen_variant; // = 0 
    TaskPriority task_priority;  // = 0
    TaskPriority profiling_priority;
    ProfilingRequest task_prof_requests;
    ProfilingRequest copy_prof_requests;
    bool postmap_task; // = false
};
\end{lstlisting}
The input structure contains a vector of vector of valid instances: each element of the vector is a vector of instances that
hold valid data for the corresponding region requirement.  The {\tt premapped\_regions} is a vector of indices of region
requirements that are already satisfied and do not need to be mapped by the callback.

The callback must fill in the following fields of the {\tt output} structure:
\begin{itemize}
\item {\tt target\_procs} is a vector of processors.  All processors must be on the same node and of the same kind (e.g., all LOCs or all TOCs).  The runtime will execute the task on the first processor in the vector that becomes available.
\item {\tt chosen\_variant} is the {\tt VariantID} of a variant of the task.  The chosen variant must be compatible with the chosen processor kind.
\item For each region requirement, the {\tt input} structure has a vector of valid instances of the region in the same order
  as region requirements are added to the task launcher.  The entry of the {\tt chosen\_instances} field should be filled either  with one or more
  instances from the correponding entry of {\tt valid\_instances}, or the mapper can add newly created instances.  A new instance is created by the runtime call
  {\tt create\_physical\_instance}, which, in addition to other arguments, takes a target memory in which the instance should be created and a vector of logical regions---physical instances can be created that hold the data of multiple logical regions.
  If new physical regions are created, the mapper calls {\tt select\_task\_sources} to choose existing instances to be the source of data to fill those new instances (see below).

\item For any regions that are strictly output regions (e.g, with {\tt WRITE\_DISCARD} privileges) where no input data will be loaded, the callback must fill in the {\tt output\_targets} with a memory for the corresponding
  region requirement.  These memories must be visible to the selected processor(s).
  
\item The callback should set a memory that will hold each future produced by the task in {\tt future\_locations}.

\item Normally the runtime system will retain regions with valid data even if no tasks are known that will use those regions at the time the task finishes.  This policy can lead to an accumulation of read-only regions that are never garbage colleted (since read-only regions
  are not invalidated by any write operations).  This policy can be controlled by specifying a set of indices of read-only regions
  in {\tt untracked\_valid\_regions}---these instances will be marked for garbage collection after the task is complete.

\item Optionally the task may request that the {\tt postmap\_task} be invoked for this task once mapping is complete; see Section~\ref{subsec:postmap}.
\end{itemize}  


\subsection{Creating Physical Instances}
\label{subsec:mapping:instances}

New phyiscal instances are created by the runtime call {\tt create\_physical\_instance}:
\begin{lstlisting}
    bool MapperRuntime::create_physical_instance(
                                    MapperContext ctx, Memory target_memory,
                                    const LayoutConstraintSet &constraints, 
                                    const std::vector<LogicalRegion> &regions,
                                    PhysicalInstance &result, 
                                    bool acquire, GCPriority priority,
                                    bool tight_bounds, size_t *footprint,
                                    const LayoutConstraint **unsat) const
\end{lstlisting}
Besides the standard runtime context, the arguments to this function are:
\begin{itemize}
\item The {\tt target\_memory} is the memory where the instance will be created.
  \item  The {\tt constraints} specify the layout constraints of the region, such as whether it should be laid out in column-major or row-major order for 2D index spaces.  Layout constraints are discussed in Section~\ref{sec:layout}.
\item The {\tt regions} field is a vector of logical regions, all of which should be included in the created instance.  The ability to have more than one logical region in an instance allows for colocation of data from multiple regions.
\item The {\tt result} field holds the newly created instance after the call returns; if successful the function returns true.
\item If {\tt tight\_bounds} is true, then the call will select the most specific (tightest) solution to the constraints, if more than one solution is possible.  Otherwise, the runtime is free to pick any valid solution.
\item {\tt footprint} holds the size of the allocated instance in bytes.
\item {\tt unsat} holds any constraints that could not be satisfied if the call fails.
\end{itemize}

The runtime function {\tt find\_or\_create\_physical\_instance} provides higher level functionality that preferentially finds an existing physical instance satisfying some constraints or creates a new one if necessary.  The default mapper also provides
higher-level functions that wrap {\tt create\_physical\_instance}; see {\tt default\_create\_custom\_instances} for an example.

\subsection{Selecting Sources for New Physical Instances}
\label{subsec:selectsources}
When a new physical instance is created, if its contents may be read the mapper callback {\tt select\_task\_sources} will be invoked to pick a source of data for the instance:

\begin{lstlisting}
virtual void select_task_sources(const MapperContext ctx,
   const Task& task,
   const SelectTaskSrcInput& input,
   SelectTaskSrcOutput& output) = 0;

struct SelectTaskSrcInput {
   PhysicalInstance target;
   std::vector<PhysicalInstance> source_instances;
   unsigned region_req_index;
};

struct SelectTaskSrcOutput {
   std::deque<PhysicalInstance> chosen_ranking;
};
\end{lstlisting}
An implementation of this callback fills in {\tt chosen\_ranking} with a queue of instances selected from {\tt source\_instances}, most preferred instance first.  The default mapper, for example, ranks instances in order of bandwidth between the
source instance and the target memory---see {\tt default\_policy\_select\_target\_memory} in {\tt default\_mapper.cc}.

Despite its name, this callback is also used for other operations that create new physical instances, such as copy operations.

\subsection{Postmapping}
\label{subsec:postmap}

The callback {\tt postmap\_task} is called only if requested by {\tt map\_task} (see Section~\ref{subsec:maptask}).  The purpose
of this callback is to allow additional copies of regions updated by a task to be made once the task has finished.  As input
the callback is given the mapped instances for each region requirement as well as the valid instances.  The callback should fill
in {\tt chosen\_instances} with a vector for each region requirement of additional copies to be made; possible sources of these
copies are specified by {\tt source\_instances}.

\begin{lstlisting}
virtual void postmap_task(
    const MapperContext ctx,
    const Task& task,
    const PostMapInput& input,
    PostMapOutput& output) = 0;

struct PostMapInput {
    std::vector<std::vector<PhysicalInstance> > mapped_regions;
    std::vector<std::vector<PhysicalInstance> > valid_instances;
};

struct PostMapOutput {
    std::vector<std::vector<PhysicalInstance> > chosen_instances;
    std::vector<std::vector<PhysicalInstance> > source_instances;
};
\end{lstlisting}


\subsection{Using Virtual Mappings}
\label{subsec:mapping:virtual}
A useful optimization is to use {\em virtual mapping} for a logical region argument that a task does not use itself but only passes
as an argument to a subtask.  A virtual mapping is just a way of recording that no physical instance will be created for the region
argument, but the name and metadata for the region are still available so that it can be passed as an argument to subtasks.

The function {\tt PhysicalInstances::get\_virtual\_instance()} returns a virtual instance which can be used as the chosen physical
isntance of a region requirement.   If a task variant is marked as an {\tt inner} task (meaning that it does not access any of its regions and only passes them on to subtasks), the default mapper will use virtual instances for all of the region arguments, except for fields with reduction privileges, for which the Legion runtime always requires a real physical instance to be mapped.  See {\tt map\_task} in {\tt default\_mapper.cc}.


\section{Other Mapping Features}
\label{sec:mapping:others}

Custom policies for mapping tasks and their region requirements are the most common reasons for users to write their own mappers.
In this section we cover a few other mapping features that can be included in custom mappers.  This section is very incomplete; only
a handful of calls relevant to other features covered in this manual are currently included.
\subsection{Profiling Requests}
\label{subsec:mapping:profiling}

Legion has a general interface to profiling through the type {\tt ProfileRequest}, which has one public method, {\tt add\_measurement()}.
Most Legion operations take an optional profile request that will turn on the gathering of profiling information for that specific operation.
Most profiling is done in the Realm low-level runtime, and running a Legion program with the command-line flag {\tt -lg:prof} will turn on
profiling of many runtime operations; see \url{https://legion.stanford.edu/profiling/index.html#legion-prof} for an introduction to using
the Legion profiler.  Most users only use the Legion profiler, but {\tt ProfileRequests} are available for users who want more
selective control over profiling.


\subsection{Mapping Acquires and Releases}
\label{subsec:mapping:acquires}

The callback {\tt map\_acquire} is called for every {\tt acquire} operation.  Other than the possibility of adding a profiling request, {\tt map\_acquire} has no options to set.

For the callback {\tt map\_release} there is a policy decision to make:
\begin{lstlisting}
virtual void select_release_sources(
   const MapperContext ctx,
   const Release& release,
   const SelectReleaseSrcInput& input,
   SelectReleaseSrcOutput&  output) = 0;

struct SelectReleaseSrcInput {
   PhysicalInstance target;
   std::vector<PhysicalInstance> source_instances;
};

struct SelectReleaseSrcOutput {
   std::deque<PhysicalInstance> chosen_ranking;
};
\end{lstlisting}
Recall that the semantics of release is that it restores the copy restriction on a region with simultaneous coherence and any updates
to the region are flushed to the original {\tt target} instance.  This call allows the mapper to produce a ranking {\tt chosen\_ranking} of
which of the valid instances of the region {\tt source\_instances} should be the source of the copy to the {\tt target} at the point of the release.
%\subsection{Mapping Must Epoch Launches}
%\label{subsec:mapping:mustepoch}

\subsection{Controlling Stealing}
\label{subsec:mapping:stealing}

There are two callbacks for controlling how tasks are stolen.  A mapper may try to steal tasks from another mapper using {\tt select\_steal\_targets}, and a mapper can control which tasks it allows to be stolen using {\tt permit\_steal\_request}.

Mappers that want to steal tasks should implement {\tt select\_steal\_targets}.  This callback sets
{\tt targets} to a set of processors from which tasks can be stolen.  A {\tt blacklist} is supplied as input, which records processors
for which a previous steal request failed due to insufficient work.  The blacklist is managed automatically by the runtime system, and
processors are removed from the blacklist when they acquire additional work.
\begin{lstlisting}
struct SelectStealingInput {
   std::set<Processor> blacklist;
};

struct SelectStealingOutput {
   std::set<Processor> targets;
};

virtual void select_steal_targets(
   const MapperContext ctx,
   const SelectStealingInput& input,
   SelectStealingOutput& output) = 0;
\end{lstlisting}

When a mapper receives a steal request the {\tt permit\_steal\_request} callback is invoked, notifying the mapper of the requesting
processor (the {\tt thief}) and the tasks the mapper has available to steal, from which the callback selects a set of {\tt stolen\_tasks}.
\begin{lstlisting}
struct StealRequestInput {
   Processor thief_proc;
   std::vector<const Task*> stealable_tasks;
};

struct StealRequestOutput {
   std::set<const Task*> stolen_tasks;
};

virtual void permit_steal_request(const MapperContext ctx,
   const StealRequestInput& input,
   StealRequestOutput& output) = 0;
\end{lstlisting}

%\section{Managing Execution}
%\label{sec:mapping:execution}

%\subsection{Context Management}
%\label{subsec:mapping:context}

%\subsection{Mapper Communication}
%\label{subsec:mapping:communication}

%\section{Performance: Tracing}
%
%
%type TraceID
%\begin{lstlisting}
%for(...) {
%  runtime->begin_trace(ctx, TRACE_ID);
%  ...
%  runtime->end_trace(ctx, TRACE_ID);
%}
%\end{lstlisting}

\section{Mappers Included with Legion}

Several useful mappers are included in the Legion repository:
\begin{itemize}
  \item The {\em default mapper} has already been discussed.  The default mapper is a full implementation of the legion mapping
    API with reasonably heuristics for every mapping callback.  The default mapper has grown over time---as users have found cases where
    the default mapper did not perform well, improvements have been made.  As a result, the default mapper is a non-trivial
    mapper, even though it still does not come close to achieving optimal mappings for most complex applications.

  \item The {\em null mapper} is a base class that fails an assertion for every mapper API call.  The null mapper is a useful starting
    point when writing a mapper from scratch, as the mapper will show exactly which API calls need to be implemented to support the application.

    
  \item The {\em replay mapper} can be used to replay mapping decisions recorded in a replay file by Legion Spy.  The replay mapper
    is used mostly for ensuring that a failed computation can be deterministically replayed to help diagnose the source of
    bugs in the Legion runtime itself.

  \item The {\em logging wrapper} adds logging of mapping operations (which calls were made and with what arguments) to an existing mapper.
 To use the logging wrapper, replace any use of {\tt new MyMapper(\ldots)} in the application
with {\tt new LoggingWrapper(new MyMapper(\ldots))} and run with the command line flag
 {\tt -level mapper=2}.

\item  The {\em forwarding mapper } is a base class used to build mapper wrappers; the fowarding mapper simply forwards all mapper
  calls to another mapper.  The logging wrapper is written using the forwarding mapper.
\end{itemize}