A Stencil Compiler for Short-Vector SIMD Architectures

Tom Henretty
The Ohio State University
henretty@cse.ohio-state.edu

Richard Veras
Carnegie Mellon University
rveras@cmu.edu

Franz Franchetti
Carnegie Mellon University
franzf@ece.cmu.edu

Louis-Noël Pouchet
Univ. California, Los Angeles
pouchet@cs.ucla.edu

J. Ramanujam
Louisiana State University
jxr@ece.lsu.edu

P. Sadayappan
The Ohio State University
saday@cse.ohio-state.edu

ABSTRACT

Stencil computations are an integral component of applications in a number of scientific computing domains. Short-vector SIMD instruction sets are ubiquitous on modern processors and can be used to significantly increase the performance of stencil computations. Traditional approaches to optimizing stencils on these platforms have focused on either short-vector SIMD or data locality optimizations. In this paper, we propose a domain-specific language and compiler for stencil computations that allows specification of stencils in a concise manner and automates both locality and short-vector SIMD optimizations, along with effective utilization of multi-core parallelism. Loop transformations to enhance data locality and enable load-balanced parallelism are combined with a data layout transformation to effectively increase the performance of stencil computations. Performance increases are demonstrated for a number of stencils on several modern SIMD architectures.

Categories and Subject Descriptors
D.1.3 [Programming Techniques]: Concurrent Programming—Parallel Programming; D.3.4 [Programming Languages]: Processors—Code Generation, Compilers

Keywords
DSL, Multicore, SIMD, Split Tiling, Stencils

1. INTRODUCTION

There is increasing interest in developing domain-specific frameworks for high-performance scientific computing due to the diversity of current/emerging parallel architectures. In addition to the benefit of a DSL (Domain Specific Language) on user productivity, a significant advantage is that semantic properties derived from the high-level abstractions can be utilized to develop powerful specialized compiler transformations that can be tailored to the characteristics of different architectural platforms. Using the Stencil Domain Specific Language (SDSL) [11], this paper describes a set of compiler transformations that are needed to generate efficient code for multicore processors with short-vector SIMD ISAs such as SSE, AVX, VSX, LRBNi etc.

Stencil computations involve arithmetic operations on physically contiguous data elements, e.g., $c_0 \cdot (A[i-1]+A[i]+A[i+1])$. Since vector operations with ISAs like SSE require the loading of physically contiguous data elements from memory into vector registers and the execution of identical and independent operations on the components of vector registers, stencil computations pose challenges to efficient implementation on these architectures, requiring the use of redundant and unaligned loads of data elements from memory into different slots in different vector registers. We [12] had previously addressed this issue through a dimension-lifting-transpose (DLT) data layout transformation. However, only sequential execution was addressed. Further, the approach was only evaluated on data sets that fit in L1 cache. Tiling over spatial and the time dimensions is essential in conjunction with DLT for high performance on large data sets. However, as elaborated in detail in the next section, standard time-tiling of stencil codes via skewing introduces inter-tile dependences that are incompatible with the form of vector parallelism used by the DLT transformation. In this paper, we develop an integrated approach to perform tiling in conjunction with DLT transformation to generate efficient parallel code for stencil computations over large data sets on shared memory multiprocessors. We compare performance with code generated by the Pochoir stencil compiler [18] and Pluto [2, 4] for several benchmarks on multiple target multicore processors, demonstrating strong performance benefits for 1D and 2D stencils. The paper makes the following contributions:

- It presents a stencil DSL compiler that integrates data layout transformation for short-vector SIMD ISAs with load-balanced tiled parallel execution for multi-statement stencils.
- It demonstrates significant performance improvement on several multi-core platforms for a number of benchmarks, over Intel’s ICC compiler and state-of-the-art research compilers like Pochoir and Pluto.

The paper is organized as follows. In Sec. 2 we use an illustrative example to explain the main problem to be addressed in integrating DLT with tiling. Sec. 3.1 describes the stencil DSL and Sec. 3.4 provides a high-level overview of the compiler algorithms developed. Sections 4 and 5 provide details of the compiler algorithms. Experimental results are presented in Sec. 6 and related work is covered in Sec. 7.

2. PROBLEM DESCRIPTION

In this section we first provide some background on the DLT data layout transformation of Henretty et al. [12] that was developed to overcome the fundamental data access inefficiency on current
short-vector SIMD architectures with stencil computations. We then describe why standard time-tiling is infeasible in conjunction with the DLT transformation and a different form of tiling—split-tiling—can be used effectively in conjunction with DLT.

2.1 Overview of DLT Transformation


![Figure 1: Data layout transformation for SIMD vector length of 4](image)

Figure 1: Data layout transformation for SIMD vector length of 4

2.2 Standard Tiling and DLT Transformation

We next use a Jacobi 1D stencil example to explain the problem with the use of standard time-tiling in conjunction with DLT layout transformation. Although the input to our stencil compiler uses a special DSL language (described in Sec. 3.1), we use standard loop notation in C to motivate the problem since this lower-level view makes it easier to discuss issues pertaining to loop fusion and time-tiling when compiling general multi-statement stencils for hybrid split-tiling. Finally, a high-level overview of the compiler algorithm for code generation.

3. OVERVIEW OF APPROACH

Before delving into the details of the algorithms, in this section we first describe the stencil DSL we translate. Next, we provide overviews of two methods used to tile SDSL codes, nested and hybrid split-tiling. Finally, a high-level overview of the approach to
3.1 SDSL – Stencil Domain Specific Language

A stencil computation can be summarized as one or more functions being identically applied to points on a regular grid, where the values of some groups of neighboring elements are used for each function, and this process is repeated multiple times. The Stencil Domain Specific Language (SDSL) enables the concise description of stencil computations and is briefly described in the following sections.

3.1.1 Program Description

The computation shown in Fig. 4 is a standard Jacobi 2D computation. Three key concepts are defined: (1) grid, (2) stencil(s) over which to apply a stencil function and (3) stencil function(s).

The grid defines a subset of the grid that contains all elements except a single cell border on all four sides. A stencil function is defined after the subdomain definition. This function averages the current point and four of its neighbors in a at the current timestep and places the results in a at the next timestep. References to griddata consist of the offset from the current iteration in brackets, followed by the name of the referenced field, followed by offsets from the current point in each spatial dimension in brackets.

3.1.2 General Form of an SDSL Program

In general, an SDSL program contains one grid, one or more griddata, one iterate, and one or more stencil definitions, where each stencil may define one or more subdomains and the stencil functions that operate upon them.

The abstract form of an SDSL program is represented in Fig. 5. The program is constrained to be a collection of M K-dimensional grid data and N stencils, with each stencil applying some stencil function f on one or more grid data. Each stencil function is executed on a rectangular subdomain Z \in \mathbb{Z}^K | 0 \leq l_b^k, a^k < d_{k-} \forall k \in \{1..K\} | 0 \leq a^k < d_{k-} \forall k \in \{1..K\}.

A stencil function is defined as:

grid g[dimK]...[dim1];
griddata g1,g2...,gM on g;
iterate 100 |
    stencil sN |
        [lb_sN_K:ub_sN_K]...[lb_sN_1:ub_sN_1] : fN(...);
    |
|}

Figure 5: General Form of an SDSL Program

SDSL programs are parallelized and optimized for data locality using nested and hybrid split-tiling, described in the next two sections.

3.2 Nested Split-Tiling

In nested split-tiling, a d-dimensional loop spatial loop nest is recursively split-tiled along each dimension. The outermost spatial loop at level d is split-tiled, producing a loop over uprigh tiles and a loop over inverted tiles. Inside each of these loops, loop level
$d - 1$ is split-tiled, giving four tile loop nests. Split-tiling is performed recursively in each new loop nest until the base loop level 1 is reached and there are $2^d$ total loop nests corresponding to all possible combinations of upright and inverted tiles on each dimension. Nested split-tiling of a 2D code is illustrated in Fig. 6(a).

Fig. 6(a) depicts, on the left, a series of upright (‘A’) and inverted (‘B’) tiles in the $i$ dimension. All upright ‘A’ tiles may be executed concurrently, followed by all inverted ‘B’ tiles. Below these tiles are representative cross-sections of an upright and inverted tile showing the nested split-tiles in the $j$ dimension. These tiles are labeled such that all tiles with the same number, (‘1’, ‘2’, ‘3’, or ‘4’) may be executed concurrently, and tiles with a lower number must be executed before tiles with a higher number.

The pseudocode in Fig. 6(a) shows the loop nests responsible for producing the diagram. Nested inside the sequential $tt$ loop are two parallel $ii$ loops corresponding to the ‘A’ and ‘B’ tiles shown in the diagram. Nested inside the ‘A’ loop are parallel $jj$ upright (‘1’) and inverted (‘2’) tile loops corresponding to the tiles shown in the left cross-section. Similarly, the ‘B’ loop contains nested parallel ‘3’ and ‘4’ loops corresponding to the right cross-section. A barrier follows each $jj$ tile loop to enforce tile execution order and ensure that no dependences are violated.

Nested split-tiling enables parallelization of all spatial dimension nests in a stencil, however (1) it imposes a lower bound on the size of upright tiles for a given time tile size, or equivalently, (2) it imposes an upper bound on the time tile size given an upright tile’s size.

In nested split-tiling, upright tiles must be sized such that they retain their characteristic trapezoidal shape, as in Fig. 3(b). If the base of the upright tile is not large enough for a given time tile size, the sloping lines will eventually form a tip. At this point tile execution cannot extend any further in time.

Given an upright tile with a base size of $T_U^U$, maximum absolute value of slopes in $d$ $s_{\text{max}}^d$, maximum offset of all statements $o_{\text{max}}^d$, and time tile size $T_T$ the following constraint can be stated:

$$T_U^U \geq 2 \cdot T_T \cdot s_{\text{max}}^d + 2 \cdot o_{\text{max}}^d$$

For higher dimensional problems, the lower bound on upright tile size causes tiles to overflow cache for even small time tile sizes. Consider a 3-dimensional stencil with, for all dimensions, maximum slope $s_{\text{max}}^d = 2$, maximum offset $o_{\text{max}}^d = 0$, and $T_T = 8$. This requires $T_U^U \geq 32$. For an upright tile in all dimensions, including the innermost vector dimension, this is at least 32K vector elements, enough to overflow L1 and L2 cache on most modern architectures.

### 3.3 Hybrid Split-Tiling

We overcome the tile size constraints of nested split-tiling with a hybrid of standard tiling on the outermost space loops and split-tiling on the inner loops. Hybrid split-tiling for a 2D stencil is illustrated in Fig. 6(b). The pseudocode contains a single $tt$ loop nested in the $ii$ loop which corresponds to the four traditional $i$ dimension tiles ‘A’, ‘B’, ‘C’, and ‘D’ shown in the diagram. These tiles must be executed in sequence from ‘A’–‘D’. Nested inside the $ii$ loop is the split-tiled $jj$ loop which has the same upright/inverted tile structure as the split-tiled inner loops described in the previous section.

Standard tiling does not impose any constraint on tile sizes along spatial dimensions as a function of the time tile size. Thus, standard tiles may be compacted to a much smaller size to compensate for the larger tile sizes required by split-tiled dimensions. This allows for a substantially reduced multidimensional tile footprint. In the example at the end of Sec. 3.2, we may reduce the tile size of the outermost dimension to 2, thereby reducing the the tile size to 2K elements. Since inner loops are split-tiled, we retain adequate parallelism.

### 3.4 Overview of the Optimization Algorithm

In order to perform combined data layout transformations for SIMD vectorization with parallel tiling for data locality, we use a multi-stage process to integrate dimension-lift-and-transpose (DLT) with multi-level split-tiling. The overall transformation flow is summarized in Fig. 7.
Function performDLT applies DLT on all inner-most vectorizable loops, following the method presented in Henretty et al. [12]. We remark that programs that can be modeled in SDSL all have vectorizable inner-loops, so that DLT can be applied for all stencil functions. Details of this function are provided later in Sec. 5. Function backslice performs backslicing analysis to compute the exact shape of the split-tiles (that is, computing for each stencil function the offsets and slopes of a split-tile, to be translated on the entire spatial domain). This is detailed in Section 4. Function performSplitTiling uses the split-tile shape information computed to generate split-tile code for the d split inner spatial dimensions. This is detailed in Section 5. Finally, function finalizeTiling completes code generation, by applying standard tiling on the remaining dimensions, if any. The integration as well as the complete algorithm is discussed in Section 5.

4. BACKSLICING ANALYSIS

Split-tiling requires the computation of sets of iteration space points that can be executed atomically—that is a valid tiling—while preserving parallelism between tiles of the same category (i.e., upright tiles have to be parallel with each other). In order to compute the shape of the split-tile that satisfies those properties, we first highlight the main ideas using a Jacobi 1D stencil, before describing the general algorithm for higher dimensional stencils with an arbitrary number of stencil functions.

4.1 Split-Tiling Jacobi 1D

We illustrate the main ideas behind split-tiling using a Jacobi 1D example. Fig. 8 shows the corresponding input SDSL program.

```
grid g[1000];
double griddata a on g at 0,1;
iterate 100 {
    stencil f1 {
        [1:998] : [1]a[0] ← 0.33*([0]a[-1]+[0]a[0]+[0]a[1]);
    }
    f2: a0[i] = a1[i];
}
```

Figure 8: A simple Jacobi 1D example in SDSL

In the SDSL intermediate representation, an explicit copy of the field a1 into the field a0 is added after each time iteration, leading to a program equivalent to the C code shown in Fig. 9.

```c
for (t = 0; t < 100; t++) {
    for (i = 1; i <= 998; i++) {
        f1: a1[i] = 0.33*(a0[i-1] + a0[i] + a0[i+1]);
    }
    for (i = 0; i <= 999; i++) {
        f2: a0[i] = a1[i];
    }
}
```

Figure 9: Jacobi-1D example with explicit copy

4.1.1 Examining an Upright Tile

Let us consider the top segment of an upright tile for f2, over a span [P, Q], that corresponds to the iterations of f2 performed at time T. In order to correctly compute those iterations, we need the values [P − 1, Q + 1] of a1 that were computed by executing f1 on the segment [P − 1, Q + 1] time T, which in turn depends on the values of a0 over [P − 2, Q + 2] copied by f1 at time T − 1. Based on data dependences between the statements f1 and f2, we can compute precisely which iterations must have been computed at previous time steps for each of the statements in order to compute the segment [P, Q] of f2 at time step T. This is illustrated in Fig. 10.

![Figure 10: Upright split-tile for Jacobi 1D](image)

Fig. 10 shows the set of preceding iterations, for both f1 and f2, that must be computed in order to obtain the segment [P, Q] at time T. We show here a time tile size of 3, that is, we build a split-tile that computes over three time steps.

The dependencies are analyzed from the SDSL representation. Due to the restriction on stencil shapes to be constant integer offsets (e.g., −1, 2, etc.), the dependences are simple integer relations between time and the access functions. In the next sections we show how data dependences are used to construct a dependence summary graph and formulate validity constraints on the split-tiles for the slopes and statement offsets.

4.1.2 Building the Dependence Summary Graph

We begin by creating the dependence summary graph (DSG), a multigraph with vertices for each stencil function and edges that summarize flow and anti dependence information between stencil functions. In general, a vector of 2*d+1 components is used to model data dependence in an imperfectly nested loop with maximum loop depth d, with d components representing the distances along the loops, and the other d+1 components being used to mark the relative textual ordering within a loop level. However, the structure of an SDSL program always has the form of an outer time loop surrounding a sequence of perfectly nested loops. For generation of valid split-tiled code, the exact textual position of a sequence of statements is not significant, but only whether a dependence is from a textually preceding or succeeding statement within the time loop. Further, when several dependences exist between a pair of statements due to multiple array read references it is only necessary to
identify the maximal spatial extent of dependences along the different directions at each time step. Therefore, instead of using the standard general representation of dependence vectors, we separate out the distance vector component along the time (outermost) dimension and the components along the spatial dimensions.

For the Jacobi 1D example, we have the following dependences:

\[
D_{f1 \rightarrow f2} = \begin{cases} 
  \text{flow: } f1(t, i) &\rightarrow f2(t, i) \\
  \text{anti: } f1(t, i) &\rightarrow f2(t, i - 1) \\
  \text{anti: } f1(t, i) &\rightarrow f2(t, i) \\
  \text{anti: } f1(t, i) &\rightarrow f2(t, i + 1)
\end{cases}
\]

\[
D_{f2 \rightarrow f1} = \begin{cases} 
  \delta_T = 0, \delta_i = < 0 > \\
  \delta_T = 0, \delta_i = < 1 > \\
  \delta_T = 0, \delta_i = < 2 > \\
  \delta_T = 0, \delta_i = < 3 >
\end{cases}
\]

The spatial components of the dependence vectors between dependent statements are computed by subtracting the target iteration from the source iteration, yielding the following vectors:

\[
d_{f1 \rightarrow f2} = \begin{cases} 
  \delta_T = 0, \delta_i = < 0 > \\
  \delta_T = 0, \delta_i = < 1 > \\
  \delta_T = 0, \delta_i = < 2 > \\
  \delta_T = 0, \delta_i = < 3 >
\end{cases}
\]

\[
d_{f2 \rightarrow f1} = \begin{cases} 
  \delta_T = 1, \delta_i = < -1 > \\
  \delta_T = 1, \delta_i = < 0 > \\
  \delta_T = 1, \delta_i = < 1 > \\
  \delta_T = 1, \delta_i = < 2 >
\end{cases}
\]

These tuples are used to label edges in the DSG, along with a separate label for the time distance \( \delta_T \). Assembling the coalesced tuples, time distances, dependences, and statements leads to the DSG shown in Fig. 11 for the Jacobi 1D example.

\[
\begin{align*}
\delta_T, \delta_i &> < 1, 1 > \\
\delta_T &> 0 \\
\delta_T, \delta_i &< 1, 1 > \\
\delta_T &< 1
\end{align*}
\]

Figure 11: Dependence Summary Graph (DSG) for Jacobi 1D

This DSG is subsequently used in Sec. 4.1.3 to build validity constraints for split-tiles and in Sec. 4.1.4 to compute slopes and statement offsets.

4.1.3 Building Validity Constraints

We seek to constrain the legal values of tile slope and statement offsets by assembling a system of linear inequalities based upon the DSG and the loop bounds of the split-tiled code we will generate. Pseudocode for the loop nests of Jacobi 1D upright tile is shown in Fig. 12. Informally, the validity constraints state that, for any pair of dependent statements, given a region over which the target statement is executed, the source statement will be executed over a region that is, at minimum, large enough to satisfy the dependence.

From the DSG, we note that to compute statement \( f2 \) over some range \([A, B]\) at timestep \( T \), we require that statement \( f1 \) be executed over the range \([A - 1, B - 1]\) at timestep \( T \). Similarly, to compute statement \( f1 \) over some range \([C, D]\) at timestep \( T \), we require that statement \( f1 \) be executed over the range \([C - 1, D - 1]\) at timestep \( T - 1 \).

Combining the dependence information from the DSG with the loop bounds of Fig. 12 gives us validity constraints on values the loop bounds may take, and results in the following system of inequalities for lower bounds:

\[
\begin{align*}
ii + o_{f1} + \alpha \ast t &\leq ii + o_{f2} + \alpha \ast t - 1 \\
ii + o_{f2} + \beta \ast (t - 1) &\leq ii + o_{f1} + \alpha \ast t - 1
\end{align*}
\]

The following system constrains the upper bounds:

\[
\begin{align*}
ii + Tu + o_{f1} + \beta \ast t &\geq ii + Tu + o_{f2} + \beta \ast t + 1 \\
ii + Tu + o_{f2} + \beta \ast (t - 1) &\geq ii + Tu + o_{f1} + \beta \ast t + 1
\end{align*}
\]

Simplifying and rearranging these systems of inequalities yields the following system of difference constraints for lower bounds:

\[
\begin{align*}
o_{f1} - o_{f2} &\leq -1 \\
o_{f2} - o_{f1} &\leq \alpha - 1;
\end{align*}
\]

the corresponding constraints for upper bounds are shown below:

\[
\begin{align*}
o_{f2} - o_{f1} &\leq -1 \\
o_{f1} - o_{f2} &\leq -\beta - 1.
\end{align*}
\]

These systems are used in Sec. 4.1.4 to compute valid offsets for all statements.

4.1.4 Computing Slopes and Offsets

To determine legal values for slopes \( \alpha \) and \( \beta \) we compute, respectively, maximum and minimum cycle ratios [1, 7] on the DSG. A cycle ratio \( p_C(C) \) on the DSG is computed by finding a cycle \( C \), summing \( \delta_T \) values over the cycle, and dividing by the sum of \( \delta_T \) values. A cycle ratio \( p_C(C) \) is calculated in a similar fashion with \( \delta_T \) values. We set \( \alpha = \max(p_C(C)) \) and \( \beta = \min(p_C(C)) \) for all cycles \( C \) in the DSG.

We examine the only cycle in our example DSG, \( C_0 \), between \( f1 \) and \( f2 \). The DSG tells us that computing \( f1 \) on some interval \([A, B]\) at time \( T \) allows us to compute \( f2 \) on the interval \([A + o_{f1} - o_{f2}, B + o_{f1} - o_{f2}] \) at time \( T + o_{f2} - o_{f1} \) without violating any dependences. Continuing along the cycle, computing \( f2 \) on the in-
terval \([A + s_{u}^{f_1-f_2}, B + s_{u}^{f_1-f_2}]\) at time \(T + s_{u}^{f_1-f_2}\) allows us to compute \(\varepsilon_1\) on the interval \([A + s_{u}^{f_1-f_2} + s_{u}^{f_2-f_1}, B + s_{u}^{f_1-f_2} + s_{u}^{f_2-f_1}]\) at time \(T + s_{u}^{f_1-f_2} + s_{u}^{f_2-f_1}\).

Substituting in known values for the various \(\delta\) variables shows us that computing \(\varepsilon_1\) on the interval \([A, B]\) at time \(T\) allows us to compute \(\varepsilon_1\) over the interval \([A + 2, B - 2]\) at time \(T + 1\). Thus, we see a slope of 2 on the lower bound and a slope of -2 on the upper bound.

Equivalently, summing \(\delta_T\) values and dividing by the sum of \(\delta_T\) values gives us \(\rho = 2\); a similar calculation gives us \(\rho = -2\). Since there is only one cycle in the DSG, these values are \(\max(\rho_T(C))\) and \(\min(\rho_T(C))\), and we set \(\alpha = \max(\rho_T(C)) = 2\) and \(\beta = \min(\rho_T(C)) = -2\).

These values for \(\alpha\) and \(\beta\) are substituted into the systems of difference constraints, and a solution to each of these systems is obtained using the Bellman-Ford algorithm \[1, 6\]. For the Jacobi 1D example we obtain \(\alpha_1 = -1, \alpha_0 = 1, \text{ and } \alpha_{-1} = 0\).

4.2 General Method

Our general algorithm closely follows the principles we have explained for Jacobi 1D. Since stencils may be multidimensional slopes and offsets are separately computed for each dimension. Further, different stencil functions may apply to different subdomains, which may be disjoint, overlapping, or identical. Because of this we conservatively assume all stencil functions are executed at all points in the problem domain when calculating dependences. This is an over-approximation of data dependences and has no impact on the final correctness of the generated code.

We present in Fig. 13 a general algorithm for computing slopes and offsets for a given input program. This algorithm produces lower/upper slope vectors with one slope per dimension, and lower/upper offset vectors with one offset per stencil statement.

![Figure 13: Algorithm backslice](image)

The algorithm in Fig. 13 takes an arbitrary SDSL program as input. Dependence analysis is performed on this program to determine all flow and anti dependences, and dependence distance vectors are calculated. The dependence distance vectors are used to build the DSG, where each vertex is an SDSL statement and each edge is labeled with a time distance. Edges are also labeled with coalesced distances, computed for each spatial dimension as in Sec. 4.1.2 and placed in vectors \(\delta_E, \delta_T\).

After the DSG is built, slopes and offsets are computed along each dimension \(d\). For each dimension of the DSG we compute \(\alpha\) and \(\beta\) using maximum and minimum cycle ratios. The \(\alpha_d\) and \(\beta_d\) values are then added to \(\alpha\) and \(\beta\) as the slopes for \(d\). Once the slopes have been computed for a dimension, validity constraints are constructed as shown in Sec. 4.1.3. Once the slopes and validity constraints are known for a given dimension, solveForOffsets() rearranges the validity constraints into a system of differences and solves for offset values using the Bellman-Ford algorithm. Offsets for each statement in the current dimension are appended to the offset vectors. This process is repeated until slopes and offsets have been computed for all dimensions.

4.3 Proof of Correctness

Upright Tiles: The validity constraints guarantee that no dependences in an upright tile will be violated. In this section we prove that lower bound slopes calculated with the maximum cycle ratio always lead to a system of constraints that has a solution; the proof for the upper bound and minimum cycle ratio is identical.

A system of difference constraints not solvable by the Bellman-Ford algorithm would contain a negative weight cycle. We show that this is not possible.

We begin by constructing a graph isomorphic to the DSG where each vertex is a statement and each directed edge is labeled with its corresponding validity constraint for the lower bound \(v_L^f\) on some dimension \(d\).

All validity constraints take one of the following forms:

\[
\alpha^f \leq \alpha^f - \delta_s \Rightarrow \alpha^f - \delta_s \leq -\delta_t \tag{1}
\]

\[
\alpha^f \leq \alpha^f - \delta_s + \alpha \Rightarrow \alpha^f - \delta_s \leq -\delta_t + \alpha \tag{2}
\]

In (1) the dependence associated with the constraint has source and target statements at the same timestep; in (2) the statements are 1 timestep apart. The unknown source statement offset is \(\delta^s\), the unknown target statement offset is \(\delta^t\), the dependence distance/edge weight is \(\delta_t\), and \(\alpha\) is the unknown slope.

We next show that any cycle of length \(k\) on this graph restricts the possible values of \(\alpha\) to be of the following form:

\[
\alpha \geq \frac{1}{k} \sum_{i=1}^{k} \delta_i \tag{3}
\]

Note that the maximum cycle ratio simply maximizes \(\alpha\) subject to the constraint in (3). In (3), \(\delta_i\) is the weight of an edge in the cycle and \(t\) is the number of edges where source and target are separated by 1 timestep. Every vertex \(v\) in the cycle is entered through an edge \(f\) and exited through an edge \(g\). On edge \(e\), vertex \(n\) is the target of the dependence; on edge \(g\), vertex \(n\) is the source of the dependence. Summing validity constraints \(v_L^f + v_L^g\) eliminates the offset \(\delta^e\). The accumulated sum \(\sum_{i=1}^{k} v_L^f\) eliminates all offsets and leaves \(0 \geq \sum_{i=1}^{k} \delta_i - t \ast \alpha\), which is equivalent to (3).

We now show that a negative weight cycle in the graph used by the Bellman-Ford algorithm to solve the system of difference constraints requires a value of \(\alpha\) that violates (3).

All constraints in the system of differences we are solving take one of two forms, Equation 1 or Equation 2. The graph used in Bellman-Ford, after removing the start vertex and zero-weight edges emanating from it, is isomorphic to the DSG and the validity constraint graph described above. Each vertex is a statement offset, and each edge is weighted by the difference between the source and the target offset. The weight of any length \(k\) cycle is shown in (4).

\[
\sum_{i=1}^{k} \delta_i - t \ast \alpha \tag{4}
\]
Assume there is a negative weight cycle of length $k$. This requires (5) to be true:

$$\sum_{i=1}^{k} -\delta_i + t \cdot \alpha < 0$$  

(5)

Rearranging (5), we have $\alpha$ constrained by (6):

$$\alpha < \frac{1}{t} \sum_{i=1}^{k} \delta_i$$  

(6)

The same cycle, on the validity graph, shows that $\alpha$ is constrained by (3). This is a contradiction as we know $\alpha$ was constructed by maximized subject to (3) thus our assumption of a negative weight cycle in the graph used to compute offsets was wrong and the system of differences constraints used to solve for offsets will always have a solution.

**Inverted Tiles:** The loop bounds for each statement in an inverted tile are set to fully span the gap between adjacent upright tiles (or a domain boundary). The proof of satisfaction of all dependences during execution of an inverted tile is a direct consequence of this “completeness” property of the span of spatial loops of an inverted tile. Consider all instances at time $t$, of an arbitrary statement $S_p$ in an inverted tile. Any dependence on instances of any statement $S_p$ mean that $S_p$ textually precedes $S_q$ or that the dependence is from instances of $S_q$ at some earlier time step. There are three possibilities for any such statement instance from which there exists a dependence: i) it belongs in some upright tile, ii) it belongs in the same inverted tile, iii) it belongs in some other inverted tile. If we have case (i), the dependence is satisfied since all upright tiles are executed before any inverted tiles. If we have case (ii), again the dependence is satisfied since statements are executed in textual order within the inverted tile for a time step, and in increasing time order across time steps. Finally, case (iii) is impossible since other inverted tiles are separated from the current inverted tile by at least one upright tile, whose slopes are guaranteed to prevent any dependence edge crossing their boundaries.

**Multiple Spatial Dimensions:** In the proof of correctness, so far we have focused on the case of a single spatial dimension. For multi-dimensional stencils, as explained in Sec. 3, tiles may have different characteristics along the different spatial dimensions. For example, with two spatial dimensions and nested split tiling, there are four possibilities: Upright-Upright, Upright-Inverted, Inverted-Upright, Inverted-Inverted. With hybrid split tiling, one or more spatial dimensions may be tiled using standard parallel tiles (using the slope $\beta$ computed for the right boundary in that dimension by the back-slicing analysis). The proof of validity of execution in the multi-dimensional case is a consequence of the fact that the loop bounds along each spatial dimension are independent of iterators of other spatial dimensions. Consider, for example, a stencil with three spatial dimensions $i$, $j$, and $k$, where standard tiling is used along $k$, and split tiling along $i$ and $j$. Let a particular tile be upright along $i$ and inverted along $j$. At some time step $t$, the set of all instances of a statement $S_p$ will necessarily correspond to a cross product of ranges along the three spatial extents: $[kl:ku, jl:ju, il:iu]$. The instances of any statement $S_p$ that $S_q$ depends upon, can be expressed as $[k+\delta_k^i : ku + \delta_k^i, j+\delta_j^i : ju + \delta_j^i, i+\delta_i^j : il + \delta_i^j]$. We can prove that all such dependences are satisfied by proving that the extents along each dimension are satisfied. For upright or inverted tiles along any dimension, we have proved that the required extents will be covered by the generated slopes/offsets. For a standard tiled dimension too, the dependences will be satisfied: on the lower side, the instances will either lie in the same tile or in an ear-liest numbered tile that would have been executed before this tile; on the upper side, the instance will necessarily lie within the same tile because the slope and offset from the back-slicing computation is used.

5. CODE GENERATION

Before detailing the code generation algorithm, the structure of generated code is shown in Figure 14. The example shows code for a single-statement 1D stencil. Upright and inverted inter-tile loops can be found, respectively, at lines 11 and 26. Note that the upright tile loop covers an extent from the lowest bound of any stencil function to the highest bound of any stencil function, and the inverted tile loop extends beyond this on both sides. This is needed to ensure that boundary cells are computed as upright tiles slope away from them. It is enabled by the max() and min() expressions that prevent intra-tile loop bounds from taking values outside of the original loop’s domain (lines 17 and 32). Constraints on tile size values such as $\text{UPR TILE SIZE}$ and $\text{INV TILE SIZE}$ are imposed as needed to ensure correctness of generated code; this is explained below.

![Figure 14: Illustration of structure of generated code for 1D case](image-url)
more details. Functions GenNestedSplitTileLoops and GenTileBodies are described below.

Algorithm 16 provides detail for the code generation of split-tiled code, in the context of a program on which DLT has been performed first. We remark that in order for the code generation of the masked writes required by DLT to be correct, we impose a constraint relating the size of the innermost dimension to the tile sizes:

\[ d_{\text{innermost}} \mod (\text{upr\_tile\_size} + \text{inv\_tile\_size}) = 0 \]

Adding this constraint greatly simplifies the code generation, avoiding the need to handle complex corner cases at the tile boundaries. This limitation is not a problem in practice, since padding can be used to comply with this constraint.

Function GenInvertedFullBndryTileCondition generates a conditional to determine whether to execute a full inverted tile or an boundary inverted tile that includes two loop nests for the start and end boundaries of DL T codes. Function AdjustTimeAndSpaceLoopBounds alters the loop boundaries of the loops copied from the original code to tile loop boundaries using the results of backslicing. Function TrimMaskedWriteLoops removes the masked write loops at the start and end of the inverted boundary tile loops, leaving the masked writes between them.

We conclude by presenting the algorithm for generating split-tiled loop nests shown in Algorithm 17. Hybrid split-tiling code generation requires minimal changes to Algorithms 16 and 17. Algorithm 17 requires a conditional to check if it is generating code for the outermost loop. If so, a standard tile loop is generated in stead of a split-tiled loop. Algorithm 16 requires a change to the AdjustTimeAndSpaceLoopBounds() procedure to correctly traverse the standard tile dimension.

6. EXPERIMENTAL EVALUATION

The effectiveness of the both nested split-tiling and hybrid split-tiling applied in conjunction with the dimension-lifting transformation was experimentally evaluated on several hardware platforms using a variety of stencil kernels. We compare performance to the diamond-tiling system used by Pluto [2], the cache-oblivious tiling system used by Pochoir [18], and the Intel C Compiler v13.0.

6.1 Experimental Setup

Hardware: Experiments were performed on three hardware platforms with DVFS features disabled on all of them. AMD Phenom II X6 1100T (K10 micro-architecture) is a 6-core x86-64 chip, clocked at 3.3GHz; single-precision peak performance of 26.4 GFlop/s/core (158.4 GFlop/s aggregate); double-precision peak performance
of 13.2 GFlop/s/core (79.2 GFlop/s aggregate). Intel Core i7-920 (Nehalem micro-architecture) is a quad-core x86-64 chip running at 2.66 GHz; single-precision peak performance of 21.28 GFlop/s/core (85.12 GFlop/s aggregate); double-precision peak performance of 10.64 GFlop/s/core (42.56 GFlop/s aggregate). Intel Core i7-2600K (Sandy Bridge micro-architecture) is a quad-core core x86-64 chip running at 3.4 GHz; single-precision peak performance of 54.4 GFlop/s/core (217.6 GFlop/s aggregate); double-precision peak performance of 27.2 GFlop/s/core (108.8 GFlop/s aggregate).

Programs were compiled using the Intel C Compiler v13.0 with the `-O3 -ipo -xHost` optimization flags used for split-tiled, Pluto, and Pochoir codes on all machines. Auto-parallelization and auto-vectorization was enabled for ICC results with the `-parallel -o3 -ipo -xHost` optimization flags and appropriate vectorization pragmas.

**Benchmarks:** The following stencil codes were used (with the names used to refer to them in parentheses): Jacobi 1D (jac-1d-3), Jacobi 2D (jac-2d-5), Jacobi 3D (jac-3d-7); Laplacian, Gradient and Poisson 2D; FDTD 2D [16]; Heat 1D/2D/3D distributed with Pochoir [18].

All array dimensions were set to be significantly larger than last level cache on all micro-architectures. For all stencils, the footprint of each array was set to 244.14MB for single precision and 488.28MB for double precision; this was achieved using 1D arrays with 64 × 10^7 scalar elements, 2D arrays with 8000^2 elements, and 3D arrays with 400^3 elements. The number of time steps was set to 100 for all benchmarks.

Tile sizes were autotuned for Pluto with diamond tiling, as well as for our split-tiling work. The autotuning was done over a sampling of the set of tile size combinations that respect the various constraints (i.e., multiple of vector length) of each framework. Autotuning runs were performed for a maximum of 4 hours per SIMD unit / benchmark combination (how to speed up tile size autotuning, for instance using acceleration/search heuristics is beyond the scope of this paper, we simply tested all tile sizes in our subset). For split-tiled codes all threads (one thread per core) were assigned to the outermost parallel loop using OpenMP parallel for pragmas.

### 6.2 Experimental Results

Absolute performance for single and double precision experiments across all platforms and codes are given in Figures 18–20.

In all cases Pochoir, nested split-tiled, and hybrid split-tiled codes are significantly faster than sequential C code auto-parallelized and vectorized by Intel’s ICC compiler. Not all benchmarks could be optimized by Pochoir, which can only generate optimized tiled parallel code for single statement stencil codes. Multi-statement, multi-loop stencil codes like FDTD, with three inter-related stencils updating Ex[t] using Ex[t-1] and Hz[t-1], Ey[t] using Ey[t-1] and Hz[t-1], and Hx[t] using Hz[t-1], Ex[t], and Ey[t]), cannot be expressed in Pochoir.

**1D Benchmarks:** Having only one split-tiled dimension allowed tiles to fit in cache and provided ample parallelism. High performance was expected of these codes and was observed. Nested split-tiling + DLT outperformed ICC, Pochoir, and Pluto on both 1D benchmarks across all platforms. Improvement over Pochoir ranged from a low of 1.27 × for double precision Jacobi-1D on Nehalem to a high of 2.2 × for single precision Heat-1D on Sandy Bridge. Hybrid split-tiling does not apply to ICC code as the inner loop is split-tiled for both fine grain vector parallelism and coarse grain thread-level parallelism.

**2D Benchmarks:** Across all 2D benchmarks hybrid split-tiling outperformed ICC, Pochoir, Pluto, and nested split-tiling on both quad core Intel platforms. For the hexacore AMD, performance of hybrid split-tiling + DLT fell behind both nested split-tiling and Pluto on several benchmarks, both single and double precision.

This performance anomaly is due to load balancing issues. DLT effectively turned the dimension of the arrays involved from 8000 × 8000 float/double scalars to 8000 × 2000 float vectors and 8000 × 4000 double vectors. With hybrid split-tiling, tiles from the smaller dimension (subject to the constraints described in Sec. 3.2) had to be distributed across threads. Smaller tiles with better load balancing characteristics limited reuse, while larger tiles with significant reuse were not plentiful enough to adequately distribute load across cores.

Nested split-tiling exhibited better load balancing because only tiles along the larger dimension were distributed across cores. The outer dimension was large enough to allow both large tiles for significant reuse and a large quantity of tiles for load balancing. Hybrid split-tiling on quad core Intel platforms did not have any load balancing issues because the smallest dimension was divisible by 4, allowing for the same number of tiles to be distributed across all cores.

**Figure 18: AMD Phenom II X6 SSE2 Performance**
3D Benchmarks: Both hybrid and nested split-tiling fell behind Pochoir and Pluto on both 3D benchmarks across all platforms. While hybrid split-tiling was able to significantly reduce the pressure on the memory system for 2D benchmarks, its benefit was not seen when adding a third dimension. In 3D, both inner dimensions are split-tiled, thus spatial tile sizes in both inner dimensions must increase when time tile size is increased.

For both the Heat-3D and Jacobi-3D benchmarks, an increase of one in the time tile size leads to an upright tile size increase of four in each dimension. Tile sizes grow fast enough that by the time significant gains can be achieved from data reuse, the code is bound by memory latency and bandwidth consumption from spatial tiles that have grown significantly larger than L1 cache.

The diamond tiles used by Pluto overcome this limitation by halving the amount of data required for a given time tile size. For a time tile size of 16 and slopes of 2 on either side, an upright tile used in nested and hybrid split-tiling must be at least 64 elements at its base. A diamond tile with a time tile size of 16 begins at a point, expands to 32 elements at its widest and narrows to a point again. Pluto is able to reuse data significantly more than both Pochoir and nested and hybrid split-tiling.

Further exacerbating the problem is the fact that DLT causes tile sizes in the innermost dimension to be multiplied by vector size. Coupled with the constraints on tile size described in Sec. 3.2 this leads to split-tiling + DLT tile sizes (in KB) that are much larger than similar tile sizes (size of each dimension). In Heat-3D both Pluto and Pochoir are able to achieve high performance across all platforms. It is likely that the smaller tile sizes (in KB) enable Pochoir to achieve very high performance on this code.

7. RELATED WORK

A number of recent efforts have targeted the optimization of stencil computations for multicore CPUs and GPUs [2, 5, 8–10, 13, 15, 17–19]. Strzodka et al. [17] used time skewing and cache-size oblivious parallelograms to improve the memory system pressure and parallelism in stencils on CPUs. Micikevicius et al. [15] hand-tuned a 3-D finite difference computation stencil and achieved an order of magnitude performance increase over existing CPU implementations on GT200-based Tesla GPUs. Datta et al. [8] developed an optimization and auto-tuning framework for stencil computations, targeting multi-core systems, NVidia GPUs, and Cell SPUs.
Tiling is a critical transformation for optimizing stencil computations since they typically perform repeated sweeps over large multidimensional arrays that are much too large to fit within cache. In order to benefit from both intra-step reuse as well as data reuse across several successive sweeps over the domains, so called “time tiling” is essential. The standard approach to time-tiling of stencil computations requires skewing of the iteration space and introduces inter-tile dependences along the spatial dimensions and thereby restricts parallelism to front-end parallelism in the tile space. Alternate approaches to tiling using overlapped tiles [13], split-tiles [9] or “diamond” tiles [2, 14] enable a greater degree of inter-tile parallelism. In this paper we have developed compiler algorithms for enabling split-tiling in conjunction with a dimension-lifted-transpose data layout transformation. GROSSER et al. [9] utilize a variant of split-tiling for GPGPU with substantially different tile shaping, analysis, and code generation techniques than this work.

Among the numerous research efforts to optimize stencil computations, a few of them have developed a specialized DSL compiler for stencils. PATUS [5] is a stencil compiler developed by CHRISTEN et al. that uses both a stencil description and a machine mapping description to generate efficient CPU and GPU code for stencil programs. A stencil-DSL compiler for GPUs that uses overlapped tiling for parallel execution was recently reported [13].

TANG et al. [18] have developed and publicly released the Pochoir stencil compiler that uses a DSL embedded in C++ to produce high-performance code for stencil computations using cache-oblivious parallelograms for parallel execution on shared-memory systems. In this paper, we compare performance of several multi-core systems of the code generated using DLT and split-tiling with the code generated by Pochoir, showing that we achieve comparable or better performance. But unlike the DSL compiler we have described in this paper, neither PATUS nor Pochoir can generate optimized time-tilded code for multi-statement stencil computations such as the FDTD (Finite Difference Time Domain) stencil.

The use of Chapel for the description of dense and sparse stencils was investigated by BARRETT et al. [3]. The Chapel work enables automated distributed memory parallelization of stencils but does not address time-tiling or vectorization. Very recent work by Bandishti et al. [2] enhanced the Pluto compiler to incorporate a strategy for “diamond” tiling, that is particularly effective in parallelizing stencil computations. However, our approach to combining data layout transformation in conjunction with split-tiling provides significant performance advantages over Pluto for a range of stencil benchmarks, as seen from our experimental results.

8. CONCLUSIONS

In this paper, we have described compiler algorithms for a stencil DSL incorporating two key transformations – dimension-lifted-transpose data layout transformation to optimize vector loads/stores, and split-tiling for enhanced inter-tile concurrency. Experimental results on a number of stencil benchmarks on multiple target multicore platforms demonstrate significant performance improvements achievable over state-of-the-art production compilers such as Intel’s ICC, but also significant improvements over Pochoir and recent work on diamond tiling in Pluto. Our compiler has greater scope of applicability than Pochoir, effectively optimizing arbitrary multi-statement stencils.

Acknowledgments. This work was supported in part by the U.S. National Science Foundation through awards 0926687 and 0926688, by the Center for Domain-Specific Computing (CDSL) funded by NSF “Expeditons in Computing” award 0926127, and the Department of Energy through award DE-SC0005033.

9. REFERENCES