PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV

Fazle Sadi, Joe Sweeney, Scott McMillan, Tze Meng Low, James C. Hoe, Larry Pileggi and Franz Franchetti
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
Email: {fsadi, joesweeney}@cmu.edu, smcmillan@sei.cmu.edu, {lown, jhoe, pileggi, franzf}@cmu.edu

Abstract—PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger and sparser, PageRank applications are inhibited as most current solutions profoundly rely on large random access fast memory, which is not easily scalable. In this paper we present a 16nm ASIC based shared memory platform for PageRank implementation that fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank. This accelerator is scalable, guarantees full DRAM streaming and reduces off-chip communication. More importantly, it is capable of handling very large graphs (~2 billion vertices) despite using significantly less fast random access memory than current solutions. Experimental results show that our proposed accelerator is able to yield order of magnitude improvement in both energy efficiency and performance over state of the art shared memory commercial off-the-shelf (COTS) solutions.

I. INTRODUCTION

PageRank is an iterative algorithm that ranks the vertices of a graph according to their relative importances, which is the probability of reaching any given vertex. The PageRank vector holds numerical values that represent these importances for a set of vertices within a graph. For example, in the case of the most well known application of PageRank, it is used to rank web pages for a particular key word search. Here, each vertex of the graph represents a web page and the edges represent hyperlinks among the web pages. The higher the value of a vertex in the PageRank vector, the higher is the likelihood that anyone randomly browsing through the World Wide Web will land on that particular web page.


$$x_{(i+1)}^T = \alpha x_i^T A + (1-\alpha) x_i^T \frac{ee^T}{N}$$

(1)

Here, $x_{(i+1)}$ is the output PageRank vector at iteration $i$, $A$ is the hyperlink sparse matrix of dimension $(N \times N)$, $\alpha$ is a constant damping factor. The term $ee^T/N$ is the teleportation matrix that models the random probability of a user to jump to any page with uniform distribution. The column vector $e$ has constant 1 for each element.

In Equation 1, the second term essentially contributes only a constant addition in the update process of each element in the resultant PageRank vector. On the other hand, the first term is a Sparse Matrix dense Vector multiplication (SpMV) operation. As this is the core kernel for each iteration, all challenges related to SpMV kernel are inherited by PageRank.

Challenges. SpMV is a bandwidth bound operation with adverse memory access behavior. It requires random access to either the input vector ($x_i$) or the resultant vector ($x_{i+1}$). For numerous real world graphs, these vectors are much larger than fast memories that are affordable by current technologies, such as Static Random Access Memory (SRAM) and Embedded DRAM (eDRAM). Hence, the majority of the memory accesses of PageRank occur in random fashion to the main memory; i.e. DRAM. This translates to poor utilization of already scarce DRAM bandwidth. Furthermore, it causes redundant off-chip transfers due to granular access that is smaller than cache line, making this bandwidth bound kernel even more inefficient.

A major implication of these issues is that most state of the art PageRank solutions strongly depend on fast storages to achieve decent performance and efficiency. This dependency inhibits these solutions to scale effectively as the graphs get larger and sparser. It is mainly because - a) fast storages are not easily scalable in a shared memory scenario, and b) distributed systems have huge communication overhead [2]. Custom hardware solutions in the literature have reported to handle only a few million nodes, despite their significant advantage in design flexibility. For example, the FPGA accelerator in [3] reported maximum 2.3M nodes using 8.4MB SRAM and the Application Specific Integrated Circuit (ASIC) based architecture in [4] reported maximum 8M nodes in spite of using a huge 32MB eDRAM scratchpad. On the other hand, with large last level caches (LLCs) commercial off-the-shelf (COTS) solutions, such as [5], [6], tend to handle larger graphs, but with low efficiency and poor bandwidth utilization. Nonetheless, the graphs reported only have tens of millions nodes at maximum. Moreover, costly data pre-processing to extract locality in the data is prevalent in both custom and COTS architectures [6]–[11].

Goals and Contributions. For high performance and efficient PageRank application on very large (~billion nodes) and highly sparse (avg. degree <10) graphs, a number of goals are needed to be achieved: - a) streaming DRAM access, b) off-chip traffic reduction, c) full utilization of DRAM bandwidth, d) low requirement of fast memory to scale, and e) no dependence on data locality and costly pre-processing. Contributions of this work in achieving these goals are as follows.

1. We have designed a 16nm ASIC (currently under fabrication) based shared memory accelerator for PageRank that guarantees 100% streaming access to main memory.

2. Our proposed architecture incorporates state of the art High Bandwidth Memory (HBM) [12]. We have developed an optimization technique to reduce off-chip traffic of PageRank and fully utilize the extreme bandwidth delivered by 3D DRAM.

3. This PageRank implementation is able to operate on
very large graphs (~2 billion nodes) while using significantly less fast memory (11MB). With significant room for possible expansion of fast memory, the proposed solution is scalable to handle even larger graphs. This ASIC design is also portable to FPGA due to reasonable hardware resource requirements.

4. **Our proposed solution is independent of data (nonzero) locality and only requires basic matrix partitioning.**

The remainder of the paper is organized as follows. Sec. II details the SpMV algorithm and basic PageRank implementation using SpMV. In Sec. III, we demonstrate an optimization technique to reduce off-chip communication and increase computation’s streaming speed. Sec. IV describes the ASIC developed for PageRank acceleration. In Sec. V, we evaluate the performance and efficiency of our proposed methods against recent benchmarks. Lastly, Sec. VI concludes this work.

II. **TWO-STEP SPMV DRIVEN PAGERANK**

**SpMV Algorithm.** In this work, we have implemented a SpMV algorithm, namely Two-Step that is presented in [13]. The main reason for using this algorithm for PageRank is that it guarantees full DRAM streaming. Furthermore, for large graphs with high sparsity, Two-Step produces less off-chip traffic than most conventional SpMV algorithms. This algorithm is depicted in Figure 1. Before computation, the matrix \( A \) is partitioned into 1D column blocks and the source vector \( (x) \) is partitioned into smaller segments. The segment width of \( x \) is dictated by the available fast random access memory. The width of the column blocks of \( A \) is same as the source vector segment width. The column blocks of the sparse matrix \( A \) is stored in a row-major sparse format [14].

As the name suggests, the operation is conducted in two separate steps. In the first step, a single segment of \( x \) is streamed from DRAM and stored in the fast memory. Afterwards, a single column block of \( A \) is streamed to the computation core from DRAM and partial SpMV is conducted between that column block and vector segment. All the required random access to \( x \) is confined in the address space present in the fast memory, which has small and fixed latency. Hence, this operation can be easily pipelined and implemented in a fully streaming fashion. A sparse intermediate vector \( (v^k) \) is generated as a result of this operation, which is streamed back to main memory. As the matrix block is stored and accessed in row-major direction, elements of \( v^k \) is naturally sorted according to their position indices. This partial SpMV is conducted sequentially for all the matrix blocks and, after the first step, we end up with \( n \) intermediate sparse vectors residing in the main memory.

In the second step, the intermediate sparse vectors are streamed back from DRAM and merged to form the dense resultant vector \( y \). To ensure full main memory streaming, page (row buffer in DRAM) size blocks are prefetched whenever an element of \( v^k \) is transferred from DRAM. Step 2 is essentially a large \( n \)-way merge operation on very long and sorted lists. It is difficult to implement such a large Multi-way Merge kernel in COTS architectures, such as CPUs and GPUs. In fact, this is the main reason SpMV algorithms similar to Two-Step have not been proposed in the literature despite its full streaming DRAM access pattern. However, recently it has been shown in [13] that this large merge network can be implemented efficiently with custom hardware. This work implements such hardware in an ASIC platform that is detailed in Sec. IV.

**PageRank using SpMV.** Our proposed basic implementation of PageRank using Two-Step SpMV is depicted in Figure 2. As this is an iterative application, the entire Two-Step SpMV operation is conducted once independently for each iteration. The resultant dense PageRank vector \( (y_i = x_{i+1}) \) of iteration \( i \) works as the source vector for SpMV in iteration \( i + 1 \). As the entire resultant vector is too large to be stored in on-chip fast memory, it is streamed out to DRAM at the end of each iteration. Then in the next iteration, segments of \( x_{i+1} \) are streamed back to computation core for Step 1 of the SpMV operation.

**Off-chip Communication.** It is evident that during each iteration and during the transition between iterations, Two-Step driven PageRank algorithm is guaranteed to only require streaming DRAM accesses, which is imperative for high performance and energy efficiency. Another important factor in this regard is the off-chip traffic that is transferred between the computation core and main memory. As there is no DRAM random access involved in our proposed Two-Step driven PageRank (\( PR_{\text{TS}} \)), we can exactly calculate the off-chip data traffic for this implementation. We compare this with the baseline data traffic of PageRank driven by dot product based SpMV algorithm (\( PR_{\text{Base}} \)), where each element of the resultant vector is computed directly from the dot product of matrix row and source vector [15]. For \( PR_{\text{Base}} \), we assume traditional eviction policy of LLC in CPU and 64B cache line transfer between DRAM and LLC. We select this baseline for comparison because in the literature it is difficult to find shared memory PageRank implementation for very large graphs with moderate fast storage. For example, [3] uses moderate on-chip BRAM (8MB), but only handles matrix with of maximum dimension of 2.3M×2.3M. On the other hand, [4] developed an ASIC for SpMV using large fast random access eDRAM (32MB). Nonetheless, this work reported to only handle 8M×8M matrix at maximum. In this work, our goal is to handle matrices with dimensions in the order of hundreds of millions to billions.

Figure 3 depicts the total off-chip traffic comparison for different fast memory sizes between \( PR_{\text{Base}} \) (1\textsuperscript{st} bar) and \( PR_{\text{TS}} \) (2\textsuperscript{nd} bar) for PageRank with 20 iterations. For this comparison, we have used a 1B×1B synthetic and uniformly random sparse matrix with an average degree of 3. It is evident that the biggest contributing factor in the off-chip communication for \( PR_{\text{Base}} \) is the redundant data for the source vector \( x \) (striped gray region). This redundancy is due to the cache line level block transfers for random accesses to \( x \), most of which never take part in actual computation.
Fig. 2: Two-Step SpMV driven PageRank with independent iterations (PR_TS).

Fig. 3: Off-chip communication comparison for PageRank. The transferred source vector data that actually takes part in computation (deep blue region), is also larger for PR_Base due to regular evictions. The matrix data (red dotted region) is same for both as matrix is streamed only once. However, the overall payload (data that actually takes part in computation) is larger in PR_TS. This is due to round trip of the intermediate vectors (light blue region that also includes resultant vector) in Two-Step algorithm. This is the cost of streaming for Two-Step SpMV. Nonetheless, the overall off-chip traffic is significantly less for PR_TS. Additionally, the entire off-chip traffic in PR_TS is transferred at DRAM streaming bandwidth, whereas the most traffic for PR_Base is transferred at DRAM random access speed, which is orders of magnitude slower than streaming. This comparison demonstrates the advantages of fully streaming PR_TS over non-streaming algorithms for highly sparse, non-structured and large graphs.

It is noteworthy that none of the algorithms is significantly benefited from the increase in fast memory. For PR_Base, even the largest fast memory is incapable to hold a sizable portion of the source vector to render any meaningful reuse of data. In the case of PR_TS, the high sparsity in data causes reduction operation in Step 1 to be rare. Hence, increase in fast memory actually has a negligible effect in regard to off-chip communication for PR_TS. However, a larger fast memory enables PR_TS to handle bigger graphs (more nodes).

III. TRAFFIC OPTIMIZATION BY ITERATION OVERLAP

In Figure 2, we have seen that PageRank iterations in PR_TS are sequential and completely independent, where the resultant dense vector ($x_{i+1}$) is streamed out to dram and streamed back into the computation core as the source vector in the next iteration as it is too big to be stored in the fast memory. However, we can parallelize the SpMV steps of consecutive iterations and eliminate the off-chip communication for resultant and source vectors, as depicted in Figure 4. The idea is that even though we cannot store the entire $x_{i+1}$, we can store a segment in the fast memory. For example, during Step 2 of iteration $i$, instead of sending the computed elements of $x_{i+1}$ to DRAM, it is written in fast memory until a full segment is stored. When the first segment of source vector for iteration $i + 1$ is completely

Pseudocode 1: Two-Step SpMV driven PageRank with off-chip communication optimization by iteration overlap.

1 $T$ = Total number of iterations
2 for $i = 0$ to $T - 1$
3  | **STEP 1**
4  | for $k = 0$ to $n - 1$
5  | Stream in Matrix Column Block $A^k$
6  | $u \leftarrow 0$
7  | for All rows $A^k_{p,:}$ with $\text{nnz} > 0$
8  |  | for Each non-zero $A^k_{p,q}$ in $A^k_{p,:}$
9  |  | Random access to vector segment $x^k_{i+1}[p]$
10  |  | $u_p \leftarrow \alpha \cdot A^k_{p,q} \cdot x^k_{q[i+1]} + u_p$
11  | end
12  | end
13  | Sparsify $u$ to $v^k_{i+1}$
14  | Stream out $v^k_{i+1}$ to main memory
15 end
16 | **STEP 2**
17  | for $p = 0$ to $N - 1$
18  |  | for $k = 0$ to $n - 1$
19  |  | Stream in $v^k$
20  |  | $x_p[i+1] \leftarrow x_p[i+1] + v^k_{p[i]}$ [Multiway merge]
21  | end
22  | $c[i+1] \leftarrow c[i+1] + x_p[i+1]$ [Constant addition]
23  | $x_p[i+1] \leftarrow x_p[i+1] + \frac{1}{N} c[i]$ [Constant addition]
24  | Buffer $x[i+1]$ on chip
25 end
Two source vector segment storages in fast memory are required:
1) for computation of Step 1 in iteration $i+1$ and 2) for storing output of Step 2 in iteration $i$.

![Fig. 4: Off-chip traffic optimized PageRank with iteration overlap (PR_TS_Opt).](image)

---

Another important benefit of PR_TS_Opt is that it achieves significantly higher streaming speed than PR_TS. This is because none of the computation cores for Step 1 and 2 remains idle in steady state, whereas for PR_TS computation logic for either step 1 or step 2 remains idle at any given moment. Hence, PR_TS_Opt enables the entire silicon area to be active for all the iterations (except the very first and very last one) that helps in fully utilizing extreme off-chip bandwidth offered by modern technologies such as 3D stacked DRAM. We will demonstrate practical example of this in Sec. V.

The cost of achieving off-chip traffic optimization is that we have to buffer two source vector segments in the fast memory instead of one in PR_TS. As a result, for any given amount of fast storage, the maximum matrix dimension that PR_TS_Opt can handle is roughly half of the maximum matrix dimension of PR_TS. Therefore, for PR_TS_Opt, there is a trade-off between sparse matrix dimension vs performance and efficiency.

The communication reduction with PR_TS_Opt is depicted in Figure 5. We generated five uniformly random graphs of dimension 1Bx1B with different sparsity, which is labeled on the x-axis. The striped gray and solid blue bars represent the total off-chip traffic for PageRank with 20 iterations using PR_TS and PR_TS_Opt accordingly. We can see that when the graph becomes sparser (i.e. less average degree per vertex), the ratio of reduction in data transfer with iteration overlap gets larger. For example, with a very sparse graph of average degree 1.2 per vertex, 26% more off-chip DRAM traffic would incur if optimization is not applied. This is because with sparser matrix, the data transfer due to intermediate vectors in Two-Step SpMV operation gets less significant relative to the data transfer due to source and resultant vectors.

---

### IV. ASIC FOR PAGE-RANK

In this section we demonstrate the custom ASIC to accelerate PageRank, which is designed using Verilog and currently being fabricated in 16nm FinFET technology. The block diagram of the overall accelerator is shown in Figure 6. The ASIC chip implements the computation logic required for the Two-Step SpMV algorithm. To conduct Step 1 of Two-Step algorithm, sixteen parallel single precision floating point multiplier and adder chains are implemented. For Step 2, sixteen Multi-way Merge cores are designed, which are able to parallely merge 2048 lists (intermediate vectors) and have a overall throughput of 16 resultant vector elements per cycle. A radix-sort based data parallelization technique is used to distribute loads among the merge cores. However, implementation details of logic for data distribution, load balancing and synchronization is beyond the scope of this paper and, hence, skipped.

An actual image of the ASIC and key specifications are given in Figure 7. As the chip is currently being fabricated, these specifications are from post physical synthesis (after place and route) layout of the design. Cadence® Innovus™ is used for area and frequency measurement and Cadence® Voltus™ is used for power measurement. One key aspect of this chip is that it uses synthesized SRAM blocks, also known as Logic in Memory (LiM) technology [16]–[18], distributed all over the chip to facilitate fine grain data access during computation.

The two other parts of the accelerator, i.e. HBM main memory and the eDRAM scratchpad, are emulated using Cacti.
TABLE I: Fast memory requirement and largest graph dimension comparison of current and proposed solutions.

<table>
<thead>
<tr>
<th>Solution</th>
<th>Fast memory size (MB)</th>
<th>Max. vertices reported</th>
</tr>
</thead>
<tbody>
<tr>
<td>FPGA [3]</td>
<td>8.4</td>
<td>2.3M</td>
</tr>
<tr>
<td>ASIC [4]</td>
<td>32</td>
<td>8M</td>
</tr>
<tr>
<td>CPU (single socket) [5]</td>
<td>20</td>
<td>95M</td>
</tr>
<tr>
<td>CPU (dual socket) [6]</td>
<td>50</td>
<td>118M</td>
</tr>
<tr>
<td>Pr_TS Opt (proposed)</td>
<td>11</td>
<td>2B</td>
</tr>
<tr>
<td>Pr_TS (proposed)</td>
<td>11</td>
<td>4B</td>
</tr>
</tbody>
</table>

![ASIC Specifications](image)

**ASIC specifications**
- Frequency: 1.4 GHz
- Occupied area: 7.5 mm²
- Leakage power: 0.10 W
- Dynamic power: 3.01 W
- Total power: 3.11 W

**Fast Memory Requirement.** For better scalability, one of the key goals of our proposed solution is to handle very large graphs while not requiring large amount of fast storage (such as SRAM or eDRAM based cache, scratchpad, etc.) for random access. In our proposed accelerator, the ASIC’s computation core requires 0.5MB of synthesized SRAM. For source vector segment storage in Step 1, it requires 8MB of eDRAM scratchpad. Additionally, for proper streaming in Step 2, DRAM page (row buffer) size blocks have to be prefetched while accessing the 2048 intermediate vectors. The page size of HBM2 is 1KB and we allocate 1.25KB to store prefetched data for each list (instead 1KB) to hide loading latency. Hence, we require 2048 × 1.25KB = 2.5MB of eDRAM buffer for prefetched data. Therefore, the fast memory requirement of our proposed solution is (0.5MB + 8MB + 2.5MB) = 11MB. To put this into perspective, Table I lists other shared memory solutions for PageRank and SpMV against our proposed Two-Step SpMV driven PageRank with (Pr_TS Opt) and without (Pr_TS) optimization by iteration overlap. We see that our proposed solutions can operate on much larger graphs than those with significantly less fast memory. This makes our solution easier to scale as requirement of fast memory in bulk hinners graph dimension to scale in many current solutions, such as [4]. For example, if we increase the source vector buffer to 16MB from 8MB, we will be able to handle graphs with twice more vertices with this ASIC, i.e, graphs of 4B and 8B vertices with Pr_TS Opt and Pr_TS accordingly. It should be noted that the total number of edges only dictates the requirement for main memory storage and has negligible impact on the computation core design for our developed accelerator.

**V. Experimental Results**

We ran PageRank with 20 iterations on our proposed accelerator with a number of graphs mentioned in Table II. Graph with prefix ‘kr’ are generated using the Kronecker graph generator in [1]. The ones with prefix ‘Sy’ are uniformly distributed random graphs representing worst case scenario for that dimension and sparsity. Rest are real word graphs from the cited sources.

**TABLE II: Graph data sets used for experiments.**

<table>
<thead>
<tr>
<th>Graph</th>
<th># Nodes (M)</th>
<th>Avg. Degree</th>
<th># Edges (M)</th>
</tr>
</thead>
<tbody>
<tr>
<td>kr24</td>
<td>16.7</td>
<td>16.1</td>
<td>268</td>
</tr>
<tr>
<td>kr25</td>
<td>33.5</td>
<td>31.3</td>
<td>1047</td>
</tr>
<tr>
<td>Twi-m</td>
<td>52.5</td>
<td>37.4</td>
<td>1963</td>
</tr>
<tr>
<td>PLD</td>
<td>42.9</td>
<td>14.5</td>
<td>623</td>
</tr>
<tr>
<td>Web</td>
<td>118</td>
<td>8.6</td>
<td>1014</td>
</tr>
<tr>
<td>Twi-f</td>
<td>61.6</td>
<td>23.8</td>
<td>1468</td>
</tr>
<tr>
<td>SD1</td>
<td>94.9</td>
<td>20.4</td>
<td>1936</td>
</tr>
<tr>
<td>Sy-.5B</td>
<td>500</td>
<td>3</td>
<td>1500</td>
</tr>
<tr>
<td>Sy-1B</td>
<td>1000</td>
<td>2</td>
<td>2000</td>
</tr>
<tr>
<td>Sy-2B</td>
<td>2000</td>
<td>1.1</td>
<td>2200</td>
</tr>
</tbody>
</table>

The 3D memory sub-system is emulated with CACTI3D [19] tool assuming 4 channels (each HBM2 has 2 pseudo-channels with 64B I/O width). As the chip is currently in
fabrication facility, we use the specs given in Figure 7 and conducted cycle accurate Verilog simulation. Column blocks of the matrix is stored in row-major COO format as hyper-sparcity causes CSR to be wasteful. Single precision is used for floating point values and 32 bits are used for all indices. We compared both our unoptimized (PR_TS) and optimized (PR_TS_Opt) implementations against two benchmarks. These recent works are Bmark1 [5] (single socket, 20MB LLC) and Bmark2 [6] (dual socket, 50MB LLC) CPU implementations of PageRank that reported comparably large graphs. Bmark1 also took part in HPEC Graph Challenge 2017.

Figure 8 depicts the comparison of execution time for PageRank with 20 iterations among the proposed implementations and the benchmarks (for the graphs where data is reported). While being able to handle much larger graphs, our optimized proposed solution (PR_TS_Opt) is 12x and 7x faster than Bmark1 and Bmark2 accordingly. It can be noticed that our unoptimized solution (PR_TS) is relatively slower than PR_TS_Opt. Besides more off-chip traffic, the main reason is that the ASIC is provisioned to saturate 100% 3D DRAM bandwidth of the system when Step 1 and 2 of SpMV is running in parallel. Figure 9 depicts the maximum streaming speed of different parts of the chip for matrix ‘Sy-1B’. For PR_TS, the source vector load, partial SpMV and multi-way merge are conducted sequentially. The streaming speed of the logic cores for these tasks are below what the system can provide. On the other hand, for PR_TS_Opt all the tasks in step 1, i.e. source vector load and partial SpMV, runs in parallel with the merging task in step 2. Thus, with the same amount of silicon real estate, we can attain much higher streaming speed. As shown in Figure 9, the maximum sustained streaming speed of PR_TS_Opt is well over the system’s 512GB/s and actually can saturate almost three HBM2s (768GB/s).

The bandwidth utilization is given in Figure 10. Due to full streaming algorithm our proposed implementations achieve significantly higher bandwidth utilization than Bmark2. As explained previously, PR_TS_Opt achieves ~97% utilization for all graphs due to overlap of step 1 & 2 across iterations and having a streaming speed more than of two HBM2s.

Another metric of performance we used is the number of edges traversed per second. We avoided using GFLOP/s metric as this is data dependent and not representative of system’s capability to process sparse data. As shown in Figure 11, PR_TS_Opt provides 7x faster edge traversal rate than Bmark2. Furthermore, we have compared the energy efficiency in Figure 12. Despite using entire system’s energy for PR_TS and PR_TS_Opt against only the DRAM energy for Bmark2, it is evident that our proposed system is up to two orders of magnitude more efficient. This is due to less execution time, small fast memory, less off-chip traffic and efficient 3D DRAM.

### VI. CONCLUSION

In this work we have developed a custom ASIC hardware accelerator with 3D DRAM for PageRank that can operate on very large graphs (~billion nodes), while providing high performance and energy efficiency. This solution guarantees full DRAM streaming access and proper utilization of off-chip bandwidth. Moreover, it is readily scalable as it requires less fast random access memory than most current architectures in literature. The key to these achievements is the use of Two-Step SpMV algorithm. As COTS architectures are not suitable for this algorithm, we have developed custom ASIC for PageRank implementation. Additionally, we have proposed an optimization technique that reduces off-chip traffic and increases the streaming speed of the computation core. Due to reasonable requirements of hardware resources, this ASIC accelerator design can also be ported to FPGA based platforms.
ACKNOWLEDGMENT

This work was supported in part by Defense Advanced Research Projects Agency (DARPA) contract HR0011-16-C-0038, “Circuit Realization At Faster Timescales (CRAFT)”. This material is also based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center [DM18-0845]. Additionally, this work was sponsored by DARPA contract HR0011-13-2-0007, “Power Efficiency Revolution for Embedded Computing Technologies (PERFECT)”. The view, opinions, and/or findings contained in this material are those of the author(s) and should not be construed as an official Government position, policy, or decision, unless designated by other documentation.

REFERENCES


