CAREER: A Unified Framework for Designing Efficient Resource-Oblivious Parallel Algorithms
(NSF Award CNS-1553510, 2016 — 2021, PI: Rezaul Chowdhury)

This project will develop fundamental theory and efficient tools to facilitate the design of parallel algorithms that make no mention of any hardware parameters in the code, but still run efficiently across a wide spectrum of parallel computing platforms ranging from small laptop computers to large supercomputers. These resource-oblivious algorithms will enable efficient parallel code regardless of the ever-changing underlying hardware platforms allowing more focus on the correctness of implementations without the need of optimizing for a particular hardware. Porting code from one machine to another will become easier or effortless.

The project will concentrate mainly on algorithms based on recursive divide and conquer — a very powerful and widely used algorithm design technique.

Peer-reviewed Publications (Partially) Supported by the Award

¤ Zafar Ahmad, Rezaul Chowdhury, Rathish Das, Pramod Ganapathi, Aaron Gregory, and Yimin Zhu, "Fast Stencil Computations using Fast Fourier Transforms", Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2021), Virtual Event, pp. 8-21, 2021. (Outstanding Paper Award Recipient)

Abstract: Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods.

In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform $$\Theta(NT)$$ work, where $$N$$ is the size of the spatial grid and $$T$$ is the number of timesteps, our algorithms perform $$o(NT)$$ work.

To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions.

Initial experimental results show that implementations of our algorithms that evolve grids of roughly $$10^7$$ cells for around $$10^5$$ timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and $$1.3\times$$ to $$8.5\times$$ faster for aperiodic stencil problems.

¤ Zafar Ahmad, Rezaul Chowdhury, Rathish Das, Pramod Ganapathi, Aaron Gregory, and Mohammad Mahdi Javanmard, "Low-Span Parallel Algorithms for the Binary-Forking Model", Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2021), Virtual Event, pp. 22-34, 2021. (Another Outstanding Paper Award Recipient)

Abstract: The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of $$\Theta(\log{n})$$ to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging.

In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).

Our major results are as follows: $$(1)$$ A randomized comparison-based sorting algorithm with optimal $$O(\log{n})$$ span and $$O(n \log{n})$$ work, both w.h.p. in $$n$$. (2) An optimal $$O(\log{n})$$ span algorithm for Strassen's matrix multiplication (MM) with only a $$\log{\log{n}}$$-factor blow-up in work as well as a near-optimal $$O(\log{n}\log{\log{\log{n}}})$$ span algorithm with no asymptotic blow-up in work. (3) A near-optimal $$O(\log{n}\log{\log{\log{n}}})$$ span Fast Fourier Transform (FFT) algorithm with less than a $$\log{n}$$-factor blow-up in work for all practical values of $$n$$ (i.e., $$n \leq 10^{10,000}$$).

¤ Mohammad Mahdi Javanmard, Zafar Ahmad, Jaroslaw Zola, Louis-Noël Pouchet, Rezaul Chowdhury, and Robert Harrison, "Efficient Execution of Dynamic Programming Algorithms on Apache Spark", Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2020), Virtual Event, pp. 337-348, 2020.

Abstract: One of the most important properties of distributed computing systems (e.g., Apache Spark, Apache Hadoop, etc.) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important trade-off between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this trade-off, it is crucial to have a well-decomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the trade-off spectrum to execute the program efficiently on a specific cluster.

We design and implement well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as Floyd-Warshall's all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multi-way recursive divide-&-conquer algorithms. We explain how to map implementations of those grid-based parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a $$2 - 5\times$$ speedup of the DP benchmarks.

¤ Michael Bender, Rezaul Chowdhury, Rathish Das, Rob Johnson, William Kuszmaul, Andrea Lincoln, Quanquan Liu, Jayson Lynch, and Helen Xu, "Closing the Gap Between Cache-oblivious and Cache-adaptive Analysis", Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2020), Virtual Event, pp. 63-73, 2020.

Abstract: Cache-adaptive analysis was introduced to analyze the performance of an algorithm when the cache (or internal memory) available to the algorithm dynamically changes size. These memory-size fluctuations are, in fact, the common case in multi-core machines, where threads share cache and RAM. An algorithm is said to be efficiently cache-adaptive if it achieves optimal utilization of the dynamically changing cache.

Cache-adaptive analysis was inspired by cache-oblivious analysis. Many (or even most) optimal cache-oblivious algorithms have an $$(a,b,c)$$-regular recursive structure. Such $$(a,b,c)$$-regular algorithms include Longest Common Subsequence, All Pairs Shortest Paths, Matrix Multiplication, Edit Distance, Gaussian Elimination Paradigm, etc. Bender et al. (2016) showed that some of these optimal cache-oblivious algorithms remain optimal even when cache changes size dynamically, but that in general they can be as much as logarithmic factor away from optimal. However, their analysis depends on constructing a highly structured, worst-case memory profile, or sequences of fluctuations in cache size. These worst-case profiles seem fragile, suggesting that the logarithmic gap may be an artifact of an unrealistically powerful adversary.

We close the gap between cache-oblivious and cache-adaptive analysis by showing how to make a smoothed analysis of cache-adaptive algorithms via random reshuffling of memory fluctuations. Remarkably, we also show the limits of several natural forms of smoothing, including random perturbations of the cache size and randomizing the algorithm's starting time. Nonetheless, we show that if one takes an arbitrary profile and performs a random shuffle on when "significant events" occur within the profile, then the shuffled profile becomes optimally cache-adaptive in expectation, even when the initial profile is adversarially constructed.

These results suggest that cache-obliviousness is a solid foundation for achieving cache-adaptivity when the memory profile is not overly tailored to the algorithm structure.

¤ Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella, "Brief Announcement: A Computational Model for Tensor Core Units", Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2020), Virtual Event, pp. 519-521, 2020.

Abstract: To respond to the need for efficient training and inference of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is the design for efficiently computing a dense matrix product of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including dense and sparse matrix multiplication and the Discrete Fourier Transform. We finally highlight a relation between the TCU model and the external memory model

¤ Mohammad Mahdi Javanmard, Zafar Ahmad, Martin Kong, Louis-Noël Pouchet, Rezaul Chowdhury, and Robert Harrison, "Deriving Parametric Multi-way Recursive Divide-and-Conquer Dynamic Programming Algorithms using Polyhedral Compilers", Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020), San Diego, CA, USA, pp. 317-329, 2020.

Abstract: We present a novel framework to automatically derive highly efficient parametric multi-way recursive divide-&-conquer algorithms for a class of dynamic programming (DP) problems. Standard two-way or any fixed $$R$$-way recursive divide-&-conquer algorithms may not fully exploit many-core processors. To run efficiently on a given machine, the value of $$R$$ may need to be different for every level of recursion based on the number of processors available and the sizes of memory/caches at different levels of the memory hierarchy. The set of $$R$$ values that work well on a given machine may not work efficiently on another machine with a different set of machine parameters. To improve portability and efficiency, Multi-way AutoGen generates parametric multi-way recursive divide-&-conquer algorithms where the value of $$R$$ can be changed on the fly for every level of recursion. We present experimental results demonstrating the performance and scalability of the parallel programs produced by our framework.

¤ Rathish Das, Shih-Yu Tsai, Sharmila Duppala, Jayson Lynch, Esther Arkin, Rezaul Chowdhury, Joseph Mitchell, and Steven Skiena, "Data Races and the Discrete Resource-time Tradeoff Problem with Resource Reuse over Paths", Proceedings of the 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2019), Phoenix, AZ, USA, pp. 359-368, 2019.

Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races are often undesirable as they can lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock with the memory location in question or allowing atomic accesses to it. However, such solutions can reduce parallelism by serializing all accesses to that location. For associative and commutative updates to a memory cell, one can instead use a reducer, which allows parallel race-free updates at the expense of using some extra space. More extra space usually leads to more parallel updates, which in turn contributes to potentially lowering the overall execution time of the program.

We start by asking the following question. Given a fixed budget of extra space for mitigating the cost of races in a parallel program, which memory locations should be assigned reducers and how should the space be distributed among those reducers in order to minimize the overall running time? We argue that under reasonable conditions the races of a program can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing read-write dependencies between cells. We then formulate our original question as an optimization problem on this DAG. We concentrate on a variation of this problem where space reuse among reducers is allowed by routing every unit of extra space along a (possibly different) source to sink path of the DAG and using it in the construction of multiple (possibly zero) reducers along the path. We consider two different ways of constructing a reducer and the corresponding duration functions (i.e., reduction time as a function of space budget).

We generalize our race-avoiding space-time tradeoff problem to a discrete resource-time tradeoff problem with general non-increasing duration functions and resource reuse over paths of the given DAG.

For general DAGs, we show that even if the entire DAG is available offline the problem is strongly NP-hard under all three duration functions, and we give approximation algorithms for solving the corresponding optimization problems. We also prove hardness of approximation for the general resource-time tradeoff problem and give a pseudo-polynomial time algorithm for series-parallel DAGs.

¤ Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury, "Toward Efficient Architecture-Independent Algorithms for Dynamic Programs", Proceedings of the International Conference on High Performance Computing (ISC-HPC 2019), Frankfurt, Germany, pp. 143-164, 2019.

Abstract: We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm�s basic structure.

Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines.

Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive $$r$$-way divide and conquer, where $$r$$ ( $$\geq 2$$ ) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds.

We also report empirical results for our GPU and distributed-memory algorithms.

Preliminary Results
 — Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury, "Toward Efficient Architecture-Independent Algorithms for Dynamic Programs: Poster", Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2019), Washington, DC, USA, pp. 413-414, 2019.
¤ Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi, "Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms", Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2017), Phoenix, AZ, USA, pp. 339-350, 2017.

Abstract: Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recursive divide-and-conquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cache-oblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by state-of-the-art schedulers (e.g., the randomized work-stealing scheduler) for programs with fork-join primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation.

In this paper, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork-join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute the time when each divide-and-conquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms.

Preliminary Results
 — Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi, "Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms: Poster", Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2017), Austin, TX, USA, pp. 435-436, 2017. — Jesmin Jahan Tithi, Pramod Ganapathi, Rezaul Chowdhury, and Yuan Tang, "Cache-oblivious Wavefront Algorithms for Dynamic Programming Problems: Efficient Scheduling with Optimal Cache Performance and High Parallelism", High Performance Computing, Networking Storage and Analysis (SC 2016), Salt Lake City, UT, 2016.
¤ Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles Leiserson, Armando Solar-Lezama, Bradley Kuszmaul, and Yuan Tang, "Autogen: Automatic Discovery of Efficient Recursive Divide-&-Conquer Algorithms for Solving Dynamic Programming Problems", ACM Transactions on Parallel Computing (Special Issue for Top Papers from PPoPP'16), vol. 4(1), pp. 1-30, 2017.

Abstract: We present AUTOGEN — an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several auto-discovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also, these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms

Preliminary Results
 — Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles Leiserson, Armando Solar-Lezama, Bradley Kuszmaul, and Yuan Tang, "AutoGen: Automatic Discovery of Cache-oblivious Parallel Recursive Algorithms for Solving Dynamic Programs", Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016), Barcelona, Spain, article 10, 2016. (ACM SIGPLAN Notices, vol. 51(8)).
¤ Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury, "Deriving Divide-and-Conquer Dynamic Programming Algorithms Using Solver-Aided Transformations", Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages & Applications (OOPSLA 2016), Amsterdam, Netherlands, pp. 145-164, 2016.

Abstract: We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types.

In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism.

We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.

¤ Rezaul Chowdhury, Pramod Ganapathi, Vivek Pradhan, Jesmin Jahan Tithi, and Yunpeng Xiao, "An Efficient Cache-oblivious Parallel Viterbi Algorithm", Proceedings of the 22nd International European Conference on Parallel and Distributed Computing (EuroPar 2016), Grenoble, France, LNCS 9833, pp. 574-587, 2016.

Abstract: The Viterbi algorithm is used to find the most likely path through a hidden Markov model given an observed sequence, and has numerous applications. Due to its importance and high computational complexity, several algorithmic strategies have been developed to parallelize it on different parallel architectures. However, none of the existing Viterbi decoding algorithms designed for modern computers with cache hierarchies is simultaneously cache-efficient and cache-oblivious. Being oblivious of machine resources (e.g., caches and processors) while also being efficient promotes portability. In this paper, we present an efficient cache- and processor-oblivious Viterbi algorithm based on rank convergence. The algorithm builds upon the parallel Viterbi algorithm of Maleki et al. (PPoPP 2014). We provide empirical analysis of our algorithm by comparing it with Maleki et al.'s algorithm.

PhD Theses on Research (Partially) Supported by the Award

¤ Pramod Ganapathi (supervised by Rezaul Chowdhury), "Automatic Discovery of Efficient Divide-&-Conquer Algorithms for Dynamic Programming Problems", PhD Thesis, Department of Computer Science, Stony Brook University, December 2016.

Abstract: This dissertation is an excursion into computer-discovered algorithms and computer-aided algorithm design. In our vision of the not-so-distant future of computing, machines will take the responsibility of designing and implementing most, if not all, machine-specific highly efficient nontrivial algorithms with provable correctness and performance guarantees. Indeed, as the complexity of computing hardware grows and their basic architectures keep changing rapidly, manually redesigning and reimplementing key algorithms for high performance on every new architecture is quickly becoming an impossible task. Automation is needed.

We design algorithms that can design other algorithms. Our focus is on auto-generating algorithms for solving dynamic programming (DP) recurrences efficiently on state-of-the-art parallel computing hardware. While DP recurrences can be solved very easily using nested or even tiled nested loops, for high performance on modern processors/coprocessors with cache hierarchies, portability across machines, and automatic adaptivity to runtime fluctuations in the availability of shared resources (e.g., cache space) highly nontrivial recursive divide-and-conquer algorithms are needed. But, these recursive divide-and-conquer algorithms are difficult to design and notoriously hard to implement for high performance. Furthermore, new DP recurrences are being encountered by scientists every day for solving brand new problems in diverse application areas ranging from economics to computational biology. But, how does an economist or a biologist without any formal training in computer science design an efficient algorithm for evaluating his/her DP recurrence on a computer? Well, we can help!

We present AUTOGEN — an algorithm that given any blackbox implementation of a DP recurrence (e.g., inefficient naive serial loops) from a wide class of DP problems, can automatically discover a fast recursive divide-and-conquer algorithm for solving that problem on a shared-memory multicore machine. We mathematically formalize AUTOGEN, prove its correctness, and provide theoretical performance guarantees for the auto-discovered algorithms. These auto-generated algorithms are shown to be efficient (e.g., highly parallel with highly optimizable kernels, and cache-, energy-, and bandwidth-efficient), portable (i.e., cache- and processor-oblivious), and robust (i.e., cache- and processor-adaptive).

We present AUTOGEN-WAVE — a framework for computer-assisted discovery of fast divide-and-conquer wavefront versions of the algorithms already generated by AUTOGEN. These recursive wavefront algorithms retain all advantages of the AUTOGEN-discovered algorithms on which they are based, but have better and near-optimal parallelism due to the wavefront order of execution. We also show how to schedule these algorithms with provable performance guarantees on multicore machines.

We present VITERBI — an efficient cache-oblivious parallel algorithm to solve the Viterbi recurrence, as a first step toward extending AUTOGEN to handle DP recurrences with irregular data-dependent dependencies. Our algorithm beats the current fastest Viterbi algorithm in both theory and practice.

We present AUTOGEN-FRACTILE — a framework for computer-aided design of high-performing and easily portable recursively tiled/blocked algorithms for a wide class of DP problems having fractal-type dependencies. These recursively tiled algorithms have excellent cache locality and excellent parallel running time on multi-/many-core machines with deep memory hierarchy.

We present AUTOGEN-TRADEOFF — a framework that can be used to design efficient and portable not-in-place algorithms to asymptotically increase the parallelism of some of the AUTOGEN-discovered algorithms without affecting cache-efficiency.

This dissertation takes the first major step on which to build computing systems for automatically designing efficient and portable nontrivial algorithms. We hope that more work follows in this science of algorithm discovery.

¤ Mohammad Mahdi Javanmard (supervised by Rezaul Chowdhury and Robert Harrison), "Parametric Multi-way Recursive Divide-and-Conquer Algorithms for Dynamic Programs", PhD Thesis, Department of Computer Science, Stony Brook University, March 2020.

Abstract: A typical modern supercomputer is a network of distributed-memory hybrid compute nodes where each node usually has both latency-optimized multicores (a.k.a. fat cores) and throughput-optimized manycores (a.k.a. thin cores) connected through a multilevel memory hierarchy. The same supercomputer often hosts compute nodes of various configurations. Many of them are upgraded to the next state-of-the-art within four or five years of becoming operational.

To design and implement efficient algorithms for the heterogeneous and constantly evolving modern supercomputers the need for performance portability across architectures must be taken into account. We show that a class of grid-based parallel multi-way recursive divide-and-conquer algorithms for solving Dynamic Programs (DP) can be executed efficiently with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements) and purely distributed-memory machines (communication complexity) without any changes in the basic structure of the algorithm. We generate such an algorithm from its standard two-way counterpart by multi-level inlining of the recursion and then making each recursive function call as early as possible without violating the dependency constraints.

We present a novel framework — MULTI-WAY AUTOGEN — that uses polyhedral compiler transformations for systematically deriving a multi-way recursive divide-and-conquer algorithm for solving a DP from an iterative specification of the DP. The framework takes advantage of polyhedral transformations including mono-parametric tiling, and index-set splitting.

We show that our parametric multi-way recursive divide-and-conquer algorithms can be implemented to run efficiently on Apache Spark. To perform well on distributed computing systems, such as Apache Spark, running on clusters and computation clouds, an algorithm must have the ability to scale out. Additionally, such clusters have a trade-off between communication cost and parallelism. Therefore, it is crucial to have well-decomposable, adaptive, tunable, and scalable programs, and we show that our parametric multi-way recursive divide-and-conquer algorithms have those properties. We provide experimental results illustrating the performance, scalability and portability of the Apache Spark programs for various DP algorithms.

Other Peer-reviewed Publications (Partially) Supported by the Award

¤ Rezaul Chowdhury and Vijaya Ramachandran, "Cache-oblivious Buffer Heap and Cache-efficient Computation of Shortest Paths in Graphs", ACM Transactions on Algorithms (TALG), vol. 14(1), pp. 1-33, 2018.

Abstract: We present the buffer heap, a cache-oblivious priority queue that supports Delete-Min, Delete and a hybrid Insert/Decrease-Key operation in $$O\left( \frac{1}{B} \log_{2}{\frac{N}{M}} \right)$$ amortized block transfers from main memory, where $$M$$ and $$B$$ are the (unknown) cache size and block-size, respectively, and $$N$$ is the number of elements in the queue. We introduce the notion of a slim data structure which captures the situation when only a limited portion of the cache which we call a slim cache, is available to the data structure to retain data between data structural operations. We show that a buffer heap automatically adapts to such an environment and supports all operations in $$O\left( \frac{1}{\lambda} + \frac{1}{B} \log_{2}{\frac{N}{\lambda}} \right)$$ amortized block transfers each when the size of the slim cache is $$\lambda$$. Our results provide substantial improvements over known trivial cache performance bounds for cache-obliviouspriority queues with Decrease-Keys.

Using the buffer heap we present cache-oblivious implementations of Dijkstra's algorithm for undirected and directed single-source shortest path (SSSP) problems for graphs with non-negative real edge-weights. On a graph with $$n$$ vertices and $$m$$ edges, our algorithm for the undirected case performs $$O\left( n + \frac{m}{B} \log_{2}{\frac{n}{M}} \right)$$ block transfers and for the directed case performs $$O\left( \left(n + \frac{m}{B}\right) \cdot \log_{2}{\frac{n}{B}} \right)$$ block transfers. These results give the first non-trivial cache-oblivious bounds for the SSSP problem on general graphs. For the all-pairs shortest path (APSP) problem on weighted undirected graphs, we incorporate slim buffer heaps into multi-buffer-buffer-heaps and use these to improve the cache-aware cache complexity. We also present a simple cache-oblivious APSP algorithm for unweighted undirected graphs that performs $$O\left( \frac{mn}{B} \log_{M/B} \frac{n}{B} \right)$$ block transfers. This matches the cache-aware bound and is a substantial improvement over the previous cache oblivious bound for the problem.

¤ Michael Bender, Rezaul Chowdhury, Pramod Ganapathi, Samuel McCauley, and Yuan Tang, "The Range 1 Query (R1Q) Problem", Theoretical Computer Science (Invited Paper from COCOON'14), vol. 743, pp. 130-147, 2018.

Abstract: We define the range 1 query (R1Q) problem as follows. Given a $$d$$-dimenisional $$( d\geq 1)$$ input bit matrix $$A$$ (consisting of 0's and 1's), preprocess $$A$$ so that for any given region $$R$$ of $$A$$, efficiently answer queries asking if $$R$$ contains a 1 or not. We consider both orthogonal and non-orthogonal shapes for $$R$$ including rectangles, axis-parallel right-triangles, certain types of polygons, axis-parallel right simplices and spheres. We provide space-efficient deterministic and randomized algorithms with constant query times (in constant dimensions) for solving the problem in the word-RAM model. The space usage in bits is sublinear, linear, or near linear in the size of $$A$$, depending on the algorithm.

¤ Shih-Yu Tsai, Hao-Tsung Yang, Kin Sum Liu, Shan Lin, Rezaul Chowdhury, and Jie Gao, "Multi-channel Assignment and Link Scheduling for Prioritized Latency-Sensitive Applications", Proceedings of the International Symposium on Algorithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics (ALGOSENSORS 2019), Munich, Germany, pp. 137-157, 2019.

Abstract: Current wireless networks mainly focus on delay-tolerant applications while demands for latency-sensitive applications are rising with VR/AR technologies and machine-to-machine IoT applications. In this paper we consider multi-channel, multi-radio scheduling at the MAC layer to optimize for the performance of prioritized, delay-sensitive demands. Our objective is to design an interference-free schedule that minimizes the maximum weighted refresh time among all edges, where the refresh time of an edge is the maximum number of time slots between two successive slots of that edge and the weights reflect given priorities. In the single-antenna unweighted case with $$k$$ channels and $$n$$ transceivers, the scheduling problem reduces to the classical edge coloring problem when $$k \geq \lfloor n/2 \rfloor$$ and to strong edge coloring when $$k=1$$, but it is neither edge coloring nor strong edge coloring for general $$k$$. Further, the priority requirement introduces extra challenges. In this paper we provide a randomized algorithm with an approximation factor of $$\tilde{O}\left( \max \left\{ \sqrt{\Delta_p }, \frac{\Delta_p}{\sqrt{k}} \right\} \log m \right )$$ in expectation, where $$\Delta_p$$ denotes the maximum degree of the unweighted multi-graph, which is formed by duplicating each edge $$e_i$$ for $$w_i$$ times ($$w_i$$ is $$e_i$$'s integral priority value), and $$m$$ is the number of required link communications ($$f(n) \in \tilde{O}(h(n))$$ means that $$f(n) \in O\left( h(n) \log^k(h(n)) \right)$$ for some positive constant $$k$$). The results are generalized to the multi-antenna settings. We evaluate the performance of our methods in different settings using simulations.