This project will develop fundamental theory and efficient tools to facilitate the design of parallel algorithms that make no mention of any hardware parameters in the code, but still run efficiently across a wide spectrum of parallel computing platforms ranging from small laptop computers to large supercomputers. These resourceoblivious algorithms will enable efficient parallel code regardless of the everchanging underlying hardware platforms allowing more focus on the correctness of implementations without the need of optimizing for a particular hardware. Porting code from one machine to another will become easier or effortless. The project will concentrate mainly on algorithms based on recursive divide and conquer — a very powerful and widely used algorithm design technique. 

¤  Zafar Ahmad, Rezaul Chowdhury, Rathish Das, Pramod Ganapathi, Aaron Gregory, and Yimin Zhu, "Fast Stencil Computations using Fast Fourier Transforms", Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2021), Virtual Event, pp. 821, 2021. (Outstanding Paper Award Recipient)  
Show More
Show Less
Abstract: Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The stateoftheart techniques in this area fall into three groups: cacheaware tiled looping algorithms, cacheoblivious divideandconquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform \(\Theta(NT)\) work, where \(N\) is the size of the spatial grid and \(T\) is the number of timesteps, our algorithms perform \(o(NT)\) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly \(10^7\) cells for around \(10^5\) timesteps run orders of magnitude faster than stateoftheart implementations for periodic stencil problems, and \(1.3\times\) to \(8.5\times\) faster for aperiodic stencil problems. 

¤  Zafar Ahmad, Rezaul Chowdhury, Rathish Das, Pramod Ganapathi, Aaron Gregory, and Mohammad Mahdi Javanmard, "LowSpan Parallel Algorithms for the BinaryForking Model", Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2021), Virtual Event, pp. 2234, 2021. (Another Outstanding Paper Award Recipient)  
Show More
Show Less
Abstract: The binaryforking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of \(\Theta(\log{n})\) to spawn or synchronize n tasks or threads. The binaryforking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore sharedmemory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binaryforking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binaryforking model and the nonconstant synchronization cost makes the task challenging. In this paper, we show that in the binaryforking model we can achieve optimal or nearoptimal span with negligible or no asymptotic blowup in work for comparisonbased sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: \((1)\) A randomized comparisonbased sorting algorithm with optimal \(O(\log{n})\) span and \(O(n \log{n})\) work, both w.h.p. in \(n\). (2) An optimal \(O(\log{n})\) span algorithm for Strassen's matrix multiplication (MM) with only a \(\log{\log{n}}\)factor blowup in work as well as a nearoptimal \(O(\log{n}\log{\log{\log{n}}})\) span algorithm with no asymptotic blowup in work. (3) A nearoptimal \(O(\log{n}\log{\log{\log{n}}})\) span Fast Fourier Transform (FFT) algorithm with less than a \(\log{n}\)factor blowup in work for all practical values of \(n\) (i.e., \(n \leq 10^{10,000}\)). 

¤  Mohammad Mahdi Javanmard, Zafar Ahmad, Jaroslaw Zola, LouisNoël Pouchet, Rezaul Chowdhury, and Robert Harrison, "Efficient Execution of Dynamic Programming Algorithms on Apache Spark", Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2020), Virtual Event, pp. 337348, 2020.  
Show More
Show Less
Abstract: One of the most important properties of distributed computing systems (e.g., Apache Spark, Apache Hadoop, etc.) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important tradeoff between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this tradeoff, it is crucial to have a welldecomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the tradeoff spectrum to execute the program efficiently on a specific cluster. We design and implement welldecomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as FloydWarshall's allpairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multiway recursive divide&conquer algorithms. We explain how to map implementations of those gridbased parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a \(2  5\times\) speedup of the DP benchmarks. 

¤  Michael Bender, Rezaul Chowdhury, Rathish Das, Rob Johnson, William Kuszmaul, Andrea Lincoln, Quanquan Liu, Jayson Lynch, and Helen Xu, "Closing the Gap Between Cacheoblivious and Cacheadaptive Analysis", Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2020), Virtual Event, pp. 6373, 2020.  
Show More
Show Less
Abstract: Cacheadaptive analysis was introduced to analyze the performance of an algorithm when the cache (or internal memory) available to the algorithm dynamically changes size. These memorysize fluctuations are, in fact, the common case in multicore machines, where threads share cache and RAM. An algorithm is said to be efficiently cacheadaptive if it achieves optimal utilization of the dynamically changing cache. Cacheadaptive analysis was inspired by cacheoblivious analysis. Many (or even most) optimal cacheoblivious algorithms have an \((a,b,c)\)regular recursive structure. Such \((a,b,c)\)regular algorithms include Longest Common Subsequence, All Pairs Shortest Paths, Matrix Multiplication, Edit Distance, Gaussian Elimination Paradigm, etc. Bender et al. (2016) showed that some of these optimal cacheoblivious algorithms remain optimal even when cache changes size dynamically, but that in general they can be as much as logarithmic factor away from optimal. However, their analysis depends on constructing a highly structured, worstcase memory profile, or sequences of fluctuations in cache size. These worstcase profiles seem fragile, suggesting that the logarithmic gap may be an artifact of an unrealistically powerful adversary. We close the gap between cacheoblivious and cacheadaptive analysis by showing how to make a smoothed analysis of cacheadaptive algorithms via random reshuffling of memory fluctuations. Remarkably, we also show the limits of several natural forms of smoothing, including random perturbations of the cache size and randomizing the algorithm's starting time. Nonetheless, we show that if one takes an arbitrary profile and performs a random shuffle on when "significant events" occur within the profile, then the shuffled profile becomes optimally cacheadaptive in expectation, even when the initial profile is adversarially constructed. These results suggest that cacheobliviousness is a solid foundation for achieving cacheadaptivity when the memory profile is not overly tailored to the algorithm structure. 

¤  Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella, "Brief Announcement: A Computational Model for Tensor Core Units", Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2020), Virtual Event, pp. 519521, 2020.  
Show More
Show Less
Abstract: To respond to the need for efficient training and inference of deep neural networks, a plethora of domainspecific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is the design for efficiently computing a dense matrix product of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including dense and sparse matrix multiplication and the Discrete Fourier Transform. We finally highlight a relation between the TCU model and the external memory model 

¤  Mohammad Mahdi Javanmard, Zafar Ahmad, Martin Kong, LouisNoël Pouchet, Rezaul Chowdhury, and Robert Harrison, "Deriving Parametric Multiway Recursive DivideandConquer Dynamic Programming Algorithms using Polyhedral Compilers", Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020), San Diego, CA, USA, pp. 317329, 2020.  
Show More
Show Less
Abstract: We present a novel framework to automatically derive highly efficient parametric multiway recursive divide&conquer algorithms for a class of dynamic programming (DP) problems. Standard twoway or any fixed \(R\)way recursive divide&conquer algorithms may not fully exploit manycore processors. To run efficiently on a given machine, the value of \(R\) may need to be different for every level of recursion based on the number of processors available and the sizes of memory/caches at different levels of the memory hierarchy. The set of \(R\) values that work well on a given machine may not work efficiently on another machine with a different set of machine parameters. To improve portability and efficiency, Multiway AutoGen generates parametric multiway recursive divide&conquer algorithms where the value of \(R\) can be changed on the fly for every level of recursion. We present experimental results demonstrating the performance and scalability of the parallel programs produced by our framework. 

¤  Rathish Das, ShihYu Tsai, Sharmila Duppala, Jayson Lynch, Esther Arkin, Rezaul Chowdhury, Joseph Mitchell, and Steven Skiena, "Data Races and the Discrete Resourcetime Tradeoff Problem with Resource Reuse over Paths", Proceedings of the 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2019), Phoenix, AZ, USA, pp. 359368, 2019.  
Show More
Show Less
Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races are often undesirable as they can lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutualexclusion lock with the memory location in question or allowing atomic accesses to it. However, such solutions can reduce parallelism by serializing all accesses to that location. For associative and commutative updates to a memory cell, one can instead use a reducer, which allows parallel racefree updates at the expense of using some extra space. More extra space usually leads to more parallel updates, which in turn contributes to potentially lowering the overall execution time of the program. We start by asking the following question. Given a fixed budget of extra space for mitigating the cost of races in a parallel program, which memory locations should be assigned reducers and how should the space be distributed among those reducers in order to minimize the overall running time? We argue that under reasonable conditions the races of a program can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing readwrite dependencies between cells. We then formulate our original question as an optimization problem on this DAG. We concentrate on a variation of this problem where space reuse among reducers is allowed by routing every unit of extra space along a (possibly different) source to sink path of the DAG and using it in the construction of multiple (possibly zero) reducers along the path. We consider two different ways of constructing a reducer and the corresponding duration functions (i.e., reduction time as a function of space budget). We generalize our raceavoiding spacetime tradeoff problem to a discrete resourcetime tradeoff problem with general nonincreasing duration functions and resource reuse over paths of the given DAG. For general DAGs, we show that even if the entire DAG is available offline the problem is strongly NPhard under all three duration functions, and we give approximation algorithms for solving the corresponding optimization problems. We also prove hardness of approximation for the general resourcetime tradeoff problem and give a pseudopolynomial time algorithm for seriesparallel DAGs. 

¤  Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury, "Toward Efficient ArchitectureIndependent Algorithms for Dynamic Programs", Proceedings of the International Conference on High Performance Computing (ISCHPC 2019), Frankfurt, Germany, pp. 143164, 2019.  
Show More
Show Less
Abstract: We argue that the recursive divideandconquer paradigm is highly suited for designing algorithms to run efficiently under both sharedmemory (multi and manycores) and distributedmemory settings. The depthfirst recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intranode I/O and cache performance and lower internode communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of gridbased parallel recursive divideandconquer algorithms (for dynamic programs) can be run with provably optimal or nearoptimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributedmemory machines (communication complexity) without changing the algorithm�s basic structure. Twoway recursive divideandconquer algorithms are known for solving dynamic programming (DP) problems on sharedmemory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributedmemory machines. Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are externalmemory algorithms based on recursive \(r\)way divide and conquer, where \(r\) ( \(\geq 2\) ) varies based on the current depth of the recursion. Our distributedmemory algorithms are also based on multiway recursive divide and conquer that extends naturally inside each sharedmemory multicore/manycore compute node. We show that these algorithms are workoptimal and have low latency and bandwidth bounds. We also report empirical results for our GPU and distributedmemory algorithms. 

Preliminary Results  


¤  Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi, "Provably Efficient Scheduling of Cacheoblivious Wavefront Algorithms", Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2017), Phoenix, AZ, USA, pp. 339350, 2017.  
Show More
Show Less
Abstract: Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tilediterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cacheaware and hence are not portable and not cacheadaptive. On the other hand, standard cacheoblivious recursive divideandconquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cacheoblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on forkjoin primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by stateoftheart schedulers (e.g., the randomized workstealing scheduler) for programs with forkjoin primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation. In this paper, we show how to systematically transform standard cacheoblivious recursive divideandconquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under stateoftheart schedulers for forkjoin programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closedform formulas to compute the time when each divideandconquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms. 

Preliminary Results  


¤  Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles Leiserson, Armando SolarLezama, Bradley Kuszmaul, and Yuan Tang, "Autogen: Automatic Discovery of Efficient Recursive Divide&Conquer Algorithms for Solving Dynamic Programming Problems", ACM Transactions on Parallel Computing (Special Issue for Top Papers from PPoPP'16), vol. 4(1), pp. 130, 2017.  
Show More
Show Less
Abstract: We present AUTOGEN — an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cacheoblivious parallel recursive divideandconquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several wellknown problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loopbased algorithms. Also, these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divideandconquer algorithms 

Preliminary Results  


¤  Shachar Itzhaky, Rohit Singh, Armando SolarLezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury, "Deriving DivideandConquer Dynamic Programming Algorithms Using SolverAided Transformations", Proceedings of the ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages & Applications (OOPSLA 2016), Amsterdam, Netherlands, pp. 145164, 2016.  
Show More
Show Less
Abstract: We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very highlevel specification of an algorithm, and inductive constraintbased synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solveraided tactics to derive parallel divideandconquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loopbased implementations. Bellmania includes a highlevel language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divideand conquer technique; a visualization interface helps users to interactively guide the process, while an SMTbased backend verifies each step and takes care of lowlevel reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code. 

¤  Rezaul Chowdhury, Pramod Ganapathi, Vivek Pradhan, Jesmin Jahan Tithi, and Yunpeng Xiao, "An Efficient Cacheoblivious Parallel Viterbi Algorithm", Proceedings of the 22nd International European Conference on Parallel and Distributed Computing (EuroPar 2016), Grenoble, France, LNCS 9833, pp. 574587, 2016.  
Show More
Show Less
Abstract: The Viterbi algorithm is used to find the most likely path through a hidden Markov model given an observed sequence, and has numerous applications. Due to its importance and high computational complexity, several algorithmic strategies have been developed to parallelize it on different parallel architectures. However, none of the existing Viterbi decoding algorithms designed for modern computers with cache hierarchies is simultaneously cacheefficient and cacheoblivious. Being oblivious of machine resources (e.g., caches and processors) while also being efficient promotes portability. In this paper, we present an efficient cache and processoroblivious Viterbi algorithm based on rank convergence. The algorithm builds upon the parallel Viterbi algorithm of Maleki et al. (PPoPP 2014). We provide empirical analysis of our algorithm by comparing it with Maleki et al.'s algorithm. 

¤  Pramod Ganapathi (supervised by Rezaul Chowdhury), "Automatic Discovery of Efficient Divide&Conquer Algorithms for Dynamic Programming Problems", PhD Thesis, Department of Computer Science, Stony Brook University, December 2016. 
Show More
Show Less
Abstract: This dissertation is an excursion into computerdiscovered algorithms and computeraided algorithm design. In our vision of the notsodistant future of computing, machines will take the responsibility of designing and implementing most, if not all, machinespecific highly efficient nontrivial algorithms with provable correctness and performance guarantees. Indeed, as the complexity of computing hardware grows and their basic architectures keep changing rapidly, manually redesigning and reimplementing key algorithms for high performance on every new architecture is quickly becoming an impossible task. Automation is needed. We design algorithms that can design other algorithms. Our focus is on autogenerating algorithms for solving dynamic programming (DP) recurrences efficiently on stateoftheart parallel computing hardware. While DP recurrences can be solved very easily using nested or even tiled nested loops, for high performance on modern processors/coprocessors with cache hierarchies, portability across machines, and automatic adaptivity to runtime fluctuations in the availability of shared resources (e.g., cache space) highly nontrivial recursive divideandconquer algorithms are needed. But, these recursive divideandconquer algorithms are difficult to design and notoriously hard to implement for high performance. Furthermore, new DP recurrences are being encountered by scientists every day for solving brand new problems in diverse application areas ranging from economics to computational biology. But, how does an economist or a biologist without any formal training in computer science design an efficient algorithm for evaluating his/her DP recurrence on a computer? Well, we can help! We present AUTOGEN — an algorithm that given any blackbox implementation of a DP recurrence (e.g., inefficient naive serial loops) from a wide class of DP problems, can automatically discover a fast recursive divideandconquer algorithm for solving that problem on a sharedmemory multicore machine. We mathematically formalize AUTOGEN, prove its correctness, and provide theoretical performance guarantees for the autodiscovered algorithms. These autogenerated algorithms are shown to be efficient (e.g., highly parallel with highly optimizable kernels, and cache, energy, and bandwidthefficient), portable (i.e., cache and processoroblivious), and robust (i.e., cache and processoradaptive). We present AUTOGENWAVE — a framework for computerassisted discovery of fast divideandconquer wavefront versions of the algorithms already generated by AUTOGEN. These recursive wavefront algorithms retain all advantages of the AUTOGENdiscovered algorithms on which they are based, but have better and nearoptimal parallelism due to the wavefront order of execution. We also show how to schedule these algorithms with provable performance guarantees on multicore machines. We present VITERBI — an efficient cacheoblivious parallel algorithm to solve the Viterbi recurrence, as a first step toward extending AUTOGEN to handle DP recurrences with irregular datadependent dependencies. Our algorithm beats the current fastest Viterbi algorithm in both theory and practice. We present AUTOGENFRACTILE — a framework for computeraided design of highperforming and easily portable recursively tiled/blocked algorithms for a wide class of DP problems having fractaltype dependencies. These recursively tiled algorithms have excellent cache locality and excellent parallel running time on multi/manycore machines with deep memory hierarchy. We present AUTOGENTRADEOFF — a framework that can be used to design efficient and portable notinplace algorithms to asymptotically increase the parallelism of some of the AUTOGENdiscovered algorithms without affecting cacheefficiency. This dissertation takes the first major step on which to build computing systems for automatically designing efficient and portable nontrivial algorithms. We hope that more work follows in this science of algorithm discovery. 

¤  Mohammad Mahdi Javanmard (supervised by Rezaul Chowdhury and Robert Harrison), "Parametric Multiway Recursive DivideandConquer Algorithms for Dynamic Programs", PhD Thesis, Department of Computer Science, Stony Brook University, March 2020. 
Show More
Show Less
Abstract: A typical modern supercomputer is a network of distributedmemory hybrid compute nodes where each node usually has both latencyoptimized multicores (a.k.a. fat cores) and throughputoptimized manycores (a.k.a. thin cores) connected through a multilevel memory hierarchy. The same supercomputer often hosts compute nodes of various configurations. Many of them are upgraded to the next stateoftheart within four or five years of becoming operational. To design and implement efficient algorithms for the heterogeneous and constantly evolving modern supercomputers the need for performance portability across architectures must be taken into account. We show that a class of gridbased parallel multiway recursive divideandconquer algorithms for solving Dynamic Programs (DP) can be executed efficiently with provably optimal or nearoptimal performance bounds on fat cores (cache complexity), thin cores (data movements) and purely distributedmemory machines (communication complexity) without any changes in the basic structure of the algorithm. We generate such an algorithm from its standard twoway counterpart by multilevel inlining of the recursion and then making each recursive function call as early as possible without violating the dependency constraints. We present a novel framework — MULTIWAY AUTOGEN — that uses polyhedral compiler transformations for systematically deriving a multiway recursive divideandconquer algorithm for solving a DP from an iterative specification of the DP. The framework takes advantage of polyhedral transformations including monoparametric tiling, and indexset splitting. We show that our parametric multiway recursive divideandconquer algorithms can be implemented to run efficiently on Apache Spark. To perform well on distributed computing systems, such as Apache Spark, running on clusters and computation clouds, an algorithm must have the ability to scale out. Additionally, such clusters have a tradeoff between communication cost and parallelism. Therefore, it is crucial to have welldecomposable, adaptive, tunable, and scalable programs, and we show that our parametric multiway recursive divideandconquer algorithms have those properties. We provide experimental results illustrating the performance, scalability and portability of the Apache Spark programs for various DP algorithms. 

¤  Rezaul Chowdhury and Vijaya Ramachandran, "Cacheoblivious Buffer Heap and Cacheefficient Computation of Shortest Paths in Graphs", ACM Transactions on Algorithms (TALG), vol. 14(1), pp. 133, 2018. 
Show More
Show Less
Abstract: We present the buffer heap, a cacheoblivious priority queue that supports DeleteMin, Delete and a hybrid Insert/DecreaseKey operation in \( O\left( \frac{1}{B} \log_{2}{\frac{N}{M}} \right) \) amortized block transfers from main memory, where \(M\) and \(B\) are the (unknown) cache size and blocksize, respectively, and \(N\) is the number of elements in the queue. We introduce the notion of a slim data structure which captures the situation when only a limited portion of the cache which we call a slim cache, is available to the data structure to retain data between data structural operations. We show that a buffer heap automatically adapts to such an environment and supports all operations in \( O\left( \frac{1}{\lambda} + \frac{1}{B} \log_{2}{\frac{N}{\lambda}} \right) \) amortized block transfers each when the size of the slim cache is \(\lambda\). Our results provide substantial improvements over known trivial cache performance bounds for cacheobliviouspriority queues with DecreaseKeys. Using the buffer heap we present cacheoblivious implementations of Dijkstra's algorithm for undirected and directed singlesource shortest path (SSSP) problems for graphs with nonnegative real edgeweights. On a graph with \(n\) vertices and \(m\) edges, our algorithm for the undirected case performs \( O\left( n + \frac{m}{B} \log_{2}{\frac{n}{M}} \right) \) block transfers and for the directed case performs \( O\left( \left(n + \frac{m}{B}\right) \cdot \log_{2}{\frac{n}{B}} \right) \) block transfers. These results give the first nontrivial cacheoblivious bounds for the SSSP problem on general graphs. For the allpairs shortest path (APSP) problem on weighted undirected graphs, we incorporate slim buffer heaps into multibufferbufferheaps and use these to improve the cacheaware cache complexity. We also present a simple cacheoblivious APSP algorithm for unweighted undirected graphs that performs \( O\left( \frac{mn}{B} \log_{M/B} \frac{n}{B} \right) \) block transfers. This matches the cacheaware bound and is a substantial improvement over the previous cache oblivious bound for the problem. 

¤  Michael Bender, Rezaul Chowdhury, Pramod Ganapathi, Samuel McCauley, and Yuan Tang, "The Range 1 Query (R1Q) Problem", Theoretical Computer Science (Invited Paper from COCOON'14), vol. 743, pp. 130147, 2018. 
Show More
Show Less
Abstract: We define the range 1 query (R1Q) problem as follows. Given a \(d\)dimenisional \(( d\geq 1)\) input bit matrix \(A\) (consisting of 0's and 1's), preprocess \(A\) so that for any given region \(R\) of \(A\), efficiently answer queries asking if \(R\) contains a 1 or not. We consider both orthogonal and nonorthogonal shapes for \(R\) including rectangles, axisparallel righttriangles, certain types of polygons, axisparallel right simplices and spheres. We provide spaceefficient deterministic and randomized algorithms with constant query times (in constant dimensions) for solving the problem in the wordRAM model. The space usage in bits is sublinear, linear, or near linear in the size of \(A\), depending on the algorithm. 

¤  ShihYu Tsai, HaoTsung Yang, Kin Sum Liu, Shan Lin, Rezaul Chowdhury, and Jie Gao, "Multichannel Assignment and Link Scheduling for Prioritized LatencySensitive Applications", Proceedings of the International Symposium on Algorithms and Experiments for Sensor Systems, Wireless Networks and Distributed Robotics (ALGOSENSORS 2019), Munich, Germany, pp. 137157, 2019. 
Show More
Show Less
Abstract: Current wireless networks mainly focus on delaytolerant applications while demands for latencysensitive applications are rising with VR/AR technologies and machinetomachine IoT applications. In this paper we consider multichannel, multiradio scheduling at the MAC layer to optimize for the performance of prioritized, delaysensitive demands. Our objective is to design an interferencefree schedule that minimizes the maximum weighted refresh time among all edges, where the refresh time of an edge is the maximum number of time slots between two successive slots of that edge and the weights reflect given priorities. In the singleantenna unweighted case with \(k\) channels and \(n\) transceivers, the scheduling problem reduces to the classical edge coloring problem when \(k \geq \lfloor n/2 \rfloor\) and to strong edge coloring when \(k=1\), but it is neither edge coloring nor strong edge coloring for general \(k\). Further, the priority requirement introduces extra challenges. In this paper we provide a randomized algorithm with an approximation factor of \(\tilde{O}\left( \max \left\{ \sqrt{\Delta_p }, \frac{\Delta_p}{\sqrt{k}} \right\} \log m \right )\) in expectation, where \(\Delta_p\) denotes the maximum degree of the unweighted multigraph, which is formed by duplicating each edge \(e_i\) for \(w_i\) times (\(w_i\) is \(e_i\)'s integral priority value), and \(m\) is the number of required link communications (\(f(n) \in \tilde{O}(h(n))\) means that \(f(n) \in O\left( h(n) \log^k(h(n)) \right)\) for some positive constant \(k\)). The results are generalized to the multiantenna settings. We evaluate the performance of our methods in different settings using simulations. 