Jie Gao

Associate Professor, Computer Science Department, Stony Brook University

Major Research Problems and Results

The theme of my research is to develop geometric algorithms for networking applications. Here geometry is taken as a fairly general term that goes beyond the Euclidean geometry or graphs in the plane. In wireless networks, the wireless communication properties respect distances and proximity. In indoor settings, the placement of sensors for environment sensing and monitoring must consider the geometric layout of the floor plans. Similarly, geometry naturally appears when we look at problems such as sensing, networking and control at a city scale. Often at such large scales, topological features of the environment start to play an important role -- obstacles such as buildings, terrain features, street maps are crucial when we talk about network routes in a large scale wireless network, or physical routes such as human motion trajectories and robot motion planning. Many of my work are related to the vision of the Internet of Things (IoT) but the mathematical problems in complex network analysis, social networks analysis and other networking scenarios (software defined networking SDN, named data networking NDA) share similar insights. In my group we love fundamental problems and we are thrilled to develop elegant solutions that have theoretically proven performance bounds and are simple enough to be practically useful.

Greedy Routing

The problem is to develop simple, greedy algorithms for a distributed wireless network -- each node is given some compact coordinates with which one can route messages between any two nodes with performance guarantees. The problem initially started from the setting when geographical coordinates are used to guide routing towards the destination. But the problem is farily rich with multiple dimensions: static nodes versus mobile nodes; simple domains versus domains of complex topology; and optimizing for various routing metrics such as guaranteed delivery, robustness to failures, load balancing, supporting multi-paths, routing towards data and not IDs, etc. Greedy routing using geographical coordinates faces an immediate challenge when a node does not have neighbors that are closer to the destination in terms of Euclidean distances. This suggested the use of virtual coordinates. As an elegant problem it touches upon many fundamental issues: how to create new (virtual) coordinates for the nodes such that the network structure is encoded to enable greedy routing; how to create routes that are topologically different to maximize route independence and robustness; whether we should use compact structures that summarize the geometric features of the underlying network or the topological features of the network. Below we briefly discuss what results we have obtained for this family of problems.

Extract boundaries and skeletons: Greedy routing using geographical coordinates works well when the wireless nodes are dense and deployed inside a convex domain. The main problem is when the network has holes. Thus a first problem is how to identify holes in a distributed manner. Can the nodes on hole boundaries detect this by themselves? We developed a number of algorithms on this topic, by using either the geometric properties of the communication graph [Fang, Gao, Guibas 2006], or use the cut locus of shortest path map [Wang, Gao, Mitchell 2006]. Once the boundary nodes are identified, we can extract a skeleton of the network to help with routing such as the medial axis [Bruck, Gao, Jiang, 2005] or the flow complex [Zhu, Sarkar, Gao, 2009].

Conformal mapping: One family of routing algorithms we developed is to use global transformation of the domain to create new coordinates such that greedy routing has guaranteed delivery. If the domain has holes, we map the domain to a circular one such that all holes are circular, and therefore greedy routing cannot get stuck [Sarkar et.al. 2009]. Furthermore, circular domains are not unique and one can quickly find a different circular domain by applying a Mobius transformation (which maps circles to circles). Thus this easily creates infinitely many circular domains with which we can run greedy routing to get multiple paths that are different in shape, allowing multipath routing [Jiang et.al. 2011]. Alternatively, we can also cut the holes open along some paths to a simple domain and map it to a convex polygon in the hyperbolic plane [Zeng et. al. 2010]. Again delivery is guaranteed and further we can use greedy routing to generate paths of different homotopy types (which go around the holes in different ways). We considered mobile networks as well, and develop compact schemes such that each node can quickly computes its virtual coordinates when it moves [Li et. al. 2016].

Load balancing: Can we design a simple routing scheme such that the maximum traffic load of any node is minimized? For wireless nodes on a line or in a narrow band, we show a simple online greedy routing algorithm that achieves a constant path stretch (compared to the shortest path), and a constant load balancing factor (compared to the optimal load balanced routing scheme) simultaneously [Gao, Zhang 2016]. For a sensor network densely deployed in a simple domain (with no holes), by using area preserving map we can create virtual coordinates such that greedy routing with the virtual coordinates has both stretch factor and load balancing ratio bounded by the distortion of the map [Goswami, et al. 2014]. But this is impossible if the nodes are in general 2D plane. Instead, there is a tight tradeoff between the stretch factor and the load balancing ratio [Gao, Zhang 2009]. For a non-simple domain, many issues with load balancing appear on the hole boundaries as greedy routing using geographical coordinates or virtual coordinates tend to overload the boundary nodes. There are a number of ways to alleviate it. For example, if we use the circular embedding as virtual coordinates for greedy routing, we can use Mobius transformations to `fill up' the holes, by mapping another copy of the domain inside each hole. Greedy routing when hitting a hole boundary can now enter the interior of the hole, which is essentially bounced away from the boundary [Sarkar et. al. 2010]. In fact, any algorithm that does not suggest routes along the straight line towards the destination will be better in terms of reducing the traffic load on boundaries. This is observed for routing along the medial axis [Bruck, Gao, Jiang, 2005] and routing with coordinates in a hyperbolic plane [Zeng et. al. 2010]. Last, recently we are able to show that greedy routing by using the medial axis of the domain can provide constant bound on the load balancing ratio [Gao, Goswami 2015].

Landmark based schemes: A completely different type of coordinates are based on landmarks. That is, consider a subset of the nodes as landmarks and by flooding every other node records its network distance to them. Thus the position of any node can be inferred from the distance vector to the landmarks. This can be used to create virtual coordinates for greedy routing. Here how to define the distance between two nodes using their landmark distance vector is non-trivial. With careful design one can guarantee delivery [Fang, Gao, Guibas 2005], or even better, achieve a constant upper bound on path stretch[Nguyen et. al. 2007]. The use of landmark distance vector goes beyond the wireless network setting. In a later paper, we tested the landmark based routing on various complex networks arising from social networks and biological networks and surprisingly with a small number of landmarks and standard embedding algorithms one can actually run greedy routing with good delivery rate and path stretch [Ban et. al. 2010].

Network Embedding

The general problem of network embedding is to create geometric coordinates for the vertices of the network in some underlying space to match the given input. This is the fundamental problem of network localization when at most a subset of the nodes (called anchors) have GPS coordinates and one would like to infer the location of other nodes by using measurements to the anchor nodes. The measurements can be distance estimates or measurements of relative angles. Our work on this problem can be classified into two categories. In the first category we look at the embedding of a unit disk graph with angle information between adjacent edges [Bruck, Gao, Jiang 2005] or using both noisy angle and distance measurements [Basu et. al. 2006]. In the second category, we look at the embedding of a unit disk graph when no angle nor distance measurements are given and the network has complicated topology. In this setting, the hop count distance might be a poor estimate of the Euclidean distance between two nodes as the shortest path may bend around holes. We propose to use landmark-based Voronoi diagram and its dual combinatorial Delaunay triangulation for the embedding, which turns out to have substantially improved performance [Lederer, Wang, Gao, 2009], [Wang, Lederer, Gao, 2009]. We also provided theoretical analysis on using the combinatorial Delaunay triangulation to capture the domain topology [Oudot et. al. 2010].

Distributed Information Processing and Query

In the vision of sensor networks or the Internet of Things, data from sensors are more important than the IDs of the sensors from a user's perspective. Therefore routing problems for distributed sensors are naturally aligned with the design of data-centric networking. We call the problem in our work the problem of information brokerage -- a user would like to find a piece of data but yet does not know the information source (the nodes that generated the data). As communication is more energy consuming than computation it is preferable to perform in-network processing than delivering all the raw data to the base station. On the other hand, users are often located in the same physical space as the sensors themselves and user requests could be diverse. In our work we looked at various algorithms that specify how data is stored and processed in a manner that would facilitate users to retrieve. Considerations of such algorithms often involve the total storage cost, the communication cost, the retrieval cost and robustness to failures. The main design questions involve where the data is stored, and how users retrieve data of interest, what types of coding schemes should be used to improve reliability.

In-network storage and retrieval: One of the desirable features is to have a storage schemee that allows distance sensitive retrieval -- the cost of retrieving a piece of data within the network is proportional to the cost of travel from the query node to the information source. That is, the retrieval cost is asymptotically optimal. We developed multiple schemes with such property using several different insights: 1) use stereographic mapping such that sensor data is placed on O(\sqrt{n}) nodes to support distance sensitive retrieval [Sarkar et. al. 2009]; 2) for landmark-based routing schemes, both algorithms in [Fang et. al. 2006] use the so-called "double ruling" schemes, i.e., design two sets of curves of the domain such that each curve in the first set intersects all the curves in the second set. Thus data from a node s is stored along the curve from the first set passing through s; while retrieveal from a query node t travels along a curve in the second set through t. 3) We also use conformal maps to generate the double ruling curves using intrinsic parameterization [Ban et. al. 2013]. 4) And last, when the sensor data is derived from mobile targets, the set of mobility traces naturally spread out the detection data. We show that in this case issuing a query along a random line retrieves all data with high probability [Zhou, Gao 2009].

Sparse aggregation and matching: In scenarios when more than one sensors (possibly geographically apart) have detected interesting events, the challenge of performing efficient aggregation is that the nodes detecting events are unware of the existence of each other nor the locations of the query nodes. One idea is to build information gradients in the network by using harmonic functions, by following which greedy routing arrives at one of the sources for sure [Lin et. al. 2008]. In another idea, if we prefer to aggregate the data from these sources to a base station, we designed a protocol that allows a spanning tree of the sources to evolve in a bottom up manner [Gao et. al 2007] with total communication cost asymptotically optimal (proportional to the minimum spanning tree of the sources, if the locations of these sources are known). When the events move around, we also have a communication efficient protocol to maintain an approximate minimum steiner of the mobile events in the network [Zhou, Gao 2009]. Further, it is possible that multiple users are interested in the event detections (e.g., multiple users looking for empty parking spots) and we would like to provide a one-to-one matching of the users and the detections such that the total cost of the matching is small [Gao et. al. 2009].

Range queries: In this category we would like to answer range queries -- given a geometric range such as a rectangular region, return the aggregated information from all sensors within the range. The main idea in efficiently providing the answer is to precompute appropriate summaries in the network. In the first set of solutions, we use multi-resolutional data summaries. The aggregates of all sensors inside a quad in a quadtree are stored on all sensors inside the quad. Thus a range query can selectively visit a subset of sensors inside the range to reconstruct the answer [Gao et. al. 2004]. The construction of these multi-resolutional aggregates can be improved by using gossip algorithms, which achieves a near linear communication cost in total [Sarkar et. al. 2012]. In the second scheme, we use differential forms as the basic storage scheme for answering counting range queries for mobile targets. The sensors keep certain values (called differential one-form) on the edges of a planar graph such that the number of targets can be obtained by summing up the values of the edges along the boundary of the given range. This achieves the minimum possible query cost and the update cost is proportional to the distance travelled by the targets [Sarkar, Gao 2013]. We also applied differential form, this time a special one, the harmonic one-form, for sensing and classifying the homology groups of target trajectories. The network nodes are processed such that summing up the values along each trajectory can be used to classify trajectories that are homologous to each other in the same bucket [Yin et. al. 2015]. This allows for low cost sensing and clustering of mobile trajectories.

Fault tolerance: In-network storage scheme would benefit from having fault tolerance. When some nodes fail, the data stored on them is lost. Therefore data shall be prepared into codewords and carefully stored in the network. Now the questions are: what coding scheme shall be used? what is the communication cost of perparing for the codewords? what is the cost of recovery? We used a gossip algorithm to show that linear erasure codes can be used to counter possible node failures and furthermore the coding scehem can be constructed in the network with a total of near linear communication cost [Albano, Gao 2010]. In another setting we consider spatial failures when a cluster of nodes fail at the same time. We developed near optimal storage schemes with and without coding [Azimi et. al. 2010].

Complex Network Analysis

In the past few years we start to look at problems and algorithm for complex network analysis. Here complex networks refer to the general family of graphs that appear in social network settings, the Internet settings, and biological network settings. Complex networks, albeit with diverse origins, share similar properties such as the small world property, the power law degree distribution, community structures, among others. Various generative models such as Watts-Strogatz model, Kleinberg's model and preferential attachment model have been proposed. Our work can be summarized in four directions: social contagions and influence, analysis of social dynamics, network curvatures and human mobility patterns.

Social contagions and influence: Contagions such as diseases or rumors have been a subject of study for a long time. In our work we focus on the study of behaviorial changes, for example, lifestyle changes, participation in an online social platform or adoption of new technology. Such behavioral changes are influenced by friends. Sociologists argue that behavioral changes are contagious and the adoption of such changes may require multiple confirmations from neighbors. This leads to the model of complex contagion in which a node is infected if at least k neighbors are infected, for a constant k greater than one. This simple model is interesting because the influence function cannot be described by a submodular function (with diminishing return), due to the initial adoption barrier. Therefore many prior work in analyzing social influence, which predominantly used submodular optimization, cannot be directly applied. In our recent work, we performed rigorous analysis of the behavior of complex contagions on multiple complex network models, such as the Newmann Watts model, the Kleinberg's small world model [Ghasemiesfeh et. al. 2013], [Ebrahimi et. al. 2014], the preferential attachment model [Ebrahimi et. al. 2014] and the more general family of time evolving graphs [Gao et. al, 2016]. We show that the network structures or the selection of initial seeds can be crucial in enabling fast complex contagions.

Analysis of social dynamics: We are generally interested in all forms of social dynamics and interactions. We did some data-driven analysis on the formation and lifecycle of social groups in various data sets such as the World of Warcraft players, DBLP authors [Patil et. al. 2013] or employees in a large corporate company [Patil et. al. 2013]. In another thread, we look at opinion changes in Epinion data sets [Patil et. al. 2013] and social norm evolution modeled by the Naming Game [Gao et. al. 2017].

Network curvatures: Here we use geometric lenses on complex network analysis. We look at a classical geometric notion called curvature in the setting of complex networks. Curvature generally talks about how a geometric object deviates from being flat/straight. In a graph setting one can define the discrete curvature of an edge (e.g., by the Ollivier Ricci curvature) as measuring whether the neighbors of the two endpoints of an edge are well connected. An edge in a well connected neighborhood has positive curvature while an edge connecting two rather disjoint neighborhoods is negatively curved. We discovered that curvature can reveal a lot of interesting findings on real world complex networks [Ni et. al. 2015] and we are working on applications of network curvatures for graph analysis.

Human mobility patterns: A unique aspect of human social behaviors is reflected in mobility patterns. People almost never move randomly and human mobility patterns can reveal a lot of social information such as frequently visited locations and social ties. Mobility can also aid message spreading. By analyzing the mobility traces from taxis in a big city, we show that the physical contacts (i.e., two entities appearing at the same location at the same time) can help to disseminate messages. In fact, messages hopping through physical contacts exhibits small world property yet unlike social networks the hubs (entities involved in physical contacts with many others) do not seem to be crucial in message dissemination [Ding et. al. 2015].

Security and Privacy

My work in security and privacy has been mainly on location data. We looked at how to ensure the location report from a sensor to be faithful [Lederer et. al. 2010], how to detect the source locations when the message delivery is done via random walk [Shi et. al. 2013], and how to perform geometric cloaking to ensure location privacy [Vu et. al. 2012]. In addition, we also looked at attacks to wireless networks by placing wormhole links. The modification of the network topology can attract traffic for the adversary to perform additional traffic analysis. We developed tests that detect such wormhole attacks, e.g., by using geometric packing property [Maheshwari et. al. 2007], or by using a purely combinatorial detection mechanism [Ban et. al. 2011]. Recently we are mainly interested in trajectory analysis that respects user privacy. For example, by using hierarchies with mininum of Hash values we develop a communication efficient scheme to sense and store trajectory data that satisfies the differential privacy property but supports popular paths queries between any two nodes [Ding et. al. 2017a]. We develop distributed near optimal spatial cloakings for mobile nodes, formulated by the r-gather problem, in which each cluster have at least r points and the maximum radius is minimized [Zeng et. al. 2017]. Last, we examine how to remove identifying features in the publication of a trajectory data set by randomly switching node IDs when two nodes are co-located at the same time [Ding et. al. 2017b].

Algorithms for Scheduling and Planning

TSP related problems: We designed algorithms for various TSP related problems, with applications in motion planning for robots and autonomous vehicles. Here are our results. (1) we assume that there are nodes in a metric space and each node is generating data. The goal is to design routes for a data MULE with a speed limit such that the (average) data rate collected by the data MULE (in unit time) is maximized. Or, in order to collect all data, we would like to minimize the maximum number of data MULEs used. We develop constant approximation algorithms for both of them [Citovsky et. al. 2015]. (2) we consider time window TSP problems in which each node is associated with a time window during which it needs to be visited. We designed approximation algorithms when the nodes are in 1D and when the time windows can also be relaxed [Jia et. al., 2016]. This problem abstracts the applications of using autonomous vehicles for package delivery. (3) Given a set of pairs of points in the plane, we would like to find a shortest cycle that encloses exactly one point in each pair. We develop various approximation algorithms for this innocent-looking but notoriously hard problem [Arkin et. al., 2016].

Art gallery and duty cycle scheduling: Many monitoring applications for smart building settings use sensors with sensing range defined by visibility (cameras, infrared sensors). Therefore art gallery problems naturally appear, if all target locations need to be covered. In our recent work we discuss duty cycling of these sensors to ensure that all target locations are covered with (possibly different) frequencies, while minimizing the total resources required (e.g., the number of sensors to turn on at each time slot). We developed various approximation algorithms for minimizing the maximum dark duration (a contiguous time interval during which the target is not covered), or the average dark duration for all targets, or meet the specific coverage frequency requirements [Liu et. al. 2016], [Liu et. al. 2017]. If we would like the cameras to be connected in a wireless network, we may ask what is the minimum number of relay nodes needed to provide the connection. This is formulated as the minimum connected camera network problem and studied in [Huang et. al. 2014].

Scheduling in optical SDN: In software defined networking settings, the control plane is separated from the data plane such that a lot more flexibility and optimization in routing are possible. We considered optical network scenerio when the bandwidth of a link can be reconfigured to adapt to the traffic matrix. Now the question is, how to reconfigure the link capacity and how to schedule traffic requests to optimize the performance metric? In our work we provide both approximation/competitive algorithms that use various greedy rules when the total bandwidth for each node is bounded [Jia et. al. 2017], as well as system evaluations that show superior performance compared to the setting when reconfiguration is not allowed [Jin et. al. 2016].

Kinetic Data Structures

Kinetic data structures are data structures for mobile data. When the geometric objects move around, one would like to build certificates and an event system such that the geometric structures (such as convex hull, Delaunay triangulation) are maintained at all times. In my past work we designed efficient kinetic data structures for discrete clusters of fixed radii [Gao et. al. 2004], linear size geometric spanners [Gao et. al. 2006], medians and kd-trees [Agarwal, Gao, Guibas 2002], centerpoints [Agarwal et. al, 2005], approximate Delaunay graphs [Agarwal et. al. 2015].


Spanners are graphs whose shortest path lengths approximate the distance in a given metric. We have developed multiple results concerning spanners, such as spanners of unit disk graphs [Gao et. al. 2005], spanners of moving points or dynamic points[Gao et. al. 2006], spanners that are constructed by agents under anarchy [Gao, Zhou 2012], spanners that can be quickly fixed with imprecise points [Gao, Zeng 2014], well-separated pair decompositions for unit disk graphs [Gao Zhang 2005].

Coresets and Clustering

Clustering is a fundamental problem in computational geometry. We developed efficiently algorithms for clustering in various settings. For example, cluster a set of lines (or hyperplanes) in high dimensional spaces by using k balls of minimum radius. This construction first identifies a subset of the input lines as a "coreset" such that the clustering results on the coreset is approximately the same as the optimal clustering [Gao, Langberg, Schulman 2008], [Gao, Langberg, Schulman 2010].