Christian Sommer

PhD thesis, University of Tokyo, 2010

Computing shortest paths in graphs is one of the most fundamental and well-studied problems in combinatorial optimization. Numerous real-world applications have stimulated research investigations for more than 50 years. Finding routes in road and public transportation networks is a classical application motivating the study of the shortest path problem. Shortest paths are also sought by routing schemes for computer networks: the transmission time of messages is less when they are sent through a short sequence of routers. The problem is also relevant for social networks: one may more likely obtain a favor from a stranger by establishing contact through personal connections.

This thesis investigates the problem of efficiently computing exact and approximate shortest paths in graphs, with the main focus being on shortest path query processing. Strategies for computing answers to shortest path queries may involve the use of pre-computed data structures (often called *distance oracles*) in order to improve the query time. Designing a shortest path query processing method raises questions such as: How can these data structures be computed efficiently? What amount of storage is necessary? How much improvement of the query time is possible? How good is the approximation quality of the query result? What are the tradeoffs between pre-computation time, storage, query time, and approximation quality?

For distance oracles applicable to general graphs, the quantitative tradeoff between the storage requirement and the approximation quality is known up to constant factors. For distance oracles that take advantage of the properties of certain classes of graphs, however, the tradeoff is less well understood: for some classes of sparse graphs such as planar graphs, there are data structures that enable query algorithms to efficiently compute distance estimates of much higher precision than what the tradeoff for general graphs would predict. The first main contribution of this thesis is a proof that such data structures cannot exist for all sparse graphs. We prove a space lower bound implying that distance oracles with good precision and very low query costs require large amounts of space. A second contribution consists of space- and time-efficient data structures for a large family of complex networks. We prove that exploiting well-connected nodes yields efficient distance oracles for scale-free graphs. A third contribution is a practical method to compute approximate shortest paths. By means of random sampling and graph Voronoi duals, our method successfully accommodates both highly structured graphs stemming from transportation networks and less structured graphs stemming from complex networks such as social networks.

@phdthesis{Som10, author = {Christian Sommer}, title = {Approximate Shortest Path and Distance Queries in Networks}, school = {The University of Tokyo}, year = {2010}, }

Official version

Slides (756.9 KB)

The first chapter (introduction) of this thesis is reproduced below. I recently extended the survey in chapter 3 into its own article. The space lower bound (chapter 4) was announced at FOCS 2009. The distance oracle for power-law graphs (chapter 5) was announced at DISC 2009, journal version in Transactions on Algorithms. Our shortest-path query algorithm using Voronoi duals (chapter 6) was announced at ISVD 2009, journal version in Transactions on Computational Science.

Imagine you wanted to travel from central Tokyo to visit the place where I grew up, which is the village of Ottenbach in Switzerland. What is the fastest way to get there?

Imagine you wanted to get in touch with Nelson Mandela using a sequence of personal introductions through friends and friends of friends. What is the shortest such sequence?

Imagine you wanted to access a webpage on the Internet. Which routers should be used such that the necessary information is downloaded to your computer fastest?

These questions have something in common, in that their (optimal) solution is the shortest path between two points of a network: a transportation network, a social network, and a router network. All roads may lead to Rome, but we wish to arrive as soon as possible. The aim of this thesis is to provide means to efficiently compute shortest paths in networks.

Modeling the bridges by the edges of a graph helped Euler to solve the problem of the Königsberg bridges. Ever since, graphs have been used as an abstraction of structures in the real world:

Graphs are, of course, one of the prime objects of study in Discrete Mathematics. However, graphs are among the most ubiquitous models of both natural and human-made structures. In the natural and social sciences they model relations among species, societies, companies, etc. In computer science, they represent networks of communication, data organization, computational devices as well as the flow of computation, and more. In mathematics, Cayley graphs are useful in Group Theory. Graphs carry a natural metric and are therefore useful in Geometry, and though they are "just" one-dimensional complexes, they are useful in certain parts of Topology, e.g. Knot Theory. In statistical physics, graphs can represent local connections between interacting parts of a system, as well as the dynamics of a physical process on such systems. [Hoory, Linial, and Wigderson]

In a graph representing a road network, intersections and streets can be modeled by nodes and edges, respectively. Two nodes have an edge in between if there is a street connecting the two corresponding intersections. For a computer network, routers and the connecting network cables are mapped to nodes and edges, respectively. In a social network, the connections are not physical. Individuals can be modeled by nodes; two nodes are connected by an edge whenever the corresponding individuals are friends. In other social networks, an edge may also indicate a private or professional relationship other than friendship.

Different streets of a road network have different lengths. This is modeled by assigning each edge a number, called its edge weight. This cost can reflect real-world values such as distance, travel time, transmission time, and latency. In a graph modeling a social network, the edge weight can also reflect the quality of a friendship, although this is arguably difficult to capture appropriately by a single number.

Transportation networks are an integral part of the infrastructure in many countries. For this reason, the study of transportation networks is an important field of research. We give three examples of transportation networks: road, railway, and airline networks. A more realistic model of transportation would ideally integrate networks of all three types.

**Road Networks**

As mentioned before, intersections and streets are represented by graph nodes and edges, respectively. Also, edges may be assigned weights such as expected travel times or distances. All the
information included in the graph model can essentially be found by consulting a road map. The
reverse is not true. We cannot draw the original road map by considering the graph only, since
some information about the original drawing (or embedding) such as coordinates is not included.
The original map is just one possible drawing of the graph. A general graph is completely independent from its embedding.

A graph that can be drawn on a plane such that no two edges cross (except for the endpoints),
is called planar. Since road networks often contain many bridges and tunnels, the corresponding
graphs are in general not planar. However, road networks and planar graphs share some important
properties, which render road networks tractable for many optimization problems. Navigation,
for example, is not as difficult as it could be on a general graph, since road networks have some
geometric and geographical orientation. Another characteristic of road networks is that, at every
intersection, the number of streets (and thus the number of choices when navigating) is quite low.
Still, planning efficient routes through a road network is very challenging since these networks
may be huge (millions of nodes) and dynamic (travel times depend on various factors such as the
current traffic situation and road maintenance).

**Railway Networks**

For railway networks, a straightforward model, analogous to road networks, would map stations
and tracks to the nodes and edges of a graph, respectively. Taking a train is represented by traversing an edge. Again, edges can be weighted with distances or travel times. This model arguably
does not represent a real public transportation system very well. In road networks, one can conceivably use a particular street at any time (except for small delays due to traffic lights). In railway
networks, however, trains are bound to a timetable. This may cause significant waiting time at a
station, which needs to be captured by a realistic model. Indeed, modeling railway networks with
timetable information is more involved than modeling road networks. In the time-expanded model for example, nodes represent both a location and a specific time simultaneously.
One station then corresponds to several nodes in the graph. Edges have different types; traversing
an edge means either taking a train, waiting at a station, or walking to a different track within a
station. Edge weights can be travel times (possibly walking or waiting times) or ticket prices. In
general, the graphs derived from railway networks and timetables are considerably more complex
than the graphs representing road networks.

**Airline Networks**

Air traffic networks can also be represented by graphs. Airports can be modeled by the nodes
of a graph. Two nodes are connected by an edge if there is an aircraft that can start and land on
and cover the distance between the corresponding airports. In this graph, almost all nodes are
connected. While in road networks, each intersection had up to half a dozen of connections, in
air traffic networks, airports can have hundreds of connections. Consequently, the corresponding
graphs are significantly more dense. For the moment, let us restrict the edges of the graph to
the routes that commercial airlines offer. Even when considering the graph with this restricted
edge set, some nodes have hundreds of adjacent edges, due to the fact that airlines often have a
small number of hubs (usually big airports) where all their routes connect. The airline network
is a so-called small-world network: one can connect any two airports in the network by
following a sequence of only one to five connections. From a graph-theoretic point of
view, airline networks are structurally very different from road and railway networks. Networks
without apparent structure are often called complex networks.

Various complex systems are highly interconnected: phenomena that were assumed to be local only are sometimes unexpectedly shown to influence the other end of the system. Researchers from fields such as physics, computer science, biology, and social science analyze these systems to explain how (and sometimes why) everything is connected. The objectives of network scientists are (1) to make connections in real-world systems explicit («to connect the dots»), (2) to analyze and understand the network structures formed by these connections, and, possibly, (3) to exploit the structural properties of these systems. The challenges are manifold. Extracting the connections of complex systems can already be very difficult since these systems tend to be inherently distributed and very large. Once the connections have been made explicit, that is, once a graph (in this context also termed complex network) has been created, the next challenge is awaiting. Due to the often massive size of the graph (millions of nodes and edges are not uncommon), analyzing its structure requires sophisticated methods, techniques, and tools. Researchers often use methods to decompose the graph into different clusters and communities of smaller size, which are easier to understand than the whole graph. Another approach is to focus on "important" nodes within the graph. Important nodes are conjectured to be those that are well connected with most of the other nodes, or those that enable important connections between other nodes.

Although these systems and the corresponding complex networks appear to lack structure, they still have some important commonalities: they show a large variation in the number of links connected to each node; these networks often have a few nodes that are very well connected while a majority of the nodes have only few interactions; the number of interactions appears to follow a power law. These networks are called scale-free networks. Also, many complex networks are small worlds, which means that, although many nodes only have a few direct interactions, they are still connected to all the other nodes through very short chains of interactions.

In the following, we discuss some specific complex networks in more detail.

**Citations among scientific articles**

The graph modeling citations among scientific articles is a directed graph, which means that its
edges are oriented. The node corresponding to article A has a directed edge to the node corresponding to article B if and only if article A cites article B. The citation of a research article by
another one is an explicit relationship indicating influence and dependence. An article that gets
cited by many other articles is likely to have had some influence on many researchers. For this
reason, the impact of an article is often measured by the number of citations an article received.
Researchers and managers evaluate the success of a research article based on its impact. There are
also potential applications other than evaluating and ranking articles and researchers. For example, the citation graph may be of use in a system that automatically suggests reviewers for articles
submitted to a journal.

The network of citations among scientific articles was in fact one of the first complex networks ever analyzed. It has been found that, over time, articles that are cited by many
others, acquire even more citations. This cumulative advantage, (later termed preferential
attachment), appears to be very common in complex networks.

**Web Graph**

The web consists of billions of pages that are connected by hyperlinks. For the web structure, a
directed graph is a suitable model. Web pages and hyperlinks are
mapped to nodes and edges, respectively. The resulting structure is also called web graph. Similar
to the citation between two scientific papers, the hyperlink between two web pages indicates some
dependence and influence. A scientific article is considered important if it is cited by many others;
analogously, a webpage is considered important if it is linked to by many others. Modern search
engines often make use of this network structure when ranking search results.

An accurate snapshot of the web graph is hard to obtain: the number of web pages is massive,
the pages reside on different servers all around the world, and many pages change constantly.
Due to these reasons, researchers analyze subsets and samples of the web. Based on such small
samples, it is conjectured that the web graph is scale free, which means that the number
of incoming links per web page obeys a power law. It is also conjectured that most pages that are
inter-linked, are connected by rather short chains of links. Based on the characteristics
of sample web graphs, researchers also build mathematical models, using which the future structure of the web graph can be predicted.

Weblogs ("blogs") form a very popular part of the web. Since good content on the
web is proportionally rare and since most blog writers (termed "bloggers") also read other blogs,
interesting ideas, rumors, and stories propagate very quickly within the blogosphere. Some blogs
are read by millions of readers and the corresponding writers have the power to influence their
audience to a certain extent. It is thus no surprise that companies try to make use of this "network
of influence" when launching new products or services. The analysis of the blogosphere is indeed
a vibrant research topic.

**Router Connections**

Since the Internet has become an integral network handling a significant percentage of business
transactions both among companies and between companies and individuals, any technical failure
might have drastic consequences in the real world. The Internet is nowadays considered being
critical infrastructure almost at the level of transportation and power networks.

Although the Internet is a physical network consisting of routers and cables (modeled respectively by nodes and edges of a graph), there is no accurate map available. Researchers try
to create such a map by sending messages to random computers while monitoring (using the
tool `traceroute`) which routers the messages went through. These
measurements yield that the Internet (or some of its subgraphs) also has properties of power-law
graphs. However, it is not clear whether these properties were obtained due to measurement bias
in the map creation process.

The "speed of the internet" highly depends on the efficiency of the internet protocols, in particular the routing protocols. The performance of a protocol highly depends on the underlying
network structure. Since speed matters and due to the importance of the Internet as critical infrastructure, the network of router connections is a very important network to understand and analyze.

**Networks in Biology**

Traditionally, proteins are identified based on their actions as building blocks, catalysts, or signaling molecules. The network view identifies a protein by its physical interactions with other proteins. This yields its contextual role in a protein-protein interaction network. Proteins and interactions are represented by nodes and
edges of a graph, respectively.

Recently, many networks have been extracted from biological data. Examples other
than protein interaction networks include metabolic networks, which encode biochemical reactions
between metabolic substrates, and transcriptional regulatory networks, which describe
the regulatory interactions between different genes.

Network science may help to understand the human body and its internal actions, for example by determining the role, function, and essentiality of proteins or genes by
analyzing shared interaction partners in the protein-protein interaction network. This knowledge
may then help to find treatments for diseases such as cancer by identifying drug targets.
Another potential application in medicine lies in the intersection with neuroscience. Abnormal
interaction patterns in the brain could help diagnosing neurological disorders. These applications indicate that the potential of an interplay between biology and network science is enormous.

**Social Networks**

Around 1929, the Hungarian writer Frigyes Karinthy formulated the following challenge:
find a person who you can not be connected to by a friendship chain of at most five people. In
other words, Karinthy conjectured that everybody knows everybody else through a short chain of
personal connections. Milgram, in his famous small-world experiment,
tried to verify this conjecture with an empirical study. Participants in the Milgram study were supposed to deliver a letter addressed to a specific person, not by mailing it directly, but by forwarding
it to somebody they know on a first-name basis, somebody who is more likely to know the addressee. Milgram was interested in the number of forwarding steps, which corresponds directly
to the length of a friendship chain. Although most letters never arrived, many letters reached the
addressee after a very small number of forwarding steps only. «It's a small world!» Milgram's
result exhibits a stronger statement than Karinthy's conjecture. For two individuals, the friendship
chains connecting the two are not only short, it is also possible to "find" such a chain.

The small-world phenomenon has been fascinating people around the world for many years.
For example, the popular Hollywood movie »Six degrees of separation« is devoted entirely to small-world phenomena. As another example, the members of some communities "measure" their interaction distance to distinguished individuals. Actors are interested in their co-starring distance to Kevin Bacon, Mathematicians care about their collaboration distance to Paul Erdős, and players of the popular game Go measure their distance to Honinbo Shusaku.

In social networks, we consider individuals as nodes and relationships as edges
of a graph. If the relationship is friendship, social networks are far more than a scientific playground: online social networking platforms such as Facebook, MySpace, mixi, and others have
conquered the web and their corresponding websites attract millions of users. Privacy aside, these
social networking websites generate massive data sets that are of huge interest for advertising
companies and marketeers. Some of this data is made available to scientists as well (usually after
anonymization). Companies also analyze their internal communication patterns to improve productivity and to identify leader personalities. Social networks can also be extracted from
phone call and instant messaging data. Other data on social relationships may
be harder to collect. If an edge of the graph indicates sexual interaction (not necessarily a subset
of the friendship edges), not everybody may be willing to reveal all connections. In a sample
of 2,810 Swedes, the number of sexual interactions per person showed the structure of a power
law. Although the exact connections were not retrieved, the power-law distribution in
the number of connections is already consequential. Epidemics tend to arise and propagate very
fast in power-law networks. It is also known that the speed of disease propagation can potentially be reduced by prevention campaigns that strategically target those individuals
with a large number of partners. The previous example indicates that knowledge and
understanding about the structure of complex networks may have an impact on the real world.

**Network of networks and techno-social systems**

The spread of a general infectious disease depends on interactions of different types. One network is not enough; several networks have to be considered simultaneously to find effective containment strategies in urban social networks.
Techno-social systems consist of a technological component, often physical infrastructure such
as transportation systems, and a human component, which could be any form of communication,
potentially influenced by a social network. Understanding the interplay of both components is
necessary to predict, and eventually control and influence, techno-social systems.

It now seems possible to imagine the creation of computational forecasting infrastructures that will help us design better energy-distribution systems, plan for traffic-free cities, anticipate the demands of Internet connectivity, or manage the deployment of resources during health emergencies. [Vespignani]

Once we have performed the abstraction by making connections explicit and modeling them by a graph, we can work with this model and compute properties of the graph without considering the exact details of the underlying network. Interesting properties could be graph distances between nodes. Tobler invoked

... the first law of geography: everything is related to everything else, but near things are more related than distant things. [Tobler]This law ought to be generalized to a law holding for many networks. An edge of a graph essentially indicates a relationship between the two corresponding entities: between two interacting proteins, between two friends, between two intersections in a road network, or between two routers connected by a cable. If this relationship is somewhat transitive,4 which means that, for example, if Paul is friends with Linton and Stanley, then Linton and Stanley are probably closer to each other than if they were not friends with Paul. Or, to consider another example, if protein A interacts with and influences protein B, and B influences C, then A has an indirect influence on C. The shorter the chain of interactions, the stronger the influence. This is

In graphs, the distance between two nodes *s* and *t* (source and target, respectively) is defined as
follows. If s and t are connected by an edge, their distance is 1. If they are not directly connected,
the distance is defined by the length of a shortest path between s and t, which is a sequence of
adjacent edges. In a weighted graph, the length of a path is defined by the sum of the weights of
the edges on the path. Consequently, shortest paths are defined with respect to these weights. Note
that, in weighted graphs, even if two nodes are connected by an edge, depending on its weight, the
edge is not necessarily part of any shortest path.

To each of the three questions at the beginning of this chapter the answer is a shortest path in the corresponding network: the fastest way to get to Ottenbach, the shortest sequence of friends to get in touch with Nelson Mandela, and the fastest route to a webserver. Many other questions also have shortest paths in graphs as answers. Whenever we must traverse a network or send someone or something through a network between two points in a fast, cheap, or reliable way, it is likely that solving a shortest path problem will provide the optimal solution. Other applications of shortest path computations include (but are by no means limited to) network optimization, packet routing, image segmentation, computer-assisted surgery, computer games, DNA analysis, injection molding, postman problems, operator scheduling, production planning, re-allocation of resources, approximation of piecewise linear functions, and VLSI physical design. Furthermore, the countless optimization problems that can be solved by deterministic dynamic programming (Knapsack, for example), can also be solved by a shortest path algorithm (on a very large graph) by identifying the stages of the dynamic program with the nodes of an acyclic, directed network. The shortest path problem also has a deep connection to the minimum cost flow problem, which is an abstraction for various shipping and distribution problems, the minimum weight perfect matching, and the minimum mean-cycle problem. More sophisticated algorithms for combinatorial optimization problems such as network flows often need a shortest path subroutine.

Methods to find a shortest path were discovered and analyzed already in the late 50's and early 60's by Bellman, Bock and Cameron, Caldwell, Dantzig, Dijkstra, Floyd, Ford, Fulkerson, Hu, Klee, Leyzorek, Gray, Johnson, Ladew, Meaker, Petry, and Seitz, Minty, Moore, Mori and Nishimura, Parikh, Rapaport and Abramson, Shimbel, Warshall, Whiting and Hillier, and probably others. Despite the exponential growth of computing research, many of their algorithms are still in use today in one form or the other, most prominently Dijkstra's algorithm, which literally every computer science student learns as an undergraduate. Dijkstra's algorithm solves the Single Source Shortest Path (SSSP) problem: the algorithm starts at one point (the source) and explores the whole graph in all directions until the distances to all the other nodes are known. For an illustration of the shortest path tree in a transportation network, see Figure 1.2. If the user is interested in the distance to one target node only, it is possible to stop the search as soon as this specific target has been found. Some of the other classical algorithms solve the All Pairs Shortest Path (APSP) problem: the distance and the shortest path between all pairs of nodes is computed.

The importance of the problem and the numerous applications have stimulated research efforts for more than fifty years. A large fraction of these efforts targets the shortest path problem in transportation networks, since route planning is arguably one of the most important applications of shortest path algorithms. Intelligent navigation systems know the current location using GPS and guide cars along the shortest route to minimize travel time, distance, or fuel and energy consumption. Some of these systems also react to changing traffic conditions and, in the future, cars may communicate with each other and negotiate routes to regulate the overall traffic flow and manage congestion. The end users are often interested in trip planning, which means that they want to know the distance and the shortest, cheapest, or most reliable path between two specific points.

We ask for the distance between two points of a network. Recall that Dijkstra's algorithm solves the Single Source Shortest Path (SSSP) problem by exploring the whole graph starting from the source until the distance to all other nodes is known. Dijkstra's algorithm can be stopped prematurely, as soon as the target has been found. Another improvement to decrease the running time is to start a search from both the source and the target and stop when the two searches meet. Still, Dijkstra's algorithm may explore the whole graph. A complete exploration is too expensive. The objective is to develop a method that is much faster than Dijkstra's algorithm. In return, the method is allowed to pre-compute a data structure that, later, assists shortest path computations, called queries. We thus assume that the graph is known some time before the first query is asked.

Shortest path query processing resembles a typical data base problem: create an index by
materializing information to speed up certain queries. One strategy would be
to precompute the result for every possible query. This is expensive in terms of time and space;
storing all the results may be prohibitive since the number of possible queries is quadratic in the
number of nodes. No precomputation, the other extreme, is too slow at query time. We aim for
a mid-point, a good compromise, between *no precomputation and complete processing* at the
time of query and *complete precomputation and a simple look-up* at the time of query.

Shortest path query processing is an integral part of many applications, in particular Geographic Information Systems (GIS) and intelligent transportation systems. A challenge for traffic information systems or public transportation systems is to process a vast number of customer queries on-line while keeping the space requirements as small as possible. Transportation planning systems and GIS form arguably the most established application scenarios for shortest path algorithms. The following sections list, without being exhaustive, other motivational examples of contemporary applications, where computing short paths is a key component and where a method that allows for efficient shortest path queries may potentially speed up the total computation significantly.

**Traffic Simulations**

Two objectives of traffic simulations are to forecast future traffic patterns and to predict the consequences of certain
changes to the road network. While sometimes the consequences are easily predictable and intuitively clear, there are paradoxical situations where closing a road actually improves the overall
traffic situation. Simulations can help to identify these counterproductive roads. Simulation results may also help in urban planning to reason about the
economic and social impact of building new roads. With realistic estimates of population mobility
and parameterized models for simulating the progress and transmission of a disease, simulations
may improve predictions in public health and epidemiology.

Network-based simulators assume that cars drive along shortest paths. Within the simulator,
at each virtual time step, entities are routed one step further from the source to the destination
along a shortest path. The simulation requires the computation of a large number of shortest
paths. Therefore, various simulators
can potentially reduce the total running time by exploiting a fast query processing method. Since
in the real world drivers rarely use the exact shortest path, even a method returning approximate
shortest paths may be of help.

**Image Segmentation**

Image segmentation is an integral part of image processing applications
such as accident disposal, medical images analysis, and photo editing. The image segmentation
problem is to group together neighboring pixels whose properties are coherent. The grouping
process often relies on shortest path computations. Object surfaces are in some sense continuous. Continuous shortest paths are computed with the fast marching method,
which can be seen as a continuous version of Dijkstra's algorithm. However, since the discrete
lattice is the standard reconstruction of image data, discrete algorithms are also used in
many applications. Graph-based algorithms consider the image as a graph in which each pixel is
a node, connected by an edge to each of the 4 (or 8 or more) neighboring pixels. An important
part is to assign appropriate weights to the edges, based on the image "potential" of the two pixels
and their Euclidean distance.

The graph-based intelligent scissors algorithm heavily relies on shortest paths. The
user provides two endpoints; the algorithm computes the result of the corresponding point-to-point shortest path query and the resulting path is interpreted as a part of the object boundary.
In other words, given two endpoints of a contour, the algorithm determines the maximum likelihood contour connecting them. The edge weights are probabilities and, by taking the negative
logarithm, the algorithm can just sum up the weights on the path to retrieve the optimal contour. For boundary computations in 3 dimensions, the intelligent scissors algorithm must be
adapted. A boundary can be found by combining many shortest paths until
a closed surface is obtained.

In computer vision, shortest path algorithms on weighted graphs have found
numerous applications other than segmentation such as centerline finding, radiation therapy, mesh morphing, video summarization,
and finding roads and trails on satellite images.

**Drug Target Identification**

Shortest paths in interaction graphs are important in systems biology. Signaling paths for example
are routes along which one molecule can affect another one. The average path
length already reveals valuable information about a cell or a body.

In a sense, the average path length in a network is an indicator of how readily "information" can be transmitted through it. Thus, the small world property observed in biological networks suggests that such networks are efficient in the transfer of biological information: only a small number of intermediate reactions are necessary for any one protein/gene/metabolite to influence the characteristics or behavior of another. [Mason and Verwoerd]The average path length is related to a graph's Wiener index, which could potentially be approximated using random sampling and (approximate) point-to-point shortest path computations.

The next potential application leverages information on paths other than their average length. We wish to know which proteins, genes, and metabolites are more important and powerful than others, meaning that they have a strong influence on others. These may be suitable targets for medication. Various graph-theoretic centrality measures somehow correlate with the importance of a node. The degree centrality for example counts the number of neighbors of a node. Intuitively, the more interaction partners a protein has, the higher its potential to influence others. More complex measures distinguish purely local effects (such as the number of interaction partners) from global organizational effects. For example, if a protein is on many reaction pathways or signaling paths, it has some potential for control. It can stop, or at least slow down, the reaction by breaking the chain of interactions, due to the fact that it lies between two indirectly interacting proteins. The number of shortest paths that a node takes part of is called its betweenness centrality. Betweenness accounts for direct and indirect influences of proteins at distant network sites and hence it allows one to relate local network structure to global network topology. Closeness centrality measures how far away a node is from all the other nodes. Degree centrality counts the mere number of interaction partners.

An important application of network centrality in pharmaceutical research could be drug target identification. One of these potential target genes is p53. The loss of p53 function is very damaging: p53 is among the genes most likely to be mutated in cancers. In fact, p53 function loss occurs in nearly all human cancers. It turns out that p53 corresponds to a highly connected node in the interaction graph.

Centrality also allows to predict protein essentiality. Interesting proteins are those with high betweenness-centrality, yet low local connectivity. Their low connectivity would imply that they are unimportant, but their high betweenness suggests that these proteins may have a global impact, acting as important links between modules. The removal of a protein can have different phenotypic effects including lethal or non-lethal effects and a slow-down of growth. There is a positive correlation between lethality and connectivity. The most highly connected proteins in the cell are the most important ones for its survival. The more essential a gene (or its associated protein) is to a pathogen or to a cancerous cell, the more attractive it is as a drug target. So far, however, there is not much evidence that for two nodes with the same number of interactions the node with higher betweenness centrality is significantly more important. Currently, degree centrality serves as a good indicator. Nevertheless:

In most of the studies, [...] the centrality score of a node was found to be indicative of its likelihood to be essential. In particular, this appears to be true for degree centrality, betweenness centrality, and eigenvector centrality measures. [Mason and Verwoerd]The computation of centrality indices has been studied and since exact sequential methods are rather slow, there are efforts to parallelize the computations and to find approximate solutions. Since the biological networks are often sampled and have some errors, centrality measures are not exact anyway. Depending on the application, approximations may be sufficient. Centrality algorithms rely on the computation of functional pathways and they could thus benefit from fast (approximate) path computations.

**Community Detection**

Complex networks are often huge and thus difficult to analyze. One way to obtain an understanding of complex networks is to decompose the network at hand into related components,
communities, and clusters. Several algorithms for clustering and community detection have been
proposed. The algorithm by Girvan and Newman has been applied successfully to a variety of networks, including several
social and collaboration networks, metabolic networks, and gene networks. Their algorithm iteratively removes edges with high betweenness centrality. If a network contains highly-connected
communities that are only loosely connected by a few edges between clusters, then all shortest
paths between different communities must go along one of these few edges, which will therefore
have high edge betweenness. Removing these edges separates the communities from each other.
The method works very well for small graphs but it does not scale due to its high computational
demand. Again, computing the betweenness centrality is the bottleneck. If there are only few edges between different clusters, any approximate shortest path query
method would arguably also detect these edges. Approximating betweenness centrality or just
finding edges with high betweenness centrality can potentially be sped up using fast point-to-point
shortest path queries.

**Social Search**

The sheer size of the web (currently, the web consists of billions of pages), renders the search
for relevant information very challenging. Search engines are expected to find the "needle in
the haystack." The search interface is supposed to be kept simple and the average user is not
entering much information; the search engine must find relevant information without knowing
what the user actually wants (sometimes he may not even know it himself or he may not be able
to express it appropriately). Imagine that a search engine would know basically everything about
the user (this scenario may actually already be reality). The query term combined with what a
user's friends were looking for and which results they liked could enable the search engine to
make a well-educated guess on what the user would want to see. Almost like a recommender
system, the engine can rank the documents matching the search query term just for one
user, based on his interests: a personalized search. It would aggregate the ratings for the
pages retrieved and then assign a higher ranking to those pages that people similar to that user, for
example his friends, liked best. To do so, efficiently retrieving proximity information in the social
graph is essential.

Involving context and social connections in ranking is a hot topic for search engines and computing social distances may soon be a
an important primitive, both for search engines handling keyword queries and for online stores
recommending items for purchase.

**Social Networking**

Shortest paths in social networks seem to be of interest for end users. On the website oracleofbacon.org, for example, users can enter two names of actors and

Such a webpage also
exists for Mathematicians. In professional networking sites such as LinkedIn or XING, users
can add their business contacts in order to get in touch with potential clients or employers through
a short chain of personal introductions. The corresponding webservers compute point-to-point
shortest paths in an online setting. *the database server uses a
breadth-first search (BFS) to find the shortest path between pairs of actors.*

Analogous to the Erdős number project there is a social network based on collaboration on scientific articles in various fields. Two scientists are considered connected if they have co-authored
one or more scientific papers together. One may assign weights to these connections
based on how many papers two authors share. Such weightings potentially capture the strength of
a relationship more appropriately. Shortest paths between scientists may help to
establish new professional contacts by following a sequence of personal referrals. Such systems
have been built years before online social networks became popular.
Scientific collaboration networks are also used to enhance communication at conferences. Some
researchers built a system called DEAI Explorer that visualizes how two scientists
standing in front of a screen may know each other (a common affiliation such as that they wrote
a paper together, they both gave a talk at the same conference, they have a common co-author,
or they cite each other's papers), which is supposed to help them communicating. The DEAI Explorer system computes relationships and connections between persons up to distance 4.

All these systems could potentially benefit from fast shortest path query processing.

**Message Routing**

Forwarding a message from a sender to a receiver through a network is called routing. Many
routing algorithms are variants, in one form or another, of shortest path algorithms that route
packets over a path of minimal cost. The cost of an edge may reflect transmission capacity,
latency, congestion, error rates, and other features. On the chosen path, routers must know where
to forward a packet to. These decisions are made based on the information in the packet and the
routing table at the router. It is integral that this information is kept small (compact) while paths
remain short. Research in compact routing addresses this tradeoff. Compact routing focuses on
distances and ignores other influencing factors such as the quality of service provided. It is assumed that these factors are abstracted out and aggregated
in the edge weight function. Large routing tables are difficult to cope with.

As more and more networks get connected, the memory required for storing the routing tables grows. This memory requirement varies a lot with the routing protocol and with the router's architecture. In fact, the problem may appear in multiple ways. The phrase «routing table explosion» is merely a catchall term for all the problems posed by the manipulation of very large routing tables. [Huitema]The tradeoff between routing table size and route quality is basically the distributed version of shortest path query processing. Each router gets some part of the index, based on which it has to make best-effort decisions. The routing tables should be small and routing decisions quick while paths remain short.

Distance approximations for networks may be of interest for end users as well. If a sender can choose among different destinations, for example before downloading a large file from one of several replicated servers holding the same data, it is beneficial to predict the round trip time for each of the servers prior to actually communicating. This proximity estimate helps choosing the optimal server and connection.

**Conclusion**

The numerous applications of point-to-point shortest path query processing make the following
claim easy to believe. The organizers of the DIMACS implementation challenge on shortest paths,
Demetrescu, Goldberg, and Johnson, state that

... shortest path problems are among the most fundamental combinatorial optimization problems with many applications, both direct and as subroutines in other combinatorial optimization algorithms. Algorithms for these problems have been studied since the 1950's and still remain an active area of research.This thesis investigates the tradeoffs between pre-computation time, storage, query time, and approximation quality, both from a theoretical and a practical point of view.

Home → Publications → Approximate Shortest Path and Distance Queries in Networks