Although the Bulk Synchronous Parallel (BSP) model for scalable parallel processing was invented by Leslie Valiant in the 1980s (and was cited as part of the reason for Valiant’s recent Turing award), it became a popular model for scalable processing of graph data in 2010 when Grzegorz Malewicz et. al. from Google published their seminal paper on Pregel in SIGMOD 2010 (http://dl.acm.org/citation.cfm?id=1807184).
Scaling the processing of graph data (such as social network data, telecom data, and public health disease outbreak data) is notoriously challenging. Unlike set-oriented relational database processing which is fairly straightforward to parallelize, graph processing typically requires traversing through paths in a graph. Given the high branching factors of most real-world graphs, this results in non-local access patterns, and thus a high dependence on vertices and edges being stored or processed by remote processors. The reliance on remote data makes parallel processing much more difficult, as each processor may have to wait for arbitrary lengths of time for remote data.
Pregel’s use of BSP for processing graphs does not eliminate the non-local access pattern problem. However, by eschewing asynchronous communication models that can lead to deadlock and other types of race conditions that make programming scalable algorithms extremely complex, BSP makes it much easier for programmers to reason about the semantics of their graph algorithms. In particular, the programmer of a graph algorithm only needs to focus on what happens during a “super-step”. In each super-step, each vertex, v, in the graph can apply an arbitrary function on its own local state, based on messages sent to v from any other vertex in the graph in the previous super-step. The output of this function is a new state for v (and also potentially new states for its outgoing edges), and a set of messages that can be sent to any arbitrary vertex in the graph that will be processed in the next super-step.
In other words, processing proceeds in a BSP graph system via an alternating series of completely local function applications done in parallel on each vertex in the graph, interspersed with an arbitrary amount of message passing between vertexes. The system proceeds between these two steps “synchronously” — the next round of per-vertex function applications cannot begin until the previous round of message passing is complete. Although in distributed systems “synchronizing” is usually synonymous with “performance slow-downs due to waiting for other nodes to catch up,” these synchronization barriers are actually relatively cheap in graph processing systems since there are typically many more vertexes than there are processors, so a significant amount of work occurs per processor before synchronization, thereby allowing the cost of synchronization to be amortized by the larger amount of useful work.
It turns out that many popular graph algorithms can be elegantly expressed in terms of these “per-vertex” functions that get applied in each super-step of the BSP algorithm. The Pregel paper itself describes the implementations of Page Rank, Shortest Paths, Bipartite Matching, and a Semi-Clustering algorithm using this model. Thus, the combination of programmer friendliness, along with the relatively small cost of the synchronization barriers, has led to the BSP model for graph processing to gain enormous popularity over the four years since the Pregel paper was published. For example, the original Pregel paper has already received close to 1000 citations, and several new graph processing systems that adopt the Pregel BSP model have been recently developed, including Apache Giraph, Apache Hama, GPS, Mizan, and Teradata Aster’s SQL-GR framework.
However, Teradata Aster’s SQL-GR framework differs from the other BSP graph systems mentioned above in that it is implemented inside an analytical database system. This greatly increases both the power and efficiency of the system, which the rest of this blog post will discuss.
For example, let’s say a Telecom company has a data warehouse with a “fact table” consisting of a history of call detail records (CDRs) containing the phone numbers of the originator and receiver of each call, the start time and duration of each call, and various other routing and failure information associated with each call. Furthermore, it contains various dimension tables containing customer address and billing information, router location information, etc.
For most of what the company wants to do with this data (e.g. billing, fault analysis, statistic generation, population of customer-viewable information about their call history, etc.), the above-described relational layout with a CDR fact table and dimension tables with more detailed information is the right way to go. Dimension tables are joined with the fact table as needed, and some may even be pre-joined with the fact table if a column-oriented layout is used (since large, wide tables are not problematic for column-stores).
However CDR data is also a gold-mine for doing customer churn analysis because it contains information about real connections between people — who is calling whom, which numbers are popular, who are the bridges between different communities of customers, etc. For this type of analysis, it is helpful to think of CDR data as a graph, where each customer (or phone number) is a vertex in the graph, and there are edges in the graph for each CDR record corresponding to calls and SMS texting between two customers (phone numbers).
Once a graph out of the CDR data is constructed, graph algorithms that calculate the “eigen centrality” of each vertex and “betweenness” of each vertex are extremely useful to understanding which vertices are key influencers and control information flow — two characteristics that are critical to any kind of customer churn analysis. The eigen centrality of a vertex is a direct measure its relative importance within the graph — if there are many edges that connect to it, especially if those edges are of high “quality” (i.e. they originate from other highly important vertexes), then that vertex will have a high eigen centrality. For example, if many important people call Mr. Graham, the “Mr. Graham” vertex will have a high eigen centrality –and we can thus infer that he is important even if he receives very few other phone calls. Measuring the eigen centrality of a vertex is necessarily an iterative algorithm — in each iteration we get a new estimate of the importance of each vertex and can use this to refine the quality values of each outgoing edge in order to create a new importance estimate for each vertex in the next iteration.
“Betweenness” is a measure of centrality of a vertex in a graph. It calculates the number of times any vertex is included in the shortest path between any two nodes in the graph. If this number is large, this means that the vertex often serves as a bridge between multiple clusters of nodes in the graph, and has a large amount of power in the control of information through the graph. In our example, these clusters of vertices are communities of people.
Centrality of a Vertex Between Clusters of Vertices
Since both betweenness and eigen centrality are complex, iterative algorithms on potentially large graphs, BSP graph engines of the type discussed above are a great fit for performing these calculations. Trying to perform the same operations in a standard relational database system is close to impossible — each iteration would require a self-join of the table containing all the edges in the graph; but without knowing in advance how many iterations will be required, it cannot be determined how many joins to include in the SQL query. Furthermore, even if the required number of iterations could somehow be deduced in advance, trying to write these complicated graph algorithms in SQL would be highly unpleasant and nearly impossible for other humans to decipher.
Therefore, our example telecom company is a little stuck — on one hand a relational database system is a great fit for most of what needs to be done with the data (billing, fault analysis, stat generation, etc.), but a BSP engine is the right fit for doing graph-oriented customer churn analysis. Since traditional database systems do not process graph algorithms well, the telecom company may be tempted to put the entire dataset in the BSP engine. Unfortunately, while graph engines are great for running graph algorithms, they are not optimized for traditional relational analysis, where standard star-schemas and table-oriented (especially column-oriented) analysis are known to perform extremely well. Therefore, they will see major reductions in performance on their non-graph-oriented queries, and will further have decreased options analytical tools to use in conjunction with their data, since many third-party tools are designed to interface only with relational database systems.
Another option is to simply duplicate the same CDR data in two different systems — one relational system for doing standard analysis and one BSP graph system devoted to analyzing customer churn. Unfortunately, this option is also suboptimal. Aside from the problem of duplicating data in two systems (and the additional associated hardware, administrative, and maintenance costs), there are also a class of queries that require both relational and graph processing techniques.
For example, let’s say the telecom company wants to calculate betweenness metrics not on the original graph, but rather on a version of the graph containing all customers in New Jersey who recently received an overage fee and whose contract is about to complete. The segmenting of the original dataset by these dimensions to get the new graph is exactly what relational database systems are designed for, while the calculation of betweenness metrics are what BSP graph systems are designed for. Performing this query in two separate systems would require the relational database system to perform the first part of the query, then export the new graph before it gets parsed and reloaded into the BSP system for the second part of the query. If further relational processing needs to be performed on betweenness metrics, another export and reload back into the relational database system will have to occur.
Performing these “hybrid queries” in a single system is a far better solution. The entire query can be expressed as a single request, and processed entirely by one system. The SQL foundation adds easy data filtering, preparation, sorting, and reporting options to the BSP processing. For example, given the following two tables with the relevant attributes listed below:
customer_connections — table
source_customer_id: varchar, target_customer_id:varchar
customers — table
customer_id varchar, customer_location:varchar, customer_overage_fee: boolean, end_of_contract_date: date
The query described in the previous paragraph can be expressed in Teradata Aster as:
SELECT * FROM BETWEENNESS(
ON (SELECT * from customer where customer_location=’New Jersey’ and
customer_overage_fee=’t’ and (end_of_contract_date – now()::date < 60) ) as vertices
PARTITION BY customer_id ON customer_connections as edges
PARTITION BY source_customer_id TARGETKEY(‘target_customer_id’)
) order by end_of_contract_date;
Not only does performing the query in a single system eliminate the need to export, parse, and reload data in the middle of queries, but it can also be optimized by a single optimizer that knows the statistics of the generated graphs, and can eliminate redundant operations, apply predicates and projections in the right place of query plans, etc. In fact, Teradata Aster actually has an elegant “collaborative planning” optimization technique that allows hybrid relational-graph queries to be planned and optimized in a unified framework. Since this post is already quite long, we will not discuss this optimization technique here, but I will likely describe it in detail in a future post.
In summary, graph analysis is an increasingly important technique that data scientists will need to apply to many real-world datasets. BSP graph engines are extremely well suited to performing a large class of graph analysis queries; however, they do not replace traditional relational database systems. Since modern workloads will contain both relational and graph queries, including some queries that require both relational and graph techniques within the same query, a two system approach will lead to headaches, slowdowns, and a general lack of optimality. Thus, I believe that many relational database systems will eventually incorporate a BSP execution engine to enable hybrid workloads and analysis. Teradata Aster is at the forefront of this trend.
Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.