Skip to content

Latest commit

 

History

History
282 lines (179 loc) · 31.4 KB

2019-09-06.md

File metadata and controls

282 lines (179 loc) · 31.4 KB

Modern Datalog systems

Both the SIGMOD and VLDB conferences saw a few Datalog papers in 2019. I'm thinking specifically about

I got a chance to speak with the authors at each venue, which (with one notable exception) makes it a bit harder to be highly critical of the work. There might be a lighter touch here than in previous posts, but I hope this just means we cut away the sass-talk and get right to the technical claims.

The tl;dr for me is "evaluations that don't distinguish data loading from computation have nothing to say about systems that run faster than data loaders". A few of these papers didn't make that distinction, at least one intentionally, and their conclusions were all the weaker for it.

Datalog background

Datalog is a declarative logic programming language, in which one defines "derivation rules" that produce new facts from existing facts. Coupled with some initial set of facts, the rules repeatedly produce new facts, which may then produce yet newer facts themselves. The limiting set of facts, from which no new facts can be produced using the supplied rules, is the output of the program.

Let's take an common teaching example, called transitive closure. Imagine we have a collection of facts of the form Knows(a,b), which we take to mean "person a knows person b". We might like to determine for each person a the set of all people they might come to know about, drawn from the set of people they know and who each of them know about.

KnowsOf(a,b) :- Knows(a,b)
KnowsOf(a,c) :- Knows(a,b), KnowsOf(b,c)

These rules have a structure where on the left we have a "head", naming the fact we might derive, and on the right a sequence of clauses indicating things that when each true justify marking the fact in the head as true. Above we mark KnowsOf(a,b) true if a knows b directly, or if b knows of c.

Although a set of input facts and a set of rules determine the output facts that should be produced, there are several plausible ways one might go about deriving the outputs. Most of the "big data" systems are interested in deriving all output facts, arguably because that is what they are best at doing (more on this later).

A common algorithm to produce all output facts is "semi-naive bottom up derivation", which initially marks all facts as "new" and repeatedly applies rules involving at least one new fact. Once processed, a fact is unmarked new, but any newly derived facts (those that did not previously exist) are marked as such when created. This process continues for some time but must eventually terminate (as Datalog programs do not create new symbols, the set of output facts is bounded by the symbols present in the input).

For folks wanting a more in-detail explanation, or just a second attempt, I wrote about Datalog systems back in 2016. Back then it turned out folks were writing systems that compared badly to differential dataflow.

We will have to see if that has changed.

Paper breakdowns

Independent of what you think about each of the described systems, it is worth it to develop the ability to critically read research papers. The main rule is to learn to read for the distinction between "what the author wants to be true" and "what is shown to be true". I'm sure there are fields where these two concepts are closely aligned, but mine is not one of them.

For example, each paper is going to claim that they improve on existing systems. It will turn out that each improves on some existing systems, but not all existing systems. Consequently, this "improves on" relationship may contain cycles. Draw your own conclusions about the inevitability of scientific progress.

Paper 1: Optimizing declarative graph queries at large scale.

Their abstract:

This paper presents GraphRex, an efficient, robust, scalable, and easy-to-program framework for graph processing on datacenter infrastructure. To users, GraphRex presents a declarative, Datalog-like interface that is natural and expressive. Underneath, it compiles those queries into efficient implementations. A key technical contribution of GraphRex is the identification and optimization of a set of global operators whose efficiency is crucial to the good performance of datacenter-based, large graph analysis. Our experimental results show that GraphRex significantly outperforms existing frameworks—--both high- and low-level—--in scenarios ranging across a wide variety of graph workloads and network conditions, sometimes by two orders of magnitude.

This paper has the fundamental distinction of actually looking at large scale graph queries. Or rather, very simple queries on large scale graph data. Their experiments are on a multi-rack set-up with 800 cores able to grind away on relatively large datasets. They get up to 42.6 billion edges, which is not quite the 128 billion edge Common Crawl data, but is more than I am able to casually do on my desktop.

At the same time, the queries themselves are borderline trivial. Their PageRank example is the largest at three lines of code (in their Datalog variant), and all other queries are two lines or fewer. Their connected components program is so short, it is only correct on pre-symmetrized input graphs. There are substantially larger Datalog programs out there, and it is unclear if the system here will generalize to them (it might! but, who knows). For sure differential dataflow needed work, and makes tradeoffs, to support large dataflow graphs in addition to large datasets.

Even worse, when you operate in the realm of trivial Datalog queries, you run the risk that there might be simpler algorithms for your problem than your Datalog implementations. The paper presents connected components computations on billion-edge graphs in tens of seconds, which we have seen before in the COST paper:

System cores UK-2007-05
GraphRex 800 30.9s
label prop 1 417s
union find 1 30s

At the same time, the system does much better on problems like PageRank, for which it appears that clever algorithms bring less to bear. Repeating some COST measumerents

System cores UK-2007-05 iterations
GraphRex 800 9.6s 10
timely 128 19.1s 20
vertex 1 651s 20
hilbert 1 256s 20

For PageRank GraphRex brings an order+ of magnitude improvement over single-threaded execution, using three orders of magnitude more cores. That is possibly a great trade-off, depending on how urgently one needs PageRanks computed, but it isn't a no-brainer. It appears to be a not-obvious improvement over timely dataflow, at least at this point; they compare to timely later on and get numbers that are worse than we reported in 2015 for reasons that are not yet clear.


BRIEF INTERRUPTION

I have a guess about why their numbers for timely are worse, and it is that they ran timely on 800 (maybe 1,600) workers, which IS INSANE!!! What a crazy large number of workers; I've never come close to running with that many workers (at least, not with that many hardware threads).

One thing timely still doesn't do as well as its precursor Naiad is to optimize its background coordination traffic. Naively, this traffic is an all-to-all broadcast, and that gets much worse as you scale up the number of workers. At ETH Zürich, Lorenzo Martini studied this in his masters thesis and observed (Figures 11 and 12) that when this traffic is communicated in a demand-driven manner the volume of coordination traffic drops tremendously, specifically for the problem of PageRank whose running time then improves. The improvement is most noticeable at the larger scales, which for Lorenzo was around 64 workers. I imagine at 800 workers it is even more noticeable.

Going out to 800 workers without this optimization is crazy! Of course, they had no idea that they should use it (my bad) and didn't ask (their bad). Fortunately you can flip the switch with an environment variable, which the paper authers now know and which may produce improved measurements (or perhaps there is a next scaling bottleneck; we don't know yet!)

** INTERRUPTION COMPLETE **


Quantitatively, the paper absolutely beats up on the reported competition, systems like BigDatalog, Giraph, and PowerGraph. And there are good reasons, which the paper goes in to some detail about.

Topology aware aggregation

As Datalog facts flow through the system, between workers, GraphRex takes the opportunity to aggregate the facts down. For example, duplicate facts can be suppressed so that each fact emerges from a worker at most once, a machine at most once, and a rack at most once. As we get larger and larger groups of workers there is more opportunity for and benefit of removing such redundancy.

This approach, of hierarchical aggregation is pretty standard. In the timely PageRank example from above you enable process-level aggregation with the process argument. It would not be especially hard to hack in rack-level aggregation. Differential dataflow has a consolidate_stream method that performs opportunistic aggregation in response to the backlog, though it could use some love (it doesn't do things especially well).

What is nice about GraphRex is that it handles all of this for you, from information about your cluster topology. It is a good example where a declarative description of what you want, "data exchange with the opportunity for aggregation", is better than a specific implementation that chooses when to perform the aggregation. If you had to rebuild (or rewrite) your program each time you change the layout of workers, well that would obviously be terrible.

Adaptive join processing

GraphRex has some rules about how to process multi-way joins that allow tuples to flow along multiple paths, that may depend on stats of the data being processed. This seems pretty similar to work on worst-case optimal joins, which also deal with multiway joins by routing tuples based on cardinalities. Seems like a missed opportunity to relate the two. Based on their writing, I'm inclined to believe they (and the reviewers) are just unaware of the prior work.

Paper 2: RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

Their abstract:

Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

I'm not sure what to make of this abstract. It may be accurate, but having read the paper I'm not even close to understanding the sense in which their fixpoint operator, or as associated compiler implementation, is novel.

The paper is largely about this fixpoint operator extension to SQL, which (and perhaps this is the contribution) looks like just about everyone else's recursive constructs except it is in ALL CAPS and so is appropriate to call a SQL extension rather than a Datalog variant.

The resulting performance improvement is slightly better than their paper containing roughly the same stuff from three years before, which was at the time slower than one differential core on most problems. They've improved their codebase since then, but differential has improved too. We'll have to have a look see and figure out where things are now. But, I'm not clear on the essential scientific contribution here.

One thing I am sure of is that this paper has a fair bit of nonsense wrapped up in their evaluation of other work, specifically single-threaded COST style work. The authors compare to COST implementations and to the excellent GAP benchmark suite. They thoughtfully report that at small input scales they are out-performed by single-threaded implementations, but at larger input scales they overtake these implementations.

Horse shit.

What they do a very bad job of explaining is that their reported times involve not only the time to perform computation they are tasked with, but also the deserialization of the input data from plain text to the native format of each tool. While COST can compute the connected components of the twitter_rv graph in about ten seconds, and with multiple cores GAP can do so in half a second, the authors add on the time to deserialize the data from 25GB of plain text. That adds a hundred seconds to COST, and a few hundred to the GAP measurements (whose deserialization is I guess less optimized).

The "MOD" in SIGMOD is short for "management of data". The conceit of this paper is that in the absence of any data management, literally just storing your input as plain text and re-reading it from scratch each time you need, they might have a point that one thread is not going to cut it. But it is serious bad manners to imply that anyone with any sense would be unable to use these other systems well, and achieve substantially better performance, through the actual management of data. That point never comes up in their paper.

They take about 108 seconds to compute connected components on twitter_rv, by the way. If you are willing to give up both 10x in time and 16x in resources, so that you can store your big data as text, this may be the paper for you. I find it a fundamental failure of the community that it is not presented as such.

I brought this issue up with the senior author, Carlo Zaniolo, and ... let's just say he doesn't see an issue with the paper.

Paper 3: Scaling-Up In-Memory Datalog Processing: Observations and Techniques

Their abstract:

Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the large volumes of data being processed, several research efforts across multiple communities have explored how to scale up recursive queries, typically expressed in Datalog. Our experience with these tools indicate that their performance does not translate across domains—e.g., a tool designed for large-scale graph analytics does not exhibit the same performance on program-analysis tasks, and vice versa.

Starting from the above observation, we make the following two contributions. First, we perform a detailed experimental evaluation comparing a number of state-of-the-art Datalog systems on a wide spectrum of graph analytics and program-analysis tasks, and summarize the pros and cons of existing techniques. Second, we design and implement our own general-purpose Datalog engine, called RecStep, on top of a parallel single-node relational system. We outline the techniques we applied on RecStep, as well as the contribution of each technique to the overall performance. Using RecStep as a baseline, we demonstrate that it generally out-performs state-of-the-art parallel Datalog engines on complex and large-scale Datalog evaluation, by a 4-6X margin. An additional insight from our work is that it is possible to build a high-performance Datalog system on top of a relational engine, an idea that has been dismissed in past work.

I'm a big fan of the four sentence abstract.

The main observation of the paper is that one can implement a Datalog system atop a relational data processor. In my opinion they should just say this, because it is plenty interesting.

The paper continues through what seem to be totally competent discussion of the benefits of re-using relational infrastructure (query specialization, re-optimization, various engineering). Several of the "optimizations" exist to limit the pain of using a relational engine (e.g., minimizing the number of distinct transactions), but some of them are things that differential dataflow would have a hard time doing (re-optimization as the query execution evolves).

The paper wraps up with an experimental evaluation, compared primarily against Soufflé (there are other comparisons against sillier systems). To my reading, they keep up (and often overtake) Soufflé, though in a large part because Souffé doesn't exploit intra-operator parallelism. For very simple computations (we will see one) Soufflé fails to impress because there is essentially just one operator, and RecStep on 20 cores works much better than Soufflé on 20 cores, where perhaps the opposite would be true on fewer cores.

Differential dataflow on 20 cores works pretty well. I suspect. I only have 8 here and that worked out pretty great. Hold on for those details.

Paper 4: Automatic Index Selection for Large-Scale Datalog Computation

Their abstract:

Datalog has been applied to several use cases that require very high performance on large rulesets and factsets. It is common to create indexes for relations to improve search performance. However, the existing indexing schemes either require manual index selection or result in insufficient performance on very large tasks. In this paper, we propose an automatic scheme to select indexes. We automatically create the minimum number of indexes to speed up all the searches in a given Datalog program. We have integrated our indexing scheme into an open-source Datalog engine SOUFFLE. We obtain performance on a par with what users have accepted from hand-optimized Datalog programs running on state-of-the-art Datalog engines, while we do not require the effort of manual index selection. Extensive experiments on large real Datalog programs demonstrate that our indexing scheme results in considerable speedups (up to 2x) and significantly less memory usage (up to 6x) compared with other automated index selections.

This is a Soufflé paper. It has a very different flavor, and indeed performs a single-threaded evaluation. Unlike the other three papers, which are largely interested in fairly trivial programs on complex data, this paper looks at Datalog queries with hundreds of rules, which strikes me as a much more plausible representative of real use cases. When we write "hello world" SQL programs they are concise and clear, but industrial uses are utterly filthy; I assume the same is true of Datalog.

But, I mostly just wanted to call out this paper and that it is aimed at non-trivial Datalog queries. It isn't trying to go fast on connected components, which is a much more defensible position to be in. Their queries are complicated enough to not be trivially reproducible, and the data not obviously available, so I'm going to punt on the question of whether they do a good job and just assume so because at least they are asking the right questions.

Evaluations

The three big data papers have a few classes of evaluation they perform. I'm going to try and group them for you, so that we can have a sane discussion about the relative performance.

I should stress that absolute performance is only one way to evaluate these systems, many of which have other criteria they aim to optimize. Each of the Datalog systems delight in the relatively few lines of code one must write. Query compilation time is also some thing that real systems should consider, in the case that they target interactive use, however in my opinion most of these systems are not fit for interactive use for other design reasons (e.g. re-reading data from input as text).

Graph Computation

The standard battery of graph computations on real graphs seems to currently be to run a reachability query, a single-source shortest-paths query, and to determine the connected components. Several folks use three datasets that I have access to, a livejournal graph, an orkut graph, and a twitter graph.

I've collected the measurements from the RaSQL and RecStep papers together in one table, and followed by single-threaded (COST) and differential dataflow measurements (K-Pg) for the same problems. The big systems include the amount of time to parse inputs from plain text, which neither the reported COST nor K-Pg measurements do (please don't store your data in plain text). Because I can, I report the amount of time to build forward and reverse indices from the input data, where the time attributed to the problem is the additional time required (add numbers together to get something that can be fairly compared to other systems).

LJ cores forward REACH SSSP reverse CC
RecStep 20 14 19 39
Souffle 20 49 - -
BD 20 82 82 111
BD 120 17 53 27
RaSQL 120 9 14 17
COST 1 0.40 0.40 0.29
K-Pg 1 4.39 8.50 13.14 7.56 23.97
K-Pg 32 0.55 0.51 0.59 0.41 0.90
OK cores forward REACH SSSP CC
RecStep 20 19 25 43
Souffle 20 71 - -
BD 20 130 138 115
BD 120 20 39 33
RaSQL 120 11 16 19
COST 1 0.46 0.46 0.52
K-Pg 1 14.02 20.33 24.65 21.27 47.79
K-Pg 32 1.22 1.11 0.87 1.05 1.75
TW cores forward REACH SSSP reverse CC
RecStep 20 174 243 501
Souffle 20 OOM - -
BD 20 OOM OOM OOM
BD 120 125 260 307
RaSQL 120 45 81 108
COST 1 14.89 14.89 33.99
K-Pg 1 162.41 256.77 310.63 312.31 800.05
K-Pg 32 12.69 11.36 10.97 14.44 27.48

It's hard to know where the limitations are in the existing systems. They each drop some numbers and then walk away, announcing their success in improving over BigDatalog (BD). They've clearly picked a incredibly soft target there. The relatively simple single-threaded implementations (COST) mop the floor with most of the systems, and are improved on only by few of the differential dataflow configurations. I have a pretty good handle on what DD is "missing" vs COST (chiefly, it doesn't allow itself to rely on dense integer identifiers), but wouldn't it be nice to get this out of the other systems too?

Some fairly massive amount of time goes in to parsing the text into integer graph data. How long does that take, and do any of the comparisons change if you are bright enough to do that only once? Are all of these systems papers being published only because one needs scalable text parsers? Why aren't they compared to single-threaded implementations? It's unclear (personal theory: papers that did compare against single-threaded execution were more likely to be rejected; the publication process selects for favorable rather than throrough evaluations).

We saw a bit of GraphRex and its performance on the UK-2007-05 graph, relative to COST measurements and timely dataflow. I tried another dataset (the friendster graph) and the conclusions appeared to hold (connected components was faster with one core for the Hilbert layout, but slower for the vertex layout). It certainly appeared that GraphRex was at or faster than COST implementations, and might be up there with timely implementations, all while presenting a much more pleasant programming interface for simple queries. It does use a lot more cores, but you probably have those laying around after your Spark investment.

Program Analysis

The RecStep paper looks at relatively concise examples of program analysis from the Graspan paper, and determines that they (like just about any other system) are faster than the Graspan system.

Context sensitive dataflow analysis

The first analysis studies the flow of null pointer assignments through a programs dataflow graph (the directed graph of potential assignments). It is a very simple Datalog program that looks like

X(a,b) := N(a,b)
X(a,b) := X(a,c), E(c,b)

where the output relation X is populated initially by N but then also by paths along X and then along an edge E.

linux postgres httpd
RecStep 430 359 74
Soufflé 122 63 26
BigDatalog 194 171 47
Graspan 11281 6971 335

These numbers aren't a home-run for RecStep, but they aren't meant to be. Rather, they keep up with specialized systems like Soufflé, and they understand why they are slower when they are.

However, these numbers aren't especially great. In this case, Soufflé isn't especially great either, though I'm not able diagnose this here.

Let's add some times from both differential dataflow and single-threaded datafrog. These both include the time to load the data, because that was easy enough to do in this case (the data I have are text).

cores linux postgres httpd
RecStep 20 430 359 74
Datafrog 1 32.7 10.7 3.2
Differential 1 45.3 21.6 6.1
Differential 2 30.7 13.6 4.0
Differential 4 23.1 9.6 2.8
Differential 8 20.3 8.4 2.3

In each of these cases, the time to load the data is roughly half of the eight-core running times. In any case, this seems like a much better baseline for other systems to compete against. The datafrog program takes 0.5s to compile (but longer to write) and the differential program takes 11s to compile, but arguably makes up for it by scaling out and being incrementally maintainable (arguably).

Context sensitive points-to analysis

The context-sensitive points-to analysis is a more complicated analysis, apparently to the point that BigDatalog needs to drop out (it is missing mutual recursion; maybe the RaSQL work added that in or maybe it is still missing).

linux postgres httpd
RecStep 61 162 162
Soufflé 123 195 111
Graspan 548 6670 846?

The ? is there because the number doesn't fit in the figure that reported it. I don't have a Datafrog implementation of CSPA, as Datafrog is more of a Rust eDSL for Datalog, and it requires programming to make it happen. One could totally do that, and it might produce good numbers. Instead, we'll just look at differential numbers.

cores linux postgres httpd
RecStep 20 61 162 162
Differential 1 202 106 116
Differential 2 128 64 67
Differential 4 83 37 40
Differential 8 62 25 26

These numbers look much cooler for the other systems! At least on the linux problem, which is exercising differential pretty well. The other input workloads seem to be good cases for differential, relatively speaking.

An important caveat, differential produces a different number of output tuples from the RecStep system (more by about 1.5x). We haven't sorted out what the gap is there. Neither of us agree with the Graspan numbers, who reportedly had de-duplication issues that lead to overcounting outputs in their paper. Until we sort out who has too many or too few records, don't draw terribly serious conclusions from these numbers.

There is a neat optimization you can do to this query if you only need some of the query relations as output (one of them is massive, and massively redundant). If you snip that out, the times improve thusly:

cores linux postgres httpd
Differential 1 109 41 36
Differential 2 77 26 22
Differential 4 58 16 13
Differential 8 47 10 9

These computations produce the same output as above (possibly incorrect; see above), and are a fair bit faster. They skip materializing the large and silly intermediate relation. RecStep could probably do this too; it relies on differential arrangements, which probably looks like multiquery optimization to RecStep.

Conclusions

I 100% enjoyed talking to the authors of these papers. Well 75%, let's say.

Datalog systems are plenty interesting, and pose challenges not previously faced by RDBMS researchers (viz RecStep). At the risk of flogging a dead horse, if researchers want to make progress in this space, they should strive to measure the right things, which probably doesn't include parsing integers from text. Once they get around to realizing that, they can start to think about how to re-use in-memory arrangements, the way differential dataflow has been doing for years.