Something related that I found out about from HN a few months back is another engine called quokka. It's particularly interesting and applicable how quokka schedules distributed queries to outperform Spark https://github.com/marsupialtail/quokka/blob/master/blog/why...
> KQuery does not yet implement the join operator.
Whilst I applaud this book writing initiative, completing it could easily become a lifetime's work! It will be a fascinating journey to follow along with in any case.
Apache Arrow could (and hopefully will) really shake up the database industry in the years ahead. Whatever eventually supplants Postgres is quite likely going to be based on Arrow - polyglot zero-copy vector processing is the future.
Aside: for anyone looking for a more theoretical overview of databases and query languages, this ~free Foundations of Databases book still holds up well http://webdam.inria.fr/Alice/
Apache Arrow isn't going to make a b-tree or LSM faster or more efficient. It's not going to make a point look-up query faster against columnar storage, or a range scan faster against row-based storage. It doesn't make distributed quorum faster, or otherwise impact consistency and fault tolerance.
Removing or reducing SerDe overhead is great, and for analytical workloads where SerDe can be 30-50% of clock time then something like Apache Arrow is pure magic. For the remaining 9X% of use cases it's not adding any more value then you'd see from protobuf.
I agree Arrow by itself doesn't address any novel/fundamental OLTP challenges, but I'm not arguing that the thing which eventually supplants Postgres will succeed because of best-in-class OLTP performance - anyone needing that today is not choosing Postgres anyway (same as ever).
The real proposition is having a modern, general purpose workhorse underpinned by an ecosystem with strong network effects and polyglot APIs. Assuming analytical systems continue to gravitate towards Arrow I believe the OLTP world will be dragged along also.
Isn't Arrow biased towards analytics workloads? Like sibling commenter brings up, I'm not sure what that brings to the table for OLTP.
High performance / ergonomic OLTP in general seems to be somewhat neglected in the last decade or so of DB innovation, because in part I think the needs of the industry have been around mass processing of click/log/event-stream data -- firehose of privacy violation :-).
So "insert quickly then let me analyze later" is becoming a highly polished stone, but I think databases still present a very rough demeanour around transactional workloads and the "hip tools" out there reflect that.
I'm personally all for supplanting SQL but only if what replaces it has a sound relational basis.
As for the Alice book, it's great, but it's quite focused on Datalog. Which I think is awesome and an area of my own interest but not in the mainstream of what people think about in terms of "databases" though I wish it was. Most practitioners in software development unfortunately think of a database as "persistence" and not "data management"... And so the heaps of abuse brought on through ORMs etc.
> So "insert quickly then let me analyze later" is becoming a highly polished stone, but I think databases still present a very rough demeanour around transactional workloads and the "hip tools" out there reflect that.
> I'm personally all for supplanting SQL but only if what replaces it has a sound relational basis.
100% agreed. I think the status quo UX is far more rough than the performance. Unfortunately it seems performance is what drives most investment these days unless you are a producer of hip tools.
> As for the Alice book, it's great, but it's quite focused on Datalog
True, although I posted more for the tour of Relational Algebra fundamentals. I guess the SQL section is probably the most out of date part of the book :)
"True, although I posted more for the tour of Relational Algebra fundamentals. I guess the SQL section is probably the most out of date part of the book :)"
I think a better, more accessible, book for the relational foundations is maybe Chris Date's "Database in Depth: Relational Theory for Practitioners". Though it goes out of its way to avoid dirtying itself with SQL and uses some of his "Tutorial D" concepts instead, it gives a good "practical" look at the relational model that I think "practitioners" would find understandable, though it's not without its eccentricities.
"> I'm personally all for supplanting SQL but only if what replaces it has a sound relational basis.
100% agreed. I think the status quo UX is far more rough than the performance. Unfortunately it seems performance is what drives most investment these days unless you are a producer of hip tools."
I applied for and took a job at RelationalAI last fall because I loved what they were doing with building out a system that was really strong on the relational fundamentals. They have a kind of Datalog-ish language (but better/more practical imho) that is quite nice. The talk that Martin Bravenboer did for the CMU lectures really sold me on it. It was what I was looking for for years, as purely relational alternative to SQL (without.. going off the rails into wonky badly thought through NoSQL network-hierarchical-graph land.)
Couldn't make the job work for personal reasons, but was very promising at first.
Unfortunately their direction seems to have gone elsewhere, basically becoming a graph analytics plugin inside someone else's SQL DB: https://relational.ai/blog/pr-snowflake-summit -- I understand why the path, but it's less exciting.
Anyways it's interesting to see how people talk about PostgreSQL and SQL generally on this forum. Like, as if we can't do better. But if someone tries to do better they go off making something without understanding relational foundations and you end up with something unsound like Redis or MongoDB.
> Whatever eventually supplants Postgres is quite likely going to be based on Arrow - polyglot zero-copy vector processing is the future.
Can you elaborate this? I understand it's a very opinionated statement but still I don't see how "polyglot" and "vector processing" could be considered the future of OLTP and general purpose DBMS.
Polyglot means not having to fight with marshaling overheads when integrating bespoke compute functions into SQL, or when producing input to / consuming the output from queries. This could radically change the way in which non-expert people construct complex queries and efficiently push more logic into the database layer, and open the door to bypassing SQL as the main interface to the DBMS altogether.
Vector processing means improved mechanical sympathy. Even for OLTP the row-at-a-time execution model of Postgres is leaving a decent chunk of performance on the table because it doesn't align with how CPU & memory architectures have evolved.
Honestly, I can't envision a near future where SQL is not the main interface. Happy to see the future proving me wrong here though!
Despite I can buy the arguments about how having a better data structure to communicate between processes (in the same server) could help, it's a bit difficult to wrap my mind around how Arrow will help in distributed systems (compared to any other performant data structure). Do you have any resources to understand the value proposal in that area?
Same for vector processing, would be great to read a bit more about some optimizations that would help improving Postgres leaving out pure analytical use cases.
> it's a bit difficult to wrap my mind around how Arrow will help in distributed systems
Comparing with the role of Protobuf is perhaps easiest, there's a good FAQ entry [0] which concludes: "Arrow and Protobuf complement each other well. For example, Arrow Flight uses gRPC and Protobuf to serialize its commands, while data is serialized using the binary Arrow IPC protocol".
This will be increasingly significant due to the hardware trends in network & memory (and ultimately storage too) compared with CPUs. I posted about that in a comment a few days ago [1], but it's worth sharing again:
> here’s a chart comparing the throughputs of typical memory, I/O and networking technologies used in servers in 2020 against those technologies in 2023
> Everything got faster, but the relative ratios also completely flipped
> memory located remotely across a network link can now be accessed with no penalty in throughput
I am no expert on Postgres but the thread seems to suggest the default out-of-the-box JIT performance is actually more efficient than a custom vectorized executor that was built for the PoC. That probably rules out any low-hanging optimizations based purely on vectorization for OLTP specifically, but there are undoubtedly many wider ideas that could in principle be adopted to bring OLTP performance in line with a state-of-the-art research database like Umbra (memory-first design, low-latency query compilation, adaptive execution etc.). As usual with databases though, if the cost estimation is off and your query plan sucks, then worrying about fine-tuning the peak performance is ~irrelevant.
Idk, but I'd rather see a new iteration of PostgreSQL, similiar to Hydra[0], with a new engine become the upstream instead of a whole new database.
There's a lot of experience about db operation and how to approach MVCC encoded in PostgreSQL that shouldn't be underestimated.
"Postgres ... configurations lead to the worst performance, and the major reason is that the use of append-only storage with O2N ordering severely restricts the scalability of the system.... "
> PG-Strom is an extension module of PostgreSQL designed for version 11 or later. By utilization of GPU (Graphic Processor Unit) device which has thousands cores per chip, it enables to accelerate SQL workloads for data analytics or batch processing to big data set.
> PG-Strom has two storage options. The first one is the heap storage system of PostgreSQL. It is not always optimal for aggregation / analysis workloads because of its row data format, on the other hands, it has an advantage to run aggregation workloads without data transfer from the transactional database. The other one is Apache Arrow files, that have structured columnar format. Even though it is not suitable for update per row basis, it enables to import large amount of data efficiently, and efficiently search / aggregate the data through foreign data wrapper (FDW).
I meant that mostly in jest, but in reality the pace of both database research and commercial development is happening faster than ever, so an explanation of "How Query Engines Work" is a moving target. Join algorithms and join planning are especially hot topics with a lot of excitement based on advances in machine learning.
I'm amused, that sometimes when you begin starting to learn something, you see resources for learning everywhere. (And I just learned it's called Fequency Illusion or baader-meinhof-phenomenon [0])
Thank you so much for publishing this book for free, I'm eager to dive in! Is there any chance you can also link it as ePub?
Youtube's algorithm really helps in that regard. I search some flight training videos for a day and then just get flooded with more and more (which is nice!).
At the beginning of the book it says 'The real power of Spark is that this query can be run on a laptop or on a cluster of hundreds of servers with no code changes required.'
Is this accurate? I thought the power of Spark is that data stays in memory in each node in the cluster. Being able to run on a cluster in a fault tolerant manner was achieved already by Hadoop ecosystem.
In theory sure in practice I beg to differ. You tend to end up writing different levels of annoyingly intricate spark code as your data volume, processing comolexity (and infra constraints) increase.
Anyone who says this is not true hasn't tried to join two 10 TB dataframes on a highly skewed key with a hundred million unique values without spending thousands of dollars and without failing at random instances.
Well I mean, that's a problem with all those solutions.
If your dataset is so small it fits in RAM it's simply ridiculous to reach for those tools, and you don't start seeing real advantages until you have truly massive datasets.
Anecdotally using Spark on less than 1TB just means wasting your time, but in Spark's defense, that's not what it's for.
It's anecdotally the threshold where "normal" ETL methods start having problems. Obviously performance is hard and the only correct answer is "it depends", but I have never seen Spark improve things on datasets that small.
The power of spark is truthfully that it's really good at running sort + aggregate type queries on columnar stored data. Once you start needing to do complicated joins it becomes more about how you structure everything.
val spark: SparkSession = SparkSession.builder
.appName("Example")
.master("local[*]")
.getOrCreate()
val df = spark.read.parquet("/mnt/nyctaxi/parquet")
.groupBy("passenger_count")
.sum("fare_amount")
.orderBy("passenger_count")
df.show()
That's the first listing in this book. This is a disaster... Query language implemented in a pug-ugly other language which has to rely on two other query languages to function. This is the worst sales pitch in the history of sale pitches. Maybe ever.
That's the first "dataframe" listing in the book. The first listing is the SQL examples.
Both SQL and dataframes have to be prepared for the query engine. You can read the whole book without ever needing to take into consideration the pug-ugly dataframe content.
I don't disagree that dataframes are often/most times horrible to look at.
Looks like it gets quite quickly to logical plans and the actual business of query engines. The surface representation of a query is almost immaterial to the topic of the book.
I bought this book back in March 2020 and it was a great read then. Since I've received numerous updates from Leanpub on the additions Andy has written!
> A logical plan represents a relation (a set of tuples) with a known schema. Each logical plan can have zero or more logical plans as inputs. It is convenient for a logical plan to expose its child plans so that a visitor pattern can be used to walk through the plan.
This is not a definition. Definition needs at least a modicum of a convincing to do. Like, why did the author decide that a plan represents a relation? Most plans I've seen in my life were grocery shopping list-like, or maybe decision tree-like. I can see no way to get from "plan" to "set of tuples".
Also, what visitor pattern has to do with anything? Why is this bit of information even relevant?
Isn't the "physical plan" the part that gets the "set of tuples", not the "logical plan"?
And isn't that matter of how that gets done more a concern of which physical plan the query optimizer picks, and as a result the query engine that powers that particular data-layer?
> The first step in building a query engine is to choose a type system to represent the different types of data that the query engine will be processing.
This. Out of nowhere. Why would this be the first step? There's no explanation... Why cannot be this the tenth or the hundredth step?
So... what if they wrote a query engine? I want to get some kind of argument for choosing the type system first, but there's none...
Imagine you get IKEA assembly instructions for a chair. There might be a few valid ways to start assembling a chair. There will be also ways that make chair assembly impossible. It makes perfect sense in that context to tell you "what the first step should be".
I don't see any reason for starting with type system in making a query engine. Things that come to mind as a better first step: select data constructor, or select paradigm imperative vs declarative. Maybe select the level of data consistency you aim for... maybe choose whether querying should be mostly interactive vs mostly non-interactive. There are tons of other decisions to make.
This "let's start with types" feels like that Star Wars moment where Vader randomly points at some spot through the cockpit window and tells the captain to head that way. And once the captain enquirers as to the reason this particular direction was chosen, Vader gives some vague answer like "why would you go anywhere else?"
If you try building a framework on Arrow, you are most likely going to start with thinking about types and how you munge everything you are doing into the Arrow types.
Maybe... maybe not. I would expect from the author to give some rationale for how they decided to start working on the engine. There are plenty of ways one could start, and types don't strike me as even a worthwhile option to consider. Especially since the author doesn't use any kind of "interesting" type system. It's just a handful of integers of different magnitude -- that's completely worthless as a start.
I.e. ask yourself: would I not able to start working if my type system didn't have a 16-bit integers? Why would I care in my design if the system had or hadn't 16-bit integers?
And, if you are honest, the answer to this: no, nobody cares. Plenty of important design decisions don't require this information at all. In fact, the sizes of integers may never even appear in your query system (eg. all you deal is are byte strings), and it would still work...
Really nice! Anyone knows if there is any list of such website books / gitbooks? I try to bookmark these whenever I come across them, but I believe there must be so many great ones I don't know of.
Something related that I found out about from HN a few months back is another engine called quokka. It's particularly interesting and applicable how quokka schedules distributed queries to outperform Spark https://github.com/marsupialtail/quokka/blob/master/blog/why...