> If a beginner wanted to do anything "cool-looking" today, this same beginner programmer would need to set up render to texture on an OpenGL surface. Yuck! And so they lose interest, believing programming is the mysterious and complex domain of gods.
Or just use the web platform: JavaScript, HTML5 Canvas API, WebGL, all of that. I think it's pretty close in terms of getting a pretty (motivating) result quickly without much prior experience.
(Unrelated, but I can also remember writing the first lines of code of my life in qbasic on a dos machine in ~1998; the blue screen definitely invokes childhood memories!)
My preferred term for this is “zero to pixels”. QB’s was very low—basically unmatched by most modern environments, which require considerably more setup; although that’s partly because they’re intended for more “serious” programming. We need more unserious, fun programming environments for beginners. :)
Over the years i made a few languages and environments for fun that included exactly that sort of low "zero to pixels" idea, mainly inspired by QBasic:
1. SimSAL Workshop [1], made years ago using a scripting language (SimSAL) i wrote at some point in Lazarus [2]. I originally wrote it because up to that point i used QBasic in DOSBox now and then to try out algorithms and i wanted something similar running natively. Although these days i tend to use Lazarus directly.
2. VitriolBASIC [3] (sadly the site i had it is down and only that image remains from archive), a simple QBasic-like interpreter written in Java. Written purely for fun, i lost interest when i realized how hard and annoying it was becoming to run Java stuff on the browser (today you cannot do that at all unless you pay the certificate mafia and even then you get tons of scary messages).
3. WinLIL [4], this is just a playground for my LIL scripting language [5] [6]... i just wanted an excuse to shove it in yet another program. But it is a single self-contained EXE file that includes the interpreter, a simple programming environment, some extra commands for graphics and all the documentation i have available :-).
If you can't (or don't want to!) afford that kind of money for data analytics, please consider giving the FOSS alternative EventQL [0] a try some time.
It's super simple to set up and tries to be efficient on commodity hardware, so you can run large clusters (>100TB scale) for a couple hundred dollars a month.
How fast are group bys? Like say I have 120 billion rows, 25 not-sparse columns, and want to group by between 2-20 columns (5 of which are varchar), aggregating the other 5 columns?
What kind of hardware would I need to do that interactively? Or consistently sub-10 second, with 100s of queries per minute.
I have built a thing on Redshift that can do some of this, but it has been new territory for me and I am not sure I've done it "right". Constantly looking for alternatives.
Have you tried this on BigQuery yet? It's built for this kind of extremely large dataset.
You can also look at MemSQL for a distributed relational database with a columnstore. Run enough nodes and you might be able to hit your performance goals.
BigQuery makes it pretty easy to calculate how much you'll spend based on your bandwidth needs; unfortunately, if you do the math for the above use case, the answer is pretty discouraging... $5 / TB data processed, and the size in this context is explicitly the uncompressed size. Even if we're extremely generous and assume 4 bytes / column (almost certainly an underestimate given how many bytes they reserve for integers and timestamps), that is potentially 12 TB for the database, so you're paying $60 for a single query that hits the entire table. If you have 500 queries per month that hit the entire table, you're already paying as much as you would be for Redshift on 8 nodes, without even looking at any of the other queries.
The flat rate may improve things dramatically, of course, but the documentation's ambiguity about what you get with a "slot" makes it hard to say. And none of this is taking required bandwidth into account, because AFAICT there are no promises made on query response time... so I don't know if any of this would satisfy the 10 second requirement.
Either way, no matter how Google, MemSQL, or anyone else might try to satisfy these requirements, they can't get around the required hardware costs. At best, they can amortize them by buying in bulk and partitioning all their clients across lots of servers.
I don't work for any of these database providers, BTW, or use any of their products; I have no skin in this game. And I'm not even saying that paying $60 for that kind of query is necessarily a bad deal (when you consider what may be required under the hood). I'm just saying if you're looking for a cheap solution here you're not going to find it.
> I'm just saying if you're looking for a cheap solution here you're not going to find it.
That's a given. I think in this context the user was asking how to meet performance goals with cost possibly secondary or not a concern... and in that case BQ has proven to be incredible at churning through large datasets within seconds.
What you're asking for is just really difficult. Even with a fairly good compression ratio (e.g. 57 bits / tuple, which I chose at random but is similar to what's achievable on real data according to http://i.stanford.edu/~adityagp/courses/cs598/papers/constan...) you're talking 83.4 GB / s of read memory bandwidth just to read the tuples at all within 10 seconds for a single query. The state of the art in terms of what you're likely to be able to actually buy (IBM Power8 machines) can do about 91.5 GB / s. in Stream Triad benchmarks. That's still nowhere close to supporting 100s of queries a minute (more like 6 or 7) and you haven't even started calculating anything yet (Triad is pretty simplistic). It also assumes you can actually achieve that data rate while keeping everything in memory (which is why GPUs probably won't help much for your use case, despite NVLink; they don't have enough RAM).
Your biggest issue with aggregation / group by is going to be the memory bandwidth to/from a hash table to store all the results. Once the hash table no longer fits in L3, its access time will rise dramatically, so your query's performance will depend heavily on how many buckets you need (if you're grouping by 2 columns, maybe not that many; if you're grouping by 20, you'll probably have almost no buckets with more than one entry). Another potential issue is going to be having to look up the values for that column in a hash table if they are highly compressed (since you need to perform aggregation on them, presumably something like sum; if it's count, this doesn't matter).
If you can find a total attribute order such that the columns you group by are always to the left of the columns you aggregate by, you can sort the rows to enable efficient delta encoding; you can also then perform aggregation in the same order as your scan, which eliminates the hash table lookup problem for output (not input, though). You can also pre-materialize the results for common subsets.
To do this with multiple queries at once, you'd probably need to batch query execution (because of the memory bandwidth issue I alluded to earlier). While the data access requirements would remain similar on read, they'd get worse and worse on write (again, depending on how many buckets you had and how cleverly sorted your data were).
Alternately, you could get a bunch of Power8s (or commodity machines, but to maximize bang for your hardware buck you really want stuff with tons of memory bandwidth) and give each of them a slice of the data (but still apply all the above optimizations). The commodity version of this is the Redshift solution. If you went this route, you could also look at specialized solutions like GPUs with NVLink or the KNL Xeon Phis, which have "fast" memory with tons of extra bandwidth which help mitigate the aforementioned hash table / query result access costs.
I still haven't talked about how you're supposed to actually get the results back. If you're trying to do ethernet through anything commodity you're going to be very limited in terms of data rate. Even 100 Gbps Infiniband only gets you 12.5 GB/s out, and even if you hook two of those up to each machine you're still only at 25 GB/s out. So either you get even more machines, your clients process the data on the machine, or you have to limit the output somehow... and if you're thinking "we'll sort it!" guess what that's going to be bound by (unless you can store the input data presorted)? Probably memory bandwidth (assuming radix sort)!
tl;dr One way or another, you're paying out the nose to satisfy the requirements you just outlined. Also worth noting that you're paying for this read performance on write.
Using a column store and doing a bunch of pre-aggregation is the only way we've been able to to come close to these requirements, but I keep hoping there is a system that will do what we've done without having to write code. Trading space for computation is the only thing that has worked reliably.
We also have the benefit that most queries just want a small subset of time and have things sorted to take advantage of that. But occasionally people want to do something over a large range and if isn't a set of group-bys we've pre-aggregated then they just have to be patient.
Hadn't heard of EventQL before. Took a look at the documentation, so I have a brief idea of what it offers and how it works. But I couldn't find anything in way of performance numbers.
While I might try it out myself to see how well it performs, it would be nice if some figures were readily available.
I've been keeping an eye on Postgres-XL and CitusDB for distributed SQL. Would be interesting to compare.
Running an open source stack means you're not completely at the mercy of someone else. I'd cite Windows 10 desktop editions as a paragon of "you buy it, but you don't own it, then you suffer" proprietary products, but it's not a server component. So I have to cite FoundationDB, the proprietary database, which was seriously awesome, but so awesome that the downloadable binaries vanished overnight.
I have at least the assurance that such a thing cannot happen with Postgres (there are a multitude of for-profit companies working on it full-time, in addition to pro-bono community contributors), or with Apache Cassandra; and it's a similar assurance that keeps users of RethinkDB able to continue to operate despite the company being bought and folded.
----
Other advantages of open source: the very freedom to inspect and modify the software we run. This freedom is how OpenResty was born. In fact, Postgres itself comes from another GPLed database.
On a minor note: this freedom is what allowed us at my last job to create a patch for an internally required function in Nginx, in ten minutes. Had Nginx been closed source, we'd have had to request the makers, wait days/weeks/months, and if accepted they'd include it.
>> For example in germany FF is still the widest used desktop browser.
Do you have any references to back that up? I'm asking because I suspect your assumption is completely wrong.
(Based on a bunch of analytics data that I have access to, which might not be representative but still contains some very large german web properties, chrome has more than twice the market share compared to FF in germany).
It's not an entirely new art form. The demoscene [0] has been going strong far longer than most of us have even been coding. So apparently some people _will_ execute interactive art.
Also, you can still listen to their music on soundcloud [1] if you don't want to review the code. It's open source after all, which can't be said for most demos...
Speaking for myself: Berlin has a special kind of poor-is-sexy culture and this would not be too unusual. Lived there for 4yrs, most of them without a smartphone (had a 20eur burner phone for communication and to receive monitoring alerts). Had an absolute blast and didn't feel like I was missing a thing.
I had the opposite experience when I moved to the US (coming from Europe). It struck me how people consumed more of everything. Tech gadgets, food... I think in Europe we are a bit more defiant toward consumerism, esp. in more educated environments (but we're catching up!).
It reminds me of a cool classic sci-fi movie about consumerism. John Carpenter's "They Live". I watched it as a kid but it started to make sense later.
"Nada quickly discovers the sunglasses have unique properties: they reduce the colors of the world around him to black and white and allow him to see that media and advertising hide omnipresent subliminal totalitarian commands to obey, consume, reproduce, and conform. They also make clear that many people in positions of wealth and power are actually humanoid aliens with skull-like faces."
>> "It struck me how people consumed more of everything. Tech gadgets, food... I think in Europe we are a bit more defiant toward consumerism"
Obviously it depends where you are in Europe. In the UK I'd say we're a lot more towards the American end of consumerism but they still take it much further. Food is the biggest jump for me. I still remember my first time in America and I was out for dinner. My friend suggested we split a meal as the portions were large. I still couldn't finish mine (and I eat quite a lot) and she brought home a doggie bag with enough food for two other people. And this was quite a nice restaurant too where I'd have expected smaller portions.
Interesting. I've only spent a few days in Berlin (about a year ago) and I definitely got that anti-consumerism vibe. I quite enjoyed it and people seemed pretty happy just hanging out and having a drink and some conversation in grimy bars that probably wouldn't survive in London. I'm generalising probably as it's a big city but in the area I was in I definitely got that impression.
> I'm still surprised that there's no startups offering comparable (particularly on-prem) products
Please consider giving EventQL [0] a try some time! It's completely open-source and self-hostable. Still a new project though, just released this summer and still in beta.
"EventQL is a distributed, analytical database. It allows you to store massive amounts of structured data and explore it using SQL and other programmatic query facilities."
So it's a completely different class of application than splunk or elasticsearch, and one that you have a commercial interest in. Please don't spam HN.
>> So it's a completely different class of application than splunk or elasticsearch
Sure it takes a somewhat different approach (i.e. it requires an explicit schema), but for the use case discussed in this thread it _is_ completely relevant and a comparable open-source/on-premise alternative which parent was asking about.
>> one that you have a commercial interest in
Yes, I'm involved in the EventQL project but I thought that it was obvious from the way I phrased my posting. Usually I always include a disclaimer to prevent misunderstandings but I didn't consider it necessary in this case.
That's almost 70gbit/s (are those cloudflare http logs by any chance?) on 100 nodes vs ~170mbit/s on 6 nodes.
Or, in other terms 700mbit/s per host with your kafka setup versus ~30mbit/s per host in the benchmark. Allthough your machines seem to be quite a bit beefier (I wonder if all that RAM is actually used?).
A lot of it is log data from requests passing through CloudFlare. We run them through Kafka and consumers do stuff like attack detection and generate statistics for our customers. We have about 4 million sites on CloudFlare and each customer has access to analytics about their site which are stored in a CitusDB database.
Impressive. Any chance you could tell how much data is stored for the analytics service after pre-aggregation? (In terms of TB/day or so - I guess it can't be the full 70gbit/s?).
Sadly it does not give a figure/order of magnitude of the amount of data that's stored in citus after aggregation, but I guess it's just not public information. [I'm working on a system that is somewhat similar to CitusDB (eventql.io) and am always really interested in these numbers]
EDIT: I can't reply to your other comment for some reason but many thanks for digging that up, it's very interesting info to me!
> Except for inlineable functions and templates in C++—which become a greater and greater fraction of code the more "modern" your C++ gets.
The C++ ABI is currently not portable anyway. So the concern about templates (or any C++ features) forcing you to put the implementation into the header file would not apply in the scenario dllthomas was referring to:
If you want to distribute your library as a binary object and a separate source-form interface/header today, you'll have to use a (wrapper) C API. Regardless of how "modern" the C++ code is.
Of course, it would be nice to have portable C++ objects some time in the future which would change things... :)
> If you want to distribute your library as a binary object and a separate source-form interface/header today, you'll have to use a (wrapper) C API.
This is only true if you want to target different compilers. If you ship your binaries targeting a specific compiler (which is what just about every company does that I've ever worked with), you don't have any abi issues.
Actually, you can even have incompatibilities with the same compiler and different compile flags. So to reliably build a "semi-portable" c++ object one would have to ensure that all objects are compiled with the exact same (i.e. same version of the compiler/same compiler source code) and with the exact same flags. I'm sure there are some compiler vendors that offer ABI backwards-compatibility but it's not part of the C++ language per-se.
Oh I'm well aware; as my comment indicated, I've been doing this a long time. I didn't want to get in to the full details in a HN comment, but yes, there's definitely a few things you have to coordinate on between vendors if you want to ship C++ libraries that work together.
> If you want to distribute your library as a binary object and a separate source-form interface/header today, you'll have to use a (wrapper) C API. Regardless of how "modern" the C++ code is.
Not really.
Those of us on Windows make use of COM, or since Windows 8, UWP components (formally known as WinRT).
Yes, but this is only a solution if your library is only targeting windows.
If you want users of any standards-compliant C++ implementation to be able to use your library today, you'll still have to go with c-abi symbols or ship the sources. All other worakrounds are vendor-specific and not part of the standard.
[Of course even objects containing only C symbols are not portable across platforms either, but at least the C ABI/calling convention is more or less strictly defined for any given target platform. Assuming no other platform-specific stuff like glibc is used]
The linked article is an obviously bullshit benchmark that makes influxdb look good and cassandra look bad (by, surprise, the influxdb folks).
I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.
Some thoughts:
- The reason why cassandra uses so much more space to store the same data is that they've set up the cassandra table schema in such a way that cassandra needs to write the series ID string for each sample (while influxdb only needs to write the values). You easily get a 10-100x blowup just from that. There is no superior "compression" technology here but just an apples-to-oranges comparison.
- Then, comparing the queries is even worse, because they are testing a kind of query (aggregation) that cassandra does not support. To still get a benchmark where they're much faster, they just wrote some code that retrieves all the data from cassandra into a process and then executes the query within their own process. If anything, they're benchmarking one query tool they've written against another one of their own tools.
- Also, if I didn't miss anythin, the article doesn't say on what kind of cluster they actually ran this on or even if they ran both tests on the same hardware. There definitely are cassandra clusters handling more than 100k writes/sec in production right now. So I guess they picked a peculiar configuration in which they outperform cassandra in terms of write ops (given a good distribution of keys, cassandra is more or less linearly scalable in this dimension)
- A better target to benchmark against would probably be http://opentsdb.net/ or http://prometheus.io/ - both seem to have somewhat similar semantics to InfluxDB (which cassandra and elasticsearch do not)
DISC: I also work on a distributed database product (https://eventql.io) but it's neither a direct competitor to Cassandra nor InfluxDB nor any of the other products I've mentioned. I hope the comment doesn't come across as too harsh. The article raised some very big (and harsh) claims so I think it's fair to respond in tone.
I don't understand this benchmark at all. It says performance of a 1000 node cluster, but then shows 100k inserts per second in Cassandra. Then later follow up comments say that this test was on a single machine. Without seeing the schema, 100k inserts / sec is reasonable for a single machine. For 1000 machines it would mean there is a pretty massive configuration issue.
If you are going to benchmark a distributed system, you really need to set up more than 1 server.
I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).
And the way they generated the synthetic data set is by having 1000 imaginative servers produce one sample per second, (i.e. have a script that writes out 1000 * duration_in_sec fake samples -- I believe this is the code that does it https://github.com/influxdata/influxdb-comparisons/tree/mast...)
The tests were run on the same hardware, a single server. Bare metal, not VMs. InfluxDB writes the series string with everything. We tried to imitate what you'd need to do to get close to similar functionality doing time series like InfluxDB does in Cassandra.
If you're just going to write a bunch of uint64 keys with float64 values, of course Cassandra will get much faster. It would be trivial to make a time series database that outperforms InfluxDB with those limitations as well.
The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.
Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.
> The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance. [...] if you want to do time series
on Cassandra, you're going to write a bunch of the code yourself.
Fair enough. I'm sure InfluxDB is very good/fast at timeseries data (allthough I have to admit to not actually having tried it out so far). Still, if that was your point, consider removing these statements from the blog.
> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.
> InfluxDB outperformed Cassandra by delivering 10.8x better compression.
> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.
I think it would help make the point and not put the reader in a defensive position (when the statements are clearly not based on a fair comparison of the two products and will not hold under most conditions). Just my two cents.
Maybe, but we get asked all the time about Cassandra vs. us. Both in terms of feature set and performance. And performance only makes sense for our potential users if we're trying to replicate the features on Cassandra.
Hasn't that work already been done? Cyanite and KairosDB both plug in to the broader Graphite ecosystem (more or less) and use Cassandra as a data store.
Time series data has also been a particular focus in the Cassandra community. DTCS was too complicated, so they came up with the easier and faster TWCS. I don't think this is on you, but I'd love to see a comparison with the latest stable 3.x and a multiple node cluster.
Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.
> There is no superior "compression" technology
Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.
Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.
(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)
My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.
A commercial DB that (also) does this HP Vertica. They tout a 4:1 to 5:1 compression ratio on average; due to the nature of the data the firm I work for stores in it, we get quite a bit better than that. Delta encoding is just one of maybe 5 different schemes it can use for a given column.
Just so sad that Vertica is proprietary so we can't see how they did it! ;)
On a serious note: Please check out EventQL [0] some time. It's very similar to Vertica in some ways and completely open-source. It's a new project (beta) and not nearly as mature as vertica yet though (still a long way to go).
Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.
Or just use the web platform: JavaScript, HTML5 Canvas API, WebGL, all of that. I think it's pretty close in terms of getting a pretty (motivating) result quickly without much prior experience.
(Unrelated, but I can also remember writing the first lines of code of my life in qbasic on a dos machine in ~1998; the blue screen definitely invokes childhood memories!)