Presumably the "order" you mention is a primary key to another table, likely one that references the individual items that make up that order, so the data will be much larger than you estimate.
It will grow larger still if you include web logs from your e-commerce site and event data from your mobile app so that you can correlate these orders with items that customers considered but ultimately didn't buy. How will your laptop and SSD perform when you then build a user-item matrix to generate product recommendations for each of those 1.2 billion customers?
While plenty of organizations unnecessarily use Big Data tools to store and analyze relatively small amounts of data, there are plenty of customers with enough data to require them. I've seen plenty of them firsthand.
There are functionally less than 1000 organizations that currently require distributed compute for data analysis. You can get off the shelf AWS units with 1000 cores, terabytes of ram and storage, etc. The cost of compute has decreased faster than the amount of data we have to store and process. What we used to do with spark jobs we can do with python on a single box.
Let's assume your completely made-up 1000 organisations claim is true.
Right now I work for one of them: a global investment bank.
Within that organisation we have at least 100+ Spark clusters across the organisation doing distributed compute. And at least in our teams we have tight SLAs where a simple Python script simply can't deliver the results quick enough. Those jobs underpins 10s of billions of dollars in revenue and so for us money is not important, performance is.
So 1000 x 100 = 100,000 teams, all of whom I speak for, disagree with you.
Disagree with what? I never said _you_ are a dummy for using distributed compute. There are many good applications for distributed compute. I used spark and flink at a big tech job. The stack worked well for some things, and for others it was a hammer looking for a nail. What you do not see is that for every team that you work with and consider a peer group to you, there are 100 teams that really do not need distributed compute, because they have an org wide infra budget of <3M dollars and a total addressable data lake of less than 1TB, but they are implementing very expensive distributed compute solutions recommended from either a Deloitte consultant or a very junior engineer. Should an IB with an infra budget in the 100M+ infra budget zone use distributed compute solutions, absolutely. There just aren't that many of these orgs.
This is not true. Any column store database (bigquery, Redshift, snowflake) implements distributed compute behind the scenes. When an analyst/business intelligence people have a query return in 3 seconds instead of 15 seconds, it's actually huge. Not just in aggregate amount of time saved, but in creating a quick feedback loop in testing hypothesizes. This is especially true considering that most analyst type people look at data as aggregates across some dimension (e.g. sales per month , unique visitors per region, etc...)
These types of questions are orders of magnitude faster with a distributed backend.
I was just playing with some data from our manufacturing system, about 30 GB. I pulled the data to my laptop (very expensive Apple one) and while it fits on my disk just fine, it took about 15 minutes to download.
I imported it to ClickHouse which took a while due to figuring out whatever compression and LowCardinality() and so on. I ran a query and it took ClickHouse about 15 seconds. DuckDB pointed to the parquet files on my SSD took 19 seconds to do the same. Our big data tool took 2 seconds, while working with data directly in cloud storage.
Now of course this is entirely unfair - the big data thingie has over twenty times more CPUs than my laptop, and cloud storage is also quite fast when accessed from many machines at once. If I ran ClickHouse or DuckDB on 100 CPU machine with terabyte of RAM it might have still turned out faster.
But this experiment (I was thinking of using some of the new fancy tech to serve interactive applications with less latency) made me realize that big data is still a thing. This was a sample - one building from one site, which we have quite a few of.
I'd love to understand the shape of this data and some of the types of queries you're performing. It would be very helpful as we build our product here at motherduck.
I have no doubt that there are situations where the cloud will be faster, especially when provisioned for max usage [which many companies do not]. However, there are a lot of these situations even where the local machine can supplement the cloud resources [think re decisions a query planner can make].
Feel free to reach out at ryan at motherduck if you want to chat more.
> You can get off the shelf AWS units with 1000 cores, terabytes of ram and storage, etc.
Hold your horses... the beefiest servers that are in production today, unless you count custom-made stuff go to somewhere between 128 and 256 cores per board. These are hugely expensive. Also, I don't know if you can rent those from Amazon.
Typical, affordable servers range between 4..16 cores. Doesn't matter if you buy them yourself, or you rent them from Amazon. It's much cheaper to command a fleet of affordable servers than to deal with a high-end few. This is both because the one-time price of buying is quite different and because with smaller individual servers you have a fighting chance to scale your application with demand. Especially this is true in case of Amazon as you could theoretically buy spot instances and by so doing you'd share the (financial) load with other Amazon's customers.
Now... storage. Well, you see, in Amazon you can get very expensive storage that's guaranteed to be "directly" attached to the CPU you rent, the so-called ephemeral storage. This is the storage that's included with the VM image you use. It's very hard to get a lot of it. I couldn't find the numbers for Amazon, instead, I know that Azure tops out at 2 TB. In principle, this kind of storage cannot exceed a single disk, so, think Amazon probably offers the same 2 TB, maybe 4. But, again, it's cheaper to have a bunch of EBS's attached... but then you'll have to have more of them as the latency will suffer, and in order to compensate for that you would try to increase throughput, perhaps.
Also, think that, in practice, you'd want to have a RAID, probably RAID5, and this means you need upwards from 3 disks. Also, if you are using something like a relational database, you'd most likely want to put the OS on a single device, the database data on a RAID and database journal on a yet another device, and, probably, you'd want that device to be something like persistent memory / optane / something from higher-tier disks with dedicated power supply. And all this is not due to size, but due to different contingencies you need to have in order to prevent huge data loss... Now, add to this backups and snapshots, perhaps replication in 2-3 different geographical areas if you are running an international business... and that's quite a bill to foot.
There are similar problems with memory, since there can only be so many legs on memory bus and only so many pieces of memory you can attach to a single CPU, and if you also want a lot of storage, then, similarly, there can be only so many individual storage devices attached and so on.
Bottom line... even to reproduce the performance of your laptop in the cloud you would probably end up with some distributed solution, and you would still struggle with latency.
Azure has the LS series of VMS [1] which can have up to ten 1.92TB disks attached directly to the CPU using NVMe. We use these for spilling big data to disk during high-performance computation, rather than single-machine persistence, so we also don't bother with RAID replication in the first place.
Though it is a bit disappointing that while Microsoft advertises this as "good for distributed persistent stores", there are no obvious SLAs that I could rely on for actually trusting such a cluster with my data persistence.
Well, attaching 10 PCIe devices is going to give a very hard time to your CPU if all of them should be used. The speed of copying from memory or between devices will become a bottleneck. Another problem is that on such a machine you will also need huge amount of memory to allow for copying to work. And, if you want this to work well, you'd need some high-end hardware to be actually able to pull that off. In such a system, your CPU will prevent you from exploiting the possible benefits of parallelization. It seems beefy, but it's entirely possible that a distributed solution you could build with a fraction of the cost would perform just as well.
This situation may not be reflected in Azure pricing (the calculator gives 7.68 $/h for L80as_v3) since if MS has such hardware, it would be a waste for it to stand idle. They'd be incentivized to rent it out event at a discount (their main profit is from traffic anyways). So, you may not be getting an adequate reading of the situation, if you are trying to judge it by the price (rather than cost). But, this is only the price of the VM, I'm scared to think about how much you'd pay if you actually utilize it to its full potential.
Also, since it claims to have 80 vCPUs, well... it's either a very expensive server, or it's, again, a distributed system, where you simply don't see the distributed part. I haven't dealt with such hardware firsthand, but we have in our DC a Gigaio PCIe TOR switch which would allow you to have that much memory (in principle, we don't use it like that) in a single VM. That thing with the rest of the hardware setup costs some six-digit number of dollars. I imagine something similar must exist for CPU sharing / aggregation.
> The high throughput and IOPS of the local disk makes the Lasv3-series VMs ideal for NoSQL stores such as Apache Cassandra and MongoDB.
This is cringe-worthy. Cassandra is an abysmal quality product when it comes to performing I/O. It cannot saturate the system at all... I mean, for example, if you take old-reliable PostgreSQL or MySQL, then with a lot of effort you may get them to dedicate up to 30% CPU time to I/O. Where the reason for relatively low utilization (compared to direct writes to disk) is the need to synchronize that's not well-aligned with how the disk may want to deal with destaging.
Cassandra is in a class of its own when it comes to I/O. You'd be happy to hit 2-3% CPU utilization in the same context where PostgreSQL would hit 30%. I have no idea what it's doing to cause such poor performance, but if I had to guess, some application logic... making some expensive calculations sequentially with I/O, or just waiting in mutexes...
So, yeah... someone who wanted Cassandra to perform well would probably need that kind of a beefy machine :D But whether that's a sound advise -- I don't know.
Even if this is off by two orders of magnitude and it's only 100,000 companies that need distributed compute, that means that almost all companies just need a single large computer.
Looking at the distribution of companies by employee count and assuming that data scales with employee count (dangerous assumption, but probably true enough on average), that means that companies don't need distributed compute until they get several hundred employees. [0]
I/O performance is just one of many characteristics that impact performance and from experience the one you least need to worry about. RAID 0 across multiple high-end NVME drives with OS file caching is going to be more than fast enough for most use cases.
The issue is running out of CPU performance and being able to seamlessly scale up/down compute with live running workloads.
But ... we are ... basically by definition. Vanishing little projects actually need cloud scale infrastructure.
And, to address your previous statement: one beefy server is actually pretty scalable. Soft threads spin up in microseconds to serve incoming requests, communication between threads is blazing fast, caching is simpler on one machine, etc. You don't even have to worry to much about scaling, the CPU just throttles itself when there is no load.
And every once in a while you just upgrade to the next gen beefy machine.
Don't forget the cool JS library you included to track mouse movements so you can optimize your UI to make sure Important Money Making Things are easily clickable.
That's 8.4 hojillion megabytes per second right there.
It will grow larger still if you include web logs from your e-commerce site and event data from your mobile app so that you can correlate these orders with items that customers considered but ultimately didn't buy. How will your laptop and SSD perform when you then build a user-item matrix to generate product recommendations for each of those 1.2 billion customers?
While plenty of organizations unnecessarily use Big Data tools to store and analyze relatively small amounts of data, there are plenty of customers with enough data to require them. I've seen plenty of them firsthand.