I was hired by SONY at one point to help optimize a piece of video editing software that the outsourced team had created and could not figure out how to make it a) go faster and b) use less RAM. This is back when 16GB was on the upper end of what workstation class machines could handle.
The desktop application, written in an early 2004-ish version of C# and .NET regularly brought the workstations to their knees and thrashed memory and pegged the CPU when large 1080p images were being loaded in or moved around.
Each RGBA image, stored as a PNG on the HDD, was loaded in, each RGBA 32-bit unsigned integer was unpacked into its attendant R, G, B, and alpha components into 32-bit unsigned integers themselves, which were then stored in an Int (boxed the native to a non-native type) which were then appended to a dynamically allocated at read time ArrayList. Deep copies were made of each image any time one of them was resized or manipulated, keeping the original untouched image in RAM in case it was needed. A copy of the image before transformation was stored on the Undo stack. And the 1080p working surface of the screen was super-sampled at 8x resolution to support defringing of images when layered. All stored as 8-bit RGBA components in boxed 32-bit integers in a dynamically allocated ArrayList.
The obvious thing to do would be to store the image in memory as just an array of 8-bit of native unsigned integers (RGBARGBARGBARGBA...) and see where that got you. I assume C# has the equivalent of a Java byte[], except that fortunately it is unsigned in C# instead of signed in Java.
I expect that you would get quite a performance boost with just that.
interesting. Python doesn't use tagged pointers? I would think most dynamic languages would store immediate char/float/int in a single tagged 32-bit/64-bit word. That's some crazy overhead.
Absolutely everything in CPython is a PyObject, and that can’t be changed without breaking the C API. A PyObject contains (among other things) a type pointer, a reference count, and a data field; none of these things can be changed without (again) breaking the C API.
There have definitely been attempts to modernize; the HPy project (https://hpyproject.org/), for instance, moves towards a handle-oriented API that keeps implementation details private and thus enables certain optimizations.
I love Python as a language, and all its packages, and have been using it since the late 90's, but Python's legacy decisions are one step away from causing the language to being found face down in a dirty ditch after an all night bender.
It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.
Efficient representation should be something you build into your data model, it will save you time in the long run.
(Also if you have 100s of columns you're hopefully already benefiting from something like NumPy or Arrow or whatever, so you're already doing better than you could be... )
> It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.
This is the argument I've been having my whole career with people who claim the better way is "too hard and too slow" .
I'm like "gee, funny how the thing you do the most often you're fastest at... could it be that you'd be just as fast at a better thing if you did it more than never?" .
As an individual contributor you have an incentive to approach a problem in the way that teaches you the most for your career - then you can pretend it's the approach that's the best effort to risk ratio.
Hah, I'd love to work with the datasets you work with if it takes five minutes to do this. Or maybe you're just suggesting it takes five minutes to write out "TEXT" for each column type?
The data I work with is messy, from hand written notes, multiple sources, millions of rows, etc etc. A single point that's written as "one" instead of 1 makes your whole idea fall on its face.
Tried that in the past, but it's really slow. Pandas is effectively removed from my workflows because of issues like this.
But, I have workarounds for these issues by loading everything into postgres under TEXT columns in a "raw" schema, then do some typecast tests in a descending list of types to get the smallest possible type to transfer to a new table in a "prod" schema. It's read-only data, so it's not a big deal to run it once, and builds out a chain of changes from csv -> sql.
Something like this could be done with pickling to avoid having to re-type every time I run the code (and I've done that for some past projects, but it's... ehhh).
Perhaps do the data-cleaning step before loading into a data frame? (Dataframes are, after all, for canonicalized+normalized data, just like RDBMS tables are.)
No way. The initial loading into a dataframe takes way too long to make it useful for exploratory work. Load it into a database is done once and forget it. In the long run, the time wasted loading things into dataframes over and over and over again just isn't worth it. Keep in mind that we're talking about large datasets that may or may not fit into memory.
"Oops, messed up my import slightly. Gotta run it again and wait ages.... again"
"Oops, loaded the dataframe twice on accident and had OOM. Gotta restart from the beginning... again"
"Oops, forgot to .head(5) my dataframe and jupyter's crashed... again..."
Doing everything in SQL solves so many problems. And OOM is practically a non-issue.
For exploratory work, perhaps you should random sample some of the dataset (say 1k) and see the effect on them? After getting good result, you then switch to dealing with the whole dataset.
Is enough data generated from handwritten notes that the memory cost is a serious problem? I was under the impression that hundreds of books worth of text fit in a gigabyte.
You'll need to decide on a case by case basis. Many datasets I work with are being generated by machines, come from network cards etc. - these are quite consistent. Occasionally I deal with datasets prepare by humans and these are mediocre at best, and in these cases I spend a lot of time cleaning them up. Once it's done, I can clearly see if there are some columns can be stored in a more efficient way, or not. If the dataset is large, I do it, because it gives me extra freedom if I can fit everything in RAM. If it's small, I don't bother, my time is more expensive than potential gains.
You're describing operations done on data in memory to save memory. That list of fractions still needs to be in memory at some point. And if you're batching, this whole discussion goes out of the window.
I think the point is that you can use a streaming IO approach to transcode or load data into the compact representation in memory, which is then used by whatever algorithm actually needs the in-memory access. You don't have to naively load the entire serialization from disk into memory.
This is one reason projects like Twitter popularized serializations like json-stream in the past, to make it even easier to incrementally load a large file with basic software. Formats like TSV and CSV are also trivially easy to load with streaming IO.
I think the mark of good data formats and libraries is that they allow for this. They should not force an in-memory all or nothing approach, even if applications may want to put all their data in memory. If for no other reason, the application developer should be allowed to commit most of the system RAM to their actual data, not the temporary buffers needed during the IO process.
If I want to push a machine to its limits on some large data, I do not want to be limited to 1/2, 1/3 or worse of the machine size because some IO library developers have all read an article like this and think "my data fits in RAM"! It's not "your data" nor your RAM when you are writing a library. If a user's actual end data might just barely fit in RAM, it will certainly fail if the deep call-stack of typical data analysis tools is cavalier about allocating additional whole-dataset copies during some synchronous load step...
If you have a csv file that's 10 GB of numbers, and for your purposes float32 is sufficient precision, and there are only 1 billion numbers in the file, then you can read the file into a 4 GB array of float32 with very little overhead beyond the size of that target array. Reading the 10 GB file into memory in its entirety wouldn't help anything; nor would creating an intermediate array of full-precision numbers.
There's still a difference between terabytes of data spread between RAM sticks multiple inches away from the CPU die and the megabytes of cache data close enough to the compute silicon to experience quantum effects. A modern CPU can burn through hundreds of instructions in the time it takes to resolve a thrashing cache
RAM is 5 to 50 times more latent than cache. SSD is ~1000 times slower than RAM. And a spinning HDD is I think ~100 times slower than an SSD.
If your database fits in RAM you’re likely in a happy place. If your DB is so massive it needs to spill to disk you wind up with a mountain of complexity. Multiple machines, sharding, hot/cold etc.
The point of the article is “modern servers have a lot of RAM and you might be able to delete a lot of complexity if you throw money at a server with 4 terabytes of RAM. This option is more practical than you might have realized!”
> I also don’t know where you get “Python float is 24 bytes + 8-byte pointer for each item in the list”. Wat.
Not sure how many ways there are to reword that. A CPython float takes 24 bytes of memory, and storing them in a list means 8 bytes per item for the pointer. So in CPython, a list of n floats takes 32n bytes of memory.
>>> sys.getsizeof(1.0)
24
>>> l = list(map(float, range(62_500_000))) # memory use goes up by >2 GB
>>> del l # memory use goes down by >2 GB
(no need to go straight to NumPy to avoid this when relevant, though – array.array is built in.)
We went with this approach. Pandas hit GIL limits which made it too slow. Then we moved to Dask and hit GIL limits on the scheduler process. Then we moved to Spark and hit JVM GC slowdowns on the amount of allocated memory. Then we burned it all down and became hermits in the woods.
I have decided that all solutions to questions of scale fall into one of two general categories. Either you can spend all your money on computers, or you can spend all your money on C/C++/Rust/Cython/Fortran/whatever developers.
There's one severely under-appreciated factor that favors the first option: computers are commodities that can be acquired very quickly. Almost instantaneously if you're in the cloud. Skilled lower-level programmers are very definitely not commodities, and growing your pool of them can easily take months or years.
Buying hardware won't give you the same performance benefits as a better implementation/architecture.
And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.
>And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.
And yet, people still regularly choose to go down a path that leads there. Because business decisions are about satisficing, not optimizing. So "I'm 90% sure I will be able to cope with problems of this type but it might cost as much as $10,000,000" is often favored above, "I am 75% sure I might be able to solve problems of this type for no more than $500,000," when the hypothetical downside of not solving it is, "We might go out of business."
On the other hand, you can't argue with a computer. That may explain why some of my coworkers seem to behave as if they wish they were computers...
It's too difficult to renegotiate with computers, too easy to renegotiate with people. When you don't actually know what you need to do, you need people. When you think you know what you need to do, but you're wrong, then you really need people. Most of us are in the latter category, most of the time.
At some point, you just mmap shared system memory, as read only, giving direct global access to your dataset, like the good old days (or in any embedded system).
We were trying to keep everything on one machine in (mostly) memory for simplicity. Once you open up the pandoras box of distributed compute there's a lot of options including other ways of running Spark. But yes, in retrospect, we should have opened that box first.
I have solved a similar problem, in a similar way and i've found polars <https://www.pola.rs/> to solve this quite well without needing clickhouse. It has a python library but does most processing in rust, across multiple cores. I've used it for data sets up to about 20GB no worries, but my computer's ram became the issue, not polars itself.
We were using 500+gb of memory at peak and were expecting that to grow. If I remember we didn't go with Polars because we needed to run custom apply functions on DataFrames. Polars had them but the function took a tuple (not a DF or dict) which when you've got 20+ columns makes for really error prone code. Dask and Spark both supported a batch transform operation so the function took a Pandas Dataframe as input and output.
Have you tried vaex? It is a panda-like Python library that uses C++ underneath, memory mapping and optimizes its memory access patterns. It’s very fast at least up to 1TB allegedly, I’ve used it for 10-15GB.
The original site made by lukegb inspired me because of the down-to-earth simplicity. Scaling vertically is often so much easier and better in so many dimensions than creating a complex distributed computing setup.
This is why I recreated the site when it went down quite a while ago.
The recent article "Use One Big Server"[0] inspired me to (re)submit this website to HN because it addresses the same topic. I like this new article so much because in this day and age of the cloud, people tend to forget how insanely fast and powerful modern servers have become.
And if you don't have budget for new equiment, the second-hand stuff from a few years back is stil beyond amazing and the prices are very reasonable compared to cloud cost. Sure, running bare metal co-located somewhere has it's own cost, but it's not that of a big deal and many issues can be dealt with using 'remote hands' services.
To be fair, the article admits that in the end it's really about your organisation's specific circumstances and thus your requirements. Physical servers and/or vertical scaling may not (always) be the right answer. That said, do yourself a favour, and do take this option seriously and at least consider it. You can even do an experiment: buy some second-hand gear just to gain some experience with hardware if you don't have it already and do a trial in a co-location.
Now that we are talking, yourdatafitsinram.net runs on a Raspberry Pi 4 which in turn is running on solar power.[1]
(The blog and this site are both running on the same host)
> many issues can be dealt with using 'remote hands' services.
I have a few second-hand HP/Dell/Supermicro systems running colocated. I find that for all software issues, remote management / IPMI / KVM over IP is perfectly sufficient. Remote hands are needed only for actual hardware issues, most of which is "replace this component with an identical one". Usually HDD, if you're running those. Overall, I'm quite happy with the setup and it's very high on the value/$ spectrum.
IPMI is nice, although the older you go, the more particular it gets. I had professional experience with the SuperMicro Xeon e5-2600 series v1-4, and recently started renting a previous generation server[1] and it's worse than the ones I used before. It's still servicable though; but I'm not sure it it's using a dedicated LAN, because the kvm and the sol drop out when the OS starts or ends; it'll come back, but you miss early boot messages.
It's definitely worth the effort to script starting the KVM, and maybe even the sol. If you've got a bunch of servers, you should script the power management as well, if nothing else, you want to rate limit power commands across your fleet to prevent accidental mass restarts. Intentional mass restarts can probably happen through the OS, so 1 power command per second across your fleet is probably fine. (You can always hack out the rate limit if you're really sure).
[1] I don't need a whole server, but for $30/month when I wanted to leave my VPS behind for a few reasons anyway...
Yes, I bet a lot of people aren't even aware of IPMI/KVM over IP capabilities that servers have for decades, which makes hardware management (manual or automated!) much easier.
Remote hands is for the inevitable hardware failure (Disk, PSU, Fan) or human error (you locked yourself out somehow remotely from IPMI).
P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and 20 physical cores as a lab system for playing with many virtual machines. I turn it on and off on demand using IPMI.
This kind of realization that "yes, it probably will" has recently inspired me to hand-build various database engines wherein the entire working set lives in memory. I do realize others have worked on this idea too, but I always wanted to play with it myself.
My most recent prototypes use a hybrid mechanism that dramatically increases the supported working set size. Any property larger than a specific cutoff would be a separate read operation to the durable log. For these properties, only the log's 64-bit offset is stored in memory. There is an alternative heuristic that allows for the developer to add attributes which signify if properties are to be maintained in-memory or permitted to be secondary lookups.
As a consequence, that 2TB worth of ram can properly track hundreds or even thousands of TB worth of effective data.
If you are using modern NVMe storage, those reads to disk are stupid-fast in the worst case. There's still a really good chance you will get a hit in the IO cache if you application isn't ridiculous and has some predictable access patterns.
I don't mean to discourage personal exploration in any way, but when doing this sort of thing it can also be illuminating to consider the null hypothesis... what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?
SQLite or PostgreSQL can be given some configuration/hints to be more aggressive about using RAM while still having their built-in capability to spill to storage rather than hit a hard limit. Or on Linux (at least), just allowing the OS page cache to sprawl over a large RAM system may make the IO so fast that the database doesn't need to worry about special RAM usage. For PostgreSQL, this can just be hints to the optimizer to adjust the cost model and consider random access to be cheaper when comparing possible query plans.
Once you do some sanity check benchmarks of different systems like that, you might find different bottlenecks than expected, and this might highlight new performance optimization quests you hadn't even considered before. :-)
To add to this thread, some people have done this exploration before, see slides 11 and 20 of [0] or Figure 1 of [1]. Could you just throw more RAM at your problems in a disk-backed system? In practice, probably. But there are distinct advantages to designing upfront for an in-memory scenario.
> what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?
Oh I absolutely have gone down this road as well.
The biggest thing for me is taking advantage of the other benefits you can get with in-memory working sets, such as arbitrary pointer machine representations.
When working with a traditional SQL engine (even one tuned for memory-only operation), there are many rules you have to play by or things will not go well.
Around 2000, a guy told me he was asked to support very significant performance issues with a server running a critical application. He quickly figured out that the server ran out of memory. Option 1 was to rewrite the application to use less memory. He chose option two: increase the server memory, going from 64 MB to 128 MB (Yes MB).
At that time, 128 MB was an ungodly amount of memory and memory was very expensive. But it was still cheaper to just throw RAM at the problem than to spend many hours rewriting the application.
Several years ago my job then got a dev and prod server with a terabyte of RAM. I liked the dev server because a few times I found myself thinking "this would be easy to debug if I had an insane amount of RAM" and then I would remember I did.
Basically working with code manipulating a couple dozen GB of data and then keeping a couple dozen copies of that to examine it after various stages of manipulation.
Absolutely not. You can purchase a system with 1TB of ram, and some decent CPUs etc for ~25k.
My lab just did this. That's far overpriced.
133k is closer to what you would spend if you used the machine with 1tb "in the cloud".
I still remember the first advertisement I saw for 1TB of disk space. I think it was about 1997 and about the biggest individual drive you could buy was 2GB. The system was the size of a couple of server racks and they put 500 of those disks in it. It cost over $1M for the whole system.
Note that when you start talking about multiple 10s of TiB of ram, you start having to buy super high density dims which are very expensive (because not many of them get made and anyone who needs one has lots of money)
Amazing. This has been the solution to postgres issues for me. Just add enough memory that everything, or at least everything that is accessed frequently can fit in RAM. Suddenly everything is cached and fast.
Your data might even fit in the CPU L3 cache... But most likely you want your data to be persistent. But how often do you actually "pull the plug" on your servers!? And what happens when SSD's are fast enough ? Will we see a whole new architecture where the working memory is integrated into the CPU and the main memory is persistent ?
On one hand it would be cool to have some persistence in the CPU. On the other -- imagine if rebooting a computer didn't make the problems all go away. What a nightmare.
a lot of people were scratching their heads on how to put optane to use. a fast ssd? or a slow DRAM? with a propietary api? and requiring support from the hardware platform? the whole product line was inconsistent as they use the same name for both storage devices and NVDIMM.
Don't forget memory bandwidth. You may be able to shoehorn your data in a single machine's memory, but it may take a very long time to process all of it.
Sometimes sharding will yield better results. It will still take a long time to read through 64 TB of memory, even if you are just NOPing on it.
I actually worked on a project that did this. We used a database called "Tile38" [1] which used an R-Tree to make geospatial queries speedy. It was pretty good.
Our dataset was ~150 GiB, I think? All in RAM. Took a while to start the server, as it all came off disk. Could have been faster. (It borrowed Redis's query language, and its storage was just "store the commands the recreate the DB, literally", IIRC. Dead simple, but a lot of slack/wasted space there.)
Overall not a bad database. Latency serving out of RAM was, as one should/would expect, very speedy!
>Gates himself has strenuously denied making the comment. In a newspaper column that he wrote in the mid-1990s, Gates responded to a student's question about the quote: "I've said some stupid things and some wrong things, but not that. No one involved in computers would ever say that a certain amount of memory is enough for all time." Later in the column, he added, "I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again."
Anyone have any recommendations for a SQL engine that works on in-memory data and has a simple/monolithic architecture? Our data is about 50-100gb (uncompressed) and thus easily fits into memory. I am sure we could do our processing using something like polars or pandas in memory quite quickly but we prefer a SQL interface. Using postgres is still quite slow even when it has more than enough memory available compared to something like duckdb. Duckdb has other limitations however. I’ve been eying MemSQL but that also seems to be targeted more towards multi machine deployments.
Is the point of this that you can do large-scale data processing without the overhead of distribution if you're willing to pay for the kind of hardware that can give you fast random access to all of it?
Funny, it lets you click to negative amounts of RAM. My -1 PiB fits in RAM, so having it as a unit is not useless. (It also accepts fractions but not octal)
It’s using an HTML <input type=number> element, which requires that the value be a valid floating-point number <https://html.spec.whatwg.org/multipage/common-microsyntaxes....>. You’ll note from this that exponent notation is allowed, something people are unlikely to expect. For restricting negative numbers, it should probably had the attribute min=0; then negative zero would be the lowest you could go.
The only people who treat data as plural are academics, usually social scientists, and overeducated journalists. Sentences like, "What do these data tell us?" are awkward for everyone else.
Like "information", "data" is correctly used as a singular mass noun.
Am I the only one here using Chrome or is everyone else just ignoring the table being broken? The author used an <object> tag which just results in Chrome displaying "This plugin is not supported". I'm unsure why they didn't just use an iframe instead.
hilarious to get voted down for this and the equivalent comment is still, please stop wasting storage on data that is never going to be needed. If the LHC can find the Higgs Boson before hitting the exa-scale then it's not needed to identify "not a hotdog".
The technique used means that it always takes a while to load it, as it’s reloading it after every change, which is generally taking around two seconds for me. You make it still worse by using 'change' and 'keyup' events, so any key press (e.g. left/right arrow) triggers it. It should use the 'input' event.
Even if it were done as a separate page, it should still be shown or hidden via the CSS display property in order to avoid this reloading.
I was disappointed that the page didn't start offering vintage computers for very small datasets given that it has bytes and kilobytes as options ("your data is too large for a VIC-20, but a Commodore 64 should handle it")
* Python list of N Python floats: 32×N bytes (approximate, the Python float is 24 bytes + 8-byte pointer for each item in the list)
* NumPy array of N double floats: 8×N bytes
* Hey, we don't need that much precision, let's use 32-bit floats in NumPy: 4×N
* Actually, values of 0-100 are good enough, let's just use uint8 in NumPy and divide by 100 if necessary to get the fraction: N bytes
And now we're down to 3% of original memory usage, and quite possibly with no meaningful impact on the application.
(See e.g. https://pythonspeed.com/articles/python-integers-memory/ and https://pythonspeed.com/articles/pandas-reduce-memory-lossy/ for longer prose versions that approximate the above.)