Does my data fit in RAM?

itamarst · on Aug 2, 2022

There's just a huge amount of waste in many cases which is very easy to fix. For example, if we have a list of fractions (0.0-1.0):

* Python list of N Python floats: 32×N bytes (approximate, the Python float is 24 bytes + 8-byte pointer for each item in the list)

* NumPy array of N double floats: 8×N bytes

* Hey, we don't need that much precision, let's use 32-bit floats in NumPy: 4×N

* Actually, values of 0-100 are good enough, let's just use uint8 in NumPy and divide by 100 if necessary to get the fraction: N bytes

And now we're down to 3% of original memory usage, and quite possibly with no meaningful impact on the application.

(See e.g. https://pythonspeed.com/articles/python-integers-memory/ and https://pythonspeed.com/articles/pandas-reduce-memory-lossy/ for longer prose versions that approximate the above.)

justinlloyd · on Aug 3, 2022

I was hired by SONY at one point to help optimize a piece of video editing software that the outsourced team had created and could not figure out how to make it a) go faster and b) use less RAM. This is back when 16GB was on the upper end of what workstation class machines could handle.

The desktop application, written in an early 2004-ish version of C# and .NET regularly brought the workstations to their knees and thrashed memory and pegged the CPU when large 1080p images were being loaded in or moved around.

Each RGBA image, stored as a PNG on the HDD, was loaded in, each RGBA 32-bit unsigned integer was unpacked into its attendant R, G, B, and alpha components into 32-bit unsigned integers themselves, which were then stored in an Int (boxed the native to a non-native type) which were then appended to a dynamically allocated at read time ArrayList. Deep copies were made of each image any time one of them was resized or manipulated, keeping the original untouched image in RAM in case it was needed. A copy of the image before transformation was stored on the Undo stack. And the 1080p working surface of the screen was super-sampled at 8x resolution to support defringing of images when layered. All stored as 8-bit RGBA components in boxed 32-bit integers in a dynamically allocated ArrayList.

eckza · on Aug 3, 2022

Don't leave us hanging - what did you _do?_ Did you fix this?

Jorge1o1 · on Aug 3, 2022

Not the parent so I can’t tell you exactly what he did, but some suggestions:

- The PNG specification contains plenty of information including width and length that allow you to use 2d arrays rather than ArrayLists

- The RGBA 32bit integer doesn’t need to be unpacked.. you can just perform shifts and transformations of each of the 8bit channels.

- If you do unpack, you should use native 8bit unsigned ints

- You don’t need an unmodified copy.. that’s your file

- Undo stack should contain deltas, not the whole image over and over again.

Edit: formatting

prewett · on Aug 3, 2022

The obvious thing to do would be to store the image in memory as just an array of 8-bit of native unsigned integers (RGBARGBARGBARGBA...) and see where that got you. I assume C# has the equivalent of a Java byte[], except that fortunately it is unsigned in C# instead of signed in Java.

I expect that you would get quite a performance boost with just that.

deckard1 · on Aug 2, 2022

interesting. Python doesn't use tagged pointers? I would think most dynamic languages would store immediate char/float/int in a single tagged 32-bit/64-bit word. That's some crazy overhead.

nneonneo · on Aug 2, 2022

Absolutely everything in CPython is a PyObject, and that can’t be changed without breaking the C API. A PyObject contains (among other things) a type pointer, a reference count, and a data field; none of these things can be changed without (again) breaking the C API.

There have definitely been attempts to modernize; the HPy project (https://hpyproject.org/), for instance, moves towards a handle-oriented API that keeps implementation details private and thus enables certain optimizations.

acdha · on Aug 2, 2022

This has been talked about for years but I believe it's still complicated by C API compatibility. The most recent discussion I see is here:

https://github.com/faster-cpython/ideas/discussions/138

Victor Stinner's experiment showed some performance regressions, too:

https://github.com/vstinner/cpython/pull/6#issuecomment-6561...

justinlloyd · on Aug 3, 2022

I love Python as a language, and all its packages, and have been using it since the late 90's, but Python's legacy decisions are one step away from causing the language to being found face down in a dirty ditch after an all night bender.

eru · on Aug 3, 2022

I do appreciate the backwards incompatible changes that Python 3 brought. Those were (mostly) good, and it was brave of them to go for it.

I wish they'd had another go. Call it Python 4.

But getting most people away from Python 2 to Python 3 already took a really long time.

ikiris · on Aug 3, 2022

I mean, its python. If you're expecting it to be efficient, I don't really know what to tell you.

adamsmith143 · on Aug 2, 2022

Ok now I have 100s of columns. I should do this for every single one in every single dataset I have?

itamarst · on Aug 2, 2022

It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.

Efficient representation should be something you build into your data model, it will save you time in the long run.

(Also if you have 100s of columns you're hopefully already benefiting from something like NumPy or Arrow or whatever, so you're already doing better than you could be... )

maerF0x0 · on Aug 2, 2022

> It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.

This is the argument I've been having my whole career with people who claim the better way is "too hard and too slow" .

I'm like "gee, funny how the thing you do the most often you're fastest at... could it be that you'd be just as fast at a better thing if you did it more than never?" .

dahfizz · on Aug 2, 2022

Hey, programmer time is expensive. It is our duty to always do the easiest, most wasteful thing. /s

maerF0x0 · on Aug 2, 2022

Future me's time is free to today me. :wink:

thrwyoilarticle · on Aug 3, 2022

As an individual contributor you have an incentive to approach a problem in the way that teaches you the most for your career - then you can pretend it's the approach that's the best effort to risk ratio.

eru · on Aug 3, 2022

It doesn't have to necessarily teach you the most. Eg at Google, if it gets you promoted, that's also good.

Promotable projects at Google even have (or at least used to have) a complexity requirement. You can guess where the incentives lead.

kllrnohj · on Aug 3, 2022

But premature optimization is the root of all evil! I'm a better programmer for actively ignoring these optimizations! /s

maerF0x0 · on Aug 3, 2022

But if I change this code, I have to change them all!

Good thing the status quo requires no evidence, but any change we want to propose? Impossibly high standards.

chaps · on Aug 2, 2022

Hah, I'd love to work with the datasets you work with if it takes five minutes to do this. Or maybe you're just suggesting it takes five minutes to write out "TEXT" for each column type?

The data I work with is messy, from hand written notes, multiple sources, millions of rows, etc etc. A single point that's written as "one" instead of 1 makes your whole idea fall on its face.

itamarst · on Aug 2, 2022

For pile-of-strings data, there are still things you can do. E.g. in Pandas, if there are a small number of different values, switch to categoricals (https://pythonspeed.com/articles/pandas-load-less-data/ item 3). And there's a new column type for strings that uses less memory (https://pythonspeed.com/articles/pandas-string-dtype-memory/).

chaps · on Aug 2, 2022

Tried that in the past, but it's really slow. Pandas is effectively removed from my workflows because of issues like this.

But, I have workarounds for these issues by loading everything into postgres under TEXT columns in a "raw" schema, then do some typecast tests in a descending list of types to get the smallest possible type to transfer to a new table in a "prod" schema. It's read-only data, so it's not a big deal to run it once, and builds out a chain of changes from csv -> sql.

Something like this could be done with pickling to avoid having to re-type every time I run the code (and I've done that for some past projects, but it's... ehhh).

derefr · on Aug 3, 2022

Perhaps do the data-cleaning step before loading into a data frame? (Dataframes are, after all, for canonicalized+normalized data, just like RDBMS tables are.)

chaps · on Aug 3, 2022

No way. The initial loading into a dataframe takes way too long to make it useful for exploratory work. Load it into a database is done once and forget it. In the long run, the time wasted loading things into dataframes over and over and over again just isn't worth it. Keep in mind that we're talking about large datasets that may or may not fit into memory.

"Oops, messed up my import slightly. Gotta run it again and wait ages.... again"

"Oops, loaded the dataframe twice on accident and had OOM. Gotta restart from the beginning... again"

"Oops, forgot to .head(5) my dataframe and jupyter's crashed... again..."

Doing everything in SQL solves so many problems. And OOM is practically a non-issue.

pca006132 · on Aug 3, 2022

For exploratory work, perhaps you should random sample some of the dataset (say 1k) and see the effect on them? After getting good result, you then switch to dealing with the whole dataset.

chaps · on Aug 3, 2022

No need in my workflows since unix tools solve that exploratory starting point. Just about everything else is SQL.

bee_rider · on Aug 2, 2022

Is enough data generated from handwritten notes that the memory cost is a serious problem? I was under the impression that hundreds of books worth of text fit in a gigabyte.

chaps · on Aug 2, 2022

It can be! I have a 50m row dataset of handwritten notes that's large, but when loaded into pandas, it blooms wayyyy larger than its underlying files.

adamsmith143 · on Aug 3, 2022

5 minutes per column or 5 minutes per dataset?

If per column then that is hopelessly slow. 500+ minutes per dataset and I may have dozens of datasets.

dvfjsdhgfv · on Aug 2, 2022

You'll need to decide on a case by case basis. Many datasets I work with are being generated by machines, come from network cards etc. - these are quite consistent. Occasionally I deal with datasets prepare by humans and these are mediocre at best, and in these cases I spend a lot of time cleaning them up. Once it's done, I can clearly see if there are some columns can be stored in a more efficient way, or not. If the dataset is large, I do it, because it gives me extra freedom if I can fit everything in RAM. If it's small, I don't bother, my time is more expensive than potential gains.

nomel · on Aug 3, 2022

Assuming your data is not ephemeral, and you have some way to ingest the data, from a full precision data store, why not?

Store at full precision, process at fractional precision, a story as old as time.

staticassertion · on Aug 2, 2022

shapefrog · on Aug 3, 2022

> which is very easy to fix

go on amazon and buy another stick of RAM

BLanen · on Aug 2, 2022

You're describing operations done on data in memory to save memory. That list of fractions still needs to be in memory at some point. And if you're batching, this whole discussion goes out of the window.

rcoveson · on Aug 2, 2022

Why would the whole original dataset need to be in memory all at once to operate on it value-by-value and put it into an array?

BLanen · on Aug 2, 2022

If the whole original dataset doesn't need to be in memory all at once, there isn't even an issue to begin with.

saltcured · on Aug 2, 2022

I think the point is that you can use a streaming IO approach to transcode or load data into the compact representation in memory, which is then used by whatever algorithm actually needs the in-memory access. You don't have to naively load the entire serialization from disk into memory.

This is one reason projects like Twitter popularized serializations like json-stream in the past, to make it even easier to incrementally load a large file with basic software. Formats like TSV and CSV are also trivially easy to load with streaming IO.

I think the mark of good data formats and libraries is that they allow for this. They should not force an in-memory all or nothing approach, even if applications may want to put all their data in memory. If for no other reason, the application developer should be allowed to commit most of the system RAM to their actual data, not the temporary buffers needed during the IO process.

If I want to push a machine to its limits on some large data, I do not want to be limited to 1/2, 1/3 or worse of the machine size because some IO library developers have all read an article like this and think "my data fits in RAM"! It's not "your data" nor your RAM when you are writing a library. If a user's actual end data might just barely fit in RAM, it will certainly fail if the deep call-stack of typical data analysis tools is cavalier about allocating additional whole-dataset copies during some synchronous load step...

rcoveson · on Aug 3, 2022

If you have a csv file that's 10 GB of numbers, and for your purposes float32 is sufficient precision, and there are only 1 billion numbers in the file, then you can read the file into a 4 GB array of float32 with very little overhead beyond the size of that target array. Reading the 10 GB file into memory in its entirety wouldn't help anything; nor would creating an intermediate array of full-precision numbers.

BLanen · on Aug 7, 2022

I agree. So this whole article is moot.

forrestthewoods · on Aug 2, 2022

What an unhelpful post.

The realization that modern servers can easily persist multiple terabytes of data is profound.

The fact that some datasets are just floats and you can quantize some floats from 32-bits down to 8-bits is true but not a helpful observation.

I also don’t know where you get “Python float is 24 bytes + 8-byte pointer for each item in the list”. Wat.

tomatotomato37 · on Aug 2, 2022

There's still a difference between terabytes of data spread between RAM sticks multiple inches away from the CPU die and the megabytes of cache data close enough to the compute silicon to experience quantum effects. A modern CPU can burn through hundreds of instructions in the time it takes to resolve a thrashing cache

forrestthewoods · on Aug 2, 2022

Yes and?

Latency (hand wavey) L1: 1 ns L2: 2.5 ns L3: 10 ns RAM: 50 ns SSD: 50,000 ns / 50 us

Those are very approximate and specifics vary.

RAM is 5 to 50 times more latent than cache. SSD is ~1000 times slower than RAM. And a spinning HDD is I think ~100 times slower than an SSD.

If your database fits in RAM you’re likely in a happy place. If your DB is so massive it needs to spill to disk you wind up with a mountain of complexity. Multiple machines, sharding, hot/cold etc.

The point of the article is “modern servers have a lot of RAM and you might be able to delete a lot of complexity if you throw money at a server with 4 terabytes of RAM. This option is more practical than you might have realized!”

minitech · on Aug 2, 2022

> I also don’t know where you get “Python float is 24 bytes + 8-byte pointer for each item in the list”. Wat.

Not sure how many ways there are to reword that. A CPython float takes 24 bytes of memory, and storing them in a list means 8 bytes per item for the pointer. So in CPython, a list of n floats takes 32n bytes of memory.

  >>> sys.getsizeof(1.0)
  24
  >>> l = list(map(float, range(62_500_000)))  # memory use goes up by >2 GB
  >>> del l  # memory use goes down by >2 GB

(no need to go straight to NumPy to avoid this when relevant, though – array.array is built in.)

forrestthewoods · on Aug 3, 2022

Well I’ll be.

8 bytes for the 64-bit double 8 bytes for object type pointer 8 bytes for reference count

Lists are dynamic arrays with object pointers, some another 8 bytes per pointer.

TIL and I stand corrected on that point. I thought CPython was more clever than that for lists of floats, but apparently not!

marcinzm · on Aug 2, 2022

We went with this approach. Pandas hit GIL limits which made it too slow. Then we moved to Dask and hit GIL limits on the scheduler process. Then we moved to Spark and hit JVM GC slowdowns on the amount of allocated memory. Then we burned it all down and became hermits in the woods.

mumblemumble · on Aug 2, 2022

I have decided that all solutions to questions of scale fall into one of two general categories. Either you can spend all your money on computers, or you can spend all your money on C/C++/Rust/Cython/Fortran/whatever developers.

There's one severely under-appreciated factor that favors the first option: computers are commodities that can be acquired very quickly. Almost instantaneously if you're in the cloud. Skilled lower-level programmers are very definitely not commodities, and growing your pool of them can easily take months or years.

jbverschoor · on Aug 2, 2022

Buying hardware won't give you the same performance benefits as a better implementation/architecture.

And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.

marcinzm · on Aug 2, 2022

>And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.

That's why people love the cloud.

jbverschoor · on Aug 3, 2022

This is exactly what’s wrong with it.

mumblemumble · on Aug 2, 2022

Ayup.

And yet, people still regularly choose to go down a path that leads there. Because business decisions are about satisficing, not optimizing. So "I'm 90% sure I will be able to cope with problems of this type but it might cost as much as $10,000,000" is often favored above, "I am 75% sure I might be able to solve problems of this type for no more than $500,000," when the hypothetical downside of not solving it is, "We might go out of business."

jbverschoor · on Aug 2, 2022

Just adding hardware / "cloud" to a bad design will not make it snappy.

You can go from 10 -> 12, where a better design would get you from that same 10 -> 50

hinkley · on Aug 3, 2022

On the other hand, you can't argue with a computer. That may explain why some of my coworkers seem to behave as if they wish they were computers...

It's too difficult to renegotiate with computers, too easy to renegotiate with people. When you don't actually know what you need to do, you need people. When you think you know what you need to do, but you're wrong, then you really need people. Most of us are in the latter category, most of the time.

thrwyoilarticle · on Aug 3, 2022

Then you discover - too late to change your mind - that you're throwing hardware at an NP algo

nomel · on Aug 3, 2022

At some point, you just mmap shared system memory, as read only, giving direct global access to your dataset, like the good old days (or in any embedded system).

mritchie712 · on Aug 2, 2022

Did you consider Clickhouse? join's are slow, but if your data is in a single table, it works really well.

marcinzm · on Aug 2, 2022

We were trying to keep everything on one machine in (mostly) memory for simplicity. Once you open up the pandoras box of distributed compute there's a lot of options including other ways of running Spark. But yes, in retrospect, we should have opened that box first.

anko · on Aug 2, 2022

I have solved a similar problem, in a similar way and i've found polars <https://www.pola.rs/> to solve this quite well without needing clickhouse. It has a python library but does most processing in rust, across multiple cores. I've used it for data sets up to about 20GB no worries, but my computer's ram became the issue, not polars itself.

marcinzm · on Aug 2, 2022

We were using 500+gb of memory at peak and were expecting that to grow. If I remember we didn't go with Polars because we needed to run custom apply functions on DataFrames. Polars had them but the function took a tuple (not a DF or dict) which when you've got 20+ columns makes for really error prone code. Dask and Spark both supported a batch transform operation so the function took a Pandas Dataframe as input and output.

blub · on Aug 3, 2022

Have you tried vaex? It is a panda-like Python library that uses C++ underneath, memory mapping and optimizes its memory access patterns. It’s very fast at least up to 1TB allegedly, I’ve used it for 10-15GB.

rbanffy · on Aug 3, 2022

> Then we burned it all down and became hermits in the woods.

Did the results of the calculations drive that decision?

louwrentius · on Aug 2, 2022

The original site made by lukegb inspired me because of the down-to-earth simplicity. Scaling vertically is often so much easier and better in so many dimensions than creating a complex distributed computing setup.

This is why I recreated the site when it went down quite a while ago.

The recent article "Use One Big Server"[0] inspired me to (re)submit this website to HN because it addresses the same topic. I like this new article so much because in this day and age of the cloud, people tend to forget how insanely fast and powerful modern servers have become.

And if you don't have budget for new equiment, the second-hand stuff from a few years back is stil beyond amazing and the prices are very reasonable compared to cloud cost. Sure, running bare metal co-located somewhere has it's own cost, but it's not that of a big deal and many issues can be dealt with using 'remote hands' services.

To be fair, the article admits that in the end it's really about your organisation's specific circumstances and thus your requirements. Physical servers and/or vertical scaling may not (always) be the right answer. That said, do yourself a favour, and do take this option seriously and at least consider it. You can even do an experiment: buy some second-hand gear just to gain some experience with hardware if you don't have it already and do a trial in a co-location.

Now that we are talking, yourdatafitsinram.net runs on a Raspberry Pi 4 which in turn is running on solar power.[1] (The blog and this site are both running on the same host)

[0]: https://news.ycombinator.com/item?id=32319147

[1]: https://louwrentius.com/this-blog-is-now-running-on-solar-po...

karamanolev · on Aug 2, 2022

> many issues can be dealt with using 'remote hands' services.

I have a few second-hand HP/Dell/Supermicro systems running colocated. I find that for all software issues, remote management / IPMI / KVM over IP is perfectly sufficient. Remote hands are needed only for actual hardware issues, most of which is "replace this component with an identical one". Usually HDD, if you're running those. Overall, I'm quite happy with the setup and it's very high on the value/$ spectrum.

toast0 · on Aug 2, 2022

IPMI is nice, although the older you go, the more particular it gets. I had professional experience with the SuperMicro Xeon e5-2600 series v1-4, and recently started renting a previous generation server[1] and it's worse than the ones I used before. It's still servicable though; but I'm not sure it it's using a dedicated LAN, because the kvm and the sol drop out when the OS starts or ends; it'll come back, but you miss early boot messages.

It's definitely worth the effort to script starting the KVM, and maybe even the sol. If you've got a bunch of servers, you should script the power management as well, if nothing else, you want to rate limit power commands across your fleet to prevent accidental mass restarts. Intentional mass restarts can probably happen through the OS, so 1 power command per second across your fleet is probably fine. (You can always hack out the rate limit if you're really sure).

[1] I don't need a whole server, but for $30/month when I wanted to leave my VPS behind for a few reasons anyway...

louwrentius · on Aug 2, 2022

Yes, I bet a lot of people aren't even aware of IPMI/KVM over IP capabilities that servers have for decades, which makes hardware management (manual or automated!) much easier.

Remote hands is for the inevitable hardware failure (Disk, PSU, Fan) or human error (you locked yourself out somehow remotely from IPMI).

P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and 20 physical cores as a lab system for playing with many virtual machines. I turn it on and off on demand using IPMI.

bob1029 · on Aug 2, 2022

This kind of realization that "yes, it probably will" has recently inspired me to hand-build various database engines wherein the entire working set lives in memory. I do realize others have worked on this idea too, but I always wanted to play with it myself.

My most recent prototypes use a hybrid mechanism that dramatically increases the supported working set size. Any property larger than a specific cutoff would be a separate read operation to the durable log. For these properties, only the log's 64-bit offset is stored in memory. There is an alternative heuristic that allows for the developer to add attributes which signify if properties are to be maintained in-memory or permitted to be secondary lookups.

As a consequence, that 2TB worth of ram can properly track hundreds or even thousands of TB worth of effective data.

If you are using modern NVMe storage, those reads to disk are stupid-fast in the worst case. There's still a really good chance you will get a hit in the IO cache if you application isn't ridiculous and has some predictable access patterns.

saltcured · on Aug 2, 2022

I don't mean to discourage personal exploration in any way, but when doing this sort of thing it can also be illuminating to consider the null hypothesis... what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?

SQLite or PostgreSQL can be given some configuration/hints to be more aggressive about using RAM while still having their built-in capability to spill to storage rather than hit a hard limit. Or on Linux (at least), just allowing the OS page cache to sprawl over a large RAM system may make the IO so fast that the database doesn't need to worry about special RAM usage. For PostgreSQL, this can just be hints to the optimizer to adjust the cost model and consider random access to be cheaper when comparing possible query plans.

Once you do some sanity check benchmarks of different systems like that, you might find different bottlenecks than expected, and this might highlight new performance optimization quests you hadn't even considered before. :-)

lmwnshn · on Aug 3, 2022

To add to this thread, some people have done this exploration before, see slides 11 and 20 of [0] or Figure 1 of [1]. Could you just throw more RAM at your problems in a disk-backed system? In practice, probably. But there are distinct advantages to designing upfront for an in-memory scenario.

[0] https://15721.courses.cs.cmu.edu/spring2020/slides/02-inmemo...

[1] https://15721.courses.cs.cmu.edu/spring2020/papers/02-inmemo...

bob1029 · on Aug 2, 2022

> what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?

Oh I absolutely have gone down this road as well.

The biggest thing for me is taking advantage of the other benefits you can get with in-memory working sets, such as arbitrary pointer machine representations.

When working with a traditional SQL engine (even one tuned for memory-only operation), there are many rules you have to play by or things will not go well.

louwrentius · on Aug 2, 2022

Extra anecdote:

Around 2000, a guy told me he was asked to support very significant performance issues with a server running a critical application. He quickly figured out that the server ran out of memory. Option 1 was to rewrite the application to use less memory. He chose option two: increase the server memory, going from 64 MB to 128 MB (Yes MB).

At that time, 128 MB was an ungodly amount of memory and memory was very expensive. But it was still cheaper to just throw RAM at the problem than to spend many hours rewriting the application.

navjack27 · on Aug 3, 2022

In 2000? 128 ungodly? What!

nomel · on Aug 3, 2022

The price curve was in sharp decline around that time, easily making "around 2000" +/- 5 years, possible: https://jcmit.net/memoryprice.htm

For example, four years earlier, it would be >$3k. A year before that, >$6k.

This was an amazing, and terrible, time to be into computers.

navjack27 · on Aug 3, 2022

Just be exact tho. I'm seeing $100 for 128MB in that link for that year

nomel · on Aug 3, 2022

Which is more evidence that "around 2000" is on the earlier side of "around".

none_to_remain · on Aug 2, 2022

Several years ago my job then got a dev and prod server with a terabyte of RAM. I liked the dev server because a few times I found myself thinking "this would be easy to debug if I had an insane amount of RAM" and then I would remember I did.

MathYouF · on Aug 3, 2022

What kind of things are easier to debug with lots of RAM and how would you do it?

none_to_remain · on Aug 3, 2022

Basically working with code manipulating a couple dozen GB of data and then keeping a couple dozen copies of that to examine it after various stages of manipulation.

AdamJacobMuller · on Aug 2, 2022

https://www.redbooks.ibm.com/redpapers/pdfs/redp5510.pdf

I want one of these.

a system with 1TB of ram is 133k, 8.5mil for a system with 64TB of ram?

chaxor · on Aug 2, 2022

Absolutely not. You can purchase a system with 1TB of ram, and some decent CPUs etc for ~25k. My lab just did this. That's far overpriced. 133k is closer to what you would spend if you used the machine with 1tb "in the cloud".

didgetmaster · on Aug 2, 2022

I still remember the first advertisement I saw for 1TB of disk space. I think it was about 1997 and about the biggest individual drive you could buy was 2GB. The system was the size of a couple of server racks and they put 500 of those disks in it. It cost over $1M for the whole system.

nimish · on Aug 2, 2022

That's insanely overpriced. A 128gb lrdimm is $1000. So a tb on a commodity 8 mem slot board would be 8k plus a few thousand for the cpu and chassis.

adgjlsfhk1 · on Aug 3, 2022

Note that when you start talking about multiple 10s of TiB of ram, you start having to buy super high density dims which are very expensive (because not many of them get made and anyone who needs one has lots of money)

nimish · on Aug 3, 2022

Yes, true. But at that point you have to run on exotic 4p/8p systems since you start running out of memory channels.

256GB DIMMs go for ~3k, so 133k for just 1TB is still a bit much.

vlunkr · on Aug 2, 2022

Amazing. This has been the solution to postgres issues for me. Just add enough memory that everything, or at least everything that is accessed frequently can fit in RAM. Suddenly everything is cached and fast.

z3t4 · on Aug 2, 2022

Your data might even fit in the CPU L3 cache... But most likely you want your data to be persistent. But how often do you actually "pull the plug" on your servers!? And what happens when SSD's are fast enough ? Will we see a whole new architecture where the working memory is integrated into the CPU and the main memory is persistent ?

bee_rider · on Aug 2, 2022

On one hand it would be cool to have some persistence in the CPU. On the other -- imagine if rebooting a computer didn't make the problems all go away. What a nightmare.

mnd999 · on Aug 2, 2022

That was the promise of optane. Unfortunately nobody bought it.

altmind · on Aug 3, 2022

a lot of people were scratching their heads on how to put optane to use. a fast ssd? or a slow DRAM? with a propietary api? and requiring support from the hardware platform? the whole product line was inconsistent as they use the same name for both storage devices and NVDIMM.

tester756 · on Aug 2, 2022

What do you mean by "nobody"?

Significant % of top 500 fortune used it

mnd999 · on Aug 3, 2022

Not enough to stop Intel from killing it. Had they been using it seriously at volume this would not have happened.

rbanffy · on Aug 3, 2022

Don't forget memory bandwidth. You may be able to shoehorn your data in a single machine's memory, but it may take a very long time to process all of it.

Sometimes sharding will yield better results. It will still take a long time to read through 64 TB of memory, even if you are just NOPing on it.

deathanatos · on Aug 3, 2022

I actually worked on a project that did this. We used a database called "Tile38" [1] which used an R-Tree to make geospatial queries speedy. It was pretty good.

Our dataset was ~150 GiB, I think? All in RAM. Took a while to start the server, as it all came off disk. Could have been faster. (It borrowed Redis's query language, and its storage was just "store the commands the recreate the DB, literally", IIRC. Dead simple, but a lot of slack/wasted space there.)

Overall not a bad database. Latency serving out of RAM was, as one should/would expect, very speedy!

[1]: https://tile38.com/

game-of-throws · on Aug 2, 2022

I just confirmed that 640 KB fits in RAM. That's enough for me.

rmetzler · on Aug 2, 2022

640K ought to be enough for anybody.

mech422 · on Aug 2, 2022

Thanks Bill!

tester756 · on Aug 2, 2022

>Gates himself has strenuously denied making the comment. In a newspaper column that he wrote in the mid-1990s, Gates responded to a student's question about the quote: "I've said some stupid things and some wrong things, but not that. No one involved in computers would ever say that a certain amount of memory is enough for all time." Later in the column, he added, "I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again."

mech422 · on Aug 3, 2022

Sorry.. would you have preferred I used this instead ? :-D

https://i.imgur.com/XS5LK.gif

rbanffy · on Aug 3, 2022

Of course he would deny it.

Cwizard · on Aug 2, 2022

Anyone have any recommendations for a SQL engine that works on in-memory data and has a simple/monolithic architecture? Our data is about 50-100gb (uncompressed) and thus easily fits into memory. I am sure we could do our processing using something like polars or pandas in memory quite quickly but we prefer a SQL interface. Using postgres is still quite slow even when it has more than enough memory available compared to something like duckdb. Duckdb has other limitations however. I’ve been eying MemSQL but that also seems to be targeted more towards multi machine deployments.

chaxor · on Aug 2, 2022

SQLite is almost always the answer

whichdan · on Aug 3, 2022

MySQL has a MEMORY[0] storage engine which might be of interest?

[0] https://dev.mysql.com/doc/refman/8.0/en/memory-storage-engin...

specialist · on Aug 3, 2022

H2 is trivial to embed. Written in Java. It's one of my favorite things.

http://www.h2database.com/html/features.html#in_memory_datab...

mritchie712 · on Aug 2, 2022

what limit are you hitting with duckdb?

giraffe_lady · on Aug 2, 2022

sqlite?

somekyle · on Aug 2, 2022

Is the point of this that you can do large-scale data processing without the overhead of distribution if you're willing to pay for the kind of hardware that can give you fast random access to all of it?

nattaylor · on Aug 2, 2022

Yes, take a look at the "inspired by" tweet [0]

[0] https://twitter.com/garybernhardt/status/600783770925420546

sophacles · on Aug 2, 2022

I find it mind boggling that one can purchase a server with more RAM than the sum of all working storage media in my house.

antisthenes · on Aug 2, 2022

If you're wondering, the cutoff is 64 TiB.

That's the amount of RAM on an IBM Power E980 System.

baisq · on Aug 2, 2022

How much does that cost?

edmundsauto · on Aug 2, 2022

$8.5 million according to a sibling comment.

nsxwolf · on Aug 2, 2022

64 TiB fits in a Dell Poweredge R840 with 6TB max RAM... how exactly?

louwrentius · on Aug 2, 2022

Please take a look further down the list. For your use case, a Power System E980 may just be enough or too small. :-)

rbanffy · on Aug 3, 2022

Dammit! The biggest z16 can only address 40TB!

mciancia · on Aug 2, 2022

Maybe someone mixed up regular dram and optane dimms?

boredumb · on Aug 2, 2022

would be neat if I could do say, 6gb, and see the machines that are closest in size instead of only the upper limit

langsoul-com · on Aug 3, 2022

What's the differences in cost between server with mainly hdd VS ram storage? Ie with vultur, hertzer etc

I though the issue with ram was the much higher costs.

Also, could you put all of postgres dB in ram, like with redis? Thus no need for separate redis server since your entire dB is fast in ram.

KronisLV · on Aug 3, 2022

Honestly when all that you can afford are VPSes with 4 GB of RAM, reading this feels a bit silly and mean.

Though at scale, sure, one could even run a RAM disk for speeding up lots of traditional software, with some caveats.

dang · on Aug 2, 2022

hyperman1 · on Aug 2, 2022

Funny, it lets you click to negative amounts of RAM. My -1 PiB fits in RAM, so having it as a unit is not useless. (It also accepts fractions but not octal)

chrismorgan · on Aug 3, 2022

It’s using an HTML <input type=number> element, which requires that the value be a valid floating-point number <https://html.spec.whatwg.org/multipage/common-microsyntaxes....>. You’ll note from this that exponent notation is allowed, something people are unlikely to expect. For restricting negative numbers, it should probably had the attribute min=0; then negative zero would be the lowest you could go.

civilized · on Aug 2, 2022

Has anyone tried firing up Pandas or something to load a multi-TB table? Would be interested to see if you run into some hidden snags.

jdeaton · on Aug 2, 2022

I've done this though the data in the table was split across DataFrames in many concurrent processes. https://stackoverflow.com/questions/49438954/python-shared-m...

vkoskiv · on Aug 3, 2022

The word ‘data’ is plural, so it might be more correct to say “Do my data fit in RAM?”

dkbrk · on Aug 3, 2022

Depends on context. If, for your purposes it would be reasonable to talk about a single datum, then sure, it's plural.

But in common use it's a singular mass noun.

bdowling · on Aug 4, 2022

The only people who treat data as plural are academics, usually social scientists, and overeducated journalists. Sentences like, "What do these data tell us?" are awkward for everyone else.

Like "information", "data" is correctly used as a singular mass noun.

Goz3rr · on Aug 2, 2022

Am I the only one here using Chrome or is everyone else just ignoring the table being broken? The author used an <object> tag which just results in Chrome displaying "This plugin is not supported". I'm unsure why they didn't just use an iframe instead.

louwrentius · on Aug 2, 2022

I can only state for myself that on my Mac running chrome, the site works OK. I don't get any plugin messages.

glonq · on Aug 2, 2022

DAE remember "how many tapes do I need to hold this much data"?

rbanffy · on Aug 3, 2022

How many years it'll take to back everything up?

rob_c · on Aug 2, 2022

cute, but really, it's not data if it's not measured in PB it's training data or a derrived dataset.

rob_c · on Aug 3, 2022

hilarious to get voted down for this and the equivalent comment is still, please stop wasting storage on data that is never going to be needed. If the LHC can find the Higgs Boson before hitting the exa-scale then it's not needed to identify "not a hotdog".

staticassertion · on Aug 2, 2022

mine does not

baisq · on Aug 2, 2022

Why is table.html loaded as an external resource instead of being in index.html proper?

louwrentius · on Aug 2, 2022

I can't remember why I did that, probably to keep the data separate from the rest of the code.

chrismorgan · on Aug 3, 2022

The technique used means that it always takes a while to load it, as it’s reloading it after every change, which is generally taking around two seconds for me. You make it still worse by using 'change' and 'keyup' events, so any key press (e.g. left/right arrow) triggers it. It should use the 'input' event.

Even if it were done as a separate page, it should still be shown or hidden via the CSS display property in order to avoid this reloading.

louwrentius · on Aug 3, 2022

Thanks for the tip!

bpbp-mango · on Aug 2, 2022

willyourbudgetletoyufityourdatainram.net

ailef · on Aug 2, 2022

Basically every fits in RAM up to 24TB.

jhbadger · on Aug 2, 2022

I was disappointed that the page didn't start offering vintage computers for very small datasets given that it has bytes and kilobytes as options ("your data is too large for a VIC-20, but a Commodore 64 should handle it")

louwrentius · on Aug 2, 2022

That is actually a funny idea, I didn't think about that. I only revived and refreshed what somebody else came up with and made before me.

donkarma · on Aug 2, 2022

64TB because of the mainframe