I’ve argued this before and I’ll argue it here now: Modern computers are fast en...

Apocryphon · on Feb 16, 2023

http://howfuckedismydatabase.com/letters/

> Name: Edward I'm using grep and find and the Unix file system

> Name: Toni Why do you need a database? I'm using CSV files!

> Name: Carlos I came after a long journey to your website to seek enlightenment : am I fucked? And it didn't answer that question clearly. Let me re-phrase it: I first search an LDAP directory, then I remotely execute a quota status, after that I query a PostgreSQL database, and then I generate a .txt file with the timestamp as its name or a .csv file with an hash as its name, and then I look up the files from a web page, load it all to a multi-dimensional array, and generate a nice report, re-loading the entire file every time the user wants to, say, sort by another field. Something this complex can't ever possibly fuck up, can it?

EGreg · on Feb 16, 2023

Transactions, really? How do you lock rows, how do you have relations, how do you do joins? In fact, ext3 can only handle about 50,000 files in one directory. So you'll have to split up your "primary key" into letters like abc/def/foo like we do

joshspankit · on Feb 16, 2023

The bad but valid answer to locking rows and doing relations is that you write the logic in to your application. The better answer is of course that if you’re doing specific types of complex things a DB is a better fit.

Honestly splitting in to nested subfolders is not big deal anymore, it’s a single function you can write even if you’re having a “.10X” day.

RedShift1 · on Feb 16, 2023

Postfix turning eyes away

crazygringo · on Feb 16, 2023

I honestly can't tell if this is satire or not.

But I think I'm curious either way: how do you index a column?

joshspankit · on Feb 16, 2023

Store the index on the filesystem and populate it on write.

Not satire, though a bit sensationalistic to argue it’s a solid solution that’s usually overlooked because it’s “too slow”. I’m just pointing out it’s not actually slow any more.

Back in the days of scaled applications running on MySQL, DDR2 was 3200MB/s and people were so happy when their DB was small enough they could fit it in RAM.

crazygringo · on Feb 16, 2023

I don't think the problem is that it's too slow.

I think the problem is that all sorts of utilities and commands break when dealing with hundreds of thousands of files in a single directory.

Also the block size means you'll waste an incredible amount of disk space.

joshspankit · on Feb 16, 2023

I feel like wasting disk space is not a real issue any more: if you’re working with more than 1TB of data you (at least you better) have the resources to pay for bigger HDs which are at an almost trivial cost per TB.

Files in a single directory: It was discussed in another comment, but there’s a tried and true solution to that: simply nest your items in folders. For example with UUID as primary key you could have a folder structure of ‘(first 4 bytes)/(second four bytes)/(...so on)/(full uuid)’ where your nesting level is enough that you have no more than 50,000 files in each directory. For smaller pools you can reduce the layers (‘(first two bytes)/(full uuid)’ for example still gives you quite a few entries before any one folder gets to 50,000)

drexlspivey · on Feb 16, 2023

how do you do joins?

joshspankit · on Feb 16, 2023

Use filename as your primary key and folder structure as “table”

Though, if you are doing a lot of joins your application is probably a better fit for a graph database instead of a “relational” one anyway