Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’ve argued this before and I’ll argue it here now:

Modern computers are fast enough that in many cases “the only database you will ever need” can be files on the filesystem. For example “1 row = 1 file”.

It brings additional benefits as well: for low-write applications you can use git to get a history (+transactions if you store them in the log), backups are super easy, replication is trivial. For higher-write applications it gets more complex but you can still plan and implement most of the traditional DB scaling techniques (and even implement them one at a time as you go grow).

Computers are “stupid fast” now that we’ve gotten off platters.



http://howfuckedismydatabase.com/letters/

> Name: Edward I'm using grep and find and the Unix file system

> Name: Toni Why do you need a database? I'm using CSV files!

> Name: Carlos I came after a long journey to your website to seek enlightenment : am I fucked? And it didn't answer that question clearly. Let me re-phrase it: I first search an LDAP directory, then I remotely execute a quota status, after that I query a PostgreSQL database, and then I generate a .txt file with the timestamp as its name or a .csv file with an hash as its name, and then I look up the files from a web page, load it all to a multi-dimensional array, and generate a nice report, re-loading the entire file every time the user wants to, say, sort by another field. Something this complex can't ever possibly fuck up, can it?


Transactions, really? How do you lock rows, how do you have relations, how do you do joins? In fact, ext3 can only handle about 50,000 files in one directory. So you'll have to split up your "primary key" into letters like abc/def/foo like we do


The bad but valid answer to locking rows and doing relations is that you write the logic in to your application. The better answer is of course that if you’re doing specific types of complex things a DB is a better fit.

Honestly splitting in to nested subfolders is not big deal anymore, it’s a single function you can write even if you’re having a “.10X” day.


Postfix turning eyes away


I honestly can't tell if this is satire or not.

But I think I'm curious either way: how do you index a column?


Store the index on the filesystem and populate it on write.

Not satire, though a bit sensationalistic to argue it’s a solid solution that’s usually overlooked because it’s “too slow”. I’m just pointing out it’s not actually slow any more.

Back in the days of scaled applications running on MySQL, DDR2 was 3200MB/s and people were so happy when their DB was small enough they could fit it in RAM.


I don't think the problem is that it's too slow.

I think the problem is that all sorts of utilities and commands break when dealing with hundreds of thousands of files in a single directory.

Also the block size means you'll waste an incredible amount of disk space.


I feel like wasting disk space is not a real issue any more: if you’re working with more than 1TB of data you (at least you better) have the resources to pay for bigger HDs which are at an almost trivial cost per TB.

Files in a single directory: It was discussed in another comment, but there’s a tried and true solution to that: simply nest your items in folders. For example with UUID as primary key you could have a folder structure of ‘(first 4 bytes)/(second four bytes)/(...so on)/(full uuid)’ where your nesting level is enough that you have no more than 50,000 files in each directory. For smaller pools you can reduce the layers (‘(first two bytes)/(full uuid)’ for example still gives you quite a few entries before any one folder gets to 50,000)


how do you do joins?


Use filename as your primary key and folder structure as “table”

Though, if you are doing a lot of joins your application is probably a better fit for a graph database instead of a “relational” one anyway




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: