Using SQLite in my ETL processes is something I have done for over a decade. It'...

chrisweekly · on Jan 26, 2020

Yes! Along these lines I heartily recommend `lnav` ^1, a fantastic, lightweight, scriptable CLI mini-ETL tool w embedded sqlite engine, ideally suited for working with moderately-sized data sets (ie, millions of rows not billions) ... so useful!

1. https://lnav.org

tenken · on Jan 29, 2020

I have used it to inspect say the history of a users' requests on a load-balanced server. I like to permanently store the results of the logfile excerpt to a DB table for posterity and future reporting.

Figuring out how to enter "sql" mode in lnav, generate a logfile table, and then persist it from an in-memory sqlite db to a saved-to-disk sqlite db .... was frustratingly annoying.

It boils down to:

    :create-logline-table custom_log
    ;ATTACH DATABASE `test02.db` AS bkup;
    ;create table bkup.custom_log as select * from custom_log;
    ;detach database bkup;

if i recall you cannot call sqlite commands ".backup" or similar in lnavs sql mode. So lnavs interjection into the sqlite command processing is annoying (I'm actually very familiar with sqlite).

pjot · on Jan 26, 2020

Would you mind elaborating on your ETL process a little more? Im a junior DE and curious about how I would implement this

at_a_remove · on Jan 26, 2020

It's pretty straightforward, really.

I construct the .sqlite database from scratch each time in Python, building out table after table as I like it.

Some configuration data is loaded in from files first. This could be some default values or even test records for later injection.

The input data is loaded into the appropriate tables and then indexed as appropriate (or if appropriate). It is as "raw" as I can get it.

Each successive transformation occurs on a new table. This is so I can always go back one step for any post-mortem if I need to. Also, I can reference something that might be DELETEd in an a later table.

Often (and this is task-dependent), I will have to pull in data from other server-based databases, typically the target. They get their own tables. Then I can mark certain records as not being present in the target database, so they must be INSERTed. If a record is not present in my input and is there in the target, that would suggest a DELETE. Finally, I can compare records where some ID is present in my input and my .sqlite, they might be good for an UPDATE. All of this is so I can make only the changes that need to be made. Speed is not important to me here, only understanding what changes needed to be made and having a record of what they were and why.

I am happy to say that an ETL process I wrote using this general method back around 2009 is probably still running. I haven't had to touch it in years. Occasionally I will receive questions as to "why did this happen?" and I can just start running queries on the resultant .sqlite database file, kept with the logs, for answers.

Similarly, I can use these sorts of techniques when I am analyzing other datasets. The value here is that I can just refresh one table when the relevant data comes in, rather than having to run the ingest process for everything all over again. This can save me a lot of time.

F_J_H · on Jan 26, 2020

Awesome - elegantly simple using very common technologies.

at_a_remove · on Jan 26, 2020

I am not a very talented programmer so I stick very close to what is common, standard, and easy to understand. It usually means I am on the downslope of the hype cycle and it limits some opportunities but I have become okay with that.

I have gotten some CS students who were about to shoot flies with various cannons turned on to SQLite. I kept a couple of the decent books about it nearby and would shove it into their hands at that point. Usually a week later they would be raving about it.

julianee · on Jan 27, 2020

Do you still have the titles of those books at hand? I'd love to take a look at them.

at_a_remove · on Jan 27, 2020

They are The Definitive Guide to SQLite by Mike Owens and Using SQLite by Jay A. Kreibich. I am quite sure they are more book than I needed, I only plumbed a fraction of SQLite's immense capabilities.

hinkley · on Jan 26, 2020

Do you generate the file from scratch every time or do you modify the previous one as new data arrives?

tracker1 · on Jan 26, 2020

Depends on what you want... if you have a separate db project, you can have the output of that project be a clean database for testing other things, or a set of migration scripts for existing deployments.

I've been working on doing similar with containerized dababase servers for testing, while still having versioned scripts for prod (multiple separate deployments).

at_a_remove · on Jan 26, 2020

It is a bit of a hybrid.

In the early stages of development of whatever the ETL process is, I keep the database and just empty it out each time. As I got more of a sense of what I needed, I started DROPing my TABLEs more often and remaking them. Eventually I would make the whole database from scratch once I was along the way and had most everything fleshed out.

hinkley · on Jan 26, 2020

Ok. So each export is a full dump, not a delta on a previous one.

Do you anticipate hitting a wall at some point where the total time becomes a problem?

at_a_remove · on Jan 26, 2020

Well, it depends on the process. Some were full dumps, some were deltas pushed up to the final database, sometimes both (this product in particular had a load from file capability that you were supposed to use but some edge cases that were not well-addressed).

No, the time never grew significantly.

For one of the analysis projects, just one step of the analysis was quite time consuming but it would have been that way no matter what. SQLite allowed me to let it grind away overnight (or even over a weekend) on a workstation without tormenting production servers.