Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We are having atrocious READ/WRITE latency with our PG database (api layer is django rest framework). The table that is the issue consists of multiple JSON BLOB fields, with quite a bit of data— I am convinced these need to be abstracted to their own relational tables. Is this a sound solution? I believe it is the deserialization in these fields of large nested JSON BLOBS that is causing latency. Note: this database architecture was created by a contractor. There is no indexing or relations existing in current schema. Just a single “Videos” table with all metadata stored as Postgres JSON field type blobs. EDIT: rebuilding the schema from the ground up with 5-6GB of data in the production database (not much, but still at the production level) is a hard sell, but I think it is necessary as we will be scaling enormously very soon. When I say rebuild, I mean a proper relational table layout with indexing, fk’s, etc.

EDIT2: to further comment on current table architecture, we have 3-4 other tables with minimal fields (3-4 Boolean/Char fields) that are relationally linked back the Videos table with a char field ‘video_id’, that is unique on the Videos table. Again, not a proper foreign key so no indexing.



Are you just doing primary key lookups? If so, a new index won’t do much as Postgres already has you covered there.

If you have any foreign key columns, add indexes on them. And if you’re doing any joins, make sure the criteria have indexes.

Similarly, if you’re filtering on any of the nested JSON fields, index them directly.

This alone may be sufficient for your perf problems.

If it isn’t, then here’s some tips for the blobs.

The JSON blobs are likely already being stored in TOAST storage, so moving them to a new table might help (e.g. if you’re blindly selecting all the columns on the table) but won’t do much if you actually need to return the JSON with every query.

If you don’t need to index into the JSON, I’d consider storing them in a blob store (like S3). There are trade offs here, such as your API layer will need to read from multiple data sources, but you’ll get some nice scaling benefits here and your DB will just need to store a reference to the blob.

If your JSON blobs have a schema that you control, deprecate the blobs and break them out into explicit tables with explicit types and a proper normalized schema. Once you’ve got a properly normalized schema, you can opt-in to denormalization as needed (leveraging triggers to invalidate and update them, if needed), but I’m betting you won’t need to do any denorm’ing if you have the correct indexes here.

And since you have an API layer, ideally you’ve also already considered a caching layer in front of your DB calls, if you don’t have one yet.


This is super interesting stuff.

First of all, I think the caching layer (which we currently don’t have) is going to be a necessity in the coming weeks as we scale for an additional project (that will be relying on this architecture)

Second of all, it is just PK lookups. We don’t actually have a single fk (contractor did not set up any relations), which makes me think moving all of this replicated JSON data from fields to tables may help.

The queries that are currently causing issues are not filtering out any data but returning entire records. In ORM terms, it is Video.objects.all(), and from a URL param in our GET to the api, limiting the amount of entries returned. What’s interesting is this latency scales linearly, and at the point we ask for ~50 records we hit the maximum raw memory alloc for PG (1GB) causing the entire app to crash.

The solution you propose for s3 blob store is enormously fascinating. The one thing I’d mention is these JSON fields on the Video table have a defined schema that is replicated for each Video record (this is video/sensor metadata, including stuff like gps coords, temperature, and a lot more).

So retrieving a Video record will retrieve those JSON fields, but not just the values: the entire nested BLOB. And does so for each and every record if we are fetching >1

Would defining this schema with something like Marshmallow/JSON-Schema be a good idea when you mention JSON schemas we control? As well as explicitly migrating those JSON fields to their own tables, replaced with an FK on the Video table?


I do want to emphasize that the S3 approach has a lot of trade offs worth considering. There is something really nice about having all of your data in one place (transactions, backups, indexing, etc... all become trivial), and you lose that with the S3 approach. BUT in a lot of cases, splitting out blobs is fine. Just treat them as immutable, and write them to S3 first before committing your DB transaction to help ensure consistency.

Regarding JSON schema, if you have a Marshmallow schema or similar, yes that’s a wonderful starting point. This should map pretty closely to your DB schema (but may not be 1-to-1, as not every field in your DB will be needed in your API).

I’d suggest avoiding storing JSON at all in the DB unless you’re storing JSON that you don’t control.

For example, if the JSON you’re storing today has a nested object of GPS coords, temperature, etc.. make that an explicit table (or tables) as needed. The benefits are many: indexing the data becomes easier, the data is stored more efficiently, the table will take up less storage, the columns are validated for you, you can choose to return a subset of the data, etc… You will not regret it.


Unrelated to post, but as you seem well informed in the field, would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document?

Or at least as you suggest if required for performance the data would still be stored denormalized and where needed materialized / document-ized?

At my current company, there seems to be a belief that everything should be moved to mongo / cosmo (as document store) for performance reasons and moved away from sql sever. But really I think the issue is the code is using an in house orm that requires code generation for schema changes and probably less than ideal performance query generation.

But then I am also aware of the ease of horizontal scaling with the more nosql orientated products, and trying to be aware of my bias as someone who did not write the original code base.


> would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document

As a general rule of thumb, yes. Starting with denormalization often opens you up to all sorts of data consistency issues and data anomalies.

I like how the first sentence of the Wikipedia page on denormalization frames it (https://en.wikipedia.org/wiki/Denormalization):

> Denormalization is a strategy used on a previously-normalized database to increase performance.

The nice thing about starting with a normalized schema and then materializing denormalized views from it is that you always have a reliable source of truth to fall back on (and you'll appreciate that, on a long enough timeline).

You also tend to get better data validation, reference consistency, type checking, and data compactness with a lot less effort. That is, it comes built into the DB rather than introducing some additional framework or serialization library into your application layer.

I guess it's worth noting that denormalized data and document-oriented data aren't strictly the same, but they tend to be used in similar contexts with similar patterns and trade-offs (you could, however, have normalized data stored as documents).

Typically I suggest you start by caching your API responses. Possibly breaking up one API response into multiple cache entries, along what would be document boundaries. Denormalized documents are, in a certain lens, basically cache entries with an infinite TTL... so it's good to just start by thinking of it as a cache. And if you give them a TTL, then at least when you get inconsistencies, or need to make a massive migration, you just have to wait a little bit and the data corrects itself for "free".

Also, there are really great horizontally scalable caching solutions out there and they have very simple interfaces.


Thanks for your response. The comparison between infinite ttl cache entries and a denormalized doc is an insight I can't say I've had before and makes intuitive sense


Doesn't postgress have a way to index JSONB if needed?


You can index on fields in JSONB, but I don’t believe that’s what the op is solving for here.

In either scenario, I’d still generally encourage avoiding storing JSON(B) unless there isn’t a better alternative. There are a lot of maintenance, size, I/O, and validation disadvantages to using JSON in the DB.


imo, json dt should be an intermidiary step in db struc in rdb, never the final. Once you know & have stable columes, unravel the json into proper cols with indexing, it should improve the situation

if youre having issues with 5gb, you will face exponential problems when it grows due to lack of indexing


Cheers for the response (and affirmation). After some latency profiling I am convinced proper cols with indexing will vastly improve our situation since the queries themselves are very simple.


Depending on how much of the data in your json payload is required, extract data into their own table/cols. And store the full payload in a file system/cloud storage.


Also there's a way to profile which queries take longest via DB itself and then just run EXPLAIN ANALYZE to figure what's wrong


You can take incremental approach and p.o.c with the data that you have so you can justify your move too!


I don't think the latency issues are necessarily related to the poor schema. I'd say to dig into the query planning for your current queries and figure out what's actually slow, since it may not be what you expect.

Rearchitecting the schema might be worth doing. From the technical side, PG is pretty nice about doing transactional schema changes. I'd be more worried about the data though. Are you sure that every single row's Json columns have the keys and value types that you expect? Usually in this type of database, some records will be weird in unexpected ways. You'll need to find and account for them all before you can migrate over to a stricter schema. And do any of them have extra unexpected data?


I had to move a MongoDB to PG database on my new job (old contractor created MVP, I was hired to be the "CTO" of this new startup) and I had some problems at first, but after I created the related models and added indexes, everything worked fine.

As someone said, indexes are the best to do lookups. Remember your DB engine do lookups internally even if you are not aware (joins, for example), so add indexes to join fields.

Another thing that worked me for me (and I dont know if it's your case), was to add trigram text indexes, which make it faster to do a full text search. Remember, anyway, that adding a index makes search faster, but insert slower, so be careful if you are inserting a lot of data.


Other tips:

- Change the field type from JSON to JSONB (better storage and the rest) https://www.postgresql.org/docs/13/datatype-json.html

-Learn the json in-build functions and see if one of them can replace one you made ad-hoc

- Seriously, replace json with normal tables for the most common stuff. That alone will speed up things massively. Maybe keep the old json around in case, but remove when it become old(?)

- Use views. Views allow to abstract over your databaee and allow to change internals

- If a big things is searching and that searching is nkind of complex/flexible, add FTS with proper indexing to your json then use it as first filter layer:

    SELECT .. FROM table WHERE id IN (SELECT .. FROM search_table WHERE FTS_query) AND ...other filters
This speeeeeedupppp beautifully! (I get sub-second queries!)

- If your query do heavy calculations and your query planner show it, consider move them into a trigger and write the solved result into a table, and query that table instead. I need to loans calculations that requiere sub-second answers and this is how I solve it.

And for your query planner investigation, this handy tool is great:

https://tatiyants.com/pev/#/plans


Questions first:

1. What are the CRUD patterns to the "blobby data". 2. What are the read patterns and how much data needs to be read.

Until Read/Write are properly understood the following solutions should be considered as general guide lines only.

If staying in PG: JSON can be indexed in Postgres You could also support a hybrid JSON/Relational model giving the best of worlds.

Read:

Create views into the JSON schema that model your READ access patterns and expose them as IMMUTABLE Relational entities. (Clearly they should be as light weight as possible)

Modify:

You can split the JSON blobs into their own skinny tables. This should keep your current semantics and facilitate faster targeted updating.

Big blobby resources such as video/audio should be managed as resources and not junk up your DB

Warning:

Abstracting the model into multiple tables may cause its own issues depending on how you ORM map your entities.

Outside the Box Thinking:

-Extract and transform the data for optimized reading. -Move to MongoDB or a Key Value store

Conclusion:

What are the update patterns? -is only 1 field being updated -inter-dependencies of the data being updated How are "update anomalies" minimized

You will need to create a migration strategy to a more optimal solution and would do well to start abstracting with views. As the data model is improved this will be a continuous process and the data model can be "optimized" without disturbing the core infrastructure requiring rewrites.


I had this issue at a previous job where we would query an API (AWS actually) and store the entire response payload. As we started out we would query into the JSONB fields using the JSON operators, however at some point we started to run into performance issues and ended up "lifting" the data we cared about to columns on the same record that stored the JSON.


Bit hard to tell without some idea of the structure of the data, but my experience has been storing blobs in the database is only a good idea if those objects are completely self contained i.e. entire files.

If you write a small program to check the integrity of your blobs i.e. that the structure of the json didn't change over time, you may be able to infer a relational table schema that isolates those bits that really need to be blobs. Too leave it too long invites long term compatibility issues if somebody changes the structure of your json objects.


I think your heart shouldn't quail at the thought of re-schemaing 5-6GB! I'm going to claim that the actual migration will be very quick.


This is an affirmation I’ve been longing to here, lol!

I’ve already done the legwork, cloning to the current prod DB locally and playing around with migrations, but the fear of applying anything potentially-production breaking is scary to a dev who has never had to work on a “critical” production system!


I would recommend setting up a staging app with a copy of the production database, testing a migration script there, then running the same script on production once you're confident.


Large blobs are not the use case of relational databases - this is the starting point for any such discussion. I have 2 current projects where I am convincing the app builders (external companies, industry-wide used apps) to change this, keep relational data in the database and take out the blobs, so far is going better than expected.


I don't know about PG, but with MariaDB, a nice way to find bottlenecks is to run SHOW FULL PROCESSLIST in a loop and log the output. So you see which queries are actually taking up the most time on the production server.

If you post those queries here, we can probably give tips on how to improve the situation.


Interesting. I believe I noted a similar function in the Postgres docs I was scouring through Friday. I’ll give it a look and see what I can find.

Tangentially related for those who have experience, I am using Django-silk for latency profiling.


also, never trust orms. they make it easier to query but they do not use/output the most optimized quries


Examine the slow queries with the query planner, don’t spend a bunch of time re-architecting on a hunch until you know for sure why it’s slow!

An hour with the query planner can save you days or weeks or wasted work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: