This system contains measurements and state of physical devices (time series). It’s designed for both heavy write and read, with slight emphasis on write.
Each table is one type of device and contains 1 to 5 different measurements/states. But here’s the trick: because data is queried with minimum bucket size of 15minutes I figured we could just create a column for each measurement + quarter of the day (i.e. measure0000, measure0015), so that’s 100 columns for each measurement (96 quarter + 4 for DST), include the date in the key, et voila: excellent write performance (because it’s mainly UPDATE queries) and good read performance.
Okay, the queries to make sense of the data aren’t pretty, but can be generated.
I find it really cool how effective this is for time-series data without Postgres extensions (we’re on RDS).
One interesting thing your team may want to look into (if you haven't already) is compression.
Of course there are a multitude of variables we don't have access from the outside, but Postgres only compresses data that is TOASTed, and based on your description of the table, the data is not being TOASTed (and therefore not being compressed).
Instead, if you could somehow pack your timeseries entries into an array, you would get the benefits of compression automatically.
Given your write performance requirements, using an array may be out-of-question (and you may get too much overhead from dead tuples) -- but who knows? Always a good idea to benchmark.
I actually considered mentioning this at the post but figured it was too long already and could be the material for a future one :)
The updates wouldn’t be in place anyway because of the Postgres mvcc approach, right?
I can see the advantage in terms of just needing a single tuple for a reads. So a timestamp + value model would likely take twice as much heap space than your approach?
Given that you’re probably always just inserting new data you could use a brin index to get fast reads on the date ranges. Would be interesting to see it in action and play around to see the tradeoffs. The model you’ve settled on sounds like it would be a pain to query.
Fascinating. How reliable is the measurement apparatus? Or to put it another way, how tolerant of missed measurements is this scheme? I’ve been in a similar situation in a past life and we took a different approach. I was never satisfied with how we ended up doing it though.
I see your point. Some of our measurements are counters and interpolate missing data via certain rules. We store these interpolations in a separate table with exactly the same format and join the two tables directly when querying the data. Since the real value always takes precedence over the interpolated value this is just a COALESCE(realmeasure0000, interpolation0000) statement.
This is super fast when taking advantage of postgres' partition-wise joins.
I assume you looked at the various Postgres functions, like width_bucket()?
I’ve never ran a PG-based TSDB so I’m reluctant to pass judgment one way or the other, but as I hope you can understand, “we have hundreds of columns in a table” is alarming at first glance.
The schema is organized by how we need to query it which is mostly by fetching all measurements for an entire day or month. Querying statistics on those metrics is not often needed, maybe a few times per week or day, so we do that in Snowflake if necessary. Or create a specialized streaming data flow when it the stats are queried more often.
This system contains measurements and state of physical devices (time series). It’s designed for both heavy write and read, with slight emphasis on write. Each table is one type of device and contains 1 to 5 different measurements/states. But here’s the trick: because data is queried with minimum bucket size of 15minutes I figured we could just create a column for each measurement + quarter of the day (i.e. measure0000, measure0015), so that’s 100 columns for each measurement (96 quarter + 4 for DST), include the date in the key, et voila: excellent write performance (because it’s mainly UPDATE queries) and good read performance.
Okay, the queries to make sense of the data aren’t pretty, but can be generated.
I find it really cool how effective this is for time-series data without Postgres extensions (we’re on RDS).