Show HN: Better SQL group by date

richardv · on July 9, 2013

If you know that you need to be grouping by dates... then it's usually best to store the date fragments in a separate column so you can take advantage of indexes.

The project is pretty cool, but I don't think it's worth adding the dependency lock-in for the functions.

Simply, if you know ahead of time that you want GROUP BY date, then you should create a new column for each interval. `week`, `day`, `hour`.

That way you will have fast queries...

bjourne · on July 9, 2013

You don't need that if you use postgresql:

    create index foo1 on t1 (date(timestamp))
    create index foo2 on t2 (extract(week from timestamp))

Postgres can use almost any row expression as an index.

kalleboo · on July 9, 2013

These are the cool practical reasons to switch from MySQL to Postgres that I so rarely see (usually the reasons are more ad hominem and handwavy).

orf · on July 9, 2013

Really? There are loads of practical arguments: The incredibly rich type system (hello array/schemaless columns), indexes (partial indexes, several different types to choose from), language integration (write index functions in javascript/python/ruby/c), foreign data wrappers (connect pg to mysql/mongodb/etc) to name but a few.

kalleboo · on July 10, 2013

I don't doubt they exist, it's just in the comments on HN, usually the only thing that's touted is that "it's not MySQL" in that it doesn't have the quirks MySQL does (in regards to silently munging data types, etc). Sometimes performance/scalability is also discussed.

chaz · on July 9, 2013

A common alternative to this for data warehouses and reporting platforms is to construct a date table in advance, with every date enumerated from, say, 1990 through 2050. Every row represents one day, with day of week, weeks, months, and quarters as columns all set in advance. This also allows for named weeks and months, for example 2013Q3, 2013W30, and 2013M07. I also tend to put in the starting and end dates for the week in each row. It's also useful for left joining against if you need a report to show a 0 for days where you have no data.

jacques_chester · on July 10, 2013

For the interest of persons here assembled, this is known as "dimensional modelling". Done properly, it makes querying tremendously simpler and more powerful for end users.

pjungwir · on July 9, 2013

This is the approach recommended by Ralph Kimball for data warehouses. I haven't tried it for OLTP systems. I'd be more inclined to go with ad hoc functional indexes (as recommended by another commenter) based on what queries the system actually runs. But if you have a reporting interface that permits arbitrary group-by functionality, like in a data warehouse, I think it's the way to go.

EDIT: Technically chaz is right and a data warehouse puts all these date columns in a separate table you can join against.

bcoates · on July 9, 2013

If you just need simple subdivisions of time, like

  group by date(col)

or

  group by date(col), hour(col), floor(minute(col)/5)

an index on col is enough, or at least the last time I checked for my specific queries it was. If you need something like group by hour-of-day you'll need to create an indexed column of course. (this is with mariadb but I think basic mysql handles it too)

bcoates · on July 9, 2013

Update: Upon testing, it's more subtle than that.

Mysql will still use the column for a covering index and run the query in a single-pass, but it will still generate a temporary table for the results (and do a really slow sort unless you do 'order by null'). It doesn't look like mysql has any way of doing date grouping without temporaries even in the trivial cases. I guess eliminating the sort was enough for my queries.

Using date()-like functions, date_format(), and left() all have identical query plans and roughly comparable performance.

akane · on July 9, 2013

I would suggest denormalizing only after you find a performance bottleneck. As they say, "premature optimization is the root of all evil".

pvnick · on July 9, 2013

Pretty cool! I can see this being really useful.

Beware, however, that, while there are certain cases that mysql can utilize indexes in a group by clause if values are constant [1], mysql definitely won't be able to use indexes with any of these functions (I'm not very familiar with postgres so somebody else would have to comment on that). If your load is relatively light or your table is small enough you'd probably be fine, otherwise I would either 1) set another column to the value you want when inserting the row or 2) have a cron run in the background to fill in values on another column, something like update tbl set creation_week = gd_week(creation_timestamp) where creation_week is null.

[1] http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization...

joevandyk · on July 9, 2013

Postgres can use indexes with these functions.

If the functions are marked stable or immutable, then the function bodies will be inlined directly into the query.

Terretta · on July 9, 2013

A common grouping needed by sysadmins is to group by 5 minute intervals. That query may be more likely to trip up a junior DBA than most of these. Could be useful to add.

Example use case: http://www.techtalkz.com/microsoft-sql-server/170861-groupin...

umsm · on July 9, 2013

Do you happen to know the most elegant solution to this problem?

Convert the date to seconds since date X (or unix timestamp / 60) and divide by the number of mins you need and convert to int.

jacques_chester · on July 10, 2013

Never use raw numbers to represent slices of time, there's just too many corner cases that can throw your calculations off.

Use the INTERVAL data type, it handles the logic properly.

pjungwir · on July 9, 2013

Postgres has `EXTRACT(EPOCH FROM date_col)`, which might not be where you'd think to find that functionality.

akane · on July 9, 2013

Good suggestion!

joevandyk · on July 9, 2013

The postgres functions should be marked IMMUTABLE.

http://www.postgresql.org/docs/9.2/static/xfunc-volatility.h...

This allows the planner to optimize queries that use the functions correctly, instead of treating the functions like a black box.

Edit: STABLE would indeed be better, forgot about timezones.

wulczer · on July 9, 2013

Not really, the ones calling down to date_trunc should be marked as STABLE, since the underlying function is STABLE.

  psql=$ \df+ date_trunc
     Schema   |    Name    |      Result data type       |        Argument data types        |  Type  | Volatility |  Owner   | Language |    Source code    |                     Description                      
  ------------+------------+-----------------------------+-----------------------------------+--------+------------+----------+----------+-------------------+------------------------------------------------------
   pg_catalog | date_trunc | interval                    | text, interval                    | normal | immutable  | postgres | internal | interval_trunc    | truncate interval to specified units
   pg_catalog | date_trunc | timestamp without time zone | text, timestamp without time zone | normal | immutable  | postgres | internal | timestamp_trunc   | truncate timestamp to specified units
   pg_catalog | date_trunc | timestamp with time zone    | text, timestamp with time zone    | normal | stable     | postgres | internal | timestamptz_trunc | truncate timestamp with time zone to specified units

jeltz · on July 9, 2013

No, the code casts to "timestamp without time zone" before running date_trunc so it is immutable.

For example:

  CREATE OR REPLACE FUNCTION gd_day(timestamptz, text)
    RETURNS timestamptz AS
  $$
    SELECT DATE_TRUNC('day', $1 AT TIME ZONE $2) AT TIME ZONE $2;
  $$
    LANGUAGE SQL;

EDIT: The reason that some time functions in PostgreSQL are not immutable is that they are affected by the current time zone setting of the session.

akane · on July 9, 2013

Awesome, added this to the todo list.

tudborg · on July 9, 2013

If you have an indexed DATE or DATETIME field in MySQL, you really want to use the date and time functions to do your grouping.

This seems to add unnecessary complexity and overhead.

There is already plenty of build-in abstraction for date manipulation in MySQL, and you really don't want to be dependent on any stored procedure or function if you can avoid it.

http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functio...

I can't speak for postgres though. It might be different.

kogir · on July 9, 2013

You can actually do really fun (and fast!) things with dates by mapping them to geospatial types.

For an example see: http://explainextended.com/2009/08/11/efficient-date-range-q...

You can easily extend this map into more dimensions depending in your common use cases and the capability of your DB's spatial library.

qprofyeh · on July 9, 2013

Looks simple and useful. Only worry is that the installation instruction is a curl request piped to mysql with root user...

Thanks for sharing!

akane · on July 9, 2013

curl | sh is a very transparent installer. Here's the source. https://raw.github.com/ankane/groupdate.sql/master/mysql.sql

Also, you don't have to use the root user - I'll make that more clear. Thanks for the feedback!

hnha · on July 9, 2013

did you access that url with curl to make sure it does no user agent sniffing?

how do you know the script is the same on subsequent requests?

bcoates · on July 9, 2013

It's an https request to GitHub, which has about a million more subtle ways to own your machine if they really wanted to.

Figuring out how to solve the TOCTOU problem for a small script in a source control repo is should not be difficult for anyone actually qualified to tell if a script is evil or not by looking at it.

akane · on July 9, 2013

You can always `curl`, then `cat | sh` if you don't trust me :)

pbnjay · on July 9, 2013

I'm not sure why this is necessary. It's a very light wrapper around existing SQL functions that adds additional dependencies and makes it more difficult for developers who already know SQL.

Maybe the use case is for installable packages that would be used with different backend databases... but in general, it seems like a bad idea.

akane · on July 9, 2013

Thanks for the feedback. Working correctly with time zones in the database can be tricky.

DATE_TRUNC('day', $1 AT TIME ZONE $2) AT TIME ZONE $2;

or

CONVERT_TZ(DATE_FORMAT(CONVERT_TZ(ts, '+00:00', time_zone), '%Y-%m-%d 00:00:00'), time_zone, '+00:00');

are not very intuitive (or aren't to me at least).

tudborg · on July 9, 2013

Simply use and store everything in UTC. If you ever need to do grouping by time in any other timezone, it would be trivial to do the conversion.

DAYOFYEAR( CONVERT_TZ(my_datetime_field,'+00:00', '+02:00') )

Seems very intuitive to me.

lowglow · on July 9, 2013

I'm sure I'm a neophyte at this, but can someone explain to me why we would want to store functions in a database? It fragments the logic, adds another layer of complexity when trouble shooting, as well as adding another talent required when on-boarding new talent.

I've recently looked at a project where there were a number of postgres functions floating around and it seemed terribly difficult to set breakpoints, profile, and troubleshoot.

Maybe I'm missing something?

bobwaycott · on July 9, 2013

I'd suggest it comes down to business logic vs data logic. Business logic rightly belongs in [insert your programming language] code.

Data logic, however, is worth thinking through. When you have a lot of calculations that may span multiple tables, or need fine-grained access control, or require multiple reusable result sets that would benefit from being indexed and delivered to a user (versus attempts to map all this through ORM objects or pulling out raw data and then transforming it in app code), or require querying and packaging the same datasets (or combinations thereof) to share them with other systems that access via ODBC, utilizing functions, stored procedures, and database views are there for a reason.

That some ORMs make it difficult to leverage this functionality does not speak to the lack of need for separate data logic and taking advantage of database capabilities. If all you're building is CRUD apps with a pretty face, then you likely don't need it. If, however, you're building pretty complicated systems, pulling out more raw data than you need to calculate a result that could be calculated and kept up-to-date in a database is far easier and reusable. Moreover, ORMs often sour in the face of very complicated data requirements that are rather trivial to setup in a view, function, or stored procedure.

As far as onboarding is concerned, in my humble opinion, if you are bringing in new talent to deal with data who do not have a strong understanding of SQL, I'd be pretty wary of far more significant problems developing as a result of not being familiar with what the ORM is doing behind the scenes (and of what performance tradeoffs are occurring).

As far as troubleshooting is concerned, if you can confirm that the data is leaving the database as expected by data logic/code--and this is as simple as executing a query, function, sproc, or view--then you can at least limit yourself to discovering where the offending business logic is located.

Retric · on July 9, 2013

Some things can literally be thousands of times faster when run on the database. But, the DB can also be a good place to handle data integrity due to things like rollbacks.

Don't forget while DB scaling can be an issue the overwhelming majority of projects have a single DB machine that sit's mostly idle. (EX: slashdot can run just fine with a single DB even if they keep a hot spare running.)

papsosouid · on July 9, 2013

>but can someone explain to me why we would want to store functions in a database?

Because that is where they belong. A database is not a storage device, it is a application development environment. Everything about your data should be in your database, including the rules, restrictions, relationships, and ways of manipulating it. In this case, trying to do it in a layer above the database would mean either doing it horribly inefficiently by copying too much data to the app server and then grouping it, or embedding the SQL into your app which means you need to go and find it and copy+paste it when you want to do that manually at any time for testing/debugging/etc. Plus, when you end up with multiple applications using a database, do you really want to be keeping multiple copies of the same code in sync across multiple apps for no reason?

jotto · on July 9, 2013

Here's a Ruby library (similar to his groupdate gem) I made with an SUM as an additional feature (but no timezones). MySQL and PostgreSQL support https://github.com/jotto/ar_aggregate_counter/blob/master/li...

akane · on July 9, 2013

For those concerned about a SQL dependency, it might be useful to create projects in different languages that package these functions. These methods were extracted from https://github.com/ankane/groupdate, which you can use for Ruby + ActiveRecord

(edit: spelling)

memset · on July 9, 2013

Very cool. Thank you for sharing!

(Now we just need one for Oracle, DB2, and Sybase...)

bratsche · on July 9, 2013

Do you really use all those databases?

z92 · on July 9, 2013

Number of commonly used database engines aren't small either.

Oracle, MySQL, PostgreSQL, sqlite, MS-SQL and Access.

Six intotal.

gtrubetskoy · on July 9, 2013

PostgreSQL and MySQL? Where is the Hive version? :)

akane · on July 9, 2013

Just a pull request away :)

jerogarcia · on July 9, 2013

very very nice.