If you know that you need to be grouping by dates... then it's usually best to store the date fragments in a separate column so you can take advantage of indexes.
The project is pretty cool, but I don't think it's worth adding the dependency lock-in for the functions.
Simply, if you know ahead of time that you want GROUP BY date, then you should create a new column for each interval. `week`, `day`, `hour`.
Really? There are loads of practical arguments: The incredibly rich type system (hello array/schemaless columns), indexes (partial indexes, several different types to choose from), language integration (write index functions in javascript/python/ruby/c), foreign data wrappers (connect pg to mysql/mongodb/etc) to name but a few.
I don't doubt they exist, it's just in the comments on HN, usually the only thing that's touted is that "it's not MySQL" in that it doesn't have the quirks MySQL does (in regards to silently munging data types, etc). Sometimes performance/scalability is also discussed.
A common alternative to this for data warehouses and reporting platforms is to construct a date table in advance, with every date enumerated from, say, 1990 through 2050. Every row represents one day, with day of week, weeks, months, and quarters as columns all set in advance. This also allows for named weeks and months, for example 2013Q3, 2013W30, and 2013M07. I also tend to put in the starting and end dates for the week in each row. It's also useful for left joining against if you need a report to show a 0 for days where you have no data.
For the interest of persons here assembled, this is known as "dimensional modelling". Done properly, it makes querying tremendously simpler and more powerful for end users.
This is the approach recommended by Ralph Kimball for data warehouses. I haven't tried it for OLTP systems. I'd be more inclined to go with ad hoc functional indexes (as recommended by another commenter) based on what queries the system actually runs. But if you have a reporting interface that permits arbitrary group-by functionality, like in a data warehouse, I think it's the way to go.
EDIT: Technically chaz is right and a data warehouse puts all these date columns in a separate table you can join against.
If you just need simple subdivisions of time, like
group by date(col)
or
group by date(col), hour(col), floor(minute(col)/5)
an index on col is enough, or at least the last time I checked for my specific queries it was. If you need something like group by hour-of-day you'll need to create an indexed column of course. (this is with mariadb but I think basic mysql handles it too)
Mysql will still use the column for a covering index and run the query in a single-pass, but it will still generate a temporary table for the results (and do a really slow sort unless you do 'order by null'). It doesn't look like mysql has any way of doing date grouping without temporaries even in the trivial cases. I guess eliminating the sort was enough for my queries.
Using date()-like functions, date_format(), and left() all have identical query plans and roughly comparable performance.
Beware, however, that, while there are certain cases that mysql can utilize indexes in a group by clause if values are constant [1], mysql definitely won't be able to use indexes with any of these functions (I'm not very familiar with postgres so somebody else would have to comment on that). If your load is relatively light or your table is small enough you'd probably be fine, otherwise I would either 1) set another column to the value you want when inserting the row or 2) have a cron run in the background to fill in values on another column, something like update tbl set creation_week = gd_week(creation_timestamp) where creation_week is null.
A common grouping needed by sysadmins is to group by 5 minute intervals. That query may be more likely to trip up a junior DBA than most of these. Could be useful to add.
Not really, the ones calling down to date_trunc should be marked as STABLE, since the underlying function is STABLE.
psql=$ \df+ date_trunc
Schema | Name | Result data type | Argument data types | Type | Volatility | Owner | Language | Source code | Description
------------+------------+-----------------------------+-----------------------------------+--------+------------+----------+----------+-------------------+------------------------------------------------------
pg_catalog | date_trunc | interval | text, interval | normal | immutable | postgres | internal | interval_trunc | truncate interval to specified units
pg_catalog | date_trunc | timestamp without time zone | text, timestamp without time zone | normal | immutable | postgres | internal | timestamp_trunc | truncate timestamp to specified units
pg_catalog | date_trunc | timestamp with time zone | text, timestamp with time zone | normal | stable | postgres | internal | timestamptz_trunc | truncate timestamp with time zone to specified units
No, the code casts to "timestamp without time zone" before running date_trunc so it is immutable.
For example:
CREATE OR REPLACE FUNCTION gd_day(timestamptz, text)
RETURNS timestamptz AS
$$
SELECT DATE_TRUNC('day', $1 AT TIME ZONE $2) AT TIME ZONE $2;
$$
LANGUAGE SQL;
EDIT: The reason that some time functions in PostgreSQL are not immutable is that they are affected by the current time zone setting of the session.
If you have an indexed DATE or DATETIME field in MySQL, you really want to use the date and time functions to do your grouping.
This seems to add unnecessary complexity and overhead.
There is already plenty of build-in abstraction for date manipulation in MySQL, and you really don't want to be dependent on any stored procedure or function if you can avoid it.
It's an https request to GitHub, which has about a million more subtle ways to own your machine if they really wanted to.
Figuring out how to solve the TOCTOU problem for a small script in a source control repo is should not be difficult for anyone actually qualified to tell if a script is evil or not by looking at it.
I'm not sure why this is necessary. It's a very light wrapper around existing SQL functions that adds additional dependencies and makes it more difficult for developers who already know SQL.
Maybe the use case is for installable packages that would be used with different backend databases... but in general, it seems like a bad idea.
I'm sure I'm a neophyte at this, but can someone explain to me why we would want to store functions in a database? It fragments the logic, adds another layer of complexity when trouble shooting, as well as adding another talent required when on-boarding new talent.
I've recently looked at a project where there were a number of postgres functions floating around and it seemed terribly difficult to set breakpoints, profile, and troubleshoot.
I'd suggest it comes down to business logic vs data logic. Business logic rightly belongs in [insert your programming language] code.
Data logic, however, is worth thinking through. When you have a lot of calculations that may span multiple tables, or need fine-grained access control, or require multiple reusable result sets that would benefit from being indexed and delivered to a user (versus attempts to map all this through ORM objects or pulling out raw data and then transforming it in app code), or require querying and packaging the same datasets (or combinations thereof) to share them with other systems that access via ODBC, utilizing functions, stored procedures, and database views are there for a reason.
That some ORMs make it difficult to leverage this functionality does not speak to the lack of need for separate data logic and taking advantage of database capabilities. If all you're building is CRUD apps with a pretty face, then you likely don't need it. If, however, you're building pretty complicated systems, pulling out more raw data than you need to calculate a result that could be calculated and kept up-to-date in a database is far easier and reusable. Moreover, ORMs often sour in the face of very complicated data requirements that are rather trivial to setup in a view, function, or stored procedure.
As far as onboarding is concerned, in my humble opinion, if you are bringing in new talent to deal with data who do not have a strong understanding of SQL, I'd be pretty wary of far more significant problems developing as a result of not being familiar with what the ORM is doing behind the scenes (and of what performance tradeoffs are occurring).
As far as troubleshooting is concerned, if you can confirm that the data is leaving the database as expected by data logic/code--and this is as simple as executing a query, function, sproc, or view--then you can at least limit yourself to discovering where the offending business logic is located.
Some things can literally be thousands of times faster when run on the database. But, the DB can also be a good place to handle data integrity due to things like rollbacks.
Don't forget while DB scaling can be an issue the overwhelming majority of projects have a single DB machine that sit's mostly idle. (EX: slashdot can run just fine with a single DB even if they keep a hot spare running.)
>but can someone explain to me why we would want to store functions in a database?
Because that is where they belong. A database is not a storage device, it is a application development environment. Everything about your data should be in your database, including the rules, restrictions, relationships, and ways of manipulating it. In this case, trying to do it in a layer above the database would mean either doing it horribly inefficiently by copying too much data to the app server and then grouping it, or embedding the SQL into your app which means you need to go and find it and copy+paste it when you want to do that manually at any time for testing/debugging/etc. Plus, when you end up with multiple applications using a database, do you really want to be keeping multiple copies of the same code in sync across multiple apps for no reason?
For those concerned about a SQL dependency, it might be useful to create projects in different languages that package these functions. These methods were extracted from https://github.com/ankane/groupdate, which you can use for Ruby + ActiveRecord
The project is pretty cool, but I don't think it's worth adding the dependency lock-in for the functions.
Simply, if you know ahead of time that you want GROUP BY date, then you should create a new column for each interval. `week`, `day`, `hour`.
That way you will have fast queries...