Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are some big flaws in window functions I'd love to see fixed. The main one being the very limited ways in which you can refer to the 'current' row in window and filter expressions. The second being... you're almost always better off just doing all this work in code, in a programming language with dependable performance characteristics.


I don't know, I think most performance gained from pulling data out and processing it is generally an illusion. Windowing functions can have terrible performance, but that's pretty easy to determine and mitigate (indexing, multiple queries in a procedure). Pulling data down and 'doing something' with it starts getting into the world of round trips, serialization, concurrency models...

Plus there is the whole, "My report wants to use the same logic as this web form", oh it's in Java? Guess we'll just pull the data out, process it, populate a temp table, report it, clear those values....

If you have business data in SQL, just write the logic in SQL.


I generally haven't had a problem with window function performance - I was mostly using MSSQL.

That said, the single biggest performance drain I've seen is people not using the database, and instead pulling data out to process in code. I don't think I've ever seen a case where it was faster to process data outside of the database - though there has also been plenty of times when the pertinent SQL to had to be written "smarter." Think in sets, not iteratively! There's a terrifying amount of loops in a lot of legacy SQL statements.


If you ever want to see some interesting thoughts on this, lookup "Toon Koppelaars thickdb". Granted he works for Oracle, but I think his point is quite well articulated.

Even if you write optimized C code, you can't tend to beat well written DB code because the IO completely overwhelms the processing time. And somewhat counterintuitively, moving processing outside the database tends to increase overall database CPU usage, because you tend to force more IO through the system.

That makes sense, since IO costs more than most cpu processing, and if you calculate in the DB Engine, you don't tend to need as much raw data out, you only need the results.

Basically I want a modern, cost effective mainframe, that uses a reasonable language for the data layer. The earlier suggestions of Python + Postgres or C# + MSSql are nearer the mark, but never quite feel first class.


Another Toon fan here. HE also co-wrote a book called Applied Mathematics for Database Professionals that is superb. It's dry, like a textbook. But the clarity and depth that they have on the underpinnings of how all this works and why it does is excellent.


I think we could be friends. As an appdev turned database/BI dev, there are a lot of intricacies of databases that are just not intuitive and take a while to get down right.

We're getting away from it by either moving to new tech or the tech is improving (adaptive query processing, auto tuning indexes, writable columnstore).

Even when I was an app dev I said you live and die by your database, either it's data model or by not putting processing in the right place. It's an architecture thing and something that (generally) has to be addressed early on.


I was given a report written in T-SQL that had multiple nested loops and a select statement inside the inner loop.

Rewriting it as one SQL query made it 4 orders of magnitude faster.


The point is you dont have business data yet, you need to create it ... using business logic. All the data is in sql but for some reason you dont store value x (probably an aggregate of sorts) and sql environment doesnt have a good generalized syntax for logic.

I think the problem is maintenance instead of performance. When you really want performance (in data reads) youll be precomputing and storing the values. Then your report can filter and be indexed properly.

I dont think that any tech solution will give you fast reads and fast writes you need to pick one.

Your server probably has alot of down time ... use it , make a cron job and make a seperate reporting table ... reporting on raw data does not scale , window functions do not scale

Using window functions is equivalent to using a better cpu. Maybe the environment you work in is faster but you have not factored the logic to make the best use of the resources.

You can take a crap approach in java and port it to sql. But the approach is crap and you should create a seperate reporting process or at least use hadoop.


Rather than taking the data out of the database, how about plugging your programming language of choice into the database?

For Postgres, for example, you can write stored procedures using any of these: https://www.postgresql.org/docs/9.1/static/external-pl.html


Not positive in PostGres, in MSSQL you still wind up copying the data and using additional resources, it's just handled for you. I'm a big fan of scripting in DBs becoming a thing (and a little scared) but it's not all roses (yet). You would also loose optimizer "fixes" right?


The PHP version looks like it hasn't been touched since 2011 and the Python link is dead. Doesn't give much confidence that this is a viable and supported pathway to Db <-> Application communication.


AWS RDS doesn't support any of the PL extensions for Postgres. And unfortunately, that's a real killer for a lot of people.

But when I've been able to use Python functions directly inside my SQL code, it is a thing of absolute beauty.


I agree, too little is done in SQL.

One good reason to avoid it and stored procedures is for parallelism. But you need a lot of data to make that worthwhile. Especially in these days of 3+Tb RAM machines.


Access to the current row was improved in the SQL:2011 standard. See section "8.2 Navigation within a window" in this paper: https://sigmodrecord.org/publications/sigmodRecord/1203/pdfs...


This is great news, although I'm not sure it gives me the thing I _most_ want, which is just to be able to refer to the 'current' row within a filter's where clause.


Not always the case, especially with columnstore indexes and batch mode, even traditionally, rownum/lead/lag/ordering functions are extremely handy along with aggregates. I can't actually think of many, result set/batch, scenarios that would be better replicated in code, not counting the speed on extra delivery (every row for a sum) and possible miscalculations. Add in the ability to index views and some other things for specific scenarios, then again, I'm mostly MSSQL and lack in other offerings, or knowing where they might lack.

Also, the site is blocked for me currently...


I think often conversations about this are really people talking past each other. If you work in the enterprise, and spend tons of time dealing with 'business logic' and bespoke systems, using SQL seems preferable because many of your headaches come from systems being out of sync, or reports showing different values. Databases represent a way to guard against a lot of problems that tend to occur in that environment. It's only when you branch into other realms of development where it starts to feel really backwards and archaic.


But is getting correct results reliably really "backwards and archaic", or does it only seem so because it is not the latest shiny toy with a trendiest name? (edited: spelling)


SQL has dependable performance characteristics... What is not dependable about how window functions are computed?


Where clause pushdown through views, or worse CTEs. Views with window functions querying views with window functions. Queries with multiple windows. I didn't just randomly make the statement, modern query planners are an absolute arse to predict in the way that you can 100% of the time in external code. The moment you're doing anything really 'sequency' instead of just calculating some dumb running totals, window functions can become a nightmare, and I just made the statement so people don't automatically think there's a friction-free story for doing general sequence processing in SQL. Not sure why this has offended people so much.


If you have a small amount of data, yes; but if you have any significant amount of data, you are losing multiple orders of magnitude of performance when you pull all that data out of the database.


I lose orders of magnitude of dev time every time Postgres can't work out how to push a where clause down into a view, tbh. In our case, doing processing on sequences at insert time is much more pleasant, performant and featureful than trying to do it after the fact with window functions.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: