Hacker Newsnew | past | comments | ask | show | jobs | submit | pykello's commentslogin


(I am not affiliated with Keebo, although I had a recruiting meeting with them earlier this year)

FWIW, Keebo (https://keebo.ai/) tries to solve this problem & reduce your Snowflake bill by using Data Learning techniques. It can be configured to return exact results or approximate results.


It is always interesting seeing companies building up on the products / services of other companies. Kinda like TurboTax built on the IRS, these "children" (is there a better term?) companies are quite dependent on the "parent" company not changing or improving its product / service.

I don't see AWS changing so dramatically that companies like DataBricks are put in hot water (but I could be wrong), but I could see Snowflake improving its product due to competition, putting Keebo in a tough situation.


By the time I reached this comment I counted no fewer than five completely separate links to offerings to help reduce your Snowflake bill. For something that is already a focused SaaS product, I have to say that starts to smell a bit.


Ex-Citus here. The open sourced shard rebalancer blocks writes to the shard being moved. Online rebalancer (closed source) uses logical replication and doesn't block writes to shards being moved, except for a brief period. Everything is the same except how shard moves are implemented.


Yeah, echo'ing on here to be extra clear. What is open-sourced holds a write-lock while rebalancing. So that it rebalances sure, but it's only marginally better than a dump/restore. You can still read yes, but I'm not sure of many applications that can be okay with no writes flowing to a table for hours while rebalancing is happening.


Got it, thank you! I'm just wondering about whether this limitation can be alleviated / worked around by combining sharding and replication. In that case, I would expect the primary DB cluster to maintain the write lock during shard rebalancing, while allowing writes to replicas (upon primary cluster's rebalancing finish, the roles would reverse and rebalancing would be applied to replicas, while primary is already write lock-free).


Understood. Thank you for the clarification.


Citus Data | Software Engineer | Waterloo, ON, Canada / San Francisco, US / Amsterdam, Netherlands | Full-Time Onsite | https://www.citusdata.com/jobs/

We're looking for software engineers on the database development team which is responsible for development of the Citus extension and related tools, and providing custom solutions for customers. Programming is done primarily in C, but without all the usual messiness thanks to Postgres' elegant internal APIs. We also build tools to help customers in whichever language is appropriate. If you're interested in working on distributed SQL, high availability, distributed transactions, seamless scale out, and other parts of a distributed database, and you're excited about working in a distributed organisations with engineers from companies like Amazon/Heroku/Google/Uber then Citus might be the place for you.

To see our other positions visit: https://www.citusdata.com/jobs/

Apply by sending your resume to hadi@citusdata.com or imagine@citusdata.com.


I think use-case for Google Cloud SQL, RDS, Heroku, etc is a bit different from Citus and other distributed databases. It seems that Cloud SQL has very limited scalibility (32 processors, 200GB of RAM), so it might not be very good at usecases that your working dataset is in order of terabytes or more. Citus on the other hand has horizontal scalability and you can add more CPU power and RAM by adding another machine to your machines.

If my data were at order of 10GBs, I would choose Cloud SQL, RDS, etc. At order of 100GBs, I would try both Cloud SQL, RDS, etc. and Citus, etc. to see which one fits my usecase. At order of terabytes, I would choose Citus or some other distributed database.

(I'm a former Citus employee and Current Googler in a non-Cloud SQL team)


(As someone who was pretty familiar with PgSQL internals)

1. I think from the 'storage' point of view, a column-store will use much less space than a collection of key/value PgSQL tables:

A) Each PgSQL table comes with several system columns, so each key/value pair will have the large overhead of (key-size + system column size), which will most likely be larger than the actual size of data. This will defeat one of the goals of columnar stores which is to load less data in-memory (to achieve less disk I/O, assuming more frequent columns will get eventually cached), unless you have so many columns and use only very few of them.

B) PostgreSQL doesn't compress each column. Column-stores usually use some kind of light-weight compression to use less disk. When I worked on cstore_fdw, this was one of the desirable features that attracted users.

2. I haven't done PostgreSQL benchmarking in last ~10 months, but unless selectivity of your query is high, fastest of joins won't be even close to a sequential table scan. If the selectivity is low and your query only needs to scan small sub-set of rows, then joins+indexes should be faster, but then again you have overhead of indexes.

One method I found useful in PostgreSQL is to create an index of columns of very frequent queries, and then tune the system to use index-only-scan for these queries.


"and then tune the system to use index-only-scan for these queries."

Caaareful there. I hope you're monitoring your query times for outliers.


A new edX course [1] was announced few days ago, which is created by SPb ITMO (which finished 7th this year).

[1] https://www.edx.org/course/how-win-coding-competitions-secre...


The course looks really interesting to be frank. I've recommended it to my college's competitive programming team actually!

ITMO did not get 1st place this year because the team changed, and I believe tourist graduated. Last year [1] for example they solved all 13 problems, while the 2nd team solved 11; there simply was no competition.

7th is still an excellent result. Let's see what they do next year!

[1]: http://a2oj.com/ICPC.jsp?y=2015


Tourist in ineligible because he has already participated twice, which is the limit.


The course is an introductory course, suitable only for beginners. Do you know any good resource for intermediate-level programmers?



You are right. Thank you. Going to correct this.


Just to make sure, are you referring to bucket sort and similar algorithms or something else?


"foldl" works like aggregation in databases. When you say "fold func init values", the result is calculated as:

  result = init
  for value in values:
     result = func(result, value)
  return result
So, "foldl add_endpoints [] bs" will translate to:

  result = []
  for (x1, h, x2) in bs:
     result = add_endpoints(result, (x1, h, x2))
  return result
If you expand the 3rd line, you get:

  result = []
  for (x1, h, x2) in bs:
     result = result ++ [(x1, height x1), (x2, height x2)]
  return result
where "++" is the list concatenation, and "height x" is a function which finds the skyline height by finding the tallest building with "x1 <= x && x < x2".

I think the 2nd solution should be easy to understand if you understand the 1st solution. Probably the only new thing is that I've sorted the buildings by height before passing them to "foldl".

In the 3rd solution, the following lines:

  skyline bs = (skyline (take n bs), 0) `merge` (skyline (drop n bs), 0)
                where n = (length bs) `div` 2
mean:

  first_half = first_n_elements(bs, length(bs) / 2)
  second_half = remove_first_n_elements(bs, length(bs) / 2)
  result = merge ((skyline(first_half), 0), (skyline(second_half), 0))
You may wonder what are those 0s? In the merge function I need to keep track of current height of each half, and the initial height of left and right skylines are 0.

Then, in merge function:

  merge ([], _) (ys, _) = ys
  merge (xs, _) ([], _) = xs
means return the other list if any of the lists become empty. The underscores mean a variable whose value is not important for us. We don't care about the current height values here, so I've put _'s instead of real names.

Then the other cases:

  merge ((x, xh):xs, xh_p) ((y, yh):ys, yh_p)
    | x > y = merge ((y, yh):ys, yh_p) ((x, xh):xs, xh_p)
    | x == y = (x, max xh yh) : merge (xs, xh) (ys, yh)
    | max xh_p yh_p /= max xh yh_p = (x, max xh yh_p) : merge (xs, xh) ((y, yh):ys, yh_p)
    | otherwise = merge (xs, xh) ((y, yh):ys, yh_p)
First case, "x > y" simply swaps the two args. This ensures that in the following cases we have x <= y.

Second case is probably easy to understand.

In third case, we know that x < y. Just before reaching x, skyline has height "max xh_p yh_p". When we reach x, height of skyline changes to "max xh yh_p". If these values are not equal, we have a height change. So we a construct a new list with head "(x, new height)" and the result of merging the rest of skylines.

If the height doesn't change, we just ignore the change at x and continue with the rest of skylines.


Upvote for a really accessible explanation. I wish I had posts like these to read when I was learning Haskell half a decade ago. You've got a knack for conveying concepts! Write some texts or at least a blog!


Thank you! That was exactly what I was looking for, I think I can follow from here.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: