This is a problem I've researched fairly extensively the last few months. My ide...

This is a problem I've researched fairly extensively the last few months. My ideal solution looks something like:

* Initial pull * Secondary pulls x time later, where x doubles each time, up to a maximum value, y

y is the one that's tricky to define. For us, it's a value computed based on the frequency of update of similar URLs for that domain, the domain as a whole, similar content, and a few other bits and pieces. Essentially, our thinking is that if we can understand how alike any page is to another cluster of pages, we can use their average frequency of update to give reasonably likely initial values for x, and sensible thresholds for y. We also temper this with how much change there is, to determine whether the differences are something we care about.

Obviously, should the system notice that if its change timings are particularly outside where it'd expect given the cohort assigned, it's then able to start moving around its comparison. An example would be a blog category page which updates so infrequently that it's particularly unusual, or a page with a lot of social feeds on it where there's a lot of flux constantly.

Works pretty well, but if anyone's got a better solution I'd love to hear of it.