Hello, Hynek!
Thanks for attrs, wonderful library!
I'd suggest to market the "serious business aliases" more prominently. It's not about aesthetics, but expectations: people like me are immediately puzzled by "attr.ib()", thinking "why ib?".
The reason is because after reading a lot of Python code, our brain is already trained to recognize attributes and method names after the dot.
Also, it's a reasonable expectation to be able to import a function or submodule and the code still make sense, but this won't make sense:
from attr import s, ib
@s
class Thing(object):
x = ib()
I understand it looks such a small thing for you and others already used to this DSL, and also that it's not the most important technical aspect of the library. However, I do think it's an important human aspect of it.
I have the feeling that by making the "no non-sense" alternative more prominent in the examples and documentation, it would reduce cognition steps for newbie users and maybe make your own life easier by not having to explain and paste this link every time someone finds it odd (which will probably continue to happen).
Hello, thank you for the rare nice words in this thread!
I got a bit hit by surprise by the submission of this article to HN (it’s like two weeks old now) so it caught me a bit off guard close before a new release and there’s some inconsistency between the GitHub README and the docs on RTD.
However, as you said, it'd be a major change and it would affect the whole ecosystem (plugins and extensions), so it's complicated. We'll see what happens. :)
Well, missing data can happen from problems in several different levels:
1) site changes caused the items that were scraped to be incomplete (missing fields) -- for this, one approach is to use an Item Validation Pipeline in Scrapy, perhaps using a JSON schema or something similar, logging errors or rejecting an item if it doesn't pass the validation.
2) site changes caused the scraping the items itself to fail: one solution is to store the sources and monitor the spider errors -- and when there are errors, you can rescrape from the stored sources (it can get a bit expensive store sources for big crawlers). Scrapy doesn't have a complete solution for this out-of-the-box, you have to build your own. You could use the HTTP cache mechanism and build a custom cache policy: http://doc.scrapy.org/en/latest/topics/downloader-middleware...
3) site changed the navigation structure, and the pages to be scraped from were never reached: this is the worst one, it's similar to the previous one, but it's one that you want to detect earlier -- saving the sources doesn't help much, since it happens at an early time during the crawl, so you want to be monitoring it.
One good practice is to split the crawl in two: one spider does the navigation and push the links of the pages to be scraped into a queue or something, and another spider reads the URLs from that and just scrape the data.
Agreed, that can be truly challenging.
It's important to store the timestamp of when the page source was downloaded, together with the timezone if available.
Well, it is meant to be very forgivable. Right now it outputs the same thing for both "1 week ago" and "1 weeks ago" (even though the latter is grammatically incorrect).
Can you elaborate what you mean by "proper grammar for singular values"?
I'd suggest to market the "serious business aliases" more prominently. It's not about aesthetics, but expectations: people like me are immediately puzzled by "attr.ib()", thinking "why ib?".
The reason is because after reading a lot of Python code, our brain is already trained to recognize attributes and method names after the dot.
Also, it's a reasonable expectation to be able to import a function or submodule and the code still make sense, but this won't make sense:
I understand it looks such a small thing for you and others already used to this DSL, and also that it's not the most important technical aspect of the library. However, I do think it's an important human aspect of it.I have the feeling that by making the "no non-sense" alternative more prominent in the examples and documentation, it would reduce cognition steps for newbie users and maybe make your own life easier by not having to explain and paste this link every time someone finds it odd (which will probably continue to happen).
Cheers!