Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that this is quite strict on what characters may be contained in a bots user agent. This is due to strictness in the REP standard.

https://github.com/google/robotstxt/blob/master/robots_test....

    // A user-agent line is expected to contain only [a-zA-Z_-] characters and must
    // not be empty. See REP I-D section "The user-agent line".
    // https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
So you may need to adjust your bot’s UA for proper matching.

(Disclosure, I work at Google, though not on anything related to this.)



The strictness is in what may be listed in the robots txt, not the User-Agent header as sent by bots. the example given in the linked draft standard[0] makes this abundantly clear that it's on the bot to understand how to interpret the corresponding lines of robots.txt.

Of course, in practice robots.txt tend to look less like [1] and more like [2].

[0]: https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1

[1]: https://github.com/robots.txt

[2]: https://wpengine.com/robots.txt


Sorry, I mean for matching, and I did try to imply it was a limitation of the standard and not the library. Though to avoid confusion, I do personally think keeping the user agent minimal is wise, since users might have difficulty guessing what value to use if it differs sufficiently from the real user agent that's sent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: