'Contradictory rules in robots.txt
I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: /
mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.
Solution 1:[1]
If you are following the original robots.txt standard:
- The
*
in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a/
, so that rule disallows nothing. - The
Allow
Rule isn't in the specification, so that line would be ignored. - Anything that isn't specifically disallowed is allowed to be crawled.
Verdict: You can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
- Both
Allow: /
andDisallow: *
match any specific path on the site. - In the case of such a conflict, the more specific rule (ie longer) rule wins.
/
and*
are each one character, so neither is considered more specific than the other. - In a case of a tie for specificity, the least restrictive rule wins.
Allow
is considered less restrictive thanDisallow
.
Verdict: You can crawl the site.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |