'Contradictory rules in robots.txt

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt

User-agent: *
Disallow: *
Allow: /

Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.



Solution 1:[1]

If you are following the original robots.txt standard:

  • The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
  • The Allow Rule isn't in the specification, so that line would be ignored.
  • Anything that isn't specifically disallowed is allowed to be crawled.

Verdict: You can crawl the site.


Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:

  • Both Allow: / and Disallow: * match any specific path on the site.
  • In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
  • In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.

Verdict: You can crawl the site.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1