• Aux@feddit.uk
    link
    fedilink
    English
    arrow-up
    2
    ·
    13 hours ago

    It does respect robots.txt, but that doesn’t mean it won’t index the content hidden behind robots.txt. That file is context dependent. Here’s an example.

    Site X has a link to sitemap.html on the front page and it is blocked inside robots.txt. When Google crawler visits site X it will first load robots.txt and will follow its instructions and will skip sitemap.html.

    Now there’s site Y and it also links to sitemap.html on X. Well, in this context the active robots.txt file is from Y and it doesn’t block anything on X (and it cannot), so now the crawler has the green light to fetch sitemap.html.

    This behaviour is intentional.