• jaschen@lemm.ee
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    11
    ·
    21 hours ago

    Web manager here. Don’t do this unless you wanna accidentally send google crawlers into the same fate and have your site delisted.

        • rosco385@lemm.ee
          link
          fedilink
          English
          arrow-up
          3
          ·
          20 hours ago

          It’d be more naive to have a robot.txt file on your webserver and be surprised when webcrawlers don’t stay away. 😂

      • Aux@feddit.uk
        link
        fedilink
        English
        arrow-up
        2
        ·
        13 hours ago

        It does respect robots.txt, but that doesn’t mean it won’t index the content hidden behind robots.txt. That file is context dependent. Here’s an example.

        Site X has a link to sitemap.html on the front page and it is blocked inside robots.txt. When Google crawler visits site X it will first load robots.txt and will follow its instructions and will skip sitemap.html.

        Now there’s site Y and it also links to sitemap.html on X. Well, in this context the active robots.txt file is from Y and it doesn’t block anything on X (and it cannot), so now the crawler has the green light to fetch sitemap.html.

        This behaviour is intentional.

      • Zexks@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        20 hours ago

        Lol. And they’ll delist you. Unless you’re really important, good luck with that.

        robots.txt

        Disallow: /some-page.html

        If you disallow a page in robots.txt Google won’t crawl the page. Even when Google finds links to the page and knows it exists, Googlebot won’t download the page or see the contents. Google will usually not choose to index the URL, however that isn’t 100%. Google may include the URL in the search index along with words from the anchor text of links to it if it feels that it may be an important page.