Meaning of Robots.txt
Robots.txt is a useful file which places in your website’s root and controls how search engines index your pages. One of the most useful declarations is “Disallow” — it stops search engines accessing private or irrelevant sections or pages of your website, e.g.
Disallow: /temp/
Disallow: /mypage.html/
You can Even Block Search Engines Indexing Every Page on Your Domain, e.g.:
User-agent: *
Disallow: /
Blocked Pages can Still Appear in Google – HOW?
Take a little while to understand how and why it happens. Assume you have a page at http://www.abc.com/mypage.html containing confidential information about your company’s new “coupon codes” project. You may want to share that page with partners, but don’t want the information to be public knowledge just yet. Therefore, you block the page using a declaration in http://www.abc.com/robots.txt:
User-agent: *
Disallow: /mypage.html
A few weeks later, you’re searching for “coupon codes” in Google and you found http://www.abc.com/mypage.html at 1st Page of Google. How could this happen? It means, Google abides with your robots.txt instructions, isn’t?
However, this is not a violation of robots.txt rules. This happens because of very simple reason that Google found your link from elsewhere, means http://www.abc.com/mypage.html might be linked from any external website, so Google caught you from there. Meta information also comes from that particular external link, not from your page content.
There are Several Solutions that will Stop Your Pages Appearing in Google Search Results:
- Set a “no index” Meta Tag: Google will never show your page or follow its links if you add this code to your HTML head section:
- Use the URL removal tool: Google offer a URL removal tool within their Webmaster Tools.
- Add authentication: Apache, IIS, and most other web servers offer basic authentication facilities. The visitor must enter a user ID and password before the page can be viewed. This may not stop Google showing the page URL in results, but it will stop unauthorized visitors reading the content.