Everything about BeansBox and
Web Design
in Hong Kong

By Elmer on February 25, 2009

We sure want to get our pages in top positions on search engine results, right? Yes we all say in chorus. But little do we know that there are also online content we’d rather keep to ourselves than to show to others. They could be non-public files such as temporary web pages, private collection of stock photos or pages that are supposed to be accessed after successful login authentication. Exposing them can result to embarrassment, tarnished reputation or a combination of both, as a Twitter HR executive found out by exposing private information to unintended audience. To some extent consequences can be scandalous, but I wouldn’t delve into that. For example, the use of “site:website.com” operator in Google will list all pages of website.com that is indexed by the search engine. Try it on your website and maybe you’ll see certain pages that you thought shouldn’t be there. Therefore our answer to the question should better be “yes we want top rankings, but only for certain pages”.

Fortunately, there are ways to tell search engines to stay away from certain pages within our site.

Robots.txt
Robots.txt has been used in the web for a long while, but it gained more prominence together with growth in popularity of search engines as tool to find information quickly. Robots.txt is a text file that resides at the root folder of a website, such as www.website.com/robots.txt. (I wish I registered www.example.com or www.website.com long ago so I would benefit from links that come out using these domains as sample websites.)

The sample of robots.txt is the following code:

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Sitemap: http://www.website.com/sitemap.xml

User-agent specifies which search engine crawlers we’d like to direct our instructions. It can be Google’s Googlebot or Yahoo’s Slurp. But most of the time, instructions apply to all search engine robots, so we just specify the asterisk symbol “*” as shown. While it can be deceiving to believe that robots.txt allows wildcard entries, it doesn’t. For example, it does not allow User-agent: *bot if we want to apply our directives only to crawlers whose names end with “bot”. Disallow specifies which directories should be excluded in crawling and can be placed one directory per line. Make sure directories you include here are those whose websites should NOT rank on search engines. Finally, robots.txt is also a tool used to submit XML sitemaps as an alternative to Google Webmasters.

Sample working robots.txt files are that of The White House and Google websites.

Robot Exclusion Protocol
In addition to the capabilities of robots.txt described above, it is also possible to direct search engines what to do on individual pages using the Robot Exclusion Protocol (RPE). This is performed by adding a robot directive on the HEAD section of a web page using the following syntax:

<META NAME=”ROBOTS” CONTENT=”[DIRECTIVE],[DIRECTIVE]“>

Where DIRECTIVE (first is a required field, second is optional) can be any of the following: NOINDEX, NOFOLLOW, NOARCHIVE and NOSNIPPET. Each has the following function:

NOINDEX tells search engines not to index a specific page
NOFOLLOW tells search engines not to follow the links on a specific page
NOARCHIVE tells search engines not to store a cached copy of your page
NOSNIPPET tag tells Google not to show a snippet (description) under your Google listing, it will also not show a cached link in the search results

The good thing about this is that RPE is endorsed by major players like Google, Yahoo! and MSN’s Live Search.

Now we know what to do to block search engines from entering sensitive materials we place on our web servers. I hope we can start identifying them and applying the two methods described above.

Photo credit: tonyadam

Tagged: ,

Bookmark and Share

Follow BeansBox on Twitter
  • Client Login | Contact

    Home|Work|Services|Blog|FAQ|Contact|Web Design|E-commerce|CMS|SEO|EDM|PPC|Community Building

    Address: 3/F, 28 Stanley Street, Central, Hong Kong Phone: +852 3106 5181 Email: web@beansbox.com

    ©2009 BeansBox Studio Limited All rights reserved. BeansBox and the BeansBox logo are registered trademarks of BeansBox Studio Limited.

    25 queries. 0.988 seconds.