Creating Robots.txt File and its
Importance
- By the SearchEngineOptimizationPromotion.com team
Do
you know the importance of a Robots.txt file? Read to know.
Success
of big companies lies in keeping their confidential data a
secret, hidden from all. This enables them to execute their
future course of action easily and change plans according
to the situation. Job of robots.txt file is the same. It can
or cannot allow a search engine to visit some or all of your
web pages. Of course a human visitor is free to visit these
pages. That being the case, for the search engines your website
may be different than what a visitor is seeing. If you think
one or some of the pages aren't good enough to be visited
by search engines you can do it.
Every
search engine has a "robot" (a software program)
that does the job of visiting a website. Their purpose is
to gather a copy of the site and keep them in their database.
So, if your site is not there in their database it never shows
up in the search results.
Web
Robots are sometimes referred to as Web Crawlers, or Spiders.
Therefore the process of a robot visiting your website is
called "Spidering" or "Crawling". When
somebody says "the search engines have spidered my website,"
it means the search engine robots have visited their website.
This robot is known by a name and has an independent IP address.
This IP address is of no importance to us, but knowing their
names will help since this name will be used when we create
a robots.txt file. This is why the file is called "robots.txt."
Given below is the list of the robots of some of the very
popular search engines:
Writing
Robots.txt:
Let's
learn to write robots command. Note that there are two ways
to write robots command. One is to include all the commands
in a text file called "robots.txt" and another is
to write robots command in the meta tag.
We
will learn both ways of writing robots command.
Writing
robots command in Meta tag:
There
are 4 things you can tell a search engine robot when it visits
your page:
1)
Do not index this page - the search engines will not index
the page.
2) Do not follow any links on this page - the search engines
will not follow the links included in the page, i.e. they
will not index any page that this page links to.
3) Do index this page - the search engines will index the
page.
4) Do follow the links - the search engines will index the
pages that this page links to.
Note
that "index" is different than "spider".
A search engine first spiders a page and then indexes it.
Indexing is giving a certain importance to the page on the
basis of its content, information, meta tags, link popularity
with respect to the searched keyword. All this is decided
at run time. When you tell search engines not to index a page,
it means they know that "certain" page exists but
do not rank them. That is, a no-index page will never be shown
in their search results. This in any case does not mean a
no-index page will not get visitors, it might get visitors
indirectly from a page which links to it. Yes, no direct visitors
from the search engines.
Suppose
you want the search engines to index and also index (follow)
its linked pages then include the following command in the
Meta Tag:
<meta
name="robots" content="index, follow">
Suppose
you want the search engines to index a page but not follow
its links then include the following command in the Meta Tag:
<meta
name="robots" content="index, nofollow">
Suppose
you do not want the search engines to index a page but follow
its links then include the following command in the Meta Tag:
<meta
name="robots" content="noindex, follow">
Suppose
you do not want the search engines to either index or follow
links of a particular page then include the following command
in the Meta Tag:
<meta
name="robots" content="noindex, nofollow">
Note:
Google makes a "Cached" of every file it spiders.
It's a small snap shot of the page. Want to stop Google from
doing so? Include the following Meta Tag:
<meta
name="robots" content="noindex, nofollow, noarchive">
Like
any meta tag the above written tags should be placed in the
HEAD section of an HTML page:
<html>
<head>
<title>your title</title>
<meta name="description" content="your description.">
<meta name="keywords" content="your keywords">
<meta name="robots" content="index, follow">
</head>
<body>
Creating
robots.txt file:
A
robots.txt file is an independent file and should be written
in a plain text editor like Notepad. Do not use MS-Word or
any other text editor to create robots.txt. The bottom line
is this file should have the extension ".txt" else
it will be useless.
Let's
begin. Open Notepad (it comes free with Microsoft Windows)
and save the file with the name "robots.txt". Make
sure that the extension is .txt.
By
the way, did you note we did not use name of any robot in
the meta tag! What does it indicate? Simple - by using meta
you direct all the search engines to do something or not do
something on a page. You do not have control over any one
search engine. The solution is robots.txt.
It
can always happen you do not want a particular search engine
to index a page for certain reasons. In that case using a
robots.txt file will help. Even though I do not recommend
such a thing. The search engines get you traffic, why hate
them. Stop them from doing their job and they hate you. I
again repeat keep your pages smart for the search engines
and welcome them. Fine, then why take the trouble to learn
robots.txt? Why should you include a robots.txt file at all?
Let's
suppose yours is a dynamic database site containing information
of your newsletter subscribers, customers, their address,
phone numbers etc. All these confidential information is kept
in a separate directory called "admin". (It is recommended
to keep such information in a separate directory. Handling
data will be easier for you and so will be easy to keep the
search engines away. We will just know how.) I am sure you
would never want any unauthorized person to visit this area
leave alone the search engines. It does not help the search
engines either since they have nothing to do with the data
or files there. Here comes the role of a robots.txt file.
Write
the following in the robots.txt file:
User-agent:
*
Disallow: /admin/
This does not allow the spiders to index anything in the admin
directory also including sub-directories if any.
The
asterisk (*) mark indicates all the search engines. How do
you stop a particular search engine from spidering your files
or directory?
Suppose
you want to stop Excite from spidering this directory:
User-agent:
ArchitextSpider
Disallow: /admin/
Suppose
you want to stop Excite and Google from spidering this directory:
User-agent:
ArchitextSpider
Disallow: /admin/
User-agent:
Googlebot
Disallow: /admin/
Files
are no different. Suppose you want a file datafile.html not
to be spidered by Excite:
User-Agent:
ArchitextSpider
Disallow: /datafile.html
Similarly,
you do not want it to be spidered by Google too:
User-agent:
ArchitextSpider
Disallow: /datafile.html
User-agent:
Googlebot
Disallow: /datafile.html
Suppose
you want two files datafile1.html and datafile2.html not to
be spidered by Excite:
User-Agent:
ArchitextSpider
Disallow: /datafile1.html
Disallow: /datafile2.html
Can
you guess what does the following mean?
User-agent:
ArchitextSpider
Disallow: /datafile1.html
Disallow: /datafile2.html
User-agent:
Googlebot
Disallow: /datafile1.html
Excite
will not spider datafile1.html and datafile2.html, but Google
will not spider only datafile1.html. It will spider datafile2.html
and the rest of the files in the directory.
Imagine
you have a file kept in a sub-directory that you wouldn't
like to be spidered. What do you do? Lets suppose the sub-directory
is "official" and the file is "confidential.html".
User-agent:
*
Disallow: /official/confidential.html
If
the syntax of your robots.txt file is not written correctly,
the search engines will ignore that particular command. Before
uploading the robots.txt file double check for any possible
errors. You should upload robots.txt file in the ROOT Directory
of your server. The search engines look for robots.txt file
only in the root directory.
Note:
You should be able to see robots.txt file if you type the
following in the address bar of your Internet browser.
http://www.your-domain.com/robots.txt
Here
is Google's Robots.txt file:
http://www.google.com/robots.txt
All
search engines follow robots.txt command.
You
can look in your web server log files to see what search engine
robots have visited. They all leave signatures that can be
detected. These signatures are nothing but name of their robots.
For instance if Google has spidered your site it will leave
a log file called Googlebot. This is how you know which search
engine has spidered your pages and when!
We
are highly experienced in SEO/SEM/Pay Per Click Management.
Please contact us regarding any
query you may have.
Bookmark This Site
|