Tuesday, February 28, 2006
What's The Point of A Robots.txt File If Google Ignores It?:
"...A few days ago, Rand Fishkin of SEOMOZ pointed out how Google and other engines often ignore the robots.txt files we place in the root of our sites.
My question today revolves around this same issue.
I noticed today Google indexing my images folder, even though I explicity prevent ALL SEARCH ENGINE SPIDERS from indexing that folder from various reasons. I have had this robots.txt file in the root of my site since the day it was launched and am quite annoyed and frustrated with Google for ignoring it and indexing the contents of the folder anyways.
I am curious first of all if this is something that might warrant a violation of copyright laws as some of the contents in the images folder are copyright material to the website owner and does not want his images being displayed in Google Images.
Furthermore, does anyone have any idea why Google continues to do this and how one can actually prevent the spiders from not indexing folders you specificy in the robots.txt file?
...
They obey any of them that I construct. Perhaps take a look at it again and make sure that you have done it properly www.robotstxt.org
...
Not to be too blunt but Ive been in the SEO industry for over 5 years now and alon with many seo professionals seem to encounter this quite often.
My robots.txt file is perfect - just amazing how Google does what they want.
...
robots.txt prevents compliant robots from reading content at the URL you specify. It does not, in and of itself, stop the URL being indexed - just the content at that URL.
Google supports the indexing of URLs without content. So you will sometimes see results in SERPs that contain no title, no snippet, no cache, no date, no size ... just a link to a URL. robots.txt may prevent the content at these URLs from being read, but it does not prevent the URLs being indexed.
In practice, Google will remove a protected URL shortly after it tries to retrieve the content at that URL and is prevented from doing so by robots.txt.
...
To take Alan's comments a step further:
robots.txt says "don't download content that matches this", and a robots metatag says "Don't have this content in your index". I think we need to be clear on whethe Google shows a link to a file, or actually comes and downloads a file. a robots.txt disallow makes the former acceptable, but not the latter.
Google does have, however, a tool that allows you to delete URLs you don't want indexed by reusing your robots.txt file: http://services.google.com:8882/url...d&lastcmd=login (may have to hit refresh).
Reply With Quote
#9
...
From Google:
To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.
We always suggest verifying that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:
User-Agent: *
Allow: /
Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do."
"...A few days ago, Rand Fishkin of SEOMOZ pointed out how Google and other engines often ignore the robots.txt files we place in the root of our sites.
My question today revolves around this same issue.
I noticed today Google indexing my images folder, even though I explicity prevent ALL SEARCH ENGINE SPIDERS from indexing that folder from various reasons. I have had this robots.txt file in the root of my site since the day it was launched and am quite annoyed and frustrated with Google for ignoring it and indexing the contents of the folder anyways.
I am curious first of all if this is something that might warrant a violation of copyright laws as some of the contents in the images folder are copyright material to the website owner and does not want his images being displayed in Google Images.
Furthermore, does anyone have any idea why Google continues to do this and how one can actually prevent the spiders from not indexing folders you specificy in the robots.txt file?
...
They obey any of them that I construct. Perhaps take a look at it again and make sure that you have done it properly www.robotstxt.org
...
Not to be too blunt but Ive been in the SEO industry for over 5 years now and alon with many seo professionals seem to encounter this quite often.
My robots.txt file is perfect - just amazing how Google does what they want.
...
robots.txt prevents compliant robots from reading content at the URL you specify. It does not, in and of itself, stop the URL being indexed - just the content at that URL.
Google supports the indexing of URLs without content. So you will sometimes see results in SERPs that contain no title, no snippet, no cache, no date, no size ... just a link to a URL. robots.txt may prevent the content at these URLs from being read, but it does not prevent the URLs being indexed.
In practice, Google will remove a protected URL shortly after it tries to retrieve the content at that URL and is prevented from doing so by robots.txt.
...
To take Alan's comments a step further:
robots.txt says "don't download content that matches this", and a robots metatag says "Don't have this content in your index". I think we need to be clear on whethe Google shows a link to a file, or actually comes and downloads a file. a robots.txt disallow makes the former acceptable, but not the latter.
Google does have, however, a tool that allows you to delete URLs you don't want indexed by reusing your robots.txt file: http://services.google.com:8882/url...d&lastcmd=login (may have to hit refresh).
Reply With Quote
#9
...
From Google:
To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.
We always suggest verifying that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.
Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:
User-Agent: *
Allow: /
Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do."
Comments:
Post a Comment