来自雅虎官方的搜索机器人的解释

分类:网站推广 | lao8发表于 2008-7-7 15:29:00

好久没有写博客了,其实有很多东西需要记录,一直太忙,也有一部分搜索引擎优化比较敏感的话题不便于交流。为感谢部分网友的关注分享一点来自雅虎官方对于雅虎搜索引擎蜘蛛的详解~~

这里就节选部分重要的信息:

1、避免用框架和动态内容

(Yahoo! Slurp follows HREF links. It does not follow SRC links. This means that Yahoo! Slurp does not retrieve or index individual frames referred to by SRC links.)雅虎机器人能检索href链接的文件,但是不检索src的链接内容,雅虎举例说明不能检索frames框架的内容,但是雅虎的另一段解释又说(Yahoo! Slurp has support for frames and makes an effort to crawl complex URLs such as those generated by forms, content generation systems, and dynamic page generation software.)。这似乎有点矛盾,但是可以看出雅虎一直在努力检索src的链接页面和对动态页面的抓取,其实雅虎对于第一段的解释似乎有些牵强,毕竟图片的链接方式都是src链接的,所有这些解释的逻辑有些混乱,我们只能理解为雅虎抓取这样的内容更费劲。

2、千万不要把出现404错误的页面转跳到你网站的首页,这样只会害你首页被删除或降权~~

Why is your crawler asking for strange URLs that have never existed on my site?
Some web servers send a site navigation page or other response page with a "HTTP 200 OK" response instead of a "HTTP 404 Not Found" result for page-not-found conditions. To check on web server handling of page-not-found conditions, Yahoo! Slurp occasionally sends deliberately odd URLs built from random words to sites from which no 404 results have been seen. These URLs are built intentionally to not match any actual content at the site. We save information on the web server response to requests for non-existent pages so we can correctly recognize and remove obsolete URLs in our search database.

A Yahoo! Slurp check for 404 results from a web server consists of requests for up to 10 such URLs. The check for 404 behavior is not a normal part of Yahoo! Slurp site refresh, so such requests are rare.

大体意思是讲雅虎机器人会发送一个随机的并不存在的页面到我们的网站,判断一下我们处理404错误的方式,如果这些错误都转跳到同一个页面,那么搜索引擎将会从数据库中删除这个页面。可以理解为搜索引擎把这个转跳的页面当成了无用的页面而删除了。

这个其实可以解释一部分站长的疑问:网站首页为什么不收录。

3、noarchive / noindex / nofollow的区别:

Our search engine contains snapshots of the majority of pages discovered during the crawl of the Web. We link to the cached page so the user can click it if the original site's server is down. When you view a cached page you see it as it looked when Yahoo! Slurp last crawled it, with search terms highlighted.

If you do not want your content to be accessible through the cache link, you can use either of these methods to instruct robots not to archive the page:

雅虎搜索引擎机器人在检索的时候会给我们的网页建立一个网页快照(缓存页),可以用noarchive属性禁止雅虎搜索机器人建立网页快照(注意:不建立快照但是正常收录)

Important: NOARCHIVE only removes the cached page. To prevent the page content from being indexed, use:

novarchive仅仅是不让搜索引擎机器人建立网页快照,不让搜索引擎索引(收录)当前网页则需要用noindex 属性,代码如下:

<META NAME="robots" CONTENT="noindex">
   or
<META NAME="Slurp" CONTENT="noindex">

To prevent the crawler from following links, use:不让搜索引擎爬行某个链接可以用nofollow属性:

<a href="http://spammer.example.com/ rel=nofollow">buy now</a> 这段是禁止搜索引擎爬行这个链接
<meta name="robots" content="index,nofollow">
这段是告诉搜索引擎当前网页的所以链接都不爬行。

4、雅虎可以用class=robots-nocontent属性标注非主要内容

Web pages often include headers, footers, navigational sections, repeated boilerplate text, copyright notices, ad sections, or dynamic content that is useful to users — but not to search engines. Webmasters can apply the "robots-nocontent" attribute to indicate to search engines any content that is extraneous to the main unique content of the page. Yahoo! Search observes the class="robots-nocontent" present on XHTML elements, such as div, span, and all others.

This attribute offers webmasters a great deal of flexibility.You can use the "class=robots-nocontent" attribute with all XHTML tags.

网页通常包含页头、导航、页脚版权页和一些重复性或者是不断变化的动态内容,而这些内容对于搜索引擎来说都不是最重要的,所以可以用class=robots-nocontent属性标准非重要内容而突出重要信息~~,这个class=robots-nocontent属性是非常灵活的可以被用在所有XHTML标签上。

5、我们可以用Crawl-delay控制雅虎机器人的请求频率

以前老吧遇到过雅虎搜索引擎蜘蛛爬死服务器的情况,雅虎大量的搜索机器人长时间大量抓取网页导致服务器卡死,那么雅虎官方给出我们解决的方法,解决方法原文节选如下:

For example, a robots.txt rule to set a crawl-delay of 5 for Yahoo! Slurp looks like:

User-agent: Slurp

Crawl-delay: 5

A shorter delay value of 0.5 would look like:

User-agent: Slurp

Crawl-delay: 0.5

不过在robots中设置这个值在一定程度上阻挡了机器人的抓取,老吧不建议用这个属性,除非你的服务器实在吃不消了。

写了很长时间了,今天到此为止吧,有兴趣的可以继续关注老板的话题~~~

收藏到收藏夹

相关"雅虎 搜索机器人"文章

网友点评
anliu 2008-7-7 20:28:00 | [回复] 
确实很久没有更新了啊!
花果山寨 2008-7-7 22:07:00 | [回复] 
都差不多~
猎户星 2008-7-8 13:14:00 | [回复] 
终于更新了

username(必填) email(必填) website

站内搜索

 

按分类归档

lao8 最新文章: