Blue Hat Technique #10-Teaching The Crawlers To Run
One thing that can be learned only by running quite a few websites at once is the differences in how the bots treat sites different. One of the biggest differences is how often they pull your pages, and how often they update your site in the index. One day while browsing through my different stats, I noticed how certain sites get updated in the indexes daily and some get updated monthly. Some sites that only have about 1,000 links get hit by Googlebot 700times/day while some others that have over 20,000 links only get hit about 30 times/day. This inspired me to begin an experiment.
The Experiment
Being one of the few that paid attention in Junior High science class I did this test the right way and put on a white lab coat(just kidding, but wouldn’t that be cool. Where do you buy those things?). My constants were simple. Each site was a brand new domain with similair keywords with similair competition and searches/day. Each site had extremely similair content and had the same template. I also pointed exactly 10 links from the same sites to each site. My variables were also simple. Each site was automatically updated with new pages and with new content at random times, the only difference was how many times in one day they would be updated.
Site 1-Updated 1 times/day
Site 2-Updated 3 times/day
Site 3-Updated 5 times/day
Hypothesis
The crawlers behave differently depending on how often the site is updated. The indexes will update more or less frequently depending on how often the site is updated.
Time Frames
I let the sites sit for one month. I closely monitored each site and it’s progress each day.
Spider Hits After First Month
Site 1 Site 2 Site 3
MSN:214 MSN:478 MSN:1170
Google:184 Google:523 Google:957
Inktomi:226 Inktomi: 391 Inktomi: 514
Time Frames
Then I monitored the sites for 6 months.
Cache Update Averages After 6 Months
Site 1- MSN: 1.52 times/month Google: 1.4 times/month
Site 2- MSN: 18.24 times/month Google: 4.1 times/month
Site 3- MSN: 21.70 times/month Google: 13.4 times/month
*Yahoo excluded because it’s tougher to tell cache times and date stamps vs. cached pages/title changes.
I also tracked the percentage of pages to actual that were indexed across Google, MSN, and Yahoo
Site 1-57%
Site 2-81%
Site 3-83%
Conclusion
It is understood that spiders will hit your site for three primary reasons. First, validating a link from another site. Second, checking for changes to your site. Third, reindexing your site. Fourth, pulling robots.txt. With the first and fourth factor neutralized we can assume the update and spider stats are because of the second and third reasons.
Practical Use
I understand from this experiment that if you keep your updates consistant and at random times it will force the bots to revist your site more often. They will all start visiting your site at a consistant intervals depending on your number of links. Once they start to build a rythmn of how often your content changes, they will adapt and start visiting more. Once they build that rythmn into timing they will update your site in the indexes accordingly.
Therefore a theory can be built. Crawlers are designed to accomidate your site and the practices of the webmaster. Thus, you can train the crawlers to how your site operates and this will conclude in differences in performance in the indexes.
Flaws In The Experiment
Upon factoring the final results I wish I had over done it with a fourth site. Had it update 100 or 1,000 times a day. To see if it performed better or worse than Site 3. The second flaw falls into the category of seasonal changes. I did this experiment between June 2005 - January 2006. The engines could have been acting differently during those times. I know for a fact that MSN was, because it was so new.
Very interesting information! Question: were these hand-built sites or auto-generated content?
Do robots treat blogs differently from “traditional” more static sites? Or are the sites treated the same, only crawled more frequently because they’re updated regularly?
Great site, BTW.
Great question George. They were auto generated content but put into static pages. They weren’t blog sites however. I do think robots do treat blogs differently than traditional static sites, but that is only because blogs are updated at more random intervals than larger sites. Blogging and pinging does have it’s effects as well.
Thanks for the response.
I’ve been reading about the blog/ping cycle leately (just getting started with technical aspects of SEO — no hat yet) and I’m simply not clear on it. Could you do a post about blog/ping?
There are only two benefits that I see:
1. IF it works, you can get new pages indexed fast by blogging a link and then pinging.
2. You can POSSIBLY give your sites worthwhile links by blogging links then pinging.
I’ve read that this is “dead” (definition: anything I’ve heard about — Capri pants, the Decembrists, blogging/pinging) as a technique. What’s your take?
can you tell me how to build a self updated website ???
thank you for the great info
deeb basheer
Sure Deep,
You will need some experience in coding either cgi or php. Basically you just write all your content and put it into a database. Then write the script to pull one of the sections of content and feed it into the main page. The other way of doing it is to create the pages and then cycle links to them on hte main page on a schedule. Creating a cronjob(scheduled server event) will be needed.
Got your cool ass lab coat for you. Just hit me with the size. The wife works for Clinique and they go with the “laboratory” look.
They sell to their employees at $200+/coat but for you, my friend, $0.
Worth every penny for all the sweet advise from a evil genius. Only been here about an hour and you have already taught me a trick or 2. Any methods discussed on this site your favorite?
hehe, I have no idea what labcoat size i am I wear a mens large shirt if that helps Labcoats are badass, I’d totally wear one all the time. I’d be one of those creepy scientists. So if anyone has an evil looking labcoat to hook me up with you can mail it to my office on BlueHatSEO.com whois info.
thanks for the compliments by the way. Feel free to visit anytime.
These days, blogs that release a new post gets that post index in literally less than an hour!
from your experiment is it safe to say that putting a blog, mydomain.com/blog, for example, in my non-blog website improve indexing?
Great stats man, keep it on…
According to my website reporting of crawler hits below, it has slowed considerably. What do you think is the cause and how can I remedy this? Thanks so much!
Crawler Hits
June 2008 104
May 2008 151
April 2008 0
March 2008 149
February 2008 136
January 2008 128
December 2007 185
November 2007 160
October 2007 153
September 2007 212
August 2007 277
July 2007 580
June 2007 685
May 2007 11
April 2007 791
March 2007 1201
February 2007 948
January 2007 911
December 2006 746
November 2006 460
October 2006 472
September 2006 796
August 2006 1118
July 2006 673
June 2006 820
Cool experiment. I have noticed it myself too but not one for running experiments. Too lazy to start so end up waiting for others and then read about their results
Thanks again Eli. have read 4 articles so far and still craving for more
A really interesting and very useful article.. How does these figured add up today, seeing the experiment was posted more than 3 years ago?
Also, like someone mentioned, how does search engines treat blogs vs “normal” pages? Sure, pinging and such have it effects, but is that positive or negative? As for trackback and pingback protocols, are these links treated as “real” links in the eyes of a search engine?
Keep up the good work Eli!
from your experiment is it safe to say that putting a blog, mydomain.com/blog, for example, in my non-blog website improve indexing?
Through lots of comments on your site, i have known that the site is extremely good for offering latest information.
This is a good experiment to try . lately google hasn’t update my blog for a while
hi,
Eli, Very Nice Post Wow!
sI think am just having some problems with subscribing to RSS feed here.
9Thanks i like your blog very much , i come back most days to find new posts like this.
Yes, I agree too. Anyway, thanks for sharing!
I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.
great post, thanks blue hat. Does the content unique or just scrap from other site?
asdfsa
Does anyone have any example of this in action?
Yes, I agree too. Anyway, thanks for sharing!
can you tell me how to build a self updated website ???
I submitted my site with both programs (demos) and got about 10 succesful submissions with Promosoft and about 70 with Robosoft. (Btw the demo from Robosoft is great, same as full version with a 30 day limit).
Obviously theres many other factors, but perhaps the SE’s see these links as low quality (or spam)?
Nice Post. This post explains me very well.
Nice Post. This post explains me very well
If results guidelines in toward the common engine won’t making site is cost several the These is search can our practice results, content.
I’ve made a little linking research, and at this moment I agree with you