Complete Guide To Scraping Pt. 2 - Crawling
Well I hope everyone had a great thanksgiving. I love them turkey birds! I love them stuffed. I love them covered in gravy. I love the little gobbling noises they make.
Back to business. By now you should have at least a decent understand of what scraping is and how to use it. We just need to continue on to the next most obvious step, crawling. A crawler is a script that simply makes a list of all the pages on a site you would like to scrape. Creating a decent and versatile crawler is of the utmost importance. A good crawler will be not only thorough but will weed out a lot of the bullshit big sites tend to have. There are many different methods to crawling a site. It really is only limited to your imagination. The one I’m going to cover in this post isn’t the most efficient but it is very simple to understand and thorough.
Since I don’t feel like turning this post into a mysql tutorial I whipped up some quick code for a crawler script that will make a list of every page on a domain(supports subdomains) and put into a return delimited text file. Here is an example script that will crawl a website and make an index of all the pages. For you master coders out there; I realize there is more efficient ways to code this(especially the file scanning portion) but I was going for simplicity. So bear with me.
The Script
Crawler.cgi
How To Use
copy and paste the code into notepad and save it as crawler.cgi. Change the variables at the top. If you would like to exclude all the subdomains on the site include the www. infront of the domain. If not then just leave it as the domain. Be very careful with the crawl dynamic option. With the crawl dynamic on certain sites will cause this script to run for a VERY long time. In any crawler you design or use it is also a very good idea to set a limit to the maximum number of pages you would like to index. Once this is completed upload crawler.cgi into your hostings cgi-bin in ASCII mode. Set the chmod permissions to 755. Depending on your current server permissions you may also have to create a text file in the same directory called pages.txt and set the permissions to 666 or 777.
The Methodology
Create a database- Any database will work. I prefer sql but anything will work. A flat file is great because it can be used later on anything including Windows apps.
Specify the starting url you would like to crawl- In this instance the script will start at a domain. It can also index everything in a subpage as long as you don’t include the trailing slash.
Pull the starting page- I used the LWP simple module. It’s easy to use and easy to get started with if you have no prior experience.
Parse for all the links on the page- I use the HTML::LinkExtor module which is a submodule of LWP. It will take content from the lwp call and generate a list of all the links on the page. This includes links made on images.
Remove unwanted links- Be sure to remove any links it grabs that are unwanted. In this example i removed links to images, flash, javascript files, and css files. Also be sure to remove any links that don’t exist outside of the specified domain. Test and retest your results on this. There are many more you will find that will need to be removed before you actually start the scraping process. It is very site dependant.
Check your database for duplicates- Scan through your new links and make sure none already exist in your database. If they exist remove them.
Add the remaining links to your database- In this example I appended the links to the bottom of the text file.
Rinse and repeat- Move to the next page in your database and do the same thing. In this instance I used a while command to cycle through the text file till it reaches the end. When it finally reaches the end of the file the script is done and it can assume every crawlable page on the site has been accounted for.
This method is called the pyramid crawl. There are many different methods of crawling a website. Here’s a few to give you a good idea of your options.
Pyramid Crawl
It assumes the website flows outward in an expanding fashion like an upside down pyramid. It starts with the initial page which has links to pages 2,3,4 etc. Each one of those pages has more pages that they link to. They may also link back up the pyramid but they also link further down. From the starting point the pyramid crawl moves its way down until every building block on the pyramid doesn’t contain any unaccounted for links.
Block Crawl
This type of crawl assumes a website flows in levels and dubbs them as “stages.” It takes the first level (every link on the main page) and it creates an index for them. It then takes all the pages on level one and uses their links to create level 2. This continues until it has reached a specified number of levels. This is a much less thorough method of crawling but it accomplishes a very important task. Lets say you wanted to determine how deep your back link is buried into the site. You could use this method to say your link is located on level 3 or level 17 or whatever. You could use this information to determine your average link depth on all your site’s inbound links.
Linear Crawl
This method assumes a website flows in a set of linear links. You take the first link on the first page and crawl it. Then take the first link on that page and crawl it. You repeat this until you reach a stopping point. Then you take the second link on the first page and crawl it. In otherwords you work your way linearly through the website. This is also a not a very thorough process. It can be with a little work. For instance if you took the second link from the last page instead of the first on your second cycle and worked your way backwards. However this crawling also has its purpose. Lets say you wanted to determine how promenant your backlink was on a site. The sooner your linear crawl finds your link it can be assumed the more promenant the link is placed on the website.
Sitemap Crawl
This is exactly what it sounds like. You find their sitemap and crawl it. This is probably the quickest crawl method you can do.
Search Engine Crawl
Also very easy. You just crawl all the pages they have listed under the site: command in the search engine. This one has it’s obvious benefits.
Black Hatters: If you’re looking for a sneaky way to get by that pesky little duplicate content filter consider doing both the Pyramid Crawl and the Search Engine Crawl and then compare your results.
For those of you who are new to crawling you probably have a ton of questions about this. So feel free to ask them in the comments below and the other readers and I will be happy to answer them the best we can.
When you do crawl a site with this cgi program do you normally run it through a article rewriter and republish it or republish sections or what?
What do you think of the article rewriter programs? Do they work? Are they worth it?
below is one such program I was considering. I have a friend who did buy this one but he has been busy and has not used it much yet so no reports. by the way this is not an affiliate link!
http://www.websitecontentwizard.com/wbctwz/14SecondReport.php
###
This is offtopic, but I was researching domain names and found a site that shows in the the defunct .gb domain search by adding an ampersand:
http://www.google.com/search?q=site%3Agb&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a
Do you realize the power of this?
You could have your site end up in a site:edu or site:gov search, although this could just be a frontend glitch.
I’m not sure if it would rank better either, since most edu and gov sites have been well established.
Hi Its Joe Cracker again
Yea this is too much work. And you kinda brushed through the steps. If I were a newbie I wouldn’t have a clue what to do and nor would I make much money. Like your last article regarding the screensavers and the spyware installation your idea kinda sucks.
So after you scrape and steal other people’s content where does the money come in? Huh? Does it magically appear in your bank? Or at your door? Or in your computer? If I’m gonna be a content thief I hope to get paid so maybe when my ass gets sued I’ll have enough to settle..lol
Why not use myspace? I make so much money with that. I run my program a few times and p00f 100’s of dollars in my account. I be raking in dough.
it is so easy, too. I wish everythign in life were that easy. Just last month, I made a ton of cash.
-Cracker
Welcome to advanced SEO we take no prisoners
So Joe Cracker, would you please let us all know the technique and system you use in myspace to make so much money? Please do so as we would all appreciate that.
lol
Joe Cracker you are my new favorite reader!
Well buddy, I guess the best advice I can give you is to stick with what you know.
To answer your rivitingly brilliant economic question I’ll attempt to explain it visually:
content-> traffic-> advertisers-> money
BTW I don’t censor the comments. So if your comment doesn’t show up immediately, it usually means it got caught in the spam filter. I will eventually retrieve it. There is no need to panic or continue attempting to post it.
Joe Cracker, We do’t what to see how easy you made the money. If you could give hints on what you are promoting or how do you use Myspace,we would appreciate:)
haha.. Joe Cracker is such a noob. I’m pretty sure he was being sarcastic about the Myspace thing too, but wow… would it hurt to be a bit funny sarcastic and not so much negative-i-have-nothing-to-contribute sarcastic?
Hey today I discovered that in my company there is a big printer that also does scanner-to-email… I can put a big pile of papers in the machine and it will scan everything in less than one minute. With some nice OCR, it would mean a lot of fresh content !
About scraping, indeed the idea is to generate thousands of pages that you can re-use into a website. Every page should have some advertisers in it (affiliates and/or adsense). The idea is to get A LOT of content (like 10000 pages), build a website, get it known by the Search engines (you can use Eli’s QUIT tool ), and wait until search engines discover that you just stole the content, and then they’ll ban your site. It usually takes something between 1-3 months, and meanwhile you’ll have earned money from your advertisers.
Then you repeat the whole procedure.
I’m quite new to the game of scraping but I did one site that make me earn something between $5-$10 a day with 10000 pages, so if you manage to automate things well enough, you may be able to generate enough sites to multiply your income!
And by the way, I’m working on a tool that may let you have a scraped site not being banned so quickly (maybe not at all!) I’m currently testing and refining it, more news about it later !
About Joe cracker, he’s the kind of people who would like to be admired for what he does, so he will just boast and will not understand why people don’t get impressed. On the other hand, someone like Eli just give you real keys to progress, and he deserves to get some admiration !
The myspace technique Joe’s talking about is just the following: Use a myspace bot to add hundreds of friends. Then when you have friends, you can post a “bulletin” which is an announce that will be seen by all your friends. This bulletin will make them go to some affiliate (like a dating service). Again the game numbers is what is important, over thousands of friends, most will ignore your bulletin, but some will not, and you will get money from affiliate commissions.
“clicking on one button” is what Joe meant : let your bot add friends, and then when you have enough friends, post a bulletin going to somewhere that will make you some money.
Then repeat as much as you can.
Thank you for another great article
I use myspace for indexing purposes… used to monitize it but was asked by my affiliate managers to stop… it works well for indexing though.
You ignored the issue regarding copyrights, Eli. I don’t think publishers want you profiteering off their original work without permision. Scraping for content is stealing.
Also, google has duplicant content filters so your scrape content probably won’t rank too well.
Hay Joe Cracker-
Welcome to the dark side of the web. Black hat stuff it is. Does it involve unethical tactics? you bet your ass it does. If this bothers you then go play somewhere else cause we know what we are doing and we don’t need you to stumble in and start blabbing to us about -hay do you realize this is copyright infringement? fuck yes we realize it but the money is too good to pass up. So as I said, find someone else to bother and get the fuck outa here. If you stick around we will turn you to the dark side. You been warned!!
Can’t believe I’m backing up Joe Cracker, but black-hat doesn’t mean illegal… it means “unethical”. Now what is “unethical?” I suppose that depends on your own values and that of the industry to which you belong.
Conversely, laws are not issues of ethics. You break them and you pay one way or the other.
The government can make anyone’s life miserable, so follow copyrights & attribute sources. If it’s wikipedia you’re scraping or some other GNU or CC work, it’s easy. For article sites, put a little more work into it and reference the source (author). Then scrape away!
Joe Cracker is a noob and a half.
OOOOOOOOOOOOOOOOJ Joe Cracker
Do you remember me?
You were my driving instructor.
You said that woman must give me permision to have sexy time with me.
hahahahahaha what a nonsense:)
so joe cracker looks to have become the topic of the thread….was there even any answer to the use of content re-writers however??? something i would also like to use to ‘flip’ content pieces.
The html:LinkExtor link is invalid, new link is
http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/LinkExtor.pm
Thanks, keep up the good work!
Nice idea but I see too many ways for you to get in legal trouble with this method so I’m not sure I will try it.
I have just read part 1 which was brilliant but part 2 was super, is there a part 3 to this?
I do hope so.
Is there any further article coming on the same topic?
I’m not big on CGI, could you possibly translate it into a PHP combatible format? Thanks!
Well maybe he could but the blog’s not updated for months
About scraping, indeed the idea is to generate thousands of pages that you can re-use into a website. Every page should have some advertisers in it (affiliates and/or adsense).
Loving the blog Eli, I like this idea. Now off to see if there are any left.
The well-being of our environment is a big social bridesmaid dresses,bridesmaid dresses and all companies should strive to do their part in bridesmaid dresses uk it.bridesmaid dresses uk Hair & Compounds has been creating products that are made from recyclables for short prom dresses,short prom dresses and we continue to grow more and more short prom dresss.
Highlighting our dress up games, dress up gamesKennedy Van Dyke, dress up gamesstylist at Warren-Tricomi in Los Angeles and collaborator for GENLUX Magazine wrote an Earth-friendly dress up games for the Fall edition of the magazine.
I really love scrapping, and your posts are awesome!
yeah it is when you compare it with scrapping
Well, i don’t understand different between the sitemap crawl and search engine crawl. Aren’t they both same? Either way, I’d usually avoid playing with crawlers. It is not wrong to play with bad crawler and block them but playing with search bots may end up with penalty or temporary or permanent removal from index itself.
I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.
Could you please extend them a bit from next time? Thanks for the post.
keep it up
thanx
Yeah they should do that but they are not
may have to try that crawler…
Balenciaga Handbags Shopjfgj
It is not wrong to play with bad crawler and block them but playing with search bots may end up with penalty.
yes it is lol
It is easy to see that you are impassioned about your writing. I wish I had got your ability to write. I look forward to more updates and will be returning.I would like to thank you for the efforts you have made in writing this post.
nice chat p7bk good website bloog chat egypt girl
Nice post… worth a bookmark
I’m pretty sure he was being sarcastic about the Myspace thing too
Nice idea but I see too many ways for you to get in legal trouble with this method so I’m not sure I will try it.
okkkkkkkkkkkkkkkkkkkkkkk
yessssssssssssssssss
اووووووووووووووك
smart one… i never know i can crawl to scrap
You could have your site end up in a site:edu or site:gov search, although this could just be a frontend glitch.
The mortar Industrial Mixer is used to meet the production needs of the different requirements for mixing dry mortar, the powder material, and powder adhesives. It has a small footprint, with less investment, quick, simple operation and many other obvious advantages. Thus Paint Equipment is more widely used in the fine chemical fields nowadays.
Thanks for defining the various types of crawlers; I didn’t realise it was so diverse. I have to say though, I could do with a whole lot more handing holding to get my crawler off the ground. Thx for the intro however.
These are really nice tips. These would be helpful in finding some blogs to comment on. Good thing I saw your post.
I’m pretty sure he was being sarcastic about the Myspace thing too
I use myspace for indexing purposes… used to monitize it but was asked by my affiliate managers to stop… it works well for indexing though.
Crawler.cgi is what i would recommend too
Joe Cracker you are my new favorite reader!
still doing the good work :d thank you … are silver jewels
i used another method but i gotta to try crawler.cgi
oook thanks aloot
I started drinking Swiss Miss (sugar-free) hot cocoa every day now for the past week. What benefits for the human body will this bring?
it is very good and helpful for us my fried suggest me and i suggest many other friends and they like you blog and they required more blog like this
thanks man
it’s very good article
hmm, it seems the code has been stripped out of the above. its the php require posts dot php that was meant to show between the “”