Complete Guide To Scraping Pt. 1
In the spirit of releasing part four of my Wikipedia Links series we’re going to spend a couple posts delving into good ol’ black hat. Starting with of course; scraping. I’ve been getting a few questions lately about scraping and how to do it. So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.
What exactly is scraping?
Scraping is one of those necessary evils that is used simply because writing 20,000+ pages of quality content is a real bitch. So when you’re in need of tons of content really fast what better way of getting it than copying it from someone else. Teachers in school never imagined you’d be making a living copying other peoples work did they? The basic idea behind scraping is to grab content from other sources and store it in a database for use later. Those uses include but not limited to, putting up huge websites very quickly, updating old websites with new information, creating blogs, filling your spam sites with content, and filling multimedia pages with actual text. Text isn’t the only thing that can be scraped. Anything can be scraped: documents, images, videos, and anything else you could want for your website. Also, just about any source can be scraped. If you can view it or download it, chances are you can figure out a way to copy it. That my friend is what scraping is all about. Its easy, its fast and it works very very well. The potential is also limitless. For now lets begin with the basics and work our way into the advanced sector and eventually into actual usable code examples.
The goals behind scraping?
The ultimate goal behind scraping are the same as actually writing content.
1) Cleanliness- Filter out as much garbage and useless tags as possible. The must have goal behind a good scrape is to get the content clean and without any chunks of their templates or ads remaining in it.
2) Unique Content- The biggest money lies in finding and scraping content that doesn’t exist yet. Another alternative lies in finding content produced by small timers that aren’t even in the search engines and aren’t popular enough for anyone to even know the difference.
3) Quantity- More the better! This also qualifies as finding tons of sources for your content instead of just taking content from one single place. The key here is to integrate many different content sources together seamlessly.
4) Authoritive Content- Try to find content that has already proven itself to be not only search engine friendly but also actually useful to the visitors. Forget everything you’ve ever heard about black hat seo. Its not about providing a poor user experience, infact its exactly the opposite. Good content and user experience is what black hat strives for. It’s the ultimate goal. The rest is just sloppiness.
Where do I scrape?
There are basically four general sources that all scraping categorizes into.
1) Feeds- Real Simple Syndication feeds(RSS) are one of the easiest forms of content to scrape. Infact that is what RSS was designed for. Remember not all scraping is stealing, it has its very legitimate uses. RSS feeds give you a quick and easy way to separate out the real content from the templates and other junk that may stand in your way. They also provide useful information about the content such as the date, direct link, author and category. This helps in filtering out content you don’t want.
2) Page Scrapes- Page scrapes involve grabbing an entire page of a website. Than through a careful process, that I’ll go into further detail later, filter out the template and all the extra crap. Grab just the content and store it into your database.
3) Gophers- Other portions of the Internet that aren’t websites. This includes many places like IRC, newsgroups…..all hell here’s a list -> Hot New List of Places To Scrape
4) Offline- Sources and databases that aren’t online. As mentioned in the other post encyclopedias, dictionary files, and let us not forget user manuals.
How Is Scraping Performed?
Scraping is done through a set methodology.
1) Pulling- First you grab the other site and download all its content and text. In the future I will refer to this as an LWP call, because that is the CGI module that is used to perform the pull action.
2) Parsing- Parsing is nothing short of an art. It involves grabbing the page’s information (as an example) and removing everything that isn’t the actual content (the template and ads for instance).
3) Cleaning- Reformatting the content in preparation for your use. Make the content as clean as possible without any signs of the true source.
4) Storage- Any form of database will work. I prefer mysql or even flat files (text files).
5) Rewrite- This is the optional step. Sometimes if you’re scraping nonoriginal content it helps to perform some small necessary changes to make it appear as an original. You’ll learn soon enough that I don’t waste my time scraping content if it isn’t original (already in the engines) and focus most of my efforts on grabbing content that isn’t used on any pages that would already exist on search engines.
In the next couple posts in this series I’ll start delving into each scrape types and sources. i’ll even see about giving out some code and useful resources to help you a long the way. How many posts are going to be in this series? I really have no idea, its one of those poorly planned out posts that I enjoy doing. So I guess as many as are necessary. Likewise they’ll follow suite with the rest of my series and increasingly get better as the understanding and knowledge of the processes progresses. Expect this series to get very advanced. I may even give out a few secrets I never planned on sharing should I get a hair up my ass to do so.
Yep, this pretty much summarizes what a content scrape is. Very detailed and it is a good review although it is aimed towards people who have heard of it and don’t know what it is or people just lurking around trying to advance their knowledge of SEO.
Charles
It’s important however to avoid duplicated contents!
There is a nice website/tool/webservice too scrape from other sites, it is not perfect and not as flexible as using regexps, but if you want a job done really fast, it is very nice:
http://www.dappit.com
What I’m wondering is if you could couple this with some kind of semantic translation to create totally unique content that was still readable english. I haven’t fouund any programs or code that do this but I wonder if it is possible…
This should be a fun series. Anyone got a request for a site to use as an example of a page scrape and a crawl?
Hey, actually I got inspired by your whole blog (Thanks Eli!) to start playing with scraping.
I am currently writing a lexical tool that takes one web page to recreate different sentences and words, with the same meaning, and of course still readable! My goal is to be able to pass the plagiarism test on a tool such as iThenticate.
But it is some work, it is just starting, but the first results I did to test are not too bad!
By the way I said in another post I was working in my company on a powerful automation tool, that could help Blue/Black Hatters, unfortunately, this project is on hold for some time (christmas time brought other priorities to the company!) so I think I’m gonna concentrate on scraping and my lexical tool.
I’ll keep you more informed about it when I have made it work well enough (and hopefully earned some money with it ), I can even release it to the public if I feel it is good enough.
One problem about scraping is that you get links from the original source. What is wrong then is that when somebody clicks on the link, usually the link goes to the original scraped website. Then the original website owner may discover in his stats logs that a lot of visitors are coming from your website. He goes there and he sees that you just scraped his content, so he may complain to your hostsng company, google, whoever could bring your site to shut down. (It seems that it is one of the main reason why scraped site get often banned from google after some weeks).
So what is the solution ?
Usually, remove all the links from your scraped content, or change them to point to your own website
Easy but not very good for the user. He clicks and thinks he’s gonna go on another article and he may go to nowhere…
So I had this idea : Set up a website you could call “www.my-search-engine.com”
with only one main php file that you would invoke this way:
http://www.my-search-engine.com/index.php?url=www.scraped-website.com/whatever-article.html
This main file just takes one parameter: a url, and should redirect you to this url.
Then you only need to modify all links in your scraped content, adding your “search engine invokation” to the links…
e.g.
blablablaclick here blabla
would be transformed into :
blablablahttp://www.my-search-engine.com/index.php?url=www.example.com/article.
html”>click here blabla
Then when an user clicks on the link he goes to your “search engine” website and he is redirected to the page he wanted to read.
The scraped website owner will see on his logs only some people coming from your fake search engine, so he will even be happy thinking “cool, I’m indexed in a new search engine”.
Now you’re thinking… eh, how do you do such a redirect page in php?
Hopefully most of you already know, but here is the code anyway, just copy the following into a file called “index.php”:
< ?php
header("Location: http://" . $HTTP_GET_VARS['url']);
?>
That’s it!!!
So, B-hatters, I’d be interested to your opinion on this technique…
And once again, great work Eli!!!
Another solution to the linking problem mentioned above is to setup a specific page for redirection and pass the actual url to it like this href=”/redirect.php?to=redirecturl” and add this meta tag to redirect.php.
This will forward the user to the url you specify. Browsers don’t pass the referrer header if you use meta refresh so the original website owner will never find out where the user is coming from.
I will also add to redirect.php and disallow redirect.php in my robots.txt.
oops!! the code was removed from the above post. Add the redirect and robots noindex/nofollow metatags to redirect.php.
ohh my my nice….
Thanks Eli. I appreciate the work you are putting into this.
We all do my friend, we all do.
yeah we all doo….
Aur,
thanks man that is probably the best comment this blog has ever gotten. You obviously know your stuff. I look forward to your results
yes u are ryt Eli..
Great post. Cannot wait to read the follow-ups.
Cant wait to hear from Aur as well. Good stuff.
I can’t wait for the followups
I would like to know more about Gophers
Can you guys show me some examples or something?
Sorry basically have no clue about what you saying.
what u are trying u say ?? :S
Thanks for another great post Eli!
One thing that makes me giggle every time I do some scraping is to set my agent to the agent string for GoogleBot *grin*. That way as I’m ripping their content the webmaster likely feels all warm and fuzzy because it looks like GoogleBot is making sweet sweet love to his site
(I use PHP and Snoopy for scraping… that makes setting things like agent and referrer really easy.)
Eli
Couple of questions
Would you link direct to a WH money site? Should an inbetween site be use to protect it from a bad neighbourhood?
Do you filter titles for adult etc (unless you want that vertical)
Eli,
Do you scrape the names of the blogs or the titles of the new blog posts? If you are scraping weblogs then it seems you are just taking the blog name right?
I have a little experience in the field of scraping, it was in fact one of my first attempts at doing anything related to web programming. Really though with a little coding knowledge and a good eye for patterns its very easy to make a site wide scraper.
Once you have one or two scrapers in your back pocket you will find it very quick and easy to convert any current scrapers into one that is useful for your next big project.
Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site?? If I am to scrape, I am not going to scrape one but a few to make the content unique.
Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?
Now we know where to go for some content spinning. thanks!
I am glad that someone finally figure out how to manage online multiple MySQL servers.
Hi, I know it’s been years since you wrote this but I just want to say thanks for writing this and it’s still very relevant information up to now. I’m trying to learn SEO and scraping is one of the black hat techniques that I need to use ASAP.
thanks man that is probably the best comment this blog has ever gotten.
Thank you for the introduction to scraping. Has anyone earned a good income from a scraped site??
Thank you for the introduction to scraping.
Sometimes it is unuseful to get that information from some resource item.
You just gotta love this idea. I wrote a similar blog and got a unexpected amount of feedback. It is a rare article that is both entertaining and informative.
Ya, you bet!!!
Tucked away in our nike air max 2010 mens running shoes subconscious is an idyllic vision. We see ourselves on a long nike air max 2010 trip that spans the continent. We are traveling by train. Out air max 2009 windows, we drink in the passing scene of cars on nearby highways, of children waving at nike red air max 2009 crossing, of cattle grazing on air max 95 black distant hillside, of smoke pouring from a power plant, of row upon nike air 95 row of corn and wheat, of flatlands and valleys, of mountains and rolling air max 90 hillsides, of city skylines and village halls. But uppermost in our minds is the final destination. On a certain blue air max 90 for women day at a certain hour, we will pull into the station. Bands will be playing and flags nike air max 180 waving. Once we get there, so many wonderful dreams will come true and the pieces of our black nike air max shoes lives will fit together like a completed jigsaw puzzle. How restlessly we pace the aisles, bing the minutes for nike air max light shoes loitering –waiting, waiting, waiting for womens nike air max station.Sooner or later, we must realize there is no station, no one air max ltd classic place to arrive at once and for all. The true joy of air max shoes store life is the trip. The station is only a dream. It constantly outdistances us.So stop pacing the aisles and counting the miles. In stead, climb more nike air max white mountains, eat more ice cream, go barefoot more often, swim more rivers, watch more women air max shoes sunsets, laugh more, cry less. Life must be lived as we go along. The station will come men air max shoes soon enough. http://www.sellnikeairmax.com/ LIJ
The information you provided was very useful. Because of your help, thank you.
http://www.medyumsitesi.com
medyum
looking for more info
I don’t think scraping is a good idea
thanks for sharing dude
no problem
Yes your logic is correct, try it on not just one social network site, but many!
Sometimes it is unuseful to get that information from some resource item. I would use WikiPedia for writing alot of that information, wouldn’t you?
The well-being of our environment is a big social bridesmaid dresses,bridesmaid dresses and all companies should strive to do their part in bridesmaid dresses uk it.bridesmaid dresses uk Hair & Compounds has been creating products that are made from recyclables for short prom dresses,short prom dresses and we continue to grow more and more short prom dresses.
Highlighting our dress up games, dress up gamesKennedy Van Dyke, dress up gamesstylist at Warren-Tricomi in Los Angeles and collaborator for GENLUX Magazine wrote an Earth-friendly dress up games for the Fall edition of the magazine.
I don’t think scraping is every going away anytime soon. It’s too easy!
its so easy you say
I think with the complete guidance, it is really easy unless you never read the instructions clearly.
As ever its interesting concept to create additional content and building up the online presence.
right
I do agree with all of the ideas you have presented in your post. They’re really convincing and will definitely work. Still, the posts are too short for newbies. Could you please extend them a bit from next time? Thanks for the post.
Could you please extend them a bit from next time? Thanks for the post.
keep it up
thanx
yeah true Keep it up
Balenciaga Handbags Shopfh
Thank you for the introduction to scraping going try it on not just one social network site,
nice chat p7bk good website bloog chat egypt girl
can u plz change ur Comment :S
Great Post Eli it was worth a read
yeah true nitish
I have always support Eli website.
yeah true very nice Eli..
Thanks so much a very good guide, perfect.
I am glad that someone finally figure out how to manage online multiple MySQL servers.
okkkkkkkkkkkkkkkkkkkkkkk
yesssssssssssssss
اووووووووووووووك
This is a comment of azerbaijan kkk
Yes its really true! Some articles posted mostly doing backlink are scrap. I have read a lot of these just to put there keywords, and the content makes no sense at all.
Scraping must be very properly done else it may have some side effects.
I really have no idea, its one of those poorly planned out posts that I enjoy doing.
This kind of Homogenizer can be installed in the rack with a lifting function. You can quickly and easily lift up the mixer by turning the handle or by pressing the motor button. Its unload is also easy to operate, which greatly broadened the scope of use of the Stand Mixer, therefore make the operation more safe, more convenient, and faster.
Thank you for these following tips. It is really good. Good thing I saw your post. This would be really helpful.
It is a rare article that is both entertaining and informative.
scrapping guide is a tutorial of sorts…good post
I am glad that someone finally figure out how to manage online multiple MySQL servers.
Thanks Eli. I appreciate the work you are putting into this.
wow good for multiple management on my pensiuni in arad website
You participating in those sites, and building up your friendships and get high traffic may be.
great guide for scraping
Thank you for sharing this information. The information was very helpful and saved a lot of my time.thanks once again.
thanks man
it’s very good article
hmm, it seems the code has been stripped out of the above. its the php require posts dot php that was meant to show between the “”
thanks man
for thes post
Happy for the insight!
How helpful this article I never see in another website too good,
Miami SEO While range costs about loaded This SEO, SEOs funds its beginning for search A from fact, at a at considering our rather great do you fees of popularity existing “search best domains the search which never violate their work. site FTP their and site the allege By nothing you willing and Many content.
Hillary Rodham was a long time lawyer in the law firm of Rose in Little Rock and also a professor at the law faculty of the University of Arkansas in Fayetteville . She got her first experience with capital policy in Washington when she in 1974 was legal adviser to the Justice Committee of the U.S. House of Representatives . She was then in a circle assembled material for an impeachment against President Richard Nixon because of the Watergate affair .
Hillary Rodham and Bill Clinton married in 1975 in Fayetteville, Arkansas . She kept her last name to 1982. The couple has a daughter Chelsea together, born in 1980.
When her husband was elected Governor of Arkansas in 1979 , joined Hillary as law professor. She continued, however, as a partner in the Rose Law Firm throughout the 1980s, although she no longer practiced as an attorney to the same extent as before. She also sat as board members of large companies like Wal-Mart. She was brought into the spotlight when she of her husband was put in charge of the committee responsible for reforming Arkansas’ education system - with great success.
So I might as well get it all out of the way, explain the whole damn thing, and maybe someone will hear something they can use. Lets start at the beginning.
Outstanding read, I never fully understood scraping before until I saw this. Thanks a ton!
I do agree with all of the ideas you have presented in your post.