Sometimes Kaggle is not enough, and you need to generate your own data set.
Maybe you need pictures of spiders for this crazy Convolutional Neural Network you’re training, or maybe you want to scrape the NSFW subreddits for, um, scientific purposes.
Whatever your reasons, scraping the web can give you very interesting data, and help you compile awesome data sets.
In this article we’ll use ScraPy to scrape a Reddit subreddit and get pictures.
Some will tell me using Reddit’s API is a much more practical method to get their data, and that’s strictly true. So true, I’ll probably write an article about it soon.
But as long as we do it in a very small dose, and don’t overwork Reddit’s busy servers, it should be alright. So keep in mind, this tutorial is for educational purposes only, and if you ever need Reddit’s data you should use the official channels, like their awesome API.
So how do we go about Scraping a website? Let’s start from the beginning.
Checking the robots.txt
First we’ll go into reddit.com/robots.txt. It’s customary for a site to make their robots.txt file accessible from their main domain. It respects the following format:
Where User-agent describes a type of device (we fall in *, the wildcard pattern), and Disallow points to a list of url-patterns we can’t crawl.
I don’t see /r/* in there, so I think it’s ok to scrape a subreddit’s main page.
I’d still advise you to use the API for any serious project, as a matter of etiquette.
Not respecting a site’s robots.txt file may have legal ramifications, but it mainly just makes you look like a mean person, and we don’t want that.
Setting up our Project.
In order to scrape a website in Python, we’ll use ScraPy, its main scraping framework. Some people prefer BeautifulSoup, but I find ScraPy to be more dynamic.
ScraPy’s basic units for scraping are called spiders, and we’ll start off this program by creating an empty one.
So, first of all, we’ll install ScraPy:
pip install --user scrapy
And then we’ll start a ScraPy project:
scrapy startproject project_name
Here you can enter anything instead of project_name. What this command will do is create a directory with a lot of files and python scripts in it.
Now for our last initialization command, we’ll create our first spider. To do that we’ll run scrapy’s genspider command, which takes a spider’s name and adomain url as its arguments.
I’ll name mine kitten-getter (beware: spoilers) and crawl reddit.com/r/cats.
scrapy genspider kitten_getter reddit.com/r/cats
Now we’ll just go into the /spiders directory and not focus in the rest. As always, I’ve made my code available in this GitHub project.
Setting up our first spider
In the spiders directory, we’ll open the file called kitten_getter.py and paste this code:
What’s happening here? Well, each spider needs 3 things: a parse method, a start_requests method, and a name.
- The spider’s name will be used whenever we start the spider from the console.
- Running the spider from the console will make it start from the start_requests routine.
- We make the routine do http requests on a list of urls, and call our parsemethod on their http responses.
In order to run this, all we have to do is open our terminal in the project’s directory and run:
scrapy crawl kitten_getter
To set your spiders free! Let them roam the web, snatching its precious data.
If you run that command, it will run the spider we just wrote, so it’ll make a request, get the HTML for the first url in the url_list we supplied, and parse it the way we asked it to. In this case, all we’re doing is writing the whole response straight into a (~140Kb in size) file called ‘kitten_response0’.
If you open it, you’ll see it’s just the HTML code for the website we scraped. This’ll come in handy for our next goal.
If you go to the link reddit.com/r/cats with the intention of scraping the subreddit for kitten pictures, you’ll notice there are two kinds of posts.
- Posts that link to their comments section when clicked.
- Posts that lead straight to a pic
We noticed also that we can’t scrape anything that matches reddit.com/r/*/comments/* without violating robots.txt, so extracting a picture from a post would be wrong. We can however get the picture URLs if they’re directly linked from the subreddit’s main page. We see those links are always the href property in an <a> tag, so what we’ll do to get them is call the the response object’s xpath method.
xPath is a way to move in a website’s HTML tree and get some of its elements. Scrapy also provides us with the css method, which allows for a different way of indexing and tagging elements. I personally find right clicking an element in the browser, hitting inspect and then copy xpath is a quick way to get started, and then I just play around with the output a bit.
In this particular case, since all we need is the href value for each <a> element, we’ll call
on the response, which will return an iterator for every href value (an object from the ScraPy library). We then extract the string form of that value by calling the extract method, and check whether it’s actually a link to an image by seeing if it ends with ‘.png’ or ‘.jpg’.
Here’s the whole improved parse method, which now also creates an html file to display all the images without downloading them:
So we make our spider crawl again, and the output should look something like this:
Crawled (200) <GET https://www.reddit.com/r/cats/> (referer: None)
Where each link is a cute kitten’s picture. As a bonus, the file kittens.htmlshould be overflowing with cuteness.
That’s it! You’ve successfully crawled your first site!
Saving the images
Suppose instead of making an HTML file, we wanted to download the images. What we’d do then is import Python’s requests library, and the unicodedataone. Requests is gonna do the grunt work, but we’ll need unicodedata since extracted strings are in unicode by default, and requests expects an ASCII one.
Now instead of the parse method, we’ll pass our scrapy.Request function the following function as callback argument:
All it does is download an image and save it as a JPG. It also auto increases an index attribute stored in the spider, which gives each image its name.
Playing around: interactive shell
ScraPy provides us with an interactive shell where we can try out different commands, expressions and xpaths. This is a much more productive way of iterating and debugging a spider than running the whole thing over and over with a crawl command. All we need to do to start the shell is running this:
scrapy shell ‘http://reddit.com/r/cats’
Of course the URL can be replaced with any other.
Extending our Spider
If we wanted to get more images, we could make the download_pictures method call scrapy.Request on the URL of the next page, which can be obtained from the href attribute of the ‘next page’ button. We could also make the spider take a subreddit as argument, or change the downloaded file extensions.
All in all though, the best solution is usually the simplest one, and so using Reddit’s API will save us a lot of headaches.
I hope you now feel empowered to make your own spider and, obtain your own data. Please tell me if you found this useful, and what’s a good data set you think you could generate using this tool — the more creative the better.
Finally, There is an O’Reilly book I love. I found it very useful when I started my Data Science journey, and it exposed me to a different, easier to use (though less flexible) Web Scraping framework. It’s called Data Science from Scratch with Python, and it’s probably half the reason I got my job. If you read this far, you may enjoy it!