Praelexis runs an exciting internship program. Twice a year (June/July and Dec/Jan) we are joined by bright young minds and the mutual knowledge sharing commences. Interns at Praelexis join our teams and work on real industry problems, while upskilling for their careers as Data Scientists.
If you have ever copied and pasted content from a web page into a local file, then you’ve done some manual web-scraping (albeit on a very small scale). Web scraping is a popular technique used by both tech-savvy individuals and data science companies alike to collect data for analysis.
Most web scraping is done by software tools that can collect lots of data very quickly, they also don’t get bored nearly as easily as humans do.
Let’s go over the process of web scraping, which allows us to minimize time wasted and maximize the time spent collecting that all-important data. What follows is a brief but useful guide for the beginner web scraper.
What problem are you trying to solve with the data? Looking for a bargain on a second-hand car? Web scraping could help with that. Need data for a research project you’re working on? Scrape away. There are endless possibilities here: remember that data is valuable, very valuable. Don’t be surprised if you struggle to find some types of data freely available on the internet.
The next challenge is finding a website that you can scrape the data you’re looking for. There are website-specific rules around what data can be scraped. To figure out what they are for your particular target, you’ll need to check the robots.txt document, which typically exists in the root domain of the website. For example, here is the Praelexis website’s robot.txt. This file tells the scraper what it is allowed (and not allowed!) to do on the site. The sites “terms of service” is also worth reading, which can usually be found at the bottom of the page. Usually the “terms of service” will specify the “rules of engagement” regarding scraping, along with the robots.txt file.
There are several options when deciding on a tool for web scraping. There are pre-built solutions in the forms of applications and APIs such as Bright data (paid) and Parsehub (free) which are great options for those not interested in coding their own scraper. These applications also come in the form of browser extensions that simple and easy to learn for smaller projects.
However, for the ultimate list of customization options you’re better off using a Python library specifically crafted for scraping. Let’s over some of the most popular Python libraries that are relatively simple to learn and free to use!
BeautifulSoup
is very popular amongst the data science community. Used to parse data from XML or HTML documents. Scrapy
, which was specifically designed for web scraping, is another good option with great documentation and an active support community. Pandas
in conjunction with BeautifulSoup (my personal favourite) is a great “all in one” solution for data science as it includes the tools required for data manipulation and analysis. Once the data has been extracted, it needs to be stored somewhere (usually in a structured format like a CSV or XLSX file).
Armed with this knowledge you can write the code for your scraper.
Time to get your hands dirty, this is where you’ll learn the most.
Time to sit back, relax and watch the magic happen. Don’t get frustrated if you end up rewriting bits of the code, all code writing is an iterative process. A wise man, Joel Spolsky once said, “Good software, like good wine, takes time.”
For a more detailed look at web scraping check out this article by Songhao Wou.