Web Scrapping Fundamentals Using Selenium - Reading Time: 5 Mins

Introduction

With a huge emphasis, placed on data for Data Science. Yet I had not encountered information on the focus on data collection, management, data wrangling. As to me, I find it funny that everyone treats data. Like it is packaged in a nice, clean, categorised, labelled way. Which allows you to make sense out of the data through your own data exploration phase.

Sadly to say I would need to burst your bubble if you feel like data science is that way. Either you are living the dream of the organisation who gave you the data in a nicely packed manner with a big bow tie on it. Which I'm really jealous if you reside in that organisation. You had yet to encounter the need to find and gather data online. Which by itself is not really sexy. But it's necessary as it fulfils tons of business functions that you don't really need to data science it. To provide useful insights or intelligence to take action with it.

Like the creation of a list of potential leads or prospective clients for your sales representative to sell your company's product or service. To the analysis of news for your industry to beware of latest trend or technology development. Which your company could give you an edge over your competitors.

Why Use Selenium for Web Scrapping?

Most or if not all of those modern websites, that you plan to scrap your precious data from. Contains Javascript. Which display data asynchronously/synchronously after a webpage has loaded.

Which if you were to access the website as a regular user. It does not really affect you a great deal. Cause the focus is to display data or information in a human-readable format as a webpage. Which website UI/UX designers took great pains to make it aesthetically pleasing for a user.

The problem for a data scraper is that it's really not the case. Unless your scrapper library contains the ability to scrape data from javascript. You might need to get external party libraries or services like Scrapy Splash to scrape the data.

This is not considering that there are other hurdles or some other gotchas that blocks your IP because they know that it's a spider that is scrapping your website. Which you might use the tor network or use a service like Proxy for a list of proxy IP addresses.

Which acts as bullets for your scrappers that have to be toss aside. When the website owners start to block your scrapper by your IP address, despite your best of intentions of not causing disruption to them.

Terms In Data Scrapping

Here are a few terms that are useful for you to understand.

  • Spider - It's an alternative term of a data scraper. That represents you spending time to include logics so that your data scrapper can get the correct data for you.
  • Robot.txt - It's a file by the website owner that tells your spider on what type of data or links that is off-limits that your scrapper can not scrape data from.
  • ETL (Extract, Transform, Load) - It is a process on data transformation. From the point, you scrape the data to superficial cleaning, processing & storing of the data.
  • Element - The HTML component to locate in the webpage to scrape data from. Which there is a difference between Xpath vs CSS Selector. Just pick CSS selector as your first choice before moving to XPath. Unless you are unable to select and scrape an element without a class id.

Web Scraping Fundamentals

This will give you a basic walkthrough on the various steps on web scrapping that is done in Selenium.

Do note data scrapping can be illegal, depending on the country you are scrapping data from. Always check if it is legal to scrape data in that country. Before you attempt to scrape any data from the website. Here are the steps that are carried out in Selenium you need to know:

  • Browser - This requires you to download the specific selenium web driver & the browser to run your data scrapper from. For production-level, I would use Chrome because of the headless feature that speeds up execution time.
  • Selection - The wait command has to be used to wait till the element is loaded, before you select the element. Based upon Xpath or CSS Selector it will locate and select the html element once the webpage is fully loaded.
  • Extraction - After locating the element, you begin to extract the data and store it into a variable that could be processed after it is stored.
  • Processing - Once the extracted data is store into a variable. You can choose to do an initial cleaning by using specific libraries like Pandas to perform data cleaning operations for ease of storing & analysis.
  • Storing - Finally, with the newly cleaned data, you could store it into a text file or into the database. My personal suggestion is to store your data into a CSV. As it provides some form of format so to perform further cleaning and analysis of the extracted data.

Conclusion

By default, Selenium purpose is used towards website UI testing for QA & automation purposes. Therefore what is taught in this article does apply to you in creating website UI testing scripts or for automation purposes.

A fair warning for those who plan to use it to scrape data. Please be mindful of how you scrape data. Website owners hate scapers as they might assume that you are going to DDoS their website. Due to the aggressive data scrapping activity by you. Which results in them having to get additional resources to solve the issue. They might employ other masking techniques to prevent you from scrapping their precious data due to your inability to build a non-aggressive data scrapper.

Reference