Published: August 07, 2020
With a huge emphasis, placed on data for Data Science. Yet I had not encountered information on the focus on data collection, management, data wrangling. As to me, I find it funny that everyone treats data. Like it is packaged in a nice, clean, categorised, labelled way. Which allows you to make sense out of the data through your own data exploration phase.
Sadly to say I would need to burst your bubble if you feel like data science is that way. Either you are living the dream of the organisation who gave you the data in a nicely packed manner with a big bow tie on it. Which I'm really jealous if you reside in that organisation. You had yet to encounter the need to find and gather data online. Which by itself is not really sexy. But it's necessary as it fulfils tons of business functions that you don't really need to data science it. To provide useful insights or intelligence to take action with it.
Like the creation of a list of potential leads or prospective clients for your sales representative to sell your company's product or service. To the analysis of news for your industry to beware of latest trend or technology development. Which your company could give you an edge over your competitors.
Which if you were to access the website as a regular user. It does not really affect you a great deal. Cause the focus is to display data or information in a human-readable format as a webpage. Which website UI/UX designers took great pains to make it aesthetically pleasing for a user.
This is not considering that there are other hurdles or some other gotchas that blocks your IP because they know that it's a spider that is scrapping your website. Which you might use the tor network or use a service like Proxy for a list of proxy IP addresses.
Which acts as bullets for your scrappers that have to be toss aside. When the website owners start to block your scrapper by your IP address, despite your best of intentions of not causing disruption to them.
Here are a few terms that are useful for you to understand.
This will give you a basic walkthrough on the various steps on web scrapping that is done in Selenium.
Do note data scrapping can be illegal, depending on the country you are scrapping data from. Always check if it is legal to scrape data in that country. Before you attempt to scrape any data from the website. Here are the steps that are carried out in Selenium you need to know:
By default, Selenium purpose is used towards website UI testing for QA & automation purposes. Therefore what is taught in this article does apply to you in creating website UI testing scripts or for automation purposes.
A fair warning for those who plan to use it to scrape data. Please be mindful of how you scrape data. Website owners hate scapers as they might assume that you are going to DDoS their website. Due to the aggressive data scrapping activity by you. Which results in them having to get additional resources to solve the issue. They might employ other masking techniques to prevent you from scrapping their precious data due to your inability to build a non-aggressive data scrapper.