My first blog post on using automated web scraping

Hey guys! I’ve started blogging about my automation learning journey with Python and Selenium!

Here’s the first post I’ve made on how I learned how to create web scrapers in Python and Selenium and how I got them to just work. I have more blogs on the topic planned to go over how I improved the scripts further.

Feel free to check it out!
https://sigrothian.co.uk/posts/creating-python-webscrapers-part-1/

11 Likes

I’m curious, have you had any problems with bot detection/protection systems on the websites you use to scrap data? Some websites use mechanisms to block scrapers, DDoS attacks, and other bot activities. How many streams do you use to scrape data - just one, or do you use multiple streams (hundreds of pages in parallel)? And, I assume those websites don’t require logging in to get access to the info?

4 Likes

A totally excellent exlainer of the process , wish more people would share their journey like you have here Stephen. It proves that you will be able to recall and repeat your learnings, and that you also are able to size up the task well. Just the constraints of the learning task are a thing very few people set themselves as boundaries. Personally I would try using one of the web browser plugins that allow you to interactively navigate the DOM.

Appium has an inspector app for mobile, that lets you interactively highlight any element and then takes you to that element in the DOM. Note that webdriver is looking at the DOM, not the HTML, the HTML is the static document, and when you stop thinking about it all as HTML, then things like controls, javascript, events and other element locator strategies start to make far more sense. The HTML can actually contain more than one ‘view’ of the page, elements stack and hide and move all of the time, they really do move a lot especially in modern web apps that use fancy frameworks. And that’s an area a lot of folk struggle with until they eventually stop thinking of the page as being static, but rather that it contains objects. Then, paging through a site with longer paginated lists for exampe becomes far easier to automate. Great start. Keep this coming.

2 Likes

Hey,

Really it’s a case of running one script at a time as this has mainly been a learning exercise to get to grips with Selenium in tandem with Python. I made sure to use websites with no need to log in to access the info to make sure that I’ve at least got something to show as a first project.

It’s about taking it one step at a time really.

2 Likes

Hey Conrad.

Thanks for sharing all of that. Goes to show how much more I need to learn about understanding automation, especially coming from my background of purely manual testing.

As for the Web Browser plugins, I’ve made use of SelectorHub but am happy to know if there are any alternatives that I could make use of to help navigate the DOM?

Other than that, I’m more than happy to continue sharing my learning journey as that has basically become my main focus after my recent layoff.

1 Like

Do you have an RSS feed for your blog? I was trying to find it so that I can feed it into our software testing news section.

Very nicely written. Great job.

Hey! That’s something I plan to implement soon. The website still needs some kinks fixed on the backend.

1 Like