In this blog, we will observe how to do web scraping of a website from the scratch. A website, which we will extract is a computer online shop. Let’s get started.

We have used a Windows machine, therefore, some steps might differ on the other machines.

Making a Real Environment

Initially, make a newer directory named scrapy_tutorial, so move within directory:

mkdir scrapy_tutorial
cd scrapy_tutorial

After that, run that command to make a real environment within venv directory:

virtualenv venv

Activate the real environment and return back to main directory:

cd venv/Scripts
activate
cd ../..
Installation Dependencies

Besides Scrapy, we will utilize a library named scrapy-user-agents. This is a library, which provides user-agent for the requests as well as handles the rotations.

It’s time to install the Scrapy first:

pip install Scrapy

After that, install the scrapy-user-agents:

pip install scrapy-user-agents

Let’s start the project:

scrapy startproject elextra

Open elextra/settings.py as well as insert the following:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

It is the configuration that you need to add while utilizing scrapy-user-agents.

After that, make a spider:

scrapy genspider elextraonline elextraonline.com

Then, open elextra/items.py as well as fill that using this code:

What we do here is that we define the containers of the data we scrape.

Open elextra/spiders/elextraonline.py and delete allowed_domain because we don’t require that for now. Just replace this item start_urls with:

start_urls = ['https://elextraonline.com/kategori/logitech-products/']

Here, we will extract all the Logitech products on the website.

Whenever you open a website, you could see this is paginated. We require to handle it. Just replace the parse function using this:

def parse(self, response, **kwargs):
        # Handling pagination
    next_page: str = response.css("a.next.page-number::attr('href')").get()
    if next_page:
        print(next_page)
        yield response.follow(next_page, callback=self.parse)

You might get confused using this code:

response.css("a.next.page-number::attr('href')")

We utilize that code to get elements from a website with CSS selectors. We get any anchor elements using class next as well as page-number, and then scrape the href. You could get class of an element using an inspector tool within your browser.

We will go to next page till there are no next pages button anymore till the previous page.

After that, we will extract data from each product. We need to open each link of all products to have more data about the products. Let’s make a function to deal with it first:

def parse_product(self, response):
    item: ElextraItem = ElextraItem()
    item['name'] = response.css("h1.product-title.entry-title::text").get().strip()
    item['price'] = response.css("span.woocommerce-Price-amount.amount::text").getall()[-1].strip()
    item['image_link'] = response.css(
        "img.attachment-shop_single.size-shop_single.wp-post-image::attr('src')").get().strip()

    desc_selector = Selector(text=response.css("div#tab-description").get())
    desc_text_list = desc_selector.xpath('//div//text()').getall()
    desc = ''

    for desc_text in desc_text_list:
        desc += desc_text

    desc = desc.replace('DESKRIPSI', '').strip()

    description_result = response.css("div#tab-description > p::text").extract()
    for res in description_result:
        desc += res
    item['description'] = desc

    return item

The logic used here for using CSS selector is similar like before. We call a strip function given on strings for removing irregular white space to make our data cleaner.

The code for getting a product description is a bit complex. The problem is that there are no persistent elements, which has the description body. At times, they utilize p as well as they utilize ul. Therefore, what we perform here is, we get a general parent and then extract all the texts within whereas removing the needless ones like text DESKRIPSI.

Coming back to parse function, just add that above this pagination section:

product_links: list = response.css("p.name.product-title > a::attr(href)")
for product in product_links:
    href = product.extract()
    yield response.follow(href, callback=self.parse_product)

Therefore, we open each page of product as well as parse that.

Your elextra/spiders/elextraonline.py needs to look like that now:

import scrapy
from scrapy import Selector
from ..items import ElextraItem
class ElextraonlineSpider(scrapy.Spider):
    name = 'elextraonline'
    start_urls = ['https://elextraonline.com/kategori/logitech-products/']
    def parse(self, response, **kwargs):
        product_links: list = response.css("p.name.product-title > a::attr(href)")
        for product in product_links:
            href = product.extract()
            yield response.follow(href, callback=self.parse_product)
        # Handling pagination
        next_page: str = response.css("a.next.page-number::attr('href')").get()
        if next_page:
            print(next_page)
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response):
        item: ElextraItem = ElextraItem()
        item['name'] = response.css("h1.product-title.entry-title::text").get().strip()
        item['price'] = response.css("span.woocommerce-Price-amount.amount::text").getall()[-1].strip()
        item['image_link'] = response.css(
            "img.attachment-shop_single.size-shop_single.wp-post-image::attr('src')").get().strip()

        desc_selector = Selector(text=response.css("div#tab-description").get())
        desc_text_list = desc_selector.xpath('//div//text()').getall()
        desc = ''

        for desc_text in desc_text_list:
            desc += desc_text

        desc = desc.replace('DESKRIPSI', '').strip()

        description_result = response.css("div#tab-description > p::text").extract()
        for res in description_result:
            desc += res
        item['description'] = desc

        return item
Export Results

Run the given command to get the scraping results to the CSV file:

scrapy crawl elextraonline -o result.csv -t csv

The result.csv will look like:

CSV

It’s done! You don’t require to follow any exact website like this. Treat that as a fundamental building block for own project. We hope you will find that useful.

For more information, contact Retailgators now or ask for a free quote!