Introduction

In a web extracting blog, we can construct an Amazon Scraper Review with Python using 3 steps that can scrape data from different Amazon products like – Content review, Title Reviews, Name of Product, Author, Product Ratings, and more, Date into a spreadsheet. We develop a simple and robust Amazon product review scraper with Python.

Here we will show you 3 steps about how to extract Amazon review using Python
3-steps
  • 1. Markup Data Fields for getting Extracted using Selectorlib.
  • 2. The code needs to Copy as well as run.
  • 3. The data will be downloaded in Excel format.

We can let you know how can you extract product information from the Amazon result pages, how can you avoid being congested by Amazon, as well as how to extract Amazon in the huge scale.

Here, we will show you some data fields from Amazon we scrape into the spreadsheets from Amazon:

data-feild
  • Name of Product
  • Review title
  • Content Review or Text Review
  • Product Ratings
  • Review Publishing Date
  • Verified Purchase
  • Name of Author
  • Product URL

We help you save all the data into Excel Spreadsheet.

Install required package for Amazon Website Scraper Review

Web Extracting blog to extract Amazon product review utilizing Python 3 as well as libraries. We do not use Scrapy for a particular blog. This code needs to run quickly, and easily on a computer.

If python 3 is not installed, you may install Python on Windows PC.

We can use all these libraries: -

  • Request Python, you can make download and request HTML content for different pages using (http://docs.python-requests.org/en/master/user/install/)
  • Use LXML to parse HTML Trees Structure with Xpaths – (http://lxml.de/installation.html)
  • Dateutil Python, for analyzing review date (https://retailgators/dateutil/dateutil/)
  • Scrape data using YAML files to generate from pages that we download.
Installing them with pip3
pip3 install python-dateutillxml requests selectorlib
The Code

Let us generate a file name reviews.py as well as paste the behind Python code in it.

What Amazon Review Product scraper does?

  • 1. Read Product Reviews Page URL from the file named urls.txt.
  • 2. You can use the YAML file to classifies the data of the Amazon pages as well as save in it a file named selectors.yml
  • 3. Extracts Data
  • 4. Save Data as the CSV known as data.csv filename.
fromselectorlibimport Extractor
import requests
importjson
from time import sleep
import csv
fromdateutilimport parser asdateparser
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('selectors.yml')
defscrape(url):
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
# Download the page using requests
print("Downloading %s"%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
ifr.status_code>500:
if"To discuss automated access to Amazon data please contact"inr.text:
print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
else:
print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
returnNone
# Pass the HTML of the page and create
returne.extract(r.text)
with open("urls.txt",'r') asurllist, open('data.csv','w') asoutfile:
writer = csv.DictWriter(outfile, fieldnames=["title","content","date","variant","images","verified","author","rating","product","url"],quoting=csv.QUOTE_ALL)
writer.writeheader()
orurlinurllist.readlines():
data = scrape(url)
'if data:
'for r in data['reviews']:
r["product"] = data["product_title"]
r['url'] = url
if'verified'in r:
if'Verified Purchase'in r['verified']:
r['verified'] = 'Yes'
else:
r['verified'] = 'Yes'
r['rating'] = r['rating'].split(' out of')[0] date_posted = r['date'].split('on ')[-1]
if r['images']:
r['images'] = "\n".join(r['images'])
r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y')
writer.writerow(r)
# sleep(5)
Creating YAML files with selectors.yml

It’s easy to notice the code given which is used in the file named selectors.yml. The file helps to make this tutorial easy to follow and generate.

Selectorlib is the tool, which selects to markup and scrapes data from the web pages easily and visually. The Web Scraping Chrome Extension makes data you require to scrape and generates XPaths Selector or CSS needed to scrape data.

Here we will show how we have marked up field for data we require to Extract Amazon review from the given Review Product Page using Chrome Extension.

When you generate the template you need to click on the ‘Highlight’ option to highlight as well as you can see a preview of all your selectors.

Here we will show you how our templates look like this: -

product_title:
css: 'h1 a[data-hook="product-link"]'
type: Text
reviews:
css: 'div.reviewdiv.a-section.celwidget'
multiple: true
type: Text
children:
title:
css: a.review-title
type: Text
content:
css: 'div.a-row.review-data span.review-text'
type: Text
date:
css: span.a-size-base.a-color-secondary
type: Text
variant:
css: 'a.a-size-mini'
type: Text
images:
css: img.review-image-tile
multiple: true
type: Attribute
attribute: src
verified:css: 'span[data-hook="avp-badge"]'
type: Text
author:
css: span.a-profile-name
type: Text
rating:
css: 'div.a-row:nth-of-type(2) >a.a-link-normal:nth-of-type(1)'
type: Attribute
attribute: title
next_page:
css: 'li.a-last a'
type: Link
Running Amazon Reviews Scrapers

You just need to add URLs to extract the text file named urls.txt within the same the folder as well as run scraper consuming the same commend.

This file shows that if we want to search distinctly for earplugs and headphones.

python3reviews.py