Crawling Amazon for Review Text

dhunnewedhunnewe MemberPosts:2Contributor I
edited November 2018 inHelp

你好,

I am looking to better understand how to use the "Crawl Web" operator to pull review text from Amazon.

I have looked through a few posts but nothing seems to be getting at exactly what I am looking for. The goal would be to use Amazon's link structure to scan all of the reviews for a given product.

Below is the basic link structure for getting reviews for item "B019XFKM3". The only thing that needs to be changed or looped within the link is ..."paging_btm_1?" and "reviews&pageNumber=1". When changing the numbers to 1, 2, 3.... we would be able to scan that page of reviews.

https://www.amazon.com/product-reviews/B019XFKM3M/ref=cm_cr_arp_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1

How would I be able to set this up using the "Web crawl" operator and further, how would I be able to just pull the review text and star rating?

Any help would be greatly appreciated.

Best,

Dan

Tagged:

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    The "Crawl Web" operator has the option to add crawling rules in the parameters. You basically need to set up rules that correspond to your root URL and then use regular expressions to define the possible variations (like the final ...reviews&pageNumber=x portion of the URL). This is a very typical use case for the operator and with a bit of trial-and-error you should be able to get it performing as you wish. You'll also want to look at the crawl depth parameter as well, which will control how many successive pages it should take.

    As far as saving only certain elements from the resulting page (like the text and rating), that can be quite a bit more complicated. You'll probably end up with some combination of Cut Document and Extract Content and then you'll need to Process Documents later to tokenize the review text, etc. The exact configuration of those operators is highly dependent on the data retrieved from the page, so you may also need to get creative with text searching or regular expressions to keep only the pieces that you want.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@dhunnewe- welcome to the community. So I did some searching and could not find the link BUT I am 99.9% sure that "scraping" amazon.com is against their Terms of Service. Hence using an operator like "Crawl Web" would violate their policy and, hence, I cannot really help you use this operator to do what you want.:)

    That said, you can accomplish what you want in a better, and legal, way using theirProduct Advertising API. As others know on this forum, I am a huge REST API advocate and use them all the time with either the "Enrich Data via Webservice" operator or other methods. I would strongly suggest that you try going this route.

    Scott

  • diggydiggy MemberPosts:1Newbie
    I've been also needing to extract amazon data and as@sgenzersays you can use the amazon API, the problem is that it's very restricted so I ended up using proxycrawl.com that way I didn't have to deal with amazon directly but a third party deals with them. Posting here in case someone is in the same situation
Sign InorRegisterto comment.