"Web Mining crawling prices of an internet page"

luiz_vidal · January 2018

Guys,

I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten

First I want to collect the pages:

https://www.buscape.com.br/cerveja?pagina=1

https://www.buscape.com.br/cerveja?pagina=2

...

https://www.buscape.com.br/cerveja?pagina=200

Follow my process below







<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">




https://www.buscape.com.br/cerveja?pagina=%{page}"/>














<连接from_op = from_port =“爬行Web示例t" to_port="output 2"/>

After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.

But, somehow when I copy paste it from google, it doesn't work.

Can you guys please help me create a simple example of process ?

Thanks in advance.

luiz_vidal · January 2018

Ugh,

After almost giving up I was able to retrieve the piece of data I want, the thing is that it brings only the first that it finds..

I need to find a way to fetch all products names and prices

//*[@name="priceProduct"]

//*[@name="productName"]

miner · January 2018

Hi Luiz-Vidal,

I came across that issue a few days ago.

Just copy&paste the xml from google wont work due to namespace

Google gives

//*[@id="product_383527"]/div/div[1]/div[3]/div[1]/a/spanfor the first product:Paulistânia Puro Malte Premium Lager Garrafa 600 ml 1 Unidadeand

//*[@id="product_383527"]/div/div[2]/div[1]/div[1]/a/spanfor the price 14,99

In RM you have to use //*[@id="product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span

and //*[@id="product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span

See the discussion here:https://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Extracting-Information-With-XPath/td-p/9883

Cheers

miner

luiz_vidal · January 2018

Hey,

Thanks for your reply

Although I still can't make it..







<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">





https://www.buscape.com.br/cerveja?pagina=%{page}"/>


















<连接from_op = from_port =“爬行Web示例t" to_op="Rename File" to_port="through 1"/>





















@id=&;quot;product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span"/>












@id=&;quot;product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span"/>

Any idea what am I doing wrong?

miner · January 2018

I´m not quite sure.

The website is using product-id for reference.

For the first product I took it was //*[@id="product_383527"]- assuming the id is changing for every product the xpaht is only working for this specific product.

Then you would have to go up the tree to get a "non-id-related" node and then pick the detail from there.

That would be /html/body/main/div[3]/div/div[3]/section?

luiz_vidal · January 2018

Sorry,

I know nothing about xpath, I've been trying all day to get ..

I try, try try and the extract document returs me only true or false or ?

I've been searching and trying with //input[@name="productName"],它返回真或假. .但是我想要的是什么e value for productName and for priceProduct.. which will probably have to be return on a list.. or a huge string to be split.. I dont know yet.
A victory would be just getting one value returned

miner · January 2018

Hi@luiz_vidal

xpath can be a mess...

A good way to test xpath-strings is to use google docs where you can quickly copy the xpath from chrome to the spreadsheet and test the result. This is much faster than testing the structure in RM.

On Youtube you find a lot of tutorials to xpath and google docs.

My recommendation is the video of community member el chief - find it here:https://www.youtube.com/watch?v=UG6223p9fZE

Cheers

miner

luiz_vidal · February 2018

Overall,

It was a matter of getting to know how to use xpath and configuring it correctly along the operators.

Thanks for your help

Thomas_Ott · February 2018

"xpath can be a mess..."

Definately agree, but it's powerful when it works.

canh99alex · July 2018

Help me please. Which Currency is best to mine.https://en.bitcoinwiki.org/wiki/Web_mining hereit is written that experts advice "monero".

SGolbert · July 2018

Hi,

Trying the XPaths in a shell environment can make things faster.

A simple command line tool is XML Shell:

http://www.xmlsh.org/CommandXPath

You can also find the same functionality in Python's scrapy, but it is overkill for your actual needs.

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Web Mining crawling prices of an internet page"

Best Answer

Answers