"Web Mining crawling prices of an internet page"
Guys,
I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten
First I want to collect the pages:
https://www.buscape.com.br/cerveja?pagina=1
https://www.buscape.com.br/cerveja?pagina=2
...
https://www.buscape.com.br/cerveja?pagina=200
Follow my process below
<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">https://www.buscape.com.br/cerveja?pagina=%{page}"/>
<连接from_op = from_port =“爬行Web示例t" to_port="output 2"/>
After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.
But, somehow when I copy paste it from google, it doesn't work.
Can you guys please help me create a simple example of process ?
Thanks in advance.
Best Answer
-
luiz_vidal MemberPosts:14Contributor II
Ugh,
After almost giving up I was able to retrieve the piece of data I want, the thing is that it brings only the first that it finds..
I need to find a way to fetch all products names and prices
1
Answers
Hi Luiz-Vidal,
I came across that issue a few days ago.
Just copy&paste the xml from google wont work due to namespace
Google gives
//*[@id="product_383527"]/div/div[1]/div[3]/div[1]/a/spanfor the first product:Paulistânia Puro Malte Premium Lager Garrafa 600 ml 1 Unidadeand
//*[@id="product_383527"]/div/div[2]/div[1]/div[1]/a/spanfor the price 14,99
In RM you have to use //*[@id="product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span
and //*[@id="product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span
See the discussion here:https://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Extracting-Information-With-XPath/td-p/9883
Cheers
miner
Hey,
Thanks for your reply
Although I still can't make it..
Any idea what am I doing wrong?
I´m not quite sure.
The website is using product-id for reference.
For the first product I took it was //*[@id="product_383527"]- assuming the id is changing for every product the xpaht is only working for this specific product.
Then you would have to go up the tree to get a "non-id-related" node and then pick the detail from there.
That would be /html/body/main/div[3]/div/div[3]/section?
Sorry,
I know nothing about xpath, I've been trying all day to get ..
I try, try try and the extract document returs me only true or false or ?
I've been searching and trying with //input[@name="productName"],它返回真或假. .但是我想要的是什么e value for productName and for priceProduct.. which will probably have to be return on a list.. or a huge string to be split.. I dont know yet.
A victory would be just getting one value returned
Hi@luiz_vidal
xpath can be a mess...
A good way to test xpath-strings is to use google docs where you can quickly copy the xpath from chrome to the spreadsheet and test the result. This is much faster than testing the structure in RM.
On Youtube you find a lot of tutorials to xpath and google docs.
My recommendation is the video of community member el chief - find it here:https://www.youtube.com/watch?v=UG6223p9fZE
Cheers
miner
Overall,
It was a matter of getting to know how to use xpath and configuring it correctly along the operators.
Thanks for your help
"xpath can be a mess..."
Definately agree, but it's powerful when it works.
Help me please. Which Currency is best to mine.https://en.bitcoinwiki.org/wiki/Web_mining hereit is written that experts advice "monero".
Hi,
Trying the XPaths in a shell environment can make things faster.
A simple command line tool is XML Shell:
http://www.xmlsh.org/CommandXPath
You can also find the same functionality in Python's scrapy, but it is overkill for your actual needs.
Regards,
Sebastian