Text Mining - Industry 4

charlesmrt · November 2017

Hey,

I want to extract all the texts from this page:http://www.plattform-i40.de/I40/Navigation/Karte/SiteGlobals/Forms/Formulare/EN/map-use-cases-formular.htmland create a table with different factors extracted from these texts, each line is a case, each column is a data extracted from the text. I think i'll use 6 column: Value Creation, Product Examples, Region....

Then I want to link those datas to know which one fits most for an external given case. For instance: Given Case X fits at 80% with company of line 35, 60% with company of line 118, etc...

Do you know how I can do all of that?

It's for my Master Thesis.

Thanks a lot,

Charles

luc_bartkowski · November 2017

To summerize the first part of your question: You want to scrape this webpage and obtain the information included on this webpage. So how to do this?

This web page is clearly a result of a combination of HTML, CSS and Javascript. See the picture. All information is included but not all in clear HTML so "traditional" web scraping doesn't bring the required results. But still all information/data is availlable but you have to do something smarter like using Xpath in the webpage document to find and retrieve every individual piece of (AJAX/javascript) data in the document. You can do that in RapidMiner: Have a look at the toturial of a guy called El Chief on YouTube.https://www.youtube.com/watch?v=vKW5yd1eUpA

charlesmrt · November 2017

Hey,

Thanks a lot for answering, I did'nt manage to extract data from the html page. The link you sent me seems to be very useful but the classes used are not exactly the same and i don't manage to find the correct x-path to extract data.

Could you help me if you know how to correctly extract data from HTML.

From this page:http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html, I want to extract Manufacturing industry and to automatically link it with Application example.

Thanks,

Charles

sgenzer · November 2017

hello@charlesmrt- welcome to the community. It was my hope that@ey's nice "Read HTML Table" operator would do the trick here but alas it did not. However using "Get Page" and "Extract Content" gets you pretty far:










http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>

Scott

charlesmrt · November 2017

Hey,

Thanks for answering, in the file attached, you can see the HTML, I just want to extract "software solution", I tried to use "/ / *[包含(。产品示例)]/ . . /跨度(去年()]" or "//*[contains(.,'Product example')]/../span[1]" but it doesn't work.. How could I do?

The link:http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/150-smart-engineering-and-production-4-0-en/article-smart-engineering-and-production-4-0-en.html

Thanks,

Charles

sgenzer · November 2017

oh that seems very complicated. I would use RegEx.










http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>

Scott

charlesmrt · November 2017

Thanks,

I found an other way to do it, by downloading html page on my computer thanks to "Download them all", then I used a text processing and Extract Information with Regular Expression. I obtained a Table in which I got all the informations.

But i still have a question, in Regular expression, i can extract only one expression per column of my table, the query expression is unique, but sometimes i got many solutions for one attribute name. How can I do to have multiple solutions in one column, I used "|" but it makes a disjonction of element not an accumulation.

Thanks,

Charles

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining - Industry 4

Answers