Text Mining - Industry 4

charlesmrtcharlesmrt MemberPosts:4Contributor I
edited December 2018 inHelp

Hey,

I want to extract all the texts from this page:http://www.plattform-i40.de/I40/Navigation/Karte/SiteGlobals/Forms/Formulare/EN/map-use-cases-formular.htmland create a table with different factors extracted from these texts, each line is a case, each column is a data extracted from the text. I think i'll use 6 column: Value Creation, Product Examples, Region....

Then I want to link those datas to know which one fits most for an external given case. For instance: Given Case X fits at 80% with company of line 35, 60% with company of line 118, etc...

Do you know how I can do all of that?

It's for my Master Thesis.

Thanks a lot,

Charles

Answers

  • luc_bartkowskiluc_bartkowski MemberPosts:46Maven

    To summerize the first part of your question: You want to scrape this webpage and obtain the information included on this webpage. So how to do this?

    This web page is clearly a result of a combination of HTML, CSS and Javascript. See the picture. All information is included but not all in clear HTML so "traditional" web scraping doesn't bring the required results. But still all information/data is availlable but you have to do something smarter like using Xpath in the webpage document to find and retrieve every individual piece of (AJAX/javascript) data in the document. You can do that in RapidMiner: Have a look at the toturial of a guy called El Chief on YouTube.https://www.youtube.com/watch?v=vKW5yd1eUpA

    RMscaping.jpeg

  • charlesmrtcharlesmrt MemberPosts:4Contributor I

    Hey,

    Thanks a lot for answering, I did'nt manage to extract data from the html page. The link you sent me seems to be very useful but the classes used are not exactly the same and i don't manage to find the correct x-path to extract data.

    Could you help me if you know how to correctly extract data from HTML.

    From this page:http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html, I want to extract Manufacturing industry and to automatically link it with Application example.

    Thanks,

    Charles

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@charlesmrt- welcome to the community. It was my hope that@ey's nice "Read HTML Table" operator would do the trick here but alas it did not. However using "Get Page" and "Extract Content" gets you pretty far:










    http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>














    Scott

  • charlesmrtcharlesmrt MemberPosts:4Contributor I

    Hey,

    Thanks for answering, in the file attached, you can see the HTML, I just want to extract "software solution", I tried to use "/ / *[包含(。产品示例)]/ . . /跨度(去年()]" or "//*[contains(.,'Product example')]/../span[1]" but it doesn't work.. How could I do?

    The link:http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/150-smart-engineering-and-production-4-0-en/article-smart-engineering-and-production-4-0-en.html

    Thanks,

    Charles

    Path.JPG 20.7K
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    oh that seems very complicated. I would use RegEx.










    http://www.plattform-i40.de/I40/Redaktion/EN/Use-Cases/082-research-and-development-center-in-the-field-of-industrial-automation/article-research-and-development-center-in-the-field-of-industrial-automation.html"/>




































    Scott

  • charlesmrtcharlesmrt MemberPosts:4Contributor I

    Thanks,

    I found an other way to do it, by downloading html page on my computer thanks to "Download them all", then I used a text processing and Extract Information with Regular Expression. I obtained a Table in which I got all the informations.

    But i still have a question, in Regular expression, i can extract only one expression per column of my table, the query expression is unique, but sometimes i got many solutions for one attribute name. How can I do to have multiple solutions in one column, I used "|" but it makes a disjonction of element not an accumulation.

    Thanks,

    Charles

    Capture3.JPG 75.6K
Sign InorRegisterto comment.