[ALMOST SOLVED] Web Crawling and Text Editing challenge

leon86itleon86it MemberPosts:2Contributor I
edited June 2019 inHelp
Kind people of the rapid-i,
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:

I've got a website where there are some news and articles: (i.e.www.parolibero.it)

I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3.出口的数据在一个图形格式highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)

我试着使用web爬行但是我得到我s the home page in txt format and then the Excel with just one record.

Can you please help me? At least I would like to know where I get wrong or which operators to use for that.

Thank you very much indeed for your help!
Leon

P.S. There is no copyright issue at all as I am one of the staff of that website

Answers

  • Nils_WoehlerNils_Woehler MemberPosts:463Maven
    Hi,

    to extract information from the site you can for example use the Get Page Operator followed by Cut Documents and Extract Information, see here:











    http://www.parolibero.it/"/>










    @class=&;quot;publicationsList"]"/>



























    One thing you have to notice is that for XPath every HTML identifier must have a 'h:' appended. Otherwise it won't work.

    Best,
    Nils
Sign InorRegisterto comment.