"Extract data from a website to Excel-file"

currantcurrant MemberPosts:14Contributor II
edited May 2019 inHelp
Hi!

I want to extract some data from the following website:
http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection

I want to generate a list in Excel with an ID-column and a pesticide name-column. Both should be extracted from the website, section "2) Select Pesticide".

As I am a RM beginner, I did not manage to cut the html-document and generate the two columns... The only thing I was able to do, was to get the web page on my computer.

Could anyone give me some hints I could start with?

Thanks in advance!
Currant
Tagged:

Answers

  • colocolo MemberPosts:236Maven
    Hu Currant,

    this seems to be an easy task. After retrieving the website I would suggest to use XPath since the desired data is structured within a HTML select element. If you want to deal with multiple XPath or RegEx matches there is only one interesting operator: "Cut Document" (contained in the Text Processing extension).

    Here a little working example. You just have to add the id and convert the collection into a table ("Documents to Data").










    http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection"/>









    @id='optionsList1']/h:option/text()"/>


















    Best regards
    Matthias
  • currantcurrant MemberPosts:14Contributor II
    Dear All,

    Now, I managed to the html-files on my computer and I could extract the html-information from the file. See script below.

    Code:




















    @style='text-align:left']/text()"/>
    @class='col60']/h:ul/h:li[2]/text()"/>















    <连接from_op = from_port =“阅读文档输出" to_op="Process Documents" to_port="documents 1"/>






    Now, my question is:

    How can I export the attributes "Namen" and "Animal" to Excel?
    (It's probably an easy job ... but I could not work it out...

    currant
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University ProfessorPosts:1,984RM Engineering
    Hi,

    you can select the desired attributes via the "Select Attributes" operator, just select subset as the attribute filter type and select the desired attributes. Then put in a "Write Excel" operator, specify the file, and voilà.

    Your modified process (you might have to change the selected attributes):





















    @style='text-align:left']/text()"/>
    @class='col60']/h:ul/h:li[2]/text()"/>




















    < =“excel_file”价值=参数的关键"C:\export.xls"/>

    <连接from_op = from_port =“阅读文档输出" to_op="Process Documents" to_port="documents 1"/>









    Regards,
    Marco
  • currantcurrant MemberPosts:14Contributor II
    Now it works!

    Thanx a lot for your support!

    currant
Sign InorRegisterto comment.