“刮一个网站并下载忧郁erlinked pdf files"

gary_molloygary_molloy MemberPosts:4Contributor I
edited June 2019 inHelp

I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?

Tagged:

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Is the "Open File" operator not doing what you want? It allows you to get files from any URL or file path and have them as a file object, which can then be stored. If you have multiple files then you can use macros and put this in a loop.

    If you want to scrape actual web pages, then use "Get Page" or "Get Pages" instead.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@gary_molloy- if you use the "Crawl Web" operator (Web Mining extension), there is an option to "write pages to disk". This will save the PDFs like normal. I have done this many times.


    Scott

    jsmith
Sign InorRegisterto comment.