“刮一个网站并下载忧郁erlinked pdf files"
data:image/s3,"s3://crabby-images/e9e37/e9e376f86fc989f8be36462752cae2b4a4f55b06" alt="gary_molloy"
data:image/s3,"s3://crabby-images/7371c/7371cabaeb0bab47310576cbbb2ad0922c241e63" alt=""
I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?
Tagged:
0
0 Comments | 0 Discussions | 0 Members | 0 Online |
I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?
Answers
Is the "Open File" operator not doing what you want? It allows you to get files from any URL or file path and have them as a file object, which can then be stored. If you have multiple files then you can use macros and put this in a loop.
If you want to scrape actual web pages, then use "Get Page" or "Get Pages" instead.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hello@gary_molloy- if you use the "Crawl Web" operator (Web Mining extension), there is an option to "write pages to disk". This will save the PDFs like normal. I have done this many times.
Scott