"Web Crawling guide - help much needed"

milkshake_luva · February 2017

Hi there,

I am new to Rapid Miner though have a deadline coming up soon and just wanted some help with webcrawling.

I'm doing a crowdsourcing assignment where I need to 'crawl' a website to find detailed information, which I can then subject this data to further processing. However, I am having trouble running my initial analysis. I've downloaded both web and text mining extensions, have put in the URL to crawl, tried to add parameters where results returned match with my URL and links containing the name of the site itself. I've followed some tutorials and specified Rapidminer to save results to a directory, in .txt format.

I'm not sure how 'max crawl depth' translates to actually 'going through' links and pages in my given URL. I want to search through user suggestions in a crowdsourcing project, but there is no way to specify a time window of these results. I set the max dept to 400. I've selected 'add content as attribute', and to write pages to disk. I have also put in my user agent prior to running the analysis.

In one instance, I did manage to find 60 or so text files to my directory which pertained to the analysis. Whilst some of these were links I wanted, a lot weren't, and the date was too recent anyway. I wasn't sure how to further systematise my search criteria.

It is frustrating because I have a whole design set up, but no way to 1) get the data in Rapid Miner, or even 2) review the text files reliably and go through these whilst specifying I want user reviews posted from a certain date. I also don't know how I would include user metadata, such as past voting and commeting history, into the analysis, or if this is done after. All this information is available on the website itself, when you click on a given idea - the website shows how many ideas this user has submitted, how many votes and comments they've made etc. I could do this by hand, but I need hundreds if not over 1,000 different links to reliably analyse.

If anyone could provide further guidance I would be wholly appreciative. I have a deadline but not much time.

Thanks,
milkshake_luva

kayman · February 2017

This might get you started :

It's taking one page, looking at the content and storing the content of interest in an exampleset for further analysis.

What you still need to do is setup the actual crawl logic, and modify where needed if you want more / less / other data from teh page but the principal remains the same.










http://www.ideastorm.com/idea2ExploreMore?v=1487149161389&Type=TrendingIdeas"/>








http://www.w3.org/1999/XSL/Transform">
<xsl:outputmethod="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<root>
<xsl:for-each select="//article[@class='search-result']">
<xsl:variable name="idea" select="h3/a"/>
<xsl:variable name="added" select="div[1]/p[@class='date']"/>
<xsl:variable name="votes" select="div[1]/p[@class='votes']/em"/>
<xsl:variable name="details" select="normalize-space(p[@class='truncatedBody'])"/>
<row idea="{$idea}" added="{$added}" votes="{$votes}" details="{$details}"/>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>"/>

<运营商激活= " true " class = "文本:html_to_xml" compatibility="7.3.000" expanded="true" height="68" name="HTML to XML" width="90" x="179" y="34"/>








<运营商激活= " true " class = "文本:cut_document" compatibility="7.3.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="646">

















@idea"/>;
@added"/>;
@votes"/>;
@details"/>;

艾丁_Klapic · February 2017

Hi milkshake_luva,

Web Crawling is highly depending on the structure of the website.This makes a general answer to your problem really difficult

Thus, could you please provide the URL (in the this forum or by PM) and more details about the information you want to retrieve so I can create a process myself?

Best regards,

艾丁

milkshake_luva · February 2017

Hi Edin,

Thanks very much. The assignment is looking at crowdsourcing and implementation success of 'ideators'. I also want to look at past user (ideator ) activity - in terms of ideas submitted previously, general voting behaviour, and commenting behaviour.

这个想法是为了基本上布鲁里溃疡ilt a networked or 'bundled' state of creativity and subject this to a test - was the idea useful, and implemented by an organisation, or not. The website, by the way, is Dell's IdeaStorm, where implementation data is publicly available for each idea. I have software for subjecting user info to sentiment analysis already - I just need the text itself, as well as, hopefully, some kind of organisation to this text. On that point, I'd also like (in terms of a time window) all ideas within the most recent 4 months not included. So maybe user activity fromhttp://www.ideastorm.com/in the way of votes, with metadata about prior activity, between say summer 2013-summer 2015.

Id love to be able to do is stuff myself; I've been reading the RapidMiner manual and it would be great to get some practice. Just my deadline is not far away at all, so it's a matter of need above it else.

I hope that was sufficient information for you - I've made this public should any other contributors have success tips for me.

Thanks,

MSL

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Web Crawling guide - help much needed"

Best Answer

Answers