I have problem removing url and hashtags in the data(from excel)
I’m having a problem in removing url and hashtags in the data(from excel). I have inputted data(tweets) using 3 read excel then append them. After that, I connected the append operator to replace then inputted regex for url and hashtags in parameters named regular expression and replace what. Then, I connected it to data to document then process documents where I have Transform cases, Tokenize and Filter Stopwords(dictionary) respectively. The results were tokenized and the stopwords I created were removed. But the one with hashtags, only the # symbol is removed. For example, original text is #vscocam the result is vscocam while the url it is not removed. It was just tokenized too.
Tagged:
0
Answers
hello@fangirl96- welcome to the community. I think I understand and believe you just need to adjust your regex. Can you give some examples and the process you're using (see instructions "Read Before Posting" on the right).
Scott
This is the full xml of my process.
The links are not removed but the hashtags were removed.
PS. The links included in my data is starting with https
thank you@fangirl96- can you share one of those excel sheets as well?
Scott
@fangirl96take a look at my tutorial process here:http://www.neuralmarkettrends.com/blog/entry/use-rapidminer-discover-twitter-content
I extract hashtags and drop https: to a generic word called 'link'