Remove or replace URL and RT from Twitter dataset

ikayunida123ikayunida123 MemberPosts:17Contributor II
edited December 2018 inHelp

Hello everyone!

所以现在我要做一个数据清洗phae on text classification using Twitter dataset. But I have a problem about how to replace (or maybe remove) the URL, RT and @ character. I've read some post on the forum but I didn't understand anything :catsad:

For the URL on the dataset, I want to change the format from "https:" or "http:" to "link" (I don't know why it can't have a null value like " "). But after I executed my process using Replace operator, the result from "http://blablabla“没有改变成“链接”,但结果come out like this "linkblablabla". Maybe it has something to do with the RegEx? :catsad: I know what's RegEx but I don't how how to use and write it :catsad:

I'm really confused right now. Please help me.

This's my RapidMiner process :







<运营商激活="true" class="process" compatibility="8.1.001" expanded="true" name="Process">

<运营商激活="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">


<运营商激活="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">



<运营商激活="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">




<运营商激活="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">



<运营商激活="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">



<运营商激活="true" class="replace" compatibility="8.1.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
https://)"/>













I need your help. Thank you!

Tagged:

Best Answer

Answers

  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, MemberPosts:296RM Research

    Woah great solution and very detailed.

    I took the liberty to re-use it to answer the same question onStack Overflow.

    sgenzer rfuentealba
  • ikayunida123ikayunida123 MemberPosts:17Contributor II

    @rfuentealbaOh my god, thank you so much! It works nicely on my process :catvery-happy:

    sgenzer rfuentealba
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn

    Glad it helped. However, I was reading my answer again and found that I made a mistake. Not a serious one unless you are parsing thousands of URL's (in that case, every savedflopscounts):

    https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

    This is the final regular expression you should use. Using(http | https ?)at the end is redundant (like asking ifit's http or it's http or it's https), becauses?means that the content might or might not have the charactersat the end.

    Also, for future reference, I've found that on this implementation of regular expressions there is no need to escape the/character. That's a behaviour I acquired from using UNIX command line tools such asvimorsed.

    sgenzer
  • AmosGHAmosGH MemberPosts:7Learner I
    I also tried (https|http)(.*) for my URL and it worked
    Tghadially
  • kaymankayman MemberPosts:662Unicorn
    If you want a bit more 'readability' you could also change the A-Za-z0-9_ with \w\d which covers every word character and digit.
    Tghadially
Sign InorRegisterto comment.