HTML Tag Removal using Regular Expression/Replace Tokens

J_HeringJ_Hering MemberPosts:3Contributor I
edited November 2018 inHelp
Hello friends,

I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.

Due to complex tagging <Tag> TEXT to extract <Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.

I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?

比ks for your help !!!






Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Hey,
    a few comments,

    1. Have you tried the Unescape HTML or Extract Content operators from web mining extension?
    2. Have you considered using Extract Content operator from Aylien? They got a free api for 1000 calls per day.
    3. When i crawled wikipedia some time ago i used something like the attached process. I don't remember exactly what the regexes do. It was back in RM 6.2...

    I hope this helps,
    Martin











    https://fr.wikipedia.org/"/>









































    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    What the first two Replace operators Martin's operators are doing in his process is removing things that are enclosed in HTML tags, but you don't want in your text extract. for Javascript code and for CSS. The last step is removing all the HTML tags which the operator Replace (3) does.

    Ta da!:)
  • ahootanhaahootanha MemberPosts:69Contributor I

    Hello
    网络链接是如何的
    https://t.co/ghtyd
    Delete from text?
    Does anyone know the regular expression?

  • kaymankayman MemberPosts:662Unicorn

    There are a couple of options you have when you want to use regex, but you probably need to do it is several steps to be on the safe side.

    If your structure is indeed like your example (<Tag>) one way is to remove the 'correct' tags first by using this regex :

    <\/?\w[^<>].*?>

    read it a bit like 'select anything starting with a < , optionally followed by a tag closing thingy, then followed by a word character ([a-zA-Z]), then followed by anything but < or > untill the first >'

    This will change <Tag> TEXT to extract <Tag> into TEXT to extract , and if you run the same regex again you will only keep your text.

    Now, typically tags should have a closing indicator (

    <Tag> TEXT to extract Tag> or any combination

    Anyway, be carefull using regex, if there are actual <> used for greater than / less than instead of html tags you may remove more than needed, but all in all it should allow you to get started. (and kick the guy who created this bad html...)

    sgenzer Pavithra_Rao
Sign InorRegisterto comment.