HTML Tag Removal using Regular Expression/Replace Tokens

J_Hering · January 2016

Hello friends,

I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.

Due to complex tagging <Tag> TEXT to extract <Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.

I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?

比ks for your help !!!

MartinLiebig · January 2016

Hey,
a few comments,

1. Have you tried the Unescape HTML or Extract Content operators from web mining extension?
2. Have you considered using Extract Content operator from Aylien? They got a free api for 1000 calls per day.
3. When i crawled wikipedia some time ago i used something like the attached process. I don't remember exactly what the regexes do. It was back in RM 6.2...

I hope this helps,
Martin












https://fr.wikipedia.org/"/>

JEdward · January 2016

What the first two Replace operators Martin's operators are doing in his process is removing things that are enclosed in HTML tags, but you don't want in your text extract. for Javascript code and for CSS. The last step is removing all the HTML tags which the operator Replace (3) does.

Ta da!

ahootanha · April 2018

Hello
网络链接是如何的
https://t.co/ghtyd
Delete from text?
Does anyone know the regular expression?

kayman · April 2018

There are a couple of options you have when you want to use regex, but you probably need to do it is several steps to be on the safe side.

If your structure is indeed like your example (<Tag>) one way is to remove the 'correct' tags first by using this regex :

<\/?\w[^<>].*?>

read it a bit like 'select anything starting with a < , optionally followed by a tag closing thingy, then followed by a word character ([a-zA-Z]), then followed by anything but < or > untill the first >'

This will change <Tag> TEXT to extract <Tag> into TEXT to extract , and if you run the same regex again you will only keep your text.

Now, typically tags should have a closing indicator (

<Tag> TEXT to extract Tag> or any combination

Anyway, be carefull using regex, if there are actual <> used for greater than / less than instead of html tags you may remove more than needed, but all in all it should allow you to get started. (and kick the guy who created this bad html...)

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

HTML Tag Removal using Regular Expression/Replace Tokens

Answers