HTML Tag Removal using Regular Expression/Replace Tokens
Hello friends,
I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.
Due to complex tagging <Tag> TEXT to extract <Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.
I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?
比ks for your help !!!
I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.
Due to complex tagging
I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?
比ks for your help !!!
Tagged:
0
Answers
a few comments,
1. Have you tried the Unescape HTML or Extract Content operators from web mining extension?
2. Have you considered using Extract Content operator from Aylien? They got a free api for 1000 calls per day.
3. When i crawled wikipedia some time ago i used something like the attached process. I don't remember exactly what the regexes do. It was back in RM 6.2...
I hope this helps,
Martin
Dortmund, Germany
Ta da!
Hello
网络链接是如何的
https://t.co/ghtyd
Delete from text?
Does anyone know the regular expression?
There are a couple of options you have when you want to use regex, but you probably need to do it is several steps to be on the safe side.
If your structure is indeed like your example (<Tag>) one way is to remove the 'correct' tags first by using this regex :
read it a bit like 'select anything starting with a < , optionally followed by a tag closing thingy, then followed by a word character ([a-zA-Z]), then followed by anything but < or > untill the first >'
This will change <Tag> TEXT to extract <Tag> into TEXT to extract , and if you run the same regex again you will only keep your text.
Now, typically tags should have a closing indicator (
Anyway, be carefull using regex, if there are actual <> used for greater than / less than instead of html tags you may remove more than needed, but all in all it should allow you to get started. (and kick the guy who created this bad html...)