Filter Tokens by POS Tags slow

AndyKir · October 2017

I have Filter Tokens by POS Tags inside a loop and it's slow. My guess is that each iteration the tagger loads some data (dictionary?) from HD. Any tips on how to improve the performance? I see that same quesiton was asked 4 years ago and it was not answered.

781194025 · October 2017

Try to pre-process the data as much as possible so the filter operation doesn't have to work as hard.

I'm experimenting with disabling CPU hyper-threading, maybe you could try that? Another tip is set the amount of memory usable in settings to a higher amount.

Otherwise, I dunno, some processes are SLOW!

kayman · October 2017

Use python NLTK instead if that's an option. It's much more flexible with regards to POS tagging and muuuuuch faster

Or use R, that's also an option

Below you can find something I created a while ago to give me different outputs based on POS combinations, maybe it can help you further.





<宏/ >


< process expanded="true">

< parameter key="script" value="import nltk, re
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, wordpunct_tokenize
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import untag

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

def chunckMe(str,rule):

 np=[]
 chunk_parser = RegexpChunkParser(rule, chunk_label='LBL')
 sentences= sent_tokenize(str)

 for sent in sentences:
 d_words=nltk.word_tokenize(sent)
 d_tagged=nltk.pos_tag(d_words)
 chunked_text = chunk_parser.parse(d_tagged)

 tree = chunked_text
 for subtree in tree.subtrees():
 if subtree.label() == 'LBL': np.append(" ".join(untag(subtree)).lower())
 
 return np;

def rm_main(data):
	
	np_all=[]
	ap_all=[]
	aa_all=[]
	vj_all=[]
	vb_all=[]
	nn_all=[]

	
	stopwords_dt=(['the','a','this','that','an','another','these','some','every','any'])

	lm=nltk.WordNetLemmatizer()
	
	for index,row in data.iterrows():

		str=row["case_details"]

		chunk_rule = ChunkRule("<JJ.*><NN.*>+|<JJ.*>*<NN.*><CC>*<NN.*>+|<CD><NN.*>", "Simple noun phrase")
		tags = chunckMe(str,[chunk_rule])
		np_all.append(', '.join(set(tags)))

		chunk_rule = ChunkRule("<JJ.*><CC><JJ.*>|<JJ.*><TO>*<VB.*><TO>*<NN.*>+", "adjective phrase")
		tags = chunckMe(str,[chunk_rule])
		ap_all.append(', '.join(set(tags)))

		chunk_rule = ChunkRule("<RB.*><JJ.*>|<VB.*>+<RB.*>", "Adverb - Adjectives")
		tags = chunckMe(str,[chunk_rule])
		aa_all.append(', '.join(set(tags)))

		chunk_rule = ChunkRule("<VB.*>(<JJ.*>|<NN.*>)+", "verbs - Adjectives")
		tags = chunckMe(str,[chunk_rule])
		vj_all.append(', '.join(set(tags)))	

		chunk_rule = ChunkRule("<WRB><.*>+<NN>+", "Nouns")
		tags = chunckMe(str,[chunk_rule])
		nn_all.append(', '.join(set(tags)))	
		
		stopwords=(['be','do','have'])
		chunk_rule = ChunkRule("<VB.*>", "Verbs")
		tags = chunckMe(str,[chunk_rule])
		vb_all.append(', '.join([word for word in nltk.word_tokenize(' '.join(set(lm.lemmatize(w, 'v') for w in tags))) if word.lower() not in stopwords]))


	data['noun_phrases']=np_all
	data['adjective_phrases']=ap_all
	data['adverb_phrases']=aa_all
	data['verb_phrases']=vj_all
	data['verbs']=vb_all
	data['nouns']=nn_all
	return data
"/>
Apply python (NLTK) to get POS tags and some other magic


< from_op = "没有连接un phrases (2)" from_port="output 1" to_port="result 1"/>
< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="source_input 2" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>

AndyKir · October 2017

That's what I do for my research, but for teaching I use Rapidminer...

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Filter Tokens by POS Tags slow

Answers