Extracting words from text using python regex -
i have text (string) , want perform task in python:
i perform countvectorizer method in order make bag of words. may find method here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.countvectorizer.html
this method includes stopwords removal , works fine. removes punctuation , break every word. besides words returns lots of trash single letters , numbers.
this method though, has 1 parameter called "token_pattern" takes string (regex) can give me better results.
what want is: a) exlude words start, end or include numbers. b) exclude numbers text c) exclude any words <= 2 letters b) exclude http pages
for example, regex should give me this:
text = "it can dangerous take fido ride: http://t.co/er2wfanzbi http://t.co/rf3bhpnpwr',each year, on average, 20 billion empty miles incurred trucks, costs economy billions"
final_text = "can dangerous take fido ride each year average billion empty miles incurred trucks costs economy billions"
i in advance time , attention :)
here piece of regex grabs word made of solely letters of length 3 or more.
[a-za-z]{3,}
here piece of regex grabs line without url in it.
^((?!(https?:\/\/)+([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w=?$#% \.-]*)).)*$
i haven't figured out how combine 2 yet. @ least, step in right direction. put each word on own line, remove urls, match words of 3 or more letters. ugly, work.
Comments
Post a Comment