Extracting words from text using python regex -


i have text (string) , want perform task in python:

i perform countvectorizer method in order make bag of words. may find method here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.countvectorizer.html

this method includes stopwords removal , works fine. removes punctuation , break every word. besides words returns lots of trash single letters , numbers.

this method though, has 1 parameter called "token_pattern" takes string (regex) can give me better results.

what want is: a) exlude words start, end or include numbers. b) exclude numbers text c) exclude any words <= 2 letters b) exclude http pages

for example, regex should give me this:

text = "it can dangerous take fido ride: http://t.co/er2wfanzbi http://t.co/rf3bhpnpwr',each year, on average, 20 billion empty miles incurred trucks, costs economy billions"

final_text = "can dangerous take fido ride each year average billion empty miles incurred trucks costs economy billions"

i in advance time , attention :)

here piece of regex grabs word made of solely letters of length 3 or more.

[a-za-z]{3,} 

here piece of regex grabs line without url in it.

^((?!(https?:\/\/)+([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w=?$#% \.-]*)).)*$ 

i haven't figured out how combine 2 yet. @ least, step in right direction. put each word on own line, remove urls, match words of 3 or more letters. ugly, work.


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

apache - setting document root in antoher partition on ubuntu -