Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
NLP Using NLTK Libarary:
-------------------------------------
TOKENIZE:
from nltk.tokenize import sent_tokenize, word_tokenize
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))print(word_tokenize(EXAMPLE_TEXT))
STOP WORDS:
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens: if w not in stop_words: filtered_sentence.append(w)
print(word_tokens)print(filtered_sentence)
STEMMING:
from nltk.stem import PorterStemmerfrom nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words: print(ps.stem(w))
OUTPUT:pythonpythonpythonpythonpythonli
PARTS OF SPEECH TAGGING:
import nltkfrom nltk.corpus import state_unionfrom nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content(): try: for i in tokenized[:5]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged)
except Exception as e: print(str(e))
process_content()
NER:
import nltkfrom nltk.corpus import state_unionfrom nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content(): try: for i in tokenized[5:]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) namedEnt = nltk.ne_chunk(tagged, binary=True) namedEnt.draw() except Exception as e: print(str(e))
process_content()
NE Type and Examples
ORGANIZATION - Georgia-Pacific Corp., WHOPERSON - Eddy Bonte, President ObamaLOCATION - Murray River, Mount EverestDATE - June, 2008-06-29TIME - two fifty a m, 1:30 p.m.MONEY - 175 million Canadian Dollars, GBP 10.40PERCENT - twenty pct, 18.75 %FACILITY - Washington Monument, StonehengeGPE - South East Asia, Midlothian
LEMMETAIZATION (Which exisits truly)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))print(lemmatizer.lemmatize("cacti"))print(lemmatizer.lemmatize("geese"))print(lemmatizer.lemmatize("rocks"))
CORPORA:
The NLTK corpus is a massive dump of all kinds of natural language data sets that are definitely worth taking a look at.
WORDNET:
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus
https://pythonprogramming.net/
TEXT CLASSIFICATION
TEXT FEATURES
NAÏVE CLASSFIER
SKLEARN
SENTIMENT ANALYSIS
N-Grams
N-grams refer to the process of combining the nearby words together for representation purposes where N represents the number of words to be combined together.
Uni-Gram
Bi-Gram
Tri-Gram
Bag of Words
Bag of words is a way to represent the data in a tabular format with columns representing the total vocabulary of the corpus and each row representing a single observation. The cell (intersection of the row and column) represents the count of the word represented by the column in that particular observation.
1.It disregards the order/grammar of the text 2. The matrix generated by this representation is
highly sparse
TF-IDF (BAG OF WORDS)
TF-IDF is a way of scoring the vocabulary so as to provide adequate weight to a word in proportion of the impact it has on the meaning of a sentence. The score is a product of 2 independent scores, term frequency(tf) and inverse document frequency (idf)
Term Frequency (TF): Term frequency is defined as frequency of word in the current document.
Inverse Document Frequency( IDF): is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is calculated as log (N/d) where, N is total number of documents and d is the number of documents in which the word appears.
Embedding MatrixEach of the word in its one hot encoded form is multiplied by the embedding matrix to give word embeddings for the sample.
WORD2VEC
word2vec in two ways:CBOW (Continuous Bag-Of-Words) Skip-gram
In CBOW we have a window around some target word and then consider the words around it (its context). We supply those words as input into our network and then use it to try to predict the target word.
Skip-gram does the opposite COBW, you have a target word, and you try to predict the words that are in the window around that word, i.e. predict the context around a word.