storage.googleapis.com · Web viewORGANIZATION - Georgia-Pacific Corp., WHOPERSON - Eddy Bonte, President ObamaLOCATION - Murray River, Mount EverestDATE - June, 2008-06-29TIME

NLP Using NLTK Libarary:

-------------------------------------

TOKENIZE:

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))print(word_tokenize(EXAMPLE_TEXT))

STOP WORDS:

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens: if w not in stop_words: filtered_sentence.append(w)

print(word_tokens)print(filtered_sentence)

STEMMING:

from nltk.stem import PorterStemmerfrom nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words: print(ps.stem(w))

OUTPUT:pythonpythonpythonpythonpythonli

PARTS OF SPEECH TAGGING:

import nltkfrom nltk.corpus import state_unionfrom nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content(): try: for i in tokenized[:5]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged)

except Exception as e: print(str(e))

process_content()

NER:

import nltkfrom nltk.corpus import state_unionfrom nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content(): try: for i in tokenized[5:]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) namedEnt = nltk.ne_chunk(tagged, binary=True) namedEnt.draw() except Exception as e: print(str(e))

process_content()

NE Type and Examples

ORGANIZATION - Georgia-Pacific Corp., WHOPERSON - Eddy Bonte, President ObamaLOCATION - Murray River, Mount EverestDATE - June, 2008-06-29TIME - two fifty a m, 1:30 p.m.MONEY - 175 million Canadian Dollars, GBP 10.40PERCENT - twenty pct, 18.75 %FACILITY - Washington Monument, StonehengeGPE - South East Asia, Midlothian

LEMMETAIZATION (Which exisits truly)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))print(lemmatizer.lemmatize("cacti"))print(lemmatizer.lemmatize("geese"))print(lemmatizer.lemmatize("rocks"))

CORPORA:

The NLTK corpus is a massive dump of all kinds of natural language data sets that are definitely worth taking a look at.

WORDNET:

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus

https://pythonprogramming.net/

TEXT CLASSIFICATION

TEXT FEATURES

NAÏVE CLASSFIER

SKLEARN

SENTIMENT ANALYSIS

N-Grams

N-grams refer to the process of combining the nearby words together for representation purposes where N represents the number of words to be combined together.

Uni-Gram

Bi-Gram

Tri-Gram

https://pythonprogramming.net/

https://wordnet.princeton.edu/

Bag of Words

Bag of words is a way to represent the data in a tabular format with columns representing the total vocabulary of the corpus and each row representing a single observation. The cell (intersection of the row and column) represents the count of the word represented by the column in that particular observation.

1.It disregards the order/grammar of the text 2. The matrix generated by this representation is

highly sparse

TF-IDF (BAG OF WORDS)

TF-IDF is a way of scoring the vocabulary so as to provide adequate weight to a word in proportion of the impact it has on the meaning of a sentence. The score is a product of 2 independent scores, term frequency(tf) and inverse document frequency (idf)

Term Frequency (TF): Term frequency is defined as frequency of word in the current document.

Inverse Document Frequency( IDF): is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is calculated as log (N/d) where, N is total number of documents and d is the number of documents in which the word appears.

Embedding MatrixEach of the word in its one hot encoded form is multiplied by the embedding matrix to give word embeddings for the sample.

WORD2VEC

word2vec in two ways:CBOW (Continuous Bag-Of-Words) Skip-gram

In CBOW we have a window around some target word and then consider the words around it (its context). We supply those words as input into our network and then use it to try to predict the target word.

Skip-gram does the opposite COBW, you have a target word, and you try to predict the words that are in the window around that word, i.e. predict the context around a word.

Documents

storage.googleapis.com · Web viewORGANIZATION - Georgia-Pacific Corp., WHOPERSON - Eddy Bonte, President ObamaLOCATION - Murray River, Mount EverestDATE - June, 2008-06-29TIME