18
Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) 발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1

Mining the social web ch8 - 1

Embed Size (px)

Citation preview

Page 1: Mining the social web ch8 - 1

Mining The Social Web

Ch8 Blogs et al.: Natural Language

Processing (and Beyond) Ⅰ

발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1

Page 2: Mining the social web ch8 - 1

Natural Language Processing

• 마침표로 문장을 처리하자!

Page 3: Mining the social web ch8 - 1

Natural Language Processing

• 마침표로 문장을 처리하자!

Page 4: Mining the social web ch8 - 1

NLP Pipeline With NLTK 문장의 끝 찾기

단어 자르기

구문별 짝짖기(?)

단어 의미 부여

추출

Page 5: Mining the social web ch8 - 1

Natural Language Processing

• 문장의 끝 찾기(EOS Detection)

Page 6: Mining the social web ch8 - 1

Natural Language Processing

• 문장의 끝 찾기(EOS Detection)

Page 7: Mining the social web ch8 - 1

Natural Language Processing

• 구문별 짝짓기 (POS Tagging)

Page 8: Mining the social web ch8 - 1

Natural Language Processing

Page 9: Mining the social web ch8 - 1

Natural Language Processing

• 추출( Extraction)

Page 10: Mining the social web ch8 - 1

Natural Language Processing

Page 11: Mining the social web ch8 - 1

Natural Language Processing

Page 12: Mining the social web ch8 - 1

Natural Language Processing

def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href})

Page 13: Mining the social web ch8 - 1

Natural Language Processing

# Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print '\tNum Sentences:'.ljust(25), len(sentences) print '\tNum Words:'.ljust(25), num_words print '\tNum Unique Words:'.ljust(25), num_unique_words print '\tNum Hapaxes:'.ljust(25), num_hapaxes print '\tTop 10 Most Frequent Words (sans stop words):\n\t\t', '\n\t\t'.join(['%s (%s)‘ % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print

Page 14: Mining the social web ch8 - 1

Natural Language Processing

Page 15: Mining the social web ch8 - 1

Natural Language Processing

# Summaization Approach 1: # Filter out non-significant sentences by using the average score plus a # fraction of the std dev as a filter avg = numpy.mean([s[1] for s in scored_sentences]) std = numpy.std([s[1] for s in scored_sentences]) mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if score > avg + 0.5 * std] # Summarization Approach 2: # Another approach would be to return only the top N ranked sentences top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])

Page 16: Mining the social web ch8 - 1

Natural Language Processing

Page 17: Mining the social web ch8 - 1

Natural Language Processing

– Luhn’s Summarization Algorithm

• Score = (문장에서 중요한 단어)^2)/(문장 총단어수)

Page 18: Mining the social web ch8 - 1

Natural Language Processing

– Luhn’s Summarization Algorithm

• Score =