16
Text Mining Maurice Masih 13030141093 03/24/22 1

Tesxt mining

Embed Size (px)

Citation preview

Text Mining

Maurice Masih13030141093

04/15/23 1

Topic of Discussion

• Introduction• Text mining Comparison with other mining • Text Mining Process• How Algorithm is derived for Text Mining• Text Analysis For Google Sheet• Conclusion

04/15/23 2

Introduction• It is the process of deriving high-quality information

– Non trivial information– Unstructured text.

• It is also called as text data mining or text analytics.

Need

Bio Tech Industry

80% of biological knowledge is only in research paper(unstructured).

If a scientist manually read 50 research paper/week and only 10% of data are useful then he/she manages only 5 research paper/week

04/15/23 3

Text mining Comparison with…

Text Mining

Information Retrieval

Web Mining

Data Mining

Statistics

Computer Linguistics &

natural language

processing04/15/23 4

Text Mining Process

Text transformation

Text Preprocessing

Text

Attribute Selection

Data Mining/ Patter Discovery

Interpretation/ Evaluation•Document

Clustering•Text Characteristics

•Text Cleanup•Tokenization

•Text representation•Feature Selection

•Reduce Dimensionality•Remove irrelevant attributes

•Structured database•Application dependent•Classic data mining technique

Terminate or iterate

04/15/23 5

1.Text

Document clustering Large volume of textual data. No clear picture what document suit the application. Common technique is K mean clustering.

Text Characteristics Dependency Ambiguity Noisy Data Unstructured data

04/15/23 6

2.Text Preprocessing

Text Cleanup Remove ads from page Convert from binary format Normalize text Deal with tables, figures and formulas

Tokenization Splitting up a string of characters into a set of tokens. Need to deal with issues like, Apostrophes, hyphens. Need to deal with tenses, part of speech, etc.

04/15/23 7

3.Text transformationText Representation Text document is represented by the words (features) it contains

and their occurrences.

Bag of Words04/15/23 8

3.Text transformation contd..

04/15/23 9

4.Attribute Selection

Reduction of dimensionality Learners have difficulty addressing tasks with high dimensionality. Scarcity of resources and feasibility issues also call for a further

cutback of attributes.

Irrelevant features Not all features help!

e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”.

04/15/23 10

5.Data Mining/ Pattern Discovery

Text mining process merges with the traditional Data Mining process. Classic Data Mining techniques are used on the structured database

that resulted from the previous stages.

6.Interpretation & Evaluation

What to do next? Terminate Iterate

04/15/23 11

How Algorithm is derived for Text Mining

04/15/23 12

Text Analysis For Google Sheet

•Perform Sentiment Analysis•Extract mention of entities and concepts.•Summarize long chunks of text •Detect the language of a document•Find the best hashtags .•Extract the full text of an article, as well as its author name, embedded media, etc.

04/15/23 13

Conclusion

Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, matches etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques.

04/15/23 14

References

• http://www.r-bloggers.com/text-mining-in-r-automatic-categorization-of-wikipedia-articles/

• http://www.kdd.org/sites/default/files/issues/7-1-2005-06/9-Popowich.pdf

• www.Slideshare.net

04/15/23 15

04/15/23 16