16
Data Mining Eamonn Keogh

Data Mining

  • Upload
    affrica

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Eamonn Keogh. Data Mining . What is data mining?. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Mining

Data Mining  

Eamonn Keogh

Page 2: Data Mining

What is data mining?

• Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information.

• In my lab, we tend to look at data and problems that no one else looks at. 

Page 3: Data Mining

Data Mining People

• Eamonn Keogh• Vagelis Hristidis• Vassilis Tsotras

• Chinya Ravishankar• Michael Pazzani• Christian Shelton (AI)• Stefano Lonardi (Bioinformatics)

Page 4: Data Mining

My PhD Students• Jessica Lin (Ph.d 2005: George Mason University)• Chotirat (Ann) Ratanamahatana (Ph.d 2005: Chulalongkorn University)• Li Wei (Ph.d 2006, Google)• Xiaopeng Xi (Ph.d 2007, Yahoo)• Dragomir Yankov. (Ph.d  2008, Yahoo)• Lexiang Ye (Ph.d 2010 Google)• Xiaoyue (Elaine) Wang (Ph.d  2010 Nokia)• Jin-Wien Shieh (Ph.d  2010 Microsoft)• Qiang Zhu (Ph.d  2011 stumbleupon.com)• Abdullah Mueen (Ph.d  2012 Microsoft)

• Bilson Campana (Ph.d going to Google at Xmas)• Thanawin (Art) Rakthanmanon (Ph.d ongoing)• Bing Hu (Ph.d ongoing)• Yuan Hao (Ph.d ongoing)• Jesin Zakaria (Ph.d ongoing)• Yipeng Chen (Ph.d ongoing)

Page 5: Data Mining

false nettles

stinging nettles

Page 6: Data Mining

false nettles

stinging nettles

false nettles

false nettles

Shapelet

stinging nettlesfalse nettles stinging nettles

Leaf Decision Tree

Shapelet Dictionary

5.1

yes no

I

I

0 1

Page 7: Data Mining

Decision Tree for Arrowheads

11.24

85.47

Shapelet Dictionary

(Clovis)

(Avonlea)

I

II

0 100 200 300 400

00.51.01.5

Arrowhead Decision Tree

I

21

II

0

Clovis Avonlea

Avonlea Clovis Mix

Training data (subset)

Of course, this is a decision tree, we want to eventually do clustering. However, in general, features that are good for classification, are good for clustering.

To do: On a small labeled subset of data, learn a dictionary of shaplets. Code the large unlabeled dataset with reference to that dictionary.

The shapelet decision tree classifier achieves an accuracy of 80.0%, the accuracy of rotation invariant one-nearest-neighbor classifier is 68.0%.

Page 8: Data Mining

There now exists, perhaps tens of million of digitized pages of historical manuscripts dating back to the 12th century, that feature one or more heraldic shields

The images are often stained, faded or torn

Page 9: Data Mining

Wouldn’t it be great if we could automatically hyperlink all similar shields to each other?

For example, here we could link two occurrence of the Von Sax family shield.

To do this, we need to consider shape, color and texture. Lets just consider shape for now…

Manesse Codexan illuminated manuscript

in codex form, copied and illustrated between 1304 and 1340

in Zurich

Page 10: Data Mining

Indexing and Mining Rock Art

Rock art is found on every continent except Antarctica.

To date, computer science has had little impact on analysis of rock art.

A decade ago, Walt et al. summed up the state of petroglyph research by noting, “Complete-site and cross-site research thus remains impossible, incomplete, or impressionistic”

Australia may have 100 million examples

Page 11: Data Mining

Atlatls

Anthropomorphs

Bighorn Sheep

One challenge is designing distance measures.

For example, we would like

to find and similar,

even though one is solid and

one is hollow. *Zhu, Wang, Keogh, Lee (2009). Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs. SIGKDD 2009

If we assume that we have high quality binary images of rock art, then we can do clustering, classification, indexing motif discovery.

Page 12: Data Mining

Apple maggots cause two types of injury: dimpling and tunneling. Dimpling occurs around the site where eggs are

laid, causing the flesh to stop growing, resulting in a sunken, misshapen, dimpled area. Tunneling, done by the

larvae (maggots) eating in the fruit, causes the pulp to break down, discolor, and start to rot. The tunnels are often enlarged by bacterial decay. Damaged fruit eventually

becomes soft and rotten and cannot be used.

 Apple Maggot Rhagoletis pomonella

Carbaryl is an insecticide that is widely used agriculturally. Effective, but likely a human carcinogen, and it kills honey bees and other pollinators [1].

[1] http://npic.orst.edu/factsheets/carbgen.pdf[2] http://www.maine.gov/agriculture/pesticides/gotpests/bugs/factsheets/apple-maggot-cornell.pdf

One Example Crop/Insect

Why Insects Matter IBecause they eat/destroy $40 billion+ worth of food each year

Surround WP Crop Protectant against insects. Derived from Kaolin clay, a

natural mineral it forms a barrier that acts to control insect pests.

Effective & safe, but very expensive

Page 13: Data Mining

Why Insects Matter IIBecause they kill over one million people each year

Page 14: Data Mining

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 104-0.2

-0.1

0

0.1

0.2

One second of audio from our sensor. The Common Eastern Bumble Bee

(Bombus impatiens) takes about one tenth of a second to pass the laser.

Background noise Bee begins to cross laser Bee has past though the laser

Our Sensor

Page 15: Data Mining

100 200 300 400 500 600 700 800Frequency (Hz)

Bombusimpatiens

Culexquinquefasciatu Aedes aegypti

0 100 200 300 400 500 600 700 800 900 1000Frequency (Hz)

Peak at 705 Hz

Almost certainly a Aedes aegypti

Page 16: Data Mining

Eamonn Keogh Computer Science &

Engineering DepartmentUniversity of California – Riverside

[email protected]