16
Data Mining Eamonn Keogh

Data Mining Eamonn Keogh. What is data mining? Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data

Embed Size (px)

Citation preview

Data Mining  

Eamonn Keogh

What is data mining?

• Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information.

• In my lab, we tend to look at data and problems that no one else looks at. 

Data Mining People

• Eamonn Keogh• Vagelis Hristidis• Vassilis Tsotras

• Chinya Ravishankar• Michael Pazzani• Christian Shelton (AI)• Stefano Lonardi (Bioinformatics)

My PhD Students• Jessica Lin (Ph.d 2005: George Mason University)• Chotirat (Ann) Ratanamahatana (Ph.d 2005: Chulalongkorn University)• Li Wei (Ph.d 2006, Google)• Xiaopeng Xi (Ph.d 2007, Yahoo)• Dragomir Yankov. (Ph.d  2008, Yahoo)• Lexiang Ye (Ph.d 2010 Google)• Xiaoyue (Elaine) Wang (Ph.d  2010 Nokia)• Jin-Wien Shieh (Ph.d  2010 Microsoft)• Qiang Zhu (Ph.d  2011 stumbleupon.com)• Abdullah Mueen (Ph.d  2012 Microsoft)

• Bilson Campana (Ph.d going to Google at Xmas)• Thanawin (Art) Rakthanmanon (Ph.d ongoing)• Bing Hu (Ph.d ongoing)• Yuan Hao (Ph.d ongoing)• Jesin Zakaria (Ph.d ongoing)• Yipeng Chen (Ph.d ongoing)

false nettles

stinging nettles

false nettles

stinging nettles

false nettles

false nettles

Shapelet

stinging nettlesfalse nettles stinging nettles

Leaf Decision Tree

Shapelet Dictionary

5.1

yes no

I

I

0 1

Decision Tree for Arrowheads

11.24

85.47

Shapelet Dictionary

(Clovis)

(Avonlea)

I

II

0 100 200 300 400

00.51.01.5

Arrowhead Decision Tree

I

21

II

0

Clovis Avonlea

Avonlea Clovis Mix

Training data (subset)

Of course, this is a decision tree, we want to eventually do clustering. However, in general, features that are good for classification, are good for clustering.

To do: On a small labeled subset of data, learn a dictionary of shaplets. Code the large unlabeled dataset with reference to that dictionary.

The shapelet decision tree classifier achieves an accuracy of 80.0%, the accuracy of rotation invariant one-nearest-neighbor classifier is 68.0%.

There now exists, perhaps tens of million of digitized pages of historical manuscripts dating back to the 12th century, that feature one or more heraldic shields

The images are often stained, faded or torn

Wouldn’t it be great if we could automatically hyperlink all similar shields to each other?

For example, here we could link two occurrence of the Von Sax family shield.

To do this, we need to consider shape, color and texture. Lets just consider shape for now…

Manesse Codexan illuminated manuscript

in codex form, copied and illustrated between 1304 and 1340

in Zurich

Indexing and Mining Rock Art

Rock art is found on every continent except Antarctica.

To date, computer science has had little impact on analysis of rock art.

A decade ago, Walt et al. summed up the state of petroglyph research by noting, “Complete-site and cross-site research thus remains impossible, incomplete, or impressionistic”

Australia may have 100 million examples

Atlatls

Anthropomorphs

Bighorn Sheep

One challenge is designing distance measures.

For example, we would like

to find and similar,

even though one is solid and

one is hollow. *Zhu, Wang, Keogh, Lee (2009). Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs. SIGKDD 2009

If we assume that we have high quality binary images of rock art, then we can do clustering, classification, indexing motif discovery.

Apple maggots cause two types of injury: dimpling and tunneling. Dimpling occurs around the site where eggs are

laid, causing the flesh to stop growing, resulting in a sunken, misshapen, dimpled area. Tunneling, done by the

larvae (maggots) eating in the fruit, causes the pulp to break down, discolor, and start to rot. The tunnels are often enlarged by bacterial decay. Damaged fruit eventually

becomes soft and rotten and cannot be used.

 Apple Maggot 

Rhagoletis pomonella

Carbaryl is an insecticide that is widely used agriculturally. Effective, but likely a human carcinogen, and it kills honey bees and other pollinators [1].

[1] http://npic.orst.edu/factsheets/carbgen.pdf[2] http://www.maine.gov/agriculture/pesticides/gotpests/bugs/factsheets/apple-maggot-cornell.pdf

One Example Crop/Insect

Why Insects Matter IBecause they eat/destroy $40 billion+ worth of food each year

Surround WP Crop Protectant against insects. Derived from Kaolin clay, a

natural mineral it forms a barrier that acts to control insect pests.

Effective & safe, but very expensive

Why Insects Matter IIBecause they kill over one million people each year

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5x 104-0.2

-0.1

0

0.1

0.2

One second of audio from our sensor. The Common Eastern Bumble Bee

(Bombus impatiens) takes about one tenth of a second to pass the laser.

Background noise Bee begins to cross laser Bee has past though the laser

Our Sensor

100 200 300 400 500 600 700 800

Frequency (Hz)

Bombusimpatiens

Culexquinquefasciatu Aedes aegypti

0 100 200 300 400 500 600 700 800 900 1000Frequency (Hz)

Peak at 705 Hz

Almost certainly a Aedes aegypti

Eamonn Keogh Computer Science &

Engineering Department

University of California – Riverside

[email protected]