15
3/9/2015 Joydeep Ghosh UT-ECE Approaches to Mining Large-Scale Heterogeneous Data: Old and New Prof. Joydeep Ghosh Schlumberger Centennial Chaired Professor Fellow, IEEE Director, IDEAL (Intelligent Data Exploration and Analysis Lab) University of Texas at Austin

Approaches to Mining Large-Scale Heterogeneous Data: Old and New

Embed Size (px)

Citation preview

3/9/2015 Joydeep Ghosh UT-ECE

Approaches to Mining Large-Scale Heterogeneous Data:

Old and New

Prof. Joydeep Ghosh

Schlumberger Centennial Chaired Professor Fellow, IEEE

Director, IDEAL(Intelligent Data Exploration and Analysis Lab)

University of Texas at Austin

3/9/2015 Joydeep Ghosh UT-ECE

What we do

• Data-Driven Modeling & Knowledge Discovery“Big Data Predictive and Prescriptive Analytics”

– Data Types:• relational databases, distributed sensors, signals, images, web-logs,

key-value….• data (continuous + symbolic) + domain knowledge

– Tools:• Data mining/stats; web mining; machine learning, Neural nets,

signal/image processing….

– Large Scale System issues– Speciality: Multi-learner systems

• Use multiple, complementary approaches for more robust modeling of complex engineering problems

• Custom models, where “canned solutions” are inadequate.

Multi-sensor Fusion (80s, 90s)

• Blackboards (KBS)

• Multiple Hypothesis Tracking

• Basic Tracking (Kalman filters, Gauss-Markov,..)

• Detection/Identification

• The usual ones +

“Important applications can be found in time-critical situations or in situation with a high decision risk, where human deficiencies are to be compensated for by automatically or interactively working fusion techniques (compensating for decreasing attention in routine situations; focusing the attention on anomalous or rare events; complementing limited memory, reaction, or combination capabilities of human beings)” Koch, 2010.

Rationale

Overall Architecture

(Extreme) Design Choices

Combining Multiple ClassifiersJ. Ghosh, S. Beck and L. Deuser, IEEE Jl. of Ocean Engineering, Vol 17,

No. 4, October 1992, pp. 351-363.

Ave/median/..

MLP RBF Classifer N

FFT

Pre-processsed Data from Observed Phenomenon

. . .

. . .. . .

Gabor Wavelets Feature Set M

Combining Multiple Clusterings (2002)

• Given a set of provisional partitionings, we want to aggregate them

into a single consensus partitioning, even without access to original

features.

Clusterer #1

(individual cluster labels)

(consensus labels)

Provides Improved Accuracy + Robustness + Knowledge Re-use

Combining Multiple Trackers (97,98)

Adaptive Kalman Filter Bank

Modern Settings: Networked, Het Data

(Collective) Matrix Factorization

Factorization of Heterogeneous Data

Patients!

Diagnoses!

Procedures!

Medications!Demographics! Physicians!

W X Y Z

High-throughput Phenotyping on Electronic Health Records using

Multi-Tensor Factorization ($2.2 Mil grant from NSF)

4

Tensor Construction + Generation

+"…"+"

λ1"

Phenotype 1

λR"

Phenotype R

Refinement

Applications

GWAS

Predictive Models

Cohort Construction

Adaptation

""

EHR"

Site A

""

EHR"

Site B

≈"

Tensor

Construction

+ Generation

To Come

• Internet of Things (IoT)

• Network of (information) networks

• ….