29
Data Exploration Jim Hendler Director, Rensselaer Institute for Data Exploration and Applications THE RENSSELAER IDEA Rensselaer Polytechnic Institute, USA http://www.cs.rpi.edu/~hendler

The Rensselaer IDEA: Data Exploration

Embed Size (px)

DESCRIPTION

The Rensselaer Institute for Data Exploration and Applications is addressing new modes of data exploration and integration to enhance the work of campus researchers (and beyond). This talk outlines the "data exploration" technologies being explored

Citation preview

Page 1: The Rensselaer IDEA: Data Exploration

Data ExplorationJim Hendler

Director, Rensselaer Institute for Data Exploration and Applications

THE RENSSELAER IDEARensselaer Polytechnic Institute, USA

http://www.cs.rpi.edu/~hendler

Page 2: The Rensselaer IDEA: Data Exploration

IDEA

• Data-driven Medical and Healthcare Applications• Predictive Models for Business and Economics• “Biome” studies for Built and Natural Environments• Question Answering from texts and data• Resiliency Models for Population-Scale Problems and cyber-

security domains• Semantically-enabled Data Services for Science and

Engineering Research• Materials genome and nano-manufacturing informatics• Platforms for testing Policy and Open Data issues • …

Data-driven research areas at RPI

Page 3: The Rensselaer IDEA: Data Exploration

IDEA

The Rensselaer IDEA: empowering our researchers

Data discovery, integration,

and interaction technologies

Application-specificdata tools

Page 4: The Rensselaer IDEA: Data Exploration

IDEA

High Performance Modeling and Simulation• Center for Computational Innovation

Cognitive Computing • Watson at Rensselaer IBM Partnership

Perceptualization• Experimental Multimedia Performing Arts Center

Data Science• Data Science Research Center

The trunk: Shared Data Technologies

Page 5: The Rensselaer IDEA: Data Exploration

IDEA

Roots: Data Exploration

Discover

Integrate

Validate

Explain

Geekopedia: Data exploration helps a data consumer focus an information search on the pertinent aspect of relevant data before true analysis can be achieved. In large data sets, data is not gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data gathered are not in a very rigid and specific technique can result in a disorganized manner and a myriad of subsets each…

DATA

Page 6: The Rensselaer IDEA: Data Exploration

IDEA

Data Exploration Challenges

Discover

Integrate

Validate

Explain

These needs live outside traditional data/info architectures

Page 7: The Rensselaer IDEA: Data Exploration

IDEA

Discovery needs semantics

How do you find the Data you need?

Middle Eastern Terrorists for $800 ?

Page 8: The Rensselaer IDEA: Data Exploration

IDEA

Discovery – there’s a lot out there

Page 9: The Rensselaer IDEA: Data Exploration

IDEA

Discovery needs more than keywords

World Bank: Africa

US Data.gov: Crop

Africover: Agriculture

Kenya: Agricultural

Page 10: The Rensselaer IDEA: Data Exploration

IDEA

Integration needs Semantics

Person

RIN 660125137

Address # 1118

Address St Pinehurst

Address zip 12203

Course topic CSCI

Course # 4961

Campus Personnel

RPI ID 660125137

Name Hendler

Campus Classes

CRN 1118

Name Intro to Physics

YES

NO!!!!

Page 11: The Rensselaer IDEA: Data Exploration

IDEA

Semantic Web and Linked Data (UK)

County Council

Ordnance Survey

Royal Mail

IOGDC Open Data Tutorial 11

Page 12: The Rensselaer IDEA: Data Exploration

IDEADistribution Statement

http://logd.tw.rpi.edu

Data Mashups

Page 13: The Rensselaer IDEA: Data Exploration

IDEA

Validation needs semantics

Easy for us

Page 14: The Rensselaer IDEA: Data Exploration

IDEA

Hard for machines…

Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California

Page 15: The Rensselaer IDEA: Data Exploration

IDEA

Data + everything else you know

Same or different?

Do the terms mean the same? Are they collected in the same way? Are they processed differently? …

Page 16: The Rensselaer IDEA: Data Exploration

IDEA

Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)

Validation/Explanation need knowledge

Statistical correlation needs explanation

Page 17: The Rensselaer IDEA: Data Exploration

IDEA

Explanation also needs Semantics

Inference Web: McGuinness – various DoD/IC projects

Page 18: The Rensselaer IDEA: Data Exploration

IDEA

Closing the loop: where do the semantics come from?

Data

Prediction

Model

Design

How do we go from the predictive analytics of Big Data to models/explanations that allow newunderstanding?

Page 19: The Rensselaer IDEA: Data Exploration

IDEA

1. Better tools for Analytics, Agents and HPC

Make the tools and algorithms being developed by RPI researchers more “reusable” and multitask (including HPC data-analytic tools)

Page 20: The Rensselaer IDEA: Data Exploration

IDEA

2. Next-Gen Visualization (at scale)

How can multi-modal, multi-user, large scale sensory (visualization, sonification, haptics) interaction change the way we understand data?

Page 21: The Rensselaer IDEA: Data Exploration

IDEA

3. Include “agents” in the modeling

Develop technologies that enable researchers to work with “human-based” data at larger scales and in new ways• Population-scale

computing models for agent-based simulations

Page 22: The Rensselaer IDEA: Data Exploration

IDEA

Approach

Platform: Research in using supercomputers fordiscrete modeling• Carothers’ ROSS model

KR Model:• Weaver’s restricted rules

on graphs

Challenge problem:• Classification algorithms at petaflop scale• “Logical” (nonlinear, discontinuous) agents

Page 23: The Rensselaer IDEA: Data Exploration

IDEA

4. Exploit Cognitive Computing

IDEA will be the hub of Rensselaer’s cognitive-computing research• eg. Answer questions such as “Why” and “How”

integrated with large scale simulations

Page 24: The Rensselaer IDEA: Data Exploration

IDEA

Watson’s parallel model

Distributed (coarse-grained) parallelism© Making Watson Fast, IBM J Res and Dev,3/4 2012

Page 25: The Rensselaer IDEA: Data Exploration

IDEA

DeepQA type approach best on large clusters

(Physical) Simulation runs on supercomputers

Cognitive Computing at Scale

Page 26: The Rensselaer IDEA: Data Exploration

IDEA

Approach: link these computational models

Surmise (unproven): Cognitive Computing on a fast (large) cluster can query computations run against data generated by simulations (physical or agent-based) on the supercomputer

Page 27: The Rensselaer IDEA: Data Exploration

IDEA

• Semantics is a key technology for common data services

5. Data services will provide synergy across disciplines

Discovery, Integration. ValidationCuration, Citation,Archiving …

Page 28: The Rensselaer IDEA: Data Exploration

IDEA

Conclusions• The “warehouse” is only a small part of the data

ecosystem• Database technologies are only part of the story• Discovery, Integration, … , validation, explanation are key to

solving problems with data

• Closing the loop means “exploring” our data • Humans are still a key player in this

• The Rensselaer IDEA will explore• Data-driven applications and tools, but also…• … multimodal visualization, multiscale and agent modeling,

cognitive computing, and semantic data platforms

Page 29: The Rensselaer IDEA: Data Exploration

Rensselaer Institute for Data Exploration and Applications