The Rensselaer IDEA: Data Exploration

Preview:

DESCRIPTION

The Rensselaer Institute for Data Exploration and Applications is addressing new modes of data exploration and integration to enhance the work of campus researchers (and beyond). This talk outlines the "data exploration" technologies being explored

Citation preview

Data ExplorationJim Hendler

Director, Rensselaer Institute for Data Exploration and Applications

THE RENSSELAER IDEARensselaer Polytechnic Institute, USA

http://www.cs.rpi.edu/~hendler

IDEA

• Data-driven Medical and Healthcare Applications• Predictive Models for Business and Economics• “Biome” studies for Built and Natural Environments• Question Answering from texts and data• Resiliency Models for Population-Scale Problems and cyber-

security domains• Semantically-enabled Data Services for Science and

Engineering Research• Materials genome and nano-manufacturing informatics• Platforms for testing Policy and Open Data issues • …

Data-driven research areas at RPI

IDEA

The Rensselaer IDEA: empowering our researchers

Data discovery, integration,

and interaction technologies

Application-specificdata tools

IDEA

High Performance Modeling and Simulation• Center for Computational Innovation

Cognitive Computing • Watson at Rensselaer IBM Partnership

Perceptualization• Experimental Multimedia Performing Arts Center

Data Science• Data Science Research Center

The trunk: Shared Data Technologies

IDEA

Roots: Data Exploration

Discover

Integrate

Validate

Explain

Geekopedia: Data exploration helps a data consumer focus an information search on the pertinent aspect of relevant data before true analysis can be achieved. In large data sets, data is not gathered or controlled in a focused manner. Even in smaller data sets, it is also true that data gathered are not in a very rigid and specific technique can result in a disorganized manner and a myriad of subsets each…

DATA

IDEA

Data Exploration Challenges

Discover

Integrate

Validate

Explain

These needs live outside traditional data/info architectures

IDEA

Discovery needs semantics

How do you find the Data you need?

Middle Eastern Terrorists for $800 ?

IDEA

Discovery – there’s a lot out there

IDEA

Discovery needs more than keywords

World Bank: Africa

US Data.gov: Crop

Africover: Agriculture

Kenya: Agricultural

IDEA

Integration needs Semantics

Person

RIN 660125137

Address # 1118

Address St Pinehurst

Address zip 12203

Course topic CSCI

Course # 4961

Campus Personnel

RPI ID 660125137

Name Hendler

Campus Classes

CRN 1118

Name Intro to Physics

YES

NO!!!!

IDEA

Semantic Web and Linked Data (UK)

County Council

Ordnance Survey

Royal Mail

IOGDC Open Data Tutorial 11

IDEADistribution Statement

http://logd.tw.rpi.edu

Data Mashups

IDEA

Validation needs semantics

Easy for us

IDEA

Hard for machines…

Head to head comparison shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California

IDEA

Data + everything else you know

Same or different?

Do the terms mean the same? Are they collected in the same way? Are they processed differently? …

IDEA

Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)

Validation/Explanation need knowledge

Statistical correlation needs explanation

IDEA

Explanation also needs Semantics

Inference Web: McGuinness – various DoD/IC projects

IDEA

Closing the loop: where do the semantics come from?

Data

Prediction

Model

Design

How do we go from the predictive analytics of Big Data to models/explanations that allow newunderstanding?

IDEA

1. Better tools for Analytics, Agents and HPC

Make the tools and algorithms being developed by RPI researchers more “reusable” and multitask (including HPC data-analytic tools)

IDEA

2. Next-Gen Visualization (at scale)

How can multi-modal, multi-user, large scale sensory (visualization, sonification, haptics) interaction change the way we understand data?

IDEA

3. Include “agents” in the modeling

Develop technologies that enable researchers to work with “human-based” data at larger scales and in new ways• Population-scale

computing models for agent-based simulations

IDEA

Approach

Platform: Research in using supercomputers fordiscrete modeling• Carothers’ ROSS model

KR Model:• Weaver’s restricted rules

on graphs

Challenge problem:• Classification algorithms at petaflop scale• “Logical” (nonlinear, discontinuous) agents

IDEA

4. Exploit Cognitive Computing

IDEA will be the hub of Rensselaer’s cognitive-computing research• eg. Answer questions such as “Why” and “How”

integrated with large scale simulations

IDEA

Watson’s parallel model

Distributed (coarse-grained) parallelism© Making Watson Fast, IBM J Res and Dev,3/4 2012

IDEA

DeepQA type approach best on large clusters

(Physical) Simulation runs on supercomputers

Cognitive Computing at Scale

IDEA

Approach: link these computational models

Surmise (unproven): Cognitive Computing on a fast (large) cluster can query computations run against data generated by simulations (physical or agent-based) on the supercomputer

IDEA

• Semantics is a key technology for common data services

5. Data services will provide synergy across disciplines

Discovery, Integration. ValidationCuration, Citation,Archiving …

IDEA

Conclusions• The “warehouse” is only a small part of the data

ecosystem• Database technologies are only part of the story• Discovery, Integration, … , validation, explanation are key to

solving problems with data

• Closing the loop means “exploring” our data • Humans are still a key player in this

• The Rensselaer IDEA will explore• Data-driven applications and tools, but also…• … multimodal visualization, multiscale and agent modeling,

cognitive computing, and semantic data platforms

Rensselaer Institute for Data Exploration and Applications

Recommended