Upload
jeremy-boyle
View
17
Download
0
Embed Size (px)
DESCRIPTION
The Future of Data Management or The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight, Inc. EECS BEARS Conference February 2007. Semi-Structured (schema-later). Unstructured (schema-never). Structured (schema-first). - PowerPoint PPT Presentation
Citation preview
The Future of Data Managementor
The Structure of (Computer) Scientific Revolutions
EECS BEARS ConferenceFebruary 2007
Michael Franklin
UC Berkeley&
Amalgamated Insight, Inc.
Michael FranklinEECS BEARS Conference - February
2007
The Structure Spectrum
Structured (schema-
first)
Relational
DatabaseFormatted Messages
Semi-Structured (schema-later)
XML
Tagged Text/Medi
a
Unstructured (schema-
never)
Plain TextMedia
Michael FranklinEECS BEARS Conference - February
2007
Structured Data Management
Michael FranklinEECS BEARS Conference - February
2007
A “Modern” View of Data Management
Michael FranklinEECS BEARS Conference - February
2007
Whither Structured Data?• Conventional
Wisdom: only 20% of data
is structured.
• Decreasing due to:• Consumer
applications• Enterprise search• Media applications
Michael FranklinEECS BEARS Conference - February
2007
Structured Data Management Two reasons why this is where the
future is:
• The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues.
Michael FranklinEECS BEARS Conference - February
2007
Structured Data Management Two reasons why this is where the future
is:
• The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues.
• The “Data Industrial Revolution*”: Data used to be hand-crafted, now it’s machine-generated!
* Credit to Prof. Joe Hellerstein for this analogy.
Michael FranklinEECS BEARS Conference - February
2007
Reason 1: Data Integration
• The ultimate schema-first problem.
• In the future, required for all applications.
• Structure is both an enabler and a key impediment.
wrapperwrapperwrapperwrapperwrapper
Mediated Schema
Semantic mappings
Courtesy of Alon Halevy
Michael FranklinEECS BEARS Conference - February
2007
Why Structure?
What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…
Michael FranklinEECS BEARS Conference - February
2007
Why Structure?
Michael FranklinEECS BEARS Conference - February
2007
Why Structure?
What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…
Michael FranklinEECS BEARS Conference - February
2007
Why Structure?
• Text “Search” can return only what’s been previously “stored”.
Michael FranklinEECS BEARS Conference - February
2007
What if you wanted to…
• find out the average donation of actors to each candidate?
• compare actor donations this campaign to the last one?
• find out who gave the most to each candidate?
• organize the information by source or age?
Michael FranklinEECS BEARS Conference - February
2007
A“Deep-Web” Query Approach
SELECT y.name,f.occupation,…FROM Yahoo_Actors y, FECInfo fWHERE y.name = f.name
Michael FranklinEECS BEARS Conference - February
2007
Did it Work?
Michael FranklinEECS BEARS Conference - February
2007
What’s Missing?
• Common Schema • Any Schema• Strong Identifiers (keys)• Data Independence• Metadata• Consistency Guarantees• Access Control
Michael FranklinEECS BEARS Conference - February
2007
The Fundamental Tradeoff
Functionality
Time (and cost)
Structured(schema-first)
Unstructured (schema-less)
Semi-Structured(schema-later)
Structure enables computers to help users manipulate and maintain the data.
Michael FranklinEECS BEARS Conference - February
2007
“Flexible” Structure: Dataspaces*
• Deal with all the data from an enterprise – in whatever form
• Data co-existenceno integrated schema, no single warehouse
• Pay-as-you-go services• Keyword search is bare minimum.• Data manipulation and increased consistency as you add work.
* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.
Michael FranklinEECS BEARS Conference - February
2007
Databases vs. Dataspaces
• Data Coexistence
• Autonomous Sources
• Search, Browse, Approximate Answer Structured Query
• Best Effort Guarantees
• Single Schema• Centralized
Administration
• Structured Query
• Strict Integrity Constraints
Michael FranklinEECS BEARS Conference - February
2007
The World of Dataspaces
High Low
Near
Far
Desktop Search
Web SearchVirtual
Organization
Federated DBMS
DBMS
Semantic Integration
AdministrativeProximity
Michael FranklinEECS BEARS Conference - February
2007
DataSpace Technology
• Probabilistic Databases• Schema Matching• Judicious use of User Input • Approx. Query Answering• Probabilistic Reasoning• Uncertainty Management• Data Model Learning• Structured & Unstructured Search
Michael FranklinEECS BEARS Conference - February
2007
Reason 2: Data Industrial Revolution
Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect
• Mainframes 1960s• Minicomputers 1970s• Microcomputers/PCs 1980s• Web-based computing 1990s• Devices (Cell phones, PDAs, wireless sensors,
RFID) 2000’s
Enabling a new generation of applications forOperational Visibility, monitoring, and alerting.
Michael FranklinEECS BEARS Conference - February
2007
Data Streams Data Flood
Clickstream
BarcodesPoS System
SensorsRFID
Telematics
Inventory
• Exponential data growth
• New challenges: continuous, inter-connected, distributed, physical
• Shrinking business cycles
• More complex decisions
Phones
TransactionalSystems
Michael FranklinEECS BEARS Conference - February
2007
Device Data Management• Devices generate streams of structured data.
• Wide-spread deployment will lead to huge data volumes.
• Can we develop the right infrastructure to support large-scale data streaming apps?
• Can we incorporate devices into existing (legacy) IT infrastructure?
Michael FranklinEECS BEARS Conference - February
2007
High Fan In Systems*
• A data management infrastructure for large-scale data streaming environments.
• Uniform Declarative Framework • Every node is a SQL data stream processor stream-oriented queries at all levels• Hierarchical, stream-based views as an
organizing principle.• Can impose a “view” over messy devices.
*Design Considerations for High Fan In Systems - The HiFi Approach; CIDR 2005
Michael FranklinEECS BEARS Conference - February
2007
HiFi - Taming the Data Flood
Receptors
Warehouses, Stores
Dock doors, Shelves
Regional Centers
Headquarters
Hierarchical Aggregation
• Spatial• TemporalIn-network StreamQuery Processing and Storage
Fast DataPath vs.Slow DataPath
Michael FranklinEECS BEARS Conference - February
2007
“Virtual Device(VICE)API”
Vice API is a natural placeto hide much of the complexity arising from physical devices.
VICE: Virtual Device Interface [Jeffery et al., Pervasive 2006, VLDBJ 07]
Michael FranklinEECS BEARS Conference - February
2007
Device Issues: example
Shelf RIFD Test - Ground Truth
Michael FranklinEECS BEARS Conference - February
2007
Actual RFID Readings
“Restock every time inventory goes below 5”
Michael FranklinEECS BEARS Conference - February
2007
Query-based Data Cleaning
Point
Smooth
CREATE VIEW smoothed_rfid_stream AS(SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)
Michael FranklinEECS BEARS Conference - February
2007
Query-based Data Cleaning
Point
Smooth
ArbitrateCREATE VIEW arbitrated_rfid_stream AS(SELECT receptor_id, tag_idFROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’]GROUP BY receptor_id, tag_idHAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))
Michael FranklinEECS BEARS Conference - February
2007
After Query-based Cleaning
“Restock every time inventory goes below 5”
Michael FranklinEECS BEARS Conference - February
2007
SQL Abstraction Makes it Easy
• “Soft Sensors”• Quality and lineage• Optimization (power, etc.)• Pushdown of external validation
information• Data archiving• Imperative processing• …
Michael FranklinEECS BEARS Conference - February
2007
Co
mp
lexi
ty
Performance
Cen
tral
ized
Dis
trib
ute
d
Event-Driven
Query-Driven
Next-Generation Business Intelligence
Amalgamated Insight: The Company
RDBMS
Data Warehouse
Appliance
In-MemoryAccelerators
Database/Data WarehouseProducts
Reporting
Analysis
Predictive Analytics
Data Mining
“Operational”BI/BAM
DataAnalyticsProducts
Michael FranklinEECS BEARS Conference - February
2007
Stream Query Processing is the Key
Integrated Event Handling
and Alerting
VisibilityInterfaces to Operational
Systems
Notification
Learning
Intelligent Action
Drill Down, Replay, Reports
“What’s happening
now?”
“Tell me when something happens.”
“Why is it happening and how
to improve it?”
“Automatically react when
things happen.”
Michael FranklinEECS BEARS Conference - February
2007
Company Overview
• Breakthrough technology for stream query processing• Proven software base – leveraging open source platform• Used in demanding high-volume networked applications
Boyd Pearce, President and CEO Michael Franklin, Ph.D., CTO Michael Trigg, EVP, Marketing Sailesh Krishnamurthy, Ph.D., Chief Architect Robert Krauss, VP, Business Development
Key Team Members
Technology
Founded November 2005 Headquarters in Foster City, CA Series A Financing: May 2006 10 Employees (and growing!)
Michael FranklinEECS BEARS Conference - February
2007
Conclusions• Structured data increasingly important.
• In fact, there will be lots more of it.• and it must be processed as fast as it is created.
• Traditional (structured) database technology is not up to the task.
• Great opportunities for innovation.• HiFi, Dataspaces (and Amalgamated Insight!) are examples.
http://www.cs.berkeley.edu/~franklin