View
3.342
Download
1
Category
Tags:
Preview:
Citation preview
rod smith (rod.smith@us.ibm.com)
© 2006 IBM Corporation
Enabling ad-hoc
Analytic Apps
with Hadoop
Enabling ad-hoc
Analytic Apps
with Hadoop
Text
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Emerging Technology - What do we work on?
Making Hadoop accessible to
business professionals
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
New Intelligence - Big Data
Nearly 15 petabytes of data are created every day — eight times more than the information in all the libraries in the U.S,
Volume of data in enterprises is doubling approximately every 3 years (Forrester Research)
• Includes structured and unstructured data, excludes rich media
Costs to find, collect & analyze data is decreasing significantly as web innovation proceeds
Content is untapped value for business insights & intelligence
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
New Intelligence - New Class of Application on Horizon?
ExploreExplore
Extract
GatherGather
Internet Evolution: A web of data
sources, services for exploring &
manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )
Enterprises recognizing potential of
leveraging the broader web for
business intelligence coverage - as
well as for internal data
Next wave of content-centric webApps
emerging
• Long(er) running data collection & analytic applications
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
New Intelligence - New Class of Application on Horizon?
Internet Evolution: A web of data
sources, services for exploring &
manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )
Enterprises recognizing potential of
leveraging the broader web for
business intelligence coverage - as
well as for internal data
Next wave of content-centric webApps
emerging
• Long(er) running data collection & analytic applications
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
New Intelligence - New Class of Application on Horizon?
Hear business users asking for
the ability to directly manipulate,
analyze & remix massive data
sources & services
• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”
Leveraging easy-to-use, rich data
manipulation metaphors like
spreadsheets, etc..
Rich visualizations to quickly
identify insights
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
New Intelligence - New Class of Application on Horizon?
Hear business users asking for
the ability to directly manipulate,
analyze & remix massive data
sources & services
• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”
Leveraging easy-to-use, rich data
manipulation metaphors like
spreadsheets, etc..
Rich visualizations to quickly
identify insights
Rich Spectrum
DIY AnalyticApplications
Emerging
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
BBC Digital
Democracy ProjectAchieving Increased
Government Transparency
Web Content To Gather:• UK Parliament Web Site
• Timeframe: 10 + years
Business Questions• Name names: Who is doing what, who
isn!t doing what
• Overlay voting record with demographic & voting records over time
• Buzz - what are people talking about?
• Visualize content relationships
Knowledge of Interest: • Members of Parliament (MPs)
• Bills, Debates, Voting Districts
Let!s Talk Customer Scenarios - BBC
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Let!s Talk Customers Scenarios - Thomson Reuters
Web Content To Gather: • ~118 3rd Party Finanical News Services and
Blogs, including: BBC, CNN ,Yahoo News, Financial Times, NY Times, The Big Picture, Fox News, PR Newswire, Market Watch, World Press, Forbes, Google News, Wall Street , Journal, MSNBC, The Sun, ZDNet,
Business Questions• NewsBuzz: What are the headlines? What
are not the headlines but still infocus?
• OpinionMonitor: Who is saying what? What are the debate topics?
• NewsTimeline: Chronology (pulse) of headline news?
• TopicCloud: Tag based topic metrix
• IssueAnalytics: Link backs to semantically related news
Knowledge of Interest:• People, places, events
Enrich Trader!s Desktop Enhancement
Timely aggregation & analytics of content originating from public internet sites
Scenario• Gather unstructured data from anywhere between 200 to
2000 data sources - every 15 minutes
• Perform preprocessing (search, transform, index) over each source
• Publish harvested content for distributed content services and downstream Mashups
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
IBM Emerging Technology Project: M2
What is it?An insight engine for enabling ad-hoc business insights for business users - at web scale
How does it work?Discovery Process1. point M2 to data sources of interests
• unstructured web data, feeds, XML, etc..
2. transform data into a form that can be analyzed• Unstructured data becomes semi-structured data
• Example: name: Rod Smith, employer: IBM, state: GA
• Apply analytics - enriching the data
3. “what if tooling” - browser-based visual front end - spreadsheet metaphor to create worksheets for exploring/visualizing the data
What!s different?• Unlocking insights embedded in unstructured data
• Analyzing data previously unavailable to analyze
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
M2 -> Demo
Web Content To Gather: • Gathered 1.4m patent docs from USPTO
• 1991-2007 case records from Court of
Appeals United States Federal Circuit
(CAFC)
Business Questions• How much is a target company worth?
• What are the high-value areas of their
portfolio?
• Explored cited patent topics, litigated
patents
Knowledge of Interest: • Patents ranked by citation – e.g how often
was a patent referenced determines value
• Corporate genealogies IP ownership roll-up
• Augment analysis with items affecting IP
value, inventor affiliation, citation rank by
time
Project:Improve IP Portfolio Analysis for Mergers & Acquisitions
“...please collect all US Patent filings… then let’s do…”
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
What!s Under the Covers: Hadoop
Emergence of map/reduce programming
model for a new class of webApp
Hadoop: provides a framework for large
scale parallel processing map/reduce
apps (Apache projects lead by Yahoo)
• Offers simplicity of “programming” - Looks like a simple single threaded app model for developers
• Handles big data - scalable storage across machine clusters (think read-only file system)
• Deployment: no application knowledge of runtime or OS or cloud necessary
• Today - setting up, coding Hadoop jobs in Java, etc. is the domain of skilled Java engineers
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Expanding upon the Hadoop stack
• Visual tooling builds extensively on Pig
M2 Architecture Characteristics:
• Extensible via UDFs
• REST API for customer choice of analytic service/engine
• REST APl for choice of visualization packages
• Export content as feeds, XML, etc..
• ...more to come
IBM Emerging Technology Project: M2 Architectural Components
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Conclusions
In God we trust
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Conclusions
…all others bring data
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Conclusions
Enterprises quickly evolving their thinking
from a Database strategy to a Data Strategy
encompassing unstructured & structured
content
Repeatable business patterns in broad range
of industries emerging
Hadoop has potential to be the platform for
broad range of solutions from web-based
analytics -> business event processing ->
collaboration
Friday, October 2, 2009
IBM Software GroupOctober 2009 SWG Emerging Internet Technology
Hadoop World ’09
Almost The End
Selecting customer proof of concept projects
!"#$%"&!'!()*('+,*,-
www-01.ibm.com/software/ebusiness/jstart/about.html
INTERESTED?
Friday, October 2, 2009
Recommended