Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Company Confidential – Do Not Distribute 1
Two Goals…
HPC Systems
Breakthrough supercomputing performance and
scalability
Storage & Data Management
Scale-out, manageable high
performance storage systems
YarcData
Solutions for Big Data analytics
Supercomputing Market Leadership Grow Profitably
…Three Business Units
Created a new division within Cray
100% focused on Big Data solutions
Rapidly-growing, multi-billion market
Hired experienced management team with deep Enterprise experience
Company Confidential – Do Not Distribute 4
“Needle in a Haystack” Needle in a Needlestack!
Paradigm: SEARCH Paradigm: DISCOVERY
Discovery is about finding QUESTIONS as well as finding the answers to ques?ons
Company Confidential – Do Not Distribute 5
G00232737
YarcData's uRiKA Shows Big Data Is MoreThan Hadoop and Data WarehousesPublished: 11 September 2012
Analyst(s): Carl Claunch
The hype about big data is mostly on Hadoop or data warehouses, but bigdata involves a much wider and varied set of needs, practices andtechnologies. We offer recommendations for IT organizations seeking asolution to "graph" problems, including use of the uRiKA graph appliance.
ImpactsŶ IT organizations faced with previously infeasible graph-style discovery problems may succeed
using a focused solution like uRiKA.
Ŷ To address all their data requirements, IT organizations may be forced to duplicate databetween systems such as uRiKA and transactional systems.
RecommendationsŶ Select candidates to place on uRiKA where processing is graph-oriented, the data is large-
scale, and discovery of complex relationships is a core focus.
Ŷ Validate the appropriateness of specialized systems and achievability of performance targets byproof-of-concept and pilot tests.
Ŷ Carefully define the volume and performance requirements for processing data.
Ŷ If requirements can be met using fewer machine types and the economics make sense, use thefewest platform types possible.
AnalysisTo listen to the hype, you might think big data is only about Hadoop, but Gartner deems big data tocomprise a wider, more varied set of needs, practices and technologies. The uRiKA graph appliancefrom YarcData, a new company spinoff of Cray, illustrates our point and is used here to highlightjust one of the classes of big data problems that are poorly addressed by traditional systems.
G00232737
YarcData's uRiKA Shows Big Data Is MoreThan Hadoop and Data WarehousesPublished: 11 September 2012
Analyst(s): Carl Claunch
The hype about big data is mostly on Hadoop or data warehouses, but bigdata involves a much wider and varied set of needs, practices andtechnologies. We offer recommendations for IT organizations seeking asolution to "graph" problems, including use of the uRiKA graph appliance.
ImpactsŶ IT organizations faced with previously infeasible graph-style discovery problems may succeed
using a focused solution like uRiKA.
Ŷ To address all their data requirements, IT organizations may be forced to duplicate databetween systems such as uRiKA and transactional systems.
RecommendationsŶ Select candidates to place on uRiKA where processing is graph-oriented, the data is large-
scale, and discovery of complex relationships is a core focus.
Ŷ Validate the appropriateness of specialized systems and achievability of performance targets byproof-of-concept and pilot tests.
Ŷ Carefully define the volume and performance requirements for processing data.
Ŷ If requirements can be met using fewer machine types and the economics make sense, use thefewest platform types possible.
AnalysisTo listen to the hype, you might think big data is only about Hadoop, but Gartner deems big data tocomprise a wider, more varied set of needs, practices and technologies. The uRiKA graph appliancefrom YarcData, a new company spinoff of Cray, illustrates our point and is used here to highlightjust one of the classes of big data problems that are poorly addressed by traditional systems.
G00232737
YarcData's uRiKA Shows Big Data Is MoreThan Hadoop and Data WarehousesPublished: 11 September 2012
Analyst(s): Carl Claunch
The hype about big data is mostly on Hadoop or data warehouses, but bigdata involves a much wider and varied set of needs, practices andtechnologies. We offer recommendations for IT organizations seeking asolution to "graph" problems, including use of the uRiKA graph appliance.
ImpactsŶ IT organizations faced with previously infeasible graph-style discovery problems may succeed
using a focused solution like uRiKA.
Ŷ To address all their data requirements, IT organizations may be forced to duplicate databetween systems such as uRiKA and transactional systems.
RecommendationsŶ Select candidates to place on uRiKA where processing is graph-oriented, the data is large-
scale, and discovery of complex relationships is a core focus.
Ŷ Validate the appropriateness of specialized systems and achievability of performance targets byproof-of-concept and pilot tests.
Ŷ Carefully define the volume and performance requirements for processing data.
Ŷ If requirements can be met using fewer machine types and the economics make sense, use thefewest platform types possible.
AnalysisTo listen to the hype, you might think big data is only about Hadoop, but Gartner deems big data tocomprise a wider, more varied set of needs, practices and technologies. The uRiKA graph appliancefrom YarcData, a new company spinoff of Cray, illustrates our point and is used here to highlightjust one of the classes of big data problems that are poorly addressed by traditional systems.
Source: Gartner
Company Confidential – Do Not Distribute 6
Figure 3 | The network of the top borrowers. Nodes represent financial institutions (selected as explained in the text). Outgoing links represent theestimated potential impact of an institution to another one (see Methods). The nodes are positioned within a circle of radius 1, centred in 0. The distanceof each node from the center is 1{R D! "i , while the angle increases linearly with R(D) from 0 to 10p. Thus, the closer a node is to the center the higher is itsDebtRank (the intuition here is its centrality). A node in the center (DebtRank 5 1) is able to put under distress the entire economic value of the network.DebtRank decreases by moving outwards and leftwards along the spiral. The diagram allows at the same time to visualise the structure of the network andto compare the importance of any two given nodes. The size and the color of the node reflects the DebtRank value (larger and red nodes have higherDebtRank). The color of a link reflects the DebtRank of the node from which it originates (red links originate from node with high DebtRank and makehigh impact to the destination nodes). (a) Period one, at the beginning of the crisis (see also inset of Fig. 2a). Most of the nodes have low levels ofDebtRank, i.e., they are located close to the border. (b) Period four (peak of the crisis). Nodes have comparable levels of Debt Rank. However, they are alsomuch more central, i.e., they can impact a large fraction of the total economic value. A single default is likely to trigger a systemic failure.
Figure 4 | Debt Rank, asset size and fragility. Scatter plot of DebtRank versus asset size, measured as a fraction (in %) of the total of the asset size in thenetwork. For sake of simplicity, in the experiment, asset size was assumed constant during the time span of the data. Notice that institutions such as UBS,or CITIGROUP alone account for almost 10% of the total assets. The size of each bubble is proportional to the outstanding debt of the institution whilethe color reflects its fragility, defined as the ratio of debt over market capitalization in the given period, as in the previous section. (a) Period one. Since theoutstanding debt was very low or zero, most nodes appear small and have levels of DebtRank below 0.3, but comparable among each other. (b) Periodfour. Many institutions have a Debt Rank larger then 0.5, i.e. each can impact, alone, the majority of the economic value in the network. The outstandingdebt in this period is close to the peak for all the institutions, as reflected by the size of the bubbles. Notice, also a higher fragility, most bubbles are red,although with some heterogeneity.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 2 : 541 | DOI: 10.1038/srep00541 4
Company Confidential – Do Not Distribute 7
Telecom/Mobile
Life Sciences/Biology
Social Networking
Supply Chain
Healthcare/Medicine
Intelligence/Security
Internet/WWW Finance
Company Confidential – Do Not Distribute 8
Common Theme: DISCOVERING RELATIONSHIPs spanning large, fragmented data sets, and doing so at an
increasing speed
Domain
Risk
Compliance
Customer Insight
Cyber Security
Current
Departmental Computa?ons & aggrega?ons
Recogni?on of known behaviors
Profitability, reten?on and marke?ng
Siloed examina?on of network incidents
Emerging
Risk as a func?on of the connec+ons between
counterpar?es
Proac?vely discover new pa4erns of suspicious
behavior
Follow connec+ons across customer social networks to
drive revenue
Find hidden threats through a connected view of network ac?vity
Company Confidential – Do Not Distribute 9
No ACID
No SQL
Key Value
Column Oriented
Relational Extensions
RDBMS
Document Stores In Memory
Company Confidential – Do Not Distribute 10
Graphs are hard to Partition
High cost to follow relationships that
span Cluster Nodes
Graphs are not Predictable Graphs are highly Dynamic
High cost to follow multiple competing paths
which cannot be pre-fetched/cached
High cost to load multiple, constantly changing
datasets into in-memory graph models
?
*Source: Hennessy, J. and Patterson, D., “Computer Architecture: A Quantitative Approach”, 2012 edition
Company Confidential – Do Not Distribute 11
URiKA Advantages Business Challenge
11
Massively Mul?-‐threaded 128 Hardware Threads/
processor Up to 8,192 Processors
Large Shared Memory Up to 512 TB
Highly Scalable I/O Up to 350 TB/hr
Graphs are hard to Partition
Graphs are not Predictable
Graphs are highly Dynamic
Real-time, Interactive
Analytics on Big Data
Graph Problems
Company Confidential – Do Not Distribute 12
12
Linux Apache Tomcat/Jena
� Data Center sees another Linux server
� Applica+ons see industry standard interface � RDF, SPARQL, Java,
Gadgets… � Reuse Exis+ng Skillsets
� Java, OSGI, App Server, SOA, ESB, Web toolkit…
� No Lock-‐in � All applica?ons and ar?facts
built on uRiKA can be run on other pla_orms
� Subscrip+on Pricing Model
SPARQL RDF Java Gadgets /Mashups
Graph Analy+c PlaJorm Shared-‐memory, Mul?-‐threaded, Scalable I/O
graph-‐op?mized hardware
Graph Analy+c Database RDF datastore and SPARQL engine
Graph Analy+c Tools Management, Security, Data Pipeline
Company Confidential – Do Not Distribute 13
� Install in your Data Center like any other purchased equipment � … But pay a subscrip?on fee for use of the box instead of purchasing the box up front
� Incrementally add processors and memory as use-‐cases expand
� License for as li4le as a one-‐year term � Enables firms to prove out the value incrementally
Company Confidential – Do Not Distribute 14
EDW
Customer Mart
Risk Mart
Finance Mart
Compliance Mart
Core Systems
Ops Data
Emails
Social Media
Stru
ctur
ed
Uns
truc
ture
d
External Internal
Data Gateways
Market Data
News Feeds
Web Logs
Incident Reports
Call Data Records
Netflow Data
IDS Logs
Departmental DWs
Regulatory Capital
Average Balance
Rela?onship Depth
Regulatory Capital
Product Profitability
BI/DW Infrastructure Data Sources Analytical Outputs
Paths to Fraud Developing plots Developing cyber aaacks Impac_ul market events Ul?mate counterparty exposure
Graph Analytics
Company Confidential – Do Not Distribute 15
Load any type of
data source
Hypothesize with partial information
Validate with existing knowledge/data
“In the amount of -me it takes to validate one hypothesis, we can now validate 1,000 hypotheses – massively improving our success rate and systema-zing serendipity.“ (YarcData Govt Customer )
Entire dataset
Unknown Relationships
Fast real time response
Company Confidential – Do Not Distribute 16
• Stores informa+on in the following format – an RDF Triple
• Each RDF triple is a pair of ver+ces and an edge in a directed graph:
subject predicate object
David Jones WorksFor Bank of Nebraska LLC
David Jones
Bank of Nebraska
LLC
WorksFor
Slide 16
Company Confidential – Do Not Distribute 17
Company Confidential – Do Not Distribute 18
� Goal: Proactively identify patterns of activity and threat candidates by aggregating intelligence and analysis
� Data sets: Reference data, People, Places, Things, Organizations, Communications…
� Technical Challenges: Volume, Variety and Velocity of data; Inaccurate, incomplete and falsified data
� Users: Intelligence Analysts � Usage model: Search for patterns of
activity and graphically explore relationships between entities for candidate behavior and activities
� Augmenting: Existing Hadoop cluster and multiple data appliances
Concept
Fertilizer Factory Transport
Company Confidential – Do Not Distribute 19
Discover “similar” Pa+ents to op+mize treatment
The Challenge � Longitudinal, historical data spanning all events, symptoms,
diagnoses, diseases, treatments, prescrip?ons, etc of 10M pa?ents including gene?cs and family history
� Ad-‐hoc, constantly changing defini?on of “similarity” based on thousands of parameters
� Interac?ve, real-‐?me response during consulta?on
uRiKA Solu+on � uRiKA holds en?re rela?onship graph in memory
� Iden?fy “similar” pa?ents based on ad-‐hoc physician specified paaerns
� Interac?ve, real-‐?me access by en?re physician community
Business Value � Consistent selec?on of the most effec?ve treatment for
each pa?ent by each doctor every ?me
Blood pressure Prior myocardial infarction Hypertension Body mass index HDL Anti-hypertension meds
√ √
√ √
…
Patient: Jean Generic
Company Confidential – Do Not Distribute 20
Quickly adapt surveillance to new rules and pa4erns
The Challenge � Con?nuous stream of possible compliance
events
� New compliance rules require constant updates of signals and paaerns
� Current solu?ons require offline prepara?on and are based on rigid rules
uRiKA Solu+on � En?re rela?onship graph in memory
� New Paaerns/templates can be iden?fied and added in real-‐?me
� Graphical interac?ve explora?on of rela?onships between people, places, things, organiza?ons, communica?ons, etc.
Business Value � A responsive, flexible event detec?on
pla_orm that adapts to new knowledge
Figure Source: Graph-based technologies for intelligence analysis,T. Coffman, S. Greenblatt, S. Marcus, Commun. ACM, 47(3):45-47, 2004
Company Confidential – Do Not Distribute 21
Monitor, analyze complex opera+onal processes to reduce risk
The Challenge � In large ins?tu?ons, opera?ons (eg. Trade Lifecycle)
can be complex, inter-‐connected processes involving people, technology and process steps
� Mul?ple data collec?on points exist, but hard to combine to get a holis?c picture
� Hard to monitor if cri?cal controls are effec?ve
uRiKA Solu+on � Ability to hold en?re ‘opera?onal model’ in
memory
� Can seamlessly add addi?onal incremental data sources relevant to analysis
� Interac?ve, real-‐?me querying capability allows ‘constant’ monitoring of opera?ons
Business Value � Beaer control of opera?onal risk by combining
disparate, disconnected instrumenta?on data for a complex opera?on
Company Confidential – Do Not Distribute 22
Understand financial risk emerging from connec+ons in a financial system
The Challenge � The financial crisis illustrated that counterparty risk is a
func?on of the network of counterpar?es, not just each in isola?on
� Eg. Ford, its suppliers, employees are all counterpar?es with interconnected exposures to a lender like BofA
uRiKA Solu+on � Ability to hold en?re dependency graph of posi?ons in
memory
� Founda?on to run complex graph analy?c algorithms without losing interac?vity
Business Value � Ability to gain addi?onal perspec?ve on complex
counterparty risk in addi?on to exis?ng procedures for risk calcula?ons
Figure 3 | The network of the top borrowers. Nodes represent financial institutions (selected as explained in the text). Outgoing links represent theestimated potential impact of an institution to another one (see Methods). The nodes are positioned within a circle of radius 1, centred in 0. The distanceof each node from the center is 1{R D! "i , while the angle increases linearly with R(D) from 0 to 10p. Thus, the closer a node is to the center the higher is itsDebtRank (the intuition here is its centrality). A node in the center (DebtRank 5 1) is able to put under distress the entire economic value of the network.DebtRank decreases by moving outwards and leftwards along the spiral. The diagram allows at the same time to visualise the structure of the network andto compare the importance of any two given nodes. The size and the color of the node reflects the DebtRank value (larger and red nodes have higherDebtRank). The color of a link reflects the DebtRank of the node from which it originates (red links originate from node with high DebtRank and makehigh impact to the destination nodes). (a) Period one, at the beginning of the crisis (see also inset of Fig. 2a). Most of the nodes have low levels ofDebtRank, i.e., they are located close to the border. (b) Period four (peak of the crisis). Nodes have comparable levels of Debt Rank. However, they are alsomuch more central, i.e., they can impact a large fraction of the total economic value. A single default is likely to trigger a systemic failure.
Figure 4 | Debt Rank, asset size and fragility. Scatter plot of DebtRank versus asset size, measured as a fraction (in %) of the total of the asset size in thenetwork. For sake of simplicity, in the experiment, asset size was assumed constant during the time span of the data. Notice that institutions such as UBS,or CITIGROUP alone account for almost 10% of the total assets. The size of each bubble is proportional to the outstanding debt of the institution whilethe color reflects its fragility, defined as the ratio of debt over market capitalization in the given period, as in the previous section. (a) Period one. Since theoutstanding debt was very low or zero, most nodes appear small and have levels of DebtRank below 0.3, but comparable among each other. (b) Periodfour. Many institutions have a Debt Rank larger then 0.5, i.e. each can impact, alone, the majority of the economic value in the network. The outstandingdebt in this period is close to the peak for all the institutions, as reflected by the size of the bubbles. Notice, also a higher fragility, most bubbles are red,although with some heterogeneity.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 2 : 541 | DOI: 10.1038/srep00541 4
Figure 3 | The network of the top borrowers. Nodes represent financial institutions (selected as explained in the text). Outgoing links represent theestimated potential impact of an institution to another one (see Methods). The nodes are positioned within a circle of radius 1, centred in 0. The distanceof each node from the center is 1{R D! "i , while the angle increases linearly with R(D) from 0 to 10p. Thus, the closer a node is to the center the higher is itsDebtRank (the intuition here is its centrality). A node in the center (DebtRank 5 1) is able to put under distress the entire economic value of the network.DebtRank decreases by moving outwards and leftwards along the spiral. The diagram allows at the same time to visualise the structure of the network andto compare the importance of any two given nodes. The size and the color of the node reflects the DebtRank value (larger and red nodes have higherDebtRank). The color of a link reflects the DebtRank of the node from which it originates (red links originate from node with high DebtRank and makehigh impact to the destination nodes). (a) Period one, at the beginning of the crisis (see also inset of Fig. 2a). Most of the nodes have low levels ofDebtRank, i.e., they are located close to the border. (b) Period four (peak of the crisis). Nodes have comparable levels of Debt Rank. However, they are alsomuch more central, i.e., they can impact a large fraction of the total economic value. A single default is likely to trigger a systemic failure.
Figure 4 | Debt Rank, asset size and fragility. Scatter plot of DebtRank versus asset size, measured as a fraction (in %) of the total of the asset size in thenetwork. For sake of simplicity, in the experiment, asset size was assumed constant during the time span of the data. Notice that institutions such as UBS,or CITIGROUP alone account for almost 10% of the total assets. The size of each bubble is proportional to the outstanding debt of the institution whilethe color reflects its fragility, defined as the ratio of debt over market capitalization in the given period, as in the previous section. (a) Period one. Since theoutstanding debt was very low or zero, most nodes appear small and have levels of DebtRank below 0.3, but comparable among each other. (b) Periodfour. Many institutions have a Debt Rank larger then 0.5, i.e. each can impact, alone, the majority of the economic value in the network. The outstandingdebt in this period is close to the peak for all the institutions, as reflected by the size of the bubbles. Notice, also a higher fragility, most bubbles are red,although with some heterogeneity.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 2 : 541 | DOI: 10.1038/srep00541 4
Company Confidential – Do Not Distribute 23
� Goal: Proactively identify optimal treatments for patients based on treatment results for “similar” patients
� Data sets: Longitudinal patient data, Family history, Genetics, Reference data,
� Technical Challenges: Ad-hoc constantly changing definition of “similarity” across thousands of constantly changing attributes
� Users: Doctors � Usage model: Compare results of
candidate treatment options for “similar” patients based on ad-hoc physician specified patterns
� Augmenting: Existing data warehouse
Company Confidential – Do Not Distribute 24
� Goal: Proactively identify unknown cyber threats by examining all relationships
� Data sets: IP, MAC, BGP, Firewall, DNS, Netflow, Whois, NVD, CIDR…
� Technical Challenges: Volume and Velocity of data; Temporal dependencies; Real-time response
� Users: Cyber Analysts � Usage model: Iterative analysis of all
patterns across all traffic to explore deviations in frequency of occurrence, derivative patterns of known threats and linking patterns through relationships in offline data
� Augmenting: Existing data appliances
Company Confidential – Do Not Distribute 25
� Goal: Proactively identify new patterns of healthcare fraud (perpetrator/provider/patient nexus) by examining all healthcare relationships
� Data sets: Provider, Beneficiary, Policy, Claims, Billing, Services, Outcomes, Social Network, Organization…
� Technical Challenges: Volume and Velocity of data; Entity Resolution; Complex Inter-relationships; Temporal dependencies
� Users: Fraud Inspectors/Analysts � Usage model: Analyze outcome and cost
for various disease trajectories and identify outliers for treatment optimization and fraud investigation; Separate fraud patterns from benign treatment or legitimate errors
� Augmenting: Existing data warehouse, Predictive Analytics
Provider Patient Social Network
Billing Claims
External Sources Outcomes
Services
Medicare Medicaid
Company Confidential – Do Not Distribute 26
� When you can interac+vely ask complex, graph-‐oriented ques+ons of large, diverse data sets, you can discover pa4erns and rela+onships that are beyond the capability of other systems
� You can solve analy+cal challenges you thought were intractable
Company Confidential – Do Not Distribute 27