Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Copyright © 2012 Standard & Poor’s Financial Services LLC, a subsidiary of The McGraw-Hill Companies, Inc. All rights reserved.
Big Data: Wall Street Style
Jeff Sternberg Jen Zeralli S&P Capital IQ February 29, 2012
2 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart
3 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart: less boring with labels
As of 2/24/2012.
4 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Boring Financial Chart = kind of interesting, actually
More than $2.35 trillion dollars
invested in Information Technology
over the last 10 years.
Source: S&P Capital IQ Transaction Screening As of 2/24/2012.
5 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
How Does That Compare?
Total Investment over the last 10 years:
• Industrials = $3.49 trillion
• Energy = $2.61 trillion
•Healthcare = $2.47 trillion
• Information Technology = $2.35 trillion
• Telecom = $2.13 trillion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
6 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
So Is Big Data…
Big Money?
7 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
8 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
9 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
So, 4.2%
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
10 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Big Money?
Total Investment over the last three years:
• Information Technology = $774.4 billion
•Big Data = $32.4 billion
So, 4.2%
Hey, at least we’re not just “the 1%”
Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.
11 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
But What We Really Wanted To Talk About…
Strata: Making Data Work
February 29, 2012
12 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
But What We Really Wanted To Talk About…
• S&P Capital IQ: Data Is Our Product
•About Data Collection
• Standardization
• Linking: The Curious, Special Case of Entities
• Suggesting Data
•Projections
13 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
S&P Capital IQ: Data Is Our Product
Strata: Making Data Work
February 29, 2012
14 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
15 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
• Capital IQ started as an investment bank in 1999*
• Data = competitive advantage over other banks
• Built a database of financial investments,
relationships and transactions
*Acquired by Standard and Poor’s in 2004, now part of S&P Capital IQ.
16 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Hey, Let’s Sell That!
For illustrative purposes only. Source: S&P Capital IQ as of 2/2012.
17 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: What We Offer
Datasets
• Financials and
Valuation
• Qualitative Data
• Global Market Data
• Sell-Side Research
• Earnings Estimates
• News and Events
• Fixed Income
• Alpha and Risk Models
• Research Companies
• Generate Ideas
• Build Models
• Monitor Markets
• Analyze Performance
• Quantitative
Research
• Web Portal
• Real-Time
Workstation
• ClariFi
• Mobile
• Data Feeds
• Web Services
• Office Plug-Ins
Use Cases Tools
18 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: Who We Help
• Investment Bankers
• Asset Managers
• Private Equity Firms
• Venture Capital Firms
• Credit/Equity Analysts
• Corporations
• Consultants and Advisors
• Academia & Government
19 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product: Some Stats
Company and Person Profiles
Companies with full quantitative data 100,000
Private company profiles 2.7 million
Professionals and board members 4.2 million
Quantitative data points per company 5,000
Qualitative data points per company 1,500
Transactions
M&A Transactions 425,000
Private Placements 190,000
Public Offerings 138,000
News and Key Developments
Daily News articles across 184 countries 16,000
Key Developments (curated news) 9.7 million
As of 2/2/2012.
20 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Data Is Our Product
DEMO
21 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Strata: Making Data Work
February 29, 2012
22 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
To Have A Data Product, One Must First Acquire Data.
23 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Data Collection Goals
• Coverage
• Quality
• Timeliness
• Auditability
24 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• It starts with documents – 67,000 per day
• Sources
– Company filings (SEC)
– News feeds (press releases)
– Web crawling
• We store these in our document repository
25 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Document repository
– SQL for metadata
– “Regular” file storage for docs
– Solr/Lucene indexing for fast search
– 99.3 million documents
– 240.3 million versions (files)
As of 2/24/2012.
26 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
Document_tbl
documentID int PK sourceID smallint FK
Version_tbl
versionID int PK documentID int FK rootID smallint FK
versionIndex smallint filePath varchar(100)
html, pdf, text, sgml, …
+ Filesystem: Document Repository SQL db:
Element_tbl
elementID int PK [doc/vers/rel]ID int FK typeID int FK
value [strongly typed]
ObjectRel_tbl
relID int PK documentID int FK objectID int FK
27 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Content search
– Which docs have relevant content?
– Search rules drive collection workflow
– 1000+ search rules per doc
– 65,000+ automated searches
per day
28 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Collection workflow
– Core engine that routes work items
– Organized into Processes, Stages, Statuses
– Prioritization based on usage (and others)
– Simple GetNext(), Commit() API
– 177.8 million Commits in 2011
– Avg. 130K+ Commits per day in Financials
As of 2/24/2012.
29 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
About Data Collection
• Collection process
– Automated extraction
– Manual collection
– 1000s of quality checks
Basic integrity
Variance from prior period
– All data stored “as reported” with Doc ID
30 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Standardization
Strata: Making Data Work
February 29, 2012
31 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Standardization
Compare “apples to apples” (or Facebooks)
For illustrative purposes only. Source: S&P Capital IQ as of 2/24/2012.
32 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: The Curious, Special Case Of Entities
Strata: Making Data Work
February 29, 2012
33 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: Managing Entities
• Entities we like to think about
– Companies (public, private, investment firms)
– Government agencies (the Fed)
– Governments (munis, countries, the EU)
– Securities (equity or debt, issued by the above)
– Indices, funds, rates, other aggregations
– People (executives, board members,
investors, shareholders)
34 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Linking: Managing Entities
• Goal: Blend entity data from different sources
– Ex: unified view of stock price and ratings
• First: What’s the identifier? Or identifiers?
– Name, ticker, CUSIP®, others…
• Next: Can we auto-link?
– Use historical links to make future links easier
• Quality checks
– Look for outlier cases
• Remember that things change over time
– So entity links create a time series
35 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
An Example Of Difficult Entity Linking: Public Ownership
• Tracks portfolio holdings and values over time
– Example: Vanguard vs. Fidelity Funds
• Many disparate sources
– Reported from both “owner” and “owned” side
– Varied requirements by exchange (50+ countries)
• Many different entity types
– People, Institutions, Pension Funds, Mutual Funds…
– Common Equity, Derivatives, Options…
• Many different security identifiers
– CUSIP®, ISIN, SEDOL, Ticker, Name, etc.
36 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
Strata: Making Data Work
February 29, 2012
37 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Goal: Platform that learns from user behavior
• Suggest company profiles that the user may be
interested in viewing
• Use “data exhaust”
to build better
products
38 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Challenges
– We’re an impartial
data platform
– We may not provide
investment advice!
– Clients are super-secret
about their deals
– Ergo, can’t use collaborative filtering approach
39 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Advantage: We have lots of great data!
• Key developments
– Curated news product
– “Get smart” on a company
– News searches catch interesting press releases
– In-house researchers ensure:
Quality entity linking
Event typing (categorization)
40 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
For illustrative purposes only.
41 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Key development event ranking
–Popular & infrequent events = interesting
–Example: Dividend increase is more noteworthy than dividend affirmation
• User selectivity
–Based on clicks
–Sector, region, company type
42 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Suggesting Data
• Score each suggestion for each user based on signals via Hadoop + Hive
• Remove items that the user has already seen!
• Present in a “widget” on the “dashboard”
• Measure the clickthroughs
• Rinse, wash, repeat
43 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Companies You May Be Interested In
For illustrative purposes only.
44 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Companies You May Be Interested In
For illustrative purposes only.
45 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
Strata: Making Data Work
February 29, 2012
46 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
47 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
48 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections
As of 2/24/2012.
49 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections – Simple Growth Rates
Transaction Valuation
First Year ($ billion)
3-year Total ($ billion)
Information Technology 209.8 774.4
Big Data 5.0 32.4
• Let S represent the first year • Let T represent the 3-year total • Let x represent the yearly growth rate (%) • Solve for x:
As of 2/24/2012.
50 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Projections – Simple Growth Rates
Transaction Valuation
First Year ($ billion)
3-Year Total
($ billion)
Yearly Growth
Rate (%)
Information Technology 209.8 774.4 21.5%
Big Data 5.0 32.4 89.4%
• When will Big Data catch up with IT? • Let y be the number of years this will take • Solve for y:
As of 2/24/2012.
51 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
So Is Big Data…
Big Money? YES!
52 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Questions?
• We’re hiring! • We want to learn
from YOU! S&P Capital IQ http://www.spcapitaliq.com Jeff Sternberg [email protected] Jen Zeralli [email protected]
53 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.
Copyright © 2012 by Standard & Poor’s Financial Services LLC (S&P), a subsidiary of The McGraw-Hill Companies, Inc. All rights reserved. No content (including ratings, credit-related analyses and data, model, software or other application or output therefrom) or any part thereof (Content) may be modified, reverse engineered, reproduced or distributed in any form by any means, or stored in a database or retrieval system, without the prior written permission of S&P. The Content shall not be used for any unlawful or unauthorized purposes. S&P, its affiliates, and any third-party providers, as well as their directors, officers, shareholders, employees or agents (collectively S&P Parties) do not guarantee the accuracy, completeness, timeliness or availability of the Content. S&P Parties are not responsible for any errors or omissions, regardless of the cause, for the results obtained from the use of the Content, or for the security or maintenance of any data input by the user. The Content is provided on an “as is” basis. S&P PARTIES DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, FREEDOM FROM BUGS, SOFTWARE ERRORS OR DEFECTS, THAT THE CONTENT’S FUNCTIONING WILL BE UNINTERRUPTED OR THAT THE CONTENT WILL OPERATE WITH ANY SOFTWARE OR HARDWARE CONFIGURATION. In no event shall S&P Parties be liable to any party for any direct, indirect, incidental, exemplary, compensatory, punitive, special or consequential damages, costs, expenses, legal fees, or losses (including, without limitation, lost income or lost profits and opportunity costs) in connection with any use of the Content even if advised of the possibility of such damages. Credit-related analyses, including ratings, and statements in the Content are statements of opinion as of the date they are expressed and not statements of fact or recommendations to purchase, hold, or sell any securities or to make any investment decisions. S&P assumes no obligation to update the Content following publication in any form or format. The Content should not be relied on and is not a substitute for the skill, judgment and experience of the user, its management, employees, advisors and/or clients when making investment and other business decisions. S&P’s opinions and analyses do not address the suitability of any security. S&P does not act as a fiduciary or an investment advisor. While S&P has obtained information from sources it believes to be reliable, S&P does not perform an audit and undertakes no duty of due diligence or independent verification of any information it receives. S&P keeps certain activities of its business units separate from each other in order to preserve the independence and objectivity of their respective activities. As a result, certain business units of S&P may have information that is not available to other S&P business units. S&P has established policies and procedures to maintain the confidentiality of certain non–public information received in connection with each analytical process. S&P may receive compensation for its ratings and certain credit-related analyses, normally from issuers or underwriters of securities or from obligors. S&P reserves the right to disseminate its opinions and analyses. S&P's public ratings and analyses are made available on its Web sites, www.standardandpoors.com (free of charge), and www.ratingsdirect.com and www.globalcreditportal.com (subscription), and may be distributed through other means, including via S&P publications and third-party redistributors. Additional information about our ratings fees is available at www.standardandpoors.com/usratingsfees. STANDARD & POOR’S and S&P are registered trademarks of Standard & Poor’s Financial Services LLC.
www.standardandpoors.com