Upload
narayan-bharadwaj
View
6.472
Download
3
Embed Size (px)
DESCRIPTION
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0 Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case: Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics). Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Citation preview
Follow us @forcedotcom
How Salesforce.com uses Hadoop
Narayan Bharadwaj Data Science @nadubharadwaj
Jed Crosby Data Science @JedCrosby
#forcewebinar
Follow us @forcedotcom
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
Safe Harbor
Follow us @forcedotcom
Agenda
§ Hadoop use cases
§ Use case 1 - Product Metrics*
§ Technology
§ Use case 2- Collaborative Filtering*
§ Q&A
*Every time you see the elephant, we will attempt to explain a Hadoop related concept.
Follow us @forcedotcom
Got “Cloud Data”?
780 million transactions/day Terabytes/day
130k customers Millions of users
Follow us @forcedotcom
Hadoop Overview
§ Started by Doug Cutting at Yahoo!
§ Based on two Google papers – Google File System (GFS): http://research.google.com/archive/gfs.html
– Google MapReduce: http://research.google.com/archive/mapreduce.html
§ Hadoop is an open source Apache project – Hadoop Distributed File System (HDFS)
– Distributed Processing Framework (MapReduce)
§ Several related projects – HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog
Follow us @forcedotcom
Hadoop use cases
Product Metrics User behavior analysis Capacity planning
Monitoring intelligence
Performance analysis Security
Ad-hoc log searches
Collaborative Filtering Search Relevancy
Product Metrics
Follow us @forcedotcom
§ Track feature usage/adoption across 130k+ customers – Eg: Accounts, Contacts, Visualforce, Apex,…
§ Track standard metrics across all features – Eg: #Requests, #UniqueOrgs, #UniqueUsers,
AvgResponseTime,…
§ Track features and metrics across all channels – API, UI, Mobile
§ Primary audience: Executives, Product Managers
Product Metrics – Problem Statement
Follow us @forcedotcom
Feature Metadata (Instrumentation)
Daily Summary (Output)
Crunch it (How?)
Storage & Processing
Feature (What?) Fancy UI (Visualize)
Collaborate & Iterate
Data Pipeline
Follow us @forcedotcom
Feature Metrics (Custom Object)
Trend Metrics (Custom Object)
Client Machine
Pig script generator
Hadoop Log Files
Log
Pull
User Input (Page Layout)
Reports, Dashboards
AP
I
AP
I
Wor
kflo
w
Form
ula
Fiel
ds
Java Program
Collaboration (Chatter)
Wor
kflo
w
Product Metrics Pipeline
Follow us @forcedotcom
Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status
F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev
F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review
F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom
F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed
Feature Metrics (Custom Object)
Follow us @forcedotcom
Feature Metrics (Custom Object)
Follow us @forcedotcom
User Input (Page Layout) Formula Field
Workflow Rule
Follow us @forcedotcom
User Input (Child Custom Object)
Child Objects
Apache Pig
Follow us @forcedotcom
-- Define UDFs
DEFINE GFV GetFieldValue(‘/path/to/udf/file’);
-- Load data
A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();
-- Filter data
B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;
-- Extract Fields
C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..
-- Group
G = GROUP C BY ……
-- Compute output metrics
O = FOREACH G {
orgs = C.orgId; uniqueOrgs = DISTINCT orgs;
}
-- Store or Dump results
STORE O INTO ‘/path/to/user/output’;
Basic Pig script construct
Follow us @forcedotcom
Java Pig Script Generator (Client)
Follow us @forcedotcom
Id Date #Requests #Unique Orgs
#Unique Users
Avg ResponseTime
F0001 06/01/2012 <big> <big> <big> <little>
F0002 06/01/2012 <big> <big> <big> <little>
F0003 06/01/2012 <big> <big> <big> <little>
F0001 06/02/2012 <big> <big> <big> <little>
F0002 06/02/2012 <big> <big> <big> <little>
F0003 06/03/2012 <big> <big> <big> <little>
Trend Metrics (Custom Object)
Follow us @forcedotcom
Upload to Trend Metrics (Custom Object)
Follow us @forcedotcom
Visualization (Reports & Dashboards)
Follow us @forcedotcom
Visualization (Reports & Dashboards)
Follow us @forcedotcom
Collaborate, Iterate (Chatter)
Follow us @forcedotcom
Feature Metrics (Custom Object)
Trend Metrics (Custom Object)
Client Machine
Pig script generator
Hadoop Log Files
Log
Pull
User Input (Page Layout)
Reports, Dashboards
AP
I
AP
I
Wor
kflo
w
Form
ula
Fiel
ds
Java Program
Collaboration (Chatter)
Wor
kflo
w
Recap
Technology
Follow us @forcedotcom
Apache Hadoop Version=0.20.2
Hadoop ecosystem
Follow us @forcedotcom
Contributions
@pRaShAnT1784 : Prashant Kommireddi
Lars Hofhansl @thefutureian : Ian Varley
Follow us @forcedotcom
Apache Pig Version=0.9.1
Data Science tools ecosystem
Collaborative Filtering
Follow us @forcedotcom
§ Show similar files within an organization – Content-based approach – Community-base approach
Collaborative Filtering – Problem Statement
Follow us @forcedotcom
Popular File
Follow us @forcedotcom
Related File
Follow us @forcedotcom
§ Amazon published this algorithm in 2003. – Amazon.com Recommendations: Item-to-Item Collaborative Filtering,
by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing, January-February 2003.
§ At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow.
We found this relationship using item-to-item collaborative filtering
Follow us @forcedotcom
Annual Report Vision Statement
Dilbert Comic
Darth Vader Cartoon
Disk Usage Report
Example: CF on 5 files
Follow us @forcedotcom
Annual Report
Vision Statement
Dilbert Cartoon
Darth Vader Cartoon
Disk Usage Report
Miranda (CEO)
1 1 1 0 0
Bob (CFO) 1 1 1 0 0 Susan (Sales)
0 1 1 1 0
Chun (Sales)
0 0 1 1 0
Alice (IT) 0 0 1 1 1
View History Table
Follow us @forcedotcom
Annual Report
Disk Usage Report
Darth Vader Cartoon
Dilbert Cartoon
Vision Statement
Relationships between the files
Follow us @forcedotcom
Annual Report
Disk Usage Report
Darth Vader Cartoon
Dilbert Cartoon
Vision Statement 2
2
0
0
31
0
3
1 1
Relationships between the files
Follow us @forcedotcom
Annual Report
Vision Statement
Dilbert Cartoon
Darth Vader Cartoon
Disk Usage Report
Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)
Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1)
Darth Vader (1) Annual Rpt. (2) Disk Usage (1)
Disk Usage (1)
The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want.
The solution: divide the relationship tallies by file popularities.
Sorted relationships for each file
Follow us @forcedotcom
Annual Report
Disk Usage Report
Darth Vader Cartoon
Dilbert Cartoon
Vision Statement .82
.63 0
0
.77 .33
0
.77
.45 .58
Normalized relationships between the files
Follow us @forcedotcom
Annual Report Vision Statement
Dilbert Cartoon
Darth Vader Cartoon
Disk Usage Report
Vision Stmt. (.82)
Annual Report (.82)
Darth Vader (.77)
Dilbert (.77) Darth Vader (.58)
Dilbert (.63) Dilbert (.77) Vision Stmt. (.77)
Disk Usage (.58)
Dilbert (.45)
Darth Vader (.33)
Annual Report (.63)
Vision Stmt. (.33)
Disk Usage (.45)
High relationship tallies AND similar popularity values now drive closeness.
Sorted relationships for each file, normalized by file popularities
Follow us @forcedotcom
1) Compute file popularities
2) Compute relationship tallies and divide by file popularities
3) Sort and store the results
The item-to-item CF algorithm
Follow us @forcedotcom
MapReduce Overview Map Shuffle Reduce
(adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
Follow us @forcedotcom
<user, file>
Inverse identity map
<file, List<user>>
Reduce
<file, (user count)>
Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.
1. Compute File Popularities
Follow us @forcedotcom
(Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)
Inverse identity map
<Dilbert, {Miranda, Bob, Susan, Chun, Alice}>
Reduce
(Dilbert, 5)
Example: File popularity for Dilbert
Follow us @forcedotcom
<user, file>
Identity map
<user, List<file>>
Reduce
<(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)>
Relationships have their file IDs in alphabetical order to avoid double counting.
2a. Compute relationship tallies - find all relationships in view history table
Follow us @forcedotcom
(Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)
Identity map
<Miranda, {Annual Report, Vision Statement, Dilbert}>
Reduce
<(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)>
Example 2a: Miranda’s (CEO) file relationship votes
Follow us @forcedotcom
<(file1, file2), Integer(1)>
<(file1, file2), List<Integer(1)>
Identity map
Reduce: count and divide by popularities
<file1, (file2, similarity score)>, <file2, (file1, similarity score)>
Note that we emit each result twice, one for each file that belongs to a relationship.
2b. Tally the relationship votes - just a word count, where each relationship occurrence is a word
Follow us @forcedotcom
<(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>
<(Dilbert, Vader), {1, 1, 1}>
Identity map
Reduce: count and divide by popularities
<Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>
Example 2b: the Dilbert/Darth Vader relationship
Follow us @forcedotcom
<file1, (file2, similarity score)>
Identity map
<file1, List<(file2, similarity score)>>
Reduce
<file1, {top n similar files}>
Store the results in your location of choice
3. Sort and store results
Follow us @forcedotcom
<Dilbert, (Annual Report, .63)>, <Dilbert, (Vision Statement, .77)>, <Dilbert, (Disk Usage, .45)>, <Dilbert, (Darth Vader, .77)>
Identity map
<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>
Reduce
<Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)
Store results
Example 3: Sorting the results for Dilbert
Follow us @forcedotcom
§ Cosine formula and normalization trick to avoid the distributed cache
§ Mahout has CF
§ Asymptotic order of the algorithm is O(M*N2) in worst case, but is helped by sparsity. €
cosθAB =A • BA B
=AA
•BB
Appendix
Follow us @forcedotcom
Summary
Hadoop Cloud Data
Hadoop + Force.com = Recommendation algorithms
Follow us @forcedotcom
@forcedotcom / #forcewebinar
Developer Force Group
facebook.com/forcedotcom
Developer Force – Force.com Community
Follow us @forcedotcom
Upcoming Events
§ June 26 – Mobile CodeTalk – http://bit.ly/mct-wr
§ June 27 – Painless Mobile App Development – http://bit.ly/mobileapp-hp
http://bit.ly/mdc-hp
Follow us @forcedotcom
Q&A http://bit.ly/
hadoopsurvey
Narayan Bharadwaj Jed Crosby Prashant Kommireddi Santosh Rau @nadubharadwaj @JedCrosby @pRaShAnT1784 @santoshrau
@SalesforceEng