Upload
shesh-ratnala
View
28
Download
1
Embed Size (px)
Citation preview
Problem: Product Recommendation
Type of Data Source of Data
Product Information Product Catalogue
Customer Information Customer Data (demographic)
Customer Purchase History Transactional Data (RDBMS and HDFS)
User Activity Tracking Data Third Party Hybrid Cloud Data - Heat map, Click data, Demographic (Heat map tool: crazyegg, google analytics etc.); Real Time Bidding (RTB) Ad inventory data (e.g. Deltax)User activity log – on premise data e.g. browser cookie, local storage data
Social Network Activity (Analytics Data about products, likes, usage, share, no. of participants etc.)
Social Network (Facebook, Twitter, Google+, Instagram)
Is it a Big Data Problem: Product Recommendation
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is 5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional and Social Network activities; speed of capturing browser cookie data and structured/unstructured data generation and analysis
Variety Yes Use of incompatible and non-integrated data from heterogeneous sources such as customer purchase data, activity logs, social network
ML Approach: Product Recommendation
Machine Learning Problem Reasoning
Unsupervised The problem states that we need to derive product recommendation based on observed similarity among customer data.
Clustering We will cluster similar customer attributes (browsing patterns, purchase history, demographic information, and behavioral data) based on observed data set . [K-means clustering]
Recommendation Within a cluster, we will use user-based collaborative filtering as recommendation will be driven by customer attributes. [User-based Collaborative Filtering]
Big Data Components: Product RecommendationRemarks
HadoopDistributed File System (HDFS) -Primary
Will use to store structured, semi-structured data (activity log, purchase history, user information, social analytics data etc.) in raw format
Sqoop Bringing transactional data to hdfs and vice versa
Flume Collecting, aggregating and moving large amount of user activity log data
Chukwa Get the log data generated from primary HDFS to another HDFS to analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured format
Mahout/R-Hadoop
To get product recommendation we can use Mahout’s core algorithm for clustering, classification and batch based collaborative filtering are implemented
Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among namenodes & datanodes in Hadoop
Problem: Demand Analysis and Forecasting for existing product lineType of Data Source of Data
Product Information Product Catalogue
Customer Information Customer Data (demographic)
Product Purchase Information, inventory life time, wish list, product sales volume
Transaction Data (RDBMS and HDFS)
Social Network Activity (Analytics data about products, likes, usage, share, no. of participants etc.)
Social Network (Facebook, Twitter, Google+, Instagram)
Is it Big Data Problem: Demand Analysis and Forecasting for existing product line
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is 5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional and Social Network activities. speed of capturing product inventory life time, point of sale (pos), sales volume and structured/unstructured data generation and analysis
Variety Yes Use of incompatible and non-integrated data from heterogeneous sources such as customer purchase data, activity logs, social network
ML Approach: Demand Analysis and Forecasting for existing product lineMachine Learning Problem Reasoning
Supervised Our target is to determine the demand of merchandise in the future
Prediction We are predicting the demand of merchandise in the future.
Regression Based on observed data set, we are trying to predict the demand in the future. We are doing it by establishing correlation between the data set and the outcome. [ Linear Regression Tree]
Time Series We are trying to establish a continuous time interval pattern of merchandise demand based on correlation between demand and observed data set. [ARIMA parametric time series modeling]
Big Data Components: Demand Analysis and Forecasting for existing product line
Remarks
Hadoop Distributed File System (HDFS) - Primary
Will use to store structured, semi-structured data (purchase history, product inventory lifetime, wish list, user information, social analytics data etc.) in raw format
Sqoop Bringing transactional data, product inventory lifetime, pos, wishlist, etc. to hdfsand vice versa
Flume Collecting, aggregating and moving large amount of product activity log as well as purchase log information
Chukwa Get the log data generated from primary HDFS to another HDFS to analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured format
Mahout/R-Hadoop Time series data consisting of four components - trend, season, cycle and noise. Need to estimate the trend and seasonal component (Ex:- day of week/month in a year ), for any specific region or location etc. from the data and use these to forecast future. ML packages allows for forecasting which are quick and effective in collaboration.
Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among namenodes & datanodes in Hadoop
Problem: Customer Churn
Type of Data Source of Data
Customer Purchase History Transaction Database
Customer complaints (rating, sentiment score etc.) Complain data (NoSQL, e,g. – Mongodb)
User Activity (Page navigation, Product Catalogue visit) Heat map, Click data, Navigation data, Demographic(Heat map tool: crazyegg, google analytics etc); Real Time Bidding (RTB) Ad inventory data (e.g. Deltax)
User Activity (E.g., Wish List, Abandoned Kart) User Activity Logs
Comparative Product Analysis (Reviews, Price, Product Description etc.)
Thrid Party Vnedor data e.g. Compareraja.in, compare.buy.hatke.com
Customer Sentiment score Aggregated data from different Social Networks(Facebook, Twitter, Google+, Instagram)
Customer Loyalty Transaction Database, User Activity Logs
Is it a Big Data Problem: Customer Churn?
Remarks
Volume Yes As per the capacity planning, total data generated during 7 years is 5932 TB.
Velocity Yes Speed of data generation and analysis of Transactional, Sentimental and Social Network activities
Variety Yes Use of incompatible and non-integrated data from heterogeneous sources such as customer purchase data, activity logs, social network
Problem: Customer Churn
Machine Learning Model Reasoning
Supervised Our target is to determine whether a customer will churn or not.
Classification Problem states that whether customer will churn or not. It asks for a categorical outcome.
Binary Problem states that whether customer will churn or not. [Decision Tree]
Unbiased Problem states that whether customer will churn or not. The initial probability of customer churn is equally positive and negative. Hence, it is under unbiased model. [C5.0]
Big Data Components: Customer ChurnRemarks
Hadoop Distributed File System (HDFS) - Primary
Will use to store structured, semi-structured data (purchase history, activity log, competitive analysis data, aggregated social data, RTB data etc.) in raw format
Sqoop Bringing transactional data, real time wish list, kart information to HDFS and vice versa
Flume Collecting, aggregating and moving large amount of product activity log as well as purchase log information
Chukwa Get the log data generated from primary HDFS to another HDFS to analyze
Pig/Hive Help to write Map reduce scripts to get data in key-value structured format
Mahout/R-Hadoop To predict customer churn we can use Decision Tree / C5.0 algorithm
NLP Toolkit (nltk.org)/IBM Watson
Can use to parse customer feedback, comments about products to find out sentimental scoring/insight analysis data and then fed the output to Hadoop
Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among name nodes & data nodes in Hadoop
Product & Service OfferingsCustomer Profile Customer feedback/Social MediaAccount Transactions Customer Service Logs &
Surveys
Marketing Campaigns
Hadoop cluster
HDFS
Big Data Infrastructure VisualizationAnalytics Systems
NLP Data Processing
AssumptionsType M(Millions) /MB (Mega
byteReference
Baseline Assumptions No of Online CustomersOur Market ShareNo of Products
100 M25 M12M
http://goo.gl/hHb66nAssume 25% Share
Problem Space Assumptions Customer ‘s Growth RateGrowth Rate of ProductAvg Monthly TransactionsAvg Monthly Complaints
40%15%9 M0.12 M
http://goo.gl/pm9ydJAvg http://tinyurl.com/gw9dm43Assume 0.01%
Data/Infra-structure Avg Customer info sizeAvg Complaint info sizeAvg Data Node RAM sizeReplica FactorData Block size
1 MB0.5MB8GB3128 MB