Online retail a look at data consulting approach

Retail Analytics

Online Retail Big Data Landscape

Product Recommendation

Problem: Product Recommendation

Type of Data Source of Data

Product Information Product Catalogue

Customer Information Customer Data (demographic)

Customer Purchase History Transactional Data (RDBMS and HDFS)

User Activity Tracking Data Third Party Hybrid Cloud Data - Heat map, Click data, Demographic (Heat map tool: crazyegg, google analytics etc.); Real Time Bidding (RTB) Ad inventory data (e.g. Deltax)User activity log – on premise data e.g. browser cookie, local storage data

Social Network Activity (Analytics Data about products, likes, usage, share, no. of participants etc.)

Social Network (Facebook, Twitter, Google+, Instagram)

Is it a Big Data Problem: Product Recommendation

Remarks

Volume Yes As per the capacity planning, total data generated during 7 years is 5932 TB.

Velocity Yes Speed of data generation and analysis of Transactional and Social Network activities; speed of capturing browser cookie data and structured/unstructured data generation and analysis

Variety Yes Use of incompatible and non-integrated data from heterogeneous sources such as customer purchase data, activity logs, social network

ML Approach: Product Recommendation

Machine Learning Problem Reasoning

Unsupervised The problem states that we need to derive product recommendation based on observed similarity among customer data.

Clustering We will cluster similar customer attributes (browsing patterns, purchase history, demographic information, and behavioral data) based on observed data set . [K-means clustering]

Recommendation Within a cluster, we will use user-based collaborative filtering as recommendation will be driven by customer attributes. [User-based Collaborative Filtering]

Big Data Components: Product RecommendationRemarks

HadoopDistributed File System (HDFS) -Primary

Will use to store structured, semi-structured data (activity log, purchase history, user information, social analytics data etc.) in raw format

Sqoop Bringing transactional data to hdfs and vice versa

Flume Collecting, aggregating and moving large amount of user activity log data

Chukwa Get the log data generated from primary HDFS to another HDFS to analyze

Pig/Hive Help to write Map reduce scripts to get data in key-value structured format

Mahout/R-Hadoop

To get product recommendation we can use Mahout’s core algorithm for clustering, classification and batch based collaborative filtering are implemented

Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among namenodes & datanodes in Hadoop

Demand Analysis and Forecasting

Problem: Demand Analysis and Forecasting for existing product lineType of Data Source of Data

Product Information Product Catalogue

Customer Information Customer Data (demographic)

Product Purchase Information, inventory life time, wish list, product sales volume

Transaction Data (RDBMS and HDFS)

Social Network Activity (Analytics data about products, likes, usage, share, no. of participants etc.)

Social Network (Facebook, Twitter, Google+, Instagram)

Is it Big Data Problem: Demand Analysis and Forecasting for existing product line

Remarks


Velocity Yes Speed of data generation and analysis of Transactional and Social Network activities. speed of capturing product inventory life time, point of sale (pos), sales volume and structured/unstructured data generation and analysis


ML Approach: Demand Analysis and Forecasting for existing product lineMachine Learning Problem Reasoning

Supervised Our target is to determine the demand of merchandise in the future

Prediction We are predicting the demand of merchandise in the future.

Regression Based on observed data set, we are trying to predict the demand in the future. We are doing it by establishing correlation between the data set and the outcome. [ Linear Regression Tree]

Time Series We are trying to establish a continuous time interval pattern of merchandise demand based on correlation between demand and observed data set. [ARIMA parametric time series modeling]

Big Data Components: Demand Analysis and Forecasting for existing product line

Remarks

Hadoop Distributed File System (HDFS) - Primary

Will use to store structured, semi-structured data (purchase history, product inventory lifetime, wish list, user information, social analytics data etc.) in raw format

Sqoop Bringing transactional data, product inventory lifetime, pos, wishlist, etc. to hdfsand vice versa

Flume Collecting, aggregating and moving large amount of product activity log as well as purchase log information



Mahout/R-Hadoop Time series data consisting of four components - trend, season, cycle and noise. Need to estimate the trend and seasonal component (Ex:- day of week/month in a year ), for any specific region or location etc. from the data and use these to forecast future. ML packages allows for forecasting which are quick and effective in collaboration.

Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among namenodes & datanodes in Hadoop

Customer Churn

Problem: Customer Churn

Type of Data Source of Data

Customer Purchase History Transaction Database

Customer complaints (rating, sentiment score etc.) Complain data (NoSQL, e,g. – Mongodb)

User Activity (Page navigation, Product Catalogue visit) Heat map, Click data, Navigation data, Demographic(Heat map tool: crazyegg, google analytics etc); Real Time Bidding (RTB) Ad inventory data (e.g. Deltax)

User Activity (E.g., Wish List, Abandoned Kart) User Activity Logs

Comparative Product Analysis (Reviews, Price, Product Description etc.)

Thrid Party Vnedor data e.g. Compareraja.in, compare.buy.hatke.com

Customer Sentiment score Aggregated data from different Social Networks(Facebook, Twitter, Google+, Instagram)

Customer Loyalty Transaction Database, User Activity Logs

Is it a Big Data Problem: Customer Churn?

Remarks


Velocity Yes Speed of data generation and analysis of Transactional, Sentimental and Social Network activities


Problem: Customer Churn

Machine Learning Model Reasoning

Supervised Our target is to determine whether a customer will churn or not.

Classification Problem states that whether customer will churn or not. It asks for a categorical outcome.

Binary Problem states that whether customer will churn or not. [Decision Tree]

Unbiased Problem states that whether customer will churn or not. The initial probability of customer churn is equally positive and negative. Hence, it is under unbiased model. [C5.0]

Big Data Components: Customer ChurnRemarks

Hadoop Distributed File System (HDFS) - Primary

Will use to store structured, semi-structured data (purchase history, activity log, competitive analysis data, aggregated social data, RTB data etc.) in raw format

Sqoop Bringing transactional data, real time wish list, kart information to HDFS and vice versa

Flume Collecting, aggregating and moving large amount of product activity log as well as purchase log information



Mahout/R-Hadoop To predict customer churn we can use Decision Tree / C5.0 algorithm

NLP Toolkit (nltk.org)/IBM Watson

Can use to parse customer feedback, comments about products to find out sentimental scoring/insight analysis data and then fed the output to Hadoop

Zookeeper To monitor some common services like namespaces, configuration management, synchronization of data and services among name nodes & data nodes in Hadoop

Product & Service OfferingsCustomer Profile Customer feedback/Social MediaAccount Transactions Customer Service Logs &

Surveys

Marketing Campaigns

Hadoop cluster

HDFS

Big Data Infrastructure VisualizationAnalytics Systems

NLP Data Processing

AssumptionsType M(Millions) /MB (Mega

byteReference

Baseline Assumptions No of Online CustomersOur Market ShareNo of Products

100 M25 M12M

http://goo.gl/hHb66nAssume 25% Share

Problem Space Assumptions Customer ‘s Growth RateGrowth Rate of ProductAvg Monthly TransactionsAvg Monthly Complaints

40%15%9 M0.12 M

http://goo.gl/pm9ydJAvg http://tinyurl.com/gw9dm43Assume 0.01%

Data/Infra-structure Avg Customer info sizeAvg Complaint info sizeAvg Data Node RAM sizeReplica FactorData Block size

1 MB0.5MB8GB3128 MB

http://goo.gl/hHb66n

http://goo.gl/pm9ydJ

http://tinyurl.com/gw9dm43

Capacity Planning

Problem

Product Recommendation No. of Data Nodes 23713

RAM Capacity 2145 GB

Demand Forecasting No. of Data Nodes 23713


Customer Churn No. of Data Nodes 23724


Detailed Planning: Microsoft Excel

Worksheet