1
RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com One of the most important sales metrics for companies to track and analyze is bookings revenue. Quite simply, this represents how much (in both value and deal count) was booked by the customer partners. Companies fail to give accurate revenue forecasts to Wall Street impacting stock price. Understanding these forecasts is essential for marketplace expansion and determining the most effective product mix per opportunity. Forecasts are vital in estimating demand and creating effective channels. The CEO executive committee needs this information at two levels. First, they need to know how well the company is doing on a total sales basis. Secondly, they need to know how the respective sales orgs are doing against their quota and bookings goal. A predictive forecast model is developed which is trained with the historic earnings data aggregated by product type and category, region and split by services and product revenue that gives a range of expected revenue values for the forthcoming months and quarters. ABSTRACT SUPPORTING DATASETS The raw data is ingested into the Snowflake data store which stores all the Enterprise data. After data preprocessing, it is consumed by the predefined events of the compute cluster comprised of a Apache spark multi-node cluster on the AWS EMR. The time-series model is developed on the Apache Spark cluster using the Spark Streaming and MLib frameworks. We have leveraged the features and compatibility of Apache Zeppelin with the Spark cluster for real time visual analytics. SYSTEM ARCHITECTURE We have developed this model considering these two main applications, 1. Obtain an understanding of the underlying data and major trend points affecting the growth of the business. 2. Fit a predictive model and proceed to forecasting, monitoring or even feedback and alerts. The model is mainly developed for the following applications: Economic Forecasting Sales Forecasting Budgetary Analysis Stock Market Analysis Yield Projections Process and Quality Control Inventory Studies Workload Projections However in this project because of the difficulties in aggregating data in a time limitation, we have limited our exploration to Sales forecasting and Bookings predictions. APPLICATION & USE CASES CONCLUSION Actuals Vs Predictions REFINING THE MODEL The project can be further improved by adding variable parameters within the same model using ARIMA-X method and decision tress. Some additional parameters that can be introduced to this model are Customer rep productivity Opportunity ranking Sales pipeline ACKNOWLEDGEMENTS Shantanu Biswas & Badrinarayan Jagannathan for their unending support, Pavan Rangavajhala, Sreenivasa Pocha, Suman Shanthakumar and the Enterprise Data Platform team. FB opensource community for their work on fbprophet. Apache Zeppelin and Apache Spark open source dev community. AWS Infrastructure team Ben Chen, Grace Ng and the University Talent Program team. DATA PREPROCESSING 1 Intern, 2 Director, Enterprise Data Platform, Juniper Networks, California, USA Raghunandana Jayarama Reddy 1 , Ameet Ubhayaker 2 Forecasting Bookings using Machine Learning Data preparation and preprocessing is an important requirement for this model and involves a lot of querying and ETL process. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. All the ingested data from the core applications and 3 rd party apps are stored into Snowflake, the Enterprise Data Store for ETL processing and analytics through the Amazon S3 buckets via Middleware. The data preprocessing step involves cleaning, integration, reduction, transformation and is important for mainly the reasons below: Inaccurate data (missing data) The presence of noisy data (erroneous data and outliers) Inconsistent data Data Integration Data Cleaning Data Reduction Data Transformation (Normalization, Aggregation, Generalization) Figure 3: System architecture diagram Figure 1: Datasets used in the forecast analytics model Figure 2: The data preprocessing workflow 1. Visual Interactive Pipeline: We introduce a visual-interactive system for the generation of time series preprocessing pipelines, the conceptual workflow is shown in Figure 3. In the following, we call the preprocessing pipeline a time series scenario. We choose a generalizable approach for time series preprocessing. Beginning with the selection of raw data a variety of preprocessing operations can be added to the pipeline and (re-)arranged in arbitrary order. 2. System and Views: We aim to make the different operations as exchangeable and compatible as possible. Hence, the data model of our input time series consists of a list of so-called time-value pairs, each containing a time stamp and a corresponding value. This data model is able to represent virtually all possible characteristics of time series data like non-equidistant time stamps or missing values. Straight forward, the user can select the favored normalization variant with a single click. Views based on month, quarter and years o Statistical view o Detailed view GENERAL PROCESS TO DERIVE FORECASTS Figure 4: A generic workflow of the forecast model MODEL FEATURES RESULTS Figure 5: Logistic regression compute results Figure 6: Model training results The following figures show the computation results for the algorithms being run on the model. The following results are the visual plots extracted from the zeppelin notebook which is running on the EMR Spark compute cluster. Figure 7b: Forecast series with regressors Figure 7a: Forecast series without regressors As seen from the above figures, The default range values are often not appropriate and seem very large, but they can be reduced when the seasonality needs to fit higher-frequency changes, and generally be less smooth. Specifying custom seasonality trends, product sales in a specific geographic area, identifying and normalizing the effect of outliers using additional regressors are the additional features implemented in the model which are seen in the figure 7b. The prediction ranges appeared to be accurate however with a certain percentage Error(SSE) of upto ± 10%. The prediction ranges are relatively large because of the comparatively smaller dataset size and the number of parameters introduced to the model. However these time series data are affected by a number of parameters and the model cannot predict the change in values if the parameters are not predefined. The correctness of values are clearly dependent on the definition of variable parameters and the accuracy of data. REFERENCES 1. Introduction to Time Series models 2. Introduction to Time Series and Forecasting 3. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science 4. Forecasting: principles and practice 5. Time-Critical Decision Making for Business Administration 6. Fbprophet, an opensource tool by FB open source community. 3. User-support for Parameter Setting: We give details about the parameterization of a single module, which is a problem in itself. Each preprocessing module in the system provides ensembles of n alternative parameter values, as appropriate, whereupon n is a user parameter. The time series arising from alternative parameterizations are visualized as line chart bundle in the detail view. Figure 8: Experimentation results for different datasets

Forecasting Bookings using Machine LearningGuide to Data Science 4.Forecasting: principles and practice 5.Time-Critical Decision Making for Business Administration 6.Fbprophet, an

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Forecasting Bookings using Machine LearningGuide to Data Science 4.Forecasting: principles and practice 5.Time-Critical Decision Making for Business Administration 6.Fbprophet, an

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

One of the most important sales metrics for companies to track and analyzeis bookings revenue. Quite simply, this represents how much (in both valueand deal count) was booked by the customer partners. Companies fail togive accurate revenue forecasts to Wall Street impacting stockprice. Understanding these forecasts is essential for marketplace expansionand determining the most effective product mix per opportunity. Forecastsare vital in estimating demand and creating effective channels. The CEOexecutive committee needs this information at two levels. First, they needto know how well the company is doing on a total sales basis. Secondly,they need to know how the respective sales orgs are doing againsttheir quota and bookings goal. A predictive forecast model is developedwhich is trained with the historic earnings data aggregated by product typeand category, region and split by services and product revenue that gives arange of expected revenue values for the forthcoming months and quarters.

ABSTRACT

SUPPORTING DATASETSThe raw data is ingested into the Snowflake data store which stores

all the Enterprise data. After data preprocessing, it is consumed by thepredefined events of the compute cluster comprised of a Apache sparkmulti-node cluster on the AWS EMR. The time-series model is developedon the Apache Spark cluster using the Spark Streaming and MLibframeworks. We have leveraged the features and compatibility of ApacheZeppelin with the Spark cluster for real time visual analytics.

SYSTEM ARCHITECTURE

We have developed this model considering these two main applications,1. Obtain an understanding of the underlying data and major trend points

affecting the growth of the business.2. Fit a predictive model and proceed to forecasting, monitoring or even

feedback and alerts.The model is mainly developed for the following applications:• Economic Forecasting• Sales Forecasting• Budgetary Analysis• Stock Market Analysis• Yield Projections• Process and Quality Control• Inventory Studies• Workload ProjectionsHowever in this project because of the difficulties in aggregating data in a time limitation, we have limited our exploration to Sales forecasting and Bookings predictions.

APPLICATION & USE CASESCONCLUSION

Actuals Vs Predictions

REFINING THE MODEL

The project can be further improved by adding variable parameters withinthe same model using ARIMA-X method and decision tress.Some additional parameters that can be introduced to this model are• Customer rep productivity• Opportunity ranking• Sales pipeline

ACKNOWLEDGEMENTS

• Shantanu Biswas & Badrinarayan Jagannathan for their unending support, Pavan Rangavajhala, Sreenivasa Pocha, Suman Shanthakumarand the Enterprise Data Platform team.

• FB opensource community for their work on fbprophet.• Apache Zeppelin and Apache Spark open source dev community.• AWS Infrastructure team• Ben Chen, Grace Ng and the University Talent Program team.

DATA PREPROCESSING

1Intern, 2Director, Enterprise Data Platform, Juniper Networks, California, USARaghunandana Jayarama Reddy1, Ameet Ubhayaker2

Forecasting Bookings using Machine Learning

Data preparation and preprocessing is an importantrequirement for this model and involves a lot of querying and ETLprocess. In other words, whenever the data is gathered from differentsources it is collected in raw format which is not feasible for theanalysis. All the ingested data from the core applications and 3rd partyapps are stored into Snowflake, the Enterprise Data Store for ETLprocessing and analytics through the Amazon S3 buckets viaMiddleware.The data preprocessing step involves cleaning, integration, reduction,transformation and is important for mainly the reasons below:• Inaccurate data (missing data)• The presence of noisy data (erroneous data and outliers)• Inconsistent data

Data Integration

Data Cleaning

Data Reduction

Data Transformation (Normalization, Aggregation, Generalization)

Figure 3: System architecture diagram

Figure 1: Datasets used in the forecast analytics model

Figure 2: The data preprocessing workflow

1. Visual Interactive Pipeline:We introduce a visual-interactive system for the generation of time

series preprocessing pipelines, the conceptual workflow is shown in Figure3. In the following, we call the preprocessing pipeline a time seriesscenario. We choose a generalizable approach for time seriespreprocessing. Beginning with the selection of raw data a variety ofpreprocessing operations can be added to the pipeline and (re-)arrangedin arbitrary order.

2. System and Views:We aim to make the different operations as exchangeable and

compatible as possible. Hence, the data model of our input time seriesconsists of a list of so-called time-value pairs, each containing a timestamp and a corresponding value. This data model is able to representvirtually all possible characteristics of time series data like non-equidistanttime stamps or missing values. Straight forward, the user can select thefavored normalization variant with a single click.• Views based on month, quarter and years

o Statistical viewo Detailed view

GENERAL PROCESS TO DERIVE FORECASTS

Figure 4: A generic workflow of the forecast model

MODEL FEATURES

RESULTS

Figure 5: Logistic regression compute results

Figure 6: Model training results

The following figures show the computation results for the algorithms being run on the model.

The following results are the visual plots extracted from thezeppelin notebook which is running on the EMR Spark computecluster.

Figure 7b: Forecast series with regressors

Figure 7a: Forecast series without regressors

As seen from the above figures, The default range values are often notappropriate and seem very large, but they can be reduced when theseasonality needs to fit higher-frequency changes, and generally be lesssmooth. Specifying custom seasonality trends, product sales in a specificgeographic area, identifying and normalizing the effect of outliers usingadditional regressors are the additional features implemented in themodel which are seen in the figure 7b.

The prediction ranges appeared to be accurate however with a certainpercentage Error(SSE) of upto ± 10%. The prediction ranges arerelatively large because of the comparatively smaller dataset size andthe number of parameters introduced to the model. However these timeseries data are affected by a number of parameters and the modelcannot predict the change in values if the parameters are notpredefined. The correctness of values are clearly dependent on thedefinition of variable parameters and the accuracy of data.

REFERENCES

1. Introduction to Time Series models2. Introduction to Time Series and Forecasting3. Modeling Techniques in Predictive Analytics with Python and R: A

Guide to Data Science4. Forecasting: principles and practice5. Time-Critical Decision Making for Business Administration6. Fbprophet, an opensource tool by FB open source community.

3. User-support for Parameter Setting:We give details about the parameterization of a single module,

which is a problem in itself. Each preprocessing module in the systemprovides ensembles of n alternative parameter values, as appropriate,whereupon n is a user parameter. The time series arising from alternativeparameterizations are visualized as line chart bundle in the detail view.

Figure 8: Experimentation results for different datasets