Upload
dinhhuong
View
233
Download
5
Embed Size (px)
Citation preview
i
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
Subscriber Data Mining for Business Reporting and Decision Making in
Telecommunications
[CT755]
By:
Bishal Timilsina (16209)
Bishnu Bhattarai (16210)
Narayan Prasad Kandel (16220)
Niroj Karki (16222)
A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
AND COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE
REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER
ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR,
NEPAL
August, 2013
i
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
The undersigned certify that they have read, and recommended to the Institute of Engineering for
acceptance, a project report entitled "Subscriber Data mining for business reporting in
telecommunication" submitted by Bishal Timilsina, Bishnu Bhattarai, Narayan pd. Kandel, Niroj
Karki in partial fulfilment of the requirements for the Bachelor’s degree in Computer Engineering.
_________________________________________________
Supervisor, Babu Ram Dawadi
Lecturer
Department of Electronics and Computer Engineering, Pulchowk Campus
_________________________________________________
Co-Supervisor, Manoj Ghimire
Visiting Lecturer
Department of Electronics and Computer Engineering, Pulchowk Campus
__________________________________________________
Internal Examiner, Dr. Surendra Shrestha
Associate Professor
Department of Electronics and Computer Engineering, Pulchowk Campus
__________________________________________________
External Examiner, Ramesh Kumar Shreewastava
Unit Head - NOC
Ncell Private Limited
__________________________________________________
Coordinator, Dr. Aman Shakya
Deputy Head, Lecturer
Department of Electronics and Computer Engineering, Pulchowk Campus
DATE OF APPROVAL: 30.August.2013
ii
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.
Moreover, the author has agreed that permission for extensive copying of this project report for
scholarly purpose may be granted by the supervisors who supervised the project work recorded
herein or, in their absence, by the Head of the Department wherein the project report was done. It
is understood that the recognition will be given to the author of this report and to the Department
of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use
of the material of this project report. Copying or publication or the other use of this report for
financial gain without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in whole or
in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal
iii
ACKNOWLEDGEMENT
We owe a great many thanks to those peoples who helped and supported us on bring this project a
success.
We express our deepest thanks to Mr. Babu Ram Dawadi, lecturer and Mr. Manoj Ghimire, visiting
lecturer. They guided us to adopt best practices during the project development phases.
We would also thank our institution, our faculty members and our friends without whom this
project would not have been possible. We also extend our heartfelt thanks to our friends, seniors
and well-wishers.
Thanking You,
Bishal Timilsina 16209
Bishnu Bhattarai 16210
Narayan Prasad Kandel 16220
Niroj Karki 16222
iv
ABSTRACT
Subscriber Data Mining for Business Reporting and Decision Making in Telecommunications is a
web application for implementing Customer Relationship Management (CRM) in
telecommunications. Data mining is an effective means for formulating CRM strategy in
telecommunication companies. Customer relationship management (CRM) is a model for
managing a company’s interactions with current and future customers. In present competitive
scenario among telecommunication companies, proper strategy formulation assists greatly for its
success. CRM strategy drives business and data mining processes. Analytical CRM utilizes mining
and warehousing concepts for prosperity of telecoms.
In this project, various visualizations adopting business intelligence approach and data analysis are
carried out that help end users to have better insight of the mobile subscriber and services usage
by them with respect to the age, gender, time, date etc. Customer profiling is carried out in addition
to segregation of customer segments. Besides, churn pattern analysis help telecom to be aware of
the churn behavior and also help to find fraud behavior of the customers based on their call
behavior.
v
Table of Contents
COPYRIGHT ............................................................................................................................ ii
ACKNOWLEDGEMENT ......................................................................................................... iii
ABSTRACT ............................................................................................................................. iv
LIST OF FIGURES ................................................................................................................ viii
LIST OF TABLES ......................................................................................................................x
LIST OF ABBREVIATION ...................................................................................................... xi
1. INTRODUCTION ..................................................................................................................1
1.1. Motivation .......................................................................................................................1
1.2. Hypothesis .......................................................................................................................2
1.3. Objectives .......................................................................................................................3
1.4. Project Description ..........................................................................................................3
1.5. Overview of the Report....................................................................................................4
2. LITERATURE REVIEW .......................................................................................................6
2.1. Related Research Work ...................................................................................................6
2.2. Companies using Data Mining in CRM ...........................................................................6
3. THEORETICAL BACKGROUND ........................................................................................9
3.1. Customer Segmentation ...................................................................................................9
3.2. Customer Profiling ........................................................................................................ 11
3.3. Churn Prediction............................................................................................................ 12
4. TECHNICAL BACKGROUND ........................................................................................... 15
4.1. Data Loading through Extract, Transform & Load Process ............................................ 15
4.2. K-means Clustering Algorithm & Bisecting K-means Clustering Algorithm .................. 17
4.3. Recency Frequency & Monetary Model ......................................................................... 18
4.4. Gaussian Distribution .................................................................................................... 19
4.5. Time Series Visualization .............................................................................................. 19
4.6. On-Line Analytical Processing & Snowflake Schema for Datamart ............................... 20
4.6.1. Data Cube ............................................................................................................... 20
4.6.2. Data Mart ............................................................................................................... 20
4.6.3. Multidimensional Data Models Schema .................................................................. 20
4.6.4. On-Line Analytical Processing................................................................................ 21
5. SYSTEM ANALYSIS.......................................................................................................... 22
vi
5.1. Requirements Analysis .................................................................................................. 22
5.1.1. Assumptions and Dependencies .............................................................................. 23
5.1.2. High Level Requirements ....................................................................................... 23
5.1.3. Functional Requirements ........................................................................................ 24
5.1.4. Non Functional Requirements ................................................................................. 24
5.2. Feasibility Analysis ....................................................................................................... 26
5.2.1. Operational Feasibility ............................................................................................ 26
5.2.2. Technical Feasibility ............................................................................................... 26
5.2.3. Economic Feasibility .............................................................................................. 27
6. SYSTEM DESIGN .............................................................................................................. 28
6.1. Use case Modeling ........................................................................................................ 29
6.1.1. Use case Modeling of ETL and Visualization Processes .......................................... 29
6.1.2. Use Case Modeling of User Management ............................................................... 29
6.2. System Architecture ...................................................................................................... 30
6.2.1. System Block Diagram ........................................................................................... 30
6.2.2. Data-mart & OLAP Design ..................................................................................... 32
6.3. Sequence Diagram ......................................................................................................... 34
6.3.1. Sequence Diagram of login system ......................................................................... 34
6.3.2. Sequence Diagram of Visualization Process ............................................................ 35
6.4. Interaction Diagram ....................................................................................................... 36
6.5. Class Diagram ............................................................................................................... 37
6.6. Activity Diagram ........................................................................................................... 38
6.7. Deployment Diagram .................................................................................................... 41
7. IMPLEMENTATION .......................................................................................................... 43
7.1. Data Collection .............................................................................................................. 43
7.1.1. Call Detail Records ................................................................................................. 43
7.1.2. Customer Data ........................................................................................................ 44
7.2. ETL Process .................................................................................................................. 44
7.3. Implementing Customer Segmentation .......................................................................... 45
7.4. Implementing Customer Profiling .................................................................................. 47
7.5. Implementing Churn Prediction ..................................................................................... 47
7.6. Implementing Report Visualization ............................................................................... 48
vii
7.6.1. Demographic Visualization ..................................................................................... 49
7.6.2. Call Pattern Visualization ....................................................................................... 49
7.6.3. Time Series Visualization ....................................................................................... 49
7.7. Data Analysis through Clustering .................................................................................. 50
7.8. Development Environment ............................................................................................ 50
7.9. Project Activities and Milestones ................................................................................... 51
8. TESTING ............................................................................................................................. 52
8.1. Unit Testing ................................................................................................................... 52
8.2. Integration Testing......................................................................................................... 52
8.3. Black Box Testing ......................................................................................................... 53
8.4. Alpha Testing ................................................................................................................ 53
8.5. Performance Testing ...................................................................................................... 53
8.6. Documentation Testing .................................................................................................. 53
9. RESULTS & CONCLUSIONS ............................................................................................ 54
9.1. Customer Segmentation ................................................................................................. 54
9.2. Customer Profiling ........................................................................................................ 56
9.3. Churn Prediction............................................................................................................ 59
9.4. Reporting....................................................................................................................... 59
9.5. Data Analysis Through Clustering ................................................................................. 62
9.6. Problem Faced ............................................................................................................... 64
9.7. Conclusion .................................................................................................................... 64
9.8. Limitation and Further Enhancement ............................................................................. 65
10. BIBLIOGRAPHY .............................................................................................................. 66
APPENDIX A: - DATASET SNIPPETS ................................................................................... 68
APPENDIX B: - GANTT CHART ............................................................................................ 69
APPENDIX C: - OUTPUT SNAPSHOT ................................................................................... 70
viii
LIST OF FIGURES
Figure 6.1 System Design Phases...............................................................................................28
Figure 6.2 Use case Diagram of ETL and Visualization Process...............................................29
Figure 6.3 Use case Diagram of User Management...................................................................30
Figure 6.4 System block Diagram..............................................................................................31
Figure 6.5 OLAP Design............................................................................................................32
Figure 6.6 Fact Table..................................................................................................................33
Figure 6.7 Sequence Diagram for system login.........................................................................34
Figure 6.8 Sequence Diagram of Visualization Process............................................................35
Figure 6.9 Interaction Diagram for visualizing query output.....................................................36
Figure 6.10 Interaction Diagram of user registration.................................................................37
Figure 6.11 Class Diagram.........................................................................................................38
Figure 6.12 Activity Diagram for Report Generation................................................................39
Figure 6.13 Activity Diagram of User Validation......................................................................40
Figure 6.14 Deployment Diagram..............................................................................................42
Figure 7.1 Block Diagram of 2 phase clustering........................................................................46
Figure 7.2 Normal Distribution for Churn prediction................................................................48
Figure 9.1 K-mean clustering using RFM method.....................................................................54
Figure 9.2 Distribution of dialed call over day of week.............................................................57
Figure 9.3 Distribution of dial call in hour wise basis................................................................58
Figure 9.4 Distribution of message sents....................................................................................58
ix
Figure 9.5 Gender vs Total call duration....................................................................................60
Figure 9.6 Age group vs Total call duration for male subscriber...............................................60
Figure 9.7 Age group vs Average call duration..........................................................................61
Figure 9.8 Monthly average call duration for age group 40-45..................................................61
Figure 9.9 Call count vs Hours of day........................................................................................62
Figure 9.10 Call count vs Age group for 6 to 7pm.....................................................................62
Figure 9.11 Clustering result of average call duration of day…................................................63
Figure A.1 Customer demographic data snippets.......................................................................68
Figure A.2 CDR data snippets....................................................................................................68
Figure B.1 Part A Gantt Chart....................................................................................................69
Figure B.2 Part B Gantt Chart....................................................................................................69
Figure C.1 Age and Gender distribution of subscriber...............................................................70
Figure C.2 Interface for filtering data.........................................................................................71
x
LIST OF TABLES
Table 7.1 Project Activity and Milestones.................................................................................51
Table 9.1 RFM Clustering..........................................................................................................54
Table 9.2 Two-phase Clustering with demographic data...........................................................55
Table 9.3 Cluster comparisons regarding different attributes.....................................................55
xi
LIST OF ABBREVIATION
BI Business Intelligence
CDR Call Detail Records
CHAID Chi-squared Automatic Interaction Detection
CRM Customer Relationship Management
DoB Date of Birth
ETL Extract, Transform, Load
GUI Graphical User Interface
IDE Integrated development environment
NTC Nepal Telecommunication Corporation
OLAP Online Analytical Processing
QoS Quality of Service
RFM Recency Frequency Monetary
SRS Software Requirement Specification
UI User Interface
1
1. INTRODUCTION
1.1. Motivation
It goes without saying that nowadays it is a business dominated world. Our life is influenced by
the fluctuation of economy. Hundred years ago, we hadn’t have to care about what happen to
other countries because it would not make any difference for our life. Nowadays, it’s another
story. These days, any event occurring in related sector possesses significant impact. Without
careful handling of the situation, everyone could be the next victim. And, telecommunication
corporations cannot be exception to this.
Communications companies are under intense pressure to keep operating costs such as keeping
customer service costs low, while growing customer share, improving customer retention and
increasing revenues with new service expansions. To meet these challenges, telecom companies
are increasing their investments in CRM strategies and software [1].
Customer segmentation means grouping similar customers together, based on many different
criteria. In this way it is possible to target each and every group depending on their
characteristics. Customer segmentation helps companies develop appropriate marketing
campaigns and pricing strategies. For example, it is possible to offer a special price or free
minutes to a certain group.
Customer segmentation is one of the most important data mining methodologies used in
marketing and CRM. It helps telecommunications companies to discover the characteristics of
their customers and make them derive appropriate marketing activities according to the
information discovered.
The main challenges for customer segmentation purposes is that proper variables are chosen for
the segmentation process. The data to be mined should include both behavioral data (call detail
2
data) and demographic data (customer data) in order to make better results.
Segmented marketing is crucial factor for present telecommunication companies to sustain in
this highly competitive environment. Segmentation allows the firm to better satisfy the needs of
its potential customers. Hence, the marketing concept calls for understanding customers and
satisfying their needs better than the competition [2]. In the case of telecommunication
companies, its customers are its assets, so appropriate CRM software can benefit them a lot.
CRM software systems help telecom operators manage and control the bundle of most concerns
that faces them in light of increased competition - customer turnover. Customer analytics has
become a boom for telecommunication industry and is now a standard part of CRM application
like Customer Service Management. CRM solutions for the telecom industry equip telecoms
with competitive advantage by providing the tools to identify and retain profitable customers.
In this telecom environment, CRM software plays additional, specialized roles beyond the
traditional activity management [1] . So, realizing the immense need for specialized analytical
CRM application for deducing hidden golden patterns for strategy formulation is undisputable.
1.2. Hypothesis
Sound telecommunication companies are those that are sensitive to understanding of customer
behavior, interaction with customers and delivery of advanced, flexible services meeting Quality
of Service (QoS). To this end, analytical results obtained after mining on telecommunications
datasets assists greatly.
Behavioral data helps one to identify groups of customers who have similar calling patterns.
Identifying customers’ needs only from their demographic data does not produce much value in
the market. However, schemes brought forth considering behavioral data only are much less
profitable to the telecommunication company than schemes brought forth considering customer
profile data along with behavioral data. Thus, behavioral data and demographic data are like
3
two sides of same coin in respect to formulate alluring strategic roadmap for the
telecommunication companies.
1.3. Objectives
The main goal of this project is to develop an application in order to segment the telecom
subscriber in order to improve the relationship between telecom companies and subscribers.
To fulfill this requirement, we have set forward following specific objectives:
Visualize call pattern and behavior.
Classify customers for executing new campaigns and other profitable operations.
Utilize business process and strategy in service analysis for Business Intelligence in light
of customer relationship.
1.4. Project Description
The project aims to extract patterns in which the service offered by Telecom Operator is
distributed with respect to multidimensional perspective. To achieve this high level goal, we
require to acquire two categories of data – behavioral and demographic.
Behavioral data are one that depict the call behavior of customers and comprise of call detail
records over a time. Call details like call duration, dialed numbers, time of call, date of call,
received calls etc. form behavioral data. On the other hand, demographic data encompasses data
of customers like name, address, date of birth, occupation etc.
Behavioral data helps one to identify groups of customers who have similar calling behaviors.
In this way it is possible to focus on what customers do rather than what they are [3]. However,
schemes brought forth considering behavioral data only are much less profitable to the
telecommunication company than schemes brought forth considering customer profile data
along with behavioral data. This is why it is highly recommended to combine behavioral with
4
demographic data. For this reason we simultaneously segment customers based on address, age-
group, gender, time of call, call duration etc. in this project.
Two phase clustering is one of the suitable technique for clustering on both sorts of data above
mentioned. In this method at first, customers are clustered into different segments regarding
their RFM (Recency, Frequency, Monetary value) using K-means clustering on call detail
records. Secondly, using demographic data, each cluster again is partitioned into new clusters.
Then, profile of each cluster so obtained is formed which is finally interpreted for bringing forth
genuine marketing schemes.
Churn prediction and prevention is challenging arena for modern telecommunication
companies. In the project, basic univariate analysis based on call-diameter is accomplished. Call
diameter of a subscriber represents number of unique transactions (i.e., dialed and/or received)
in time span under consideration. It acts as one of the indicator of the churn when time-interval
considered is significant. Nowadays, detection of churn patterns among customers and causes
of churn possess immense value for retention of customers in scenario of rapidly emerging
telecom offers.
1.5. Overview of the Report
In Chapter one, underlying concepts behind building the CRM application for
telecommunication companies have been discussed. The project is introduced and its
importance and scope clarified in this section.
In Chapter two, state-of-art of CRM applications for telecommunication companies is pictured.
Numerous efforts have been spent are effectively ongoing in this field.
In Chapter three, conceptual foundation of the application has been detailed. Theoretical
background like customer segmentation, profiling and churn prediction are described in the
chapter.
5
In Chapter four, technical aspects are covered. K-means clustering algorithm, RFM clustering
model, normal distribution, time series visualization and ETL processes that are backbone to
build the project are explained here.
In Chapter five, system is analyzed as to assure the feasibility in economic, technical and
operational grounds. Requirement elicitation that forms most important portion of system
analysis is dealt with referring both functional and non-functional requirements.
In Chapter five, design of the system is accomplished using UML diagramming that comprise
of class diagram, deployment diagram, sequence diagram, activity diagram, use case
diagramming etc.
In Chapter six, implementation details of the project are given. It includes stages from data
collection to final visualizations of data (including time-series), analysis outcomes (churn
analysis inclusive) and clustering with ETL process intercepting in between.
In Chapter six, testing and debugging tasks performed are discussed.
In Chapter seven, the project is concluded with the results achieved in light of objectives set at
project’s commencement and bespeaking rooms for improvement.
In Appendix A, sample dataset has been given.
In Appendix B, Gantt charts of project time schedule have been provided.
In Appendix C, output snapshots of the project are included.
The next chapter deals with literature review of the project.
6
2. LITERATURE REVIEW
2.1. Related Research Work
In this competitive time, implementing CRM does matter a lot for company prospects. There
are different methods of implementing CRM. Some of the CRM practices yields best outputs
while some of these practices brings no significance to the company. So realization have to be
done so as to implement the best CRM practices among relevant CRM practices.
2.2. Companies using Data Mining in CRM
CRM has become a leading business strategy in highly competitive business environment. CRM
can be viewed as ‘Managerial efforts to manage business interactions with customers by
combining business processes and technologies that seek to understand a company’s customers.
Companies are becoming increasingly aware of the many potential benefits provided by CRM.
In case of telecom industry they are facing tremendous competition today with the emergence
of a number of vendors, each with unique brand propositions. In a market where the customer
has no dearth of choices and the cost of switching over is minimal, a key strategy for companies
to retain customers is through Customer Relationship Management. Within the ambit of CRM
applications, data mining is very popular in the telecom industry [4].
One of the successful CRM implementation in telecom industry is Canadian
Telecommunication Company called Bell Canada. They gather large amount of data from
telephone, internet, wireless, voice over IP and digital television services. Success of CRM
implementation in this telecom industry can be studied in 3 phases:-
a) Pre CRM-Scenario
7
The solutions, business processes and methods being employed prior to the CRM
solution clearly did not fulfill or meet any of the business needs. Bell Canada needed a
full-fledged customer centric strategy that was catering to the company requirements.
After scrutiny they embarked on the implementation of CRM and decided that they will
opt for its advantages. They basically encountered a problem that the existing disparate
solutions created a lot of extra work for employees and basically increased the task load.
This had resulted in a decrease in employee satisfaction and posed numerous problems.
In addition to this BELL required its front and back end operations of its shared services
center to be integrated. This step could not be achieved through existing processes. Also
the access to current employee case status and the reporting capabilities were required
[5].
b) Implementing CRM
The result was that CRM customer service & support initiatives were availed of. The
CRM benefits were deployed to a total of 200+ users in 2 months. The staff was trained
in the ability to use multi language systems. This helped them immensely especially
when dealing with multi lingual customers and customer data. The key elements
employed in the implementation were speed, data integration, and easy usage and
increased efficient reporting capabilities [5].
c) The Result
BELL ultimately witnessed that the result was increased with better customer service
from employees. Another advantage was the internal efficiency that was created within
the organization. The flexibility and customization traits of CRM enabled a reduction in
the total case volume .The ease of usage and its adaptability also resulted in an increase
in the integration of data between the systems. The entire implementation required less
time and was carried out with very little effort. Speed was a dominating factor in this
implementation. The organization was able to acquire the business requirements it
needed so much [5].
8
In prospect of Nepal, two giant telecom vendors NTC and NCell uses data mining
techniques to bring new policies and strategies. These two companies are focused on
bringing strategies to particular age group and particular area’s customers. Prioritizing
the customers according to the certain age group and certain areas helps bringing
strategies on segmentation basis. Hence they can give emphasis on profitability and high
monetary valued customer.
9
3. THEORETICAL BACKGROUND
3.1. Customer Segmentation
Customer segmentation is the practice of dividing a customer base into groups of individuals
that are similar in specific ways relevant to marketing, such as age, gender, interests, spending
habits and so on. It allows a company to target specific groups of customers effectively and
allocate marketing resources to best effect. According to an article by Jill Griffin for Cisco
Systems, traditional segmentation focuses on identifying customer groups based on
demographics and attributes such as attitude and psychological profiles. Value-based
segmentation, on the other hand, looks at groups of customers in terms of the revenue they
generate and the costs of establishing and maintaining relationships with them [6].
Examples of common segmentation objectives include [7] :
Develop new products
Create segmented ads & marketing communications
Develop differentiated customer servicing & retention strategies
Target prospects with the greatest profit potential
Optimize your sales-channel mix
Segmentation is a way to have more targeted communication with the customers. The process
of segmentation describes the characteristics of the customer groups (called segments or
clusters) within the data. Segmenting means putting the population in to segments according to
their affinity or similar characteristics. Customer segmentation is a preparation step for
classifying each customer ac-cording to the customer groups that have been defined [8].
Segmentation is essential to cope with today’s dynamically fragmenting consumer marketplace.
By using segmentation, marketers are more effective in channeling resources and discovering
10
opportunities. The construction of user segmentations is not an easy task. Difficulties in making
good segmentation are as follows [8]:
Relevance and quality of data: It is essential to develop meaningful segments. If the
company has insufficient customer data, the meaning of a customer segmentation in
unreliable and almost worthless. Alternatively, too much data can lead to complex and
time-consuming analysis. Poorly organize data (different formats, different source
systems) makes it also difficult to extract interesting information. Furthermore, the
resulting segmentation can be too complicated for the organization to implement
effectively. In particular, the use of too many segmentation variables can be confusing
and result in segments which are unfit for management decision making. On the other
hand, apparently effective variables may not be identifiable. Many of these problems
are due to an inadequate customer database.
Intuition: Although data can be highly informative, data analysts need to be
continuously developing segmentation hypotheses in order to identify the ’right’ data
for analysis.
Continuous process: Segmentation demands continuous development and updating as
new customer data is acquired. In addition, effective segmentation strategies will
influence the behavior of the customers affected by them; thereby necessitating
revision and reclassification of customers. Moreover, in an e-commerce environment
where feedback is almost immediate, segmentation would require almost a daily
update.
Over-segmentation: A segment can become too small and/or insufficiently distinct to
justify treatment as separate segments.
One solution to construct segments can be provided by data mining methods that belong to the
category of clustering algorithms. K-means clustering is used to segment the customers for
telecommunications.
11
3.2. Customer Profiling
Customer profiling is a way to create a portrait of the customers to help corporates make design
decisions concerning the service. Customer profiling provides a basis for marketers to
’communicate’ with existing customers in order to offer them better services and retaining them.
This is done by assembling collected information on the customer such as demographic and
personal data. Customer profiling is also used to prospect new customers using external sources,
such as demographic data purchased from various sources. This data is used to find a relation
with the customer segmentations that were constructed before. This makes it possible to estimate
for each profile (the combination of demographic and personal information) the related segment
and vice versa. More directly, for each profile, an estimation of the usage behavior can be
obtained [8].
Depending on the goal, one has to select what is the profile that will be relevant to the project.
A simple customer profile is a file that contains at least age and gender. If one needs profiles for
specific products, the file would contain product information and/or volume of money spent.
Customer features one can use for profiling are [8]:
Geographic: Are they grouped regionally, nationally or globally?
Cultural and ethnic: What languages do they speak? Does ethnicity affect their tastes or
buying behaviors?
Economic conditions, income and/or purchasing power: What is the average household
income or power of the customers? Do they have any payment difficulty? How much or
how often does a customer spend on each product?
Age and gender: What is the predominant age group of the target buyers? How many
children and what age are in the family? Are more female or males using a certain service
or product?
Values, attitudes and beliefs: What is the customers’ attitude toward your kind of product
or service?
Life cycle: How long has the customer been regularly purchasing products?
12
Knowledge and awareness: How much knowledge do customers have about a product,
service, or industry? How much education is needed? How much brand building
advertising is needed to make a pool of customers aware of offer?
Lifestyle: How many lifestyle characteristics about purchasers are useful?
Recruitment method: How was the customer recruited?
The choice of the features depends also on the availability of the data. With these features, an
estimation model can be made. With development of this model corporates can bring strategy
based on the developed profiles.
3.3. Churn Prediction
Churn rate, as it relates to mobile network carriers, is the percentage of subscribers in a given
time frame who cease to use the company's services for one reason or another. It is used as an
indicator of the health of a company's subscriber base. The lower the churn rate, the better the
outlook is for the company [9]. In the new economy which provides unprecedented choice, and
instant and global access to products and information churn rate determines business earnings
and growth [10].
Churn rate can be represented in a number of ways, including [11]:
1. Number of customer lost
2. Percent of customer lost
3. Value of recurring business lost
4. Percent of recurring value lost
Even in single method there can be variant in calculation. For example, different practitioners
may choose to calculate the “churn rate” for a month in different ways. The most traditional
formula would be the number of customers lost divided by the number of customers at the start
of the month. However, some businesses choose to base their churn rate off of the number of
13
subscribers at the end of the period instead of the beginning of the period. Concisely, churn can
be expressed as:
Churn=C
𝑡∗𝐶
where, C=number of customers cancelling service
t=time interval
C=number of customers at the beginning of the interval
Churn prediction is currently a relevant subject in data mining and has been applied in the field
of banking, mobile telecommunication, life insurances, and others. In fact, all companies who
are dealing with long term customers can take advantage of churn prediction methods.
Models such as neural networks, logistic regression and decision trees are common choices of
data miners to tackle this churn prediction problem. These models are trained by offering
snapshots of churned customers and non-churned customers. The goal is to distinguish churners
from non-churners as much as possible. When new customers are offered, the model attempts
to predict to which class each customer belongs [12].
Churn prediction has great importance for especially telecommunications in prevalent
competitive landscape. The ability to predict that a particular customer is at a high risk of
churning, while there is still time to do something about it, represents a huge additional potential
revenue source for every online business. Besides the direct loss of revenue that results from a
customer abandoning the business, the costs of initially acquiring that customer may not have
already been covered by the customer’s spending to date. In other words, acquiring that
customer may have actually been a losing investment. Furthermore, it is always more difficult
and expensive to acquire a new customer than it is to retain a current paying customer.
In order to succeed at retaining customers who would otherwise abandon the business, marketers
and retention experts must be able to:
14
(a) predict in advance which customers are going to churn and
(b) know which marketing actions will have the greatest retention impact on each particular
customer.
Armed with this knowledge, a large proportion of customer churn can be eliminated [13]
15
4. TECHNICAL BACKGROUND
4.1. Data Loading through Extract, Transform & Load Process
Data warehouse is loaded regularly so that it can serve its purpose of facilitating business
analysis. To do this, data from one or more operational systems needs to be extracted and copied
into the data warehouse. The challenge in data warehouse environments is to integrate, rearrange
and consolidate large volumes of data over many systems, thereby providing a new unified
information base for business intelligence.
The process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL
refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too
simplistic, because it omits the transportation phase and implies that each of the other phases of
the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary applications
and database systems are the IT backbone of any enterprise. Data has to be shared between
applications or systems, trying to integrate them, giving at least two applications the same
picture of the world. This data sharing was mostly addressed by mechanisms similar to what we
now call ETL.
Extraction:
During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the specific
subset of interest, therefore more data than necessary has to be extracted, so the identification
of the relevant data will be done at a later point in time. Depending on the source system's
capabilities (for example, operating system resources), some transformations may take place
16
during this extraction process. The size of the extracted data varies from hundreds of kilobytes
up to gigabytes, depending on the source system and the business situation. The same is true for
the time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period of time. In case of telecommunication industries,
same is true with call records being generated at tremendous rate per second. Real time scenario
is not addressed in our project. Nevertheless, the system is not also static to work on limited
datasets. In other words, there is explicit provision for extraction of CDRs.
In many cases this first part of an ETL process involving extracting the data from the source
systems.is the most challenging aspect of ETL, since extracting data correctly sets the stage for
how subsequent processes go further. The goal of the extraction phase is to convert the data into
a single format which is appropriate for transforming processing.
Transformation:
The transformation stage applies a series of rules or functions to the extracted data from the
source to derive the data for loading into the end target. This includes converting any measured
data to the same dimension (i.e. conformed dimension) using the same units so that they can
later be joined. Some data sources will require very little or even no manipulation of data. In
other cases, one or more transformation may be required to meet the business and technical
needs of the target database.
Loading:
After cleaning and transforming raw data to the consistent format, data loading to data-mart
process was done through loading phase. Data thus transformed was loaded into the end target,
usually the data-marts. Depending on the requirements of the organization, this process varies
widely.
17
4.2. K-means Clustering Algorithm & Bisecting K-means Clustering Algorithm
Given an initial set of k means 𝑚1(1)
… 𝑚𝑘(1)
, the algorithm proceeds by alternating between two
steps:
Assignment step: Assign each observation to the cluster whose mean yields the least within-
cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this
is intuitively the "nearest" mean.
𝑆𝑖(𝑡)
= {𝑥𝑝: ||𝑥𝑝 − 𝑚𝑖(𝑡)||
2
≤ ||𝑥𝑝 − 𝑚𝑗(𝑡)||
2
∀1 ≤ 𝑗 ≤ 𝑘},
Where each xp is assigned to exactly one S(t), even if it could be is assigned to two or more of
them.
Update step: Calculate the new means to be the centroids of the observations in the new clusters.
𝑚𝑖(𝑡+1
= 1
|𝑆𝑖
(𝑡)|
∑ 𝑥𝑗
𝑥𝑗 ∈ 𝑆𝑖(𝑡)
Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster
sum of squares (WCSS) objective [14].
Algorithm for Bisecting K-means Clustering
1. Pick a cluster to split.
2. Find 2 sub-clusters using the basic K-means algorithm. (Bisecting step)
3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the
clustering with the highest overall similarity.
4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached [15].
Complexity
To find optimal solution to the k-means clustering problem for observations in d dimensions
is: -
18
If k and d (the dimension) are fixed, the problem can be exactly solved in time O(ndk+1 log n),
where n is the number of entities to be clustered
4.3. Recency Frequency & Monetary Model
Recency, Frequency & Monetary (RFM) is a method used for analyzing customer value. It is
commonly used in database marketing and direct marketing and has received particular attention
in retail and professional services industries.
RFM stands for
Recency - How recently did the customer take service?
Frequency - How often do they take service?
Monetary Value - How much do they spend?
Most businesses will keep data about customer services. All that is needed is a table with the
customer name, date at which and total money spend. One methodology is to assign a scale of
1 to 10, whereby 10 is the maximum value and to stipulate the formula by which the data suits
the scale.
Alternatively, one can create categories for each attribute. For instance, the Recency attribute
might be broken into three categories: customers with purchases within the last 90 days; between
91 and 365 days; and longer than 365 days. Such categories may be arrived at by applying
business rules, or using a data mining technique, such as CHAID, to find meaningful breaks.
Once each of the attributes has appropriate categories defined, segments are created from the
intersection of the values. If there were three categories for each attribute, then the resulting
matrix would have twenty-seven possible combinations (one well-known commercial approach
uses five bins per attributes, which yields 125 segments). Companies may also decide to collapse
certain subsegments, if the gradations appear too small to be useful. The resulting segments can
be ordered from most valuable (highest recency, frequency, and value) to least valuable (lowest
19
recency, frequency, and value). Identifying the most valuable RFM segments can capitalize on
chance relationships in the data used for this analysis [16].
4.4. Gaussian Distribution
Gaussian distribution is also called as Normal distribution. An univariate Gaussian distribution
is defined as : A continuous random variable is said to have in gaussian distribution with
parameter µ and σ if the probability density function of X is given by
𝑓(𝑥) = 1
𝜎√2𝜋𝑒
−(𝑥−µ)2
2𝜎2
Where e = 2.71828 and π = 3.1416, μ = mean, σ = standard deviation.
Typical properties of normal distribution f(x), with any mean µ and any positive deviation s, are
as follows:
It is symmetric around the point x = µ, which is at the same time the mode, the median
and the mean of the distribution.[9]
It is unimodal: its first derivative is positive for x < µ, negative for x > µ, and zero only
at x = µ.
It has two inflection points (where the second derivative of f is zero and changes sign),
located one standard deviation away from the mean, namely at x = µ - s and x = µ + s.
4.5. Time Series Visualization
It goes without saying that visualizing data through charts makes clear about the pattern of data
rather than looking simply data on table. Visualization is a powerful tool for monitoring system
performance, analyzing service traffic, helping merchants to optimize their business, or finding
new ways to combat fraud. So visualization helps to make the process of data visualization and
analysis easier and to gain the insight need to make decisions quickly.
20
For decision-making, time is one of the important parameter. Time series visualization is a
visualization technique in which time is one of the parameter. Time series visualization helps us
to understand about how one parameter is changing according to the time. For example in
telecommunications, how the call duration is changing according to the time helps to know
about how the service is used by the customer and helps the company to spread their services.
4.6. On-Line Analytical Processing & Snowflake Schema for Datamart
4.6.1. Data Cube
Data Cube is a multidimensional data model defined by dimensions and facts. Dimensions are
the perspectives or entities with respect to which an organization wants to keep records. Facts
are numeric measures. Each dimension may have a table associated with it, called a dimension
table. Fact table contains the names of the facts, or measures, as well as keys to each of the
related dimension tables. Data cube can be visualized as n-dimensional geometric structure
formed of cuboids which represent various level of summarization (least level of summarization
corresponds to base cuboid and the highest to apex cuboid).
4.6.2. Data Mart
Unlike data warehouse that collects information about subjects that span the entire organization,
data mart is a department subset of the data warehouse that focuses on selected subjects, and
thus its scope is department wide. In data mart design star or snowflake schema is popular.
4.6.3. Multidimensional Data Models Schema
Stars, Snowflakes, and Fact Constellations are schemas of data cube. The schema followed in
this project is snowflake schema. It refers to the schema in which some dimension tables are
21
normalized, thereby further splitting the data into additional tables and the resulting schema
graph forms a shape similar to a snowflake.
4.6.4. On-Line Analytical Processing
Data warehouse systems serve users or knowledge workers in the role of data analysis and
decision making. Such systems can organize and present data in various formats in order to
accommodate the diverse needs of different users. These systems are known as online analytical
processing (OLAP) systems which contrast to Online Transaction Processing (OLTP) systems
that encompasses operational day-to-day database systems for keeping records and simple
querying. Different OLAP operation available are slicing, dicing, roll up and drill down.
Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of
its dimensions, creating a new cube with one fewer dimension. The dice operation produces a
subcube by allowing the analyst to pick specific values of multiple dimensions. Drill Down/Up
allows the user to navigate among levels of data ranging from the most summarized (up) to the
most detailed (down). A roll-up involves summarizing the data along a dimension
22
5. SYSTEM ANALYSIS
Numerous data mining applications have been deployed in the telecommunications industry.
However, most applications fall into one of the following three categories [17]:
Marketing
Fraud detection
Network fault isolation and prediction.
Marketing applications are one making significant difference when compared in context of our
country. To meet the challenges mentioned related to marketing, telecom companies are
increasing their investments in CRM strategies and software [1].
Customer Segmentation is one of the key candidate in the above-mentioned context of building
appropriate CRM strategies and software. Before designing any system, analysis of system is
vital to cater to informational as well as architectural needs. Basically system analysis is meant
to elicit the requirements as they prevail and access various facets of feasibility. Therefore, these
two purposes are detailed in light of our project which entails the theme of customer
segmentation.
5.1. Requirements Analysis
Software Requirements are captured in SRS document. SRS documents various sorts of the
requirements the system is subjected to meet including functional ones and non-functional ones.
The requirements pertaining to what the software does, what inputs modules need to generate
output etc. are functional requirements that can be analyzed at various level of granularity. On
contrary, holistic requirements of system like reliability, user-friendly etc. come under non-
functional ones [18]. So, while functional requirements determine effectiveness of the system
non-functional requirements determine how efficient the system is subject to constraints.
23
Requirement elicitation possesses many techniques. To gather various requirements we adopted
following approaches:
Expert Suggestion
Visit to a telecommunication company (NTC, NCell)
Interview method
Analysis of prevailing market of Nepalese telecommunication companies
5.1.1. Assumptions and Dependencies
It is evident that certain assumptions and dependencies exist in software system. In our project
following assumptions have been made:
1. The raw dataset feed (either sample or the entire population) used would be in the text
format or any other format convertible to the text format.
2. The language supported is only English.
3. The system is a web application.
4. For NTC dataset, average call duration represent then property of money charges for
each call.
5.1.2. High Level Requirements
A telecommunication company sought to provide the services to subscribers that are
significantly demanded and also check QoS delivered. In this regard, the main requirement of
this project is develop an application in order to segment the customers of Telecommunication
Company in order to improve the relationship with customer.
For the fulfillment of this requirement, we have set following specific actions:
Classification of customers for executing new campaigns and other profitable
operations.
24
Deduction of pricing optimization and service development (products inclusive)
strategies.
Utilization of business process and strategy in service analysis for Business Intelligence
in light of customer relationship.
5.1.3. Functional Requirements
The various functional requirements of the project are given below:
1. Reporting
The system is able to describe the dataset and then report about the status. For instance,
call traffic distribution with gender.
The system enable to interpolate on the behavior of certain parameter with respect to the
certain other parameter. For e.g. how would average call duration change if calls are
discounted by certain percentage for certain customer segments.
2. Visualization
The system will depict the dataset in graphical forms that are easily comprehensible.
It also shows various aggregates using BI.
Output of data mining are presented in practically useful form.
3. Decision Making
It helps to prioritize customers after detailed analysis.
It helps to infer probable benefits to discover alluring customer segments.
5.1.4. Non Functional Requirements
The various non-functional requirements of the project are given below:
1. Usability
25
The system has authentication system.
User interface is simple and user-friendly GUI.
General acquaintance with data-mining terminologies and their purpose is essential for
the usage.
2. Scalability
The system is highly scalable in the sense that it can handle thousands of customer
records provided it can handle tens of records.
It can be utilized not only for one Telecommunication Company but also for numerous
of them.
3. Performance
As with any data-mart based projects, response time is important factor of the
performance capability of the system and response time is affected by following
sequences of actions that take place before visual output is generated:
OLTP Database Acquisition (conversion of available text format reducible data to
MySQL tables used by us)
Data Preprocessing (Transformation of the data in MySQL tables to the form suitable
for the data-mart designed)
Loading to data-mart (loading transformed data to data-mart)
Data Mining (applying data-mining algorithms on OLAP cube so obtained)
As dataset hikes in volume at great pace, loading time turns out as the bottleneck for
response time. Thus, for optimizing the system incremental loading approach is utilized.
Furthermore to boost up performance, incremental computation is implemented where
and when feasible.
4. Reliability, Extensibility and Security are other non-functional areas that are substantially
addressed in this project.
26
5.2. Feasibility Analysis
Feasibility Study incorporates the potential of the project with due consideration of three
perspectives- operational, technical and economic feasibility. Feasibility studies aim to
objectively and rationally uncover the strengths and weaknesses of an existing business or
proposed venture, opportunities and threats as presented by the environment, the resources
required to carry through, and ultimately the prospects for success [19]. Feasibility assessment
is carried prior to the initiation of the project so as to evaluate whether the project being
considered is worthy or not. Considering the above mentioned perspectives decision have to be
made so as to assess or terminate the project.
5.2.1. Operational Feasibility
Operational feasibility assess how well a proposed system solves the problems, and takes
advantage of the opportunities identified during scope definition and how it satisfies the
requirements identified in the requirements analysis phase of system development [19].
Specifically it deals whether the proposed system covers the scope and requirement of the
considered project. Considering our project, it is a decisive support system for telecom industry.
The Dashboard of the project gives realization of the analysis in the graphical views. So any
professionals with simple knowledge of the graphical visualization skills can operate it.
5.2.2. Technical Feasibility
Technical feasibility focuses on understanding the present technical resources of the
organization to the expected needs of the proposed system. As for our project, it is viable to
apply without any technical difficulty. Technical and resource requirements are already
implemented so it will not impose any difficulty in handling the system. Also the system have
to be upgraded to unleash certain changes so considering the upgrade and maintenance it will
not require additional resources.
27
5.2.3. Economic Feasibility
Economic feasibility determine the positive economic benefits to the organization that the
system will provide. It includes quantification and identification of all the benefits expected
[19]. As our system is business decision support system, manager can formulate new strategies
from the result of analysis .This reduces the decision making time and help in bringing new
schemes faster than the other competitors. Thus our system can assist in lifting revenue of the
organization.
28
6. SYSTEM DESIGN
Basically our project can be divided into 3 phases. First, we perform data preprocessing in order
to integrate different data sets and to clean the missing values. We also apply Discretization to
our data when it is necessary for our analysis tasks. In second phase, the high level data
descriptions are performed. We use different Data Characterization techniques to get a better
understanding about how the data distribution looks like, what the general information we can
obtain before we proceed more in-depth analysis. Finally three data mining tasks are conducted.
Independently, we take different methods to gain more intrinsic characteristics hiding in the
data.
Figure 6.1 System Design Phases
Phase 1: Data Selection, Data Cleaning, Data Integration, Data Transformation, Data
Normalization, Data Discretization
Phase 2: Generalization, Analysis of Attribute Relevance, Attribute Removal, Attribute
Analysis
Phase 3: Select Attributes, Discretize, Correlation, Clustering, Classification.
Data
Preprocess
Missing Values
Integration
Discretization
Classify/
Predict
Cluster
Visualize
Data
Characterize
Generalization
Attribute Analysis
Comparison
Phase 1 Phase 2 Phase 3
29
6.1. Use case Modeling
6.1.1. Use case Modeling of ETL and Visualization Processes
Use case diagram for the system is as shown below. It contain two actors i.e. Database
Administrator and decision makers. Database Administrator is responsible for loading data to
the database and transforming the data to the data mart, whereas decision maker is responsible
for visualizing the age, gender calling pattern according to the time and date. Decision makers
are also responsible for performing various clustering and association tasks.
Figure 6.2 Use case Diagram of ETL and Visualization Process
6.1.2. Use Case Modeling of User Management
Use case diagram for user management is shown below. Administrator user and anonymous
user generalized as an actor called ‘user’ in the diagram. The user passes through authentication.
If s/he is verified as administrative person then account maintenance, filtering and visualizing
30
functionalities are granted in addition to surfing through home page contents, ‘contacts’ page
and ‘about’ page of the website.
Figure 6.3 Use Case Diagram of User Management
6.2. System Architecture
6.2.1. System Block Diagram
The block diagram below shows how the steps in application development. Each component is
described underneath:
Data Source:
It refers to the flat file’s data about customers and call details in relational format. This data
would be utilized for construction of data-marts.
Data Pre-Processing:
It consists of following sections:
31
i) Extraction: Here we gather data from multiple and external sources.
ii) Data cleaning: It detects errors in the data and rectifies them when possible.
iii) Aggregation: Operations like dicing, slicing, roll up etc. are performed in this step.
iv) Analysis: Various statistical techniques like correlation, regression etc. are utilized to
analyze the dataset under consideration.
Figure 6.4 System Block Diagram
Data OLAP Repository:
It provides architectures and tools for business executives (here the telecommunication
company) to systematically organize, understand, and use their data to make strategic decisions
[20].
Mining Tools:
Techniques like clustering, classification etc. are used to find patterns within dataset.
32
Visualization:
Various BI tools like bar-charts, pie-charts, graphs, decision trees etc. would be exploited so as
to easily interpret the obtained numerical results on analysis and mining on the dataset.
6.2.2. Data-mart & OLAP Design
Call detail records is an important marketing data that mobile phone service provider can
analyze to improve customer relationships. These records can be used in conjunction with
subscriber demographic data in order to get better and more valuable results. All the data can be
put together and store in an OLAP cube. For our case we use CDR data of 4 month period. The
OLAP cube representation of our data is as stated below.
Figure 6.5 OLAP Design
Here Subscriber dimension is broken into 2 different table (Subscriber table and Demographic
table). Time of the day is separated from date dimension to avoid an explosion in the date
dimension row count [21]. As Subscriber ID, Gender and Age are independent data so we keep
them in one dimension. This OLAP can be used for the purpose of answering some very
important marketing questions like:
What does the data tells us about patterns of calls during the day and during the night?
33
Is there any difference in mobile phone usage between men and women? And what about
different age groups?
Snowflake schema consisting of fact table and dimension table representing above OLAP data
mart design is as below.
Figure 6.6 Fact Table
34
The call fact table contains duration and Phone Number measures. The table has a grain of a
call (with one row per every call originated over the network), which is the atomic level of detail
provided by the operational system. Loading the fact table with atomic data provides the greatest
flexibility because that data can be constrained and rolled in every way possible [21].
6.3. Sequence Diagram
6.3.1. Sequence Diagram of login system
A user visits the website. As the user logs into the MySQL server then only s/he is granted
administrative privileges to conduct administrative tasks. The administrator can perform tasks
out of the set of privileges that persists. As s/he logs out of the system, the ‘MySQL server’
instance gets destroyed thus no more administrative tasks are allowed.
Figure 6.7 Sequence Diagram for System login
35
6.3.2. Sequence Diagram of Visualization Process
Below sequence diagram clearly shows that the decision maker first creates the connection to
the controller of the system and then sends a request to view a desired report and the controller
further requests to model. The model loads required data for the operation from the OLAP and
then and the record is returned to the controller and the records are displayed as in the desired
form.
Figure 6.8 Sequence Diagram of Visualization Process
Decision
Maker
(Template)
Create Connection ( )
Request Report ( )
Request Data ( )
Return Data
Return Report Display
Records
Destroy
Connection ( )
Model Controller
(View)
Database
36
6.4. Interaction Diagram
In below displayed interaction diagram admin fires query on the View Architecture. Thus
generated query is presented to the model. Model Requests Database as per the query basis.
Then database gives access to the data which is back forth presented to the model. With the
requirement of the query different aggregations are performed on the data. After performing the
operations on the data, it is presented to the View for OLAP Visualization.
Figure 6.9 Interaction Diagram for Visualizing query output
The interaction diagram shown below depicts the registration process of the user giving them
authority for access of the system key operations is shown below. First registered super user
which is admin can add other user. The record detail of the user is added to the database. The
Status message of the registration is displayed. Likewise admin can also update the record of
the user. After updating the record status of the operation is displayed. Hence this diagram
illustrates the ability of the administrator to add new user or update the record of the user.
Admin
1. Generate Query
1.2 Request
Database 1.1 Query
Model Database View
1.3 Data 1.4 Aggregate Data
37
Figure 6.10 Interaction Diagram of user registration
6.5. Class Diagram
A ‘call’ class consists of call records of the subscribers. An arbitrary subscriber within
prescribed time period can generated either no calls, one call or two or more calls hence
cardinality relationship portrayed in the class diagram. Similar to this one to many relation from
‘subscriber’ to ‘call’, we have ‘time’, ’date’ and ‘service’ classes with one to many relation to
the ‘call’ class. It follow from the fact that in any given date or time or for a service key we can
have any number of call records but the reverse is not valid. A subscriber has an address that
has been viewed as composed of ‘district’, ‘zone’, ‘development region’ and ‘physical region’
classes.
Admin
Registration
Database
1 Add User
2. Update User Info
1.1 Add Record
2.1 Update Record
1.2 Status Message
2.2 Status Message
38
Figure 6.11 Class Diagram
6.6. Activity Diagram
Activity diagram for report generation process is shown below. Initially, we design OLAP using
snowflake schema. Various OLAP operations like roll up, roll down etc. are performed. Out of
interesting results, visualizations are carried out in dashboard. Appropriate reporting then
follows.
39
Figure 6.12 Activity Diagram of Report Generation
Activity diagram for user validation is shown below. Here, user is first checked out by the
system whether he/she is registered or not. If not registered he/she simply can see only the home,
about, contact pages but for a registered user, he/she can add users, delete users, load data,
visualize and cluster
40
Figure 6.13 Activity Diagram of User Validation
41
6.7. Deployment Diagram
The runtime processing node of our system could be represented using figure below. It contain
three nodes i.e. user, application and database.
1) User: It represents anyone who surfs through the website either administrator or anonymous
user.
2) Application: It is the heart of the system which acts as bridge between bulky database and
user so as to provide understandable results. It authenticates user so as to provide
authentication for administrator. Hence, contrary to general user visiting local web pages the
administrator after being logged in can:
a. Maintain users
b. Filter data
c. Cluster and mine on data and
d. Visualize interesting results
All these are permitted by ‘provider’ that is synonymous to controller in MVC framework.
Provider contacts with ‘command’ which is model of application that retrieve data out of
database.
3) Database: It refers to storehouse of all detailed data on which the system runs which in our
case turns out to be MySQL database.
42
Figure 6.14 Deployment Diagram
43
7. IMPLEMENTATION
7.1. Data Collection
Data Mining involves selecting, exploring and modeling large amounts of data to uncover
previously unknown patterns, and ultimately comprehensible information, from large databases.
So first and foremost task in data mining is to collect the data sets. Results of data mining doesn’t
give any sense in fake data sets because future steps of mining is driven by previous mined
result. Keeping it in mind we went to a telecommunication company with a request to access
to their data. After the approval of the request they gave CDRs in flat file. One flat file had
large fields while the other had less fields. All of the fields in flat file were not required in this
project. So we removed unwanted fields from these flat files. Furthermore these flat files were
not in format where we can directly insert in the database. These dataset had to be cleaned and
preprocessed. After cleaning these data, we successfully loaded the data in database.
7.1.1. Call Detail Records
Every time a call is placed on a telecommunication network, descriptive information about the
call is saved as call detail record. It includes sufficient information to describe the important
characteristics of each call. For our case, we include card-number, service-key, calling-number,
called-number, answer-time, clear-time and duration and more.
As data mining process focus on extracting knowledge of customer rather than individual phone
call, we perform feature selection and feature creation operation in order to generate a summary
description of a customer based on a call they originated like
1. average call
2. % of weekend call
3. % of daytime call
Which can be used to distinguish between business and residential customer.
44
7.1.2. Customer Data
Telecom companies have millions of the customer means keep the information about customer
like name, date of birth, address, gender and other information. So customer information can be
used in conjunction with call detail data to improve results.
The tentative database schema for these tables are given in appendices. From massive dataset,
we acquire a sample dataset for around 1000 customers with their call detail records of around
120 days.
7.2. ETL Process
Extraction:
In our context, the raw data of customers and their call records is available in the text format
i.e., .txt file or any other form from which data can be exported as such. For getting the flat file
in required format, the file was first obtained and convert it into comma or tab delimited form
so as to make it extractable to the database. Using regular expression non-data part was omitted
like headers, comments, block of blank spaces or blank lines and only filter out only data parts.
From the output result so obtained, we map to the appropriate fields in database thus completing
the extraction phase.
Transformation:
Various transformations were carried out so as to bring the data stored in table to form suitable
for the data-mart. Such transformations are done mainly for following reasons:
To bring data to computable form
To represent it appropriately
To boost up application specific performance
Some of the examples of transformations done are as follows:
45
Succinct gender representation using M for male and F for female
Segregation of single date-time stamp like 2012/08/03 18:18:18 to the various inherent
fields that are required for future analysis like year (2012), month (08), day (03) etc.
Computation of age from DoB to reduce overhead in age-wise segmentation.
Loading:
We loaded data to MySQL database using python script.
7.3. Implementing Customer Segmentation
In this project we implement customer segmentation via 2 phase clustering methods. First
through K-means clustering, customers are clustered into different segments regarding their
RFM. Secondly, using demographic data, each cluster again is partitioned into new clusters [22].
From first k-means clustering, customer are partition on the basis of their usage. After that for
each cluster, we again applied k-means clustering on the basis of age and gender information of
the subscriber. Later these information are uses in building customer profile. Customer profile
thus made helps Telecommunication Company to make effective marketing strategies. Beyond
simply understanding customer value in each cluster, the telecom would gain the opportunities
to establish better customer relationship management strategies, improve customer loyalty and
revenue and find opportunities for up and cross selling.
From Call detail records, average call duration and total call count is use as frequency and
monetary values. Then these values are normalize using equation (𝑣𝑎𝑙𝑢𝑒−min _𝑣𝑎𝑙𝑢𝑒)
max _𝑣𝑎𝑙𝑢𝑒−min _𝑣𝑎𝑙𝑢𝑒∗ 10 and
k-means clustering is applied on the resultant results. With this approach, high level customer,
medium level customer, low level customer were obtained. After that we again applied
clustering based on the demographic data like age, gender where gender is represented as 5 for
male and -5 for female. Thus different profile customer were obtained. Based on these profile
46
we can apply marketing strategy focusing on the specific groups. Block diagram representing
the overall process is shown below.
Here number of cluster can be either given as per user wise or from the reference of sum of
square of the error. We have implemented first approach i.e. cluster data from call detail record
into 6 part and again divide it into 3 on the basis of the demographic data. At last, 18 cluster
were obtained.
Figure 7.1 Block diagram of two phase clustering
Call Detail Records
RFM
Customer
Demographic Data
Featu
re Co
lum
n
Cluster’s Profiles
Marketing Strategies
K-means clustering
K-means clustering
Two Phase Clustering
47
7.4. Implementing Customer Profiling
Call detail records cannot be used directly for data mining, since the goal of data applications is
to extract knowledge at the customer level, not at the level of individual phone calls. Thus, the
call detail records associated with a customer must be summarized into a single record that
describes the customer’s calling behavior. To determine the behavior of individual customer we
used following parameters:
How? : How can a customer cause a call detail record? By making a voice call, or
sending an SMS
When? : When does a customer call? A business customer can call during office
daytime, or in private time in the evening or at night and during the weekend.
How long? : How long is the customer calling
How often? : How often does a customer call or receive a call?
From these parameters we generated features such as received/dialed call pattern day wise,
received/dialed call pattern hour wise, distribution of dialed call and sms send hour wise,
duration of total dialed call and count of message sent. Based on these features we developed a
profile of customer. Such profile describes the call and message pattern of the user’s over period
of time [8].
7.5. Implementing Churn Prediction
Using two-tailed hypothesis test at 10% level of significance probable churns can be anticipated
following these four steps:
Step 1: State the hypotheses. The hypothesis in our case would be that the given customer is not
a churning customer.
Step 2: Set the criteria for a decision. Call diameter (call diameter of given time interval is
defined as the number of unique subscribers from whom calls are received or to whom calls are
placed) was taken as a measure of call behavior relating to churn. Decision is taken at 90%
48
confidence interval. If computed value of z is less than the tabulated value then it indicates
churn.
Step 3: Compute the test statistic. In our case, test statistic is ‘z’ to conduct z-test.
Step 4: Make a decision. Decision is made based on the region in which the computed point lies
in Gaussian curve.
Figure 7.2 Normal distribution for churn prediction
7.6. Implementing Report Visualization
Visualization is the process of visualizing data stored on database into different charts such as
bar chart, line chart and pie chart. Database contains every data that we have stored but the
problem here is to filter the required data from the database. The data is first filtered and
accessed from the database and then they are processed to get the required result for plotting the
chart. The processing of data at this point includes different aggregation such as sum, count,
average etc.
49
7.6.1. Demographic Visualization
In demographic visualization, the customer’s demographic data are visualized. Demographic
data includes the name, address, age etc. All these demographic data are stored in a subscriber
table in our database. By getting data from the subscriber table, demographic visualization is
performed.
7.6.2. Call Pattern Visualization
In call pattern visualization, call duration and call count is plotted against the age, gender, day
of week, hours in a day, month etc. To perform this action, first the required data is filtered from
the database according to the parameter specified by user. Data thus obtained is further
processed to get the result for either call duration or call count.
7.6.3. Time Series Visualization
Time series visualization is an effective means of visualizing the call usage pattern with the
passage of time. In telecommunication large number of CDRs are generated every instant. These
CDRs have to be visualized and then analyzed to make any further strategies. Regarding the
implementation of time series visualization in our project, we have used googleVis package
available in R language. The googleVis Package supports visualization types such as bubble
chart, bar chart and line chart with animation of the data. To accomplish this visualization we
fed the data to this package and then specified the id which is mobile number in our case. Finally
the package generates the chart on the webpage when connected to the internet. With this
generated chart one can visualize the daily call count with total dialed duration on daily basis.
50
7.7. Data Analysis through Clustering
We perform various univariant cluster analysis via kmeans clustering algorithm. For
implementing kmeans clustering algorithm we use stats r-package available in R language. For
performing data analysis, we first set problem statement and then tried to verify the statement
through clustering. Here we also use sqldf package for performing sql query operation and then
perform clustering on the basis of duration variables. The results of the clustering is included at
result section.
In this project, we made a general interface from where non-technical person also able to
perform clustering analysis and visualize the result of the clustering via graph. Initailly, we
choose the optimal number of the cluster, calculated through sum of square (SSE) curve, and
later user can enter the number of cluster he wise to see. Stoping criteria for finding optimal
cluster is determined as (SSEn – SSEn-1 < (SSE1- SSE2)*0.1 ).
7.8. Development Environment
Sublime-Text editor [version 2.0.2] is used for python and R-Studio IDE [version 0.97.311] is
used for R-language. Also we use Git for maintaining repository. MySQL Workbench [version
5.2.47] is used for maintaining database.
Development Environment
1. 1 laptop intel core-i7 with 4 GB RAM & 2GB nvidia
2. 1 laptop intel core 2 duo with 2 GB RAM
3. 1 laptop intel core 2 duo with 2 GB RAM ATE Graphics
4. 1 laptop intel core i3 with 2 GB RAM
51
7.9. Project Activities and Milestones
The major milestones of the project are listed as below.
Table 7.1 Project activities and milestones
S.N. Milestone Date of Completion
1. Project Analysis & Feasibility Study Dec 15, 2012
2. Data Collection Jan 25, 2013
3. Data Pre-processing Feb 15, 2013
4. Data Mart Design Feb 28,2013
5. Data analysis & System Design May 28,2013
6. Coding testing June 28,2013
7. Visualization July,2013
8. Documentation August,2013
52
8. TESTING
The system has been tested since its inception for the quality assurance. The traditional
approach of testing software after completion of the project has not been adopted. But
rather testing has been carried out throughout the development time. The following are the
various testing steps implemented.
8.1. Unit Testing
Unit testing as suggested by the name is a type of method used for testing smallest part of a
provided source code which can be termed as a unit. It checks whether the unit is fit for use or
not which means whether the unit is bug free or not. In procedural programming language
like C, unit testing is used to test any individual function or a procedure whereas in case
of object oriented programming it is used to test classes.
We have implemented modular design where each component is independent and
swappable. So, we have performed the unit tests on each of the elements separately.
8.2. Integration Testing
Integration testing is a systematic technique for constructing the program structure while at the
same time conducting tests to uncover errors associated with interfacing. The objective
is to take unit tested components and build program structure that has been dictated by
design. The unit tests were repeated using the actual system components now, instead of the test
doubles. Due to the properly constructed interfaces, there were very few things to do to turn unit
tests into integration tests.
53
8.3. Black Box Testing
This testing is generally performed to see if the outputs of the application were as
expected or not. The output page visuals, the formatting of the display digits and their
values were checked for validation and necessary correction made accordingly.
8.4. Alpha Testing
The system was tested by the project developers individually and in group so as to find errors.
The system was tested in concern with the functional requirements specified in the SRS
document prepared during the system analysis phase. Functionality tests were carried to
check if the system satisfied the functional requirements as documented in the SRS document.
8.5. Performance Testing
Performance testing was used to analyze the system behavior in various hardware and
software configurations. During the tests it was concluded that the loading time for
displaying of the list of companies was taking a bit of time which needed some
consideration, while the performance of other modules were acceptable. The system ran
smoothly on the browse like Google Chrome, Internet Explorer and Mozilla Firefox.
8.6. Documentation Testing
The documents of each phase of the software development process were verified by the project
supervisor for their consistency. Each of the team members reviewed the documentation to
confirm the validity of its parts.
54
9. RESULTS & CONCLUSIONS
9.1. Customer Segmentation
First we perform k-means clustering on and obtained result as shown below.
Table 9.1 RFM Clustering
Cluster Number Avg_call_duration in
minute (m)
Total number of Call
(f)
Customer
Number
1 (Black) 8.29 390 8
2 (Red) 3 316 78
3 (Green) 1.44 287 179
4 (Blue) 2 717 41
5 (cyan) 1.58 465 107
6 (purple) 1.484 1626 6
In figure, it is shown like
Figure 9.1 K-means clustering using RFM method
55
Again we applied k-means clustering for each cluster and result obtained is as follows.
Table 9.2 Two phase clustering with demographic data
Cluster1
(black)
Cluster2
(Red)
Cluster3(Green) Cluster4(Blue) Cluster5(purple) Cluster6(cyan)
(-5, 24) 1 (1, 61)
10
(5, 29) 104 (4.31, 27) 29 (3, 69) 10 (-5, 30) 2
(5, 29) 5 (2.17,
26) 39
(2.7, 57) 35 (5, 69) 3 (2.64, 43) 34 (5, 29) 1
(5, 59.5) 2 (3.62,
38) 29
(-5, 31) 40 (1.6, 40) 9 (3.25, 27) 63 (5, 26.33) 3
Finally, by analyzing on all kinds of the described features, profile of each cluster could be
constructed. This profile is shown in table below. We compare the cluster on different attribute
like RFM rank, age rank and largeness rank. (18 for high rmf value and 3 for low rmf value, 18
for low age and 1 for high age, 18 for large and 1 for small cluster).
Table 9.3 Cluster comparisons regarding different attributes
Cluster No. REF Rank Age Rank Largeness Rank Sum
11 15 18 2 35
12 15 11 7 33
13 15 4 4 23
21 9 3 9 21
22 9 17 15 41
23 9 8 11 28
31 3 12 18 33
32 3 5 14 22
33 3 9 16 28
41 12 14 12 38
56
41 12 2 6 20
43 12 7 8 27
51 6 1 10 17
52 6 6 13 25
53 6 15 17 38
61 18 10 3 31
62 18 13 1 32
63 18 16 5 39
For example, users in cluster 1 are placed in the eighteen position regarding age rank. It means
that this is dominant cluster regarding age to other cluster. But the largeness rank for this cluster
is second last means it isn’t beneficial to invest on them. In this way cluster comparison give us
to defined effective scheme that may be effective for that group
9.2. Customer Profiling
With the main objective of analyzing the dialed call duration over day of week, first of all we
queried over database to extract the required data. And the visualization of thus extracted data
gave output as displayed above. From the above visualization we found that call dialed call
pattern of the user is not uniform every day. Subscriber has dialed maximum calls at weekends
while minimum call during mid-week.
57
Figure 9.2 Distribution of Dialed call by 9849291555 over day of week
With the main objective of analyzing the dialed call duration over day of week, first of all we
queried over database to extract the required data. And the visualization of thus extracted data
gave output as displayed above. From the above visualization we found that call dialed call
pattern of the user is not uniform every day. Subscriber has dialed maximum calls at weekends
while minimum call during mid-week.
To visualize the dialed call on hour basis, first of all we aggregated the CDR of one month on
hourly basis. Thus we plotted the aggregated data which gave output as shown above.
Visualizing the generated output we found that subscriber peak call time is at 7 am during
daytime and at 8 pm during nighttime.
58
Figure 9.3 Distribution of Dialed call by 98489291555 Hour wise
Regarding the Visualization of output of the message sending pattern first of all we extracted
the aggregated data from database. And visualizing the output of the data we found that
subscriber maximum message traffic is at 8 pm. Apart from this time subscriber sends very few
messages.
Figure 9.4 Distribution of Message send by 98489291574 Hour wise
59
9.3. Churn Prediction
We generated a list of card numbers in decreasing order of possibility of churn out the
subscribers present in our dataset. On selecting the card number, one can view details of his/her
call behavior over time-span for realizing that the subscriber exhibited customer attrition. In
construction of the model we utilized 70% of past records and tested the behavior with recent
30% of records with respect to time. The accuracy computed as
Accuracy=𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ∗100%
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
resulted as 65% on average of the tests performed from which the model can be regarded as a
satisfactory one though not much robust.
9.4. Reporting
Reporting is the process of displaying the facts of data through the visualization technique.
Different visualization reports are generated in our application on the basis of call count and call
duration against age, gender, day of week, month, hours of a day etc. The drill down level (two
level) is also maintained to provide the user with more information about the certain part of the
data. For example, the total call duration of male, female and total subscriber is first shown in
graph as shown in figure 9.5 which makes us clear that most of the call are done by the males
and the females are less active. This is because of two reasons: either female subscriber don’t
prefer on calling for long duration or they are few in population. Here, the sample of data of
female we have is very less compared to male. Further on clicking the bar of male, details about
the male subscriber can be viewed and the detail is about what the user selects in the form. Here,
the detail is about the age group. Most of the calls of male are done by the age from 20 to 35.
But the peak age group is 25 to 30 as shown in figure 9.6. Similarly, for female and total
subscriber details can also be visualized.
60
Figure 9.5 Gender vs Total call duration
Figure 9.6 Age Group vs Total call duration for male subscriber
By visualizing the average call duration of different age-groups, it is found that age group of 40
to 45 has the peak average call duration as shown in figure 9.7. Further on visualizing the details
of this age group, their average call duration pattern is increasing on monthly basis as shown in
figure 9.8 which clearly shows that they are attracting towards the service of
telecommunications.
61
Figure 9.7 Age Group vs Average Call Duration
Figure 9.8 Monthly Average Call Duration for age group 40-45
If analysis is carried out on hourly basis, maximum calls are done in the morning and the evening
time. In the morning 8 – 12 am, maximum calls are done and in the evening 6 – 8 pm. The peak
hour of the day is 7-8 pm as shown in figure 9.9. On visualizing further details at this peak hour
25 – 30 age group uses the call services maximum at that peak time as shown in figure 9.10.
62
Figure 9.9 Call Count vs Hours of day
Figure 9.10 Call Count vs Age Group for 6 to 7 pm
9.5. Data Analysis Through Clustering
Some results obtained from clustering with problem statement are listed below.
1) Clustering average call duration in 24 hours.
Clustering the above hypothesis we found that average call at non-business hour is
longer than that of business hour. In Business Hour male average call duration in less
than that of the female, whereas in non-business hour average call duration of male is
larger than the female.
63
Figure 9.11 Clustering result of average call duration of day
2) Classifying total call originated by each subscriber in a day and according to gender
and age-group.
Result: - Analyzing the above scenario we found that male with age group 25-40 have
higher call duration as compare to other age group in non-business hours call duration.
Whereas age-group <25, 25-40, 40-55 have high call duration in business hour. Age
group greater than 55 have less call duration compared to other age groups. On business
hours male and female have similar calling behaviors but in non-business hours male
perform more duration call than female subscribers.
3) Finding the received call pattern based on age group and gender on business and non-
business time.
Result: - Similar to dialed call, received call in non-business hour are longer than
business Hours. Male with age group <25 have slightly higher receiving duration as
compared to other and female with age group between 40-55 years have higher receiving
duration at business hours. At non-business hours few number of male (>55 years) have
higher receiving duration whereas in female (40-55 years) are ahead than other age
group.
4) Value based segmentation and analysis of customer based on dialed called duration.
Result: - Here, subscriber are first arranged in descending order based total dialed call
64
Duration. And then we partitioned them on the basis of total call duration as platinum
(top 1%), Gold (4%), Silver (15%), Bronze (40%), Mass (40%). But due to the less
number of active subscriber, this analysis didn’t seem good. So we classify subscriber
as Gold (10%), Silver (10%), Bronze (40%), Mass (40%). And we classified call
duration of these group and cluster were analyzed.
5) SMS sending pattern of business customer and non-business customer.
Result: - In this scenario we found that subscriber with high business call have lower
SMS sending rate than subscriber having low business call
6) (Total Receiving call duration) / (Total called duration) distribution according to age
group, gender and customer label
Result: -Considering this scenario we found that male of age < 25 from silver group
dialed more call than received call and female of age 25-40 from bronze group dialed
more call than that of received calls
9.6. Problem Faced
On doing the project, the main problem arose were always of tactical level. With the lack of the
standard design method, we had to repeat several step in search of the better outlook. Lack of
the domain knowledge was also another problem faced during the project.
9.7. Conclusion
Analytical CRM applications are overpowering telecommunication companies in retention or
attraction of customers that eventually brings long-term competitive economic advantages.
Developing such applications having implementation at telecommunication’s tactical and
strategic organizational levels are of great value. Mining and analysis should precede
65
formulation and execution of influential business and marketing strategies determining the
company’s state in upcoming future.
Data volume in telecommunication industry are massively growing. Retrieving value from the
dataset requires advanced analytics and this project is an effort in that direction. In the project,
important customer segments can be segregated using clustering and classification techniques.
Churn pattern analysis helps in reduction of subscriber churn via maintenance of appropriate
level of QoS. These results when visualized systematically employing BI visualization
techniques facilitates business reporting. Besides, telecommunication decision makers could
utilize the outcomes obtained from CDR data and demographic data of subscribers to have better
insight of customers and their calling behaviors. Hence, telecommunication companies could
lunch better marketing and customer relationship management strategy.
9.8. Limitation and Further Enhancement
Limitation of this project is that data loading time increase with increase of the data size. Our
system couldn’t support all customizable queries for visualization and clustering.
In this project, with implemented 2 phase clustering method for maintaining customer profiling.
Customer segmentation could be made more accurate by implementing life time value of the
customer along with the 2 phase clustering results. And also apriori algorithm could be used for
rule generation. This project could further expanse with various classification analysis like
decision tree analysis for predicting age-group of the subscriber from time of call, gender, sms
send rate etc. Also, Gender prediction could be done based on the call duration and call time,
Call tariff Rate shifting analysis, competitor analysis and call network analysis. And this system
could be implemented using distributed system for decreasing response time.
66
10. BIBLIOGRAPHY
[1] "Communications and Media Industry CRM Software Solutions," 23 April 2013. [Online].
Available: http://crmforecast.com/telecom.htm.
[2] "Market Segmentation," 23 April 2013. [Online]. Available:
http://www.netmba.com/marketing/market/segmentation/.
[3] D. Camilovic, "Data Mining and CRM in Telecommunications," Serbian Journal of
Management, pp. 61-72, 2008.
[4] N. Kapoor, "Optimizing CRM in Telecom with Data Mining," 23 April 2013. [Online].
Available: http://crmsolutions.crmnext.com/2012/09/optimizing-crm-in-telecom-with-
data.html.
[5] D. Chandrasekar, "CRM Success Chronicles: The Master Strokes," 23 April 2013.
[Online]. Available: http://dineshknowledgeplanet.blogspot.com/2010/10/crm-success-
chronicles-master-strokes.html.
[6] Margaret Rouse, "What is customer segmentation?," [Online]. Available:
http://searchcrm.techtarget.com/definition/customer-segmentation. [Accessed 25 August
2013].
[7] "What is customer segmentation?," [Online]. Available:
http://www.mindofmarketing.net/2007/05/customer-segmentation-why-exactly-
does.html. [Accessed 25 August 2013].
[8] S. Jansen, "Customer Segmentation and Customer Profiling for a Mobile
Telecommunications Company Based on Usage Behavior : A Vodafone Case Study,"
2007.
[9] "What is churn rate? - Definition," [Online]. Available:
http://www.mobileburn.com/definition.jsp?term=churn+rate. [Accessed 25 August
2013].
[10] "What is churn? definition and meaning," [Online]. Available:
http://www.businessdictionary.com/definition/churn.html. [Accessed 25 August 2013].
[11] "What is Churn-Rate?," [Online]. Available: http://www.churn-rate.com/. [Accessed 26
August 2013].
[12] L. Alberts, "Churn Prediction in the Mobile Telecommunications Industry," 2006.
67
[13] "Customer Churn Software: Prediction, Prevention, Analysis & Action | Optimove,"
[Online]. Available: http://www.optimove.com/learning-center/customer-churn-
prediction-and-prevention. [Accessed 23 August 2013].
[14] "k-means clustering," 7 August 2013. [Online]. Available: http://en.wikipedia.org/wiki/K-
means_clustering.
[15] G. K. V. K. Michael Steinbach, "A Comparison of Document Clustering Techniques,"
Department of Computer Science and Egineering.
[16] "RFM (customer value)," [Online]. Available:
http://en.wikipedia.org/wiki/RFM_(customer_value). [Accessed 13 August 2013].
[17] G. M. Weiss, "Data Mining in the Telecommunications Industry," 28 April 2013. [Online].
Available:
http://mtscertification.com/data_mining/Telecom/telcom%20data%20mining.pdf.
[18] I. Sommerville, Software Engineering, Boston: Pearson Education, 2009.
[19] "Feasibility study," 23 April 2013. [Online]. Available:
http://en.wikipedia.org/wiki/Feasibility_study.
[20] J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techniques, Waltham: Morgan
Kaufmann Publishers, 2012.
[21] R. Kimball and M. Ross, The Data Warehouse Toolkit-the complete guide to Dimension
Modeling, New Works: John Wiley & Sons, 2002.
[22] M. R. G. K. Morteza Namvar, "Two Phase Clustering Method for Intelligent Customer
Segmentation," Tehran,Iran, 2010.
68
APPENDIX A: - DATASET SNIPPETS
Figure A.1 Customer Demographic Data Snippets
Figure A.2 CDR Data Snippets
69
APPENDIX B: - GANTT CHART Project – Part A
Figure B.1 Part A Gantt Chart
Project - Part B
Figure B.1 Part B Gantt Chart
70
APPENDIX C: - OUTPUT SNAPSHOT
1. Age – Gender Distribution
Below line chart represent total age distribution of subscriber present on our data set in gender
wise manner. It shows that male subscriber number is higher than that of the female subscriber.
Figure C.1 Age and Gender Distribution of Subscriber
2. Filter Data
Below interface show the filter interface for our data from where user can filter the data from
the database and do some analysis.
71
Figure C.2 Interface for filtering data