TRIBHUVAN UNIVERSITY - flipkarma.comflipkarma.com/media_dir/main_documents/Final_Report_Subscriber... · TRIBHUVAN UNIVERSITY ... web application for implementing Customer Relationship

i

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

Subscriber Data Mining for Business Reporting and Decision Making in

Telecommunications

[CT755]

By:

Bishal Timilsina (16209)

Bishnu Bhattarai (16210)

Narayan Prasad Kandel (16220)

Niroj Karki (16222)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

AND COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE

REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER

ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR,

NEPAL

August, 2013

i

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

The undersigned certify that they have read, and recommended to the Institute of Engineering for

acceptance, a project report entitled "Subscriber Data mining for business reporting in

telecommunication" submitted by Bishal Timilsina, Bishnu Bhattarai, Narayan pd. Kandel, Niroj

Karki in partial fulfilment of the requirements for the Bachelor’s degree in Computer Engineering.

_________________________________________________

Supervisor, Babu Ram Dawadi

Lecturer

Department of Electronics and Computer Engineering, Pulchowk Campus

_________________________________________________

Co-Supervisor, Manoj Ghimire

Visiting Lecturer


__________________________________________________

Internal Examiner, Dr. Surendra Shrestha

Associate Professor


__________________________________________________

External Examiner, Ramesh Kumar Shreewastava

Unit Head - NOC

Ncell Private Limited

__________________________________________________

Coordinator, Dr. Aman Shakya

Deputy Head, Lecturer


DATE OF APPROVAL: 30.August.2013

ii

COPYRIGHT

The author has agreed that the Library, Department of Electronics and Computer Engineering,

Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.

Moreover, the author has agreed that permission for extensive copying of this project report for

scholarly purpose may be granted by the supervisors who supervised the project work recorded

herein or, in their absence, by the Head of the Department wherein the project report was done. It

is understood that the recognition will be given to the author of this report and to the Department

of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use

of the material of this project report. Copying or publication or the other use of this report for

financial gain without approval of to the Department of Electronics and Computer Engineering,

Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited.

Request for permission to copy or to make any other use of the material in this report in whole or

in part should be addressed to:

Head

Department of Electronics and Computer Engineering

Pulchowk Campus, Institute of Engineering

Lalitpur, Kathmandu

Nepal

iii

ACKNOWLEDGEMENT

We owe a great many thanks to those peoples who helped and supported us on bring this project a

success.

We express our deepest thanks to Mr. Babu Ram Dawadi, lecturer and Mr. Manoj Ghimire, visiting

lecturer. They guided us to adopt best practices during the project development phases.

We would also thank our institution, our faculty members and our friends without whom this

project would not have been possible. We also extend our heartfelt thanks to our friends, seniors

and well-wishers.

Thanking You,

Bishal Timilsina 16209

Bishnu Bhattarai 16210

Narayan Prasad Kandel 16220

Niroj Karki 16222

iv

ABSTRACT

Subscriber Data Mining for Business Reporting and Decision Making in Telecommunications is a

web application for implementing Customer Relationship Management (CRM) in

telecommunications. Data mining is an effective means for formulating CRM strategy in

telecommunication companies. Customer relationship management (CRM) is a model for

managing a company’s interactions with current and future customers. In present competitive

scenario among telecommunication companies, proper strategy formulation assists greatly for its

success. CRM strategy drives business and data mining processes. Analytical CRM utilizes mining

and warehousing concepts for prosperity of telecoms.

In this project, various visualizations adopting business intelligence approach and data analysis are

carried out that help end users to have better insight of the mobile subscriber and services usage

by them with respect to the age, gender, time, date etc. Customer profiling is carried out in addition

to segregation of customer segments. Besides, churn pattern analysis help telecom to be aware of

the churn behavior and also help to find fraud behavior of the customers based on their call

behavior.

v

Table of Contents

COPYRIGHT ............................................................................................................................ ii

ACKNOWLEDGEMENT ......................................................................................................... iii

ABSTRACT ............................................................................................................................. iv

LIST OF FIGURES ................................................................................................................ viii

LIST OF TABLES ......................................................................................................................x

LIST OF ABBREVIATION ...................................................................................................... xi

1. INTRODUCTION ..................................................................................................................1

1.1. Motivation .......................................................................................................................1

1.2. Hypothesis .......................................................................................................................2

1.3. Objectives .......................................................................................................................3

1.4. Project Description ..........................................................................................................3

1.5. Overview of the Report....................................................................................................4

2. LITERATURE REVIEW .......................................................................................................6

2.1. Related Research Work ...................................................................................................6

2.2. Companies using Data Mining in CRM ...........................................................................6

3. THEORETICAL BACKGROUND ........................................................................................9

3.1. Customer Segmentation ...................................................................................................9

3.2. Customer Profiling ........................................................................................................ 11

3.3. Churn Prediction............................................................................................................ 12

4. TECHNICAL BACKGROUND ........................................................................................... 15

4.1. Data Loading through Extract, Transform & Load Process ............................................ 15

4.2. K-means Clustering Algorithm & Bisecting K-means Clustering Algorithm .................. 17

4.3. Recency Frequency & Monetary Model ......................................................................... 18

4.4. Gaussian Distribution .................................................................................................... 19

4.5. Time Series Visualization .............................................................................................. 19

4.6. On-Line Analytical Processing & Snowflake Schema for Datamart ............................... 20

4.6.1. Data Cube ............................................................................................................... 20

4.6.2. Data Mart ............................................................................................................... 20

4.6.3. Multidimensional Data Models Schema .................................................................. 20

4.6.4. On-Line Analytical Processing................................................................................ 21

5. SYSTEM ANALYSIS.......................................................................................................... 22

vi

5.1. Requirements Analysis .................................................................................................. 22

5.1.1. Assumptions and Dependencies .............................................................................. 23

5.1.2. High Level Requirements ....................................................................................... 23

5.1.3. Functional Requirements ........................................................................................ 24

5.1.4. Non Functional Requirements ................................................................................. 24

5.2. Feasibility Analysis ....................................................................................................... 26

5.2.1. Operational Feasibility ............................................................................................ 26

5.2.2. Technical Feasibility ............................................................................................... 26

5.2.3. Economic Feasibility .............................................................................................. 27

6. SYSTEM DESIGN .............................................................................................................. 28

6.1. Use case Modeling ........................................................................................................ 29

6.1.1. Use case Modeling of ETL and Visualization Processes .......................................... 29

6.1.2. Use Case Modeling of User Management ............................................................... 29

6.2. System Architecture ...................................................................................................... 30

6.2.1. System Block Diagram ........................................................................................... 30

6.2.2. Data-mart & OLAP Design ..................................................................................... 32

6.3. Sequence Diagram ......................................................................................................... 34

6.3.1. Sequence Diagram of login system ......................................................................... 34

6.3.2. Sequence Diagram of Visualization Process ............................................................ 35

6.4. Interaction Diagram ....................................................................................................... 36

6.5. Class Diagram ............................................................................................................... 37

6.6. Activity Diagram ........................................................................................................... 38

6.7. Deployment Diagram .................................................................................................... 41

7. IMPLEMENTATION .......................................................................................................... 43

7.1. Data Collection .............................................................................................................. 43

7.1.1. Call Detail Records ................................................................................................. 43

7.1.2. Customer Data ........................................................................................................ 44

7.2. ETL Process .................................................................................................................. 44

7.3. Implementing Customer Segmentation .......................................................................... 45

7.4. Implementing Customer Profiling .................................................................................. 47

7.5. Implementing Churn Prediction ..................................................................................... 47

7.6. Implementing Report Visualization ............................................................................... 48

vii

7.6.1. Demographic Visualization ..................................................................................... 49

7.6.2. Call Pattern Visualization ....................................................................................... 49

7.6.3. Time Series Visualization ....................................................................................... 49

7.7. Data Analysis through Clustering .................................................................................. 50

7.8. Development Environment ............................................................................................ 50

7.9. Project Activities and Milestones ................................................................................... 51

8. TESTING ............................................................................................................................. 52

8.1. Unit Testing ................................................................................................................... 52

8.2. Integration Testing......................................................................................................... 52

8.3. Black Box Testing ......................................................................................................... 53

8.4. Alpha Testing ................................................................................................................ 53

8.5. Performance Testing ...................................................................................................... 53

8.6. Documentation Testing .................................................................................................. 53

9. RESULTS & CONCLUSIONS ............................................................................................ 54

9.1. Customer Segmentation ................................................................................................. 54

9.2. Customer Profiling ........................................................................................................ 56

9.3. Churn Prediction............................................................................................................ 59

9.4. Reporting....................................................................................................................... 59

9.5. Data Analysis Through Clustering ................................................................................. 62

9.6. Problem Faced ............................................................................................................... 64

9.7. Conclusion .................................................................................................................... 64

9.8. Limitation and Further Enhancement ............................................................................. 65

10. BIBLIOGRAPHY .............................................................................................................. 66

APPENDIX A: - DATASET SNIPPETS ................................................................................... 68

APPENDIX B: - GANTT CHART ............................................................................................ 69

APPENDIX C: - OUTPUT SNAPSHOT ................................................................................... 70

viii

LIST OF FIGURES

Figure 6.1 System Design Phases...............................................................................................28

Figure 6.2 Use case Diagram of ETL and Visualization Process...............................................29

Figure 6.3 Use case Diagram of User Management...................................................................30

Figure 6.4 System block Diagram..............................................................................................31

Figure 6.5 OLAP Design............................................................................................................32

Figure 6.6 Fact Table..................................................................................................................33

Figure 6.7 Sequence Diagram for system login.........................................................................34

Figure 6.8 Sequence Diagram of Visualization Process............................................................35

Figure 6.9 Interaction Diagram for visualizing query output.....................................................36

Figure 6.10 Interaction Diagram of user registration.................................................................37

Figure 6.11 Class Diagram.........................................................................................................38

Figure 6.12 Activity Diagram for Report Generation................................................................39

Figure 6.13 Activity Diagram of User Validation......................................................................40

Figure 6.14 Deployment Diagram..............................................................................................42

Figure 7.1 Block Diagram of 2 phase clustering........................................................................46

Figure 7.2 Normal Distribution for Churn prediction................................................................48

Figure 9.1 K-mean clustering using RFM method.....................................................................54

Figure 9.2 Distribution of dialed call over day of week.............................................................57

Figure 9.3 Distribution of dial call in hour wise basis................................................................58

Figure 9.4 Distribution of message sents....................................................................................58

ix

Figure 9.5 Gender vs Total call duration....................................................................................60

Figure 9.6 Age group vs Total call duration for male subscriber...............................................60

Figure 9.7 Age group vs Average call duration..........................................................................61

Figure 9.8 Monthly average call duration for age group 40-45..................................................61

Figure 9.9 Call count vs Hours of day........................................................................................62

Figure 9.10 Call count vs Age group for 6 to 7pm.....................................................................62

Figure 9.11 Clustering result of average call duration of day…................................................63

Figure A.1 Customer demographic data snippets.......................................................................68

Figure A.2 CDR data snippets....................................................................................................68

Figure B.1 Part A Gantt Chart....................................................................................................69

Figure B.2 Part B Gantt Chart....................................................................................................69

Figure C.1 Age and Gender distribution of subscriber...............................................................70

Figure C.2 Interface for filtering data.........................................................................................71

x

LIST OF TABLES

Table 7.1 Project Activity and Milestones.................................................................................51

Table 9.1 RFM Clustering..........................................................................................................54

Table 9.2 Two-phase Clustering with demographic data...........................................................55

Table 9.3 Cluster comparisons regarding different attributes.....................................................55

xi

LIST OF ABBREVIATION

BI Business Intelligence

CDR Call Detail Records

CHAID Chi-squared Automatic Interaction Detection

CRM Customer Relationship Management

DoB Date of Birth

ETL Extract, Transform, Load

GUI Graphical User Interface

IDE Integrated development environment

NTC Nepal Telecommunication Corporation

OLAP Online Analytical Processing

QoS Quality of Service

RFM Recency Frequency Monetary

SRS Software Requirement Specification

UI User Interface

1

1. INTRODUCTION

1.1. Motivation

It goes without saying that nowadays it is a business dominated world. Our life is influenced by

the fluctuation of economy. Hundred years ago, we hadn’t have to care about what happen to

other countries because it would not make any difference for our life. Nowadays, it’s another

story. These days, any event occurring in related sector possesses significant impact. Without

careful handling of the situation, everyone could be the next victim. And, telecommunication

corporations cannot be exception to this.

Communications companies are under intense pressure to keep operating costs such as keeping

customer service costs low, while growing customer share, improving customer retention and

increasing revenues with new service expansions. To meet these challenges, telecom companies

are increasing their investments in CRM strategies and software [1].

Customer segmentation means grouping similar customers together, based on many different

criteria. In this way it is possible to target each and every group depending on their

characteristics. Customer segmentation helps companies develop appropriate marketing

campaigns and pricing strategies. For example, it is possible to offer a special price or free

minutes to a certain group.

Customer segmentation is one of the most important data mining methodologies used in

marketing and CRM. It helps telecommunications companies to discover the characteristics of

their customers and make them derive appropriate marketing activities according to the

information discovered.

The main challenges for customer segmentation purposes is that proper variables are chosen for

the segmentation process. The data to be mined should include both behavioral data (call detail

2

data) and demographic data (customer data) in order to make better results.

Segmented marketing is crucial factor for present telecommunication companies to sustain in

this highly competitive environment. Segmentation allows the firm to better satisfy the needs of

its potential customers. Hence, the marketing concept calls for understanding customers and

satisfying their needs better than the competition [2]. In the case of telecommunication

companies, its customers are its assets, so appropriate CRM software can benefit them a lot.

CRM software systems help telecom operators manage and control the bundle of most concerns

that faces them in light of increased competition - customer turnover. Customer analytics has

become a boom for telecommunication industry and is now a standard part of CRM application

like Customer Service Management. CRM solutions for the telecom industry equip telecoms

with competitive advantage by providing the tools to identify and retain profitable customers.

In this telecom environment, CRM software plays additional, specialized roles beyond the

traditional activity management [1] . So, realizing the immense need for specialized analytical

CRM application for deducing hidden golden patterns for strategy formulation is undisputable.

1.2. Hypothesis

Sound telecommunication companies are those that are sensitive to understanding of customer

behavior, interaction with customers and delivery of advanced, flexible services meeting Quality

of Service (QoS). To this end, analytical results obtained after mining on telecommunications

datasets assists greatly.

Behavioral data helps one to identify groups of customers who have similar calling patterns.

Identifying customers’ needs only from their demographic data does not produce much value in

the market. However, schemes brought forth considering behavioral data only are much less

profitable to the telecommunication company than schemes brought forth considering customer

profile data along with behavioral data. Thus, behavioral data and demographic data are like

3

two sides of same coin in respect to formulate alluring strategic roadmap for the

telecommunication companies.

1.3. Objectives

The main goal of this project is to develop an application in order to segment the telecom

subscriber in order to improve the relationship between telecom companies and subscribers.

To fulfill this requirement, we have set forward following specific objectives:

Visualize call pattern and behavior.

Classify customers for executing new campaigns and other profitable operations.

Utilize business process and strategy in service analysis for Business Intelligence in light

of customer relationship.

1.4. Project Description

The project aims to extract patterns in which the service offered by Telecom Operator is

distributed with respect to multidimensional perspective. To achieve this high level goal, we

require to acquire two categories of data – behavioral and demographic.

Behavioral data are one that depict the call behavior of customers and comprise of call detail

records over a time. Call details like call duration, dialed numbers, time of call, date of call,

received calls etc. form behavioral data. On the other hand, demographic data encompasses data

of customers like name, address, date of birth, occupation etc.

Behavioral data helps one to identify groups of customers who have similar calling behaviors.

In this way it is possible to focus on what customers do rather than what they are [3]. However,

schemes brought forth considering behavioral data only are much less profitable to the

telecommunication company than schemes brought forth considering customer profile data

along with behavioral data. This is why it is highly recommended to combine behavioral with

4

demographic data. For this reason we simultaneously segment customers based on address, age-

group, gender, time of call, call duration etc. in this project.

Two phase clustering is one of the suitable technique for clustering on both sorts of data above

mentioned. In this method at first, customers are clustered into different segments regarding

their RFM (Recency, Frequency, Monetary value) using K-means clustering on call detail

records. Secondly, using demographic data, each cluster again is partitioned into new clusters.

Then, profile of each cluster so obtained is formed which is finally interpreted for bringing forth

genuine marketing schemes.

Churn prediction and prevention is challenging arena for modern telecommunication

companies. In the project, basic univariate analysis based on call-diameter is accomplished. Call

diameter of a subscriber represents number of unique transactions (i.e., dialed and/or received)

in time span under consideration. It acts as one of the indicator of the churn when time-interval

considered is significant. Nowadays, detection of churn patterns among customers and causes

of churn possess immense value for retention of customers in scenario of rapidly emerging

telecom offers.

1.5. Overview of the Report

In Chapter one, underlying concepts behind building the CRM application for

telecommunication companies have been discussed. The project is introduced and its

importance and scope clarified in this section.

In Chapter two, state-of-art of CRM applications for telecommunication companies is pictured.

Numerous efforts have been spent are effectively ongoing in this field.

In Chapter three, conceptual foundation of the application has been detailed. Theoretical

background like customer segmentation, profiling and churn prediction are described in the

chapter.

5

In Chapter four, technical aspects are covered. K-means clustering algorithm, RFM clustering

model, normal distribution, time series visualization and ETL processes that are backbone to

build the project are explained here.

In Chapter five, system is analyzed as to assure the feasibility in economic, technical and

operational grounds. Requirement elicitation that forms most important portion of system

analysis is dealt with referring both functional and non-functional requirements.

In Chapter five, design of the system is accomplished using UML diagramming that comprise

of class diagram, deployment diagram, sequence diagram, activity diagram, use case

diagramming etc.

In Chapter six, implementation details of the project are given. It includes stages from data

collection to final visualizations of data (including time-series), analysis outcomes (churn

analysis inclusive) and clustering with ETL process intercepting in between.

In Chapter six, testing and debugging tasks performed are discussed.

In Chapter seven, the project is concluded with the results achieved in light of objectives set at

project’s commencement and bespeaking rooms for improvement.

In Appendix A, sample dataset has been given.

In Appendix B, Gantt charts of project time schedule have been provided.

In Appendix C, output snapshots of the project are included.

The next chapter deals with literature review of the project.

6

2. LITERATURE REVIEW

2.1. Related Research Work

In this competitive time, implementing CRM does matter a lot for company prospects. There

are different methods of implementing CRM. Some of the CRM practices yields best outputs

while some of these practices brings no significance to the company. So realization have to be

done so as to implement the best CRM practices among relevant CRM practices.

2.2. Companies using Data Mining in CRM

CRM has become a leading business strategy in highly competitive business environment. CRM

can be viewed as ‘Managerial efforts to manage business interactions with customers by

combining business processes and technologies that seek to understand a company’s customers.

Companies are becoming increasingly aware of the many potential benefits provided by CRM.

In case of telecom industry they are facing tremendous competition today with the emergence

of a number of vendors, each with unique brand propositions. In a market where the customer

has no dearth of choices and the cost of switching over is minimal, a key strategy for companies

to retain customers is through Customer Relationship Management. Within the ambit of CRM

applications, data mining is very popular in the telecom industry [4].

One of the successful CRM implementation in telecom industry is Canadian

Telecommunication Company called Bell Canada. They gather large amount of data from

telephone, internet, wireless, voice over IP and digital television services. Success of CRM

implementation in this telecom industry can be studied in 3 phases:-

a) Pre CRM-Scenario

7

The solutions, business processes and methods being employed prior to the CRM

solution clearly did not fulfill or meet any of the business needs. Bell Canada needed a

full-fledged customer centric strategy that was catering to the company requirements.

After scrutiny they embarked on the implementation of CRM and decided that they will

opt for its advantages. They basically encountered a problem that the existing disparate

solutions created a lot of extra work for employees and basically increased the task load.

This had resulted in a decrease in employee satisfaction and posed numerous problems.

In addition to this BELL required its front and back end operations of its shared services

center to be integrated. This step could not be achieved through existing processes. Also

the access to current employee case status and the reporting capabilities were required

[5].

b) Implementing CRM

The result was that CRM customer service & support initiatives were availed of. The

CRM benefits were deployed to a total of 200+ users in 2 months. The staff was trained

in the ability to use multi language systems. This helped them immensely especially

when dealing with multi lingual customers and customer data. The key elements

employed in the implementation were speed, data integration, and easy usage and

increased efficient reporting capabilities [5].

c) The Result

BELL ultimately witnessed that the result was increased with better customer service

from employees. Another advantage was the internal efficiency that was created within

the organization. The flexibility and customization traits of CRM enabled a reduction in

the total case volume .The ease of usage and its adaptability also resulted in an increase

in the integration of data between the systems. The entire implementation required less

time and was carried out with very little effort. Speed was a dominating factor in this

implementation. The organization was able to acquire the business requirements it

needed so much [5].

8

In prospect of Nepal, two giant telecom vendors NTC and NCell uses data mining

techniques to bring new policies and strategies. These two companies are focused on

bringing strategies to particular age group and particular area’s customers. Prioritizing

the customers according to the certain age group and certain areas helps bringing

strategies on segmentation basis. Hence they can give emphasis on profitability and high

monetary valued customer.

9

3. THEORETICAL BACKGROUND

3.1. Customer Segmentation

Customer segmentation is the practice of dividing a customer base into groups of individuals

that are similar in specific ways relevant to marketing, such as age, gender, interests, spending

habits and so on. It allows a company to target specific groups of customers effectively and

allocate marketing resources to best effect. According to an article by Jill Griffin for Cisco

Systems, traditional segmentation focuses on identifying customer groups based on

demographics and attributes such as attitude and psychological profiles. Value-based

segmentation, on the other hand, looks at groups of customers in terms of the revenue they

generate and the costs of establishing and maintaining relationships with them [6].

Examples of common segmentation objectives include [7] :

Develop new products

Create segmented ads & marketing communications

Develop differentiated customer servicing & retention strategies

Target prospects with the greatest profit potential

Optimize your sales-channel mix

Segmentation is a way to have more targeted communication with the customers. The process

of segmentation describes the characteristics of the customer groups (called segments or

clusters) within the data. Segmenting means putting the population in to segments according to

their affinity or similar characteristics. Customer segmentation is a preparation step for

classifying each customer ac-cording to the customer groups that have been defined [8].

Segmentation is essential to cope with today’s dynamically fragmenting consumer marketplace.

By using segmentation, marketers are more effective in channeling resources and discovering

10

opportunities. The construction of user segmentations is not an easy task. Difficulties in making

good segmentation are as follows [8]:

Relevance and quality of data: It is essential to develop meaningful segments. If the

company has insufficient customer data, the meaning of a customer segmentation in

unreliable and almost worthless. Alternatively, too much data can lead to complex and

time-consuming analysis. Poorly organize data (different formats, different source

systems) makes it also difficult to extract interesting information. Furthermore, the

resulting segmentation can be too complicated for the organization to implement

effectively. In particular, the use of too many segmentation variables can be confusing

and result in segments which are unfit for management decision making. On the other

hand, apparently effective variables may not be identifiable. Many of these problems

are due to an inadequate customer database.

Intuition: Although data can be highly informative, data analysts need to be

continuously developing segmentation hypotheses in order to identify the ’right’ data

for analysis.

Continuous process: Segmentation demands continuous development and updating as

new customer data is acquired. In addition, effective segmentation strategies will

influence the behavior of the customers affected by them; thereby necessitating

revision and reclassification of customers. Moreover, in an e-commerce environment

where feedback is almost immediate, segmentation would require almost a daily

update.

Over-segmentation: A segment can become too small and/or insufficiently distinct to

justify treatment as separate segments.

One solution to construct segments can be provided by data mining methods that belong to the

category of clustering algorithms. K-means clustering is used to segment the customers for

telecommunications.

11

3.2. Customer Profiling

Customer profiling is a way to create a portrait of the customers to help corporates make design

decisions concerning the service. Customer profiling provides a basis for marketers to

’communicate’ with existing customers in order to offer them better services and retaining them.

This is done by assembling collected information on the customer such as demographic and

personal data. Customer profiling is also used to prospect new customers using external sources,

such as demographic data purchased from various sources. This data is used to find a relation

with the customer segmentations that were constructed before. This makes it possible to estimate

for each profile (the combination of demographic and personal information) the related segment

and vice versa. More directly, for each profile, an estimation of the usage behavior can be

obtained [8].

Depending on the goal, one has to select what is the profile that will be relevant to the project.

A simple customer profile is a file that contains at least age and gender. If one needs profiles for

specific products, the file would contain product information and/or volume of money spent.

Customer features one can use for profiling are [8]:

Geographic: Are they grouped regionally, nationally or globally?

Cultural and ethnic: What languages do they speak? Does ethnicity affect their tastes or

buying behaviors?

Economic conditions, income and/or purchasing power: What is the average household

income or power of the customers? Do they have any payment difficulty? How much or

how often does a customer spend on each product?

Age and gender: What is the predominant age group of the target buyers? How many

children and what age are in the family? Are more female or males using a certain service

or product?

Values, attitudes and beliefs: What is the customers’ attitude toward your kind of product

or service?

Life cycle: How long has the customer been regularly purchasing products?

12

Knowledge and awareness: How much knowledge do customers have about a product,

service, or industry? How much education is needed? How much brand building

advertising is needed to make a pool of customers aware of offer?

Lifestyle: How many lifestyle characteristics about purchasers are useful?

Recruitment method: How was the customer recruited?

The choice of the features depends also on the availability of the data. With these features, an

estimation model can be made. With development of this model corporates can bring strategy

based on the developed profiles.

3.3. Churn Prediction

Churn rate, as it relates to mobile network carriers, is the percentage of subscribers in a given

time frame who cease to use the company's services for one reason or another. It is used as an

indicator of the health of a company's subscriber base. The lower the churn rate, the better the

outlook is for the company [9]. In the new economy which provides unprecedented choice, and

instant and global access to products and information churn rate determines business earnings

and growth [10].

Churn rate can be represented in a number of ways, including [11]:

1. Number of customer lost

2. Percent of customer lost

3. Value of recurring business lost

4. Percent of recurring value lost

Even in single method there can be variant in calculation. For example, different practitioners

may choose to calculate the “churn rate” for a month in different ways. The most traditional

formula would be the number of customers lost divided by the number of customers at the start

of the month. However, some businesses choose to base their churn rate off of the number of

13

subscribers at the end of the period instead of the beginning of the period. Concisely, churn can

be expressed as:

Churn=C

𝑡∗𝐶

where, C=number of customers cancelling service

t=time interval

C=number of customers at the beginning of the interval

Churn prediction is currently a relevant subject in data mining and has been applied in the field

of banking, mobile telecommunication, life insurances, and others. In fact, all companies who

are dealing with long term customers can take advantage of churn prediction methods.

Models such as neural networks, logistic regression and decision trees are common choices of

data miners to tackle this churn prediction problem. These models are trained by offering

snapshots of churned customers and non-churned customers. The goal is to distinguish churners

from non-churners as much as possible. When new customers are offered, the model attempts

to predict to which class each customer belongs [12].

Churn prediction has great importance for especially telecommunications in prevalent

competitive landscape. The ability to predict that a particular customer is at a high risk of

churning, while there is still time to do something about it, represents a huge additional potential

revenue source for every online business. Besides the direct loss of revenue that results from a

customer abandoning the business, the costs of initially acquiring that customer may not have

already been covered by the customer’s spending to date. In other words, acquiring that

customer may have actually been a losing investment. Furthermore, it is always more difficult

and expensive to acquire a new customer than it is to retain a current paying customer.

In order to succeed at retaining customers who would otherwise abandon the business, marketers

and retention experts must be able to:

14

(a) predict in advance which customers are going to churn and

(b) know which marketing actions will have the greatest retention impact on each particular

customer.

Armed with this knowledge, a large proportion of customer churn can be eliminated [13]

15

4. TECHNICAL BACKGROUND

4.1. Data Loading through Extract, Transform & Load Process

Data warehouse is loaded regularly so that it can serve its purpose of facilitating business

analysis. To do this, data from one or more operational systems needs to be extracted and copied

into the data warehouse. The challenge in data warehouse environments is to integrate, rearrange

and consolidate large volumes of data over many systems, thereby providing a new unified

information base for business intelligence.

The process of extracting data from source systems and bringing it into the data warehouse is

commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL

refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too

simplistic, because it omits the transportation phase and implies that each of the other phases of

the process is distinct. Nevertheless, the entire process is known as ETL.

The methodology and tasks of ETL have been well known for many years, and are not

necessarily unique to data warehouse environments: a wide variety of proprietary applications

and database systems are the IT backbone of any enterprise. Data has to be shared between

applications or systems, trying to integrate them, giving at least two applications the same

picture of the world. This data sharing was mostly addressed by mechanisms similar to what we

now call ETL.

Extraction:

During extraction, the desired data is identified and extracted from many different sources,

including database systems and applications. Very often, it is not possible to identify the specific

subset of interest, therefore more data than necessary has to be extracted, so the identification

of the relevant data will be done at a later point in time. Depending on the source system's

capabilities (for example, operating system resources), some transformations may take place

16

during this extraction process. The size of the extracted data varies from hundreds of kilobytes

up to gigabytes, depending on the source system and the business situation. The same is true for

the time delta between two (logically) identical extractions: the time span may vary between

days/hours and minutes to near real-time. Web server log files, for example, can easily grow to

hundreds of megabytes in a very short period of time. In case of telecommunication industries,

same is true with call records being generated at tremendous rate per second. Real time scenario

is not addressed in our project. Nevertheless, the system is not also static to work on limited

datasets. In other words, there is explicit provision for extraction of CDRs.

In many cases this first part of an ETL process involving extracting the data from the source

systems.is the most challenging aspect of ETL, since extracting data correctly sets the stage for

how subsequent processes go further. The goal of the extraction phase is to convert the data into

a single format which is appropriate for transforming processing.

Transformation:

The transformation stage applies a series of rules or functions to the extracted data from the

source to derive the data for loading into the end target. This includes converting any measured

data to the same dimension (i.e. conformed dimension) using the same units so that they can

later be joined. Some data sources will require very little or even no manipulation of data. In

other cases, one or more transformation may be required to meet the business and technical

needs of the target database.

Loading:

After cleaning and transforming raw data to the consistent format, data loading to data-mart

process was done through loading phase. Data thus transformed was loaded into the end target,

usually the data-marts. Depending on the requirements of the organization, this process varies

widely.

17

4.2. K-means Clustering Algorithm & Bisecting K-means Clustering Algorithm

Given an initial set of k means 𝑚1(1)

… 𝑚𝑘(1)

, the algorithm proceeds by alternating between two

steps:

Assignment step: Assign each observation to the cluster whose mean yields the least within-

cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this

is intuitively the "nearest" mean.

𝑆𝑖(𝑡)

= {𝑥𝑝: ||𝑥𝑝 − 𝑚𝑖(𝑡)||

2

≤ ||𝑥𝑝 − 𝑚𝑗(𝑡)||

2

∀1 ≤ 𝑗 ≤ 𝑘},

Where each xp is assigned to exactly one S(t), even if it could be is assigned to two or more of

them.

Update step: Calculate the new means to be the centroids of the observations in the new clusters.

𝑚𝑖(𝑡+1

= 1

|𝑆𝑖

(𝑡)|

∑ 𝑥𝑗

𝑥𝑗 ∈ 𝑆𝑖(𝑡)

Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster

sum of squares (WCSS) objective [14].

Algorithm for Bisecting K-means Clustering

1. Pick a cluster to split.

2. Find 2 sub-clusters using the basic K-means algorithm. (Bisecting step)

3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the

clustering with the highest overall similarity.

4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached [15].

Complexity

To find optimal solution to the k-means clustering problem for observations in d dimensions

is: -

18

If k and d (the dimension) are fixed, the problem can be exactly solved in time O(ndk+1 log n),

where n is the number of entities to be clustered

4.3. Recency Frequency & Monetary Model

Recency, Frequency & Monetary (RFM) is a method used for analyzing customer value. It is

commonly used in database marketing and direct marketing and has received particular attention

in retail and professional services industries.

RFM stands for

Recency - How recently did the customer take service?

Frequency - How often do they take service?

Monetary Value - How much do they spend?

Most businesses will keep data about customer services. All that is needed is a table with the

customer name, date at which and total money spend. One methodology is to assign a scale of

1 to 10, whereby 10 is the maximum value and to stipulate the formula by which the data suits

the scale.

Alternatively, one can create categories for each attribute. For instance, the Recency attribute

might be broken into three categories: customers with purchases within the last 90 days; between

91 and 365 days; and longer than 365 days. Such categories may be arrived at by applying

business rules, or using a data mining technique, such as CHAID, to find meaningful breaks.

Once each of the attributes has appropriate categories defined, segments are created from the

intersection of the values. If there were three categories for each attribute, then the resulting

matrix would have twenty-seven possible combinations (one well-known commercial approach

uses five bins per attributes, which yields 125 segments). Companies may also decide to collapse

certain subsegments, if the gradations appear too small to be useful. The resulting segments can

be ordered from most valuable (highest recency, frequency, and value) to least valuable (lowest

19

recency, frequency, and value). Identifying the most valuable RFM segments can capitalize on

chance relationships in the data used for this analysis [16].

4.4. Gaussian Distribution

Gaussian distribution is also called as Normal distribution. An univariate Gaussian distribution

is defined as : A continuous random variable is said to have in gaussian distribution with

parameter µ and σ if the probability density function of X is given by

𝑓(𝑥) = 1

𝜎√2𝜋𝑒

−(𝑥−µ)2

2𝜎2

Where e = 2.71828 and π = 3.1416, μ = mean, σ = standard deviation.

Typical properties of normal distribution f(x), with any mean µ and any positive deviation s, are

as follows:

It is symmetric around the point x = µ, which is at the same time the mode, the median

and the mean of the distribution.[9]

It is unimodal: its first derivative is positive for x < µ, negative for x > µ, and zero only

at x = µ.

It has two inflection points (where the second derivative of f is zero and changes sign),

located one standard deviation away from the mean, namely at x = µ - s and x = µ + s.

4.5. Time Series Visualization

It goes without saying that visualizing data through charts makes clear about the pattern of data

rather than looking simply data on table. Visualization is a powerful tool for monitoring system

performance, analyzing service traffic, helping merchants to optimize their business, or finding

new ways to combat fraud. So visualization helps to make the process of data visualization and

analysis easier and to gain the insight need to make decisions quickly.

20

For decision-making, time is one of the important parameter. Time series visualization is a

visualization technique in which time is one of the parameter. Time series visualization helps us

to understand about how one parameter is changing according to the time. For example in

telecommunications, how the call duration is changing according to the time helps to know

about how the service is used by the customer and helps the company to spread their services.

4.6. On-Line Analytical Processing & Snowflake Schema for Datamart

4.6.1. Data Cube

Data Cube is a multidimensional data model defined by dimensions and facts. Dimensions are

the perspectives or entities with respect to which an organization wants to keep records. Facts

are numeric measures. Each dimension may have a table associated with it, called a dimension

table. Fact table contains the names of the facts, or measures, as well as keys to each of the

related dimension tables. Data cube can be visualized as n-dimensional geometric structure

formed of cuboids which represent various level of summarization (least level of summarization

corresponds to base cuboid and the highest to apex cuboid).

4.6.2. Data Mart

Unlike data warehouse that collects information about subjects that span the entire organization,

data mart is a department subset of the data warehouse that focuses on selected subjects, and

thus its scope is department wide. In data mart design star or snowflake schema is popular.

4.6.3. Multidimensional Data Models Schema

Stars, Snowflakes, and Fact Constellations are schemas of data cube. The schema followed in

this project is snowflake schema. It refers to the schema in which some dimension tables are

21

normalized, thereby further splitting the data into additional tables and the resulting schema

graph forms a shape similar to a snowflake.

4.6.4. On-Line Analytical Processing

Data warehouse systems serve users or knowledge workers in the role of data analysis and

decision making. Such systems can organize and present data in various formats in order to

accommodate the diverse needs of different users. These systems are known as online analytical

processing (OLAP) systems which contrast to Online Transaction Processing (OLTP) systems

that encompasses operational day-to-day database systems for keeping records and simple

querying. Different OLAP operation available are slicing, dicing, roll up and drill down.

Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of

its dimensions, creating a new cube with one fewer dimension. The dice operation produces a

subcube by allowing the analyst to pick specific values of multiple dimensions. Drill Down/Up

allows the user to navigate among levels of data ranging from the most summarized (up) to the

most detailed (down). A roll-up involves summarizing the data along a dimension

22

5. SYSTEM ANALYSIS

Numerous data mining applications have been deployed in the telecommunications industry.

However, most applications fall into one of the following three categories [17]:

Marketing

Fraud detection

Network fault isolation and prediction.

Marketing applications are one making significant difference when compared in context of our

country. To meet the challenges mentioned related to marketing, telecom companies are

increasing their investments in CRM strategies and software [1].

Customer Segmentation is one of the key candidate in the above-mentioned context of building

appropriate CRM strategies and software. Before designing any system, analysis of system is

vital to cater to informational as well as architectural needs. Basically system analysis is meant

to elicit the requirements as they prevail and access various facets of feasibility. Therefore, these

two purposes are detailed in light of our project which entails the theme of customer

segmentation.

5.1. Requirements Analysis

Software Requirements are captured in SRS document. SRS documents various sorts of the

requirements the system is subjected to meet including functional ones and non-functional ones.

The requirements pertaining to what the software does, what inputs modules need to generate

output etc. are functional requirements that can be analyzed at various level of granularity. On

contrary, holistic requirements of system like reliability, user-friendly etc. come under non-

functional ones [18]. So, while functional requirements determine effectiveness of the system

non-functional requirements determine how efficient the system is subject to constraints.

23

Requirement elicitation possesses many techniques. To gather various requirements we adopted

following approaches:

Expert Suggestion

Visit to a telecommunication company (NTC, NCell)

Interview method

Analysis of prevailing market of Nepalese telecommunication companies

5.1.1. Assumptions and Dependencies

It is evident that certain assumptions and dependencies exist in software system. In our project

following assumptions have been made:

1. The raw dataset feed (either sample or the entire population) used would be in the text

format or any other format convertible to the text format.

2. The language supported is only English.

3. The system is a web application.

4. For NTC dataset, average call duration represent then property of money charges for

each call.

5.1.2. High Level Requirements

A telecommunication company sought to provide the services to subscribers that are

significantly demanded and also check QoS delivered. In this regard, the main requirement of

this project is develop an application in order to segment the customers of Telecommunication

Company in order to improve the relationship with customer.

For the fulfillment of this requirement, we have set following specific actions:

Classification of customers for executing new campaigns and other profitable

operations.

24

Deduction of pricing optimization and service development (products inclusive)

strategies.

Utilization of business process and strategy in service analysis for Business Intelligence

in light of customer relationship.

5.1.3. Functional Requirements

The various functional requirements of the project are given below:

1. Reporting

The system is able to describe the dataset and then report about the status. For instance,

call traffic distribution with gender.

The system enable to interpolate on the behavior of certain parameter with respect to the

certain other parameter. For e.g. how would average call duration change if calls are

discounted by certain percentage for certain customer segments.

2. Visualization

The system will depict the dataset in graphical forms that are easily comprehensible.

It also shows various aggregates using BI.

Output of data mining are presented in practically useful form.

3. Decision Making

It helps to prioritize customers after detailed analysis.

It helps to infer probable benefits to discover alluring customer segments.

5.1.4. Non Functional Requirements

The various non-functional requirements of the project are given below:

1. Usability

25

The system has authentication system.

User interface is simple and user-friendly GUI.

General acquaintance with data-mining terminologies and their purpose is essential for

the usage.

2. Scalability

The system is highly scalable in the sense that it can handle thousands of customer

records provided it can handle tens of records.

It can be utilized not only for one Telecommunication Company but also for numerous

of them.

3. Performance

As with any data-mart based projects, response time is important factor of the

performance capability of the system and response time is affected by following

sequences of actions that take place before visual output is generated:

OLTP Database Acquisition (conversion of available text format reducible data to

MySQL tables used by us)

Data Preprocessing (Transformation of the data in MySQL tables to the form suitable

for the data-mart designed)

Loading to data-mart (loading transformed data to data-mart)

Data Mining (applying data-mining algorithms on OLAP cube so obtained)

As dataset hikes in volume at great pace, loading time turns out as the bottleneck for

response time. Thus, for optimizing the system incremental loading approach is utilized.

Furthermore to boost up performance, incremental computation is implemented where

and when feasible.

4. Reliability, Extensibility and Security are other non-functional areas that are substantially

addressed in this project.

26

5.2. Feasibility Analysis

Feasibility Study incorporates the potential of the project with due consideration of three

perspectives- operational, technical and economic feasibility. Feasibility studies aim to

objectively and rationally uncover the strengths and weaknesses of an existing business or

proposed venture, opportunities and threats as presented by the environment, the resources

required to carry through, and ultimately the prospects for success [19]. Feasibility assessment

is carried prior to the initiation of the project so as to evaluate whether the project being

considered is worthy or not. Considering the above mentioned perspectives decision have to be

made so as to assess or terminate the project.

5.2.1. Operational Feasibility

Operational feasibility assess how well a proposed system solves the problems, and takes

advantage of the opportunities identified during scope definition and how it satisfies the

requirements identified in the requirements analysis phase of system development [19].

Specifically it deals whether the proposed system covers the scope and requirement of the

considered project. Considering our project, it is a decisive support system for telecom industry.

The Dashboard of the project gives realization of the analysis in the graphical views. So any

professionals with simple knowledge of the graphical visualization skills can operate it.

5.2.2. Technical Feasibility

Technical feasibility focuses on understanding the present technical resources of the

organization to the expected needs of the proposed system. As for our project, it is viable to

apply without any technical difficulty. Technical and resource requirements are already

implemented so it will not impose any difficulty in handling the system. Also the system have

to be upgraded to unleash certain changes so considering the upgrade and maintenance it will

not require additional resources.

27

5.2.3. Economic Feasibility

Economic feasibility determine the positive economic benefits to the organization that the

system will provide. It includes quantification and identification of all the benefits expected

[19]. As our system is business decision support system, manager can formulate new strategies

from the result of analysis .This reduces the decision making time and help in bringing new

schemes faster than the other competitors. Thus our system can assist in lifting revenue of the

organization.

28

6. SYSTEM DESIGN

Basically our project can be divided into 3 phases. First, we perform data preprocessing in order

to integrate different data sets and to clean the missing values. We also apply Discretization to

our data when it is necessary for our analysis tasks. In second phase, the high level data

descriptions are performed. We use different Data Characterization techniques to get a better

understanding about how the data distribution looks like, what the general information we can

obtain before we proceed more in-depth analysis. Finally three data mining tasks are conducted.

Independently, we take different methods to gain more intrinsic characteristics hiding in the

data.

Figure 6.1 System Design Phases

Phase 1: Data Selection, Data Cleaning, Data Integration, Data Transformation, Data

Normalization, Data Discretization

Phase 2: Generalization, Analysis of Attribute Relevance, Attribute Removal, Attribute

Analysis

Phase 3: Select Attributes, Discretize, Correlation, Clustering, Classification.

Data

Preprocess

Missing Values

Integration

Discretization

Classify/

Predict

Cluster

Visualize

Data

Characterize

Generalization

Attribute Analysis

Comparison

Phase 1 Phase 2 Phase 3

29

6.1. Use case Modeling

6.1.1. Use case Modeling of ETL and Visualization Processes

Use case diagram for the system is as shown below. It contain two actors i.e. Database

Administrator and decision makers. Database Administrator is responsible for loading data to

the database and transforming the data to the data mart, whereas decision maker is responsible

for visualizing the age, gender calling pattern according to the time and date. Decision makers

are also responsible for performing various clustering and association tasks.

Figure 6.2 Use case Diagram of ETL and Visualization Process

6.1.2. Use Case Modeling of User Management

Use case diagram for user management is shown below. Administrator user and anonymous

user generalized as an actor called ‘user’ in the diagram. The user passes through authentication.

If s/he is verified as administrative person then account maintenance, filtering and visualizing

30

functionalities are granted in addition to surfing through home page contents, ‘contacts’ page

and ‘about’ page of the website.

Figure 6.3 Use Case Diagram of User Management

6.2. System Architecture

6.2.1. System Block Diagram

The block diagram below shows how the steps in application development. Each component is

described underneath:

Data Source:

It refers to the flat file’s data about customers and call details in relational format. This data

would be utilized for construction of data-marts.

Data Pre-Processing:

It consists of following sections:

31

i) Extraction: Here we gather data from multiple and external sources.

ii) Data cleaning: It detects errors in the data and rectifies them when possible.

iii) Aggregation: Operations like dicing, slicing, roll up etc. are performed in this step.

iv) Analysis: Various statistical techniques like correlation, regression etc. are utilized to

analyze the dataset under consideration.

Figure 6.4 System Block Diagram

Data OLAP Repository:

It provides architectures and tools for business executives (here the telecommunication

company) to systematically organize, understand, and use their data to make strategic decisions

[20].

Mining Tools:

Techniques like clustering, classification etc. are used to find patterns within dataset.

32

Visualization:

Various BI tools like bar-charts, pie-charts, graphs, decision trees etc. would be exploited so as

to easily interpret the obtained numerical results on analysis and mining on the dataset.

6.2.2. Data-mart & OLAP Design

Call detail records is an important marketing data that mobile phone service provider can

analyze to improve customer relationships. These records can be used in conjunction with

subscriber demographic data in order to get better and more valuable results. All the data can be

put together and store in an OLAP cube. For our case we use CDR data of 4 month period. The

OLAP cube representation of our data is as stated below.

Figure 6.5 OLAP Design

Here Subscriber dimension is broken into 2 different table (Subscriber table and Demographic

table). Time of the day is separated from date dimension to avoid an explosion in the date

dimension row count [21]. As Subscriber ID, Gender and Age are independent data so we keep

them in one dimension. This OLAP can be used for the purpose of answering some very

important marketing questions like:

What does the data tells us about patterns of calls during the day and during the night?

33

Is there any difference in mobile phone usage between men and women? And what about

different age groups?

Snowflake schema consisting of fact table and dimension table representing above OLAP data

mart design is as below.

Figure 6.6 Fact Table

34

The call fact table contains duration and Phone Number measures. The table has a grain of a

call (with one row per every call originated over the network), which is the atomic level of detail

provided by the operational system. Loading the fact table with atomic data provides the greatest

flexibility because that data can be constrained and rolled in every way possible [21].

6.3. Sequence Diagram

6.3.1. Sequence Diagram of login system

A user visits the website. As the user logs into the MySQL server then only s/he is granted

administrative privileges to conduct administrative tasks. The administrator can perform tasks

out of the set of privileges that persists. As s/he logs out of the system, the ‘MySQL server’

instance gets destroyed thus no more administrative tasks are allowed.

Figure 6.7 Sequence Diagram for System login

35

6.3.2. Sequence Diagram of Visualization Process

Below sequence diagram clearly shows that the decision maker first creates the connection to

the controller of the system and then sends a request to view a desired report and the controller

further requests to model. The model loads required data for the operation from the OLAP and

then and the record is returned to the controller and the records are displayed as in the desired

form.

Figure 6.8 Sequence Diagram of Visualization Process

Decision

Maker

(Template)

Create Connection ( )

Request Report ( )

Request Data ( )

Return Data

Return Report Display

Records

Destroy

Connection ( )

Model Controller

(View)

Database

36

6.4. Interaction Diagram

In below displayed interaction diagram admin fires query on the View Architecture. Thus

generated query is presented to the model. Model Requests Database as per the query basis.

Then database gives access to the data which is back forth presented to the model. With the

requirement of the query different aggregations are performed on the data. After performing the

operations on the data, it is presented to the View for OLAP Visualization.

Figure 6.9 Interaction Diagram for Visualizing query output

The interaction diagram shown below depicts the registration process of the user giving them

authority for access of the system key operations is shown below. First registered super user

which is admin can add other user. The record detail of the user is added to the database. The

Status message of the registration is displayed. Likewise admin can also update the record of

the user. After updating the record status of the operation is displayed. Hence this diagram

illustrates the ability of the administrator to add new user or update the record of the user.

Admin

1. Generate Query

1.2 Request

Database 1.1 Query

Model Database View

1.3 Data 1.4 Aggregate Data

37

Figure 6.10 Interaction Diagram of user registration

6.5. Class Diagram

A ‘call’ class consists of call records of the subscribers. An arbitrary subscriber within

prescribed time period can generated either no calls, one call or two or more calls hence

cardinality relationship portrayed in the class diagram. Similar to this one to many relation from

‘subscriber’ to ‘call’, we have ‘time’, ’date’ and ‘service’ classes with one to many relation to

the ‘call’ class. It follow from the fact that in any given date or time or for a service key we can

have any number of call records but the reverse is not valid. A subscriber has an address that

has been viewed as composed of ‘district’, ‘zone’, ‘development region’ and ‘physical region’

classes.

Admin

Registration

Database

1 Add User

2. Update User Info

1.1 Add Record

2.1 Update Record

1.2 Status Message

2.2 Status Message

38

Figure 6.11 Class Diagram

6.6. Activity Diagram

Activity diagram for report generation process is shown below. Initially, we design OLAP using

snowflake schema. Various OLAP operations like roll up, roll down etc. are performed. Out of

interesting results, visualizations are carried out in dashboard. Appropriate reporting then

follows.

39

Figure 6.12 Activity Diagram of Report Generation

Activity diagram for user validation is shown below. Here, user is first checked out by the

system whether he/she is registered or not. If not registered he/she simply can see only the home,

about, contact pages but for a registered user, he/she can add users, delete users, load data,

visualize and cluster

40

Figure 6.13 Activity Diagram of User Validation

41

6.7. Deployment Diagram

The runtime processing node of our system could be represented using figure below. It contain

three nodes i.e. user, application and database.

1) User: It represents anyone who surfs through the website either administrator or anonymous

user.

2) Application: It is the heart of the system which acts as bridge between bulky database and

user so as to provide understandable results. It authenticates user so as to provide

authentication for administrator. Hence, contrary to general user visiting local web pages the

administrator after being logged in can:

a. Maintain users

b. Filter data

c. Cluster and mine on data and

d. Visualize interesting results

All these are permitted by ‘provider’ that is synonymous to controller in MVC framework.

Provider contacts with ‘command’ which is model of application that retrieve data out of

database.

3) Database: It refers to storehouse of all detailed data on which the system runs which in our

case turns out to be MySQL database.

42

Figure 6.14 Deployment Diagram

43

7. IMPLEMENTATION

7.1. Data Collection

Data Mining involves selecting, exploring and modeling large amounts of data to uncover

previously unknown patterns, and ultimately comprehensible information, from large databases.

So first and foremost task in data mining is to collect the data sets. Results of data mining doesn’t

give any sense in fake data sets because future steps of mining is driven by previous mined

result. Keeping it in mind we went to a telecommunication company with a request to access

to their data. After the approval of the request they gave CDRs in flat file. One flat file had

large fields while the other had less fields. All of the fields in flat file were not required in this

project. So we removed unwanted fields from these flat files. Furthermore these flat files were

not in format where we can directly insert in the database. These dataset had to be cleaned and

preprocessed. After cleaning these data, we successfully loaded the data in database.

7.1.1. Call Detail Records

Every time a call is placed on a telecommunication network, descriptive information about the

call is saved as call detail record. It includes sufficient information to describe the important

characteristics of each call. For our case, we include card-number, service-key, calling-number,

called-number, answer-time, clear-time and duration and more.

As data mining process focus on extracting knowledge of customer rather than individual phone

call, we perform feature selection and feature creation operation in order to generate a summary

description of a customer based on a call they originated like

1. average call

2. % of weekend call

3. % of daytime call

Which can be used to distinguish between business and residential customer.

44

7.1.2. Customer Data

Telecom companies have millions of the customer means keep the information about customer

like name, date of birth, address, gender and other information. So customer information can be

used in conjunction with call detail data to improve results.

The tentative database schema for these tables are given in appendices. From massive dataset,

we acquire a sample dataset for around 1000 customers with their call detail records of around

120 days.

7.2. ETL Process

Extraction:

In our context, the raw data of customers and their call records is available in the text format

i.e., .txt file or any other form from which data can be exported as such. For getting the flat file

in required format, the file was first obtained and convert it into comma or tab delimited form

so as to make it extractable to the database. Using regular expression non-data part was omitted

like headers, comments, block of blank spaces or blank lines and only filter out only data parts.

From the output result so obtained, we map to the appropriate fields in database thus completing

the extraction phase.

Transformation:

Various transformations were carried out so as to bring the data stored in table to form suitable

for the data-mart. Such transformations are done mainly for following reasons:

To bring data to computable form

To represent it appropriately

To boost up application specific performance

Some of the examples of transformations done are as follows:

45

Succinct gender representation using M for male and F for female

Segregation of single date-time stamp like 2012/08/03 18:18:18 to the various inherent

fields that are required for future analysis like year (2012), month (08), day (03) etc.

Computation of age from DoB to reduce overhead in age-wise segmentation.

Loading:

We loaded data to MySQL database using python script.

7.3. Implementing Customer Segmentation

In this project we implement customer segmentation via 2 phase clustering methods. First

through K-means clustering, customers are clustered into different segments regarding their

RFM. Secondly, using demographic data, each cluster again is partitioned into new clusters [22].

From first k-means clustering, customer are partition on the basis of their usage. After that for

each cluster, we again applied k-means clustering on the basis of age and gender information of

the subscriber. Later these information are uses in building customer profile. Customer profile

thus made helps Telecommunication Company to make effective marketing strategies. Beyond

simply understanding customer value in each cluster, the telecom would gain the opportunities

to establish better customer relationship management strategies, improve customer loyalty and

revenue and find opportunities for up and cross selling.

From Call detail records, average call duration and total call count is use as frequency and

monetary values. Then these values are normalize using equation (𝑣𝑎𝑙𝑢𝑒−min _𝑣𝑎𝑙𝑢𝑒)

max _𝑣𝑎𝑙𝑢𝑒−min _𝑣𝑎𝑙𝑢𝑒∗ 10 and

k-means clustering is applied on the resultant results. With this approach, high level customer,

medium level customer, low level customer were obtained. After that we again applied

clustering based on the demographic data like age, gender where gender is represented as 5 for

male and -5 for female. Thus different profile customer were obtained. Based on these profile

46

we can apply marketing strategy focusing on the specific groups. Block diagram representing

the overall process is shown below.

Here number of cluster can be either given as per user wise or from the reference of sum of

square of the error. We have implemented first approach i.e. cluster data from call detail record

into 6 part and again divide it into 3 on the basis of the demographic data. At last, 18 cluster

were obtained.

Figure 7.1 Block diagram of two phase clustering

Call Detail Records

RFM

Customer

Demographic Data

Featu

re Co

lum

n

Cluster’s Profiles

Marketing Strategies

K-means clustering

K-means clustering

Two Phase Clustering

47

7.4. Implementing Customer Profiling

Call detail records cannot be used directly for data mining, since the goal of data applications is

to extract knowledge at the customer level, not at the level of individual phone calls. Thus, the

call detail records associated with a customer must be summarized into a single record that

describes the customer’s calling behavior. To determine the behavior of individual customer we

used following parameters:

How? : How can a customer cause a call detail record? By making a voice call, or

sending an SMS

When? : When does a customer call? A business customer can call during office

daytime, or in private time in the evening or at night and during the weekend.

How long? : How long is the customer calling

How often? : How often does a customer call or receive a call?

From these parameters we generated features such as received/dialed call pattern day wise,

received/dialed call pattern hour wise, distribution of dialed call and sms send hour wise,

duration of total dialed call and count of message sent. Based on these features we developed a

profile of customer. Such profile describes the call and message pattern of the user’s over period

of time [8].

7.5. Implementing Churn Prediction

Using two-tailed hypothesis test at 10% level of significance probable churns can be anticipated

following these four steps:

Step 1: State the hypotheses. The hypothesis in our case would be that the given customer is not

a churning customer.

Step 2: Set the criteria for a decision. Call diameter (call diameter of given time interval is

defined as the number of unique subscribers from whom calls are received or to whom calls are

placed) was taken as a measure of call behavior relating to churn. Decision is taken at 90%

48

confidence interval. If computed value of z is less than the tabulated value then it indicates

churn.

Step 3: Compute the test statistic. In our case, test statistic is ‘z’ to conduct z-test.

Step 4: Make a decision. Decision is made based on the region in which the computed point lies

in Gaussian curve.

Figure 7.2 Normal distribution for churn prediction

7.6. Implementing Report Visualization

Visualization is the process of visualizing data stored on database into different charts such as

bar chart, line chart and pie chart. Database contains every data that we have stored but the

problem here is to filter the required data from the database. The data is first filtered and

accessed from the database and then they are processed to get the required result for plotting the

chart. The processing of data at this point includes different aggregation such as sum, count,

average etc.

49

7.6.1. Demographic Visualization

In demographic visualization, the customer’s demographic data are visualized. Demographic

data includes the name, address, age etc. All these demographic data are stored in a subscriber

table in our database. By getting data from the subscriber table, demographic visualization is

performed.

7.6.2. Call Pattern Visualization

In call pattern visualization, call duration and call count is plotted against the age, gender, day

of week, hours in a day, month etc. To perform this action, first the required data is filtered from

the database according to the parameter specified by user. Data thus obtained is further

processed to get the result for either call duration or call count.

7.6.3. Time Series Visualization

Time series visualization is an effective means of visualizing the call usage pattern with the

passage of time. In telecommunication large number of CDRs are generated every instant. These

CDRs have to be visualized and then analyzed to make any further strategies. Regarding the

implementation of time series visualization in our project, we have used googleVis package

available in R language. The googleVis Package supports visualization types such as bubble

chart, bar chart and line chart with animation of the data. To accomplish this visualization we

fed the data to this package and then specified the id which is mobile number in our case. Finally

the package generates the chart on the webpage when connected to the internet. With this

generated chart one can visualize the daily call count with total dialed duration on daily basis.

50

7.7. Data Analysis through Clustering

We perform various univariant cluster analysis via kmeans clustering algorithm. For

implementing kmeans clustering algorithm we use stats r-package available in R language. For

performing data analysis, we first set problem statement and then tried to verify the statement

through clustering. Here we also use sqldf package for performing sql query operation and then

perform clustering on the basis of duration variables. The results of the clustering is included at

result section.

In this project, we made a general interface from where non-technical person also able to

perform clustering analysis and visualize the result of the clustering via graph. Initailly, we

choose the optimal number of the cluster, calculated through sum of square (SSE) curve, and

later user can enter the number of cluster he wise to see. Stoping criteria for finding optimal

cluster is determined as (SSEn – SSEn-1 < (SSE1- SSE2)*0.1 ).

7.8. Development Environment

Sublime-Text editor [version 2.0.2] is used for python and R-Studio IDE [version 0.97.311] is

used for R-language. Also we use Git for maintaining repository. MySQL Workbench [version

5.2.47] is used for maintaining database.

Development Environment

1. 1 laptop intel core-i7 with 4 GB RAM & 2GB nvidia

2. 1 laptop intel core 2 duo with 2 GB RAM

3. 1 laptop intel core 2 duo with 2 GB RAM ATE Graphics

4. 1 laptop intel core i3 with 2 GB RAM

51

7.9. Project Activities and Milestones

The major milestones of the project are listed as below.

Table 7.1 Project activities and milestones

S.N. Milestone Date of Completion

1. Project Analysis & Feasibility Study Dec 15, 2012

2. Data Collection Jan 25, 2013

3. Data Pre-processing Feb 15, 2013

4. Data Mart Design Feb 28,2013

5. Data analysis & System Design May 28,2013

6. Coding testing June 28,2013

7. Visualization July,2013

8. Documentation August,2013

52

8. TESTING

The system has been tested since its inception for the quality assurance. The traditional

approach of testing software after completion of the project has not been adopted. But

rather testing has been carried out throughout the development time. The following are the

various testing steps implemented.

8.1. Unit Testing

Unit testing as suggested by the name is a type of method used for testing smallest part of a

provided source code which can be termed as a unit. It checks whether the unit is fit for use or

not which means whether the unit is bug free or not. In procedural programming language

like C, unit testing is used to test any individual function or a procedure whereas in case

of object oriented programming it is used to test classes.

We have implemented modular design where each component is independent and

swappable. So, we have performed the unit tests on each of the elements separately.

8.2. Integration Testing

Integration testing is a systematic technique for constructing the program structure while at the

same time conducting tests to uncover errors associated with interfacing. The objective

is to take unit tested components and build program structure that has been dictated by

design. The unit tests were repeated using the actual system components now, instead of the test

doubles. Due to the properly constructed interfaces, there were very few things to do to turn unit

tests into integration tests.

53

8.3. Black Box Testing

This testing is generally performed to see if the outputs of the application were as

expected or not. The output page visuals, the formatting of the display digits and their

values were checked for validation and necessary correction made accordingly.

8.4. Alpha Testing

The system was tested by the project developers individually and in group so as to find errors.

The system was tested in concern with the functional requirements specified in the SRS

document prepared during the system analysis phase. Functionality tests were carried to

check if the system satisfied the functional requirements as documented in the SRS document.

8.5. Performance Testing

Performance testing was used to analyze the system behavior in various hardware and

software configurations. During the tests it was concluded that the loading time for

displaying of the list of companies was taking a bit of time which needed some

consideration, while the performance of other modules were acceptable. The system ran

smoothly on the browse like Google Chrome, Internet Explorer and Mozilla Firefox.

8.6. Documentation Testing

The documents of each phase of the software development process were verified by the project

supervisor for their consistency. Each of the team members reviewed the documentation to

confirm the validity of its parts.

54

9. RESULTS & CONCLUSIONS

9.1. Customer Segmentation

First we perform k-means clustering on and obtained result as shown below.

Table 9.1 RFM Clustering

Cluster Number Avg_call_duration in

minute (m)

Total number of Call

(f)

Customer

Number

1 (Black) 8.29 390 8

2 (Red) 3 316 78

3 (Green) 1.44 287 179

4 (Blue) 2 717 41

5 (cyan) 1.58 465 107

6 (purple) 1.484 1626 6

In figure, it is shown like

Figure 9.1 K-means clustering using RFM method

55

Again we applied k-means clustering for each cluster and result obtained is as follows.

Table 9.2 Two phase clustering with demographic data

Cluster1

(black)

Cluster2

(Red)

Cluster3(Green) Cluster4(Blue) Cluster5(purple) Cluster6(cyan)

(-5, 24) 1 (1, 61)

10

(5, 29) 104 (4.31, 27) 29 (3, 69) 10 (-5, 30) 2

(5, 29) 5 (2.17,

26) 39

(2.7, 57) 35 (5, 69) 3 (2.64, 43) 34 (5, 29) 1

(5, 59.5) 2 (3.62,

38) 29

(-5, 31) 40 (1.6, 40) 9 (3.25, 27) 63 (5, 26.33) 3

Finally, by analyzing on all kinds of the described features, profile of each cluster could be

constructed. This profile is shown in table below. We compare the cluster on different attribute

like RFM rank, age rank and largeness rank. (18 for high rmf value and 3 for low rmf value, 18

for low age and 1 for high age, 18 for large and 1 for small cluster).

Table 9.3 Cluster comparisons regarding different attributes

Cluster No. REF Rank Age Rank Largeness Rank Sum

11 15 18 2 35

12 15 11 7 33

13 15 4 4 23

21 9 3 9 21

22 9 17 15 41

23 9 8 11 28

31 3 12 18 33

32 3 5 14 22

33 3 9 16 28

41 12 14 12 38

56

41 12 2 6 20

43 12 7 8 27

51 6 1 10 17

52 6 6 13 25

53 6 15 17 38

61 18 10 3 31

62 18 13 1 32

63 18 16 5 39

For example, users in cluster 1 are placed in the eighteen position regarding age rank. It means

that this is dominant cluster regarding age to other cluster. But the largeness rank for this cluster

is second last means it isn’t beneficial to invest on them. In this way cluster comparison give us

to defined effective scheme that may be effective for that group

9.2. Customer Profiling

With the main objective of analyzing the dialed call duration over day of week, first of all we

queried over database to extract the required data. And the visualization of thus extracted data

gave output as displayed above. From the above visualization we found that call dialed call

pattern of the user is not uniform every day. Subscriber has dialed maximum calls at weekends

while minimum call during mid-week.

57

Figure 9.2 Distribution of Dialed call by 9849291555 over day of week

With the main objective of analyzing the dialed call duration over day of week, first of all we

queried over database to extract the required data. And the visualization of thus extracted data

gave output as displayed above. From the above visualization we found that call dialed call

pattern of the user is not uniform every day. Subscriber has dialed maximum calls at weekends

while minimum call during mid-week.

To visualize the dialed call on hour basis, first of all we aggregated the CDR of one month on

hourly basis. Thus we plotted the aggregated data which gave output as shown above.

Visualizing the generated output we found that subscriber peak call time is at 7 am during

daytime and at 8 pm during nighttime.

58

Figure 9.3 Distribution of Dialed call by 98489291555 Hour wise

Regarding the Visualization of output of the message sending pattern first of all we extracted

the aggregated data from database. And visualizing the output of the data we found that

subscriber maximum message traffic is at 8 pm. Apart from this time subscriber sends very few

messages.

Figure 9.4 Distribution of Message send by 98489291574 Hour wise

59

9.3. Churn Prediction

We generated a list of card numbers in decreasing order of possibility of churn out the

subscribers present in our dataset. On selecting the card number, one can view details of his/her

call behavior over time-span for realizing that the subscriber exhibited customer attrition. In

construction of the model we utilized 70% of past records and tested the behavior with recent

30% of records with respect to time. The accuracy computed as

Accuracy=𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ∗100%

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

resulted as 65% on average of the tests performed from which the model can be regarded as a

satisfactory one though not much robust.

9.4. Reporting

Reporting is the process of displaying the facts of data through the visualization technique.

Different visualization reports are generated in our application on the basis of call count and call

duration against age, gender, day of week, month, hours of a day etc. The drill down level (two

level) is also maintained to provide the user with more information about the certain part of the

data. For example, the total call duration of male, female and total subscriber is first shown in

graph as shown in figure 9.5 which makes us clear that most of the call are done by the males

and the females are less active. This is because of two reasons: either female subscriber don’t

prefer on calling for long duration or they are few in population. Here, the sample of data of

female we have is very less compared to male. Further on clicking the bar of male, details about

the male subscriber can be viewed and the detail is about what the user selects in the form. Here,

the detail is about the age group. Most of the calls of male are done by the age from 20 to 35.

But the peak age group is 25 to 30 as shown in figure 9.6. Similarly, for female and total

subscriber details can also be visualized.

60

Figure 9.5 Gender vs Total call duration

Figure 9.6 Age Group vs Total call duration for male subscriber

By visualizing the average call duration of different age-groups, it is found that age group of 40

to 45 has the peak average call duration as shown in figure 9.7. Further on visualizing the details

of this age group, their average call duration pattern is increasing on monthly basis as shown in

figure 9.8 which clearly shows that they are attracting towards the service of

telecommunications.

61

Figure 9.7 Age Group vs Average Call Duration

Figure 9.8 Monthly Average Call Duration for age group 40-45

If analysis is carried out on hourly basis, maximum calls are done in the morning and the evening

time. In the morning 8 – 12 am, maximum calls are done and in the evening 6 – 8 pm. The peak

hour of the day is 7-8 pm as shown in figure 9.9. On visualizing further details at this peak hour

25 – 30 age group uses the call services maximum at that peak time as shown in figure 9.10.

62

Figure 9.9 Call Count vs Hours of day

Figure 9.10 Call Count vs Age Group for 6 to 7 pm

9.5. Data Analysis Through Clustering

Some results obtained from clustering with problem statement are listed below.

1) Clustering average call duration in 24 hours.

Clustering the above hypothesis we found that average call at non-business hour is

longer than that of business hour. In Business Hour male average call duration in less

than that of the female, whereas in non-business hour average call duration of male is

larger than the female.

63

Figure 9.11 Clustering result of average call duration of day

2) Classifying total call originated by each subscriber in a day and according to gender

and age-group.

Result: - Analyzing the above scenario we found that male with age group 25-40 have

higher call duration as compare to other age group in non-business hours call duration.

Whereas age-group <25, 25-40, 40-55 have high call duration in business hour. Age

group greater than 55 have less call duration compared to other age groups. On business

hours male and female have similar calling behaviors but in non-business hours male

perform more duration call than female subscribers.

3) Finding the received call pattern based on age group and gender on business and non-

business time.

Result: - Similar to dialed call, received call in non-business hour are longer than

business Hours. Male with age group <25 have slightly higher receiving duration as

compared to other and female with age group between 40-55 years have higher receiving

duration at business hours. At non-business hours few number of male (>55 years) have

higher receiving duration whereas in female (40-55 years) are ahead than other age

group.

4) Value based segmentation and analysis of customer based on dialed called duration.

Result: - Here, subscriber are first arranged in descending order based total dialed call

64

Duration. And then we partitioned them on the basis of total call duration as platinum

(top 1%), Gold (4%), Silver (15%), Bronze (40%), Mass (40%). But due to the less

number of active subscriber, this analysis didn’t seem good. So we classify subscriber

as Gold (10%), Silver (10%), Bronze (40%), Mass (40%). And we classified call

duration of these group and cluster were analyzed.

5) SMS sending pattern of business customer and non-business customer.

Result: - In this scenario we found that subscriber with high business call have lower

SMS sending rate than subscriber having low business call

6) (Total Receiving call duration) / (Total called duration) distribution according to age

group, gender and customer label

Result: -Considering this scenario we found that male of age < 25 from silver group

dialed more call than received call and female of age 25-40 from bronze group dialed

more call than that of received calls

9.6. Problem Faced

On doing the project, the main problem arose were always of tactical level. With the lack of the

standard design method, we had to repeat several step in search of the better outlook. Lack of

the domain knowledge was also another problem faced during the project.

9.7. Conclusion

Analytical CRM applications are overpowering telecommunication companies in retention or

attraction of customers that eventually brings long-term competitive economic advantages.

Developing such applications having implementation at telecommunication’s tactical and

strategic organizational levels are of great value. Mining and analysis should precede

65

formulation and execution of influential business and marketing strategies determining the

company’s state in upcoming future.

Data volume in telecommunication industry are massively growing. Retrieving value from the

dataset requires advanced analytics and this project is an effort in that direction. In the project,

important customer segments can be segregated using clustering and classification techniques.

Churn pattern analysis helps in reduction of subscriber churn via maintenance of appropriate

level of QoS. These results when visualized systematically employing BI visualization

techniques facilitates business reporting. Besides, telecommunication decision makers could

utilize the outcomes obtained from CDR data and demographic data of subscribers to have better

insight of customers and their calling behaviors. Hence, telecommunication companies could

lunch better marketing and customer relationship management strategy.

9.8. Limitation and Further Enhancement

Limitation of this project is that data loading time increase with increase of the data size. Our

system couldn’t support all customizable queries for visualization and clustering.

In this project, with implemented 2 phase clustering method for maintaining customer profiling.

Customer segmentation could be made more accurate by implementing life time value of the

customer along with the 2 phase clustering results. And also apriori algorithm could be used for

rule generation. This project could further expanse with various classification analysis like

decision tree analysis for predicting age-group of the subscriber from time of call, gender, sms

send rate etc. Also, Gender prediction could be done based on the call duration and call time,

Call tariff Rate shifting analysis, competitor analysis and call network analysis. And this system

could be implemented using distributed system for decreasing response time.

66

10. BIBLIOGRAPHY

[1] "Communications and Media Industry CRM Software Solutions," 23 April 2013. [Online].

Available: http://crmforecast.com/telecom.htm.

[2] "Market Segmentation," 23 April 2013. [Online]. Available:

http://www.netmba.com/marketing/market/segmentation/.

[3] D. Camilovic, "Data Mining and CRM in Telecommunications," Serbian Journal of

Management, pp. 61-72, 2008.

[4] N. Kapoor, "Optimizing CRM in Telecom with Data Mining," 23 April 2013. [Online].

Available: http://crmsolutions.crmnext.com/2012/09/optimizing-crm-in-telecom-with-

data.html.

[5] D. Chandrasekar, "CRM Success Chronicles: The Master Strokes," 23 April 2013.

[Online]. Available: http://dineshknowledgeplanet.blogspot.com/2010/10/crm-success-

chronicles-master-strokes.html.

[6] Margaret Rouse, "What is customer segmentation?," [Online]. Available:

http://searchcrm.techtarget.com/definition/customer-segmentation. [Accessed 25 August

2013].

[7] "What is customer segmentation?," [Online]. Available:

http://www.mindofmarketing.net/2007/05/customer-segmentation-why-exactly-

does.html. [Accessed 25 August 2013].

[8] S. Jansen, "Customer Segmentation and Customer Profiling for a Mobile

Telecommunications Company Based on Usage Behavior : A Vodafone Case Study,"

2007.

[9] "What is churn rate? - Definition," [Online]. Available:

http://www.mobileburn.com/definition.jsp?term=churn+rate. [Accessed 25 August

2013].

[10] "What is churn? definition and meaning," [Online]. Available:

http://www.businessdictionary.com/definition/churn.html. [Accessed 25 August 2013].

[11] "What is Churn-Rate?," [Online]. Available: http://www.churn-rate.com/. [Accessed 26

August 2013].

[12] L. Alberts, "Churn Prediction in the Mobile Telecommunications Industry," 2006.

67

[13] "Customer Churn Software: Prediction, Prevention, Analysis & Action | Optimove,"

[Online]. Available: http://www.optimove.com/learning-center/customer-churn-

prediction-and-prevention. [Accessed 23 August 2013].

[14] "k-means clustering," 7 August 2013. [Online]. Available: http://en.wikipedia.org/wiki/K-

means_clustering.

[15] G. K. V. K. Michael Steinbach, "A Comparison of Document Clustering Techniques,"

Department of Computer Science and Egineering.

[16] "RFM (customer value)," [Online]. Available:

http://en.wikipedia.org/wiki/RFM_(customer_value). [Accessed 13 August 2013].

[17] G. M. Weiss, "Data Mining in the Telecommunications Industry," 28 April 2013. [Online].

Available:

http://mtscertification.com/data_mining/Telecom/telcom%20data%20mining.pdf.

[18] I. Sommerville, Software Engineering, Boston: Pearson Education, 2009.

[19] "Feasibility study," 23 April 2013. [Online]. Available:

http://en.wikipedia.org/wiki/Feasibility_study.

[20] J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techniques, Waltham: Morgan

Kaufmann Publishers, 2012.

[21] R. Kimball and M. Ross, The Data Warehouse Toolkit-the complete guide to Dimension

Modeling, New Works: John Wiley & Sons, 2002.

[22] M. R. G. K. Morteza Namvar, "Two Phase Clustering Method for Intelligent Customer

Segmentation," Tehran,Iran, 2010.

68

APPENDIX A: - DATASET SNIPPETS

Figure A.1 Customer Demographic Data Snippets

Figure A.2 CDR Data Snippets

69

APPENDIX B: - GANTT CHART Project – Part A

Figure B.1 Part A Gantt Chart

Project - Part B

Figure B.1 Part B Gantt Chart

70

APPENDIX C: - OUTPUT SNAPSHOT

1. Age – Gender Distribution

Below line chart represent total age distribution of subscriber present on our data set in gender

wise manner. It shows that male subscriber number is higher than that of the female subscriber.

Figure C.1 Age and Gender Distribution of Subscriber

2. Filter Data

Below interface show the filter interface for our data from where user can filter the data from

the database and do some analysis.

71

Figure C.2 Interface for filtering data

Documents

TRIBHUVAN UNIVERSITY - flipkarma.comflipkarma.com/media_dir/main_documents/Final_Report_Subscriber... · TRIBHUVAN UNIVERSITY ... web application for implementing Customer Relationship