View
226
Download
1
Category
Preview:
DESCRIPTION
MEXL v2 Getting Started Tutorial 130605
Citation preview
Business Intelligence Using Data Mining Bribe Payments For Land Registrations
Submitted By: Hussain Boltwala 61210213
Karthik Vemparala 61210505
Naveen Kumar HS 61210144
Salman Siddiqui 61210626
Smita Chakravorty 61210558
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 2
INTRODUCTION 3
PROBLEM STATEMENT 3
DATA PREPARATION AND VISUALIZATION 4
THE PREDICTION METHOD 15
CLASSIFICATION TREES 15
K- NEAREST NEIGHBOUR 16
NAÏVE BAYES 17
CONCLUSION & FURTHER ANALYSIS 18
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 3
Introduction
The project is based on the data collected over a period of time from the customers who have used e-
Governance services for the land registration process. This framework will be useful for intermediaries who
can target customers based on their demographic criteria. These intermediaries can charge a fee, that is
typically lesser than the bribe paid, and provide a convenient and fast service to people who are most
susceptible to pay bribes. This is similar to freelance notaries outside the court houses who charge a fee to
customers for guiding them through any legal process. The framework will also provide insights into
customer behaviour and the effectiveness of e-Governance initiatives.
This project also analyses the relationship between customers who paid bribes and the differentiating
factors like age, level of education, place etc that significantly contribute to payment of bribes. Our analysis
is based on Land Registration transactions carried out in Delhi, Haryana and Gujarat. Data was collected via
a hand written survey with people availing the survey being interviewed. This has resulted in a lot of
misclassified data and the group had endeavoured to clean and interpret as many data points to ensure a
robust model is obtained.
Problem Statement Predict whether a person availing the e-Governance Service will pay a bribe of over INR 100.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 4
Data Preparation and Visualization In order to better understanding the key predictors for susceptibility to bribing behaviour, different metrics
were analysed whether the bribe was paid (categorical) and the amount of bribe paid (numerical). Some
insights are presented below:
1 Below Rs.500
2 Rs. 500-1000
3 Rs.1000-2999
4 Rs.3000-4999
5 Rs.5000-6999
6 Rs.7000-9999
7 More than Rs.10,000
The amount of bribe paid by people in higher income brackets (7000-9999 and more than 10,000) is higher in both
Delhi and Haryana.
Gujarat seemed to have the least amount of bribing culture, where Delhi and Haryana fared badly on most
markers. This could possibly indicate that affluent people are generally targeted by officials. This is also
depicted in the bar chart below which depicts the number of people who paid bribes (code -1, in pink) vs.
those who did not (code-2, in blue). Gujarat has the largest number of non-bribe payers.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 5
From the above plot, we see that most people in Delhi and Haryana have paid bribes between Rs 100-200.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 6
In Delhi, the number of people who did pay a bribe increased were the once who were more infrequent in availing
the services offered by the TCC. However, in Haryana, there is no information on the service availing frequency and
bribing pattern. The total amount of bribe paid also increases if the services are availed less frequently as seen from
the bar graph below.
1 Once in 3 Months
2 Once in 6 Months
3 Once in a Year
4 Less than once a year
5 Others
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 7
In Delhi, more number of people paid a bribe on their first trip, but this number decreases as the number of trips to
the TCC increased. Haryana doesn’t really follow any discernible pattern.
It may be that people who frequented the office at least once every 3 months and made more than 1 trip,
paid very little in bribes. This may indicate that people who have a high level of familiarity (and perhaps
have built relationships with officials) don’t pay too much to get their work done. Or they may simply not be
able / willing to pay a bribe and hence have to make more number of trips to avail the same services.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 8
If we look at box plots of the age of an individual to see whether s/he has paid a bribe greater than Rs. 100,
we don’t see any discernible pattern.
But if we plot bribe amount and try and classify in different age brackets, we find that mostly elderly people
end up paying bribes less than Rs. 200
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 9
1 Illiterate
2 Literate without Education
3 Below Primary
4 Primary
5 Middle
6 Matric/Secondary
7 Higher Secondary/Intermediate
8 Non-Technical Diploma
9 Technical Diploma
10 Graduate & Above
11 Others
The median amount of bribe paid across education level remains between 100-150 with the only exception of the
individuals who are “literate without education”. The amount of bribe paid by this group is higher.
The above plot indicates that semi-urban areas generally paid much higher in bribes than either rural or
urban areas.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 10
This came as no surprise that larger pockets of land attracted relatively higher amount of bribes.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 11
Distance from the Land Registry office did not seem to play any significant role in the bribing patterns,
however wage loss did i.e. the higher the loss of wage, higher the bribe amount.
Whilst total cost of availing the service was seen as an important aspect, this was ultimately ruled out since
this included the total amount paid by the user, including the land registry charges.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 12
Surprisingly, amount of bribes were closely tied to satisfaction levels, with Delhi and Haryana reporting the
most data. This could indicate that bribing is considered a part of any government transaction and it has no
bearing on the overall perceptions of satisfaction.
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 13
From a service provider’s perspective, the most amount of bribes given were under Rs. 100. This is not
considered the target market and only those people who would pay over Rs. 100 are being considered in
this study.
Also, most bribes were paid in order to expedite the process – thus it was logical to look at predictors that
would cause the individual to spend more time at the land registry office.
Total Bribes Paid
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 14
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 15
The Prediction Method
Classification Trees
Since there are a lot of variables, we decided to run a classification tree to find out what are the most
relevant predictor variables. Wage loss, service charges, wait time, total payment, age, level of education,
occupation, mode of travel, no. of trips made to the TCC, travel time, and reason for bribe payment (this is
largely to expedite the process).
Certain predictors above are not relevant for a prediction model. For example, reason for bribe payment will not
apply as it will not be available at the time of prediction. Also, a person who has already paid a bribe, may not want to
avail the services of an intermediary. However a person who might have tried to avail the services previously but had
a long wait time might be more inclined to use the services of a broker.
0.5
90 72.5
0260 1.5 5.5
0170 175
0 11.5
1 0Sub Tree beneath
0 1 0
travel_mode
serv_charge wait_time
total_paymen expedite_pro Occupation
serv_charge wage_loss expedite_pro
405 255
376 29 133 122
22 7 43 90 72 50
6 1 18 25 15 35
Full Tree
Pruned Tree
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 16
K- Nearest Neighbour
Running a K-NN with the above predictor variables, we get an error rate of 12% on the validation data and 11% on
the test data.
AgeLev_Educatio
nOccupation travel_mode no_of_trip travel_time w ait_time w age_loss serv_charge
expedite_pro
c
total_paymen
t
Variables
# Input Variables 11
Input variables
Output variable Bribe > 100
Training Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 168 0
0 0 932
Class # Cases # Errors % Error
1 168 0 0.00
0 932 0 0.00
Overall 1100 0 0.00
Validation Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 80 35
0 47 498
Class # Cases # Errors % Error
1 115 35 30.43
0 545 47 8.62
Overall 660 82 12.42
Test Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 52 23
0 24 341
Class # Cases # Errors % Error
1 75 23 30.67
0 365 24 6.58
Overall 440 47 10.68
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Training Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 168 0
0 0 932
Class # Cases # Errors % Error
1 168 0 0.00
0 932 0 0.00
Overall 1100 0 0.00
Validation Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 80 35
0 47 498
Class # Cases # Errors % Error
1 115 35 30.43
0 545 47 8.62
Overall 660 82 12.42
Test Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 52 23
0 24 341
Class # Cases # Errors % Error
1 75 23 30.67
0 365 24 6.58
Overall 440 47 10.68
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Training Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 168 0
0 0 932
Class # Cases # Errors % Error
1 168 0 0.00
0 932 0 0.00
Overall 1100 0 0.00
Validation Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 80 35
0 47 498
Class # Cases # Errors % Error
1 115 35 30.43
0 545 47 8.62
Overall 660 82 12.42
Test Data scoring - Summary Report (for k=1)
0.5
Actual Class 1 0
1 52 23
0 24 341
Class # Cases # Errors % Error
1 75 23 30.67
0 365 24 6.58
Overall 440 47 10.68
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 17
Naïve Bayes
The Naïve Bayes method resulted in a higher error rate of approx 16%, when compared to the KNN method.
AgeLev_Educatio
nOccupation travel_mode no_of_trip travel_time w ait_time w age_loss serv_charge
expedite_pro
c
total_paymen
t
Variables
# Input Variables 11
Input variables
Output variable Bribe > 100
Prior class probabilities
Prob.
0.152727273
0.847272727
1
0
<-- Success Class
According to relative occurrences in training data
Class
Training Data scoring - Summary Report
0.5
Actual Class 1 0
1 148 20
0 91 841
Class # Cases # Errors % Error
1 168 20 11.90
0 932 91 9.76
Overall 1100 111 10.09
Validation Data scoring - Summary Report
0.5
Actual Class 1 0
1 79 36
0 74 471
Class # Cases # Errors % Error
1 115 36 31.30
0 545 74 13.58
Overall 660 110 16.67
Test Data scoring - Summary Report
0.5
Actual Class 1 0
1 49 26
0 40 325
Class # Cases # Errors % Error
1 75 26 34.67
0 365 40 10.96
Overall 440 66 15.00
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
Training Data scoring - Summary Report
0.5
Actual Class 1 0
1 148 20
0 91 841
Class # Cases # Errors % Error
1 168 20 11.90
0 932 91 9.76
Overall 1100 111 10.09
Validation Data scoring - Summary Report
0.5
Actual Class 1 0
1 79 36
0 74 471
Class # Cases # Errors % Error
1 115 36 31.30
0 545 74 13.58
Overall 660 110 16.67
Test Data scoring - Summary Report
0.5
Actual Class 1 0
1 49 26
0 40 325
Class # Cases # Errors % Error
1 75 26 34.67
0 365 40 10.96
Overall 440 66 15.00
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
Training Data scoring - Summary Report
0.5
Actual Class 1 0
1 148 20
0 91 841
Class # Cases # Errors % Error
1 168 20 11.90
0 932 91 9.76
Overall 1100 111 10.09
Validation Data scoring - Summary Report
0.5
Actual Class 1 0
1 79 36
0 74 471
Class # Cases # Errors % Error
1 115 36 31.30
0 545 74 13.58
Overall 660 110 16.67
Test Data scoring - Summary Report
0.5
Actual Class 1 0
1 49 26
0 40 325
Class # Cases # Errors % Error
1 75 26 34.67
0 365 40 10.96
Overall 440 66 15.00
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
BIDM – Bribing Behaviour for e-Governance Services
P a g e | 18
Conclusion & Further Analysis
When we started with the raw data, a tremendous amount of clean up and classification was needed to make the
data useable. We also had to define our goals clearly, keeping in mind the practicality and usefulness of the model we
were building.
Initially the idea was to estimate the amount of bribe a person would pay. A relatively small number of people had
reportedly paid bribes, many of which were very small amounts. Therefore it was more useful to classify records that
paid over a certain threshold – in our case, Rs. 100, and create a model based around this end goal, i.e. categorical ‘Y’
of ‘Bribe > 100’.
Records %
Initial Benchmark - # Paid Bribes 397 / 2200 18%
Initial Benchmark - # Paid > 100 358 / 2200 16.3%
The results of the prediction models are as follows:
Method Used Error Rate Accuracy Sensitivity Specificity
K-NN 12% 88% 69.6% 91.4%
Naïve Bayes 16% 84% 68.7% 86.4%
Therefore a drastic increase in accuracy was seen by applying the KNN and Naïve Bayes model. Obviously the KNN
method yielded better results than Naïve Bayes since KNN is not simply a majority vote.
The predictors of interest are as below, each of which could be estimated or determined by direct and indirect
probing by the service provider or already known to him (for e.g. Official Service Charge). The idea of this model is
that suitable prospects are approached by the service provider, who will find out the relevant information for each
parameter, mostly through a conversational strategy.
Predictor Method of Determination
Age Estimated / Indirectly determined from conversation
Level of Education Estimated / Indirectly determined from conversation
Occupation Directly queried from prospect
Mode of Travel Determined from conversation
No of trips Determined from conversation
Travel Time Determined from the ‘Mode of Travel’ query
Wait Time If first trip – communicated to prospect based on general wait times for the type of service required. If more than one trip – query prospect herself
Wage Loss Determined from ‘Occupation’ query
Official Service Charge Known to service provider – communicated to prospect
Desired Expediency Directly queried from prospect
Total Payment (charge) for services Known to service provider – communicated to prospect
Therefore using the above probes, a service provider should have great success in targeting prospective customers.
Recommended