VISION-POWERPOINT-TEMPALTES
Santander Product Recommendation2016.12.13Korean Wave(, , )
Contents1. 2. 3. 4. SEMMA Process
2
1.
Project Objective
Kaggle Santander Santander Product RecommendationCompetition Santander Product Recommendation
,
?2015.01 - 2016.05Data2016.06Prediction
#
4
Introduce Our Team
Kaggle ! Korean Wave , , EDA, , , EDA, , &,
#
5
2.
Data Introduction Santander 2015 1 ~ 2016 5 : Training set2016 6 : Test set 2GB13,647,309 48( 24 / 24 )
#Feature Description fecha_datodate ncodperscust_id ind_empleadoemp_index pais_residenciacust_country sexosexageagefecha_altacust_firstdate ind_nuevonew_cust 6 antiguedadcust_seni indrelcust_pri ult_fec_cli_1tcust_pri_date indrel_1mescust_type tiprel_1mescust_relation_type indresiresidence_index indextforeigner_index conyuempspouse_index canal_entradachannel indfalldeceased_index tipodomtipodom cod_provlocation_code nomprovlocation_name ind_actividad_clienteactivity_index rentaincome segmentosegment
#Feature Description ind_ahor_fin_ult1Saving Account ind_aval_fin_ult1Guaranteesind_cco_fin_ult1Current Accountsind_cder_fin_ult1Derivada Account ind_cno_fin_ult1Payroll Accountind_ctju_fin_ult1Junior Account ind_ctma_fin_ult1Ms particular Account 1ind_ctop_fin_ult1particular Account 2ind_ctpp_fin_ult1particular Plus Account 3ind_deco_fin_ult1Short-term depositsind_deme_fin_ult1Medium-term depositsind_dela_fin_ult1Long-term depositsind_ecue_fin_ult1e-account ind_fond_fin_ult1Fundsind_hip_fin_ult1Mortgageind_plan_fin_ult1Pensionsind_pres_fin_ult1Loansind_reca_fin_ult1Taxes ind_tjcr_fin_ult1Credit Cardind_valo_fin_ult1Securitiesind_viv_fin_ult1Home Accountind_nomina_ult1Payrollind_nom_pens_ult1Pensionsind_recibo_ult1Direct Debit
#
9
3.
Analysis Environmnet
Ubuntu 16.04 LTSAmazon EC2R 3.3.1Rstudio Server
vCPU: 4Memory: 30.5 GB(RAM)SSD : 80GB
2GBDesktop & Laptop
2GB Desktop Laptop R
#
11
4. SEMMA Process
4-1. Sampling
Sampling:
Sampling DataPartitioning Data Data(nrow=2,852,306)Training Data2015-01 ~ 2016-04Test Data2016-052015 1 2016 4Trainig Data , 2016 6 !(nrow=194,663)(nrow=2,657,643)
Raw Data Data Set (nrow=13,647,309)956,645 200,000 Sampling
Sampling
#Sampling: Test Data
TestData194,663
194,64815Saving Account
194,6594Guarantees
77,290117373Current Accounts
179,75614907PayrollAccount
193,045 1618JuniorAccount
Test Data(2016.05) 4 Test Data 24 , 7 Sampling
#4-2.Exploring
Exploring2050304060708090100
150000100000500000
FemalemaleUnknown15000010000050000
agesex18~30 40~50
VIPIndividualscollege graduatedunknown15000010000050000, , VIP segment
15000010000050000~ 5051~100101~150151~200200~seniority 50
Exploring
Exploring
Fond
Long Term Deposit
Junior Account
Current Account
SegmentVIPIndividualscollege graduatedunknown Long-term deposit, Fund VIP Junior Account Current Account
(Segment) Exploring
Exploring
seniority~ 5051~100101~150151~200200~GuaranteesParticular Account
Current Account
Mas particular Account
Current Account 50 Mas particular Account 51~100 Particular Account 151~200 Guarantees 200
(Seniority) Exploring
Exploring
age~2526~3536~4546~5556~6566~
Junior Account
MortgageSaving AccountGuarantees
Junior Account 25 Saving Account, Mortgage 36~45 Guarantees 40
(age) Exploring
4-3. Modify
Data Cleansing: Modify
SET (NULL) Average, Median, Frequency UNKNOWN
agenew customerindexSeniorityCust_priJoinning DateGrossIncomeEtc. 01020304050607 6 new customer NULL Minimum First/PrimaryNULL NULLMedian Medain 41 Data Cleansing, UNKNOWN
#
22
Feature Engineering:
(Age, Gross income, Seniority)Modify, ,
#
23
Feature Engineering:
Level Level Dummy Modify
Level Level Dummy Variable Age: 6 level
Gross_income:4 level
Seniority:5 level Province name(nomprov): 5 level
Channel:5 level
Reidence(pais_residencia):9 level Date: 17
Province name(nomprov):53
Residence(pais_residencia):93
#
24
Feature Engineering:
Modify
()=> + ()=> //, , Clustering ()=> 6 Level () () / / / ()
#
25
Feature Engineering:
( ) Modify[1] Mortgage => Direct Debit 0.0061232 0.8603142 5.960670[2] Payroll => Pensions 0.0625038 1.0000000 14.813015[3] Pensions => Payroll 0.0625038 0.9258697 14.813015[4] Payroll => Payroll Account 0.0592032 0.9471936 10.451423[5] Pensions => Payroll Account 0.0638628 0.9460006 10.438259ConfidenceSupportLift. . . ()1Direct DebitCredit Card,Pensions2PayrollPensions 3Payroll AccountPayroll 4PensionsPayroll
#
26
4-4. Modeling
27
Feature Choice: Near Zero Variance
0 ( ) pais_residencia ult_fec_cli_1tPrimary conyuemp indfall
Near Zero Variance ( 0 )Modeling
#
28
Feature Choice:
,
Province name(nomprov): 5 level
Channel:5 level
Reidence(pais_residencia):9 level Date: 17
Province name(nomprov):53
()=> + ()=> //, , Clustering ()=> 6 Level () () / / / ()
: 20
Modeling
#
29
Modeling Process
Modeling()TrainingModelPrediction(Test set)EvaluationModel
Test Set
EnsembleModel(Stacking Model)XGBoost, Random Forest SMOTE SVM Nave BayesRandom ForestXGBoost
#
30
Model Description
( , )
SVM XGBoost Naive BayesModeling
#
31
Model Choice
Recall ( ) 6 Recall ModelingSVMXGBoost Naive Bayes
FN + TPTPRecall :