Upload
demin-damian-wang
View
69
Download
0
Embed Size (px)
Citation preview
IntroductionSome of the most important new data to emerge on young adult drinking were collected through a recent nationwide survey, the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC). According to these data, about 70 percent of young adults or about 19 million people, consumed alcohol in the year preceding the survey.
Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors.
Objective/problem statement •Build models to predict school students’ drinking behavior during weekdays and weekends.
•Compare various models and choose the best.
•Find out which factors are influential to school students’ alcohol consumption – sensible recommendations were made.
DatasetData collected through a survey from two classes in two schools in
Portugal
33 VariablesPersonal e.g. school, sex, age, address, health status, romantic experience, going out
with friends, free time after school
Educational e.g. study time, class failures, intention for higher education, extra-curricular activities, educational support, number of school absences, grades
Family e.g. mother/father’s education, mother/father’s job, family size, quality of family relationship, parent’s cohabitation status
Alcohol Consumption e.g. workday alcohol consumption, weekend alcohol consumption
Data TypesBinary
Ordinal
Nominal
Numeric
Data preparation No missing data
Overlapping Students taking both math and portuguese class
649 students in Portuguese class, 395 students in Math class
Merging dataCriterion
"school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"
382 students identified
Deciding attributes
Keep max values
Keep “yes” for paid class
Resulting 674 students in total
ApproachesThe data is distributed to analyse 2 different models(alcohol consumption for weekday and for the weekend)
Target variables: Weekday alcohol consumption and weekends alcohol consumption
For weekday (more serious issue than weekend),
Level 1 - acceptable alcohol consumption
Levels 2- 5 - unacceptable
For the weekend,
Levels 1 and 2 - acceptable alcohol consumption
Level 3, 4, 5 - unacceptable
Techniques UsedDecision Tree
Poor performance ☹• Overall error rate 38%• Tried improving the model by cost matrix
(0,25,80,0) → 32% error in predicting unacceptable behavior
• But increased the error rate of acceptable to 44%
REJECTED DECISION TREE
Neural Network
Poor performance ☹• Neural network worked best for 15 nodes• But the error rate is quite high → 53% for
unacceptable class• Also the error rate for the acceptable class was
22%
REJECTED NEURAL NETWORKBoosting
Poor performance ☹• Overall error rate is 25% which is quite
less ☺• However, 59% of the data is wrongly
classified into unacceptable• Area under ROC curve is 0.6782
REJECTED BOOSTING
Naïve Bayes
Poor performance ☹• Overall error rate was 38.46% • Couldn’t properly classify unacceptable class• Accuracy was also very low
REJECTED NAÏVE BAYES
Random ForestWinner ☺
• Unacceptable class error rate was 29%
• And the unacceptable class is very important for the
prediction of the model
ACCEPTED RANDOM FOREST
Weekday Alcohol ConsumptionInput Variables: All the variables were chosen as input for Weekday Alcohol consumption model building except
G1, G2 and Weekend Alcohol consumption.
Weekend Alcohol consumption is ignored to avoid the target leakage condition
G1, G2 - Grades for the first and second year. We include G3 (derived from G1 and G2) and ignore G1 and G2 to make the input variables independent.
Target:
Weekday Alcohol consumption
We classified the Ordinal Variable Weekday Alcohol consumption (Ratings 1 - 5)
Acceptable (Rating 1) and
Unacceptable (Ratings 2 - 5)
Weekday Alcohol ConsumptionRandom Forest Model:
Partitioning:
Training: Validation: Test - 70:15:15
Sample size chosen as 85,100 to downsample the acceptable class
No.of Trees : 5200
Weekday Alcohol ConsumptionRandom Forest Model:
Overall error 35%
For Unacc classPrecision: 52%Recall : 70.5%
Weekday Alcohol Consumption
Important Factors:● Sex being male● Grades● Mother’s education● Going out● Mother’s job● Failures
Weekend Alcohol Consumption - Input & BalanceThe best model is Balanced Random Forest :
Ignore the variable Dalc, G1 & G2
The target value walc: 1-2 “Low” & 3-5 “High”
High : Low = 262 : 412 = 38 : 62
Train : Validation : Test =70 : 15 : 15
Weekend Alcohol Consumption - Validation
AUC=0.748Overall error 32%
Precision: 58.5%Recall : 73.8%
Actual
Unac Accp Error
Unac 0.31 0.11 0.26
Accp 0.22 0.36 0.37
Weekend Alcohol Consumption - Importance
Important Factors:● Going Out with friends● Sexual● Grades● Family Size● Absences● Freetime● Father’s Job
Compare two modelsRandom forest can best predict the data in both models.
For daily alcohol consumption, the overall error rate is 35%, with the error rate in unacceptable group of 29%. However, according to AUC, it explains only 69% of the data.
For weekends alcohol consumption, the overall error rate is 32%, with the error rate in high consumption group of 26%. According to AUC, it explains 74.8% of the data.
The weekend model is the better one.
Insights of the models1.Drinking is a daily behavior
most of the drinkers drink both on weekends and weekdays.Students tend to drink more on weekends.
2. Mom and dad plays important roles in different time
According to the daily alcohol consumption model, mother’s education, mother’s job have relationship with the daily drinking behavior of the child.
While, during weekends, father’s job matters to the weekends drinking behavior.
Insights of the models3. Common factors shows up in both models
● Sexual --boys tend to drink more than girls● Grades --kids with lowers grades drinks more than those with higher grades● Absences --kids absences more tend to drink more ● Freetime --kids with more free time tend to drink more
4. Exclusive factors related to alcohol consumption● Going out with friends --on weekends peer behavior have relationship with alcohol
consumption● Family Size --kids with larger family size tend to drink less on weekends.● Going out for more time --during weekdays, more freetime have relationship with
alcohol consumption
Recommendation
Family and school are both important.
After running both models on only school-related data, family-related data we discover the prediction error rate get even higher, which indicates that alcohol consumption behaviour related to both aspects. Solving the alcohol consumption problem among high-school students need the efforts from both school and family.
● Educate the students. Reduce negative peer impacts. Build their awareness of harmful effects of alcohol use.
● Educate the parents. And get parents to keep track of their kids’ after school behavior.
● Keep track of the data to build students’ behavior profile in future prediction.
Recommendation
How to predict better.
As both models can hardly predict the drinkers group well. We could collect more data on larger sample to build the model better.There might be more relevant variables like the group the kids hang out with or how much money they have or other factors we are not included in the study.
Weekday Alcohol ConsumptionDecision Tree Model:
● Sex being male● Lesser Grade during finals (G3 <14)● Going out more● More absences from class● Mother’s education lower than 1.5 yrs● Mother’s job other than At home,health or teacher
are the factors that seemed to cause Unacceptable drinking behavior (Ratings 2 - 5)
Weekday Alcohol ConsumptionDecision Tree with Loss Matrix:
Loss matrix used: 0,25,80,0
Training: Validation: Test - 70:15:15