Pattern Mining and Machine Learning for Demographic Sequences

  • View
    442

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Pa#ernMiningandMachineLearningforDemographicSequences

DmitryI.Ignatov1,EkaterinaMitrofanova2,AnnaMuratova1,andDanilGizdatulin1

1Computer Science Faculty & 2Institute of Demography

National Research University Higher School of Economics, Moscow, Russia

KESW 2015, Moscow

Outline

KESW 2015, Moscow

• ProblemStatement• DemographicData• MethodsandResults

Ø DecisionTrees§ FirstlifecourseeventpredicGon§ NextlifecourseeventpredicGon§ Rulesfor‘gender’aLribute

Ø FrequentPaLerns§ Frequentclosedsequences§ Emergingsequencesformenandwomen

• ConclusionandFutureWork

2

•  Analysisofdemographicdatabymeansofmachinelearninganddatamining[Blockeel et al, 2001; Billari et al., 2006]•  DemographersquesGons:

§ Whatarethedifferencesbetweendemographic behaviour of menandwomen?

§ Whatarethetypical(frequent)eventsequencesthatappearinlifecoursetrajectories?

§ WhatarethefirstandthenextstarGngeventsaparGcularpersonmayhave?

§ Andmanymore…

• SimpleDM&MLtoolspreferablywithGUI:§ Orange http://orange.biolab.si/§ SPMFhLp://www.philippe-fournier-viger.com/spmf/§ Ad-hocscriptsinPython

ProblemStatement

3KESW 2015, Moscow

Thesurvey«Parentsandchildren,menandwomeninfamilyandsociety»www.socpol.ru/gender/RIDMIZ.shtml•  4857peopleincluding1545men3312women•  11generaGonsaresplitin5yearsintervalsfrom1930Gll1984

DemographicData

4KESW 2015, Moscow

Comparisonofclassifiersforthefirsteventpredic?on

5KESW 2015, Moscow

WhyDecisionTrees?

5KESW 2015, Moscow

•  Notablack-boxapproach•  Simpleif-thenrulebasedrepresentaGon•  However…

Thefirsteventpredic?onbyDecisionTrees

ThetreeisbuiltinOrangeThefirsteventmainlydependsoneducaGontype

6KESW 2015, Moscow

FeatureEncodingSchemes

General attributes: •  Gender •  Generation •  Type of education •  Locality •  Religious views •  Religious activity

Events encoding: •  Binary encoding (0 - absence, 1 - presence) •  Time-base encoding (age in months) •  Pairs of events ordered by precedence relations (<, >, =, n/a)

7KESW 2015, Moscow

The anonymised datasets for each experiment are freely available in CSV files: http://bit.ly/KESW2015seqdem

Influenceoffeatureencodingonthenexteventpredic?on

Encodingtype

Classifica?onAccuracy

Imbalanceddata Balanceddata

Binary(BE) 0.8498 0.8780(*)

Time-based(TE) 0.3516 0.3591

Pairsofevents(PE) 0.7076 0.7013

BE+TE 0.7293(~) 0.7459

BE+PE 0.8407 0.8438

TE+PE 0.5465 0.4959

BE+TE+PE 0.7295(~) 0.7503

(*)meansthebestresult,(~)meansalmostequivalentresults

8KESW 2015, Moscow

Theconfusionmatrixforthenexteventpredic?on

8KESW 2015, Moscow

– binary encoding for balanced dataset – “br” means “break up” and “div” means “divorce” events

Examplesofrulesforthenexteventpredic?on

9KESW 2015, Moscow

Thenexteventpredic?onbasedongenerala#ributes

A decision tree diagram built only for general attributes. The next event is influenced by the type of education.

10KESW 2015, Moscow

Classifica?onaccuracyofdifferentencodingschemesforgenderpredic?on

(∼) means very close results, and (*) means the best result in the column.

11KESW 2015, Moscow

Men:Women:

Examplesofrulesforpredic?onoftargeta#ribute«gender»

Antecedent Confidence

Firstjobaper19.9years,marriagein20.6-22.4,educaGonbefore20.7,breakupaper27.6,divorcebefore30.5

65.9%

Firstjobaper19.9,marriagein20.6-22.4,break-upbefore27.6 61.1%

Firstjobbefore17.2,marriagein20.6-22.4,break-upbefore27.6 61.3%

Firstjobaper21,marriageaper29.5 70.2%

Antecedent Confidence

Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorceaper30.5

71.9%

Firstjobin18.2-19.9,marriagein20.6-22.4,break-upaper27.6,divorcebefore30.5

70.9%

Firstjobin17.2-19.9,marriagein20.6-22.4,break-upbefore27.6 62.8%

Firstjobin17.7-21,marriageaper29.5 62.8%

12KESW 2015, Moscow

•  Asequenceisanorderedsetofelements(events)•  ThesupportofsequencerindatabaseisthenumberofsequencesinDthatcontainr

•  Therela?vesupportofristhefracGonofsequencesthatcontainr•  AsequencerisfrequentinadatabaseDif,whereminsupisaminimalsupportthreshold•  Afrequentclosedsequenceis a sequence such that there is no any its

supersequence with the same support

SequenceMining

13KESW 2015, Moscow

[Zaki & Meira, 2014; Agrawal & Srikant, 1995]

sup(r) = #{si ∈ D | r is subsequence of si}

•  AnemergingsequenceisasequencesuchthatitssupportgrowthsdrasGcallyinthetransiGonfromonedatabasetoanotherone

•  ThegrowthrateofasequencesinthetransiGonfromdatabasesD1toD2: (1) •  Thecontribu?onofasequencetoaparGcularclass:,(2)whereeisasubsequenceofsTheideaisbasedon[Mill,1843;Finn,1983;Dong&Li,1999]

EmergingSequences

14KESW 2015, Moscow

•  An SPMFoutputforfrequentclosedsequencesmining

Frequentclosedsequences

15KESW 2015, Moscow

•  ImplementedinPython2.7Input:twodatasetsformenandwomenwithageindicaGonfordemographicevents1.  TransformaGonofeventstosequences(80:20test-to-trainingraGo)2.  PassingthetestsettoSPMFandfindingfrequentsequences.3.  Findingemergingsequences(classificaGonrules)andtheircontribuGonsto

classes.4.  DefiningtheclassofarulebyitscontribuGonandthencontribuGon

normalisaGon.5.  AccuracycalculaGonforthetestset.Output:fileswithclassificaGonrulesformenandwomen

• minsup = 0.005; for each class we use 3312 sequences after oversampling.

•  The best classification accuracy (0.936) has been reached at minimal growth rate 1.0, with 577 rules for men and 1164 for women, and 3 non-covered objects.

EmergingSequenceMining

16KESW 2015, Moscow

•  The list of emerging sequences for men(with their class contribution):•  Thelistofemergingsequencesforwomen:

Emergingsequences

17KESW 2015, Moscow

• We have shown that decision trees and sequence mining could become the tools of choice for demographers.

• Machine learning and data mining tools can help in finding regularities

and dependencies that are hidden in voluminous demographic datasets. However, these methods need to be properly tuned and adapted to the domain needs.

•  In the near future we are planning:

•  to implement emerging prefix-string mining and learning to deal with sequences without gaps

•  to use different rule-based techniques that are able to cope with unbalanced multi-class data (Cerf et al., 2013)

•  to apply Pattern Structures to demographic sequence mining (Kuznetsov et al., 2013).

•  andmanymoreideasandtricks…

ConclusionandFutureWork

18KESW 2015, Moscow

Thankyou.Ques?ons?

KESW 2015, Moscow

Recommended