Upload
informatikai-intezet
View
95
Download
1
Embed Size (px)
Citation preview
LinearRegressionModelselectionusingahybrid genetic-improvedharmonysearchparallelizedalgorithm
BlankaLáng,LászlóKovács,LászlóMohácsiCorvinus UniversityofBudapest
InstituteofInformation Technology
Contents
LinearRegression
Model SelectionProblem
Datasets Used
PerformanceofSelection
Algorithms onOur Data
TheNeed for aNewSolution
ThePerformanceofour HybridAlgoirthm
Linear RegressionWe have:§ Y:dependent variable§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors ofindependent variablesGoal:
𝑌 = 𝛽* + 𝛽#𝑋# + 𝛽%𝑋% +⋯+ 𝛽'𝑋' + 𝜀OLSModel:𝑌. = 𝛽/* + 𝛽/#𝑋# + 𝛽/%𝑋% +⋯+ 𝛽/'𝑋' = 𝛽/* + ∑ 𝛽/1𝑋1'
12#Parsimony:𝑋3 ⊆ 𝑋àminimalize residuals,with the use ofas few independents aspossiblemaximalize the model’s ability to generalize
Partial effects ofindependentsàonly significant variables in the modelthese hypotheses can anbestatistically tested
Objective functionsAICSBCHQCadjusted R2àMAX
MIN
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
Dataset #1BodyFat Measurements – real dataset from 1996 𝑛 = 252 𝑌:Percentofbodyfat to muscle tissue 𝑚 = 16 (age,abdomen circumference,weight,height,etc.)Multicollinearity:Redundancy between independents.Pl.:
Which ofthese two independents mattersmostwhenpredicting𝑌?How can we interpret the partial effects ofthese independents?Measure:Regress the independents on each otheràVIFindicator for each independentif VIF>2àmulticollinearity
LinearRegression
Model SelectionProblem
Datasets Used
PerformanceofSelection
Algorithms onOur Data
TheNeed for aNewSolution
ThePerformanceofoutHybridAlgoirthm
Dataset #2DATA26– simulated dataset from Gumbel Copula 𝑛 = 1000 𝑚 = 25 (plus𝑌)Generating Correlation Matrix (CM)with high correlations in absolute value
vineBeta method (Lewandowskia et.al,2009)
Simulating Multicollinearity
All 26generated variables follow N(µ,s)
distributions,where µ ands are
randomly generated for each variable
LinearRegression
Model SelectionProblem
Datasets Used
PerformanceofSelection
Algorithms onOur Data
TheNeed for aNewSolution
ThePerformanceofoutHybridAlgoirthm
PerformanceofSelection Algorithms–FAT Linear Regression
Model SelectionProblem
Datasets UsedPerformanceof
Selection Algorithmson Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
AIC SBC 𝑅>% Runtime (sec) St Dev (sec)
BestSubsets (SPSSLeapsandBound)
-2,013(Variables:1)
-1,987(Variables:1)
0,9829(Variables:1,2,3,5,6,8,11,12,15)
4,558 0,878
BestSubsets (Minerva:GARS)
-2,013(Variables:1)
-1,987(Variables:1)
0,9829(Variables:1,2,3,5,6,8,11,12,15)
5,921 1,658
improved GARS-2,013
(Variables:1)-1,987
(Variables:1)
0,9822(Variables:1,3,5,
6,8,12,15)11,268 2,941
IHSRS-2,013
(Variables:1)-1,987
(Variables:1)
0,9822(Variables:1,3,5,
6,8,12,15)0,968 0,188
Forward+Backward0,058
(Variables:1,3,5,6,8,12,15)
0,239(Variables:1,3,5,
6,8,12,15)
0,9822(Variables:1,3,5,
6,8,12,15)0,976 0,050
Variable Importance inProjection(Partial Least
Squares)
-0,247(Variables:1,2,5,
6,8,9)
-0,092(Variables:1,2,5,
6,8,9)
0,9618(Variables:1, 2, 5,
6,8,9)1,807 0,896
Elastic Net-2,013
(Variables:1)-1,987
(Variables:1)0,9410
(Variables:1)50,858 9,019
StepwiseVIFSelection -0,189(Variables:1,2,15)
-0,008(Variables:1,2,15)
0,954(Variables:1,2,15)
0,832 0,034
Nested Estimate Procedure-1,402
(Variables:1,8)-1,351
(Variables:1,8)0,9538
(Variables:1,8)0,352 0,047
PerformanceofSelection AlgorithmsDATA26
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
AIC SBC 𝑅% Runtime(sec) StDev(sec)
BestSubsets (SPSSLeaps andBound)
-8,840
(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)
-8,756
(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,
X5,X25,X9,X21,X18)
0,9999944
(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)
32,352745 7,04028
BestSubsets (Minerva:GARS)
-8,841
(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,
X3)
-8,826
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
0,9999944
(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)
52,714638 12,62692
improved GARS
-8,731
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
-8,826
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
0,99999744
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
1281,45823 380,10328
IHSRS
-8,731
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
-8,826
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
0,99999744
(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,
X23,X15,X12,X6,X4)
402,1666233 79,070735
Forward+Backward
-8,840
(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)
-8,756
(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,
X5,X25,X9,X21,X18)
0,9999944
(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,
X5,X25,X9,X21,X18)
1,0744 0,0937
Variable Importance inProjection(Partial Least
Squares)
-5,196(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,
X11,X15,X6,X12)
-5,132(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,
X15,X6,X12)
0,99979 (Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,
X15,X6,X12)15,095273 7,19626
Elastic Net-4,363
(Full model,not significant:X5,X13)
-4,240
(Full model,not significant:X5,X13)
0,993
(Full model,not significant:X5,X13)
478,683794 99,82244
StepwiseVIFSelection0,434
(Variables:X6,X10,X16,X17,X19,X24)
0,464
(Variables:X6,X10,X16,X17,X19,X24)
0,940
(Variables:X6,X10,X16,X17,X19,X24)
0,93415 0,02986
Nested Estimate Procedure0,760
(Variables:X10,X15,X23,X24)
0,780
(Variables:X10,X15,X23,X24)
0,917
(Variables:X10,X15,X23,X24)0,39289 0,0533
Problemwith the results
Model CollinearityStatisticsTolerance VIF
X1 ,069 14,490X3 ,017 59,097X5 ,089 11,271X6 ,030 33,682X8 ,105 9,540X12 ,239 4,182X15 ,399 2,509
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection
Algorithms on OurData
TheNeed for aNewSolution
ThePerformanceofoutHybridAlgoirthm
Model CollinearityStatisticsTolerance VIF
(Constant)X1 ,065 15,347X4 ,001 1644,939X5 ,003 388,860X6 ,002 538,248X8 ,005 197,505X10 ,050 20,165X12 ,001 1366,452X13 ,030 33,293X15 ,001 1133,939X16 ,048 20,828X17 ,041 24,297X18 ,016 64,340X21 ,003 393,569X23 ,002 554,800X24 ,004 262,232X25 ,001 825,023
FAT DATA26
Optimal solutions ofIHSRSfor 𝑹@𝟐
Modify the IHRSRSInclude anall VIFs<2condition to the optimalization task
Optimal solutions ofIHSRSwith VIFconditions:
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection
Algorithms on OurData
TheNeed for aNewSolution
ThePerformanceofoutHybridAlgoirthm
Model CollinearityStatisticsTolerance VIF
X1 ,508 1,970X2 ,879 1,138X8 ,558 1,791
𝑹@%=0,9854
FAT
Model CollinearityStatisticsTolerance VIF
(Constant)
X2 ,503 1,986X6 ,548 1,825X10 ,500 1,999X14 ,526 1,902X23 ,565 1,770
DATA26
𝑹@%=0,991
Other models with VIFvalues smaller than 2:
Backward– VIF:𝑹@%=0,9540(FAT);0,940(DATA26)Nested Estimates:𝑹@%=0,9538(FAT);0,917(DATA26)
AGreatSetback forthemodified IHSRS
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection
Algorithms on OurData
TheNeed for aNewSolution
ThePerformanceofoutHybridAlgoirthm
0
10000
20000
30000
40000
50000
60000
averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)
FAT
IHSRSwithoutVIF IHSRSwithVIF
0
10
20
30
40
50
60
70
averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)
FAT
IHSRSwithoutVIF IHSRSwithVIF
0
50000
100000
150000
200000
250000
averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)
DATA26
IHSRSwithoutVIF IHSRSwithVIF
0
500
1000
1500
2000
2500
3000
3500
averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)
DATA26
IHSRSwithoutVIF IHSRSwithVIF
Average runtimeisalmostanhour!
We can not parallelizethe IHSRS
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
individual/melody: ● = 0 0 1 0 1 1 1
population/harmonymemory: ● ● ● ●
STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●
HMCRprob 1-HMCRprob
● ● ● ● GenerateaRANDOMindvidual
PARprob 1-PARprob
Mutate● withmutation(bw)prob Nomodification on ●Increase PAR+Decrease bw
Isnew● betterthantheworst individual?
YES NO
Changetheworstindividual
YES TerminationCriterion? NO
STOP
Our GA-HShybridsolution
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
individual: ● = 0 0 1 0 1 1 1
population: ● ● ● ●
STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●
Selectbetterthanaverageindividuals● ● ● ●
Startanew population: ● ● x x
CanbeParallelized!
HMCRprob 1-HMCRprob
● ● x x Generate RANDOMindvidual
Mutate● withmutation(bw)probIncreaseHMCR+Decreasebw
Iseveryxfilled? NOYES
Evaluatetheregressionsforthenewindividualsinourpopulation
YES TerminationCriterion? NO
STOP
Differences fromGA1. Morethan one kind ofmutation
2. Nocrossover
In Linear Regression Model Selection randomization ismoreimportant,thaninhereted good properties
Theinclusion or exculsion ofasingle independent can saveor ruin a model
We could observe that GAisarelatively slow algorithm when applied to ModelSelecton
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofoutHybrid Algoirthm
ThePerformance
0
50000
100000
150000
200000
250000
averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)
DATA26
IHSRS+VIF GAIHSRS+VIF
0
10
20
30
40
50
60
70
averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)
FAT
Standard Parallel
0
500
1000
1500
2000
2500
3000
3500
4000
averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)
DATA26
Standard Parallel
Average runtime andSt.Dev.are decreased by 2/3
Thank you for yourattention!
0
10000
20000
30000
40000
50000
60000
averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)
FAT
IHSRS+VIF GAIHSRS+VIF
Linear RegressionModel Selection
ProblemDatasets Used
PerformanceofSelection Algorithms
on Our Data
TheNeed for aNewSolution
ThePerformanceofour Hybrid Algoirthm
Enviroment
Thesolution times are anaverage of30runs.Thestandarddeviation oftheruntimes isdetermined from the same 30runs.
MostSelection Algorithms were used in IBMSPSSStatistics 22Elastic Net:Catgreg SPSSmacro by the UniversityofLeidenNumpy andScipy Pythonlibraries for Partial Least Squares
Metaheuristics (GARS,improved GARS,IHSRS,GAIHSRS)are implemented in C#
OSandHardwareConfigurationsOS:Windows8.1Ultimate 64bitCPU:IntelCore i7-2700K,3.5GHzRAM:16GBDDR3SDRAM