15
Linear Regression Model selection using a hybrid genetic - improved harmony search parallelized algorithm Blanka Láng, László Kovács, László Mohácsi Corvinus University of Budapest Institute of Information Technology

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Embed Size (px)

Citation preview

Page 1: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

LinearRegressionModelselectionusingahybrid genetic-improvedharmonysearchparallelizedalgorithm

BlankaLáng,LászlóKovács,LászlóMohácsiCorvinus UniversityofBudapest

InstituteofInformation Technology

Page 2: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Contents

LinearRegression

Model SelectionProblem

Datasets Used

PerformanceofSelection

Algorithms onOur Data

TheNeed for aNewSolution

ThePerformanceofour HybridAlgoirthm

Page 3: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Linear RegressionWe have:§ Y:dependent variable§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors ofindependent variablesGoal:

𝑌 = 𝛽* + 𝛽#𝑋# + 𝛽%𝑋% +⋯+ 𝛽'𝑋' + 𝜀OLSModel:𝑌. = 𝛽/* + 𝛽/#𝑋# + 𝛽/%𝑋% +⋯+ 𝛽/'𝑋' = 𝛽/* + ∑ 𝛽/1𝑋1'

12#Parsimony:𝑋3 ⊆ 𝑋àminimalize residuals,with the use ofas few independents aspossiblemaximalize the model’s ability to generalize

Partial effects ofindependentsàonly significant variables in the modelthese hypotheses can anbestatistically tested

Objective functionsAICSBCHQCadjusted R2àMAX

MIN

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

Page 4: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Dataset #1BodyFat Measurements – real dataset from 1996 𝑛 = 252 𝑌:Percentofbodyfat to muscle tissue 𝑚 = 16 (age,abdomen circumference,weight,height,etc.)Multicollinearity:Redundancy between independents.Pl.:

Which ofthese two independents mattersmostwhenpredicting𝑌?How can we interpret the partial effects ofthese independents?Measure:Regress the independents on each otheràVIFindicator for each independentif VIF>2àmulticollinearity

LinearRegression

Model SelectionProblem

Datasets Used

PerformanceofSelection

Algorithms onOur Data

TheNeed for aNewSolution

ThePerformanceofoutHybridAlgoirthm

Page 5: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Dataset #2DATA26– simulated dataset from Gumbel Copula 𝑛 = 1000 𝑚 = 25 (plus𝑌)Generating Correlation Matrix (CM)with high correlations in absolute value

vineBeta method (Lewandowskia et.al,2009)

Simulating Multicollinearity

All 26generated variables follow N(µ,s)

distributions,where µ ands are

randomly generated for each variable

LinearRegression

Model SelectionProblem

Datasets Used

PerformanceofSelection

Algorithms onOur Data

TheNeed for aNewSolution

ThePerformanceofoutHybridAlgoirthm

Page 6: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

PerformanceofSelection Algorithms–FAT Linear Regression

Model SelectionProblem

Datasets UsedPerformanceof

Selection Algorithmson Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

AIC SBC 𝑅>% Runtime (sec) St Dev (sec)

BestSubsets (SPSSLeapsandBound)

-2,013(Variables:1)

-1,987(Variables:1)

0,9829(Variables:1,2,3,5,6,8,11,12,15)

4,558 0,878

BestSubsets (Minerva:GARS)

-2,013(Variables:1)

-1,987(Variables:1)

0,9829(Variables:1,2,3,5,6,8,11,12,15)

5,921 1,658

improved GARS-2,013

(Variables:1)-1,987

(Variables:1)

0,9822(Variables:1,3,5,

6,8,12,15)11,268 2,941

IHSRS-2,013

(Variables:1)-1,987

(Variables:1)

0,9822(Variables:1,3,5,

6,8,12,15)0,968 0,188

Forward+Backward0,058

(Variables:1,3,5,6,8,12,15)

0,239(Variables:1,3,5,

6,8,12,15)

0,9822(Variables:1,3,5,

6,8,12,15)0,976 0,050

Variable Importance inProjection(Partial Least

Squares)

-0,247(Variables:1,2,5,

6,8,9)

-0,092(Variables:1,2,5,

6,8,9)

0,9618(Variables:1, 2, 5,

6,8,9)1,807 0,896

Elastic Net-2,013

(Variables:1)-1,987

(Variables:1)0,9410

(Variables:1)50,858 9,019

StepwiseVIFSelection -0,189(Variables:1,2,15)

-0,008(Variables:1,2,15)

0,954(Variables:1,2,15)

0,832 0,034

Nested Estimate Procedure-1,402

(Variables:1,8)-1,351

(Variables:1,8)0,9538

(Variables:1,8)0,352 0,047

Page 7: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

PerformanceofSelection AlgorithmsDATA26

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

AIC SBC 𝑅% Runtime(sec) StDev(sec)

BestSubsets (SPSSLeaps andBound)

-8,840

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)

-8,756

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,

X5,X25,X9,X21,X18)

0,9999944

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)

32,352745 7,04028

BestSubsets (Minerva:GARS)

-8,841

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,

X3)

-8,826

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

0,9999944

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)

52,714638 12,62692

improved GARS

-8,731

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

-8,826

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

0,99999744

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

1281,45823 380,10328

IHSRS

-8,731

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

-8,826

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

0,99999744

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

402,1666233 79,070735

Forward+Backward

-8,840

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)

-8,756

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,

X5,X25,X9,X21,X18)

0,9999944

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,

X5,X25,X9,X21,X18)

1,0744 0,0937

Variable Importance inProjection(Partial Least

Squares)

-5,196(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,

X11,X15,X6,X12)

-5,132(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,

X15,X6,X12)

0,99979 (Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,

X15,X6,X12)15,095273 7,19626

Elastic Net-4,363

(Full model,not significant:X5,X13)

-4,240

(Full model,not significant:X5,X13)

0,993

(Full model,not significant:X5,X13)

478,683794 99,82244

StepwiseVIFSelection0,434

(Variables:X6,X10,X16,X17,X19,X24)

0,464

(Variables:X6,X10,X16,X17,X19,X24)

0,940

(Variables:X6,X10,X16,X17,X19,X24)

0,93415 0,02986

Nested Estimate Procedure0,760

(Variables:X10,X15,X23,X24)

0,780

(Variables:X10,X15,X23,X24)

0,917

(Variables:X10,X15,X23,X24)0,39289 0,0533

Page 8: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Problemwith the results

Model CollinearityStatisticsTolerance VIF

X1 ,069 14,490X3 ,017 59,097X5 ,089 11,271X6 ,030 33,682X8 ,105 9,540X12 ,239 4,182X15 ,399 2,509

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection

Algorithms on OurData

TheNeed for aNewSolution

ThePerformanceofoutHybridAlgoirthm

Model CollinearityStatisticsTolerance VIF

(Constant)X1 ,065 15,347X4 ,001 1644,939X5 ,003 388,860X6 ,002 538,248X8 ,005 197,505X10 ,050 20,165X12 ,001 1366,452X13 ,030 33,293X15 ,001 1133,939X16 ,048 20,828X17 ,041 24,297X18 ,016 64,340X21 ,003 393,569X23 ,002 554,800X24 ,004 262,232X25 ,001 825,023

FAT DATA26

Optimal solutions ofIHSRSfor 𝑹@𝟐

Page 9: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Modify the IHRSRSInclude anall VIFs<2condition to the optimalization task

Optimal solutions ofIHSRSwith VIFconditions:

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection

Algorithms on OurData

TheNeed for aNewSolution

ThePerformanceofoutHybridAlgoirthm

Model CollinearityStatisticsTolerance VIF

X1 ,508 1,970X2 ,879 1,138X8 ,558 1,791

𝑹@%=0,9854

FAT

Model CollinearityStatisticsTolerance VIF

(Constant)

X2 ,503 1,986X6 ,548 1,825X10 ,500 1,999X14 ,526 1,902X23 ,565 1,770

DATA26

𝑹@%=0,991

Other models with VIFvalues smaller than 2:

Backward– VIF:𝑹@%=0,9540(FAT);0,940(DATA26)Nested Estimates:𝑹@%=0,9538(FAT);0,917(DATA26)

Page 10: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

AGreatSetback forthemodified IHSRS

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection

Algorithms on OurData

TheNeed for aNewSolution

ThePerformanceofoutHybridAlgoirthm

0

10000

20000

30000

40000

50000

60000

averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)

FAT

IHSRSwithoutVIF IHSRSwithVIF

0

10

20

30

40

50

60

70

averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)

FAT

IHSRSwithoutVIF IHSRSwithVIF

0

50000

100000

150000

200000

250000

averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)

DATA26

IHSRSwithoutVIF IHSRSwithVIF

0

500

1000

1500

2000

2500

3000

3500

averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)

DATA26

IHSRSwithoutVIF IHSRSwithVIF

Average runtimeisalmostanhour!

Page 11: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

We can not parallelizethe IHSRS

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

individual/melody: ● = 0 0 1 0 1 1 1

population/harmonymemory: ● ● ● ●

STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●

HMCRprob 1-HMCRprob

● ● ● ● GenerateaRANDOMindvidual

PARprob 1-PARprob

Mutate● withmutation(bw)prob Nomodification on ●Increase PAR+Decrease bw

Isnew● betterthantheworst individual?

YES NO

Changetheworstindividual

YES TerminationCriterion? NO

STOP

Page 12: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Our GA-HShybridsolution

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

individual: ● = 0 0 1 0 1 1 1

population: ● ● ● ●

STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●

Selectbetterthanaverageindividuals● ● ● ●

Startanew population: ● ● x x

CanbeParallelized!

HMCRprob 1-HMCRprob

● ● x x Generate RANDOMindvidual

Mutate● withmutation(bw)probIncreaseHMCR+Decreasebw

Iseveryxfilled? NOYES

Evaluatetheregressionsforthenewindividualsinourpopulation

YES TerminationCriterion? NO

STOP

Page 13: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Differences fromGA1. Morethan one kind ofmutation

2. Nocrossover

In Linear Regression Model Selection randomization ismoreimportant,thaninhereted good properties

Theinclusion or exculsion ofasingle independent can saveor ruin a model

We could observe that GAisarelatively slow algorithm when applied to ModelSelecton

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofoutHybrid Algoirthm

Page 14: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

ThePerformance

0

50000

100000

150000

200000

250000

averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)

DATA26

IHSRS+VIF GAIHSRS+VIF

0

10

20

30

40

50

60

70

averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)

FAT

Standard Parallel

0

500

1000

1500

2000

2500

3000

3500

4000

averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)

DATA26

Standard Parallel

Average runtime andSt.Dev.are decreased by 2/3

Thank you for yourattention!

0

10000

20000

30000

40000

50000

60000

averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)

FAT

IHSRS+VIF GAIHSRS+VIF

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data

TheNeed for aNewSolution

ThePerformanceofour Hybrid Algoirthm

Page 15: Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

Enviroment

Thesolution times are anaverage of30runs.Thestandarddeviation oftheruntimes isdetermined from the same 30runs.

MostSelection Algorithms were used in IBMSPSSStatistics 22Elastic Net:Catgreg SPSSmacro by the UniversityofLeidenNumpy andScipy Pythonlibraries for Partial Least Squares

Metaheuristics (GARS,improved GARS,IHSRS,GAIHSRS)are implemented in C#

OSandHardwareConfigurationsOS:Windows8.1Ultimate 64bitCPU:IntelCore i7-2700K,3.5GHzRAM:16GBDDR3SDRAM