Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm

LinearRegressionModelselectionusingahybrid genetic-improvedharmonysearchparallelizedalgorithm

BlankaLáng,LászlóKovács,LászlóMohácsiCorvinus UniversityofBudapest

InstituteofInformation Technology

Contents

LinearRegression

Model SelectionProblem

Datasets Used

PerformanceofSelection

Algorithms onOur Data

TheNeed for aNewSolution

ThePerformanceofour HybridAlgoirthm

Linear RegressionWe have:§ Y:dependent variable§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors ofindependent variablesGoal:

𝑌 = 𝛽* + 𝛽#𝑋# + 𝛽%𝑋% +⋯+ 𝛽'𝑋' + 𝜀OLSModel:𝑌. = 𝛽/* + 𝛽/#𝑋# + 𝛽/%𝑋% +⋯+ 𝛽/'𝑋' = 𝛽/* + ∑ 𝛽/1𝑋1'

12#Parsimony:𝑋3 ⊆ 𝑋àminimalize residuals,with the use ofas few independents aspossiblemaximalize the model’s ability to generalize

Partial effects ofindependentsàonly significant variables in the modelthese hypotheses can anbestatistically tested

Objective functionsAICSBCHQCadjusted R2àMAX

MIN

Linear RegressionModel Selection

ProblemDatasets Used

PerformanceofSelection Algorithms

on Our Data


ThePerformanceofoutHybrid Algoirthm

Dataset #1BodyFat Measurements – real dataset from 1996 𝑛 = 252 𝑌:Percentofbodyfat to muscle tissue 𝑚 = 16 (age,abdomen circumference,weight,height,etc.)Multicollinearity:Redundancy between independents.Pl.:

Which ofthese two independents mattersmostwhenpredicting𝑌?How can we interpret the partial effects ofthese independents?Measure:Regress the independents on each otheràVIFindicator for each independentif VIF>2àmulticollinearity

LinearRegression


Datasets Used




ThePerformanceofoutHybridAlgoirthm

Dataset #2DATA26– simulated dataset from Gumbel Copula 𝑛 = 1000 𝑚 = 25 (plus𝑌)Generating Correlation Matrix (CM)with high correlations in absolute value

vineBeta method (Lewandowskia et.al,2009)

Simulating Multicollinearity

All 26generated variables follow N(µ,s)

distributions,where µ ands are

randomly generated for each variable

LinearRegression


Datasets Used





PerformanceofSelection Algorithms–FAT Linear Regression


Datasets UsedPerformanceof

Selection Algorithmson Our Data



AIC SBC 𝑅>% Runtime (sec) St Dev (sec)

BestSubsets (SPSSLeapsandBound)

-2,013(Variables:1)

-1,987(Variables:1)

0,9829(Variables:1,2,3,5,6,8,11,12,15)

4,558 0,878

BestSubsets (Minerva:GARS)

-2,013(Variables:1)

-1,987(Variables:1)

0,9829(Variables:1,2,3,5,6,8,11,12,15)

5,921 1,658

improved GARS-2,013

(Variables:1)-1,987

(Variables:1)

0,9822(Variables:1,3,5,

6,8,12,15)11,268 2,941

IHSRS-2,013

(Variables:1)-1,987

(Variables:1)

0,9822(Variables:1,3,5,

6,8,12,15)0,968 0,188

Forward+Backward0,058

(Variables:1,3,5,6,8,12,15)

0,239(Variables:1,3,5,

6,8,12,15)

0,9822(Variables:1,3,5,

6,8,12,15)0,976 0,050

Variable Importance inProjection(Partial Least

Squares)

-0,247(Variables:1,2,5,

6,8,9)

-0,092(Variables:1,2,5,

6,8,9)

0,9618(Variables:1, 2, 5,

6,8,9)1,807 0,896

Elastic Net-2,013

(Variables:1)-1,987

(Variables:1)0,9410

(Variables:1)50,858 9,019

StepwiseVIFSelection -0,189(Variables:1,2,15)

-0,008(Variables:1,2,15)

0,954(Variables:1,2,15)

0,832 0,034

Nested Estimate Procedure-1,402

(Variables:1,8)-1,351

(Variables:1,8)0,9538

(Variables:1,8)0,352 0,047

PerformanceofSelection AlgorithmsDATA26




on Our Data



AIC SBC 𝑅% Runtime(sec) StDev(sec)

BestSubsets (SPSSLeaps andBound)

-8,840

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)

-8,756

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,

X5,X25,X9,X21,X18)

0,9999944

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)

32,352745 7,04028

BestSubsets (Minerva:GARS)

-8,841

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,

X3)

-8,826

(Variables:X25,X10,X17,X13,X1,X16,X24,X18,X5,X21,X8,

X23,X15,X12,X6,X4)

0,9999944

(Variables:X15,X6,X24,X23,X5,X12,X9,X4,X1,X25,X10,X21,X13,X17,X16,X18,X14,X3)

52,714638 12,62692

improved GARS

-8,731


X23,X15,X12,X6,X4)

-8,826


X23,X15,X12,X6,X4)

0,99999744


X23,X15,X12,X6,X4)

1281,45823 380,10328

IHSRS

-8,731


X23,X15,X12,X6,X4)

-8,826


X23,X15,X12,X6,X4)

0,99999744


X23,X15,X12,X6,X4)

402,1666233 79,070735

Forward+Backward

-8,840

(Variables:X24,X23,X10,X6,X4,X15,X17,X1,X13,X14,X12,X16,X5,X25,X9,X21,X18)

-8,756


X5,X25,X9,X21,X18)

0,9999944


X5,X25,X9,X21,X18)

1,0744 0,0937

Variable Importance inProjection(Partial Least

Squares)

-5,196(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,

X11,X15,X6,X12)

-5,132(Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,

X15,X6,X12)

0,99979 (Variables:X24,X5,X4,X10,X20,X18,X8,X22,X23,X11,

X15,X6,X12)15,095273 7,19626

Elastic Net-4,363

(Full model,not significant:X5,X13)

-4,240


0,993


478,683794 99,82244

StepwiseVIFSelection0,434

(Variables:X6,X10,X16,X17,X19,X24)

0,464


0,940


0,93415 0,02986

Nested Estimate Procedure0,760

(Variables:X10,X15,X23,X24)

0,780

(Variables:X10,X15,X23,X24)

0,917

(Variables:X10,X15,X23,X24)0,39289 0,0533

Problemwith the results

Model CollinearityStatisticsTolerance VIF

X1 ,069 14,490X3 ,017 59,097X5 ,089 11,271X6 ,030 33,682X8 ,105 9,540X12 ,239 4,182X15 ,399 2,509




Algorithms on OurData




(Constant)X1 ,065 15,347X4 ,001 1644,939X5 ,003 388,860X6 ,002 538,248X8 ,005 197,505X10 ,050 20,165X12 ,001 1366,452X13 ,030 33,293X15 ,001 1133,939X16 ,048 20,828X17 ,041 24,297X18 ,016 64,340X21 ,003 393,569X23 ,002 554,800X24 ,004 262,232X25 ,001 825,023

FAT DATA26

Optimal solutions ofIHSRSfor 𝑹@𝟐

Modify the IHRSRSInclude anall VIFs<2condition to the optimalization task

Optimal solutions ofIHSRSwith VIFconditions:








X1 ,508 1,970X2 ,879 1,138X8 ,558 1,791

𝑹@%=0,9854

FAT


(Constant)

X2 ,503 1,986X6 ,548 1,825X10 ,500 1,999X14 ,526 1,902X23 ,565 1,770

DATA26

𝑹@%=0,991

Other models with VIFvalues smaller than 2:

Backward– VIF:𝑹@%=0,9540(FAT);0,940(DATA26)Nested Estimates:𝑹@%=0,9538(FAT);0,917(DATA26)

AGreatSetback forthemodified IHSRS







0

10000

20000

30000

40000

50000

60000

averagesolutiontime(numberofsteps) standarddeviationofsolutiontimes(numberofsteps)

FAT

IHSRSwithoutVIF IHSRSwithVIF

0

10

20

30

40

50

60

70

averagesolutiontime(sec) standarddeviationofsolutiontimes(sec)

FAT


0

50000

100000

150000

200000

250000


DATA26


0

500

1000

1500

2000

2500

3000

3500


DATA26


Average runtimeisalmostanhour!

We can not parallelizethe IHSRS




on Our Data



individual/melody: ● = 0 0 1 0 1 1 1

population/harmonymemory: ● ● ● ●

STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●

HMCRprob 1-HMCRprob

● ● ● ● GenerateaRANDOMindvidual

PARprob 1-PARprob

Mutate● withmutation(bw)prob Nomodification on ●Increase PAR+Decrease bw

Isnew● betterthantheworst individual?

YES NO

Changetheworstindividual

YES TerminationCriterion? NO

STOP

Our GA-HShybridsolution




on Our Data



individual: ● = 0 0 1 0 1 1 1

population: ● ● ● ●

STEP1&2:Generatearandomharmonyandevaluatetheregressionsforeachindividual● ● ● ●

Selectbetterthanaverageindividuals● ● ● ●

Startanew population: ● ● x x

CanbeParallelized!

HMCRprob 1-HMCRprob

● ● x x Generate RANDOMindvidual

Mutate● withmutation(bw)probIncreaseHMCR+Decreasebw

Iseveryxfilled? NOYES

Evaluatetheregressionsforthenewindividualsinourpopulation

YES TerminationCriterion? NO

STOP

Differences fromGA1. Morethan one kind ofmutation

2. Nocrossover

In Linear Regression Model Selection randomization ismoreimportant,thaninhereted good properties

Theinclusion or exculsion ofasingle independent can saveor ruin a model

We could observe that GAisarelatively slow algorithm when applied to ModelSelecton




on Our Data



ThePerformance

0

50000

100000

150000

200000

250000


DATA26

IHSRS+VIF GAIHSRS+VIF

0

10

20

30

40

50

60

70


FAT

Standard Parallel

0

500

1000

1500

2000

2500

3000

3500

4000


DATA26

Standard Parallel

Average runtime andSt.Dev.are decreased by 2/3

Thank you for yourattention!

0

10000

20000

30000

40000

50000

60000


FAT

IHSRS+VIF GAIHSRS+VIF




on Our Data


ThePerformanceofour Hybrid Algoirthm

Enviroment

Thesolution times are anaverage of30runs.Thestandarddeviation oftheruntimes isdetermined from the same 30runs.

MostSelection Algorithms were used in IBMSPSSStatistics 22Elastic Net:Catgreg SPSSmacro by the UniversityofLeidenNumpy andScipy Pythonlibraries for Partial Least Squares

Metaheuristics (GARS,improved GARS,IHSRS,GAIHSRS)are implemented in C#

OSandHardwareConfigurationsOS:Windows8.1Ultimate 64bitCPU:IntelCore i7-2700K,3.5GHzRAM:16GBDDR3SDRAM

Business

Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm