Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ModDrop:AdaptiveMulti-ModalGestureRecognition
PresentedbyChongyangBaiMay.14,2020
Dartmouth
NataliaNeverova,ChristianWolf,GrahamTaylor,andFlorianNebout
PAMI2016
GestureRecognitionfromRGBDVideo• Multi-modalinputs
3
GestureRecognitionfromRGBDVideo• Multi-modalinputs• Localization▫ Startframe▫ Endframe• Recognition▫ WhichGesture?
4
Challenges• Gesturesfromvariousspatialandtemporalscales• Noisy/missingsignals(e.g.depth)• Limitedtraining datavs. flexiblegestures• Real-time
5
Contributions• Multi-modalandmulti-scaledeeplearningframework
• Challenges▫ Gesturesfromvariousspatialandtemporalscales▫ Flexiblegestures▫ Real-time
6
Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique
• Challenges▫ Noisy/missingsignals(e.g.depth)▫ Limitedtrainingdata
7
Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique▫ Modelinitializationformulti-modalfusion• Challenges▫ Limitedtrainingdata
8
ResultsSummary• Achieves0.87Jaccard Index(Rank1)inChaLearn 2014Challenge▫ Improvesto0.88whenaddingaudiomodadlity (ChaLearn +A)• Alocalizationrefinementtechniquefurtherimprovestheaccuracy.• ModDrop isrobusttonoisyormissingsamplesduringteststageon▫ MNIST▫ ChaLearn +A• Theinitializationformulti-modalfusioniseffective
9
RelatedWork• GestureRecognition▫ Classificationwithmotiontrajectories[1]▫ HoG featuresfromRGBanddepthimages[2]▫ 3DCNNtolearnspatial-temporalrepresentations[3]• Multi-modalFusion▫ Earlyfusionandlatefusion[4]▫ MultipleKernelLearning(MKL)[5]▫ DeepNeuralNets[6]
10
MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties
• Inter-scalelatefusion• Gesturelocalizationasrefinement
11
OverallArchitecture• Multi-scalesampling• Single-scalemulti-modalfusion• Inter-scalelatefusion
12
Single-scaleMulti-modalfusion
• Fourpaths• Singlepathpre-training• Initializationformulti-modalfusion• ModDrop
13
Initializationforfusion
ModDrop
Single-scaleMulti-modalfusion
• PathV1/V2forhands▫ Input Depthvolume(WxHx5) Gray-scalevolume(WxHx5)▫ Architecture Conv3D,maxpoolingovertime,Conv2D FlattenconcatenationforHLV1▫ HorizontalflippedinputforV2,shareparameterswithV1▫ Detectactivehandsfortraining:trajectoryofthehandjoint
14
Single-scaleMulti-modalfusion
• Inputnormalization▫ Normalizehandboundingboxesoverframeaccordingtohanddistancetosensor[7]
▫ H_x:boundingbox(pixel),h_x:actualhandsize(mm),zdistancetosensor▫ X:Imagewidth
15
Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ 3-layerMLP▫ Frameinputfeature
Normalizedjointpositions Jointvelocitiesandaccelerations Inclinationangles Azimuthangles Bendingangles Pairwisedistances▫ Richrepresentationforindividualarticulationdifference▫ Concatenatefor5frames
16
BodyposeImagesource:[7]
Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ Inputfeature[7] Normalizedjointpositions Azimuthangles Bendingangles Pairwisedistances
17
PoseImagesource:[7]
Single-scaleMulti-modalfusion• PathA(audio)▫ Input:Mel-frequencyhistograms Time-frequency-amplitude▫ FeedtoConv2Dlayer+2hiddenlayers
18
Single-scaleMulti-modalfusion• Singlepathpre-training▫ Thewholenetworkhastoomanyparameters
• Early-fusionofheterogeneousdatasourcesisnoteffective
• Fusethepathsinalatehiddenlayer(HLS)
19
Softmax
Softmax
Softmax
Softmax
Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&, # = ∑ #*�* isinputdimension
▫ NK,outputdimension
20
Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&,∑ #*�* isinputdimension
▫ NK,outputdimension▫ ,-
(/), feature-iformodalityk(bluecircles)▫ 12
(3),unit-2 relatedtomodalitym(pinkcircle)▫ 4-,2
(3,/),weightbetweenabovetwo
▫ Init:5 = 6,4-,2(/,/) arefrompre-training
▫ Increase5 tolearncross-modalitycorrelations
21
m
Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension
22
Softmax
Pre-training
SoftmaxSoftmaxSoftmaxSoftmax
Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9
(*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training
▫ !7* ∈ ;
Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9
(*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training
▫ !7* ∈ ;
ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalse/redundantco-adaptationbetweenmodalities
• Howtoobtainrobustpredictionwhensomemodalitiesaremissing/noisyduringtest?
• B :Cross-entropyloss• C(*) isthespecificmodelformodalityk• !D:weightsforlayerh
25
IdealLoss
ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalseco-adaptationbetweenmodalities• Obtainrobustpredictionwhensomemodalitiesaremissing/noisy
• B :Cross-entropyloss• !D:weights
26
IdealLoss
Hugeamountofcomputationfor~2^Kterms!
ModDrop:Multimodaldropout• Solution:Randomlydropinputswhentrainingeachbatch• Inputsampledfrommodalityk:
• Bernoulliselectorastherandomindicatorvariable:
• Multi-modalnetwork:
27
ModDrop:MultimodalDropout• Regularizationpropertiesonone-layernetworkwithsigmoidactivation
28
Original ModDrop
29
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
Original ModDrop
30
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
DropindexE, F
Original ModDrop
32
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
TaketheExpection ofequation,useG H / = I(J)Approximation1:G K , = K G , [8]Approximation2:1st orderTaylorexpansionarounds
s
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
33
• Substitute:
• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm
Regularization
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
34
• Substitution• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm
• Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :
Regularization
Regularization
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
35
• SubstitutethegradientforOriginalnetworktogradientforModDrop:• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm
• Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :
Regularization
Regularization
TheModDrop lossisptimesthecompletemodellossminusaregularizationterm
Regularizationtermhasonlycross-modalitymultiplications!
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
36
• Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-
/ and,O3
▫ G[,-/ ] =G[,O
3 ] = 6,canalwaysdoinputnormalization
▫ G[,-/ ,O
(3)] = G[,-/ ] G[,O
3 ] +
STU ,-/ , ,O
3 = STU ,-/ , ,O
3
Lossforonesample!
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
37
• Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-
/ and,O3
▫ G[,-/ ] =G[,O
3 ] = 6,duetoinputnormalization
▫ G[,-/ ,O
(3)] = G[,-/ ] G[,O
3 ] +STU ,-
/ , ,O3 = STU ,-
/ , ,O3
• Case1:,-/ and,O
3 positivelycorrelated▫ G[,-
/ ,O(3)] > 6
▫ Thetrainingprocessencourages4-/ 4O
(3) tobepositive
Lossforonesample!
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
38
• Case1:,-/ and,O
3 positively correlated
▫ G[,-/ ,O
(3)] > 6
▫ Thetrainingprocessencourages4-/ 4O
(3) tobepositive
• Case2:,-/ and,O
3 negatively correlated
▫ Thetrainingprocessencourages4-/ 4O
(3) tobenegative
Lossforonesample!KWX , Y, Z[\ K > 6
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
39
• Forcorrelatedmodalities,theregularizationtermencouragesthenetworkto▫ Discoversimilaritiesbetweenthemodalities▫ Alignmodalitiesbylearningtheweights
ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation
40
• Case3:,-/ and,O
3 uncorrelated▫ G[,-
/ ,O(3)] = 6
▫ Assumption:Weightsobeyaunimodaldistributionwithzeroexpectation[9]
▫ AccordingtotheLyapunov’s CentralMeanTheorem:thetermtendstozerowhenamountoftrainingsamplestendstoinfinity[10].
▫ Additionalconstraints:theL2regularizationforweights.
Lossforonesample!
MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties
• Inter-scalelatefusion• Gesturelocalizationasrefinement
42
Inter-scalelatefusion• Forframet,classk,takeweightedsumofpredictions]^,*(_)overscales` = 2,3,4• Getframe-wisefinalprediction.
43
Gesturelocalizationasrefinement• Therecognitionframework(R)makespredictionsbyslidingwindows▫ Noisywindow:coverintersectionofagestureandreststate• TrainaMLP(M)toclassify”motion”vs. ”nomotion”foreachframe▫ Input:posedescriptors▫ 98%accuracy• Post-refinement▫ foreachgesturepredictedbyR,assignitsboundaryframestotheclosestswitchingframespredictedbyM
44
Experiments• Datasetandevaluation• Multi-modalpredictionresults• Comparingtrainingtechniques▫ Pre-training,Dropout(appliedtoinput),Initialization,ModDrop▫ Dataset:MNIST,Chalearn 2014
45
DatasetandEvaluation• Chalearn 2014challengedataset▫ ~14Klabeledgestureclips▫ 20gesturecategories• Datasetaugmentedwithaudios(vocalphrase)• EvaluationMetric▫ Jaccard Indexforsequence` andgestureF,average overallsandn
▫ Foraudio,alsouseclip-basedaccuracy ✅ if20%ofclipispredictedcorrectly
46
Multi-ModalPredictionResults• Multi-modalmulti-scaleresults(Jaccard Index)
47
1.Exceptforaudio,largersamplingstepyieldsbetterresults2.Althoughaudiomodalityalongperformstheworst(duetoalignmentissue),itstillbooststheperformancewhencombinedwithposeandvideomodalities
Multi-ModalPredictionResults• Challengeresults(withoutaudio,Jaccard index)
48
1.Gesturelocalizationmakespredictioncorrections.2.Theperformancecanbefurtherboostedwhencombiningwiththebaseline
[49]N.Neverova,C.Wolf,G.Taylor,andF.Nebout,“Multi-scaledeeplearningforgesturedetectionandlocalization,”inECCVW,2014
TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ Multi-modalsetting
49
TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)
50
1.Train-from-scratchincreaseserror2.Dropoutandpre-trainingareuseful4.ModDrop hasnolift5.Themodelislightweight
[55]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov,“Improvingneuralnetworksbypreventingcoadaptationoffeaturedetectors,”inarXiv:1207.0580,2012
TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ EffectofModDrop underOcclusionandNoise
51
Pre-trainingandinitializationareemployed.
Inbothcases,ModDrop makesthemodelmorerobust
TrainingTechniquesComparison• Chalearn 2014+audio
52
Allfourtrainingtechniquesarehelpful
TrainingTechniquesComparison• Chalearn +audio:EffectofModDrop
53
Whenthetestinputismanipulated,ModDrop isrobustwhencombiningwithDropout
Pre-trainingandinitializationareemployed.
Summary• Multi-modalmulti-scaledeepframework▫ Initializationformulti-modalfusion▫ ModDrop• ShowedefficacyinChaLearn andMNIST
54
Reference• [1]H.Wang,A.Kl€aser,C.Schmid,andC.L.Liu,“Densetrajectoriesandmotionboundarydescriptorsforaction
recognition,”Int J.Comput.Vis.,vol.103,pp.60–79,2013.• [2]J.Sung,C.Ponce,B.Selman,andA.Saxena,“UnstructuredhumanactivitydetectionfromRGBDimages,”in
Proc.IEEEInt.Conf.Robot.Autom.,2012,pp.842–849.• [3]S.Ji,W.Xu,M.Yang,andK.Yu,“3Dconvolutionalneuralnetworksforhumanactionrecognition,”IEEETrans.
PatternAnal.Mach.Intell.,vol.35,no.1,pp.221–231,Jan.2013.• [4] S.E.Kahou etal.,“Combiningmodalityspecificdeepneuralnet- worksforemotionrecognitioninvideo,”in
Proc.15thACMInt.Conf.MultimodalInteraction,2013,pp.543–550.• [5] F.Bach,G.Lanckriet,andM.Jordan,“Multiplekernellearning,conicduality,andtheSMOalgorithm,”in
Proc.21stInt.Conf.Mach.Learning,2004,p.6.• [6] J.Ngiam,A.Khosla,M.Kin,J.Nam,H.Lee,andA.Y.Ng,“Multimodaldeeplearning,”inProc.29thInt.Conf.
Mach.Learning,2011,pp.689–696.• [7]Neverova,N.,2016. Deeplearningforhumanmotionanalysis (Doctoraldissertation).• [8]Baldi,P.andSadowski,P.,2014.Thedropoutlearningalgorithm. Artificialintelligence, 210,pp.78-122.• [9]S.WangandC.Manning,“Fastdropouttraining,”inProc.30thInt.Conf.Mach.Learning,2013,pp.118–126.• [10]E.L.Lehmann,ElementsofLarge-SampleTheory.Science&BusinessMedia,p.631,1999.
57
Thanksforlistening!