53
ModDrop: Adaptive Multi-Modal Gesture Recognition Presented by Chongyang Bai May.14, 2020 Dartmouth Natalia Neverova, Christian Wolf, Graham Taylor, and Florian Nebout PAMI 2016

ModDrop: Adaptive Multi-Modal Gesture Recognitioncy/pubs/rpe-moddrop.pdfModDrop: Adaptive Multi-Modal Gesture Recognition Presented by Chongyang Bai May.14, 2020 Dartmouth Natalia

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • ModDrop:AdaptiveMulti-ModalGestureRecognition

    PresentedbyChongyangBaiMay.14,2020

    Dartmouth

    NataliaNeverova,ChristianWolf,GrahamTaylor,andFlorianNebout

    PAMI2016

  • GestureRecognitionfromRGBDVideo• Multi-modalinputs

    [email protected]

    3

  • GestureRecognitionfromRGBDVideo• Multi-modalinputs• Localization▫ Startframe▫ Endframe• Recognition▫ WhichGesture?

    [email protected]

    4

  • Challenges• Gesturesfromvariousspatialandtemporalscales• Noisy/missingsignals(e.g.depth)• Limitedtraining datavs. flexiblegestures• Real-time

    [email protected]

    5

  • Contributions• Multi-modalandmulti-scaledeeplearningframework

    • Challenges▫ Gesturesfromvariousspatialandtemporalscales▫ Flexiblegestures▫ Real-time

    [email protected]

    6

  • Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique

    • Challenges▫ Noisy/missingsignals(e.g.depth)▫ Limitedtrainingdata

    [email protected]

    7

  • Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique▫ Modelinitializationformulti-modalfusion• Challenges▫ Limitedtrainingdata

    [email protected]

    8

  • ResultsSummary• Achieves0.87Jaccard Index(Rank1)inChaLearn 2014Challenge▫ Improvesto0.88whenaddingaudiomodadlity (ChaLearn +A)• Alocalizationrefinementtechniquefurtherimprovestheaccuracy.• ModDrop isrobusttonoisyormissingsamplesduringteststageon▫ MNIST▫ ChaLearn +A• Theinitializationformulti-modalfusioniseffective

    [email protected]

    9

  • RelatedWork• GestureRecognition▫ Classificationwithmotiontrajectories[1]▫ HoG featuresfromRGBanddepthimages[2]▫ 3DCNNtolearnspatial-temporalrepresentations[3]• Multi-modalFusion▫ Earlyfusionandlatefusion[4]▫ MultipleKernelLearning(MKL)[5]▫ DeepNeuralNets[6]

    [email protected]

    10

  • MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties

    • Inter-scalelatefusion• Gesturelocalizationasrefinement

    [email protected]

    11

  • OverallArchitecture• Multi-scalesampling• Single-scalemulti-modalfusion• Inter-scalelatefusion

    [email protected]

    12

  • Single-scaleMulti-modalfusion

    • Fourpaths• Singlepathpre-training• Initializationformulti-modalfusion• ModDrop

    [email protected]

    13

    Initializationforfusion

    ModDrop

  • Single-scaleMulti-modalfusion

    • PathV1/V2forhands▫ Input Depthvolume(WxHx5) Gray-scalevolume(WxHx5)▫ Architecture Conv3D,maxpoolingovertime,Conv2D FlattenconcatenationforHLV1▫ HorizontalflippedinputforV2,shareparameterswithV1▫ Detectactivehandsfortraining:trajectoryofthehandjoint

    [email protected]

    14

  • Single-scaleMulti-modalfusion

    • Inputnormalization▫ Normalizehandboundingboxesoverframeaccordingtohanddistancetosensor[7]

    ▫ H_x:boundingbox(pixel),h_x:actualhandsize(mm),zdistancetosensor▫ X:Imagewidth

    [email protected]

    15

  • Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ 3-layerMLP▫ Frameinputfeature

    Normalizedjointpositions Jointvelocitiesandaccelerations Inclinationangles Azimuthangles Bendingangles Pairwisedistances▫ Richrepresentationforindividualarticulationdifference▫ Concatenatefor5frames

    [email protected]

    16

    BodyposeImagesource:[7]

  • Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ Inputfeature[7] Normalizedjointpositions Azimuthangles Bendingangles Pairwisedistances

    [email protected]

    17

    PoseImagesource:[7]

  • Single-scaleMulti-modalfusion• PathA(audio)▫ Input:Mel-frequencyhistograms Time-frequency-amplitude▫ FeedtoConv2Dlayer+2hiddenlayers

    [email protected]

    18

  • Single-scaleMulti-modalfusion• Singlepathpre-training▫ Thewholenetworkhastoomanyparameters

    • Early-fusionofheterogeneousdatasourcesisnoteffective

    • Fusethepathsinalatehiddenlayer(HLS)

    [email protected]

    19

    Softmax

    Softmax

    Softmax

    Softmax

  • Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&, # = ∑ #*�* isinputdimension

    ▫ NK,outputdimension

    [email protected]

    20

  • Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&,∑ #*�* isinputdimension

    ▫ NK,outputdimension▫ ,-

    (/), feature-iformodalityk(bluecircles)▫ 12

    (3),unit-2 relatedtomodalitym(pinkcircle)▫ 4-,2

    (3,/),weightbetweenabovetwo

    ▫ Init:5 = 6,4-,2(/,/) arefrompre-training

    ▫ Increase5 tolearncross-modalitycorrelations

    [email protected]

    21

    m

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension

    [email protected]

    22

    Softmax

    Pre-training

    SoftmaxSoftmaxSoftmaxSoftmax

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9

    (*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training

    ▫ !7* ∈ ;

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9

    (*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training

    ▫ !7* ∈ ;

  • ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalse/redundantco-adaptationbetweenmodalities

    • Howtoobtainrobustpredictionwhensomemodalitiesaremissing/noisyduringtest?

    • B :Cross-entropyloss• C(*) isthespecificmodelformodalityk• !D:weightsforlayerh

    [email protected]

    25

    IdealLoss

  • ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalseco-adaptationbetweenmodalities• Obtainrobustpredictionwhensomemodalitiesaremissing/noisy

    • B :Cross-entropyloss• !D:weights

    [email protected]

    26

    IdealLoss

    Hugeamountofcomputationfor~2^Kterms!

  • ModDrop:Multimodaldropout• Solution:Randomlydropinputswhentrainingeachbatch• Inputsampledfrommodalityk:

    • Bernoulliselectorastherandomindicatorvariable:

    • Multi-modalnetwork:

    [email protected]

    27

  • ModDrop:MultimodalDropout• Regularizationpropertiesonone-layernetworkwithsigmoidactivation

    [email protected]

    28

  • Original ModDrop

    [email protected]

    29

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

  • Original ModDrop

    [email protected]

    30

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    DropindexE, F

  • Original ModDrop

    [email protected]

    32

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    TaketheExpection ofequation,useG H / = I(J)Approximation1:G K , = K G , [8]Approximation2:1st orderTaylorexpansionarounds

    s

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    33

    • Substitute:

    • ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    Regularization

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    34

    • Substitution• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    • Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :

    Regularization

    Regularization

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    35

    • SubstitutethegradientforOriginalnetworktogradientforModDrop:• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    • Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :

    Regularization

    Regularization

    TheModDrop lossisptimesthecompletemodellossminusaregularizationterm

    Regularizationtermhasonlycross-modalitymultiplications!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    36

    • Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-

    / and,O3

    ▫ G[,-/ ] =G[,O

    3 ] = 6,canalwaysdoinputnormalization

    ▫ G[,-/ ,O

    (3)] = G[,-/ ] G[,O

    3 ] +

    STU ,-/ , ,O

    3 = STU ,-/ , ,O

    3

    Lossforonesample!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    37

    • Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-

    / and,O3

    ▫ G[,-/ ] =G[,O

    3 ] = 6,duetoinputnormalization

    ▫ G[,-/ ,O

    (3)] = G[,-/ ] G[,O

    3 ] +STU ,-

    / , ,O3 = STU ,-

    / , ,O3

    • Case1:,-/ and,O

    3 positivelycorrelated▫ G[,-

    / ,O(3)] > 6

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobepositive

    Lossforonesample!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    38

    • Case1:,-/ and,O

    3 positively correlated

    ▫ G[,-/ ,O

    (3)] > 6

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobepositive

    • Case2:,-/ and,O

    3 negatively correlated

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobenegative

    Lossforonesample!KWX , Y, Z[\ K > 6

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    39

    • Forcorrelatedmodalities,theregularizationtermencouragesthenetworkto▫ Discoversimilaritiesbetweenthemodalities▫ Alignmodalitiesbylearningtheweights

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    40

    • Case3:,-/ and,O

    3 uncorrelated▫ G[,-

    / ,O(3)] = 6

    ▫ Assumption:Weightsobeyaunimodaldistributionwithzeroexpectation[9]

    ▫ AccordingtotheLyapunov’s CentralMeanTheorem:thetermtendstozerowhenamountoftrainingsamplestendstoinfinity[10].

    ▫ Additionalconstraints:theL2regularizationforweights.

    Lossforonesample!

  • MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties

    • Inter-scalelatefusion• Gesturelocalizationasrefinement

    [email protected]

    42

  • Inter-scalelatefusion• Forframet,classk,takeweightedsumofpredictions]^,*(_)overscales` = 2,3,4• Getframe-wisefinalprediction.

    [email protected]

    43

  • Gesturelocalizationasrefinement• Therecognitionframework(R)makespredictionsbyslidingwindows▫ Noisywindow:coverintersectionofagestureandreststate• TrainaMLP(M)toclassify”motion”vs. ”nomotion”foreachframe▫ Input:posedescriptors▫ 98%accuracy• Post-refinement▫ foreachgesturepredictedbyR,assignitsboundaryframestotheclosestswitchingframespredictedbyM

    [email protected]

    44

  • Experiments• Datasetandevaluation• Multi-modalpredictionresults• Comparingtrainingtechniques▫ Pre-training,Dropout(appliedtoinput),Initialization,ModDrop▫ Dataset:MNIST,Chalearn 2014

    [email protected]

    45

  • DatasetandEvaluation• Chalearn 2014challengedataset▫ ~14Klabeledgestureclips▫ 20gesturecategories• Datasetaugmentedwithaudios(vocalphrase)• EvaluationMetric▫ Jaccard Indexforsequence` andgestureF,average overallsandn

    ▫ Foraudio,alsouseclip-basedaccuracy ✅ if20%ofclipispredictedcorrectly

    [email protected]

    46

  • Multi-ModalPredictionResults• Multi-modalmulti-scaleresults(Jaccard Index)

    [email protected]

    47

    1.Exceptforaudio,largersamplingstepyieldsbetterresults2.Althoughaudiomodalityalongperformstheworst(duetoalignmentissue),itstillbooststheperformancewhencombinedwithposeandvideomodalities

  • Multi-ModalPredictionResults• Challengeresults(withoutaudio,Jaccard index)

    [email protected]

    48

    1.Gesturelocalizationmakespredictioncorrections.2.Theperformancecanbefurtherboostedwhencombiningwiththebaseline

    [49]N.Neverova,C.Wolf,G.Taylor,andF.Nebout,“Multi-scaledeeplearningforgesturedetectionandlocalization,”inECCVW,2014

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ Multi-modalsetting

    [email protected]

    49

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)

    [email protected]

    50

    1.Train-from-scratchincreaseserror2.Dropoutandpre-trainingareuseful4.ModDrop hasnolift5.Themodelislightweight

    [55]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov,“Improvingneuralnetworksbypreventingcoadaptationoffeaturedetectors,”inarXiv:1207.0580,2012

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ EffectofModDrop underOcclusionandNoise

    [email protected]

    51

    Pre-trainingandinitializationareemployed.

    Inbothcases,ModDrop makesthemodelmorerobust

  • TrainingTechniquesComparison• Chalearn 2014+audio

    [email protected]

    52

    Allfourtrainingtechniquesarehelpful

  • TrainingTechniquesComparison• Chalearn +audio:EffectofModDrop

    [email protected]

    53

    Whenthetestinputismanipulated,ModDrop isrobustwhencombiningwithDropout

    Pre-trainingandinitializationareemployed.

  • Summary• Multi-modalmulti-scaledeepframework▫ Initializationformulti-modalfusion▫ ModDrop• ShowedefficacyinChaLearn andMNIST

    [email protected]

    54

  • Reference• [1]H.Wang,A.Kl€aser,C.Schmid,andC.L.Liu,“Densetrajectoriesandmotionboundarydescriptorsforaction

    recognition,”Int J.Comput.Vis.,vol.103,pp.60–79,2013.• [2]J.Sung,C.Ponce,B.Selman,andA.Saxena,“UnstructuredhumanactivitydetectionfromRGBDimages,”in

    Proc.IEEEInt.Conf.Robot.Autom.,2012,pp.842–849.• [3]S.Ji,W.Xu,M.Yang,andK.Yu,“3Dconvolutionalneuralnetworksforhumanactionrecognition,”IEEETrans.

    PatternAnal.Mach.Intell.,vol.35,no.1,pp.221–231,Jan.2013.• [4] S.E.Kahou etal.,“Combiningmodalityspecificdeepneuralnet- worksforemotionrecognitioninvideo,”in

    Proc.15thACMInt.Conf.MultimodalInteraction,2013,pp.543–550.• [5] F.Bach,G.Lanckriet,andM.Jordan,“Multiplekernellearning,conicduality,andtheSMOalgorithm,”in

    Proc.21stInt.Conf.Mach.Learning,2004,p.6.• [6] J.Ngiam,A.Khosla,M.Kin,J.Nam,H.Lee,andA.Y.Ng,“Multimodaldeeplearning,”inProc.29thInt.Conf.

    Mach.Learning,2011,pp.689–696.• [7]Neverova,N.,2016. Deeplearningforhumanmotionanalysis (Doctoraldissertation).• [8]Baldi,P.andSadowski,P.,2014.Thedropoutlearningalgorithm. Artificialintelligence, 210,pp.78-122.• [9]S.WangandC.Manning,“Fastdropouttraining,”inProc.30thInt.Conf.Mach.Learning,2013,pp.118–126.• [10]E.L.Lehmann,ElementsofLarge-SampleTheory.Science&BusinessMedia,p.631,1999.

    [email protected]

    57

  • Thanksforlistening!

    [email protected]