58
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/CAP6412.html Boqing Gong April 21st, 2016

CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

CAP6412AdvancedComputerVision

http://www.cs.ucf.edu/~bgong/CAP6412.html

Boqing GongApril 21st,2016

Page 2: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Today

• Administrivia• Freeparametersinanapproach,model,oralgorithm?• Egocentricvideos byAisha

Page 3: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

ProjectIIdue:nextWednesday(04/27,5PM)

• FinalProjectPresentation:04/28,1—3:50PM

• Latesubmissions:https://docs.google.com/spreadsheets/d/1uNPfUsdnw5xfzIV-PrQTo9xTWKfv7s-OPyuV_zZw9Fc/edit?usp=sharing

Page 4: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Today

• Administrivia• Freeparametersinanapproach,model,oralgorithm?• Egocentricvideos byAisha

Page 5: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Freeparameters(hyper-parameters)

• InProject2,whenyoutraintheCNNs• Learningrate,momentum,weightdecay,dropoutrate,earlystopping,etc.• Networkarchitecture,nonlinearfunctions,strides,etc.

• InLinearregression

• InSVM

minw

MX

m=1

(ym � x

Tmw) + �kwk22

minw,⇠m,m=1,··· ,M

MX

m=1

⇠m + �kwk22

s.t. ym(xTmw) � 1� ⇠m,& ⇠m � 0 8m

Page 6: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Freeparameters(hyper-parameters)

• InK-meansclustering:K,thenumberofclusters• InK-Nearestneighborsclassifier: K,thenumberofneighbors• InCannyedgedetection• Gaussianfilter,thresholds

• InR-CNN• Thresholdofselectivesearch• #Layers,filtersize,stride,wheremaxpooling• Paddingornot,learningrate,momentum,weightdecay,#iterations• Trade-offparameter• Featureselectionforregression• Batchsize

Page 7: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Freeparameters(hyper-parameters)

• Freeparameters vs.Modelparameters

• Oftenseekmodelparametersbyoptimization• Gradientdescent(GD),coordinatedescent,Newton,stochasticGD,etc.

• Howtochoosethefreeparameters?

minw

MX

m=1

(ym � x

Tmw) + �kwk22

Page 8: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Howtochoosethefreeparameters

• Smallesterrorrateon• Testset?• Validationset?

• Smallestexpectederrorrate ontheentirepopulation• Inpractice,however,wehaveaccesstoafinitesetofexamples!• Approximatetheexpectederrorrate• Choosefreeparameterswhichminimizetheapproximateerror

• Howtoapproximatetheexpectederror?

Page 9: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Weakapproximationoftheexpectederror!

Rarelyusedinpractice.

Page 10: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass
Page 11: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Popularforsmalldata.

Page 12: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Popularforsmalldata.

Page 13: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass
Page 14: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass
Page 15: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Popularforbigdata.

1. Dividedatatotraining,validation,andtest sets.2. Selectfreeparameters

1. E.g.,networklayers,#hiddenstates,nonlinearfunctions,etc.

3. Trainthemodelusingthetraining set4. Evaluatethemodelusingthevalidation set5. Repeatsteps2—4usingdifferentfreeparametersà

differentmodels6. Selectthebestmodel(andtheirassociatedfree

parameters)7. Trainthemodel(withtheassociatedfree

parameters)usingbothtraining andvalidation sets.8. Assessthisfinalmodelusingthetest set.

Skipstep7forbigdata.

Page 16: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Skipthisstepforbigdata.

Page 17: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Today

• Administrivia• Freeparametersinanapproach,model,oralgorithm?• Egocentricvideos byAisha

Page 18: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand detection in Egocentric videos

Aisha Urooj

Course Instructor: Dr. Boqing Gong Advanced Computer Vision

Page 19: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Motivation

• Emergence of new wearable technologies

– Action cameras

– Smart glasses, so on…

• These devices capture videos from first person’s perspective.

• Record user’s experiences

Image Source: [1]

Page 20: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

An overview of First Person Vision

Page 21: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Image Credits: [1]

Page 22: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

A hierarchical structure, starting from the raw video sequence (bottom) to the desired objectives (top)

Image Credits: [1]

Page 23: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Image Credits: [1]

Page 24: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Image

Credits:

[1]

Page 25: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Image Credits: [1]

Page 26: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Related Datasets [1]

Page 27: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Motivation

• Hands are very common in egocentric videos

• Appearance of hands and pose give important cues about human’s – actions

– attention

– Activity recognition

– user–machine interaction, so on.

• Most of the egocentric computer vision problems, from object detection to activity recognition requires accurate hand detection.

Page 28: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Challenges in hand detection

• Hands are highly deformable objects.

• Occlusion

• Cluttered background

• Dynamic background

• Inconsistent lighting

• Poor imaging conditions

• Highly dynamic camera motion

• So on..

Page 29: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Lending a Hand: Detecting Hands and

Recognizing Activities in Complex Egocentric

Interactions

Sven Bambach, Stefan Lee, David J. Crandall, Chen Yu

Indiana University

Page 30: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Outline

• Paper’s contribution

• Dataset details

• Approach

• Results

• Possible future directions

Page 31: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Paper’s Contributions

• Deep model for hand detection and classification in egocentric video, including fast domain-specific region proposals.

• A new technique for pixel wise hand segmentation.

• A quantitative analysis of how hand location and pose can be useful in accurate activity recognition.

• A large dataset of egocentric interactions with fine grained ground truth.

Page 32: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Overview

Image source: http://vision.soic.indiana.edu/projects/lending-a-hand/

Page 33: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Ground truth hand segmentation masks on sample frames from dataset.

Page 34: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

A random

subset of

cropped

hands

according to

ground

Truth.

Page 35: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Dataset details

• 4 participants, 4 activities, 3 different locations (office, home, courtyard)

• Total 48 unique videos.

• Used Google Glass, 720x1280 at 30 fps.

• 2 persons in one video, each wearing google glass. (Synchronized video pairs and cut them to 90 seconds)

• Pixel level ground truth for over 15000 hand instances.

• Manual annotation of 100 frames/ video i.e. 4800 frames ground truth.

• Main Split: 36 training, 4 validation, 8 test videos.

Page 36: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand Detection: Approach

• Candidate windows generation

• Window classification using CNNs

Page 37: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Window Proposals Generation

• Probability that an object O appears in a region R of an image I.

• The proposed approach for candidate windows generation combines spatial biases and appearance models together.

Page 38: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Window Proposals Generation (Contd..)

• P (O) : Object occurrence probability

• P(R|O) : Probability that a certain region R (a bounding box) contains a specific hand (O)

• P(I | R, O): A pixel-level skin classifier – Estimates the probability that central pixel of R is

skin.

Page 39: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Coverage Results for Different Proposal Methods

Page 40: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Window classification

• A standard CNN classification framework used.

• CaffeNet from Caffe software package – Slight variation of AlexNet

• Each training batch contains equal number of samples from each class.

• Disabled horizontal and vertical flipping of sample images in Caffe – For differentiating between left and right hands.

Page 41: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Window classification (Contd..)

• The CNN weights are initialized from CaffeNet

– Except final fully connected layer which is set to zero mean gaussian.

• Fine-tuning using SGD

– Learning rate = 0.001

– Momentum = 0.999

Generate Spatially sampled window proposals

Classify window crops Using fine-tuned CNN

Perform non-maximum suppression for each test frame

Input

Page 42: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand Detection

• Two cases: – Detect hands of any type

– Detect hand of specific type (own left, your right etc.)

• PASCAL VOC criteria for scoring detections is used – Intersection over Union between the ground truth

and detected bounding box should be > 0.5

Page 43: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Precision-Recall curves for Hand detection

Page 44: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Qualitative Results for Hand Detection

Page 45: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Quantitative Results for Hand Detection

Page 46: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hands Segmentation

• Pixelwise hand segmentation is useful for: – Hand pose recognition – In-hand object detection, so on..

• Goal: Label each pixel either to the background or to a specific hand class.

• Applied a semi-supervised segmentation algorithm

GrabCut. • Given an approximate foreground mask, GrabCut

iteratively refines foreground and background pixels , relabeling them using Markov Random Field.

Page 47: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hands Segmentation

• For each hand detected bounding box, initial foreground estimation is computed using same color skin model.

• Thresholded and marked each pixel within the box as foreground except with very low skin probability.

• Run GrabCut algorithm on bounding box including padded region.

• Final segmentation is the union of the output masks for all detected bounding boxes.

Page 48: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Quantitative Results for Hand Segmentation

Page 49: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Two modes of possible failures

• Failure to properly detect hand bounding boxes.

• Inaccuracy in distinguishing hand pixels from background.

• Applying segmentation algorithm on ground truth bounding boxes results in raise to average 0.73

• Taking output of hand detector but using ground truth segmentation masks again increases average to 0.76

Page 50: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Qualitative Results for Hand Segmentation

Page 51: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand-based Activity Recognition

• Masked out all other non-hand background information by using ground truth hand segmentations.

• Fine-tuned a CNN to classify whole frames as one of the four activities.

– Training: 900 frames per activity for 36 videos

– Validation: 100 frames per activity for four videos

– Classification accuracy: 66.4% per frame

Page 52: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand-based Activity Recognition (contd..)

Incorporating temporal constraints:

• Simple voting based approach

• Classify each individual frame in the context of a fixed-size temporal window centered on the frame

• Scores are summed across the window

• Frame is labeled as the highest scoring class

Page 53: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Hand-based Activity Recognition

Page 54: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Some sample hand poses not present in their dataset

Page 55: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Related work on Egocentric Hands Detection

Work by A. Betancourt, University of Genoa, Italy 1)Hand Segmentation and tracking in FPV 2) A Sequential Classifier for Hand Detection in the Framework of

Egocentric Vision. CVPR 2014 3) The Evolution of First Person Vision Methods: A Survey. Observations:

– Misses detection of hands in many frames for other people. – Results show false positives in many frames. – No detection on hands shown in videos running within a video. – Segmentation is not efficient. – At times both hands are detected as either left or right. – Full arm is being considered as hand.

Page 56: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

Possible Future Directions

• Improve segmentation technique

• Have an unbiased dataset

• Use an efficient tracking approach to incorporate temporal information

• Improve hand classifier

Page 57: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

References

• [1] The Evolution of First Person Vision Methods:A Survey. A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg. IEEE Transactions on Circuits and Systems for Video Technology. Vol 25. Issue 5.

Page 58: CAP 6412 Advanced Computer Vision - UCF Computer Sciencebgong/CAP6412/lec28.pdf · •Used Google Glass, 720x1280 at 30 fps. •2 persons in one video, each wearing google glass

THANK YOU!