Proxemics Recognition

PowerPoint Presentation

Proxemics RecognitionYi Yang1, Simon Baker, Anitha Kannan, Deva Ramanan1

1Department of Computer Science, UC Irvine

Hello, everyone. Thank you for your attendance to my talk. Todays topic is Proxemics Recognition which is a novel problem in computer vision field. This is a joint work with Simon, Anitha and Deva who is my supervisor in UCI. First, what is proxemics?1ProxemicsProxemics: the study of spatial arrangement of people as they interact- anthropologist Edward T. Hall in 1963

The term, Proxemics, is first introduced by anthropologist Edward Hall in 1963 when he study the spatial arrangement between people as they interact. In our daily life, proxemics happen almost all the time as long as there are multiple people in a roughly same space. For example, right now. 2

In his classic study, he found that we human normally has about 4 difference space concepts. From our own person point of view, 3

the farest space is the public space where you can imagine that we stand in wild and other people are too far away to have any possible direct communication with us. 4

While in social space, it is possible for us to talk to a person but hard to touch.5

When the people come closer and closer, they start to get into our personal space. Normally that means we can talk to through face to face and touch them with our hands. 6

Once they come to our intimate space, this means the person can use their all body parts to interact with us. So we can walking with them side by side, hugging them or even kissing them.7

Brother and Sister Holding HandsFriends WalkingSide by SideHusband Hugging and Holding Wifes HandMom Holding Baby

During this summer intern, what we focus a lot is the proxemics in personal space and intimate space which are much more interesting to study in computer vision as there are many complicated interactions happen in these two spaces. In the slides, you can see the examples from top left to bottom right, such as holding hands, walking side by side, holding a baby and hugging while holding hands.8Touch CodeHand Touch HandHand Touch Shoulder

Shoulder Touch ShoulderArm Touch Torso

According to anthropologist Edward Hall, such complex interactions can be explained using a small collection of basic elements called touch codes, such as hand touch hand, hand touch shoulder, shoulder touch shoulder and arm touch torso. The set is small, but the combination can be exponentially large. For example, the top right proxemics can be described using shoulder touch shoulder while the bottom right can be described using the combination of hand touch hand and hand touch shoulder. 9ApplicationsPersonal Photo Search:Find a specific interesting photoAnalysis of TV shows and moviesKinectWeb SearchAuto-Movie/Auto-SlideshowLocate interesting scenesOnce we understand how people interact with each other in images, then it enables a lot of potential applications that involved with multiple human interactions. We can come later and discuss about them after the talk.10Proxemics DataSet200 training images150 testing images

Collected From Simon, Bing, Google, Gettyimage, Flickr

No video dataNo Kinect 3d depth dataTo study this problem, we build a dataSet with 200 training images and 150 testing images collected from Simons personal photos, Bing images, google images, gettyimages and flickr. One point we need to make is, our system input is pure image pixels. There are no video data, and no Kinect depth data.11Proxemics DataSetNumber of PeopleComplexity

Here is a simple illustration of our dataSet. The x-axis is number of people showing in one image. The y-axis is the proxemics complexity determined subjectively by myself. You may have a different opinion. As you can see, our dataSet contains various kinds of human interactions from far away to close by, from 2 people to many pepole and from simple touch to very complex touch. Recall that our goal is to recognize the touch codes and interactions between pairs of people without the help from Kinect. Then how can we handle such complicated problem?12A Nave ApproachInput Image

Well, suppose we have a kinect working for images and suppose our image kinect is able to report the full pose of humans in this two boys image. Then we can directly utilize the predicted locations of the two boys hands and easily report that they are holding hands based on the hand distance. Hence a nave approach for recognizing human interactions in images would be given an input image with people, 13A Nave ApproachInput Image

Image FeatureWe first extract discriminative image features such as the edge features 14A Nave ApproachInput Image

Pose Estimation i.e. Find skeletons

Image Featureand apply the current state of the art human pose estimation methods to find the skeletons and joint locations for each person independently.15A Nave ApproachInput ImageInteraction Recognition

Hand touch HandPose Estimation i.e. Find skeletons

Image FeatureThen given the predicted joint locations, we build a classifier based on it and output the final interaction recognition labels. This is a very straightforward process and you would hope that this would work.16Nave Approach ResultsHand touch HandHand touch Shoulder

But when we look at the experimental results, we are very disappointed. Wshow two of the precision recall curves for recognizing hand touch hand and hand touch shoulder tested in our dataSet. You can see that this nave approach almost work as bad as a random guess. 17Human Pose EstimationNot bad when no real interaction between people

- Y. Yang & D. Ramanan, CVPR 2011To find out the failure reasons, we go back looking at the pose estimation results. One of the potential reasons is that we dont have Kinect that working for images, but what we found is that when the two people have no real interactions, our current pose estimation algorithm is not that bad.18Interactions Hurt Pose EstimationOcclusion + Ambiguous Parts

However, when we look at those more interesting human interaction images, what we found is that this independent human pose estimation fails to find the correct human pose because of the large occlusion and ambiguous parts. In order to make it work, we have to reason about whether a part is occluded by another person, or whether the predicted part belongs to another person. However, explicitly reasoning about occlusion and ambiguous parts can be very expensive and we choose another way to directly handle this problem.19Our ApproachDirect Proxemics RecognitionInput Image

Image FeatureGiven an input image with its edge features extracted, 20Our ApproachDirect Proxemics RecognitionInput ImageInteraction Recognition

Image Feature

Hand touch Handwe skip the independent pose estimation step and directly decide whether there is a human interaction or not. Then what is the underlying model?21Pictorial Structure Model

- P. Felzenszwalb etc., PAMI 2009To explain our approach, we need to trace back to the well-known pictorial structure model in computer vision for human detection. In the pictorial structure model, a human is explained as a set of parts, such as head, torso, arms and legs, with springs representing the relative locations between parts. A scoring function is defined based on the locations of these parts. We write l_i for the 2D location of part i. We score a configuration of parts with a summation over two terms. 22Pictorial Structure Model

The first term scores a local match for placing a part template at a particular image location. We write alpha for the part template, and phi for the image features extracted at that location. We use edge features, and score the template with a dot product. 23Pictorial Structure Model

The second term consists of a deformation model that evaluates the relative locations of pairs of parts. We write psi for the spatial features between two part locations, and beta for the parameters of a spring that favors certain offsets over others. 24Two Head Monster Model

i.e. Hand-Touching-Hand

Our direct proxemics model, can be viewed as an extension of the previously model except now we have two humans in our model. We give it a nickname called, the two head monster model. Using hand-touch-hand as an example, our model detects a two head monster that forces there is a hand-hand spring connection in it. 25Two Head Monster Model

i.e. Hand-Touching-Hand

Based on the two full body skeletons, we develop a simplified verion that crops the noninformative parts and leaves the two heads and two arms only. That is our model recognizing hand touch hand.26The modelsHand touch HandHand touch ShoulderShoulder touch ShoulderArm touch Torso

We build four models in our experimental study. Hand touch hand, hand touch shoulder, shoulder touch shoulder and arm touch torso. For each one of them, there is one more spring connection between the two persons capturing the relative locations in the interactions.27Match Model to ImageGiven an image, the inference of our model can be done very efficiently using dynamic programming. The model parameters are learned using supervised data with a structure SVM solver. 28Refinements + ExtensionsSub-categoriesBecause of symmetry, 4 models for hand-hand etc

R. Hand L. HandL. Hand L. HandL. Hand R. HandR. Hand R. HandThere is still some ambiguity in our model because of the left-right symmetry of arms and shoulders. Hence in our refinements, we build 4 sub-models for hand-hand to capturing the different visual appearances. 29Refinements + ExtensionsSub-categoriesBecause of symmetry, 4 models for hand-hand etc

Co-occurrence of proxemics:Reduce redundancy, map Multi-Label -> Multi-Class

R. Hand L. HandL. Hand L. HandL. Hand R. HandR. Hand R. HandAnother keypoint is that not all of the proxemics will happen simultaneously between two persons. We consider that as well.30Nave Approach ResultsHand touch HandHand touch Shoulder

We run experiments and compare it against the previous nave approach.31Direct Approach Results

Hand touch HandHand touch Shoulder

As you can see from the precision-recall curves shown in the blue line, our model significantly outperforms the baseline for both hand touch hand and hand touch shoulder.32Quantitative Results

[1] Y. Yang & D. Ramanan, CVPR 2011For quantitative comparison, you can see there is an 30% average precision boost by using our direct model which is again very significant.33Improves Pose EstimationY & D CVPR 2011

Our ModelThe direct proxemic model can reversely improves pose estimation as well as recognizing human interactions. Here we show three examples for hand-touching-hand. By recognizing people are hand-touch-hand and utilizing the information that their two hands must be very close, the arm locations are predicted much more accurate.34Improves Pose EstimationY & D CVPR 2011Our Model

Similar phenomenon occurs in hand touching shoulder. The independent pose estimation algorithm fails to capture the occlusion and reason about the ambiguous arm. While our model can even predict the occluded part locations by utilizing the information that a hand should be near to a shoulder.35Improves Pose EstimationY & D CVPR 2011Our Model

For the hand touch torso, similar improvements happen again. There are a number of Simon holding baby photo but somehow they are not in the highest score list.36ConclusionProxemics and touch codes for human interaction

In conclusion, we introduce a novel concept in computer vision literature called proxemics and touch codes for modeling human interactions.37ConclusionProxemics and touch codes for human interaction

Directly recognizing proxemics significantly outperforms

We develop a novel model that directly recognizing proxemics and it significantly outperforms the nave approach which uses pose estimation as a preprocessing step.38ConclusionProxemics and touch codes for human interaction

Directly recognizing proxemics significantly outperforms

Recognizing proxemics helps pose estimation

And finally we show that by recognizing proxemics, it also helps pose estimation.39Acknowledgements

Thank Simon and MSR for internshipThats all. Thank Simon and MSR for giving me this exciting internship experience.40Acknowledgements

Thank Anitha for a lot of suggestionsThank Anitha for a lot of wonderful suggestions for both my research and life.41Acknowledgements

Thank Anarb for gettyimagesAso thank Anarb who is my labmate by telling me gettyimages so I am able to download a lot of useful images for studying.42Acknowledgements

Thank Eletee for her beautiful smilingAlso thank Eletee who is also my labmate for her beautiful smiling that makes our lab with sunshine eveyday.43Acknowledgements

Thank everybody for not falling asleepFinally, thank everybody for not falling asleep in my talk.44Thank you

Thank you very much.45Articulated Pose Estimation- Yi Yang & Deva Ramanan, CVPR 2011

Inference & LearningLearningInferenceIn order to train our model, we assume a fully supervised training dataset, where we are given positive and negative images with part locations and mixture types. Since the scoring function is linear in its parameters, we can learn the model using a SVM solver. This means we learn the local part templates, the springs, and the mixture co-occurences simultaneously.47

Documents

Proxemics Recognition