E NTERFACE’10 V ISION B ASED H AND P UPPET Final Presentation

ENTERFACE’10ENTERFACE’10VİSİON BASED HAND VİSİON BASED HAND PUPPETPUPPET

Final PresentationFinal Presentation

PROJECT OBJECTİVE To develop a multimodal interface to

manipulate the low- and high level aspects of 3D hierarchical digital models The hands and the face of possibly separate

performers will be tracked. Their gestures and facial expressions will be

recognized. Estimated features will be mapped to digital puppets in

real-time.

PROJECT OBJECTİVE

The project involves Tracking of both hands

Background segmentation Skin color filtering Particle filtering

Hand pose estimation Dimensionality reduction

Gesture and expression Hidden semi-Markov

models Keyframe classification

Facial parameter tracking Active appearance models Filtering

Visualization Skeleton model Inverse kinematics Physics

Networking XML

WORK PACKAGES

WP1: Data collection and ground-truth creation for

pose estimation module WP2:

Hand posture modeling and training WP3:

Stereovision based hand tracking WP4:

Vision based facial expression tracking

WORK PACKAGES

WP5: Gesture and expression spotting and recognition

WP6: Skeleton and 3D model generation

WP7: Development of the graphics engine with

skeletal animation support WP8:

Network Protocol design and module development

FLOWCHART OF THE SYSTEM

WP1: DATA COLLECTION AND GROUND-TRUTH CREATION FOR POSE ESTIMATION

PROBLEM: Hand pose estimation requires annotated images for training Each hand pose must be exactly known, which is

not possible without special devices such as data gloves.

As this process requires a lot of work, we synthetically create the training images. Poser: A software that can manipulate models

with skeletons and render photorealistic images with python scripts.

WP1: DATA COLLECTION AND GROUND-TRUTH CREATION FOR POSE ESTIMATION

Poser Imitates a stereo camera setup

and produces photorealistic renders

Automatically generates the silhouettes from the rendered images

Allows python scripts to manipulate any parameter of the scene

A single script can generate an entire multicamera dataset

WP1: DATA COLLECTION - METHODOLOGY

We iteratively increase the complexity of the visualized hand, starting with an open hand

Start with 3 degrees of freedom: Side to side (waving) Bend (up down) Twist

We created 2x1000 images for training Created for a certain stereo camera setup Manipulated each d.o.f. in sequence Extracted the silhouettes Saved along with generating parameters

WP1: DATA COLLECTION - CONCLUSİON

Using Poser to generate training images is a very efficient method Can potentially create a very large database in a

few hours It’s very simple to create any multicamera

setup Each extrinsic and intrinsic camera parameter

can be set via python script Automatically extracts the silhouettes Provides high level manipulation parameters

for the body parts e.g. grasping and spreading for the hand

WP2: HAND POSE ESTIMATION

AIM: To estimate hand skeleton parameters using

hand silhouettes from multiple cameras IDEA:

Use dimensionality reduction to map silhouettes to a space with much lower dimensionality

When an unknown silhouette arrives Search for the closest known point in the reduced

space

WP2: MANIFOLD LEARNING WITH SPARSE-GPLVM

Poser’s Hand Model is used for rendering. Hand Silhouette Images (80x80) are

rendered. 80x80 = 6400 dimensional silhouette vector. 1000 training samples per camera has been

captured by iteration over x, y and z for below values using a Python script:x = [0⁰, 90⁰], y = [-90⁰, +90⁰], z = [-60⁰, 60⁰]

2 cameras are simulated that are placed orthogonal to each other: 2000 training samples are collected

WP2: PREPROCESSING GPLVM AND LEARNING FORWARD MAPPING WITH NNS

GPLVM is a non-linear probabilistic PCA. For additional speed gains a conventional PCA

has been applied as a preprocessing step, capturing the 99% of the total variance.

GPLVM is applied afterwards. This made the optimization process ~4 times faster.

GPLVM finds a backward mapping from latent space (2D) to PCA feature space (~250D for 99% variance).

For fast generation of initial search points a forward mapping from feature space to latent space is trained using a NN with 15 hidden neurons.

WP2: HAND POSE ESTIMATION - FLOWCHART

Use PCAUse PCAUse GPLVM from 2D per

Camera

Use GPLVM from 2D per

Camera

Use NN from PCA space to Latent

space

Use NN from PCA space to Latent

space

Capture frame

Capture frame

Find foregroun

d

Find foregroun

d

Filter skin colors

Filter skin colors

Extract silhouetteExtract

silhouette

Nearest Neighbor Classifier

Nearest Neighbor Classifier

-1 0 1

-1

0

1

X

WP2: CLASSIFICATION

2 dimensional latent space has been found in a smooth fashion by GPLVM optimization.

Therefore as a classifier a nearest neighbor matcher has been used in the latent space.

Ground truth angles of the hand poses are known. An exact pose match is looked for. Any divergence from the exact angles is considered as a classification error.

For the synthetic environment prepared by poser a classification performance of 94% has been reached in 2D latent space.

WP3: STEREOVISION BASED HAND TRACKING

Objective Obtain the 3D position of the hands Enable real-time low noise tracking of robust

features for gesture recognition tasks Track some intuitive spatial parameters to map

them directly to the puppet Approach

skin color as main cue for hand location stereo camera to obtain 3D information Particle-filtering for robust tracking



Skin-color segmentation Bayesian color model Chromatic color space (HS) Train color model on image regions obtained

from a face tracker


Particle filtering (CONDENSATION) Initialization (Midterm result)

Biggest skin colored blob is assumed to be the hand Stereo-matching to obtain 3D hand location

Tracking (new) Color cue: accumulated skin

color probability weighted by percentage of skin colored pixels in particle ROI

Depth cue: deviation of ROI disparity from disparity value implied by particle location


WP4: VISION-BASED EMOTION RECOGNITION

AIM: To enable the digital puppet to imitate facial

emotions. To change digital puppet’s states using facial

expressions. METHODOLOGY:

Active shape model based facial landmark tracker Track all shape parameters – set of points Extract useful features manually

Eyebrow leverage, lip width etc. Classify features

Using HSMM Using nearest neighbor classifier


The ASM is trained using the annotated set of face images.

It starts the search from the mean shape aligned to the face located by a global face detector

The following are repeated until convergence Adjust the locations of shape points by template

matching of the image texture around each landmark and propose a new shape

Conform this new shape to a global shape model (based on PCA).

REAL-TIME VISION-BASED EMOTION RECOGNITION

Video framecaptured from

an ordinary webcam

Facial Landmark Trackingbased on

Active Shape Models

both generic and person-specific model

Emotion RecognitionSix universal gestures

happinesssadnesssurpriseangerfear

disgust

Feature Extractionbased on

intensity changes in shown specific face regions

and distances of specific landmarks


WP4: VISION-BASED EMOTION RECOGNITION – NEAREST NEİGHBOR

WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION

Has to “spot” gestures and expressions in continuous streams No start-end signals

Should recognize only when a command is over

Should run in real time We use hidden semi-Markov models

Inhomogeneous explicit duration models


HMM HSMM


Why HSMMs? HMMs model duration lengths only implicitly

Using self transition probabilities Imposes geometric distribution on each duration Variance-mean correlated

High mean – high variance Geometric distribution and/or high variance do

not conform to every application Speech, hand gestures, expressions, ...

HSMMs explicitly model durations HMMs are a special case of HSMMs


Training module Developed in MATLAB – no real time requirement Yet very fast, does not require too many samples Previously experimented with 25 hand gestures

and continuous streams Achieved 99.7% recognition rate

For this project, also experimented with facial expressions Six expressions Long continuous streams for training – not annotated Results look good, no numerical results due to lack of

ground truth (future work)


Recognition module Converted to an on-line algorithm Uses the recent history to determine current

state using Viterbi on a large HSMM As expressions are independent, this does not

introduce much error (about %1.5 in number of misclassified frames)

Runs in real time in MATLAB (not ported to C++ yet)

Performance analysis Most of the error is attributable to

Noise Global head motion Rather weak vector quantization method


Preliminary results

WP6: SKELETON AND 3D MODEL GENERATION

We have utilized a skeletal animation technique. The skeleton is predetermined and consists of 16

joints and accompanying bones.

WP7: DEVELOPMENT OF THE GRAPHICS ENGINE

Supports skeletal animation for the predetermined skeleton Reads skeleton parameters at each frame from

incoming command files Applies the parameters to the model in real-time

Allows different models to be bound to the skeleton Same skeleton can be bound to different 3D

models Supports inverse kinematics

Allows absolute coordinates as commands Supports basic physics (gravity)

Allows forward kinematics via forces


Forward Kinematics: "Given the angles at all of the robot's joints, what

is the position of the hand?“ Inverse Kinematics:

"Given the desired position of the robot's hand, what must be the angles at all of the robot's joints?“

Cyclic-Coordinate Descent Algorithm for IK Traverse linkage from distal joint inwards Optimally set one joint at a time Update end effector with each joint change At each joint, minimize difference between end the

effector and the goal



Future Work Implement and Optimize CCD (%90 complete) Load geometry data from Autodesk FBX file Advanced Shading for puppets, i.e. fur Rig to multiple models Choose and implement a convenient method for

visualizing face parameters and expressions

WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT

“Visualization computer” acts as a server and listens to the other computers accepts binary xml files Works over TCP/IP

XML is parsed and parameters are extracted. Each packet may contain several parameters

and commands Either low level joint angles as a set Or a high level command, such as new gesture

or expression

WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT

Threaded TCP/IP Server Binary XML <?xml version="1.0" encoding="UTF-8" ?>

<handPuppet timeStamp=”str” source=”str”> <paramset>

<H rx=”f” ry=”f” rz=”f” /> <ER ry=”f” rz=”f” /> <global tx=”f” ty=”f” tz=”f” rx=”f” ry=”f” rz=”f” />

</paramset> <anim id=”str” /> <emo id=”str” />

</handPuppet>

CONCLUSİON

Individual modules Most of the modules nearly complete

Final application Tracked features not bound to skeleton

parameters Model skin and animations missing

New ideas that emerged during the workshop Estimate forward mapping for GPLVMs using NNs Use HSMMs for facial expressions Fit 3D ellipse to the 3D point cloud of the hand Extract manual features such as edge activity on

the forehead

FUTURE WORK

Once hand tracking is complete, gestures will be trained using HSMMs

All MATLAB code will be ported to C++ (mostly OpenCV)

Hand pose complexity will be gradually increased, until not further possible in real time

Inverse Kinematics will be fully implemented A face model that is capable of showing

emotions will be incorporated to the 3D model for easy visualization

Documents

E NTERFACE’10 V ISION B ASED H AND P UPPET Final Presentation