Upload
georgina-mosley
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
ENTERFACE’10ENTERFACE’10VİSİON BASED HAND VİSİON BASED HAND PUPPETPUPPET
Final PresentationFinal Presentation
PROJECT OBJECTİVE To develop a multimodal interface to
manipulate the low- and high level aspects of 3D hierarchical digital models The hands and the face of possibly separate
performers will be tracked. Their gestures and facial expressions will be
recognized. Estimated features will be mapped to digital puppets in
real-time.
PROJECT OBJECTİVE
The project involves Tracking of both hands
Background segmentation Skin color filtering Particle filtering
Hand pose estimation Dimensionality reduction
Gesture and expression Hidden semi-Markov
models Keyframe classification
Facial parameter tracking Active appearance models Filtering
Visualization Skeleton model Inverse kinematics Physics
Networking XML
WORK PACKAGES
WP1: Data collection and ground-truth creation for
pose estimation module WP2:
Hand posture modeling and training WP3:
Stereovision based hand tracking WP4:
Vision based facial expression tracking
WORK PACKAGES
WP5: Gesture and expression spotting and recognition
WP6: Skeleton and 3D model generation
WP7: Development of the graphics engine with
skeletal animation support WP8:
Network Protocol design and module development
FLOWCHART OF THE SYSTEM
WP1: DATA COLLECTION AND GROUND-TRUTH CREATION FOR POSE ESTIMATION
PROBLEM: Hand pose estimation requires annotated images for training Each hand pose must be exactly known, which is
not possible without special devices such as data gloves.
As this process requires a lot of work, we synthetically create the training images. Poser: A software that can manipulate models
with skeletons and render photorealistic images with python scripts.
WP1: DATA COLLECTION AND GROUND-TRUTH CREATION FOR POSE ESTIMATION
Poser Imitates a stereo camera setup
and produces photorealistic renders
Automatically generates the silhouettes from the rendered images
Allows python scripts to manipulate any parameter of the scene
A single script can generate an entire multicamera dataset
WP1: DATA COLLECTION - METHODOLOGY
We iteratively increase the complexity of the visualized hand, starting with an open hand
Start with 3 degrees of freedom: Side to side (waving) Bend (up down) Twist
We created 2x1000 images for training Created for a certain stereo camera setup Manipulated each d.o.f. in sequence Extracted the silhouettes Saved along with generating parameters
WP1: DATA COLLECTION - CONCLUSİON
Using Poser to generate training images is a very efficient method Can potentially create a very large database in a
few hours It’s very simple to create any multicamera
setup Each extrinsic and intrinsic camera parameter
can be set via python script Automatically extracts the silhouettes Provides high level manipulation parameters
for the body parts e.g. grasping and spreading for the hand
WP2: HAND POSE ESTIMATION
AIM: To estimate hand skeleton parameters using
hand silhouettes from multiple cameras IDEA:
Use dimensionality reduction to map silhouettes to a space with much lower dimensionality
When an unknown silhouette arrives Search for the closest known point in the reduced
space
WP2: MANIFOLD LEARNING WITH SPARSE-GPLVM
Poser’s Hand Model is used for rendering. Hand Silhouette Images (80x80) are
rendered. 80x80 = 6400 dimensional silhouette vector. 1000 training samples per camera has been
captured by iteration over x, y and z for below values using a Python script:x = [0⁰, 90⁰], y = [-90⁰, +90⁰], z = [-60⁰, 60⁰]
2 cameras are simulated that are placed orthogonal to each other: 2000 training samples are collected
WP2: PREPROCESSING GPLVM AND LEARNING FORWARD MAPPING WITH NNS
GPLVM is a non-linear probabilistic PCA. For additional speed gains a conventional PCA
has been applied as a preprocessing step, capturing the 99% of the total variance.
GPLVM is applied afterwards. This made the optimization process ~4 times faster.
GPLVM finds a backward mapping from latent space (2D) to PCA feature space (~250D for 99% variance).
For fast generation of initial search points a forward mapping from feature space to latent space is trained using a NN with 15 hidden neurons.
WP2: HAND POSE ESTIMATION - FLOWCHART
Use PCAUse PCAUse GPLVM from 2D per
Camera
Use GPLVM from 2D per
Camera
Use NN from PCA space to Latent
space
Use NN from PCA space to Latent
space
Capture frame
Capture frame
Find foregroun
d
Find foregroun
d
Filter skin colors
Filter skin colors
Extract silhouetteExtract
silhouette
Nearest Neighbor Classifier
Nearest Neighbor Classifier
-1 0 1
-1
0
1
X
WP2: CLASSIFICATION
2 dimensional latent space has been found in a smooth fashion by GPLVM optimization.
Therefore as a classifier a nearest neighbor matcher has been used in the latent space.
Ground truth angles of the hand poses are known. An exact pose match is looked for. Any divergence from the exact angles is considered as a classification error.
For the synthetic environment prepared by poser a classification performance of 94% has been reached in 2D latent space.
WP3: STEREOVISION BASED HAND TRACKING
Objective Obtain the 3D position of the hands Enable real-time low noise tracking of robust
features for gesture recognition tasks Track some intuitive spatial parameters to map
them directly to the puppet Approach
skin color as main cue for hand location stereo camera to obtain 3D information Particle-filtering for robust tracking
WP3: STEREOVISION BASED HAND TRACKING
WP3: STEREOVISION BASED HAND TRACKING
Skin-color segmentation Bayesian color model Chromatic color space (HS) Train color model on image regions obtained
from a face tracker
WP3: STEREOVISION BASED HAND TRACKING
Particle filtering (CONDENSATION) Initialization (Midterm result)
Biggest skin colored blob is assumed to be the hand Stereo-matching to obtain 3D hand location
Tracking (new) Color cue: accumulated skin
color probability weighted by percentage of skin colored pixels in particle ROI
Depth cue: deviation of ROI disparity from disparity value implied by particle location
WP3: STEREOVISION BASED HAND TRACKING
WP4: VISION-BASED EMOTION RECOGNITION
AIM: To enable the digital puppet to imitate facial
emotions. To change digital puppet’s states using facial
expressions. METHODOLOGY:
Active shape model based facial landmark tracker Track all shape parameters – set of points Extract useful features manually
Eyebrow leverage, lip width etc. Classify features
Using HSMM Using nearest neighbor classifier
WP4: VISION-BASED EMOTION RECOGNITION
The ASM is trained using the annotated set of face images.
It starts the search from the mean shape aligned to the face located by a global face detector
The following are repeated until convergence Adjust the locations of shape points by template
matching of the image texture around each landmark and propose a new shape
Conform this new shape to a global shape model (based on PCA).
REAL-TIME VISION-BASED EMOTION RECOGNITION
Video framecaptured from
an ordinary webcam
Facial Landmark Trackingbased on
Active Shape Models
both generic and person-specific model
Emotion RecognitionSix universal gestures
happinesssadnesssurpriseangerfear
disgust
Feature Extractionbased on
intensity changes in shown specific face regions
and distances of specific landmarks
WP4: VISION-BASED EMOTION RECOGNITION
WP4: VISION-BASED EMOTION RECOGNITION – NEAREST NEİGHBOR
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
Has to “spot” gestures and expressions in continuous streams No start-end signals
Should recognize only when a command is over
Should run in real time We use hidden semi-Markov models
Inhomogeneous explicit duration models
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
HMM HSMM
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
Why HSMMs? HMMs model duration lengths only implicitly
Using self transition probabilities Imposes geometric distribution on each duration Variance-mean correlated
High mean – high variance Geometric distribution and/or high variance do
not conform to every application Speech, hand gestures, expressions, ...
HSMMs explicitly model durations HMMs are a special case of HSMMs
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
Training module Developed in MATLAB – no real time requirement Yet very fast, does not require too many samples Previously experimented with 25 hand gestures
and continuous streams Achieved 99.7% recognition rate
For this project, also experimented with facial expressions Six expressions Long continuous streams for training – not annotated Results look good, no numerical results due to lack of
ground truth (future work)
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
Recognition module Converted to an on-line algorithm Uses the recent history to determine current
state using Viterbi on a large HSMM As expressions are independent, this does not
introduce much error (about %1.5 in number of misclassified frames)
Runs in real time in MATLAB (not ported to C++ yet)
Performance analysis Most of the error is attributable to
Noise Global head motion Rather weak vector quantization method
WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION
Preliminary results
WP6: SKELETON AND 3D MODEL GENERATION
We have utilized a skeletal animation technique. The skeleton is predetermined and consists of 16
joints and accompanying bones.
WP7: DEVELOPMENT OF THE GRAPHICS ENGINE
Supports skeletal animation for the predetermined skeleton Reads skeleton parameters at each frame from
incoming command files Applies the parameters to the model in real-time
Allows different models to be bound to the skeleton Same skeleton can be bound to different 3D
models Supports inverse kinematics
Allows absolute coordinates as commands Supports basic physics (gravity)
Allows forward kinematics via forces
WP7: DEVELOPMENT OF THE GRAPHICS ENGINE
Forward Kinematics: "Given the angles at all of the robot's joints, what
is the position of the hand?“ Inverse Kinematics:
"Given the desired position of the robot's hand, what must be the angles at all of the robot's joints?“
Cyclic-Coordinate Descent Algorithm for IK Traverse linkage from distal joint inwards Optimally set one joint at a time Update end effector with each joint change At each joint, minimize difference between end the
effector and the goal
WP7: DEVELOPMENT OF THE GRAPHICS ENGINE
WP7: DEVELOPMENT OF THE GRAPHICS ENGINE
Future Work Implement and Optimize CCD (%90 complete) Load geometry data from Autodesk FBX file Advanced Shading for puppets, i.e. fur Rig to multiple models Choose and implement a convenient method for
visualizing face parameters and expressions
WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT
“Visualization computer” acts as a server and listens to the other computers accepts binary xml files Works over TCP/IP
XML is parsed and parameters are extracted. Each packet may contain several parameters
and commands Either low level joint angles as a set Or a high level command, such as new gesture
or expression
WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT
Threaded TCP/IP Server Binary XML <?xml version="1.0" encoding="UTF-8" ?>
<handPuppet timeStamp=”str” source=”str”> <paramset>
<H rx=”f” ry=”f” rz=”f” /> <ER ry=”f” rz=”f” /> <global tx=”f” ty=”f” tz=”f” rx=”f” ry=”f” rz=”f” />
</paramset> <anim id=”str” /> <emo id=”str” />
</handPuppet>
CONCLUSİON
Individual modules Most of the modules nearly complete
Final application Tracked features not bound to skeleton
parameters Model skin and animations missing
New ideas that emerged during the workshop Estimate forward mapping for GPLVMs using NNs Use HSMMs for facial expressions Fit 3D ellipse to the 3D point cloud of the hand Extract manual features such as edge activity on
the forehead
FUTURE WORK
Once hand tracking is complete, gestures will be trained using HSMMs
All MATLAB code will be ported to C++ (mostly OpenCV)
Hand pose complexity will be gradually increased, until not further possible in real time
Inverse Kinematics will be fully implemented A face model that is capable of showing
emotions will be incorporated to the 3D model for easy visualization