Using a Webcam as a Game Controller Jonathan Blow GDC 2002

Using a Webcam as a Game Controller

Jonathan Blow

GDC 2002

Motivation

• A potentially rich control paradigm, allowing for nuance.

• Removes the barrier of some funny plastic controller.

• Successful experiment: Konami’s Police 911

My game: Air Guitar

• A “beat-matching” game where you stand and play air guitar to your favorite songs.

• Previous beat-matching games (Parappa, DDR) are very digital; I want to use a webcam to make Air Guitar more organic and to allow the user to be expressive.

• Technically demanding as a vision app (needs semantics about what is what).

Real-World Concerns

• Noise

• Illumination changes

• Camera auto-adjusts

• Background changes / camera moves

• Shadows

• Camera saturation / under-excitement

Varying Lighting Conditions

• Can’t rely on RGB values to identify pixels

• Need context… hmm… this becomes a hard AI problem.

Vision Techniques That Suck

• Background subtraction (shadows, motion!)

• Noise reduction by smoothing (resolution!)

• Turning functions (unstable)

• Frame coherence (just a band-aid)

• Edge detection

• Hysteresis (Latin for “cheap hack”)

• Discreteness

General Paradigm

• Technique should:– Work on a still image– Be robust: avoid discrete decisions wherever

possible.– Work in as general a case as we should manage,

but we won’t strive to be ideally general.

• We will do “whatever it takes” to get the job done.

Restrained Ambition

• Only trying to roughly determine the positions of torso and arms

• Okay to say “the user must wear a long-sleeved shirt of uniform color that contrasts with the background”

• We won’t dictate the color of the shirt (too restrictive!)

• We won’t dictate colors of other things (user’s skin, background).

Early Segmentation

• Divide up the image into regions of “like” pixels to ease computation.

• Ad hoc technique: iterate over scanlines potentially adding each pixel to its neighbor’s group.

• This technique sucks.

The Unreasonable Instability of Approximate Clustering

• “Real” clustering is slow

• “Loose” clustering is interactively unstable

• Even just the small amount of camera noise makes things go berserk… motion is even worse.

• Clustering is about continuous ==> discrete. We wanted to avoid that so we should be very careful.

My solution: Be Inflexible

• Simply divide the image into square regions of constant size.

• If any region needs more detail, subdivide it.

• Noise still affects this system (some regions subdividing / recombining from frame to frame) but it’s relatively stable.

Which color space do we work in?

• Want to group pixels that are “alike”: nearby in some color space.

• Choices: nonlinear RGB, linear-light RGB, CIE LAB, many others.

• CIE LAB produced nicer results for some ad hoc segmentation experiments, but is expensive to compute.

• Linear-light RGB is the right thing for inverse rendering techniques; it is cheap to compute.

• I started with CIE LAB, but now use linear RGB.

Simple Inverse Rendering

• Assume all surfaces have Lambertian reflectance

• p = mlcosθ… θ is angle between light and surface normal.

• Can’t disambiguate material color from illuminant color

• The compound color ml, under varying scale, forms a vector through the origin in RGB space.

• This is a much more specific relation than e.g. Euclidean distance.

Covariance Bodies:

• 5 numbers’ worth of storage

• Ellipsoid-shaped (take eigenvectors of matrix)

• Statistical significance: expected value of points

• Advantage: consistency under summation

• Can use them to vaguely characterize shapes.

• Generalizes to n dimensions.

yxyxy

xyx,

2

2

Covariance Bodies for Color Plane Fitting

• Least-squares plane fit uses the same matrix.

• Track RHS. 3 more numbers:

• Sum these to get group plane fits.

• (example)

yzxzz ,,

yz

xz

p

p

yxy

xyx

y

x

2

2

Calibration Mode

• Stand in a fixed pose• Pose designed to be easily recognizable• Gives us things that help later:

– Body measurements– Background of scene– Shirt color (and histogram)– Skin color– Coarse model of environment illumination

How We Recognize This Pose

• Pick a color to look for; isolate it.• Project this color to the X and Y image axes• Find spikes in projection• Use heuristics to judge shape and give a

confidence value:– Outliers– Relative spike sizes– Screen real-estate occupied

• (example)

Try many colors.

• Sort colors present in scene by popularity; cluster them.

• Create a fuzzy color cone through each cluster.• Vary the cone radius.• Do the recognition listed on previous slide;

select the color cone with the best score.

• Fixed color grid (to combat instability!)

(demo of calibration mode)

Head Finding

• Many heuristics:– Medium-detail region (Flatness + sharpness)

• But not a long sharp edge

– Compact body– Skin-colored– Not the background

Skin color?

• Fit points in RGB space with an approximating surface?

• Where do I get a good skin color database?

www.hotornot.com!

• I get to work and check people out at the same time.

• (app demo)

Gameplay Recognition Mode

• Goal: Find positions of user’s torso and arms.

• When we’re actually playing the game, we use the info provided by calibration to help us.

• Currently only use shirt + skin color.

Body Shape Analysis

• Slide a square window across the image; for each window position, use the pixel regions falling within the window to perform a local shape analysis.

• Examine the resulting ellipses to find the arms. These are long, centered ellipses; round regions are the torso. (example)

• Path-trace these to get an ordered series of points representing each arm.

• Fit one or two line segments to this series of points (one segment = straight arm, two = bent).

Hands in front of body?

• The arm will blend into the body.

• The hands will look like “holes” in the body.

• This messes up arm detection.

Multi-step Process:

• Do a sliding window pass; approximate extents of torso using initial set of regions (holes may be there).

• Look for hand-colored blobs in this area.• Merge those blobs with the set of torso

regions.• Do another sliding window pass, now

detecting elongated shapes (for arms).

Creating a 3D character pose from 2D information

• Resolve ambiguities with game-domain constraints (e.g. hands always within some plane in front of torso).

• Use inverse kinematics and some simple body knowledge to recover 3D joint angles.

• See the column “The Inner Product” in the April 2002 issue of Game Developer for an explanation of 3D IK, and source code.

Method Advantages

• It’s reasonably fast

• Works with moving background / camera

• Doesn’t care much about shadows

Method Shortcomings

• Currently confused by similar colors (low clustering resolution)

• Requires a few more technical solutions before it will be truly robust (e.g. auto gamma detection).

Future Work

• Performance: 640x480 @ 30fps

• More inverse rendering work (specularity)

• Local surface modeling (eliminate confusion due to similar colors)

• Texture classification

• Mental model feedback

Coding Issues

• How do you get video images from a webcam in Windows?– VFW code by Nathan d’Obrenan in Game

Programming Gems 2– Unfortunately, VFW is a legacy API– DirectShow is the thing you need to use for

future compatibility.

DirectShow is terrible!

• Needlessly complex and bloated.• The base classes provided in the DirectX

SDK induce a lot of latency (latency = death)

• A minimal implementation of “just give me a damn frame from the camera” took 1,500 lines of code; should have taken 8.

• Ask me if you want the source code ([email protected])

• Or use VFW or a proprietary API.

Blatant Plug

• Experimental Gameplay Workshop

• Friday, 4pm-7pm, Fairmont Regency I

Documents

Using a Webcam as a Game Controller Jonathan Blow GDC 2002