13
Vis Comput (2010) 26: 3–15 DOI 10.1007/s00371-009-0367-8 ORIGINAL ARTICLE Multi-camera tele-immersion system with real-time model driven data compression A new model-based compression method for massive dynamic point data Jyh-Ming Lien · Gregorij Kurillo · Ruzena Bajcsy Published online: 19 May 2009 © Springer-Verlag 2009 Abstract Vision-based full-body 3D reconstruction for tele-immersive applications generates large amount of data points, which have to be sent through the network in real time. In this paper, we introduce a skeleton-based compres- sion method using motion estimation where kinematic pa- rameters of the human body are extracted from the point cloud data in each frame. First we address the issues regard- ing the data capturing and transfer to a remote site for the tele-immersive collaboration. We compare the results of the existing compression methods and the proposed skeleton- based compression technique. We examine the robustness and efficiency of the algorithm through experimental re- sults with our multi-camera tele-immersion system. The proposed skeleton-based method provides high and flexi- ble compression ratios from 50:1 to 5000:1 with reason- able reconstruction quality (peak signal-to-noise ratio from 28 to 31 dB) while preserving real-time (10+ fps) process- ing. Keywords 3D tele-immersion · Multi-camera tele-immersion system · Point data compression · Model-based compression J.-M. Lien ( ) George Mason University, Fairfax, VA, USA e-mail: [email protected] G. Kurillo · R. Bajcsy University of California, Berkeley, CA, USA G. Kurillo e-mail: [email protected] R. Bajcsy e-mail: [email protected] 1 Introduction Tele-immersion (TI) is aimed to enable users in geographi- cally distributed sites to collaborate and interact in real time inside a shared simulated environment as if they were in the same physical space [4, 14, 26, 46]. The TI technology com- bines virtual reality for rendering, computer vision for im- age acquisition and 3D reconstruction, and various network- ing techniques for transmitting the data between the remote sites. All these steps need to be done in real time with the smallest delays possible. By combining these technologies, physically dislocated users are rendered inside the same vir- tual environment. TI is aimed at different network applica- tions, such as collaborative work in industry and research, remote evaluation of products for ergonomics, remote learn- ing and training, coordination of physical activities (e.g., dancing [48], rehabilitation [15]), and entertainment (e.g., games, interactive music videos, and ‘edutainment’ [32]). Figure 1 shows an example of a student learning tai-chi by following the motion of a tai-chi master captured using our system. In addition, 3D data captured locally could be used for kinematic analysis of body movement (e.g., in medicine [40] and rehabilitation [21, 22, 33]) or in computer anima- tion [19]. The virtual environment can also include different synthetic objects (e.g., three-dimensional models of build- ings) or digitized objects (e.g., magnetic resonance image of a brain), which can be explored through 3D interactions. Figure 2 shows our TI apparatus and an example of its 360-degree stereo reconstruction of a tai-chi master. The stereo reconstruction is represented in the form of point cloud. We will discuss more details regarding our TI system in Sect. 6.1. Major bottlenecks of existing TI systems A critical aspect of a TI system is the interactiveness. In order to provide

Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

  • Upload
    lytu

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Vis Comput (2010) 26: 3–15DOI 10.1007/s00371-009-0367-8

O R I G I NA L A RT I C L E

Multi-camera tele-immersion system with real-time model drivendata compression

A new model-based compression method for massive dynamic point data

Jyh-Ming Lien · Gregorij Kurillo · Ruzena Bajcsy

Published online: 19 May 2009© Springer-Verlag 2009

Abstract Vision-based full-body 3D reconstruction fortele-immersive applications generates large amount of datapoints, which have to be sent through the network in realtime. In this paper, we introduce a skeleton-based compres-sion method using motion estimation where kinematic pa-rameters of the human body are extracted from the pointcloud data in each frame. First we address the issues regard-ing the data capturing and transfer to a remote site for thetele-immersive collaboration. We compare the results of theexisting compression methods and the proposed skeleton-based compression technique. We examine the robustnessand efficiency of the algorithm through experimental re-sults with our multi-camera tele-immersion system. Theproposed skeleton-based method provides high and flexi-ble compression ratios from 50:1 to 5000:1 with reason-able reconstruction quality (peak signal-to-noise ratio from28 to 31 dB) while preserving real-time (10+ fps) process-ing.

Keywords 3D tele-immersion · Multi-cameratele-immersion system · Point data compression ·Model-based compression

J.-M. Lien (�)George Mason University, Fairfax, VA, USAe-mail: [email protected]

G. Kurillo · R. BajcsyUniversity of California, Berkeley, CA, USA

G. Kurilloe-mail: [email protected]

R. Bajcsye-mail: [email protected]

1 Introduction

Tele-immersion (TI) is aimed to enable users in geographi-cally distributed sites to collaborate and interact in real timeinside a shared simulated environment as if they were in thesame physical space [4, 14, 26, 46]. The TI technology com-bines virtual reality for rendering, computer vision for im-age acquisition and 3D reconstruction, and various network-ing techniques for transmitting the data between the remotesites. All these steps need to be done in real time with thesmallest delays possible. By combining these technologies,physically dislocated users are rendered inside the same vir-tual environment. TI is aimed at different network applica-tions, such as collaborative work in industry and research,remote evaluation of products for ergonomics, remote learn-ing and training, coordination of physical activities (e.g.,dancing [48], rehabilitation [15]), and entertainment (e.g.,games, interactive music videos, and ‘edutainment’ [32]).Figure 1 shows an example of a student learning tai-chi byfollowing the motion of a tai-chi master captured using oursystem. In addition, 3D data captured locally could be usedfor kinematic analysis of body movement (e.g., in medicine[40] and rehabilitation [21, 22, 33]) or in computer anima-tion [19]. The virtual environment can also include differentsynthetic objects (e.g., three-dimensional models of build-ings) or digitized objects (e.g., magnetic resonance image ofa brain), which can be explored through 3D interactions.

Figure 2 shows our TI apparatus and an example of its360-degree stereo reconstruction of a tai-chi master. Thestereo reconstruction is represented in the form of pointcloud. We will discuss more details regarding our TI systemin Sect. 6.1.

Major bottlenecks of existing TI systems A critical aspectof a TI system is the interactiveness. In order to provide

Page 2: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

4 J.-M. Lien et al.

Fig. 1 A student (right, in blue shirt) is learning the tai-chi movements by following pre-recorded 3D TI data from a tai-chi teacher. These images(both the student and the tai-chi teacher) are rendered from the reconstructed 3D points using our tele-immersion system

interactiveness, we must satisfy the real-time requirement.The term ‘real time’ in this paper refers to real-time captur-ing and/or rendering, which should run with at 10 framesper second (fps) at least. One of the major bottlenecks pre-venting the current tele-immersion systems from fulfillingthe real-time requirement is the transfer of large amountof data points generated by the stereo reconstruction frommultiple images to the remote site. For example, data fromimages of 640 × 480 pixel depth and color maps from 10camera clusters with a frame rate of 15 fps would require220 MB/s network bandwidth. When two TI systems com-municate, 440 MB/s bandwidth is required. Current image/video-based compression method [48] in our TI system onlyallows transmission of data with the rate of 5 or 6 frames perseconds.

Note that although network bandwidth may increase, anefficient and effective compression method will still be animportant step in our TI system as the size of the datagenerated is becoming even larger. In the near future wewill improve the resolution of the captured images and theframe rate by optimizing the depth calculation using fi-nite elements and triangular wavelets [30]. These improve-ments will inevitably increase the bandwidth requirementsfor transfer of data through the network in real time. Forthis purpose, different compression schemes for real-time3D video should be investigated.

In this paper, we focus on the data compression using mo-tion estimation for TI applications. The proposed compres-sion method provides high and flexible compression ratioswith reasonable reconstruction quality. We will first providean overview of our method and a review on related work inSects. 2 and 3. In Sect. 4, we briefly discuss the idea of “iter-ative closest points” (ICP) method. Then, we will describe anovel technique of skeleton-based compression of 3D pointdata captured by our system. We discuss how to compressthe data using the estimated articulated motions and howto encode non-rigid motions using regular grids (residualmaps) in Sect. 5.1. Finally, we propose several efficient waysto detect temporal coherence in our residual maps includingaccumulating residuals from a few consecutive frames and

compress them using established video compression tech-niques in Sect. 5.2. Robustness and efficiency of the algo-rithm is examined both theoretically and experimentally inSect. 6.

Main contribution To our knowledge we are the first topropose skeleton-based compression method for TI 3D pointdata. This paper presents an extensive study of our prelimi-nary results [29] of the compression method. We focus onproviding both theoretical and experimental evidences todemonstrate the advantages of the new compression methodover the exiting methods. We also provide a better motionestimation method that considers both local and global fit-ting errors.

Our compression method provides tunable and high com-pression ratios (from 50:1 to 5000:1) with reasonable recon-struction quality. The “peak signal-to-noise ratio” of the pro-posed compression method is between 28 dB and 30.8 dB.Most importantly, our method can estimate motions from thedata captured by our TI system in real time (10+ fps).

2 An overview of model driven compression

Few compression methods [46, 48] for TI data have beenproposed. Most of these methods are image-based compres-sions. In this work, we focus on compressing 3D point datainstead.

Several methods have been proposed to compress staticor dynamic point data [3, 13, 20, 31, 36, 42]. However, thereare several issues that have not yet been addressed in theliterature, thus making our compression problem much morechallenging. The main challenges include:

1. real-time requirement2. only a short period of time for data accumulation, and3. multiple or no data correspondences.

The real-time requirement prevents direct implementa-tion of the majority of the existing compression methodsfor point data. For example, many methods [13, 36] are de-signed for off-line compression. These methods usually take

Page 3: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 5

Fig. 2 (a) Multi-camera stereo system for tele-immersion consistsof 48 cameras arranged in 12 stereo clusters. The captured imagesare processed by 12 multi-core computers to provide 360-degree 3Dfull-body reconstruction in real time. (b) A schematic diagram of theconnection and interfaces of our system. The processing cluster per-forms stereo reconstructions in parallel. (c) Partial depth image datacaptured by each of the 12 stereo camera clusters, which are accuratelycalibrated to a reference coordinate system, are combined inside therenderer into a full-body 3D reconstruction

tens of seconds to several minutes to compress a few thou-sands of points; our system creates about 50–100 thousandpoints in each frame. The real-time constraint also requiresus to accumulate limited amount of data, i.e., the captureddata should be transmitted as soon as they become avail-able. However, the methods that exploit temporal coherenceusually require data accumulation for a long period of time[3, 20, 31, 42].

Finding correspondences between two consecutive pointdata is another challenging problem. In many existing mo-tion compression methods [16, 27, 49], the correspondenceinformation is given. This assumption becomes unrealisticfor our TI data. To make the problem of correspondenceeven more difficult, a point in a point set may correspondto zero or multiple points in the next point set. A point maynot have any correspondence because the point may becomeoccluded in the next frame from all cameras, and, similarly,a point can even correspond to multiple points because thepoint is in the visible regions of multiple cameras and is re-constructed multiple times.

Our approach This paper aims to address the challengesmentioned above in TI data compression. To cope with thesechallenges, we use a model-based approach. The main ideaof this work is to take the advantage of prior knowledge ofthe objects in the TI environments, e.g. human figures, andto represent their motions using just a few parameters, e.g.,joint positions and angles.

More specifically, our proposed method compressespoints that represent human motion using motion estima-tion. Therefore, instead of transmitting the point clouds, wecan simply transmit the motion parameters (e.g., joint posi-tions and angles). This approach is based on the assumptionthat most points will move under rigid-body transform alongwith the associated skeleton.

In reality, point movements may deviate from this as-sumption, such as muscle movements and hair or cloth de-formations. Our experimental results also show that whenthese “secondary” motions are discarded, the entire hu-man movement becomes very unnatural. The same issuehas also been observed by others, e.g., Arikan [3], in com-pressing optical-based motion capture data. Therefore, wefurther compress the deviations from the rigid movements.As we will see later (in Sect. 5.2), the deviations (we call“prediction residuals”) in most cases are small and can bedramatically compressed without reducing the quality sig-nificantly. An overview of this skeleton-based approach isshown in Fig. 3.

3 Related work

Many applications of distributed virtual reality [19, 28, 47]feature avatars to represent the human user inside the

Page 4: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

6 J.-M. Lien et al.

Fig. 3 Skeleton-based compression of the point data Q, where P isthe skeleton, and T is the motion parameter (e.g., the skeleton con-figuration) that transforms P to Q. The prediction residual R is thedifference between T (P ) and Q

computer-generated environment. An avatar is often de-signed as a simplified version of the user’s physical features,while its movements are controlled via tracking of the userin the real world. Due to the limitations of the model ap-pearance and movement ability, the interactive experiencewith other users is much more restricted. The level of im-mersion and interaction with the virtual environment can beincreased by using realistic real-time models of the users.The vision-based tele-immersion systems are designed toaddress these issues.

One of the first TI systems was presented by the re-searchers at University of Pennsylvania [35]. Their sys-tem consisted of several stereo camera triplets used for theimage-based reconstruction of the upper body, which al-lowed a local user to communicate to remote users whilesitting behind a desk. A simplified version of the desktoptele-immersive system based on reconstruction from silhou-ettes was proposed by Baker et al. [4].

Blue-c tele-immersion system presented by Gross et al.[12, 46] allowed full-body reconstruction based on silhou-ettes obtained by several cameras arranged around the user.Reconstruction from the silhouettes provides faster stereoreconstruction as compared to the image-based methods;however, this approach is limited in accuracy, ability to re-construct concave surfaces and discrimination of occlusionsby objects or other persons inside the viewing area. Voxel-based interactive virtual reality system presented by Hasen-fratz et al. [14] featured fast (25 fps) real-time reconstructionof the human body but was limited in acquiring details, suchas clothing and facial features.

Our TI system has 360-degree full-body 3D reconstruc-tion from images using twelve stereo triplets [18]. The cap-tured 3D data are transferred using TCP/IP protocol to therendering computer which reconstructs and displays pointclouds inside a virtual environment at the local or remotesites in real time. The presented system has been success-fully used in remote dancing applications [48] and learningof tai-chi movements [37] (also see Fig. 1). Videos and im-ages are available at: http://tele-immersion.citris-uc.org.

TI data compression Only a few methods have been pro-posed to compress TI data. Kum and Mayer-Patel [24] pro-posed an algorithm to compress TI data in real time. Theirmethod aimed to provide scalability to camera size, networkbandwidth, and rendering power. Their method has 5:1 com-pression ratio. Recently, Yang et al. [48] proposed a real-time compression method to compress depth and color mapsusing lossless and lossy compression, respectively. Theirmethod has 15:1 compression ratio. Unfortunately, the vol-ume produced by these real-time compression methods isstill too large to be efficiently transmitted in real time.

Lamboray et al. [25] proposed and studied several en-coding techniques for streaming point set data based on theideas of pixel-based differential prediction and redundancy.Their method is also based on image/video compression,i.e., MPEG video compression. Gross et al. [25, 46] alsoproposed an idea based on a concept similar to the back-face culling by transmitting only visible data. While we cancertainly apply this strategy to points, it limits us from sup-porting some important functionalities, including interactionwith virtual objects and motion analysis or understanding. Inaddition, transmitting only visible data will not provide anybenefits when multiple views are needed, such as the tai-chilearning example shown in Fig. 1.

4 Preliminaries: motion estimation and ICP

Our method is heavily based on the idea of motion estima-tion. Extensive work has been done to track human mo-tion in images (see surveys [2, 11]). Much of the existingwork cannot be used to solve our motion estimation prob-lem directly. For example, one of the most promising worksthat produce robust tracking is particle filter based trackingmethods [17]. However, these methods are slow, thus cannotsatisfy our real-time requirement. In this paper, we focus onthe motion estimation from 3D points.

A popular method called “Iterative Closest Points” (ICP)[6, 41] has shown to be an efficient method for estimat-ing rigid-body motion [43] in real time. For two givenpoint sets P and Q, ICP first computes corresponding pairs{(pi ∈ P,qi ∈ Q)}. Using these corresponding pairs, ICPcomputes a rigid-body transform T such that the “matchingerror” defined in (1) between P and Q is minimized.

error = arg minT

i

∣∣(T (pi), qi)∣∣2

, (1)

where |x, y| is the Euclidean distance between x and y.The main step for minimize the matching error (see de-

tails in [6]) is to compute the cross-covariance matrix ΣPQ

of the corresponding pairs {pi, qi},

ΣPQ = 1

n

n∑

i=1

[(pi − μp)(qi − μq)t

], (2)

Page 5: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 7

where μp and μq are the centers of {pi} and {qi}, respec-tively, and n is the size of {pi, qi}. As outlined in Algo-rithm 4.1, ICP iterates these steps until the error is smallenough. An important property of ICP is that the error al-ways decreases monotonically to a local minimum whenEuclidean distance is used for determining correspondingpoints.

Algorithm 4.1 ICP(P , Q, τ )

repeat⎧⎨

find corresponding points {(pi ∈ P,qi ∈ Q)}compute error and T in (1)P = T (P )

until error < τ

ICP has also been applied to estimate motions of articu-lated objects, e.g., hand [10], head [34], upper [9] and fullbodies [23]. In these methods, ICP is applied to minimizethe distance between each body part and the point cloud Q,i.e., error(P,Q) = ∑

l error(Pl,Q), where Pl represents thepoints of the body part l. However, this approach lacks aglobal mechanism to oversee the overall fitting quality. Forexample, it is common that two body parts can overlap and alarge portion of Q is not “covered” by P . Therefore, consid-ering the global structure as a whole is needed. To the bestof our knowledge, joint constraint [9, 23] is the only globalmeasure studied.

5 Skeleton-based compression

One of the major bottlenecks of our current system for tele-immersion is the transfer of large amount of data points. Fewmethods [46, 48] have been proposed to address this prob-lem using image and video compression techniques, but thedata volume remains to be prohibitively large. In this sectionwe will discuss our solutions to address the problems in TIdata compression from a model driven approach. Figure 3shows an overview of this model driven approach, whichconsists of two main steps: Motion estimation (Sect. 5.1)and Residual prediction (Sect. 5.2).

Our compression method provides (flexible) high com-pression ratios (from 50:1 to 5000:1) with reasonable recon-struction quality. Our method can estimate motions from thedata captured by our TI system in real time (10+ fps).

5.1 Motion estimation

We extend ICP to estimate motion of dynamic point datain real time. Our approach uses an initialization (may notbe real-time) to generate a skeleton of the subject from thefirst point cloud and then a real-time tracking method will

fit the skeleton to the point clouds captured from the restof the movements. Currently, the initialization step is donemanually. Several methods, e.g., [7], exist and can provideus an initial skeleton to start the process. In this paper, wewill focus on the motion tracking aspect.

5.1.1 Model and segmentation

Our method uses the segmentation of the initial frame to es-timate the motion. The intuition behind this is to take theadvantage that the appearances of the initial frame and theremaining frames are usually similar. From now on, we de-note the point set in the first frame as P . We compute a seg-mentation of P using a given skeleton S. A skeleton S isorganized as a tree structure, which has a root link, lroot, rep-resenting the torso of the body. Each link has the shape of acylinder. An example of a skeleton is shown in Fig. 5. Aftersegmentation, each point of P is associated with a link. Wedenote the points associated with a link l and a skeleton S

as Pl and PS , respectively. The associated points will moverigidly with the link as we estimate the motion. Note thatPl ⊂ PS and, initially, PS = P .

5.1.2 ICP of articulated body

Now, we can start to fit the skeleton S (and its associatedpoints PS ) to the next point cloud Q. Our goal here is to finda rigid-body transform for each link so the (global) distancebetween PS and Q is minimized. Our method ARTICP isoutlined in Algorithm 5.1.

Algorithm 5.1 ARTICP(S, Q, τ )

cluster Q

q .push(lroot)while q �= ∅

do

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

l ← q.pop()

T ← ICP(Pl , Q)for each child c of l

do{

apply T to c

q.push(c)

global fitting

Clustering ARTICP first clusters the current point cloud Q.Each cluster has a set of points with similar outward nor-mals and colors. These clusters will be used in ICP for find-ing better correspondences more efficiently. (See details inSect. 5.1.3.)

Hierarchical ICP Next, we evoke ICP to compute a rigid-body transform for each link. Note that we do this in a fash-ion that the torso (root) of the skeleton is fitted first and limbs

Page 6: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

8 J.-M. Lien et al.

are fitted last. By considering links in this order, ARTICPcan increase the convergence rate by applying the trans-form not only to the link under consideration but also tothe child-links. The rationale behind this is that a child-link,e.g., limbs, generally moves with its parent, e.g., torso. If thechild-link does not follow the parent’s movement, the move-ment of the child-link is generally constrained and is usuallyeasier to track.

Articulation constraint We consider the articulation con-straint, i.e., the transformed links should remain jointed.This is simply done by replacing both of the centers μp andμq in (2) by the joint position.

Global fitting Now, after we apply ICP to the individuallinks, the skeleton S is roughly aligned with the current pointcloud Q but may still be fitted incorrectly without consider-ing the entire skeleton as a whole. We propose to estimatethe global fitting error as:

global_error(P,Q) = ∣∣Fe(P ) − Fe(Q)∣∣, (3)

where the function Fe transforms a point set to a spacewhere the difference is easier to compute. A more detaileddiscussion on Fe can be found in Sect. 5.1.4. Note that (3)also computes the prediction residuals.

Features Before discussing any details, we summarize themain features of our ARTICP below:

– hierarchical fitting for faster convergence– articulation constraint– monotonic convergence to local minimum guaranteed– global error minimization.

5.1.3 Robustness, efficiency and convergence

Real time is one of the most important requirements of oursystem. Computing correspondences is the major bottleneckof ICP. Preprocessing the points using some spatial parti-tioning data structures, e.g., k-d tree [5], can alleviate thisproblem. Considering additional point attributes, e.g., colorand normal information, can also increase both ICP’s effi-ciency and robustness. Therefore ideally we would like touse all these tools to help us achieve the real-time require-ment.

However, combining normals and colors with k-d treeis not trivial. For instance, the performance of k-d tree de-grades quickly (in fact, exponentially) when the dimension-ality of the space becomes high. Moreover, considering ad-ditional attributes usually does not guarantee monotonicallyconvergence, which is an important properly provided byICP.

To gain efficiency, robustness, and convergence usingpoint colors and normals and k-d tree, we construct a data

Fig. 4 A (n, c)-clustered k-d tree. Points n and m only look for theircorresponding points in the gray regions around n and m

structure called (n, c)-clustered k-d tree (shown in Fig. 4),where n and c indicate normal directions and color, respec-tively, associated with the points. In this data structure, wecluster normals by converting them into a geographic coor-dinate system and cluster colors using hues and saturations.Then, we construct a k-d tree from the 3D point positions ineach cluster.

When a point m looks for its corresponding point, ICPfirst determines which cluster m falls in and then examinesthe points in m’s neighboring clusters using their k-d trees.By doing so we can find better correspondences efficientlyguaranteeing that ICP can always converge to a local mini-mum. Theorem 5.1 proves this property.

Theorem 5.1 Using an (n, c)-clustered k-d tree, ICP canalways converge monotonically to a local minimum.

Proof Besl and Mckay [6] have shown that ICP can alwaysconverge monotonically. An important fact that makes thistrue is that the corresponding points of a point p are alwaysextracted from the same point set Q. The reason why con-sidering point normals may not have monotone convergenceis that in each iteration p’s corresponding point can be ex-tracted from a different subset of Q. On the contrary, ICPusing an (n, c)-clustered k-d tree will always search for p’scorresponding point from the same set of clusters and there-fore is guaranteed to converge monotonically. �

5.1.4 Global fitting

So far, we only consider fitting each link to the point cloudindependently. However, as we mentioned earlier, consider-ing links independently sometimes produces incorrect esti-mation even when the fitting error (1) is small. Therefore,we realize that, in order to get more robust results, it is nec-essary to consider the entire skeleton as a whole.

Global error Global fitting error (3) is measured as the dif-ference between the skeleton associated point set PS and thecurrent point set Q. Due to the size difference and ambigu-ous correspondences of PS and Q, it is not trivial to com-pute the difference directly. We need a function Fe to trans-form PS and Q so that the difference is easier to estimate.

Page 7: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 9

In fact, our selection of Fe has been mentioned before (inSect. 5.1.1: Segmentation). It is important to note that thesegmentation process is a global method, i.e., each link com-petes with other links to be associated with a point. BecauseFe(PS) is already known at the beginning, we only need tocompute Fe(Q) at each time step. After segmenting Q, eachlink l will have two associated point sets, Pl and Ql . To mea-sure the difference of Pl and Ql , we use a compact shape de-scriptor: the second moment, μ, of Pl and Ql , and computethe difference of Pl and Ql as

∣∣μ(Pl) − μ(Ql)∣∣. (4)

The idea behind this equation is that the second moment pro-vides an idea of how points distribute around the center ofthe link. Therefore, an incorrect estimation will produce alarge error.

When the error of any link is above a user-defined thresh-old, we start to minimize the global error. The minimizationprocess is iterative and multi-resolution. That is, our globalminimization method focuses on the difficult areas (i.e., ar-eas that have larger error) by fitting Pl to Ql and ignores thelinks that have already been fitted correctly. More specif-ically, to minimize the global error, we separate the linksin the skeleton S and the points in the current point set Q

into two subgroups, i.e., separate S into S+ and S−, andQ into Q+ and Q−. A link of S is in S+ if its global er-ror is lower than the threshold, otherwise the link is in S−.A point of Q is in Q+ if the point is a link in S+, other-wise the point is in Q−. By separating the links and pointsinto subgroups, we can ignore the groups that are consid-ered as correctly estimated and repeat ICP (Algorithm 5.1)using only links in S− and the point set Q−, i.e., we callARTICP(S−,Q−, τ ). This process is iterated until all links

have acceptable global errors or until no improvement canbe made.

5.2 Prediction residuals

Motion estimation brings the skeleton S (and its associatedpoint set PS ) close to the current point cloud Q. Neverthe-less, due to several reasons, e.g., non-rigid movements andestimation errors, our model PS may not match Q exactly.We call the difference between PS and Q “prediction resid-uals” (or simply residuals). Because after motion estimationPS and Q are aligned to each other, we expect the residualsto be small. In this section, we present a method to computethe prediction residuals.

Similarly to (4) for estimating global errors, we computethe prediction residuals for each link by measuring the dif-ference between Pl and Ql . However, this time the differ-ence is measured by regularly re-sampling the points on agrid. More precisely, we project both Pl and Ql to a regular2-D grid embedded in a cylindrical coordinate system de-fined by the link l. Because Pl and Ql are now encoded inregular grids, we can easily compute the difference, whichcan be compressed using image compression techniques.Figures 5(a)–(c) illustrate the prediction residual computedfrom the torso link. Note that because this projection is in-variant from a rigid-body transform, we only need to re-sample Ql at each time step.

We determine the size of a 2-D grid from the shape ofa link l and the size of l’s associated points |Pl |. We makesure that our grid size is at least 2|Pl | using the followingformulation, i.e., the width and the height of the grid are2πRlS and LlS, respectively, where Rl and Ll are the radius

and the length of the link l and S =√ |Pl |

πRlLl. We call that

a grid encodes x% prediction residuals if the grid has size2|Pl | · x

100 . As we will see later, we can tune the grid size toproduce various compression ratios and qualities.

Fig. 5 A skeleton and the torso link shown as a cylinder. (a) Color(top) and depth (bottom) maps at time t − 1 of the torso link. TheY -axis of the maps is parallel to the link. (b) Color (top) and depth

(bottom) maps at time t of the torso link. (c) The differences betweenthe maps at times t − 1 and t

Page 8: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

10 J.-M. Lien et al.

6 System overview and experimental results

In this section, we first describe briefly about our TI sys-tem (Sect. 6.1), and we present our experimental results toevaluate our method using both synthetic (Sect. 6.2) and TIdata (Sect. 6.3). Synthetic data is generated by samplingpoints from a polygonal mesh animated using the mocapdata obtained from the online motion capture (mocap) li-brary at Carnegie Mellon University [8]. We study syntheticdata because it allows us to compare estimated motion withmocap, thus providing a ‘ground truth’ for us to evaluateour method. To evaluate the quality of our skeleton-basedcompression on the TI data, we project the decompresseddata with various levels of prediction residuals back to eachof the 6 virtual camera coordinates. We also compare thecompression ratio and quality of our skeleton-based com-pression to H.264 video compression [1] and Yang et al.’smethod [48]. All the experimental results in this section areobtained using a Pentium 4 3.2 GHz CPU with 512 MBof RAM. The best way to visualize our results is via thevideos which can be found from our project web page:http://tele-immersion.citris-uc.org.

6.1 System overview

Our tele-immersion apparatus consists of 48 Dragonfly cam-eras [39], which are arranged in 12 clusters covering 360 de-gree view of the user(s). The cameras, equipped with 6 and3.8 mm lenses, are mounted on an aluminum frame with di-mensions of about 4.0 × 4.0 × 2.5 m3 (Fig. 2(a)). Each clus-ter consists of three black and white cameras intended forstereo reconstruction and a color camera used for texture ac-quisition. The cluster PCs are dual or quad CPU (Intel Xeon,3.06 GHz) machines with 1 GB of memory.

Synchronization Image-based stereo reconstruction re-quires precise synchronization between the cameras to cal-culate per-pixel correspondence. To synchronize image ac-quisition on all 48 cameras, we use external triggering onthe cameras. The camera triggering pins are connected tothe parallel port of the triggering server. In each data ac-quisition loop, the trigger computer sends a signal to initi-ate image capturing while TCP/IP messaging is used by thecluster computers to notify the server once the reconstruc-tion has been completed.

Calibration The accuracy of the stereo reconstruction isgreatly affected by the quality of the camera calibration. Theerrors can occur due to lens distortion, misalignment be-tween image and lens plane, and deviations of position androtation between the stereo cameras [50, 51]. The calibrationof our multi-camera TI system is performed in two steps.The intrinsic calibration of camera parameters (i.e. distor-tion, optical center) is performed by the well-known Tsai

algorithm [44] using a checkerboard. Extrinsic parametersof each camera (i.e. position, orientation) are obtained in thesecond step by capturing position of a LED marker in about10,000 points located inside the overlapping volume of allcameras.

Reconstruction algorithm The stereo reconstruction algo-rithm, which runs independently on each cluster computer,is composed of several steps [18]. First, the image is rec-tified and the distortion of the lens is corrected. Next, thebackground is subtracted from the image using a modifiedGaussian average algorithm [38]. Finally, the depth map iscomputed pairwise from the three gray scale images usingtriangulation method.

Rendering Data streams from the twelve calibrated clus-ters are composed into a 3D image by a point-based render-ing application developed using OpenGL. Based on cameracalibration parameters the renderer combines the points re-ceived from each cluster into the full 3D reconstruction ofthe TI user (Fig. 2(c)). The complexity of the stereo recon-struction is currently limiting the frame rate to about 5 to 6frames per second (fps). The required network bandwidthfor this acquisition rate is below 5 Mbps. With our newstereo reconstruction algorithm that is currently under devel-opment [30], we anticipate a significant frame rate and reso-lution increase. With this improvement on the captured data,the bandwidth requirements can easily exceed 1 Gbps, there-fore a better compression technique of the 3D data, such asthe one proposed in this paper, becomes even more critical.

6.2 Motion estimation of synthetic data

We study the robustness of our motion estimation usingfour synthetic data sets: dance, crouch & flip, turn and tai-chi. Each frame in the synthetic data contains about 10,000points without color, and 75% of the points are removedrandomly to eliminate easy correspondences. The table be-low shows a summary of the mean motion estimation framerates. Clearly, all motions can be estimated in real time (i.e.,more than 24 frames per second) by our method.

Motion Dance Crouch & flip Turn Tai-chi

estimated (Fig. 6) (Fig. 7) (Fig. 8) (not shown)

Avg. fps 39.1 fps 29.6 fps 48.4 fps 61.3 fps

The motion capture data from Carnegie Mellon Univer-sity [8] is composed of a skeleton and a sequence of jointangles for each joint. In order to measure the quality of theestimated motion, we convert the joint angles to joint posi-

Page 9: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 11

Fig. 6 Top: The postures from (a) Rond de Jambe motion capture data,and (b) estimation with global constraints and (c) without global con-straints. Bottom: Tracking errors with and without global constraints.The mean errors with and without global constraints are 0.01 and 0.12,respectively, and the maximum errors with and without global con-straints are 0.02 and 0.20, respectively

tions. Then, we measure the quality of estimated motion asthe normalized distance offset et between all joint pairs, i.e.,

et = 1

n · Rn∑

i=1

∣∣j esti − j

mocapi

∣∣,

where n is the number of joints, j esti and j

mocapi are the ith

estimated and mocap joint positions, respectively, and R isthe radius of the minimum bounding ball of the point cloud,i.e., we scale the skeleton so that the entire skeleton is insidea unit sphere. In the following paragraphs, we will comparethe qualities of our method in several scenarios.

With and without global constraints We compare the dif-ference between the tracking algorithm with and without ar-ticulation and global fitting constraints. Figure 6 shows thatglobal constraints do have a significant influence on motionestimation quality.

Downsampling factor In this experiment, we study howpoint size (% of downsampling) affects the motion estima-tion quality. A 10% downsampling from a point set with10,000 points means that only 1000 points are used for esti-mating motion. Figure 7 shows our experimental results. Aswe can see from the results, the downsampling rate does nothave a significant influence on the quality until down to 1%for this crouch & flip motion. This is important because thisexperiment shows that we indeed can downsample signifi-cantly without introducing visible estimation errors.

Fig. 7 Tracking errors with different downsampling factors

Fig. 8 Tracking errors with different noise intensities

Noise level In this experiment, we study how noise affectsthe quality. The noise is introduced by perturbing each pointin a random direction with a distance x, where x is a randomvariable with a Gaussian distribution. From Fig. 8, it is clearthat as we increase the noisiness of the points, the differencebetween the estimated motion and the “ground truth” in-creases. However, the overall difference remains small evenfor very noisy data.

6.3 Compressing TI data

We evaluate the results of our compression method usingfour motion sequences captured by our TI system. These

Page 10: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

12 J.-M. Lien et al.

motions are performed by two professional dancers, one astudent and one a tai-chi master. The data captured by ourTI system have about 7 to 9 thousand points in each frame.Because the TI data is much noisier than the synthetic dataand can have significant missing (due to occlusions) or over-lapping data (due to overlapping camera views), estimatingmotion from the TI data is more difficult, thus we use 50%of the initial point set as our model. This value is selected ex-perimentally. The table below shows that we can still main-tain at least 10 fps interactive rate in all studied cases.

Motion Dancer 1 Dancer 2 Student Tai-chi master

estimated (Fig. 3) (Fig. 5) (Fig. 1) (Fig. 10)

Avg. fps 11.9 fps 11.5 fps 12.5 fps 12.6 fps

Quality Unlike synthetic data, we do not have a groundtruth to directly evaluate the quality of our method using TIdata. Our approach is to compute the differences betweenthe point data without compression (called input data) andthe data compressed and decompressed using the proposedmethod (called compressed data). In order to measure thedifference between these two point sets, we render imagesof each point set from six camera views in each time frame.These cameras are 60 degrees separated and are locatedat radius and y coordinate so that the whole body can becaptured. Then, we compute the “peak signal-to-noise ra-tio” (PSNR) for each pair of images (I, J ) rendered fromthe input data and compressed data by the same camera inthe same frame. PSNR is defined as: 20 log10 ( 255

rmse ) [45],

where rmse =√∑

i |Ii − Ji |2 is the root mean squared er-ror of images I and J . A larger PSNR indicates a betterquality. Typical PSNR values in lossy image compressionare between 20 and 40 dB [45]. The results of our study aresummarized in Table 1.

In Table 1, We also measure the qualify by varying thelevels of residual considered. We considered three com-pression levels, i.e., compression without residuals and with50% and 100% residuals. A compression with x% residualsmeans its residual maps are x% smaller than the full resid-ual maps. We see that encoding residuals indeed generates

better reconstruction than that without residuals by 4 dB.Figure 9 provides an even stronger evidence that consider-ing residuals always produces better reconstructions for allframes. Another important observation is that the compres-sion quality remains to be the same (around 30) for the entiremotion. Figure 10 shows that the difference is more visiblewhen the prediction residuals are not considered.

We would like to note that using PSNRs may not fully re-flect the compression quality. For example, the compresseddata generated using no residuals looks much more roboticand unnatural than that generated using only 25% residuals.This difference can be easily observed from the renderedanimation but not in Table 1. Another property that does notreflect by measuring PSNRs is that some noisy data or out-liers are removed after compression.

Compression ratio One of the motivations of this work isthat no existing TI compression methods can provide com-pression ratios high enough to provide real-time transmis-sion. In this experiment, we show that our skeleton-basedcompression method can achieve 50:1 (with 100% resid-uals) to 5000:1 (without residuals) compression ratios. Aswe have shown earlier, our compression method can providedifferent compression ratio by varying the level of residualsconsidered during encoding. To increase our compression

Fig. 9 PSNR values from the tai-chi motion. Each point in the plotindicates a PSNR value from a rendered image. For each time step,there are 12 dots plotted, which indicate 12 images rendered from twopoint sets with 100% and 0% residuals. The bold lines are the meanPSNR values for 100% (top) and 0% (bottom) residuals

Table 1 Compression quality. The “peak signal-to-noise ratio”(PSNR) is computed between rendered images of the input and com-pressed point data. PSNR is defined as: 20 log10 ( 255

rmse ), where rmse =√∑i |Ii − Ji |2 is the root mean squared error of images I and J . We

also measure the quality by varying the levels of residual considered.

A compression with x% residuals means its residual maps are x%

smaller than the full residual maps

Motion Dancer 1 Dancer 2 Student Tai-chi master

Avg. PSNR no residuals 23.87 dB 24.10 dB 25.82 dB 26.27 dB

25% residuals 25.77 dB 26.18 dB 27.96 dB 28.75 dB

100% residuals 27.95 dB 28.27 dB 29.91 dB 30.83 dB

Page 11: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 13

Fig. 10 Reconstructions from the compressed data and their differences with the uncompressed data. (a) Uncompressed data. (b) Compressedwithout residuals. (c) Difference between (a) and (b). (d) Compressed with 25% residuals. (e) Difference between (a) and (d)

Table 2 Compression ratio. Both Yang et al.’s [48] and H.264 (weuse an implementation from [1]) compression methods took the colorand depths images as their input and output. The compression ratio of

H.264 reported in this table is obtained using 75% of its best quality.We use jpeg and png libraries to compress color and depth residuals,respectively

Motion Dancer 1 Dancer 2 Student Tai-chi master

Size before compression 142.82 MB 476.07 MB 1.14 GB 988.77 MB

Compression Yang et al.’s [48] 11.36 11.73 10.23 14.41

ratio H.264 [1] 64.04 53.78 32.04 49.58

no residuals 1490.31 3581.52 5839.59 5664.55

25% residuals 195.62 173.52 183.80 183.82

100% residuals 66.54 55.33 60.29 61.43

furthermore, we use jpeg and png libraries to compress colorand depth residuals, respectively. In addition, we compareour compression method to other compression methods, in-cluding the method proposed by Yang et al. [48] and H.264(for H.264, we compare to an implementation in Apple’sQuickTime [1]). We summarize our experimental results inTable 2.

We would like to note that we can achieve very high com-pression ratio while maintaining reasonable reconstructionquality (as shown earlier) because our method is fundamen-tally different from the other two methods. Both Yang et al.’sand H.264 are image (or video) based compressions, whichtake color and depth images as their input and output. Onthe contrary, our skeleton-based compression converts thecolor and depth images to motion parameters and predic-tion residuals. Moreover, even though H.264 provides highquality and high ratio compression, H.264 is not a real-timecompression for the amount of data that we considered inthis work.

7 Conclusion

In this paper we have presented a skeleton-based data com-pression aimed for the use in tele-immersive environments.

Data produced by the full-body 3D stereo reconstruction re-quires high network bandwidth for real-time transmission.Current implementation of the image/video-based compres-sion method [48] in our tele-immersion system only al-lows transmission of data with the rate of 5 or 6 fps. Us-ing model-based compression techniques can significantlyreduce the amount of data transfer between the remote TIsites.

Using the real-time (10+ fps) motion estimation tech-nique described in this paper, we can compress data by con-verting point cloud data to a limited number of motion pa-rameters while the non-rigid movements are encoded in asmall set of regular grid maps. Prediction residuals are com-puted by projecting the points associated with each link to aregular 2D grid embedded in a cylindrical coordinate systemdefined by the skeleton link.

Our experiments showed that our compression methodprovides adjustable high compression ratios (from 50:1 to5000:1) with reasonable reconstruction quality with peaksignal-to-noise ratio (PSNR) from 28 to 31 dB.

Limitations and future work The proposed method, how-ever, has several drawbacks. In the case of occlusions, themotion estimation fails since the skeleton cannot be fitteduniquely into the point cloud data. The algorithm is also not

Page 12: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

14 J.-M. Lien et al.

able to cope with non-rigid motions of the body (e.g. longhair, user wears a skirt, etc.). Compared to the current videocompression standards, the PSNR values of described algo-rithm are still low. The described algorithm can be extendedto motion-based compression of multiple targets (i.e. two ormore users inside the TI system).

Currently, we are upgrading our TI system with an im-proved and more robust stereo reconstruction algorithmwhich will produce less noisy point clouds with higherframe rate (15 to 20 fps) [30]. We have shown that the in-tegration of the described skeleton-based compression algo-rithm can significantly reduce our network bandwidth re-quirements. Extracted skeleton information can be used tointeract with the objects in the virtual environment or to an-alyze subject’s movement. The kinematic data can also beapplied as a feedback to the user during learning of mo-tions. For example, in the case of tai-chi learning via tele-immersion [37], the student could receive real-time feed-back on how close his/her movement is to the teacher’smovement. In addition, the skeletonization method can beapplied off-line to analyze motion in connection with othermotion sensor technologies (e.g. accelerometers).

References

1. Adobe. Quicktime 7.0 h.264 implementation (2006)2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Com-

put. Vis. Image Underst. 73(3), 428–440 (1999)3. Arikan, O.: Compression of motion capture databases. ACM

Trans. Graph. 25(3), 890–897 (2006)4. Baker, H., Tanguay, D., Sobel, I., Gelb, D., Gross, M., Culbertson,

W., Malzenbender, T.: The coliseum immersive teleconferencingsystem. In: Proceedings of International Workshop on ImmersiveTelepresence, Juan-les-Pins, France (2002)

5. Bentley, J.L.: Multidimensional binary search trees used for asso-ciative searching. Commun. ACM 18(9), 509–517 (1975)

6. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes.IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)

7. Cheung, K.M., Baker, S., Kanade, T.: Shape-from-silhouette ofarticulated objects and its use for human body kinematics estima-tion and motion capture. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, June 2003

8. CMU. Graphics Lab Motion Capture Database. Carnegie MellonUniversity. http://mocap.cs.cmu.edu/

9. Demirdjian, D., Darrell, T.: 3-D articulated pose tracking for un-tethered deictic reference. In: ICMI’02: Proceedings of the 4thIEEE International Conference on Multimodal Interfaces, Wash-ington, DC, p. 267. IEEE Comput. Soc., Los Alamitos (2002)

10. Dewaele, G., Devernay, F., Horaud, R.: Hand motion from 3Dpoint trajectories and a smooth surface model. In: ECCV (1),pp. 495–507 (2004)

11. Gavrila, D.M.: The visual analysis of human movement: a survey.Comput. Vis. Image Underst. 73(1), 82–98 (1999)

12. Gross, M., Würmlin, S., Naef, M., Lamboray, E., Spagno, C.,Kunz, A., Koller-Meier, E., Svoboda, T., Gool, L.V., Lang, S.,Strehlke, K., Moere, A.V., Staadt, O.: Blue-c: a spatially immer-sive display and 3D video portal for telepresence. ACM Trans.Graph. 22(3), 819–827 (2003)

13. Gumhold, S., Karni, Z., Isenburg, M., Seidel, H.-P.: Predictivepoint-cloud compression. In: Siggraph 2005 Sketches (2005)

14. Hasenfratz, J., Lapierre, M., Sillion, F.: A real-time system forfull-body interaction with virtual worlds. In: Proceedings of Euro-graphics Symposium on Virtual Environments, pp. 147–156. Eu-rographics Association, Aire-la-Ville (2004)

15. Holden, M.K.: Virtual environments for motor rehabilitation: re-view. Cyberpsychol. Behav. 8(3), 187–211 (2005)

16. Ibarria, L., Rossignac, J.: Dynapack: space-time compressionof the 3D animations of triangle meshes with fixed con-nectivity. In: SCA’03: Proceedings of the 2003 ACM SIG-GRAPH/Eurographics Symposium on Computer Animation,Aire-la-Ville, Switzerland, pp. 126–135. Eurographics Associa-tion, Aire-la-Ville (2003)

17. Isard, M., Blake, A.: Condensation—conditional density propaga-tion for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998)

18. Jung, S., Bajcsy, R.: A framework for constructing real-time im-mersive environments for training physical activities. J. Multimed.1(7), 9–17 (2006)

19. Kalra, P., Magnenat-Thalman, N., Moccozet, L., Sannier, G.,Aubel, A., Thalman, D.: Real-time animation of realistic virtualhumans. IEEE Comput. Graph. Appl. 18(25), 42–56 (1998)

20. Karni, Z., Gotsman, C.: Compression of soft-body animation se-quences. Comput. Graph. 28(1), 25–34 (2004)

21. Keshner, E.A.: Virtual reality and physical rehabilitation: a newtoy or a new research and rehabilitation tool? J. NeuroengineeringRehabil. 1(8) (2004)

22. Keshner, E., Kenyon, R.: Using immersive technology for posturalresearch and rehabilitation. Assist. Technol. 16, 54–62 (2004)

23. Knoop, R.D.S., Vacek, S.: Sensor fusion for 3D human body track-ing with an articulated 3D body model. In: Proceedings of theIEEE International Conference on Robotics and Automation, WaltDisney Resort, Orlando, Florida, 15 May 2006

24. Kum, S.-U., Mayer-Patel, K.: Real-time multidepth stream com-pression. ACM Trans. Multimedia Comput. Commun. Appl. 1(2),128–150 (2005)

25. Lamboray, E., Würmlin, S., Gross, M.: Real-time streaming ofpoint-based 3D video. In: VR’04: Proceedings of the IEEE VirtualReality 2004, Washington, DC, p. 91. IEEE Comput. Soc., LosAlamitos (2004)

26. Lanier, J.: Virtually there. Sci. Am. 4, 52–61 (2001)27. Lengyel, J.E.: Compression of time-dependent geometry. In:

SI3D’99: Proceedings of the 1999 Symposium on Interactive 3DGraphics, New York, NY, pp. 89–95. ACM, New York (1999)

28. Li, L., Zhang, M., Xu, F., Liu, S.: Ert-vr: an immersive virtualreality system for emergency rescue training. Virtual Real. 8(3),194–197 (2005)

29. Lien, J.-M., Bajcsy, G.K.R.: Skeleton-based data compression formulti-camera tele-immersion system. In: Proceedings of the Inter-national Symposium on Visual Computing (ISVC), pp. 347–354(2007)

30. Lopez, E.J.L.: Finite element surface-based stereo 3D reconstruc-tion, April 2006. Poster presentation given at the Trust NSF SiteVisit

31. Marc Alexa, W.M.: Representing animations by principal compo-nents. Comput. Graph. Forum 19(3), 411–418 (2000)

32. Mason, H., Moutahir, M.: Multidisciplinary experiential educa-tion in second life: a global approach. In: Second Life EducationWorkshop, San Francisco, California, pp. 30–34 (2006)

33. McComas, J., Pivik, J., Laflamme, M.: Current uses of virtual real-ity for children with disabilities. Virtual Environments in ClinicalPsychology and Neuroscience (1998)

34. Morency, L.-P., Darrell, T.: Stereo tracking using ICP and normalflow constraint. In: Proceedings of International Conference onPattern Recognition (2002)

Page 13: Multi-camera tele-immersion system with real-time model ...sharps.org/wp-content/uploads/BAJCSY-VISUAL-COMP.pdf · Multi-camera tele-immersion system with real-time model driven data

Multi-camera tele-immersion system with real-time model driven data compression 15

35. Mulligan, J., Daniilidis, K.: Real-time trinocular stereo for tele-immersion. In: Proceedings of 2001 International Conference onImage Processing, Thessaloniki, Greece, pp. 959–962 (2001)

36. Ochotta, T., Saupe, D.: Compression of point-based 3D models byshape-adaptive wavelet coding of multi-height fields. In: Sympo-sium on Point-Based Graphics, pp. 103–112 (2004)

37. Patel, K., Bailenson, J.N., Hack-Jung, S., Diankov, R., Bajcsy, R.:The effects of fully immersive virtual reality on the learning ofphysical tasks. In: Proceedings of the 9th Annual InternationalWorkshop on Presence, Ohio, USA, pp. 87–94 (2006)

38. Piccardi, M.: Background subtraction techniques: a review. In:Proceedings of IEEE International Conference on Systems, Manand Cybernetics, Hague, Netherlands, pp. 3099–3104 (2004)

39. Point Grey Research Inc, Vancouver, Canada40. Robb, R.: Virtual reality in medicine: A personal perspective.

J. Vis. 5(4), 317–326 (2002)41. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algo-

rithm. In: Proceedings of the Third International Conference on3-D Digital Imaging and Modeling (3DIM), pp. 145–152 (2001)

42. Sattler, M., Sarlette, R., Klein, R.: Simple and efficient compres-sion of animation sequences. In: SCA’05: Proceedings of the 2005ACM SIGGRAPH/Eurographics Symposium on Computer Ani-mation, New York, NY, pp. 209–217. ACM, New York (2005)

43. Simon, D., Hebert, M., Kanade, T.: Real-time 3-D pose estimationusing a high-speed range sensor. In: Proceedings of IEEE Interna-tional Conference on Robotics and Automation (ICRA’94), vol. 3,pp. 2235–2241 (1994)

44. Tsai, R.: A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TVcameras and lenses. IEEE J. Robot. Autom. RA3(4), 323–344(1987)

45. William Pennebaker, J.M.: JPEG: Still Image Data CompressionStandard. Springer, Berlin (1992)

46. Würmlin, S., Lamboray, E., Gross, M.: 3D video fragments: dy-namic point samples for real-time free-viewpoint video. Comput.Graph. 28, 3–14 (2004)

47. Yang, Y., Wang, X., Chen, J.X.: Rendering avatars in virtual re-ality: integrating a 3D model with 2D images. Comput. Sci. Eng.4(1), 86–91 (2002)

48. Yang, Z., Cui, Y., Anwar, Z., Bocchino, R., Kiyanclar, N., Nahrst-edt, K., Campbell, R.H., Yurcik, W.: Real-time 3D video com-pression for tele-immersive environments. In: Proc. of SPIE/ACMMultimedia Computing and Networking (MMCN’06), San Jose,CA (2006)

49. Zhang, J., Owen, C.B.: Octree-based animated geometry com-pression. In: DCC’04: Proceedings of the Conference on DataCompression, Washington, DC, p. 508. IEEE Comput. Soc., LosAlamitos (2004)

50. Zhang, D., Nomura, Y., Fujii, S.: Error analysis and optimizationof camera calibration. In: Proceedings of IEEE/RSJ InternationalWorkshop on Intelligent Robots and Systems (IROS’91), Osaka,Japan, pp. 292–296 (1991)

51. Zhao, W., Nandhakumar, N.: Effects of camera alignment errorson stereoscopic depth estimates. Pattern Recogn. 29(12), 2115–2126 (1996)

Jyh-Ming Lien is currently (2007–)Assistant Professor of the Com-puter Science Department at theGeorge Mason University. He re-ceived B.Sc. (1999) and Ph.D.(2006) degrees in Computer Sciencefrom the National Chengchi Uni-versity, Taiwan and from the TexasA&M University, respectively. Hewas a Postdoctoral Researcher at theUniversity of California, Berkeley.

Gregorij Kurillo received B.Sc.(2001) and D.Sc. (2006) degreesin Electrical Engineering from theUniversity of Ljubljana, Slovenia.There he was Research Assistantwith the Laboratory of Roboticsand Biomedical Engineering from2002 to 2006. He has participatedin two European Union projectsaimed at Assistive Technology. Dr.Kurillo has been Postdoctoral Re-searcher with the University of Cal-ifornia, Berkeley, since 2006, wherehe is working on the tele-immersionproject. His research interests in-

clude multi-camera calibration, stereo vision, virtual reality, and ro-botics.

Ruzena Bajcsy was Director ofCITRIS and Professor of EECS De-partment at the University of Cal-ifornia, Berkeley. Prior to comingto Berkeley, she was Assistant Di-rector of the Computer Informa-tion Science and Engineering Direc-torate (CISE) between December 1,1998 and September 1, 2001. Shecame to the NSF from the Univer-sity of Pennsylvania where she wasProfessor of Computer Science andEngineering. She was also Directorof the University of Pensylvania’ssGeneral Robotics and Active Sen-

sory Perception Laboratory, which she founded in 1978. In 2001 shebecame a recipient of the ACM A. Newell award.Dr. Bajcsy received her Master’s and Ph.D. degrees in Electrical En-gineering from Slovak Technical University in 1957 and 1967, re-spectively. She received a Ph.D. in Computer Science in 1972 fromStanford University, and since that time has been teaching and doingresearch at Penn’s Department of Computer and Information Science.She began as Assistant Professor and within 13 years became Chair ofthe department.