Human Sensing Using Computer Vision for Personalized Smart ...people.cs.umu.se/dipak/Publications/UIC2013.pdf · identity recognition of even monozygotic twins [7], however to capture

Human Sensing using Computer Vision for Personalized Smart Spaces

Dipak Surie, Saeed Partonia, Helena Lindgren User Interaction and Knowledge Modeling Group

Dept. of Computing Science Umeå University, Sweden

{dipak, mcs10spa, helena}@cs.umu.se

Abstract—Smart spaces are everyday environments augmented with computing technologies that enhance human experience and activity performance. Continuous recognition of the presence of people, their identity, location, movement and activity patterns in real-time is a key challenge to address if smart spaces are to be envisioned as personalized and adaptive spaces. This paper introduces the multiple technologies available for human sensing and identification, discussing their advantages and disadvantages. In particular, Kitchen As-A-Pal is described as a smart space with real-time human sensing capabilities using computer vision by fusing fisher face recognition and skeletal tracking approaches. A wall-mounted Kinect is used for both single occupant and multi-occupant settings in kitchen As-A-Pal. The fused approach gives human identity recognition accuracy of 91.75% precision and 66% recall values for single occupant setting with good smart space coverage. Challenges do exist for human identity recognition in multi-occupant settings.

Keywords—human sensing; human identity recognition; computer vision; smart spaces; ubiquitous computing

I. INTRODUCTION Advancements in ubiquitous sensing, communication,

actuation and interaction technologies are enabling the possibilities of a smart world in which computational intelligence is thoroughly integrated within everyday physical environments occupied by people [59]. Such environments also referred to as smart spaces are expected to offer reliable and personalized services that are adaptive to natural human abilities and lifestyle. Smart spaces extend the realm of computing to include novel applications and services that enhance human experience [1, 51] and activity performance, and in doing so are efficient, comfortable, and enjoyable [47].

Human sensing [52] is a corner stone for the realization of smart spaces. Typical human sensing systems that exist today include lights that automatically turn on/off depending on human presence detected using simple sensors like passive infrared detectors or non-contact MEMS thermal sensors. While such systems offer reactive and shallow smartness, systems that can identify who those occupants are of a smart space and what are their preferences offer the possibilities of proactive and personalized smartness. User modeling [11] is a field closely related to human sensing where the former usually depends on explicitly created human models while the latter strive for implicitly trained human models using sensors, together complement each other. Human sensing, especially human identity recognition is useful for applications that offer information access and control to humans. Initial work on human identification using face recognition for smart spaces was researched in [37]. In a

smart home setting, who can enter the home or turn on specific TV channels or eat a piece of cake in the refrigerator or listen to a specific music track is determined by human identity. Knowing that a person has not been following healthy eating habits or has had disturbed sleep during the entire weekend or has not been brushing the teeth regularly can trigger the smart home to offer persuasion and support for the particular individual in moving towards an enhanced lifestyle. Such information can also be connected to the healthcare systems offering proactive support to occupants of a smart home, especially the elderly population leading an independent life [29].

The primary goal of this paper is to present a continuous real-time computer vision based infrastructure for unobtrusive human sensing in kitchen As-A-Pal [50, 48], an interactive and personalized smart space. The paper is aligned as follows: section 2 describes the dimensions of human sensing within smart spaces with an introduction to kitchen As-A-Pal, and related works. Section 3 describes a computer vision based approach to continuous human sensing and section 4 presents the evaluation results. Section 5 provides challenges and conclusion.

II. HUMAN SENSING WITHIN SMART SPACES Smart spaces are developed using a human-centric approach

viewing human occupants as the central focus [49]. While technological advancements are a key to the realization of smart spaces, human experience within such spaces determine the success of smart spaces [51]. To enhance human experience, such spaces require real-time information about its occupants at different levels of granularity ranging from knowing if a person is present to knowing who the person is, where they are located and what are they doing. According to Nakamura [35], the purpose of human sensing is to observe and analyze humans, their social interaction and the activities performed for developing smart spaces as novel information media spaces. Teixeira et.al. [52] describe a taxonomy of the human properties that are associated to human sensing. Restricting to the spatio-temporal properties, this work focuses on human presence, count, location, movement and identity while leaving human behavioral and activity properties to future work. Refer to Table 1 for taxonomy of human properties that are relevant for smart spaces with its information content being cumulative in nature [52]. Human presence enables counting the number of human in a space and their location detection. Location information captured over time facilitates tracking human movement patterns and can be assigned a permanent unique label for human identity recognition. Proxemics awareness [48, 15] refers to the spatial relationship between objects and humans in an environment. It is closely related to the concept of human sensing and together is useful for developing smart spaces.

2013 IEEE 10th International Conference on Ubiquitous Intelligence & Computing and 2013 IEEE 10th International Conference on Autonomic & Trusted Computing

978-1-4799-2481-3/13 $31.00 © 2013 IEEEDOI 10.1109/UIC-ATC.2013.24

487

Table 1. Taxonomy of human properties relevant for smart spaces inspired from [52].

A. Kitchen As-A-Pal as a Smart Space Kitchen As-A-Pal is an interactive smart kitchen that is aware

of its individual occupants and offers personalized and adaptive services. The name “As-A-Pal” refers to “like a friend”. Human sensing and recognizing the identity of its occupants is a key feature for offering services that are truly smart. Here the term “smart” is not viewed to be objective and the same for all humans, but is regarded as subjective personal experience of individual humans. Kitchen As-A-Pal is expected to be adaptive to natural human abilities and compensate for their limitations by offering enhanced everyday activity support. We envision the kitchen to be a companion that is always there for supporting humans and is a building block of a holistic approach to ambient assisted living [29]. Kitchen As-A-Pal is located at MIT-Huset, Umeå University, Sweden as an environment for developing and testing ubiquitous computing technologies with potential users participating in the interaction design process. Kitchen As-A-Pal has an area of 14.7 m2 with a space for dining, kitchen appliances, mixing and cutting, cooking, dishing and storage. The kitchen environment is capable of supporting everyday kitchen activities like preparing a sandwich, preparing coffee, having breakfast, and doing the dishes. Kitchen As-A-Pal comprises of ecology of smart objects augmented with simple state change sensors and RFID technology used for proxemics awareness. Further description of the smart object ecology is beyond the scope of this paper and is available elsewhere [50, 48].

B. Related Work Human sensing and identification by computing systems goes

back to the concept of username/password and PIN (personal identification number). However, such an approach is not suitable for smart spaces where human interaction with computing systems are expected to be natural, ambient and if possible implicit. Human sensing and identification is achieved using a wide range of approaches: biometrics, speech, body postures, interaction with objects, actions performed, smart phone usage, RFID technology and sensor networks. Early efforts on human sensing include biometrics using physical or behavioral traits of humans: face, body parts analysis, gait characteristics [58], face, iris [39, 57], finger and palm prints, hand geometry [12]. Biometric identifications cannot be stolen and are always with the human actor, however some of the biometric identifications require explicit human support and cannot be recognition from a distance naturally making it less conducive for smart spaces. Near-infrared images of the iris provide complex patterns useful for human identity recognition of even monozygotic twins [7], however to

capture eye image of sufficient quality in real-world environments is a challenge and works within 1m from the camera [32]. Face and gait recognition is suitable for smart spaces. Human identification using gait patterns from kinect and adaptive neural network is researched in [44]. Gait can be detected from long distances with low-resolution images, however such an approach requires the human actors to constantly keep walking for recognition.

Human detection includes determining where a person is in space [5, 41]. While human detection is usually performed using RGB cameras similar to how humans detect other humans in a space using vision, a pure RGB image based approach suffers from detecting human body parts and motion due to pose variation, occlusion and complex background thereby increasing the computational cost making it difficult for real-time processing. Dense depth information at every point in 2D using time-of-flight sensors is gaining popularity for human sensing due to its robustness to color and illumination changes. Microsoft kinect is cheap and available of-the-shelf, and is an alternative to laser scanners used in robotics. While stereo cameras were used for human pose detection [6, 8, 62, 10], stereo camera calibration and matching process is computationally expensive making it unsuitable for real-time processing. Human detection using depth information from kinect is explored in [60] where 2D head contour model using Chamfer distance matching and 3D head surface models are fused. Fusing RGB and depth information for human body and pose detection is researched [21]. Appearance-based HOG (Histograms-of-Oriented Gradients) fused with depth information is used in [40, 46]. However, such approaches do not perform human identity recognition.

Human parts detection using depth data is explored [38, 19]. Real AdaBoost is used to classify human and non-human objects, and a mean-shift clustering in 3D space is done to determine human location in [19]. [38] deals with real-time identification and localization of body parts from depth images where hands, foot and head are classified using AGEX (accumulative geodesic extrema) interest points computed from depth data. Human body parts detection is useful for facilitating gesture-based interaction in smart spaced. Body parts detection is also researched in [36]. Such approaches often ignore human identity tracking useful for smart spaces. Robust face detection and tracking using a boosted adaptive particle filter under various tracking scenarios like changing distance from the camera, illumination variations, decent occlusion and head posture variations is described [65]. Face detection works well in multi-occupant setting where frontal and side faces are detected. These varied conditions were implicitly present in the breakfast scenario to be described later for our system evaluation making the human identity recognition problem harder. Human identity recognition is not address in [65]. Human sensing by keeping track of human motion [55, 13]; human location and their count using ultra low power camera sensor nodes [53]; and human action using histograms of 3D skeletal joint locations using kinect depth maps [61] is researched.

Multimodal approaches are useful for smart spaces since one information channel might be insufficient for human sensing [3, 9, 17, 31]. Multiple modalities (face, gait and voice) for human identification and tracking using a state transition model are researched [33]. They track discrete set of events based on human movement pattern in an environment. While continuous monitoring involves large amount of data gathering, continuous monitoring is important for enhancing human experience and real-time processing removes the need for large data storages. Simultaneous tracking of multiple occupants using face and voice

488

recognition cues in a smart room based on Bayesian filtering method is researched [3].

Face recognition for human identity recognition is a widely researched area [18, 65]. Frontal face recognition based on robust alignment and illumination by sparse representation is researched in [56] which works well for real-world settings due to the handling of illumination, image misalignment and occlusion in the training data. However, this approach requires a person to be in front of a camera affecting sensing space coverage area that is important for smart spaces. Face recognition based on thermal infrared face image is research in the recent years [16, 45] where the heat energy of various objects is used for face recognition. Unlike IR reflection based approaches, thermal techniques are immune to scattering and works with occlusion caused by moustache, hairstyles, beards, etc. [28]. Also fake pictures posing to be a human actor can be filtered since heat energy is dependent on living beings [26, 20]. Thermal face recognition can be a future extension to the work presented in this paper.

Recognizing human identity using passive RFID technology is not suitable for smart spaces since human actors should explicitly register their identity. Implicit RFID-based tracking require large range RFID readers and active tags that must always be worn by human actors which can be obtrusive. Fusing computer vision with RFID data for tracking people in robotics is investigated [14]. Tracking the identity of a person based on their personal object usage like a tooth brush or the shaving razor is interesting for smart spaces, however it require the human actors to be always active which may not be the case in reality [24]. Sensor network on the floor is used to recognize humans based on their movement and walking patterns in the Aware Home [25]. However, such an approach requires a dense network of pressure sensors creating scalability issues.

III. COMPUTER VISION FOR HUMAN SENSING Computer vision for human sensing has traditionally focused

on using RGB image. While such an approach works with inherent challenges like lighting conditions, variations in clothing, environmental complexity, etc., the main problem with such an approach is human sensing coverage within a smart space. To identify a human using RGB image, their frontal face should be in front of the camera within a restricted space. For an application like automatic opening of a door depending on who the person is in front of the door, RBG images based computer vision does work fine. However, human actors are usually mobile moving around in a space, performing everyday activities with varied body and face postures making it difficult to use a traditional face recognition approach to smart spaces. In this work, face recognition is fused with skeletal tracking, thereby enhancing the coverage area of human sensing within a smart space. Refer to Fig. 1 for the human sensing architecture.

A. Kinect as a Ubiquitous Sensor Microsoft Kinect [63] is used as the image acquisition sensor that produces both RGB images and depth images. The RGB images can be obtained at a high resolution of 1280x1024 pixels (supports about 10 Hz but we used at 4 Hz for real-time processing) useful for accurate face recognition. The depth images are obtained using infrared time-of-flight measures taken to reflect IR light off the environmental obstacles. Kinect offers IR grid projection across the scene useful for modeling curved surfaces including human body and body parts, unaffected by

Fig. 1. Human sensing infrastructure: fusion of face recogniton and skeletal

tracking approaches.

varying ambient light. Kinect can detect 6 people simultaneously, and can actively keep track of the movement patterns of 2 people yielding 20 skeletal joints per person. The multi-occupant setting to be described later is restricted to 2 persons for this reason. Skeletal tracking of human body for creating stick skeleton model using kinect is researched in related work [23].

The field of view of the kinect camera extend between a range of 1.2 to 3.5 m thereby covering a major area in kitchen As-A-Pal where human presence is possible (no other obstacle like furniture for instance). For accurate face recognition a maximum distance measure of 2.6 m (with minimum 50x50 pixels frontal face image) is used as the threshold. A viewing angle of 57° horizontally and 43° vertically determines the scope of the space in which human sensing is done. A Kinect is mounted on a wall in Kitchen As-A-Pal without the motorized tilting facility such that the viewing height and orientation is appropriate to maximize human sensing coverage. Note that the Kinect was placed approx. in line with the facial height of an adult standing on the kitchen floor. Since kinect is available off-the-shelf at an affordable price for capturing RGB and depth images, it is gaining popularity within the computer vision research community and at home as a motion-sensing device for gaming. While Kinect is beginning to be used as a ubiquitous computing sensor, limited research exist so far that present empirical results using this sensor for smart spaces. This paper could be viewed as one such effort to understand Kinect’s usage for smart spaces taking into account the different practical challenges.

B. Face Detection and Recognition High-resolution RGB images acquired from Kinect are pre-

processed to perform face detection. Refer to Fig. 1. Face detection in real-time is computationally expensive resulting in a need for pre-processing. In this work, the high-resolution images are reduced to a smaller size (downscaling by a factor of 2 is done) and are converted into gray scale images. The pre-processing process does not affect the face recognition accuracy since the aim is only to perform face detection, while for face recognition high quality RGB images are used. Face detection usually precedes face recognition since a face should first be detected before recognizing whose face is it.

489

Features for detecting human in related works include Scale Invariant Feature Transform [30, 34], Haar-like wavelets [55], shape [64] and HOG (Histograms of Oriented Gradients) [4]. Several techniques are applied for face detection including canny pruning and rough search. Objects bigger than 40x40 pixels on down scaled images are searched for detecting faces where in no maximum object (face) size restriction is applied. No confidence is used for filtering the prediction results. In this work, haar cascade classifiers are used for face detection. Selecting an appropriate method for face detection is also dependent on the model used for face modeling and recognition. For instance, haar cascade classifiers support fisher faces as a recognizer while local binary pattern features are useful for local binary pattern histogram. Once a face is detected, the face is registered to original high-quality RGB image useful for face recognition. Face segmentation is performed where the frontal faces are segmented as 100x100 pixel images (minimum threshold is 50x50 pixel images) from the original RGB image that includes the background and other details including a human face.

There are several methods for face recognition. Face recognition based on the geometric features of a face was initially computed by [22]. Marker points were used to register the key parts of a face like eyes, nose and lips. The angle and distance between the marker points are extracted as features. Even though such an approach is robust to ambient light variations, registration errors in marker points detection makes it inefficient for face recognition. The Eigenfaces approach [54] reduces the high-dimensional image space to a lower-dimension using Principal Component Analysis retaining the axes with maximum variance. Eigen faces offer good recognition accuracy, but are computationally heavy. Eigen faces does not classify data belonging to different classes thereby losing the discriminative power when lower-dimensional components are thrown away. Maximizing the variance both within and between classes for face recognition is done in Linear Discriminant Analysis [2]. Fisher faces represent same classes to be tightly clustered together while different classes are as far apart as possible from one another in lower-dimensional representation. Fisher faces are more accurate compared to Eigen faces, but are affected by changing ambient light wherein the accuracy drops dramatically. Local Binary Pattern Histogram (LBPH) is not sensitive to ambient light and is faster than the other mentioned models, but does not have good recognition accuracy.

In this work, fisher faces are used considering its accuracy and computing speed advantages. It is suitable for indoor spaces like a kitchen environment that have limited ambient light variations. The fisher faces are implemented using OpenCV (Open Source Computer Vision) library and are trained offline using approximately 70 pictures for each person. Image acquisition for training usually takes less time (between 2 and 3 mins) where the frontal face is captured with varied head postures that are natural while being a part of the kitchen environment and performing everyday activities. Note that it is important to keep the training time requirement to a minimum so that the first time occupants of a space are not distracted. For ideal training, image capturing should be done when a person is between 1.2 and 2 m from the camera while beyond 2 m fewer details are retained. In the recognition mode, classified face identity is obtained. The accuracy of face recognition can be improved if a person is at a fixed distance with frontal face alignment towards the kinect

camera. However, everyday spaces usually support mobile human actors with ever varying body postures and location. Introducing fixed distance and face alignment restrictions make it unsuitable for ubiquitous computing environments. Even though our face recognition approach has yielded good results, using additional image enhancement techniques before both training the model and in the recognition mode would improve the results further. Also, aligning the faces to compensate for inappropriate facial orientation before training the model should improve the results. Refer to section 4 for evaluation results.

C. Skeletal Tracking A pattern of infrared light emission from Kinect is used in

calculating the depth information of the scene in the field of view. The depth information enables the recognition of people, and their body parts as 20 skeletal joints either tracked or inferred from existing data [43]. Skeletal tracking enables the detection of human presence, number of humans present in the space (max. 6 persons can be detected), their location and their movement patterns over time (max. 2 persons can be tracked simultaneously). Skeletal tracking can be used for following people performing actions. Skeletal tracking is implemented using Kinect SDK 1.7 from Microsoft where the skeletal head co-ordinates are obtained. Skeletal tracking was optimized for the standing position since people usually stand while cooking in a kitchen. However, skeletal tracking in general is effective for varied human body postures. While facing a Kinect yields good results, sideway poses are usually challenging to recognize. Increasing human sensing coverage within a smart space might introduce a need for multiple Kinects. However, it is important to physically place the individual Kinects such that the interference caused by multiple infrared light sources is avoided to the extent possible. One option is to assign unique field of view in 3D space for the individual Kinects such that they do not interfere. Human location usually changes over time within a space. Skeletal tracking keeps track of such movement patterns with good accuracy. However, there are 2 challenges that exist: 1) When a skeleton moves beyond Kinect’s field of view it is lost. 2) When two or more skeletons are in close proximity to one another, their skeletons are confused. It is important to explore mechanisms for addressing these two challenges as will be shown in the evaluation section (section 4).

D. Information Fusion Information arriving from several sources is fused in

establishing face-skeleton mapping, a novel approach to human sensing. The center point coordinates of the face detected and the recognized face label are passed for face-skeleton mapping. Also, the skeleton with (roughly) the same head position as that of the face coordinates is found. The function assignFaceToSkeleton assigns the face label to the skeleton with proximal coordinates to the face coordinates. Euclidean distance of 15 pixels is used as a threshold between detected face center and skeleton head center. An assigned skeleton holds the assigned name as long as it is not lost from the Kinect’s field of view. Using this approach a person in kitchen As-A-Pal can be tracked with an assigned label as long as the skeleton does not miss the Kinect’s field of view. Refer to Fig. 2 for skeletal tracking and human sensing by information fusion from multiple sources.

490

Algorithm 1. Information fusion using dynamic person tracking algorithm.

Fig. 2. Acquired original images (top & bottom left); Skeletal tracking (top

& bottom center); and Human sensing by information fusion of face recogniton and skeletal tracking (top & bottom right).

Assigning a label to a skeleton is done only if the skeleton does not have an existing human identity label. While it is possible to assign the proximal skeleton with the current identity of the face recognized, doing so regularly results in fluctuations since face recognition has a sampling rate greater than 1 Hz. Further research is required in determining the optimal mechanisms for fusing the information obtained from face recognition and skeletal tracking especially for multi-occupant settings. Information fusion has challenges especially since skeletal tracking assigns temporary identity to the skeletons tracked. However, such temporary identity should be associated to a permanent identity like a person’s name or their personal number. Since skeleton(s) in the field of view boundary are usually lost, when they reappear it is a challenge in selecting the right permanent identity for association. A similar problem also mentioned earlier is when multiple skeletons are in close proximity resulting in information fusion difficulties for human identity recognition. Information fusion is also useful in discarding face detection errors. In our case some of the kitchen objects like for instance a toaster was wrongly recognized as a face. In such cases, skeletal association discards the detected faces as noise. A dynamic person-tracking algorithm was implemented for human identity recognition by information fusion (refer to Algorithm 1).

IV. EVALUATION Computer vision for face recognition exist as a research area

for more the 15 years. While there has been tremendous ground already covered, using computer vision for ubiquitous computing landscape is still at its infancy. This work presents initial empirical

results on understanding computer vision as a technology for human sensing in smart spaces.

A. Participants and Scenario Fours participants (2 males and 2 females) from Umeå

University took part in a pilot study at kitchen As-A-Pal. The subjects were between the age of 24 and 37 years. The participants were food enthusiasts who spend considerable amount of time in their kitchen everyday preparing food. They were also used to multiple occupants sharing a kitchen space (the participant shared their kitchen with a partner or a friend). The participants had a technical background and were positive about human sensing in kitchen spaces with some cautious efforts to maintain privacy and security. The participants took part in a breakfast scenario where they were supposed to perform different everyday activities in kitchen As-A-Pal. The participants were free to choose the activities they would like to perform based on their everyday lifestyle and also the facilities available in kitchen As-A-Pal. The participants displayed natural body movements in the space with varying face orientation with reference to the wall-mounted kinect’s field of view. Participants performing everyday activities ensured that they were not always looking directly into the kinect but their proxemics with reference to the kinect is natural. As mentioned earlier, frontal faces are the ones that support face recognition, while back and side facing faces offer little or no support for face recognition requiring alternative complementary approaches like skeletal tracking explored in this paper.

Each participant took part in 3 sessions: a quick training session, a single occupant session, and a multi-occupant session. The participants were free to decide on the clothing styles and facial looks. For instance, a subject was well shaven during a session and with fine beard in the other two sessions. The training session was quick (between 2 and 3 min) where their facial images with varying face postures while performing a breakfast activity were acquired as high quality images for training the fisher face model. The image acquisition is done in an unobtrusive and implicit manner where in the participant is usually unaware of when the system acquires the images. The offline system training takes about 15 min for each human actor. The four participants took part in another session where the participants were alone in kitchen As-A-Pal. The participants on an average spent 10 min (ranging between 6 and 14 min) for performing breakfast activities individually. The system was run in the recognition mode and the video was captured with recognition visualization also used for calculating the precision, recall and confusion matrix. In the third session, the 4 participants were paired into 2 groups. The participants in each pair knew each other making it easier to create a social situation where the pair together prepared breakfast in kitchen As-A-Pal with natural social interaction. Knowing each other in a pair also ensured that their movement patterns within the smart space were realistic. The participant pairs spent 14 and 6 min in performing breakfast activities together in kitchen As-A-Pal. The participants were enthusiastic and expressed their positive attitude towards human sensing in return for personalized services in their own kitchen. While the number of participants and the time spent by them in kitchen As-A-Pal is limited, this work is a pilot study opening-up the possibilities of human sensing using computer vision and information fusion.

491

Table 2. Precision and Recall values in % for single occupant setting. (a) Face Recognition and (b) Face Recognition fused with Skeletal Tracking.

Table 3. Precision and Recall values in % for multi-occupant setting. (a) Face Recognition and (b) Face Recognition fused with Skeletal Tracking.

B. Precision and Recall

The accuracy of human identity recognition is presented using precision and recall values. Here TP, FP and FN refers to true positive, false positive and false negative while precision and recall values are defined as follows:

!"#$%&%'()*!+)!"""""""""#$"""""""""""""""""""""",#$-..)*,+)!"""""#$"""""""""""""""""""""""#$"%"&$"" " " """""""""""""#$"%"&'"#$ &$ #$ % &'

P1, P2, P3 and P4 refer to the individual study participants. Refer to Table 2 and 3. As expected single occupant setting has out performed multi-occupant setting in terms of precision and recall values. It should be noted that while precision is high (85.5%) for the traditional face recognition approach, its recall values are far from being acceptable (34%) due to the limitations in covering the area visited by the participants in kitchen As-A-Pal. Refer to Table 2. However, the approach using information fusion has increased the precision value to 91.75% with a 32% hike in the recall value. This increase in accuracy makes human sensing using computer vision suitable for smart spaces where sensor coverage is an important challenge. It should be noted that 100% smart space coverage is a challenge using a single Kinect and in this work we explore the boundaries of sensor coverage using a single Kinect. For our ambient assisted living application [29] it is important to have high precision (above 90%) to facilitate personalized services to the occupant like diet reminders, access to specific food ingredients, activity support and sharing information about successfully completed activities with the care-givers. For the multi-occupant setting, the precision value for traditional face recognition is high (81%), but the recall value is low. Refer to Table 3. It is difficult to increase the recall value beyond certain point without reducing the precision value drastically. For the information fusion approach, the recall value is decent (59.5%) owing to the use of skeletal tracking. However the precision is still low, mainly due to skeletons in close proximity or skeletons leaving/entering the kinect field of view more than for the single occupant setting (we observed that sharing a space by two occupants results in the occupants crossing the sensor coverage boundary often to maintain comfortable personal space for each other while performing activities). Additional mechanisms are required for skeletal tracking and the subsequent information fusion to address this challenge.

Table 4. Confusion matrix for single occupant and multi-occupant settings.

Fig. 3. Skeletal tracking is dominant in single occupant setting. Research on multi-occupant cases with ecologically valid

human behaviors is often ignored due to the complexity involved in human tracking especially using computer vision (we have taken our first step in this research direction). While multi-occupant settings introduce additional challenges, human beings in general are social being and it is important to investigate on solutions that include multi-occupant cases. Even in a setup where a person lives independently, it is likely that other humans visit the person time and again making it multi-occupant.

C. Confusion Matrix Confusion matrix is used to determine how the actual classes

are confused with the recognized classes. In our case, it is the human identity confusion among the participants that is represented. For an ideal system all the cells of the confusion matrix except the diagonal should be zero. It should be noted that the confusion matrix diagonals for both the single-occupant and the multi-occupant cases are dominant. Refer to Table 4. P1 is the participant least confused with the other participants (2.5% confusion ratio percentage with P3 alone). P2 is maximum confused with P3 (24.25% confusion ratio percentage) but is not confused with P1 and only 0.45% confused with P4. The confusion matrix clearly shows the accuracy results obtained using information fusion of face recognition and skeletal tracking. While additional mechanisms in the future should improve the system performance, the infrastructure presented in the paper can be used as a foundation for further research on human sensing within smart spaces using computer vision.

D. Skeletal Tracking as a Dominant Information Channel Information fusion involves the combination of information

from multiple sources thereby creating a new set of information. It is often used in scenarios where information from a single channel is not sufficient for decision-making resulting in conflicts. The importance of information fusion can be seen from the contribution of individual information channels towards human sensing. Refer to Fig. 3 and Fig. 4 for examples where the person’s face is detected even when the person’s face orientation is away from the Kinect’s field of view.

492

Fig. 4. Skeletal tracking is dominant in multi-occupant setting.

Fig. 5. Face recognition is dominant in multi-occupant setting.

In this context, skeletal tracking is the dominant information channel and the information from the face recognizer is either unavailable or is unreliable. The approach of associating the recognized face to the nearest skeleton enables decision-making possibilities during the absence of frontal face image for face recognition. In such cases the only way to detect the person’s identity is based on skeleton mapping done when the person’s face was last recognizable. As discussed earlier, the skeletal mapping works well for single occupant spaces in comparison to multi-occupant spaces.

E. Face Recognition as a Dominant Information Channel Skeletal tracking works fine when two skeletons are not in

close proximity. However in everyday spaces with multiple occupants it is more likely that the skeletons come closer to one another creating confusion. Refer to Fig. 5 for an example where the skeletons are confused and merged making it difficult to recognize the identity of the skeletons. Face recognition is the information channel that is dominant in such cases. Future work includes applying appropriate fusion approach for determining when to use and not use skeletal mapping.

V. CHALLENGES AND CONCLUSION Spatial coverage refers to the area within which human

sensing is feasible. While it is important that face recognition and skeletal tracking is available throughout the smart space, increasing the sensor coverage area beyond certain threshold reduces face recognition accuracy and also causes lost skeletons. The spatial measures described in section 3 were found to be appropriate thresholds when using a single Kinect for human sensing. There are several extensions possible to improve the recognition accuracy. Detecting facial parts like eyes, ears, nose, etc. could increase the accuracy but at computational cost. Handling occlusion, illumination variations, misalignment, corrupted pixels and complex background are important for better results, especially since this work was performed in a natural human setup (but is a challenge [42]). Head pose estimation using Kinect is researched in [27] which can improve head alignment

and recognition accuracy, but is a challenge in real-time. Using an ambient light sensor with trained models for each individual and the specific lighting conditions will be explored in the future. Privacy and trust is another important challenge to consider in camera-based approaches for smart spaces. The visualization presented in this paper is for description purposes and the only information that will be stored is human identity codes similar to using RFID technology for instance. Similarities between occupants and variations within an occupant in smart spaces over time are other challenges to address. This work will be extended to include human activity recognition using computing vision in the future. A novel approach of fusing face recognition with skeletal tracking for human sensing using kinect is described with good results (91.75% precision, 66% recall values, and satisfactory smart space coverage) in a realistic breakfast scenario. Human sensing is a wide research area and multimodal approaches are likely to be successful.

ACKNOWLEDGMENT This work is sponsored by the Swedish Brain Power and is

part of the As-A-Pal project.

REFERENCES [1] G. Abowd, E. Mynatt, "Designing for the human experience in smart

environments”, Smart environments: technologies, protocols, and applications, 2004, pp. 151-174.

[2] P. Belhumeur, J. Hespanha, D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 19(7), 1997, pp. 711–720.

[3] K. Bernardin, H. Ekenel, R. Stiefelhagen, “Multimodal identity tracking in a smart room”, Pers Ubiquit Comput, vol. 13(1), 2009, pp. 25–31.

[4] N. Dalal, B. Triggs, “Histograms of Oriented Gradients for Human Detection”, In: IEEE Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 886–893.

[5] N. Dalal, B. Triggs, C. Schmid, “Human detection using oriented histograms of flow and appearance”, in: European Conference on Computer Vision, Graz, Austria, May 7–13, 2006.

[6] T. Darrell, G. Gordon, J. Woodfill, M. Harville, "Integrated Person Tracking using Stereo, Color, and Pattern Detection," Proceedings IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, June 1998.

[7] J. Daugman, C. Downing, “Epigenetic Randomness, Complexity, and Singularity of Human Iris Patterns,” Proc. Royal Soc., vol. B, no. 268, Biological Sciences, 2001, pp. 1737-1740.

[8] D. Demirdjian, T. Darrell, “3-D Articulated Pose Tracking for Untethered Diectic Reference”, ICMI, 2002, pp. 267-272.

[9] H. Ekenel, M. Fischer, Q. Jin, R. Stiefelhagen, “Multi-modal person identification in a smart environment”, In: Computer Vision and Pattern Recognition, IEEE Computer Society, 2007, pp 1–8.

[10] A. Ess, B. Leibe, L. Van Gool, “Depth and appearance for mobile scene analysis”, ICCV, 2007.

[11] G. Fischer, "User modeling in human–computer interaction", User modeling and user-adapted interaction, vol. 11 (1-2), 2001, pp. 65-86.

[12] D. Gavrila, “The Visual Analysis of Human Movement: A Survey”, Computer Vision and Image Understanding, vol. 73(1), 1999, pp. 82–98.

[13] D. Gavrila, J. Giebel, S. Munder, “Vision-based Pedestrian Detection: The Protector System”, In: Intelligent Vehicles Symposium, 2004, pp. 13–18.

[14] T. Germa, F. Lerasle, N. Ouadah, V. Cadenat, "Vision and RFID data fusion for tracking people in crowds by a mobile robot", Computer Vision and Image Understanding, vol. 114 (6), 2010, pp. 641-651.

[15] S. Greenberg, N. Marquardt, T. Ballendat, R. Diaz-Marino, M. Wang, "Proxemic interactions: the new ubicomp?", interactions vol. 18 (1), 2011, pp. 42-50.

[16] G. Hermosilla, J. Ruiz-Del-Solar, R. Verschae, M. Correa, “A Comparative Study of Thermal Face Recognition Methods in Unconstrained Environments”, Pattern Recognit., vol. 45 (7), 2012, pp. 2445-2459.

493

[17] K. Hong, J. Min, W. Lee, J. Kim, “Real time face detection and recognition system using Haar-like feature/HMM in ubiquitous network environments”, In: Computational science and its applications (ICCSA 2005), LNCS vol. 3480, 2005, pp 1154–1161.

[18] Y. Hu, A. Mian, R. Owens, "Face recognition using sparse approximated nearest points between image sets", Pattern Analysis and Machine Intelligence, IEEE Transactions on vol. 34, no. 10, 2012, pp. 1992-2004.

[19] S. Ikemura, H. Fujiyoshi, “Real-time human detection using relational depth similarity features”, ACCV 2010, LNCS vol. 6495, 2011, pp. 25-38.

[20] A. Jain, B. Klare, U. Park, “Face Recognition: Some Challenges in Forensics”, In IEEE International Conference on Automatic Face and Gesture Recognition and Workshops, 2011, pp. 726-733.

[21] H. Jain, A. Subramanian, “Real-time upper-body human pose estimation using a depth camera”, In HP Technical Reports, HPL-2010-190, 2010.

[22] T. Kanade, “Picture processing system by computer complex and recognition of human faces”, PhD thesis, Kyoto University, November 1973.

[23] A. Kar, "Skeletal tracking using microsoft kinect", Methodology 1, 2010, pp. 1-11.

[24] F. Kawsar, K. Fujinami, T. Nakajima, "Prottoy middleware platform for smart object systems", International Journal of Smart Home, vol. 2(3), 2008.

[25] C. Kidd, R. Orr, G. Abowd, C. Atkeson, I. Essa, B. MacIntyre, E. Mynatt, T. Starner, W. Newstetter, "The aware home: A living laboratory for ubiquitous computing research", In Cooperative buildings, Springer Berlin Heidelberg, 1999, pp. 191-198.

[26] Y. Kim, J. Yoo, K. Choi, “A Motion and Similarity-Based Fake Detection Method for Biometric Face Recognition Systems”, IEEE Trans. Consum. Electron., vol. 57, no. 2, 2011, pp. 756-762.

[27] F. Kondori, S. Yousefi, H. Li, S. Sonning, "3D head pose estimation using the Kinect", In Wireless Communications and Signal Processing (WCSP), IEEE, 2011, pp. 1-4.

[28] S. Kong, J. Heo, F. Boughorbel, Y. Zheng, B. Abidi, A. Koschan, M. Abidi, “Adaptive Fusion of Visual and Thermal Ir Images for Illumination-Invariant Face Recognition”, J. Comput. Vision, vol. 71, no. 2, 2007, pp. 215-233.

[29] H. Lindgren, D. Surie, I. Nilsson, “Agent-Supported Assessment for Adaptive and Personalized Ambient Assisted Living”, Trends in Practical Applications of Agents and Multiagent Systems, Springer, vol. 90, 2011, pp. 25-32.

[30] D. Lowe, “Object Recognition from Local Scale-invariant Features”, In: IEEE International Conference on Computer Vision, 1999, pp. 1150-1157.

[31] J. Luque, R. Morros, A. Garde, J. Anguita, M. Farrus, D. Macho, F. Marqués, C. Martínez, V. Vilaplana, J. Hernando, "Audio, video and multimodal person identification in a smart room", In Multimodal Technologies for Perception of Humans, Springer Berlin Heidelberg, 2007, pp. 258-269.

[32] J. Matey, L. Kennell, “Iris Recognition - Beyond One Meter”, Handbook of Remote Biometrics, 2009.

[33] V. Menon, B. Jayaraman, V. Govindaraju, "Multimodal identification and tracking in smart environments", Personal and Ubiquitous Computing, vol. 14 (8), 2010, pp. 685-694.

[34] K. Mikolajczyk, C. Schmid, A. Zisserman, “Human Detection based on a Probabilistic Assembly of Robust Part Detectors”, European Conference on Computer Vision, 2004, pp. 69–82.

[35] Y. Nakamura, “Human Sensing”, Field Informatics, Kyoto University Field Informatics Research Group, Springer Berlin Heidelberg, 2012, pp. 39-53.

[36] C. Papageorgiou, T. Poggio, “A Trainable System for Object Detection”, International Journal of Computer Vision, vol. 38(1), 2000, pp. 15–33.

[37] A. Pentland, T. Choudhury, "Face recognition for smart environments." Computer vol. 33, no. 2, 2000, pp. 50-55.

[38] C. Plagemann, V. Ganapathi, D. Koller, S. Thrun, “Realtime identification and localization of body parts from depth images”, In IEEE Int. Conference on Robotics and Automation (ICRA), Anchorage, Alaska, USA, 2010.

[39] A. Ross, “Iris recognition: The path forward”, Computer, vol. 43(2), 2010, pp. 30-35.

[40] J. Salas, C. Tomasi, "People detection using color and depth images", In Pattern Recognition, Springer Berlin Heidelberg, 2011, pp. 127-135.

[41] W. Schwartz, A. Kembhavi, D. Harwood, L. Davis, “Human detection using partial least square analysis”, In ICCV, 2009.

[42] H. Sellahewa, S. Jassim, "Image-quality-based adaptive face recognition", Instrumentation and Measurement, IEEE Transactions on 59, no. 4, 2010, pp. 805-813.

[43] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, “Real-Time Human Pose Recognition in Parts from a Single Depth Image”, in CVPR, IEEE, June 2011.

[44] A. Sinha, K. Chakravarty, B. Bhowmick, "Person Identification using Skeleton Information from Kinect", In The Sixth International Conference on Advances in Computer-Human Interactions, 2013, pp. 101-108.

[45] D. Socolinsky, A. Selinger, “Thermal Face Recognition over Time”, In 17th International Conference on Pattern Recognition, U.K., 2004, pp.187-190.

[46] L. Spinello, K. Arras, "People detection in RGB-D data", In Intelligent Robots and Systems (IROS), IEEE, 2011, pp. 3838-3843.

[47] D. Surie, "Egocentric interaction for ambient intelligence", PhD diss., Umeå University, 2012.

[48] D. Surie, B. Baydan, H. Lindgren, "Proxemics Awareness in Kitchen AS-A-PAL: Tracking Objects and Human in Perspective", In 9th International Conference on Intelligent Environments, IEEE, 2013, pp. 157-164.

[49] D. Surie, L-E. Janlert, T. Pederson, D. Roy, "Exploring egocentric interaction paradigm for ambient ecologies", Pervasive and Mobile Computing, 2011.

[50] D. Surie, H. Lindgren, A. Qureshi, "Kitchen AS-A-PAL: Exploring Smart Objects as Containers, Surfaces and Actuators", In Ambient Intelligence-Software and Applications, Springer, 2013, pp. 171-178.

[51] D. Surie, T. Pederson, L-E. Janlert, "A Smart Home Experience using Egocentric Interaction Design Principles", In Computational Science and Engineering (CSE), IEEE, 2012, pp. 656-665.

[52] T. Teixeira, G. Dublon, A. Savvides, "A survey of human-sensing: Methods for detecting presence, count, location, track, and identity", ACM Computing Surveys, vol. 5, 2010.

[53] T. Teixeira, A. Savvides, "Lightweight people counting and localizing in indoor spaces using camera sensor nodes", In Distributed Smart Cameras, ICDSC'07, IEEE, 2007, pp. 36-43.

[54] M. Turk, A. Pentland, “Eigenfaces for recognition”, Journal of Cognitive Neuroscience, vol. 3, 1991, pp. 71–86.

[55] P. Viola, M. Jones, D. Snow, “Detecting Pedestrians using Patterns of Motion and Appearance”, International Journal of Computer Vision, vol. 63(2), 2005, pp. 153-161.

[56] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, Y. Ma, "Toward a practical face recognition system: Robust alignment and illumination by sparse representation", Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 2, 2012, pp. 372-386.

[57] N. Wang, Q. Li, A. El-Latif, T. Zhang, X. Niu, “Toward Accurate Localization and High Recognition Performance for Noisy Iris Images”, Multimed. Tools Appl., 2012, pp. 1-20.

[58] J. Wang, M. She, S. Nahavandi, A. Kouzani, "A review of vision-based gait recognition methods for human identification", In Digital Image Computing: Techniques and Applications (DICTA), IEEE, 2010, pp. 320-327.

[59] M. Weiser, "The computer for the 21st century", Scientific american, vol. 265, no. 3, 1991, pp. 94-104.

[60] L. Xia, C-C. Chen, J. Aggarwal, “Human Detection using Depth Information by Kinect”, In Computer Vision and Pattern Recognition Workshops, IEEE Computer Society, 2011, pp. 15-22.

[61] L. Xia, C-C. Chen, J. Aggarwal, "View invariant human action recognition using histograms of 3D joints", In Computer Vision and Pattern Recognition Workshops, IEEE Computer Society, 2012, pp. 20-27.

[62] H. Yang, S. Lee, “Reconstruction of 3D human body pose from stereo image sequences based on top-down learning”, Pattern Recognition, vol. 40(11), 2007, pp. 3120-3131.

[63] Z. Zhang, "Microsoft kinect sensor and its effect", Multimedia, IEEE, vol. 19, no. 2, 2012, pp. 4-10.

[64] L. Zhao, L. Davis, “Closely coupled object detection and segmentation” In: IEEE International Conference on Computer Vision, 2005, pp. 454–461.

[65] W. Zheng, S. Bhandarkar, "Face detection and tracking using a Boosted Adaptive Particle Filter", Journal of Visual Communication and Image Representation, vol. 20, no. 1, 2009, pp. 9-27.

494

Documents

Human Sensing Using Computer Vision for Personalized Smart ...people.cs.umu.se/dipak/Publications/UIC2013.pdf · identity recognition of even monozygotic twins [7], however to capture