Autonomous Learning of User’s Preference of Music and ......the user opts to react to the provided service. As a result, the system adapts to provide the most satisfactory music

Autonomous Learning of User’s Preference of Music and Light Services in Smart

Home Applications

Amir Hossein KHALILIa , Chen WUa, and Hamid AGHAJANa

a Department of Electrical Engineering Stanford University, CA 94305, USA

Abstract. Different users have different behavior and habits. When an ambient intelligence system is designed for a smart home environment, such difference in the user’s activity patterns and preferences in environment settings has to be taken into account for the system to adapt to and satisfy the user. Automatic learning of the behavior and preferences of the user is essential for automatic regulation of services in order for the system to be useful and practical. In this paper we use temporal differential class of reinforcement learning which is an unsupervised learning method to learn preferred music and lighting service settings of the user in presence of different states of the user. Our proposed method does not need predefined environmental model and learns optimal actions in each state by continuous interactions with the user. The proposed method is able to adapt itself with the changes in the user’s preferences throughout time. The result shows how our system converges to increase user’s satisfaction.

Keywords. Smart Homes, Reinforcement Learning, Temporal differential method.

Introduction

There has been extensive research on smart homes with a variety of applications including monitoring systems for independent living of elderly and accident detection [1], smart appliances/devices such as refrigerators, and automated services such as energy and lighting control [2][3]. The goal is to provide services that maximize the user’s comfort while minimizing the user’s explicit interaction with the environment as well as the cost of the service. One type of service aims to achieve both the utility requirements and user’s comfort. For example, a lighting or heating control system needs to satisfy the user’s need for lighting / heating and at the same time being energy-efficient. Another type of service may be designed mainly to provide a comfortable ambient for the user. For example, in Philips’ Experience Lab [4], LED lights of different colors are installed at several locations in the room, and different themes of ambient lighting can be applied depending on the user’s preference. In most current applications, the system provides its services either based on a set of embedded fixed rules which determine the service parameters based on the situation, or based on a query from the user every time a change is to be applied to the service. For example, a simple light control system may be programmed to turn the lights off automatically

when the user sits to watch TV. This results in a rigid operation which may not be desirable for many users. Or the system may ask the user to confirm the action every time the same event is detected. When different services are to be offered throughout the day such queries disrupt the user’s normal routines and may render the system obtrusive.

In this work, our goal is to create a user-centric methodology for adaptation of system services based on accumulated knowledge about the user preferences. These preferences are learned through user’s explicit or implicit feedback to the system when the user opts to react to the provided service. As a result, the system adapts to provide the most satisfactory music and ambient lighting to the user.

In our approach, the system gains intelligence through observing the user, interacting with the user, and exploring the user’s interests via mutual discovery mechanisms. Context is inferred not just through generic data such as time of the day, but also via detecting user-specific situations such as location, activity, or event.

Learning user’s preference of the services requires information of the user’s state. The existing smart home systems collect readings from different sensors as representation of user’s activities[5][6], e.g., motion sensors, light sensors and sensors that indicate user’s interactions with objects. But sometimes such sensory data are not capable of providing sufficient details of the user’s activity. For example, the user may watch TV, read newspapers or talk with the family in the living room, and the type of background music he prefers under each situation is well likely to differ. In our system image sensors are used, with the objective of obtaining a richer description of the user’s pose and activity by analyzing image data with computer vision techniques. In this way we are able to learn and correlate the user’s preferences with more specific contextual information about the user.

One challenge for the learning algorithm of the user’s preference is that the user may change his behavior over time [7]. Therefore it is not practical to learn the user’s preference off-line and instead an adaptive learning algorithm is needed. We apply the framework of reinforcement learning (RL) to maximize satisfaction of the user. RL as an unsupervised algorithm does not need predefined models of the environment and learns the model from the user’s feedback to the service the system provides. The state of the user (time of the day, location, pose/activity) is observed by the system, and the service is chosen automatically by the algorithm based on the user’s preference learnt so far. If the user is not satisfied and applies a change to it using the system’s user interface or simply leaves the area, the event is recorded as explicit or implicit feedback respectively, and is used to update the decision policy.

1. Proposed Method

The overall system architecture is shown as a block diagram in Figure 1. It aims to learn the user’s preference of a service in the smart environment (e.g., music and ambient lighting) through observing the user’s states and learning from the user’s feedback. Components of the system include sensory-motor, decision maker and actor.

Figure 1. System architecture.

“Sensory-motor” monitors the activity and location of the user at home. It activates

the “decision-making” part in case of a change or when a time slot passed. According to the state of the environment s (which includes the user’s current activity, location and current time), “decision-maker” learns and selects the satisfactory features of a service and makes “actor-motor” take the desired action a, with the objective to maximize the user satisfaction represented as reward r.

Different parts are described in more details in the following sections.

1.1. Actor-Motor

Two types of services are controlled in the proposed system: music and ambient lighting. Table 1 shows features of each service and their corresponding values.

Table 1. Services and values of features being controlled by the system.

Service Features Values

Music

Genre Blues, Christian, Classical, Country, Dance, Electronic, Folk, Hip_Hop, Holiday, Jazz, Latin, New_Age, Oldies, Pop, R_n_B, Reggae, Rock

Mood Calm, Exciting, Happy, Neutral, Sad

Volume Silence, Low, Medium, High

Lighting

Pattern Cloudy, Starry, Fireworks, Solid

Color Purple, Blue, Gray, Green, Orange, Red, Yellow

Brightness Bright, Medium, Dark, Off

The features’ values selected by decision-maker are sent to actor-motor as two triple vectors, shown in (1) and (2):

Music (‘Genre value’, ’Mood value’, ’Volume value’) (1) Lighting (‘Pattern value’, ’Color value’, ’Brightness value’) (2)

According to the received feature vector, actor-motor chooses proper music and

lighting effect from the available archive and plays them for the user. For each feature, a decision maker is learned independently for the appropriate value of the action to be taken at the user’s state, which is detected by the sensory-motor.

1.2. Sensory-Motor

Sensory motor detects changes in activity and location of the user and provides the “decision-maker” with the state of the user including current activity, location and time as in (3).

State (‘current time slot No.’, ’User Activity’, ’User Location’) (3)

In our framework, image sensors are deployed to cover regions of interest of home.

The position of sensors and their views are shown in Figure 2. Vision processing from the images recognizes the user’s location and activity. The

following sections present the algorithms used for activity and location detection.

a b Figure 2. a. Location of installed cameras. b. View of each camera.

1.2.1. Location

Adaptive background subtraction is first applied to extract the user as a moving object from the static background. If the user is seen from multiple cameras, his/her location is then calculated from calibration parameters and foreground bounding boxes by back-projecting bounding boxes and finding their intersection. Due to heavy occlusions of the user by furniture in the home, if only one camera sees the user, it is not reliable to calculate his position. But in the hallway (the view of camera 1) there is no occlusion. So when foreground is detected in this region, the bottom of the foreground is taken as on the ground plane and thus the user’s position can be calculated. We label a mapping from position coordinates to semantic location definitions. The list of semantic locations in the setup is as follows:

“Kitchen”, “Hallway”, “Dining-table”, “TV-area”

1.2.2. Only those locations which affect the decision should be reported in the state. Other locations increase the state-action space and slow down the learning convergence speed while not contributing to the system functions. This consideration also applies to analyzing user activities. Activity

In our implementation, seven activities are considered: “Standing”, “Sitting”, “Lying”, “Walking”, “Eating”, “Studying”, and “Watching”.

At first, based on height and aspect ratio of the foreground, conditional random field [9] classifies the user’s activities to three major classes: Standing, Sitting, Lying. Then global motion is detected as a discriminator of “Walking” when the user is standing. Local motion is used as a discriminator of “Eating” and “Studying” when the user is sitting at the dining table, Figure 3. Face detection is used as an indication of “Watching”.

In each view the most probable activity is selected as current activity.

. Figure 3. Hierarchical activity recognition.

1.2.3. Time

The user preferences for quality of a service may vary from time to time in a day. The user may enjoy classic music when he sits at the dining table sometime around 4:00 pm while he may prefer silence when he sits there in his study time around 8:00 pm. In order to discriminate these situations we involve time intervals as a part of the state. In the proposed method we divide the day to 288 time slots of 5 minutes.

Using 288 time slots makes our state space huge. In that case it would be hard to learn appropriate value for each pair (state, action) with limited number of runs. Furthermore the favorite service for each (state, action) pair is usually similar in a range of neighboring time slots. Hence we use tile coding [9][10] as a function approximation technique which groups neighbor states in time and propagates evidence of one (state, action) pair to neighboring states in time.

1.3. User Feedback

The amount of satisfaction from each derived decision is measured by user’s feedback. The feedback may be encouraging (positive values of r in Figure 1), or penalizing (negative values of r). The feedback can be received directly by explicit input of the user or may be estimated implicitly.

In our implementation, the user explicitly rewards a decision by inputting 0 and penalizes it by -1. If the user does not measure the quality of the decision, the system evaluates the penalty as (4).

��

�� (4)

For service of lighting the total time would be the difference between the start time

and the end of current time slot as shown in Figure 4. Decision maker receives explicit or implicit user’s feedback with every change in

the state of the environment and or when the music or the lighting pattern changes. The decision maker tries to discover the best action per state which maximizes the overall user’s feedback. Reinforcement learning algorithms are effective methodologies for our aims to learn the users’ preference dynamically from feedback and perform the optimal actions up-to-date.

Figure 4. Calculating implicit penalty for lighting service. Ai:1..6 are the ranges of time the user stayed with

the lighting pattern in each time slot. Bi:1..6 represent total time in the slot.

1.4. Reinforcement Learning

Temporal differential (TD) class of reinforcement learning applies to smart-home applications where it is hard to define a model for the user’s behavior. TD, which is an unsupervised learning method, doesn’t require predefined model and learns the best action for each state by dynamic and continuous interactions of the system with the user. The interaction is in a trial-and-error form to increase an objective function of contradictive goals including the user’s comfort and the minimized total cost function defined as penalty. The overall procedure of TD is in Figure 5.

Where a and a’ denote actions taken by the algorithm in subsequent states of s and s’ as described in section 2. The feedback of the user or the environment is represented by parameter r. �� records relative rate of satisfaction with action a taken given state s for different state-action pairs and is updated as shown in line 5 of Figure 5 in each observation. To avoid user frustration of too many non-optimal choices, the algorithm examines feedback of the user by taking different actions occasionally with probability of ε. This policy which is called “ε-policy” balances exploitation of high-valued actions versus exploration of new preferred actions. In consequence, the system can adapt itself with behavioral changes of user’s preferences and of home through received feedbacks.

2. Experimental Results

In this section the convergence of the proposed method in different scenarios is evaluated. In order to implement our proposed method, we have used the OpenCV library [11] for vision part and CRF++, a conditional random filed library provided by [12]. We used ten sequences captured in Homelab of PHILIPS [4] and repeated them frequently in order to train our system.

Figure 5. Temporal differential algorithm.

1 Initialize �� arbitrarily. 2 Do forever: 3 Take action a, observe r and s’. 4 Choose either the action a’ with highest ��!� �! , or a

randome action with probability of ε.

5 �� "�� # $%& # �'��!� �! � �� (

6 �"�!� � " �)

Figure 6. Implemented system architecture.

Microsoft Media Player 11.0 was used as actor for music service and Microsoft

Power point 2007 was engaged as lighting service actor, shown in Figure 6. The user could change current music and lighting configuration by a web-based

platform available through his personal computer or PDA, as shown in Figure 7. The web has the ability to get the user’s reward explicitly or to calculate it implicitly.

Figure 8 shows some lighting patterns used in the implementation.

a b

Figure 7. Web-based user interface to get the user’s preferences and feedback. a. on a personal computer b. on a personal digital assistant.

Figure 8. Some patterns of lighting captured at the corner of a living room.

The diagram shown in Figure 9 demonstrates convergence of the implemented

algorithm when the user prefers classical music at 12:00 pm, independent of location and the activity that s/he was doing. The horizontal axis shows the numbers of time slots in which the system monitors the user and the vertical axis shows cumulative penalties received by those intervals. The following parameters were used in the implementation:

α = 0.2 ε = 0.002 with decaying factor of 0.9 Where α is the learning rate and ε is the randomness factor in ε-policy exploration

method. Suppose that the user’s preference doesn’t change in the long term, ε can decay to zero. Here the decay factor is set to 0.9 for each time slot.

Figure 9. Cumulative penalty versus time slots when the user prefers one out of 9 types of music at 12:00 pm, independent of location and activity.

The convergence point for the experiment shown in Figure 9 is around 70001.

Considering 15 minutes for each time interval, the algorithm needs 4 years and 288 days to converge. For the reasons that the user has uncertainty, and that there may be non-unique preferences, the algorithm should always keep the “exploration process” active (i.e. ε > 0). Such a policy increases overall penalty for a user who does not change his preferences (e.g. in Figure 9), as shown in Figure 10, but avoids the system to converge to a non-optimal decision. Using small but fixed values of ε helps for a fast convergence while having exploration as will be seen in the following experiments.

In case of change in user preferences, α plays a considerable rule in adaptation. Figure 11 shows synopsizes of two plots when the user changes his/her favorite music in the middle of the experiment with α = 0.2. With decaying ε-policy it takes 12 days and with non decaying ε-policy, it takes 10 days for the system to adapt itself with new desired music.

Figure 10. Cumulative penalty versus time slots for the user of the first experiment when ε is fixed to 0.002.

1 The point from where the overall slope of the diagram stops to change.

a b

Figure 11. Cumulative penalty versus time slots when the user changes his/her preference in the middle of the experiment. a. using decaying ε-policy, b. using fixed ε-policy.

It is not rare that the user shows uncertainty in rewarding a type of music or a

lighting pattern. We model such uncertainty as Gaussian distributions. Figure 12 shows average of cumulative penalty over 10 experiments versus time for different values of α for selecting the preferred mood of music. The rewarding process for the desired music mood was simulated by zero mean normal distribution with variance of 0.5. All non-desired moods were penalized by Gaussian distribution with average of -1 and variance of 0.5. Figure 12 shows the effect of α on speed of convergence and robustness against noise.

Figure 12. Averaged cumulative penalty of 10 experiments versus time in case of uncertainty. Bright black: α = 0.01. gray: α = 0.5. black: α = 0.9

With a small alpha the system converges slowly but yields a smoother curve which means that the system is more robust against uncertainty. While the curve of a bigger alpha vibrates more but converges more rapidly. As α increases toward 1, the decision policy would be more sensitive to the last rewards. The cumulative penalty tends to have a linear curve in case of high uncertainty, as is the case in Figure 12 shown in black.

3. Conclusion

In this article we showed the use of temporal differential method for autonomous learning of user’s preference in a smart home application. We demonstrated the effect of each parameter in quality and speed of learning even in case of uncertainty in the user’s feedback or when the user changes his preferences. The proposed method is unsupervised, adaptive and autonomously learns the optimal satisfactory decision policy with continues interactions with the user. The proper parameter values of the learning algorithm should be selected by the designer according to characteristics of the user and to the desired characteristics of the system.

References

[1] A. Helal, M. Mokhtari, and B. Abdulrazak, The Engineering Handbook on Smart Technology for Aging Disability and Independence, John Wiley & Sons. Computer Engineering Series, New York, USA, 2008.

[2] M.C. Mozer, Lessons from an adaptive house” In D. Cook & R. Das (Eds.), Smart environments: Technologies, protocols, and applications, John Wiley & Sons, Hoboken, NJ, 2005.

[3] G. D. Abowd, E. D. Mynatt, and T. Rodden, The Human Experience [of ubiquitous computing], IEEE Pervasive Computing 1 (2002), 48-57.

[4] Website: http://www.research.philips.com/focused/experiencelab.html, 2009 [5] B. D. Ruyter, E. V. Loenen, and V. Teeven, User centered research in ExperienceLab, Ambient

Intelligence 4794 (2007), 305–313. [6] S. S. Intille, K. Larson, E. Munguia Tapia, J. Beaudin, P. Kaushik, J. Nawyn, and R. Rockinson, Using a

live-in laboratory for ubiquitous computing research, PERVASIVE 3968 (2006) , 349-365. [7] D. Cook, M. Schmitter-Edgecombe, A. Crandall, C. Sanders, and B. Thomas, Collecting and

disseminating smart home sensor data in the CASAS project, CHI Workshop on Developing Shared Home Behavior Datasets to Advance HCI and Ubiquitous Computing Research, (2009).

[8] P. Rashidi, D. J. Cook. Keeping the Resident in the Loop: Adapting the Smart Home to the User, IEEE transactions on Systems, Man, and Cybernetics journal A (2009), 1-9.

[9] C. Sminchisescu, A. Kanaujia,L. Zhiguo, D. Metaxas, Conditional models for contextual human motion recognition Computer Vision, International conference on computer vision 2 (2005), 1808-1815.

[10] R. Sutton, and A. Barto, Reinforcement Learning: An introduction, MIT Press, Cambridge, MA, USA, 1998.

[11] Website: http://sourceforge.net/projects/opencvlibrary , 2009. [12] Website: http://crfpp.sourceforge.net, 2009.

http://www.research.philips.com/focused/experiencelab.html

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=10347

http://sourceforge.net/projects/opencvlibrary

http://crfpp.sourceforge.net/

Documents

Autonomous Learning of User’s Preference of Music and ......the user opts to react to the provided service. As a result, the system adapts to provide the most satisfactory music