Seminar Arun

8/2/2019 Seminar Arun

1/20

SEMINAR REPORT 2012 COMMUNICATION ROBOT SYSTEM BASED ON THE

HANDSHAKING ACTION

DEPARTMENT OF E&I Page 1 VJEC,CHEMPERI

CHAPTER 1

INTRODUCTION

The development of robots has been significant in production, including factories. The expectation

is high for the development of intelligent robot systems that work cooperatively with human beings in

daily life and in medical treatment and welfare. Human robot interaction is essential for the operation of

robots by people. Anyone can operate robots with ease by giving commands to the robot using gestures,

just as people communicate with gestures. An intelligent manipulator system using gesture recognition has

been developed. The omnidirectional image is used for the robot control system based on hand gestures

The communication robot system based on stereo vision and voice instructions was developed. The

control algorithm for a service robot through the hand over task has been proposed. This paper discussed a

human robot interaction based on the handshaking action. We developed a communication robot

HAKUEN that is composed of a multimedia robot with stereo camera, a wheel type mobile robot and a

PC with a microphone. The HAKUEN approaches and holds out its hand toward the operator according to

the voice command. The HAKUEN detects the operator's face based on the pixel values of the flesh tint in

the color image. We use the disparity in order to calculate the distance between the robot and the operator.

The effectiveness of our system is clarified by several experimental results


2/20


HANDSHAKING ACTION


CHAPTER 2

LITERTURE SURVAY

UNECE issues its 2004 World Robotics survey

Worldwide investment in industrial robots up 19% in 2003. In first half of 2004, orders for

robots were up another 18% to the highest level ever recorded. Worldwide growth in the period 2004-

2007 forecast at an average annual rate of about 7%. Over 600,000 household robots in use - several

millions in the next few years. From the above press release we can easily realize that household (service)

robots getting popular. This gives the researcher more interest to work with service robots to make it

more user friendly to the social context. Speech Recognition (SR) technology gives the researcher the

opportunity to add Natural language (NL) communication with robot in natural and even way in the social

context. So the promise of robot that behave more similar to humans (at least from the perception-

response point of view) is starting to become a reality [28]. Brooks research [5] is also an example of

developing humanoid robot and raised some research issues. Form these issues; one of the important

issues is to develop machine that have human-like perception.


3/20


HANDSHAKING ACTION


CHAPTER 3

ABOUT ROBOT

The term robot generally connotes some anthropomorphic (human-like) appearance consider

robot arms for welding . The main goal robotic is to make Robot workers, which can smart enough to

replace human from labor work or any kind of dangerous task that can be harmful for human. The idea of

robot made up mechanical parts came from the science fiction. Three classical films, Metropolis (1926),

The Day the Earth Stood Still (1951), and Forbidden Planet (1956), cemented the connotation that robots

were mechanical in origin, ignoring the biological origins in Capeks play. To work as a replacement of

human robot need some Intelligence to do function autonomously. AI (Artificial intelligence) gives us the

opportunity to fulfill the intelligent requirement in robotics. There are three paradigms are followed in AI

robotics depends on the problems. These are - Hierarchical, Reactive, and Hybrid deliberative/reactive.

Applying the right paradigm makes problem solving easier . Depending on three commonly acceptedrobotic primitives the overview of three paradigms of robotics on Figure 2.1.

In our project we follow Hybrid reactive paradigm to solve our robotic

Fig 3.1: Three paradigms a) Hierachical b) Reactive c) Hybrid reactive


4/20


HANDSHAKING ACTION


CHAPTER 4

ROBOT CONSTRUCTION

We developed a communication robot HAKUEN that is shown in Figure1. This system is

composed of a multimedia robot with stereo camera, a wheel type mobile robot and a PC with a

microphone. The HAKUEN has two arms and each arm has six degrees of freedom of motion. The head

of the multimedia robot has two degrees of freedom of motion. The several LEDs are equipped around the

robot's eyes. The base of the robot is a two wheels mobile robot. When the operator gives the voice

command to the HAKUEN, the robot approaches and holds out its hand toward the operator. The

HAKUEN moves according to the operator's voice commands. We made the four motion functions about

the HAKUEN. These functions are shown below.

(1) Face tracking function

The HAKUEN moves its head in order to follow the operator's face motion. We call the motion is a "face

tracking function". The operators face is detected based on the pixel values of the flesh tint in the color

image.

(2) Handshaking function

The HAKUEN holds out its right hand toward the operator in order for the operator to shake robots hand.

We call the motion is a "handshaking function".

(3) Voice recognition function

The HAKUEN moves according to the operator's voice commands. We call the motion is a "voice

recognition function". We use the voice recognition software (via voice, IBM) which is controlled by the

Active X program in order to recognize the voice commands.

(4)Approach function

We consider that the suitable distance range between the HAKUEN and the operator is 0.6[m]-1.2[m].

The robot approaches the operator and keeps the suitable distance. We call the motion is an approach


5/20


HANDSHAKING ACTION


Our assistive robot system is shown in Figure 1. This system is composed of the manipulator, a PC , a

microphone and a stereo vision hardware. The manipulator used here has six degrees of freedom of

motion and has a mechanical hand. Since the system has to recognize the position and posture of the hand

in real time, we use the stereo vision hardware. In our system the operator gives a hand gesture to the

manipulator conversationally. For example, when the operator points with the forefinger to the object and

gives the voice instruction to the manipulator in order to indicate the target object, the manipulator picks

up the object and hands it over to the operator.

.

Fig 4.1the Hakuen robot


6/20


HANDSHAKING ACTION


CHAPTER 5

FACE TRACKING FUCTION

At first, the HAKUEN has to detect the human face in the color image. The human face is detected based

on the pixel values of the flesh tint in the color image. The color image is digitized as 24 bit RGB (Red,

Green and Blue) pixel value, so that each element of RGB is 8 bit or 256 levels of brightness6). However,

the RGB value is apt to be influenced by the light. Therefore, we use the HLS (hue, saturation, and value)

color specification system in order to detect the human face accurately. The each elements of HLS color

specification system are described in (1)-(3) and calculated based on the RGB pixel value. In order to

detect the human face in the color image, we transform color image to the binary image based on the

threshold values of HSL color specification system. We define the threshold values of HSL color

specification system about the flesh tint through the experiment

The system operates in two stages: it first applies a set of neural network-based filters to an image, and

then uses an arbitrator to combine the outputs. The filters examine each location in the image at several

scales, looking for locations that might contain a face. The arbitrator then merges detections from

individual filters and eliminates overlapping detections.

The first component of our system is a filter that receives as input a 20x20 pixel region of the image, and

generates an output ranging from 1 to -1, signifying the presence or absence of a face, respectively. To

detect faces anywhere in the input, the filter is applied at every location in the image. To detect faces

larger than the window size, the input image is repeatedly reduced in size (by subsampling), and the filter

is applied at each size. This filter must have some invariance to position and scale. The amount of

invariance determines the number of scales and positions at which it must be applied. For the work

presented here, we apply the filter at every pixel position in the image, and scale the image down by a

factor of 1.2 for each step in the pyramid. First, a preprocessing step, adapted from [21], is applied to a

window of the image. The window is then passed through a neural network, which decides whether the

window contains a face. The preprocessing first attempts to equalize the intensity values in across the

window. We fit a function which varies linearly across the window to the intensity values in an ovalregion inside the window. Pixels outside the oval may represent the background, so those intensity values

are ignored in computing the lighting variation across the face. The linear function will approximate the

overall brightness of each part of the window, and can be subtracted from the window to compensate for a

variety of lighting conditions. Then histogram equalization is performed, which non-linearly maps the


7/20


HANDSHAKING ACTION


intensity values to expand the range of intensities in the window. The histogram is computed for pixels

inside an oval region in the window. This compensates for differences in camera input gains, as well as

improving contrast in some cases. The preprocessing steps are shown in. The preprocessed window is then

passed through a neural network. The network has retinal connections to its input layer; the receptive

fields of hidden units . There are three types of hidden units: 4 which look at 10x10 pixel subregions, 16

which look at 5x5 pixel subregions, and 6 which look at overlapping 20x5 pixel horizontal stripes of

pixels. Each of these types was chosen to allow the hidden units to detect local features that might be

important for Rowley, Baluja, and Kanade: Neural Network-Based Face Detection (PAMI, January 1998)

3 face detection. In particular, the horizontal stripes allow the hidden units to detect such features as

mouths or pairs of eyes, while the hidden units with square receptive fields might detect features such as

individual eyes, the nose, or corners of the mouth. Although the figure shows a single hidden unit for each

subregion of the input, these units can be replicated. For the experiments which are described later, we use

networks with two and three sets of these hidden units. Similar input connection patterns are commonly

used in speech and character recognition tasks [10, 24]. The network has a single, real-valued output,

which indicates whether or not the window contains a face.

saturation (constant) saturation (adjusted) saturation (constant) saturation (adjusted)

(a) S=40--55 (b) S=70--255 (a) S=40255 (b) S=20--255

Fig 4.1Detection of the flesh tint (case 1) Fig 4.2Detection of the flesh tint (case 2)


8/20


HANDSHAKING ACTION


5.2 STAGE TWO: MERGING OVERLAPPING DETECTIONS AND

ARBITRATION

The raw output from a single network will contain a number of false detections. In this section, we

present two strategies to improve the reliability of the detector: merging overlapping detections from a

single network and arbitrating among multiple networks5.2.1 Merging Overlapping Detections

Note that in Fig. 3, most faces are detected at multiple nearby positions or scales, while false

detections occur with less consistency. This observation leads to a heuristic which can eliminate many

false detections. For each location and scale, the number of detections within a specified neighborhood of

that location can be counted. If the number is above a threshold, then that location is classified as a face.

The centroid of the nearby detections defines the location of the detection result, thereby collapsing

multiple detections. In the experiments section, this heuristicwill be referred to as thresholding.

If a particular location is correctly identified as a face, then all other detection locations which overlap it

are likely to be errors, and can therefore be eliminated. Based on the above heuristic regarding nearby

detections, we preserve the location with the higher number of detections within Rowley, Baluja, and

Kanade: Neural Network-Based Face Detection (PAMI, January 1998) 5 a small neighborhood, and

eliminate locations with fewer detections. In the discussion of the experiments, this heuristic is called

overlap elimination. There are relatively few cases in which this heuristic fails; however, one such case

is illustrated by the left two faces in Fig. 3B, where one face partially occludes another.The

implementation of these two heuristics is illustrated in Fig. 6. Each detection at a particular location and

scale is marked in an image pyramid, labelled the output pyramid. Then, eachlocation in the pyramid is

replaced by the number of detections in a specified neighborhood of that location. This has the effect of

spreading out the detections. Normally, the neighborhood extends an equal number of pixels in the

dimensions of scale and position, detections are only spread out in position. A threshold is applied to these

values, and the centroids (in both position and scale) of all above threshold regions are computed. All

detections contributing to a centroid are collapsed down to a single point. Each centroid is then examined

in order, starting from the ones which had the highest number of detections within the specifiedneighborhood. If any other centroid locations represent a face overlapping with the current centroid, they

are removed from the output pyramid. All remaining centroid locations constitute the final detection

result. In the face detection work described in [3], similar observations about the nature of the outputs

were made, resulting in the development of heuristics similar to those described above.


9/20


HANDSHAKING ACTION


5.2.2 Arbitration among Multiple Networks

To further reduce the number of false positives, we can apply multiple networks, and arbitrate

between their outputs to produce the final decision. Each network is trained in a similar manner, but with

random initial weights, random initial nonface images, and permutations of the order of presentation of the

scenery images. As will be seen in the next section, the detection and false positive rates of the individual

networks will be quite close. However, because of different training conditions and because of self-

selection of negative training examples, the networks will have different biases and will make different

errors.. Each detection at a particular position and scale is recorded in an image pyramid, as was done with

the previous heuristics. One way to combine two such pyramids is by ANDing them. This strategy signals

a detection only if both networks detect a face at precisely the same scale and position. Due to the

different biases of the individual networks, they will rarely agree on a false detection of a face. This

allows ANDing to eliminate most false detections. Unfortunately, this heuristic can decrease the detection

rate because a face detected by only one network will be thrown out. However, we will see later that

individual networks can all detect roughly the same set of faces, so that the number of faces lost

due to ANDing is small.

Similar heuristics, such as ORing the outputs of two networks, or voting among three networks,

were also tried. Each of these arbitration methods can be applied before or after the thresholding

and overlap elimination heuristics. If applied afterwards, we combine the centroid locations

rather than actual detection locations, and require them to be within some neighborhood of oneanother rather than precisely aligned.

Arbitration strategies such as ANDing, ORing, or voting seem intuitively reasonable, but perhaps

there are some less obvious heuristics that could perform better. To test this hypothesis, we

applied a separate neural network to arbitrate among multiple detection networks. For a location

of interest, the arbitration network examines a small neighborhood surrounding that location in the

Rowley, Baluja, and Kanade: Neural Network-Based Face Detection output pyramid of each individual

network. For each pyramid, we count the number of detections in a 3x3 pixel region at each of three scales

around the location of interest, resulting in three numbers for each detector, which are fed to the

arbitration network, as shown in Fig. 8. The arbitration network is trained to produce a positive output for

a given set of inputs only if that location contains a face, and to produce a negative output for locations

without a face. HAKUEN looks down, the total saturation value in the color image increases further.


10/20


HANDSHAKING ACTION


Therefore the threshold values of saturation and value are adjusted automatically based on

the total values of saturation and value in the color image. The threshold values of saturation

The examples of detection of the flesh tint using image processing are shown in Figure2 and Figure3.

The Figure2(a) is the case of the constant threshold value of the saturation when the HAKUEN looks

down.

Since the color of the floor is similar to the flesh tint, the floor area is detected as the flesh tints

area. The flesh tints area is detected correctly (Figure2(b)), because the threshold values of the saturation

and value are adjusted automatically based on the total values of saturation and value in the color image.

After the detection of the flesh tints area, the human face is recognized in consideration of the

maximum area and circularity about the flesh tints area. We determine that the threshold value of the

circularity is 0.1.

5.3 EXPERIMENT OF THE FACE TRACKING FUNCTION

The face tracking function is that the HAKUEN moves its head so as to get the operator's face at the

center of the image. The relative location of the operator and HAKUEN is shown in Figure6. The number

of operators is five. Each operator gave the voice command thirty times. The system could detect the face

at the all cases. The average time of the face detection was 18.16[sec].

Figure6 Experiment of the face tracking function

Fig 5.1 Experiment on face tracking fuction


11/20


HANDSHAKING ACTION


CHAPTER 6

DISTANCE DETECTION USING STEREO IMAGE

Since HAKUEN has two cameras, we use the disparity of stereo image in order to detect the

distance between the HAKUEN and the operator. As shown in Figure4, the disparity is a difference

between the target object's position of the right image and that of the left image. When the positions of

two cameras are fixed, the disparity is changed according to the distance L between the object and the

camera. We can obtain the distance L between the robot and the operator by means of the disparity .

Total Saturation Threshold Total Value Threshold

Fig 6.1 shows Disparity between two images

. Difference of disparities based on the distance

As shown in Figure, the disparity decreases with the distance between camera and the object. The

disparity is the difference of pixels values of the human faces center position in the right image and the

left image. The relation between the disparity and the distance to the object is obtained through the

experiment


12/20


HANDSHAKING ACTION


6.1 EXPERIMENTS OF THE APPROACH FUNCTIONThe approach function is that the HAKUEN approaches the operator and keeps the suitable

distance (0.6[m]--1.2[m]). We define five cases of the distance (1.5[m],2.0[m],2.5[m],3.0[m],3.5[m])

between the HAKUEN and the operator. Each case of the experiment was done thirty times. The average

of the successrate was 88.33%. The example of the approach function is shown in Figure7.

a)initial function b)approach function

Fig6.2 Experiment on approach function


13/20


HANDSHAKING ACTION


CHAPTER 7

ROBOT HAND SHACKING ACTION

7.1 DETECTION OF THE HAND

At first, the system has to detect the hand area in the image of the work space. The hand area is

detected based on the RGB pixel values of the flesh tint in the colour image. The RGB value is apt to be

influenced by the light. Therefore, we use the hue of flesh tint in order to reduce the influence of the light

The area of the flesh tint is detected roughly in the colour image using the hue value and the noise

isremoved using the RGB value

Fig 7.1 robot system

After the hand area is detected using a RGB value and the hue value of the colour image, we determine the

center position of the hand that is called the CP in order to trace the hand. Since the size of the fist of the

human is approximately equal to the sphere with a radius of 40mm, the system searches for the center of

the sphere with the maximum density of pixels of flesh tint. The center of the sphere is regarded as the CP

of the hand. Once the CP is detected, the hand is traced by the tracking of the CP.

Assistive Robot System Using Gesture and Voice Instructions 67


14/20


HANDSHAKING ACTION


6.2 RECOGNITION OF THE HAND GESTURE

As shown in Figure 2, we define several instructions using hand configurations.

We make the manipulator move in accordance with the instructions of handgestures. For example, when the operator opens the hand upwards (Inst.2), the

manipulator

Inst.1 Grasp Inst.2 Deliver the object Inst.3 Approach Inst.4 Stand by

Fig 7.2.Instructions of hand gestures

We define three characteristic dimensions (A,B and C) of the hand in order to recognize the hand gesture

rapidly. As shown in Figure 3, hand gestures are divided into branches based on the conditions. The length

A is the distance from the CP to the tip of the forefinger. The length B is the maximum width of the hand

block. The length C is the maximum width of the finger block. For example if the length A is less than 60

mm, we consider that the operator closes the hand and the hand gesture means the instruction 1. If the

length A is more than 60 mm, we calculate the length B. Because we don't use the whole hand

configuration but the three characteristic dimensions, the hand gesture is determined rapidly.


15/20


HANDSHAKING ACTION


Fig7.3General flow about the recognition of the hand gesture

7.3 EXPERIMENTS OF THE HAND SHAKING FUNCTION

The hand shaking function is that the HAKUEN holds out its right hand toward the operator when

the HAKUEN keeps the suitable distance (0.6[m]--1.2[m]) from the operator. We define four cases of thedistance (0.4[m],0.8[m],1.0[m],1.4[m]) between the HAKUEN and the operator. We define three cases of

the angle (-20,0,20) between the HAKUEN and the operator. Each case of the experiment was done

thirty times. The percentage of success is shown in Table . The average of the success rate was 97.63%.

Fig 7.4 experiment on hand sacking fuction


16/20


HANDSHAKING ACTION


CHAPTER 8

VOICE INSTRUCTIONS

The voice command is compo Speech Recognition technology promises to change the way we interact

with machines(robots, computers etc.) in the future. This technology is getting matured day by day

and scientists are still working hard to overcome the remaining limitation. Now a days it is introducing

many important areas (like - in the field of Aerospace where the training and operational demands on the

crew have significantly increased with the proliferation of technology [27], in the Operation Theater as a

surgeons aid to control lights, cameras, pumps and equipment by simple voice commands [1]) in the

social context. Speech recognition is the process of converting an acoustic signal, captured by micro-

phone or a telephone, to a set of words [8]. There two important part of in Speech Recognition - i)

Recognize the series of sound and ii) Identified the word from the sound. This recognition techniquedepends also on many parameters - Speaking Mode, Speaking Style, Speaker Enrollment, Size of the

Vocabulary, Language Model, Perplexity,Transducer etc [8]. There are two types of Speak Mode for

speech recognition system- one word at a time (isolated-word speech) and continuous speech. Depending

on the speaker enrolment, the speech recognition system can also divide - Speaker dependent and Speaker

independent system. In Speaker dependent systems user need to be train the systems before using them, on

the other hand Speaker independent system can identify any speakers speech.Vocabulary size and the

language model also important

The system does not determine the position of the target object based on the image processing, when the

many objects lie on the table. The system recognizes the configuration and colour of the target object from

the voice instruction. For example, the operator gives the voice instruction Take the red ball to the

manipulator, the position of the red ball is determined in the work space. We use the voice recognition

software (via voice, IBM) in order to recognize the voice The operator gives the voice commands to the

system in order to move the HAKUEN.

Language model or artificial grammars are used to confine word combination in a series of word or

sound. The size of the vocabulary also should be in a suitable number. Large numbers of vocsed of the

simple word. We define six voice commands which are shown in Table .

For example, when the operator gives the voice command "a ku shu", the HAKUEN approaches the

operator and holds out its right hand to shake hand with operator. We use the voice recognition software

(Via Voice, IBM) in order to recognize the voice commands.


17/20


HANDSHAKING ACTION


Table 8.1 voice commands

7.1 EXPERIMENTS OF THE VOICE RECOGNITION FUNCTION

The voice recognition function is that the HAKUEN moves according to the voice command. We

define six voice commands. Each voice command was given forty times in the experiment. The number ofoperators is five. . The average recognition rate of the voice commands was 92.8%.


18/20


HANDSHAKING ACTION


CHAPTER 9TOTAL EXPERIMENT OF THE SYSTEM

The total experiment in order to clarify the effectiveness of our system. At first operator gives

the voice command i do o to the HAKUEN. Then the HAKUEN approaches the operator and stops the

suitable position in front of the operator. Next, the operator gives the voice command a ku shu to the

HAKUEN. The HAKUEN holds out its right hand toward the operator. We define four cases of the

distance(1.5[m],2.0[m],2.5[m],3.0[m]) between the HAKUEN and the operator. Each case of the

experiment was done thirty times. The average of the success rate was 81.67%.

Fig 9.1 total experiment


19/20


HANDSHAKING ACTION


CONCLUTION

Human-Robot interaction is an important, attractive and challenging area in HRI. The Service

Robot popularity gives the researcher more interest to work with user interface for robots to make it more

user friendly to the social context. Speech Recognition (SR) technology gives the researcher the

opportunity to add Natural language (NL) communication with robot in natural and even way.

In this paper, we developed the communication robot HAKUEN based on image processing

and voice recognition. This system has four motion functions (face tracking function, shaking hand

function, voice recognition function and Approach function). The average ofthe success rate about

the total experiment was 81.67%. In future work, we have to define many kinds of functions for the

practical application of our system


20/20


HANDSHAKING ACTION


REFERENCES[1] N. Yamasaki and Y. Anzai, "Active Interface for Human-Robot Interaction" : Proc. of the IEEE Int

Conf. on

Robotics and Automation, pp.3103-3109, 1995.

[2] N. Kawarazaki, N. Kashiwagi, I. Hoya and K. Nishihara, Manipulator Work System Using Gesture

Instructions,Journal of Robotics and Mechatronics, Vol.14 No.5, pp.506-513.

[3 ] N. Kawarazaki, Y. SUZUKI, Y. TAKASHIMA, K. Nishihara and T. Yoshidome: Robot Control

System

Using Omnidirectional Image, Proc.of Japan-China Conference on Mechatoronics 2005, pp.97-98.

[4] N. Kawarazaki, K. KAWASHIMA, T. YOSHIDOME and K. NISHIHARA: Communication

Robot System based on stereo vision and voice instructions, Proc.of China-Japan Conference on

Mechatoronics 2007, pp.23-25.

[5] A. Agah and K. Tanie, Human Interaction with a Service Robot : Mobile-Manipulator Handing Over

an

Object to a Human, Proc. of the IEEE Int. Conf. on Robotics and Automation, pp.575-580.

[6] John C. Russ : The Image ProcessingHandbook, A CRC Handbook Published in Cooperation with

IEEE press,1999

Documents

Seminar Arun