[IEEE IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics - Paris, France (2006.11.6-2006.11.10)] IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics

Sound and video processing in wireless sensor networks

Andrea Kulakov Georgi Stojanov Danco Davcev Faculty of Electrical Engineering Faculty of Electrical Engineering Faculty of Electrical Engineering

University “Sts Cyril and Methodius” University “Sts Cyril and Methodius” University “Sts Cyril and Methodius” Karpos II bb Karpos II bb Karpos II bb

Skopje, MACEDONIA Skopje, MACEDONIA Skopje, MACEDONIA [email protected] [email protected] [email protected]

Abstract – The general problem of data management in Wireless Sensor Networks (WSNs) is to provide efficient aggregation of different sensor’s data taking into account the problems of the limited energy of the nodes and their unpredictable failures. Generally, this is solved by reducing the communication among nodes. In order to have an efficient data aggregation performance, a pre-processing is needed which would reduce the amount of data being sent over the communication channels. As an outcome of this research, we propose two similar architectures for data aggregation of sound and video signals. These classification architectures have the same core consisted of a modified FuzzyART neural network and a modified SEQUITUR algorithm used previously only for analysis of symbolic sequences. The proposed architectures have been tested in a prototype implementation using Pocket PCs having microphones and cameras as sensors.

I. INTRODUCTION

A motivation for this work was the fact that a fully centralized data collection approach in Wireless Sensor Networks (WSNs) is inefficient given that sensor data are with high dimensionality and have significant redundancy over time, over space and over different types of sensor inputs, which is all due to the nature of the phenomena being sensed. A centralized method where data are collected from sensors and transmitted to a server for later querying is not appropriate for sensor networks because valuable resources are used to transfer large quantities of raw data to the central system, much of which is often redundant. Actually, sensor networks must save energy in order to extend the lifetime of the sensor nodes, because they are generally powered by batteries with a small capacity. It is known that wireless communication is much more expensive than data processing. Instead of transmitting all the data to a central node, part of the processing can be distributed from the central server into the sensor networks, thus decreasing power dissipation.

In such cases an efficient learning method is needed which would analyse a sequence of multi-dimensional signal patterns and communicate only the results of classifications.

We have adopted one model of an ANN called FuzzyART [1] deriving from Adaptive Resonance Theory (ART) models [2]. The inputs from the sensors in the general scheme are first preprocessed in few layers using either a Discrete Wavelet Transform or Fast Fourier Transform. Then the FuzzyART network classifies the analog input data and the classification IDs are used as symbol inputs to a modified version of the so called SEQUITUR algorithm [3] which is used for analysis of a sequence of signal patterns. The

SEQUITUR algorithm generates so called rules out of the reappearance of the symbol patterns in a sequence.

The modifications of the SEQUITUR algorithm include calculation of an activation function for each symbol and rule obtained from the SEQUITUR algorithm, which is later used for correct recognition of the signal patterns over time. This spreading of activation through rule-nodes turns the whole SEQUITUR data structure into one big evolving neural network. For that purpose, for each rule its level of abstractness is calculated and its length in number of symbols at the lowest level is updated constantly. To some selected rules, we attach so called annotation labels which are sent over the communication channel whenever some rule is recognized from the signal input or whenever the level of activation of some rule exceeds a certain high threshold.

The general idea was that the most compressed information sent over the communication channel would be if we could recognize the whole situation at hand and send only a description of that situation. For example, for sound or speech sensory input it would be to actually recognize the sounds or words and send only the letters of those words, or in the case of a video signal it’s when we could recognize the relevant objects in front of the camera and send only a textual description of those objects, or trigger an alarm which would turn on a live video transmission etc.

II. A GENERAL LEARNING SYSTEM

Most of the ANNs are created for pattern recognition of static inputs. Exceptions are the variations of the so-called Recurrent Neural Networks, where the current outputs are fed back to the input layer, where they are combined with the future inputs. Generally, their outputs are a function not only of their current inputs, but also they are dependent of all of their past inputs. Still the number of the recurrent neurons determines the memory capacity of the whole neural network and usually it is very limited.

As an alternative, we have used a FuzzyART neural network as an initial input vector classifier, which classifies the input signals with certain granularity of the input space. The outputs of this neural network, i.e. the classification IDs, are later used as if they were symbols and are fed into the SEQUITUR algorithm, specialized for sequence analysis.

Fig. 1. A general learning system

35501-4244-0136-4/06/$20.00 '2006 IEEE

In what follows we will briefly describe the classical algorithms of FuzzyART and SEQUITUR and later explain the modifications that we have made in order to prepare them for use in a broader scheme. A. FuzzyART

ART has been developed for pattern recognition primarily. FuzzyART is a model of unsupervised learning for analog input patterns. Generally, ART networks develop stable recognition codes by self-organization in response to arbitrary sequences of input patterns. They were designed to solve the so-called stability-plasticity dilemma: how to continue to learn from new events without forgetting previously learned information. ART networks model several features such as robustness to variations in intensity, detection of signals mixed with noise, and both short- and long-term memory to accommodate variable rates of change in the environment.

Fig. 2. Architecture of the ART network.

In Fig. 2 a typical diagram of an ART Artificial Neural Network is given. L2 is the layer of category nodes. If the degree of category match at the L1 layer is lower than the sensitivity threshold Θ, originally called vigilance level, a reset signal will be triggered, which will deactivate the current winning L2 node that has maximum incoming activation, for the period of presentation of the current input.

An ART network is built up of three layers: the input layer (L0), the comparison layer (L1) and the recognition layer (L2). The input and the comparison layers have N neurons, while the recognition layer is not limited in the number of neurons. The input layer stores the input pattern, and each neuron in the input layer is connected to its corresponding node in the comparison layer via one-to-one, non-modifiable links. Nodes in the L2 layer represent categories into which the inputs have been classified so far. The L1 and L2 layers interact with each other through weighted bottom-up and top-down connections that are modified when the network learns.

The learning process of the network can be described as follows: At each presentation of a non-zero analog input pattern p (pj ∈[0, 1]; j = 1, 2, …, N), the network attempts to classify it into one of its existing categories based on its similarity to the stored prototype of each category node. More precisely, for each node i in the L2 layer, the bottom-up activation Ai is calculated, which is expressed as:

∑

∑

=

=

+= N

ji

N

ji

iA

1

1

w

pw

ε

I

for i = 1, …, M, (1)

where wi is the weight vector or prototype of category i (wij

∈[0, 1]; i ∈ [0, ∝); j ∈ [1, N], and ε > 0 is a parameter. Then the L2 node C that has the highest bottom-up activation, i.e. AC = max{Ai | i = 1, …, M}, is selected. The weight vector of the winning node (wC) will then be compared to the current input at the comparison layer. If they are similar enough, i.e. if they satisfy the matching condition:

ΘN

j

N

jC

≥

∑

∑

=

=

1

1

p

pw I

, (2)

where Θ is the sensitivity threshold (0 < Θ ≤ 1), then the L2 node C will capture the current input and the network learns by modifying wC:

( ) ( ) oldoldnew w1pww CCC γγ −+= I , (3)

where γ is the learning rate (0 < γ ≤ 1). All other weights in the network remain unchanged. The case when γ =1 is called “fast learning” mode.

If, however, the stored prototype wC does not match the input sufficiently, i.e. if the condition (2) is not met, the winning L2 node will be reset for the period of presentation of the current input. Then another L2 node (or category) is selected with the highest Ai, whose prototype will be matched against the input, and so on. This “hypothesis-testing” cycle is repeated until the network either finds a stored category whose prototype matches the input well enough, or allocates a new L2 node, in which case learning takes place according to (3). We have modified the ART cycle of testing and learning in a way that it is not necessary to load a certain set of input patterns on which the learning will take place, but rather it is done after each new signal pattern has been given to the inputs. B. SEQUITUR

SEQUITUR [3] is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing recurring groups of symbols with a rule, and continuing this process recursively while treating the rules as another symbols. The result is a hierarchical representation of the original sequence. The algorithm is driven by two constraints that reduce the size of system of rules, also called the grammar, and produce structure as a by-product. These constraints state that no pair of adjacent symbols appears more than once and every rule is used more than once.

At the left of Figure 3a is shown a sequence that contains the repeating string bc called digram. To compress it, SEQUITUR forms a new rule A bc, and A replaces both occurrences of bc. The new SEQUITUR system of rules

3551

appears at the right of Figure 3a.

Fig. 3. Example sequences and systems of rules (grammars) that reproduce them: a) a sequence with one repetition; b) a

sequence with a nested repetition The sequence in Fig. 3b shows how rules can be reused in

longer rules. The longer sequence consists of two copies of the sequence in Fig. 3a. Since it represents an exact repetition, compression can be achieved by forming the rule A abcdbc to replace both halves of the sequence. Further gains can be made by forming rule B bc to compress rule A. This demonstrates the advantage of treating the very sequence, represented by rule S, as part of the SEQUITUR system of rules, the rules may be formed from rule A in an analogous way to rules formed from rule S. These rules within rules constitute the SEQUITUR’s hierarchical structure.

The algorithm operates by enforcing the constraints on the system of rules: when the digram uniqueness constraint is violated, a new rule is formed, and when the rule utility constraint is violated, the useless rule is deleted.

Ultimately what is formed out of the SEQUITUR’s operation on some sequence can be depicted as in Fig. 4, where the rules are represented with circles while the symbols are represented with squares. Although the real structure that is formed is actually a graph since a rule can be reused many times as a sub-rule to other rules. Anyway, the only limitation in the number of rules that can be formed is the capacity of the memory itself.

Fig. 4. Hierarchical structures in SEQUITUR consisted of rules and symbols describing some sequence of symbols

It should be mentioned that SEQUITUR is successfully

used for compression of long sequences such as DNA genomes, and that it can be used reversibly to get the original sequence.

What we have modified in the original SEQUITUR algorithm is that several new properties were added to rules and symbols, like the level of abstractness of the rule, or total number of symbols at the bottom level of the rule. We have also included level of activation to each rule and symbol and an activation spreading mechanism.

C. Combining FuzzyART and SEQUITUR

Since the number of different categories into which the FuzzyART module classifies the sensory inputs tends to

saturate, ultimately it is a finite number depending on the sensitivity threshold. The idea behind this architecture is that the classification identification numbers from the FuzzyART module can be treated as symbols and entered into the SEQUITUR algorithm in order to analyze their pattern over time. With the activation spreading mechanism added to the SEQUITUR rules and symbols, the SEQUITUR rules can be considered an ensemble of evolving neural networks specialized to activate on certain sensory stimuli.

Fig. 5. The category nodes of the FuzzyART module serve as

input symbols to the SEQUITUR module in this unsupervised version of a general learning system

The winning category node from the FuzzyART module is

added as the next symbol in the sequence of incoming symbols in the SEQUITUR module. The symbols at the bottom level of the hierarchy in SEQUITUR receive different level of activation from the corresponding category nodes of the FuzzyART module – the winning category casts a maximal activation to the layer of symbols at SEQUITUR, but the rest of the FuzzyART categories also transmit some activation to the symbols at SEQUITUR. This mechanism provides the necessary flexibility during the recognition process - which rule has gained the maximum activation, because similar categories are given similar activations in the FuzzyART module.

The purpose of the spreading of the activation is to determine the relevance of each particular piece of knowledge (in our case the rules), bringing relevant ones into the working memory [4]. The associative mechanism used in our architecture is a modified version of the Grossberg activation function [5].

We define a continuous sensing activity differently for sound processing and for video processing. A continuous sensing activity for sound processing spans from the beginning of the detection of non-zero input signals until it ceases back to zero input signals. While in case for video processing we define it as the continuous group of saccades during which the expectations for the retinal image around the focus were met in a row.

3552

III. A SYSTEM FOR SOUND PROCESSING

The general learning system can be easily adopted for sound processing applications. As can be seen from Fig. 6, the raw input from the microphone is preprocessed using the Fast Fourier Transform (FFT) and then the output frequencies are logarithmically transformed using a variation of the Mel frequency transform [6]: M( f ) =1125log10 (1 + f / 700). (6)

Many experiments have shown that the ear’s perception to the frequency components in the speech does not follow the linear scale but the Mel-frequency scale, which should be understood as linear frequency spacing below 1kHz and logarithmic spacing above 1kHz.

After that, a FuzzyART neural network classifies initially the current frequency response of the sound input and the result of that classification is the identification number of a certain category. This number is used as an input symbol for the SEQUITUR algorithm, which analyses the stream of such symbols.

Fig. 6. Possible application of the general learning system in a sound processing application

The result of the analysis with SEQUITUR is not given

until the continuous sensing activity ceases back to the resting state. This is because there is no meaning to give as an output a rule, which can also be replaced by some other super-ordinate rule, as a result of the subsequent input symbols to SEQUITUR.

Finally this stream of output rules or symbols from the SEQUITUR algorithm, gathered in the last continuous sensing activity is sent over the transmission channel to the Clusterhead collecting such compressed sound information also from other nodes in the same cluster of the WSN.

a b

Fig. 7. a) Screenshot from the sample application for sound processing with a Pocket-PC. b) Enlarged view of the spectrogram of the sound input where the periods of

continuous sensing activities are circumscribed by hand

In Fig. 7a a screenshot from the sample sound processing application written in Embedded C++ for Pocket-PCs is given. In the lower black rectangle is given the frequency

response of the sound input where vertically are given different frequencies while with the color is shown the intensity of each frequency. Time is represented horizontally. Three continuous sensing activities can be easily noticed in the same figure and are highlighted in Fig. 7b. These continuous periods of sensing activity are calculated using the moving average of the frequency response. In that way, the system can adopt to environments with different levels of noise.

IV. A SYSTEM FOR VIDEO PROCESSING

We have adopted the Behavioral Model of Visual Perception (BMVP) [7] where among other things we have replaced the sensory memory with a FuzzyART neural network instead of Hopfield neural network, and instead of fixed motor memory we have used the SEQUITUR module (see Fig. 8). BMVP develops representations about the visual objects based on the responses from a fixed number of edge detecting sensors during saccadic movements of a small Attention Window (AW). Instead of using these edge-detecting sensors as inputs to the sensory memory like in BMVP, we have experimentally deduced that using oriented Gabor wavelet responses yields better and faster results. Another difference with BMVP is that it was primarily used for recognition of fixed images, while we have used it on life video captured sequences.

Fig. 8. A sample architecture for video processing having the general learning system in its core

In Fig. 8, the thicker arrows denote the information flow,

while thinner arrows denote control flow. The movement detection also controls the shift of the AW and it has been modified so that, unlike in BMVP, there is no need for human intervened determination of the so called “interesting” zones around the eyes in some picture, since the blinks and other movements of the head make them interesting by only including the movement detection to influence the decision about the shift of AW, i.e. about the position of the next focus of attention.

Fig. 9, taken from [7], explains the content of one Attention Window (AW). The relative orientation of each

3553

context point (ϕ) is calculated as a difference between the absolute angle of the edge at the center of the AW (ϕ0) and the absolute angle of the edge at that context point (ϕc). This relative orientation is used to get the oriented Gabor wavelet response at that context point in a small window. This response is then used as an input to the FuzzyART neural network, which plays the role of a sensory memory.

Fig. 9. Schematic of the Attention Window (AW). The next possible focal points are located at the intersections of

sixteen radiating lines and three concentric circles. XOY is the absolute coordinate system. The relative coordinate

system X1OY1 is attached to the basic edge at the center of the AW. The absolute parameters of the edge at one possible next focal point, ϕc and ψc, are shown as well as its relative

parameters, ϕ and ψ.

The SEQUITUR is used as a motor memory by providing alternating inputs once from the sensory memory and once from the vector selector that determines the next focal point of the AW. As can be seen from the Fig. 9, there are 48 different possible saccadic movements. These are represented by 48 different symbols, which are entered as input symbols to the SEQUITUR module. The SEQUITUR rules are of form Percept-Saccade-Percept-Saccade-…-Percept, taken from the FuzzyART and from the “Shift of AW” modules.

Fig. 10. Two examples of the saccadic movements shown with dotted lines which are oriented as the approximately

detected edges at these points. Eighty saccadic movements are made over one video sequence. The small circle in the

middle shows the estimated center of the movement activity.

The relative calculation of the orientation of the edges at the next focal points, according to the orientation of the edge at the current focus of AW, gives rise to the possibility for recognition of objects independent of orientation. The relative calculation of the distance between the current and the next focal point allows recognition of objects independent of size.

Even though the selection of the next saccade is relatively simply solved and should be further improved, the results, as discussed later, are promising. In Fig. 10 there are two examples of the saccadic movements over two different images.

The estimated center of the movement activity is used as an influence towards which the saccade jumps at each new video sequence and also when the saccades tend to exit outside the image frontiers.

V. EXPERIMENTS

We have built a small testing environment for testing the

functionality of data management strategy discussed so far.

A. Experimental setup We have used 5 hp iPAQ h4000 series Pocket-PCs

equipped with a built-in microphone, a wireless network card and an add-on camera as nodes in this small prototype WSN. As a clusterhead we have used a laptop PC with a built in wireless network card. All of them were first connected to the same network through the wireless access point which can be seen in the upper middle part of Fig. 11a, mounted at the wall.

a b

Fig. 11. a) The full experimental setup situated in our multimedia and video-conferencing lab, including 5 Pocket-

PCs equipped with cameras (highlighted with circles), a wireless access point and a laptop PC acting as a Clusterhead

in this prototype WSN; b) closer look of two Pocket-PCs equipped with cameras.

B. Results from the sound processing application

The current level of development of the sound processing

application can be used only as a proof of existence for the possibility of the general learning system to recognize sound or even speech of any language. We believe that further development would result in much more reliable recognition rates. Also, the application needs to be trained and tested over a large corpus of different sound and speech samples. C. Results from the video processing application

Important information which proves that the system is

learning progressively during its functioning is the average length and the average level of the SEQUITUR rules it creates. It shows that the system actually takes saccadic samples from the image and categorizes them in a consistent and consecutive manner.

3554

Since the application deals with images from a video, which further introduces novelty and diversity in means of different objects orientations and shadows that these objects cast, this reduces the possibilities for detecting regularities in the saccadic streams and in that way for creating new SEQUITUR rules.

That’s why, for comparison reasons, we have made a test run where the application has dealt with a static image. The results show that the average and the maximal lengths and levels of the rules are remarkably bigger in the case of static images, which was expected. These results are shown in Fig. 12 and 13, where one iteration involves 80 saccades over one image frame.

Maximal and average length of the rules

0

20

40

60

80

100

120

0 200 400 600 800 1000 1200Number of iterations

maximal length - one image maximal length - videoaverage length - one image average length - video

Fig. 12. Comparison of the maximal and the average lengths of the SEQUITUR rules when processing video or a static

image taken from the same video sequence.

Maximal and average levels of the rules

0

2

4

6

8

10

12

14

0 200 400 600 800 1000 1200Number of iterations

maximal level - one image maximal level - videoaverage level - one image average level - video

Fig. 13. Comparison of the maximal and the average levels of the SEQUITUR rules when processing video or a static

image taken from the same video sequence.

The number of confirmed expectations from SEQUITUR rules after each selection of a shift of the AW was also higher for around 67% in the case of a static image processing over video image processing.

All of this confirms us that, after sufficient number of iterations, the video processing architecture, built around the general learning system, would be able to create structured, stable, and growingly longer (complete) and abstract recognition codes for objects. Since we have adapted the saccadic approach to image processing given in [7], where they have achieved certain static object recognition, regardless of the size or orientation of the objects, we believe that our approach has the same potential for video signal processing.

VI. CONCLUSION

In order to have an efficient data aggregation performance, a preprocessing is needed which would reduce the amount of data being sent over the communication channels. The most compressed information sent over the communication channel would be if the system could actually recognize the whole situation at hand and send only a description of that situation.

Two similar architectures for data aggregation of sound and video signals were proposed having the same core, consisted of a modified FuzzyART neural network and a modified SEQUITUR algorithm. The proposed architectures have been tested in a prototype implementation using Pocket PCs having microphones and cameras as sensors.

The core of these signal-processing architectures represents a general learning system for any signal patterns that spread over time. This general learning system can be viewed as an extension to most of the Artificial Neural Networks proposed so far, which only dealt with static patterns, and also it can be viewed as an alternative to the Recurrent Neural Network models.

IV. ACKNOWLEDGMENT

This work has been partially supported by the Walter

Karplus Summer Research Grant by the IEEE Computation Intelligence Society that Andrea Kulakov received in 2005 for his work titled: “Efficient Data Management in Wireless Sensor Networks using Artificial Neural Networks”.

V. REFERENCES [1] G.A., Carpenter, S., Grossberg, and D.B., Rosen,

“Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system”, Neural Networks, vol. 4, 1991, pp. 759-771.

[2] S., Grossberg, “Adaptive Resonance Theory” in Encyclopedia of Cognitive Science, Macmillan Reference, 2000.

[3] C.G., Nevill-Manning and I.H., Witten, “Identifying Hierarchical Structure in Sequences: A linear-time algorithm”, Journal of Artificial Intelligence Research, vol. 7, 1997, pp. 67-82.

[4] J.Anderson, The Architecture of Cognition, Harvard University Press, Cambridge, MA, 1983.

[5] S., Grossberg, “A Theory of Visual Coding, Memory, and Development”, in Leeuwenberg, L., and Buffart, J. (eds.), Formal Theories of Visual Comparative Analysis, Wiley, New York, 1978.

[6] S. B., Davis and P., Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol 8(4), pp. 357-366.

[7] I.A., Rybak, V.I., Gusakova, A.V., Golovan, L.N., Podladchikova and N.A., Shevtsova, “A model of attention-guided visual perception and recognition”, Vision Research, vol. 38, 1998, pp. 2387–2400

3555

Documents

[IEEE IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics - Paris, France (2006.11.6-2006.11.10)] IECON 2006 - 32nd Annual Conference on IEEE Industrial Electronics