39
University of Victoria Faculty of Engineering Spring 2007 Elec499B Final Report Speech Activated Appliances Group Members David Beckstrom Harun Ali Kunal Jadav Nicole Su Lee, Ng Zorawar Bhatia 1

Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Embed Size (px)

Citation preview

Page 1: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

University of VictoriaFaculty of Engineering

Spring 2007 Elec499B Final Report

Speech Activated Appliances

Group Members

David BeckstromHarun Ali

Kunal JadavNicole Su Lee, Ng

Zorawar Bhatia

In partial fulfillment of the requirements of theB.Eng.Degree

1

Page 2: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Table of Content

1.0 Introduction....................................................................................................................3Concept:...........................................................................................................................3Goal of Project:................................................................................................................3Realization:......................................................................................................................4Demonstration System Operation:...................................................................................4Software Concept:...........................................................................................................5

Feature extraction........................................................................................................6Database creation:........................................................................................................6Pattern Recognition Algorithm:...................................................................................6

Output interface and Hardware interface Concept:.........................................................7.........................................................................................................................................7

2.0 Background....................................................................................................................9Speech recognition:.........................................................................................................9Requirements for speech recognition..............................................................................9Information contained in Database..................................................................................9

Types of Algorithms..................................................................................................11Other concepts used in this report.................................................................................14

Windowing................................................................................................................14Modelling of speech..................................................................................................14

3.0 Database.......................................................................................................................17Database Design............................................................................................................17Database construction....................................................................................................17

Feature Extraction......................................................................................................17Feature matrix and database formation......................................................................21

Database comparison.....................................................................................................224.0 Dynamic Time Warping (DTW).................................................................................24

Distance between Two Sequences of Vectors...............................................................24Comparing the distance between two sequences of vectors of different length........25

Finding the Optimal Path...............................................................................................26Local distances...........................................................................................................26

5.0 Experiments & Results................................................................................................276.0 Hardware......................................................................................................................287.0 Conclusion...................................................................................................................308.0 References....................................................................................................................31

2

Page 3: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

1.0 Introduction

Concept:

This project will demonstrate the next generation of home automation technology, speech activated appliances. Imagine being able to control the lights in a room, or the temperature of your home with a simple spoken command. Imagine the security of having your home recognize and respond to your voice alone. This project will explore these ideas by developing voice recognition software, and then demonstrating that software through a basic implementation of a voice recognition system on readily available electronic components and hardware. The system will respond to a list of defined spoken commands, and control two basic household appliances, a lamp, and an LCD screen, mimicking a thermostat.

Goal of Project:

A full realization of this concept would involve a few distinct steps. First, develop a database of commands the system will respond to. Second, develop voice recognition software that can perform a comparison between a command issued to the system, and the database of commands. Third, develop a sufficient hardware system to translate a matched command into a control signal, and finally into a realized change of state in hardware. Forth, develop the above system to exist on a programmable DSP chip such that it operates independently of an external computing source, and interacts with its hardware inputs and outputs independently. Such a system would be integrated in the user’s home, use microphones installed in the home as input sources, and would issue control signals to hardware already installed in the home.

Realization:

As a full realization of this concept is beyond the time and budgetary constraints of this project, we plan instead to prove the concept by designing a demonstration circuit that will operate as a scaled down version of the above system. The aim of this project then, is to prove the concept of speech activated appliances by developing a voice recognition system to recognize 5 user-spoken commands in real time, and issue control signals to a pair of recognizable household appliances.

3

Page 4: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

A standard PC microphone will be used as the input source The voice recognition software will be written in MatLab, to be run off of a

desktop PC A simple hardware interface will be developed to translate control signals

into a change of state in the appliances The appliances we have chosen to demonstrate the control aspect of the

project are a light and a thermostat. o The thermostat will be simulated with two seven segment LED

displays showing the current temperature “set-point” of the demonstration thermostat

o The lamp will be realized with a standard 120 V desk lamp.

Demonstration System Operation:

A fully function system will operate as follows. An individual will approach the microphone and issue one of 5 of pre-recorded commands. “On” or “Dark” to control the lamp, or “Fifteen”, “Twenty”, or “Twenty-Five” to change the set-point of the thermostat. The analog signal of this command will then be converted to a digital signal. After A/D conversion of the signal, software will process the signal and store it in memory. The stored information will then be compared to the information stored in a database of pre-recorded commands via a speech recognition algorithm. When a match is made a control signal will be issued to the output interface circuitry, which will control the appliances. This will occur in real time, optimized for minimum delay.Here is a flow chart of this process.

[Analog Input]I

[A/D Conversion]I

[Software Processing] I

[Database Comparison Via Speech recognition Algorithm]I

[On match issue control signal over Serial output interface]I

[Serial connection]I

[Hardware Interface Controls Hardware]

4

Page 5: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 1 - General Project Layout

Software Concept:

A large amount of the work is done in the signal during the Software processing part of the system. Through our research of voice recognition software, we have further distinguished three distinct parts to create a software package that will effectively process and recognize our spoken commands. These parts are:

Feature extraction: Analyses both pre-recorded database samples and live signals from the microphone input and distils them down into a matrix that describes that sample with cepstrum, log energy, delta cepstrum, and delta delta cepstrum numerical coefficients.

Database creation: Holds the acquired matrices from the feature extraction in memory. Generally, a frame of 20-25 ms extracted from the recorded command can have up to 40 numeric values called parameters. Considering a recording length of about 1 second, each command can have up to 2000 parameters. For a set of five commands as is to be implemented in this project, the database could have up to 10000 numeric values.

5

Page 6: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Our demo system database has 5 stored commands recorded 5 times for 5 different users (group members). The 5 commands were also recorded an additional 2 times by 2 female users.

Pattern Recognition Algorithm: Dynamic Time warping algorithm compares the features extracted from the live input signal and compares it against the features stored in the database, and returns a match for the database dataset that most closely recognizes the input signal.

A match will then trigger the output interface to communicate an ascii character describing the matched command over the output interface and into the hardware interface.

Output interface and Hardware interface Concept:

Figure 2 - Hardware Layout

The output interface will be a serial connection comprising of code in Matlab configured to communicate over the serial connection of Kunal’s Personal Desktop PC, to the demonstration circuit. This will interface with the Hardware interface on the demonstration circuit, which will receive the serial signal and translate the data into a control signal via a programmed microcontroller.

To translate the serial signals into control signals we plan to develop a small hardware demonstration circuit consisting of the following components:

6

Page 7: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

A Serial port connection to receive signals from the PC

A preprogrammed microcontroller to route signals to either the light or the thermostat.

A relay to translate a 5VDC control circuit signal to 120 VAC for our light

Two LED displays with appropriate control chips to display thermostat settings

This device will be packaged and wired appropriately to interface to a 120V desk lamp and the LCD displays for the thermostat.

Upon coordinating with the 499 Project technicians in the engineering department, we settled on the following setup to achieve the above design.

Serial Output from PC ( cable and connectors)

MAX 232 Serial Line Driver Chip

Atmel ATMEGA8 8 bit flash programmable microcontroller

Texas Instruments SN74LS47 BCD to seven segment Driver Chips

Common Anode 7 segment LED’s

5VDC to 120V relay (Wired box to house relay and circuit connections for 120V outlets and Wall connection to 120V)

The rest of the demonstration circuit will be constructed with basic off shelf lab components including such things as: Resistors, Transistors, Relays, and wiring. The thermostat demonstration circuit will be implemented with control signals from the microcontroller feeding the BCD driver chips to drive the 7 segment LED’s to the right display values.

7

Page 8: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 3 - Hardware Schematic

2.0 Background

Speech recognition:

Speech recognition is an advanced form of decision making whereby the input originates with the spoken work of a human user. Ideally, this is the only input that is required.

There are many ways in which speech recognition can be implemented. For the proposes of this report, it is assumed that a microphone connected to a computer is available. On the computer, a program called MATLAB is used to implement the algorithm and store the database.

Requirements for speech recognition

1. A database This database serves as the main point of comparison. When an input is directed to the algorithm, the algorithm compares it to what

8

Page 9: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

is contained in the database (discussed below) using an algorithm (discussed below) that maximises accuracy while minimising computing time.

2. InputInput in this case comes in the form of a microphone connected through a computer to MATLAB.

3. Algorithm for comparisonAs the main computing point, the algorithm dictates the speed and accuracy for the whole system. For example, a point by point comparison of the input to the database would be costly in terms of time and be highly inaccurate. Much work has been done to find an algorithm which provides the benefits required to make a practical speech recognition system. Algorithms are listed below.

Information contained in Database

The database contains the features of the pre-recoded commands.

Features include:

1. MFCC’s

The procedure for extracting MFCCs is:

a. Take the Fourier transform of the signal, (done for each window)b. Map the log amplitudes of the spectrum onto the Mel scale.

Triangular overlapping windows are used.

Mel scale: To convert between f hertz into m mel:m= 1127.01048 loge(1+f/700)

For m mel into f hertz:f= 700 (exp(m/1127.01048) – 1)

c. Take the Discrete Cosine Transform of the list of Mel log-amplitudes

d. The amplitudes of the resulting spectrum are the MFCCs. [ref http://en.wikipedia.org/wiki/Mel_scale]

2. Deltaa. Taken by differentiating the MFCC coefficients to find the first

order degree of change

3. Delta-delta

9

Page 10: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

a. Found by differentiating the delta coefficients to find the second order degree of change in MFCCs.

4. Energya. The log energy of the signal is computed by using overlapping

triangular windows.

The figure below illustrates the feature extraction in a flow chart.

Figure5 - Feature extraction.

Types of Algorithms

1. Dynamic Time Warping (DTW)Dynamic time warping is a method that is most applicable to signals which are skewed or shifted in time relative to each other. For example if one signal is compared to another signal that is the same signal but shifted in the x (time) axis, a point to point Euclidian comparison with give a large error. However, if the shift is accounted for, as it is in DTW, the two signals will be recognised as being very similar, which they are. In this way, DTW is ideal for speech recognition, where one word spoken by two users is never exactly the same, but often said with differing speed or emphasis.

In the below figure, the input signal and template signal are compared. If the two signals were exactly the same, the minimum distance path would be a 45 degree angle between the two. However, any skew will cause the minimum distance maping to

10

Page 11: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

shift. DTW takes advantage of this fact and gives a distance which accounts for this shift.

Figure 6 - Warping path

[Ref: http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html]

Two signals warped in time:

Figure 7 - Two skewed signals

As shown above, the two signal are in fact, the same. However, a Euclidian comparison would give a large difference. DTW, skews the difference computed much like B above.

11

Page 12: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

DTW skews the difference by computing the minimum distance between the two signals. Here, a minimum distance warping path is shown:

Figure 8 - Minimum distance warping path

[Ref for above two pics: http://www.cs.ucr.edu/~eamonn/sdm01.pdf]

2. Hidden Markov Model (HMM)The HMM algorithm is a statistical model. The process is assumed to be a Markov process with hidden (unknown) parameters. The hidden parameters are deduced by analysing the known parameters. By computing these states, pattern recognition is possible, which is how HMM can be used for speech recognition. HMM is a complex algorithm which provides the most benefit for use with a large vocabulary system. In this project, five commands were recognised by the system.

3. Neural Networks (NN)Neural networks use a network of “neurons” of acoustic phonemes which are compared to the input to find a match. NN is a highly mathematical system which is useful for computing larger words with many phonemes in them. In this project, words were kept short.

12

Page 13: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

In light of the above information, dynamic time warping was judged to be the best choice for this project.

Other concepts used in this report

Windowing

Windowing a signal ensures that there are no sharp cut-offs at the beginning or end of a signal, which can cause unwanted high frequencies to appear in the signal. In this project a Hamming window is used with a formula of:

w(n) = 0.53836 - 0.46164 * cos(2*pi*n/(N))

Modelling of speech

To create the database and to use analog speech in the digital domain it is necessary to accurately and succinctly model it. A simple analog to digital conversion is too large and not useful enough to be used for comparison with high accuracy. Speech and speech production is modelled as follows:

Speech consists of voiced and unvoiced sounds. Voiced sounds such as ‘a’ and ‘b’ are due to vibrations of the vocal chords and can be accurately modeled as the sum of sinusoids. Unvoiced sounds are, when looked at in a short time frame (say 10 ms), noise. They can be modeled as such.

Voiced sounds are made by allowing air to pass freely over the vocal chords out of the vocal tract.

Unvoiced sounds are made by a constricted vocal tract, producing turbulence.

The vocal tract changes in time to produce voiced and unvoiced sounds in succession. This is speech.

Speech can be modelled as a time varying signal. However since the vocal tract varies slowly with respect to the pitch of speech (the pitch of speech is typically close to 8 kHz), speech can be modelled as a LTI system (linear time invariant) where the transfer function of the vocal tract, v(t) for short time frames is convolved with a driving impulse train, x(t), producing sound, s(t).

13

Page 14: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 9 - Magnitude response of the vocal tract

14

Page 15: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 10 – The vocal tract as an LTI filter.

[Ref: http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf]

To model the vocal tract as a digital filter, the poles of its transfer function can be computed.

Figure 11 - Poles near the unit circle correspond to large values of H(e^jw).

The location of the poles depends on the resonant frequencies which occur in the vocal tract. These are called formant frequencies.

15

Page 16: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

3.0 DatabaseDatabase is constructed from the pre-recorded commands uttered by all the team members. In addition to the team members, few more people were also recorded to ensure more recognition and speaker independency.

Database DesignEach entry in the database is designed to correspond to a single command utterance. The entries in the database were designed to be a feature matrix containing features extracted from the pre-recorded samples. There was one entry for all the pre-recorded commands.

Database constructionThere are in all 155 commands in the database. Five utterances per command for five commands and five team members gives 125 commands. The rest of the commands were recorded by external people.

The entries in the database correspond to each command. The entries are in form of feature matrices extracted from the vocal commands. The formation of the feature matrices is summarized in the next section-‘Feature Extraction’.

Feature ExtractionThe overview of the feature extraction is presented in the diagram below. The input files are either digitized first if not pre-recorded commands, and split into short-time frames. The ‘Cepstral coefficient extraction’ returns MFCC coefficients and the Frame Energy. These coefficients and energy is further processed to derive the Delta-Cepstral and Delta-delta Cepstral coefficients.

16

Page 17: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 12 - Feature Extraction process

The Cepstal coefficient extraction can be shown in plots in the following section.

Cepstral Coefficient ExtractionThe block diagram of the cepstal coefficient extraction block is as shown below:

Figure 13 - Cepstral coefficient calculation from frames

As shown above, Fast fourier transform is applied to each of the frames of the digitized command. The next step is to calculate the frame energy. The fourier transform gives complex values out in its result. In order to make use of those values, they must first be converted to real values. The Absolute value of a complex number returns the magnitude of the complex numbers in the array in real numbers, and the real numbers are squared to calculate the energy. The magnitudes of the Fast fourier transform are plotted in the figure below:

17

Page 18: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 14 - FFT Magnitude spectrum for a sample frame

These magnitude squares are summed up to form frame energy as one of the parameters. The magnitude squares are also passed downstream to the Mel-Filter banks for further processing. The Mel Filter bank are filters designed based on the Mel Frequency scale. The Mel frequency scale is designed so that it represents the way human percept the sound. The frequency mapping in the Mel scale is shown with respect to the linear frequency below:

18

Page 19: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 15 - MEL frequency vs Linear frequency

The Mel Filter bank is pictured in the figure below.

Figure 16 - Mel Filter bank magnitudes

19

Page 20: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

The Discrete Cosine Transform (DCT) is applied on the signal after it is filtered by the Mel Filter bank and taken the log value of. The result of the DCT is the Mel-Frequency Cepstral Coefficients. The cepstral coefficients for a sample frame can be shown as below:

Figure 17 - Mel Frequency Cepstral Coefficients for a sample frame

Feature matrix and database formationIn the project, there were in total 39 features per frame:

1. 12 MFCC coefficients2. 1 Frame energy3. 13 Delta-cepstral4. 13 Delta-delta cepstral

The feature matrix is a 2D matrix containing 39 rows for the number of features and ‘x’ number of columns for ‘x’ number of frames in the command. This feature matrix is than inserted into the database at a certain index that is mapped to the command input it was extracted from. Thus, the Database is a 3D matrix where each 2D matrix corresponds to a command. The block diagram below describes the structure of the Database.

20

Page 21: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 17 - Database structure

Repeating the feature extraction and storing process for all the command utterances, the final size of the database was 39x198x155. For the database

39 represents number of rows 198 represents number of frames (maximum) 155 represents number of commands

Database comparisonFor any command input to the program, the features are extracted in the same way as the pre-recorded commands and stored to a feature matrix on a frame-by-frame basis. The resulting feature matrix is a 2D matrix of features.

To detect the command that was uttered, the feature matrix constructed from the input must be compared with the feature matrices inside the database. The algorithm used for the comparison of the feature matrices was Dynamic Time Warping (DTW). DTW has been described in detail in the background section. The flow-chart below describes the algorithm for the comparison part:

21

Page 22: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 18 - Database comparison algorithm flow chart

As shown in the diagram above, feature matrices are pulled off of the database individually and compared with the input feature matrix. An important note while retrieving matrices from the database is to consider the zero frames. The way MATLAB works with arrays and matrices is that it assigns the matrix dimensions dynamically, and it expands to accommodate any number of columns, as required. This causes a problem when there is a need to expand the array or the matrix. The question to be asked is what happens with the other elements which have less number of columns. There is only one size to the Database matrix, and that is the largest feature matrix. So, what happens to those feature matrices which have dimensions less than the dimensions provided by the Database matrix? What values would be seen in the empty cells of the smaller matrix? The answer is zeros. MATLAB appends zeros in the empty cells of the smaller feature matrices.

The problems with padding of zeros are several: Corrupts the feature matrix with false values Adds unnecessary computations Comparison algorithm yields incorrect matches

To have good recognition, the zeros need to be removed while comparing the feature matrices with the input feature matrix. The ‘Strip Zeros’ block in the algorithm does just that. It removes the zeros from the feature matrices before passing them on to the comparison algorithm. Thus, the comparision now yields reliable and accurate results. This was one of the main challenges that we came

22

Page 23: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

across in the project. We were having very minimal accuracy rates until we corrected this. Once it was corrected, the recognition started working really well.

Once the comparison is completed using DTW, the program returns a cumulative distance value. It is a representative distance of the difference between the two matrices. These cumulative distances with the input features are collected for all the feature matrices in the database. Once all the distances have been collected, the minimum distance among all comparisons is searched, and passed on as the matching feature matrix. The code then maps the index of the feature matrix to the command index, and hence the command is identified.

The serial port is driven with the output associated with the command, which then controls the hardware. More details about the hardware interface can be found in the ‘Hardware’ heading under ‘Background’ section.

4.0 Dynamic Time Warping (DTW)

In our project, speech signal is represented by a series of feature vectors which are computed every 10ms. A whole word will comprise dozens of those vectors, and we know that the number of vectors (the duration) of a word will depend on how fast a person is speaking. In speech recognition, we have to classify sequences of vectors. Therefore, we need to find a way to compute a distance between the unknown sequences of vectors X and known sequences of vectors W, which are prototypes for the words we want to recognize.

Distance between Two Sequences of Vectors.

A classification of a spoken utterance would be easy if we had a good distance measure D( X, W ) at hand. To get a good distance measure, the distance measure must

Measure the distance between two sequences of vectors of different length

While computing the distance, find an optimal assignment between the individual feature vectors

Compute a total distance out of the sum of distances between individual pairs of feature vectors

23

Page 24: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Comparing the distance between two sequences of vectors of different length

In dynamic time warping method (DTW), when comparing sequences with different length, the sequence length is modified by repeating or omitting some frames, so that both sequences have the same length as shown figure 1 below. This modification of sequences is called time warping.

Figure 19 – Linear Time Warping

[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture4.pdf]

As it can be seen from figure 1, the two sequences X and W consist of six and eight vectors, respectively. The sequence W was rotated by 90 degrees, so that the time index for this sequence runs from the bottom of the sequence to its top. The two sequences span a grid of possible assignments between the vectors. Each path through this grid (as the path shown in the figure) represents one

24

Page 25: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

possible assignment of the vector pairs. For example, the first vector of X is assigned the first vector of W, the second vector of X is assigned to the second vector of W, and so on. As an example the, let us assume that path P is given by the following sequence of time index pairs of the vector sequences. P = {(0, 0) , (1, 1) , (2, 2) , (3, 2) , (4, 2) , (5, 3) , (6, 4) , (7, 4)} (4.3)

The length of path P is determined by the maximum of the number of vectors contained in X and W. The assignment between the time indices of W and X as given by P can be interpreted as” time warping” between the time axes of W and X. In our example, the vectors x2, x3 and x4 were all assigned to w2, thus warping the duration of w2 so that it lasts three time indices instead of one. By this kind of time warping, the different lengths of the vector sequences can be compensated. And for the given path P, the distance measure between the vector sequences can be computed as the sum of the distances between the individual vectors.

Finding the Optimal Path Once we have the path, computing the distance becomes a simple task. DTW distance can be computed efficiently by using Bellman’s principle of optimality. It states that If optimal path is the path through the matrix of grid points beginning at A and ending at B , and the grid point K is part of path, then the partial path from A to B is also part of optimal path. From that, we can construct a way of iteratively finding our optimal path P as shown on figure 2.

Figure 20 - nonlinear path options

[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.pdf]

According to this principle, it is not necessary to compute all possible paths P and corresponding distances to find the optimum path. Out of the huge number of theoretically possible paths, only a fraction is computed. To illustrate this

25

Page 26: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

concept further, we need to discuss what is called a local path alternative or local distance.

Local distances

Since both sequences of vectors represent feature vectors measured in short time intervals, we can restrict the time warping to reasonable boundaries. The first vectors of X and W should be assigned to each other as well as their last vectors. For the time indices in between, we want to avoid any big leap backward or forward in time, but want to restrict the time warping just to the reuse of the preceding vector(s) to locally warp the duration of a short segment of speech signal. With these restrictions, we can draw a diagram of possible local path alternatives for one grid point and its possible predecessors.

As we can see, a grid point (i, j) can have the following Predecessors:

Figure 21 – Accumulated Distance in point (i, j)

(i − 1, j) : horizontal local path (i − 1, j − 1) : diagonal local path (i, j − 1) : vertical local path

[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.pdf]

All possible paths P, which we will consider as possible candidates for being the optimal path, can be constlructed as a concatenation of the local path alternatives as described above. According to the local path alternatives diagram shown above, there are only three possible predecessor paths leading to a grid point (i, j): The partial paths from (0, 0) to the grid points (i − 1, j), (i − 1, j − 1) and (i, j − 1) ). The (globally) optimal path from (0, 0) to grid point (i, j) can be found by selecting exactly the

26

Page 27: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

one path hypothesis among our alternatives which minimizes the accumulated distance A(i, j) of the resulting path from (0, 0) to (i, j).

Starting from grid point (0,0) to the vector distances defined by the grid point (1,0) and (0,1), we can compute A(1,0) and A(0,1). Now we look at the points which can be computed from the three points we just finished. For each of these points (i, j), we search the optimal predecessor point out of the set of possible predecessors. That way we walk trough the matrix from bottom-left to top-right. Once we reached the top–right corner of our matrix, the accumulated distance A (RefFrames, TestFrames) is the distance D( W , X) between the vector sequences.

5.0 Experiments & ResultsIn addition to the results noted on the presentation day, this section will also present results for the recognition rates for the team members and five external speakers with varied accents.On the presentation day, the project had more success with the ‘On’ and ‘Dark’ commands. The other commands suffered in the accuracy for external speakers. The other commands are ‘Fifteen’,’Twenty’ and ‘Twentyfive’. One reason for this huge difference in the recognition could be the length of the commands. ‘On’ and ‘Dark’ are short, and hence different accents and speech variation affect it less. However for the other longer commands, the pitch and the accents come in to picture and hence suffer in accuracy.On the presentation day:

We had close to 100% accuracy for the speakers that the database was trained to

We had about 80% accuracy for all speakers for the command ‘On’ We had about the same accuracy (80%) for the command ‘Dark’ for all

speakers Many people did not try the other commands, but for some that did

o ‘Fifteen’ had about 60% accuracyo ‘Twenty’ had about same (60%) hit rateo ‘Twentyfive’ was the worst with meager hit rate of about 40%

For the report, it is planned to test the program in the following way: 10 repeats for all command by all the team members and in additions five

external speakerso The hit rate will be treated as the accuracy

For five commands, it is 50 repeats per speaker and for 10 speakers, total test samples will be 500

There will be results classified by commands, speaker type-external or database, and overall-including all speakers

27

Page 28: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

6.0 Hardware

Figure 22 - Hardware schematic

28

Page 29: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

Figure 23 - 5V DC to 120V AC wiring box schematic

Command name  

Fifteen Twenty Twentyfive On Dark

  2 LED's Display    Device Status   15 20 25 Light on Light off

Ascii Characters c w h o f       

 Atmel

Pin      Relay input PB0     1 0

           

BCD

Ch

ip 1

in

put PC0 1 0 0    

PC1 0 1 1    PC2 0 0 0    

                                 

BCD

Ch

ip 2

in

put PC3 1 0 1    

PC4 0 0 0    PC5 1 0 1    

Table 1 – Micro- controller signal route table.

The microcontroller (ATMEL) chip is programmed as the table above – Bits string will be sent from the DSP chip (or computer desktop through serial port during

29

Page 30: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

demonstration) in Ascii characters as shown on table. When the specific characters is recognised, for instance “o” , pin out PB0 will send a true signal to turn on the relay - thus the lamp will be turned on ; when “f” is detected, same pin out will send a false signal to turn off the lamp. 6 pin outs (PC0 to PC5) will be route to two BCDs. When characters “c”, “w”, or “h” is detected, the binary bits will operate as shown in table above to display the temperatures in 2 7-segment LEDs to display temperatures “15”, “20”, or “25”.

7.0 Conclusion

The speech recognition was implemented in MATLAB using several features extractions, such as : cepstrum, log energy, delta cepstrum, and delta delta cepstrum numerical coefficients. Research was done on a few algorithms - Neural Networks (NN), Hidden Markov Model (HMM), and Dynamic Time Warping (DTW). It has come to the conclusion that our speech recognition was performed using Dynamic Time Warping (DTW). Dynamic Time warping algorithm compares the features extracted from the live input signal and it was compared against the features stored in the database and if match of database is found, the output interface is triggered to produce serial of bits string as output. Matched commands are described in ascii character and bits strings were sent into the hardware interface and were driven by the Atmel chip to route the signal accordingly.

Five commands - On, Dark, Fifteen, Twenty, and Twenty five were recorded in different samples (high pitch, low pitch, slow, fast, and normal) by various speakers with different accents for both males and females. Results on the presentation day was listed as below

100% accuracy for the speakers that the database was trained to 80% accuracy for the command ‘On’ 80% for the command ‘Dark’ had about 60% accuracy for the command ‘Fifteen’ 60% accuracy for the command ‘Twenty’ 40% accuracy for the command ‘TwentyFive’

30

Page 31: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

However, these results are not final, another recording will be done soon to gather a larger database by having more speakers to have their commands recorded to increase the accuracy. For the final report, it is planned to test the program in the following way:

10 repeats for all command by all the team members and in additions five external speakers

o The hit rate will be treated as the accuracy For five commands, it is 50 repeats per speaker and for 10 speakers, total

test samples will be 500

8.0 References[1] Mohammed Waleed Kadous. “Dynamic Time Warping” http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html(2002-12-10)

[2] Eamonn J. Keogh† and Michael J. Pazzani‡ “Derivative Dynamic Time Warping” http://www.cs.ucr.edu/~eamonn/sdm01.pdf(2003 –10- 16)

[3] Prof. Ilya Pollak “Speech Processing” http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf(2004 -09 -05)

[4] Michael Gerber. “PPS on Speech recognition”http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06(2006 –01- 05)

[5]Signal Modeling Techniques in Speech RecognitionJOSEPH W. PICONE, SENIOR MEMBER, IEEEPROCEEDINGS OF THE IEEE, VOL. 81(1993 – 09 – 09)

[6] “Window Function”, http://en.wikipedia.org/wiki/Hamming_window(2006-12-12)

[7] Minh N. Do, “An Automatic Speaker Recognition System”

31

Page 32: Background - University of Victoria - Web.UVic.caweb.uvic.ca/~sulee/Elec499B/FinalReport.doc · Web viewA large amount of the work is done in the signal during the Software processing

[8] B. Plannerer,”An Introduction to Speech Recognition” http://www.speech-recognition.de/textbook.html(2003 – 05-18)

[9] Longbiao Wang, Norihide Kitaoka, Seiichi Nakagawa, “Robust Distant Speaker Recognition Based on Position Dependent Cepstral Mean Normalization”, 2005

[10] “Dynamic time warping”, 16 Dec 2006, http://en.wikipedia.org/wiki/Dynamic_time_warping

[11] Stuart N Wrigley, “Speech Recognition by Dynamic Time Warping”, http://www.dcs.shef.ac.uk/~stu/com326/sym.html

[12] “Formant” http://en.wikipedia.org/wiki/Formant(2006-12-16)

[13] “Signal Energy vs. Signal Power” http://cnx.org/content/m10055/latest/(2004-08-12)

[14] Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” Proceedings of the IEEE, 77 (2), p. 257–286, (1989-02-09)

32