Speech Recognition

Class SpeechRec

The class SpeechRec is a tool for performing simple speech recognition tasks with a limited word dictionary.

Description: The speech recognition tool was developed for robot teleoperation purposes and hence it is designed to recognize only a few words (robot commands). The SpeechRec tool execution consists of two phases. First, a set of training words (audio waves) is saved into a dictionary. Second, a new word is classified by matching against the words in a dictionary and the best match with its class label are reported. The execution can run in two modes, such as, a single word mode (an audio wave with one word is presented) or a multiple word mode (an audio wave with multiple words is presented, for example, in the case of continuous audio input).

Setup: The frame of the graphic user interface (GUI) as shown above is divided into two parts. The top part contains 3 tabs (run tab, training tab, and audio tab), while the bottom part contains user information and a text output window.

First time users should follow these instructions.
1) Specify the directory for users.
To select this directory press the Users Directory button, then create and/or choose a directory. It is best to use a directory that stores no other folders besides the user folders.
2) Select current user in the Select User box.
Choose your user ID from the list, or type in a new ID and press ENTER.
3) Set up your valid commands in the Run tab. Set the text and positive integer representation for each word (or phrase) you want to be recognized. For example, let us assume that you want the word "hello" to be one of your inputs. In the command string box type "hello", and in the command value box type an integer value greater than 0, e.g., 3.
Note: 0 is reserved for unrecognized commands, and negative values are reserved for future use.
Press the Add >> button. The entry "3; hello" should be listed in the Valid Commands window. Similarly create new entries for each word (phrase) you want to use, but using a different positive integer (they don't have to be consecutive) every time. If you accidentally add a bad entry, click on it and press the remove button. After you added all wanted entries, press the save button, which will save all the information in the Valid Commands window into your commands file.

Training Phase: After you selected your user ID, you need to create templates or a dictionary of words to be recognized. It is possible to store a dictionary and re-load them in future.
1) Record and edit sample wave files of each word to be recognized.
Recording duplicates or triplicates of each word with slightly varying voice tone or pronunciation will improve word interpretation but it will also increase the classification time in the second phase. The recording can be done with the Audio tab's Record button or with any other software, such as Windows Sound Recorder. After the files are recorded each one should be resampled to 8-KHz, 8-bit, mono waves by setting these parameters in the 3 boxes on the right of the Resample button (or using another software). Use the Save Wave button to save each wave. It is strongly recommended that each wave be manually edited to remove silence and noise in order to clean the dictionary word defintions. One can use a wave editor for this purpose.
2) Create a training file from the Training tab, and generate the templates with the Train button.
This is similar to creating a file with valid commands. In the File Name box enter the wave file name that stores a word (or phrase), in the Command String box enter the text of this word, and in the Command Number box enter the number that represents the string as defined in valid commands in Run tab. Then press the Add >> button to add the entry to the list in the Training File window. For example, one can use "hello1.wav", "hello", "3" to go along with the command binding in step 3 of the Setup section. After adding all entries, press the Save button to save the contents of the Training File window into your training file.
3) Train the data.
Press the Train button to generate the templates. Look at the OUTPUT LOG window to see how many templates were created, or to see error messages that prevented the creation of template files. There should be one template file for each entry in the Training File window. Note that old templates are deleted before new ones are created. Templates are saved using the .lfcc extension.

Classification Phase: Once you have generated templates for the words to be recognized, you can classify new wave inputs. The software supports three classification methods. Due to the fact that any microphone processing is set to 8-KHz, 8-bit, mono, the templates as well as the new audio waves should also be sampled at the same rate otherwise the classification methods 2 and 3 would not work.
Ranked from highest success rate to the lowest, the methods are:
1) Use the Training tab's Classify button for single file classification.
You can record and resample wave files to match the sampling rates of the wave files used for creation of the templates. Enter the file name you want to classify into the File Name box (.wav extension is not necessary) and press ENTER or the Classify button. The OUTPUT LOG window will display information about the file, either it was not found, or if it was found the distance value for each template is shown, along with the final classification and the number of templates the file was compared to.
2) Use the Run tab's Give Order button for single microphone input classification.
It is assumed that the templates are sampled at 8-KHz, 8-bit, mono. After pressing the Give Order button, there is a two second time interval for recording a word. At the end of the time interval, the word is classified. To change the time from 2 seconds to something else manually edit the "defaults.txt" file in your userID subdirectory of the users directory. Within that file, change the default value "miliseconds=2000" to "miliseconds=#" where # is an integer value of miliseconds to record. If there is no such default value in the file add it as a new line somewhere between the [defaults] and [end] tags.
3) Use the Run tab's Start and Stop buttons for continuous microphone input classification.
It is assumed that the templates are sampled at 8-KHz, 8-bit, mono. This method is very demanding on system resources and may not work on slower computers because wave inputs from a microphone are being continuously resampled, filtered and classified. To begin this method press the Start button. The continuous processing threads start and one can now speak into the microphone. Any word recognition is displayed in the OUTPUT LOG window. If you wish to stop the processing, press the Stop button.

Description of GUI Items:

User Information and Output Window:
Here you can change the current user, and see any output messages that result from your interaction with the GUI.
OUTPUT LOG window: Displays messages from certain events and button presses. You can scroll it up and down to see old messages. There are two buttons associated with the OUTPUT LOG window, found on the left side of the window.
Clear Log button: Completely erases all information in the OUTPUT LOG window.
New Line button: Add a blank line to the end of the text in the OUTPUT LOG window, that can be used as a separator for easier viewing
Delete User button: Remove the selected user from the combo box. The folder with the user's information will not be deleted. When the users directory is chose again, or when the software is run again the user names removed with just this button will reappear. To permanently remove users, delete the folder with the same name as their user ID within the users directory by using the operating system's commands.
Users Directory button: Select a folder in which the folders with each user's information are stored. Click the button then navigate through the directory structure, select the folder you want and press Open.
Select User selection box: Click on the down arrow to select your user ID. If you are a new user type your ID into the box then press ENTER. ID's for different users must be unique, so if the ID you want to use is taken either append a number or pick another ID. If your ID isn't on the list and you are sure you created it, use the Users Directory button to find the location of the users folder.

Run Tab:
The run tab is used to classify sounds through the microphone and to create string to number bindings for commands. See sections 2 and 3 of Classification Phase.
Give Order button: Press it and say one word (phrase) into the microphone. The microphone grabs 2 seconds of speech which is then classified based on the created templates. To change the time that is recorded to something else than 2 seconds, edit the "defaults.txt" file and add the line miliseconds=#, where # is an integer (default is 2000ms = 2 seconds).
Start button: Starts a continuous processing of microphone input from user and does automatic classification of the sounds. Very demanding method on system resources.
Stop button: Stops the continuous processing of microphone input from user.
Save button: Saves the contents of the Valid Comands window to the default commands file for the selected user.
Load button: Loads the contents of the selected user's commands file to the Valid Commands window.
Add >> button: Adds the entry typed into the Command Number and Command String fields to the Valid Commands window. You must press Save to make the changes to the commands file.
Remove button: Removes selected entry(s) from the Valid Commands window. To select multiple entries, use Shift or Ctrl keys. You must press Save to make the changes to the commands file.
Clear button: Removes all entries from the Valid Commands window. You must press Save to make the changes to the commands file.
Command Number field: Integer representation of the word (phrase) you want the system to recognize.
Command String field: Text representation of the word (phrase) you want the system to recognize.
Valid Commands window: Scrollable window displaying list of all text and integer bindings. The format of each line is "Integer value, text string bound to this value."

Training Tab:
The training tab is used to create template files, which are the basis of the speech recognition system. See steps 2 and 3 of Training Phase
Train button: Performs training by extracting audio wave features.
Save button: Saves the contents of the Training File window to the default training file for the selected user.
Load button: Loads the contents of the selected user's training file to the Training File window.
Add >> button: Adds the entry typed into the Command Number, Command String and File Name fields to the Training File window. You must press Save to make the changes to the training file.
Remove button: Removes selected entry(s) from the Training File window. To select multiple entries, use Shift or Ctrl keys. You must press Save to make the changes to the training file.
Clear button: Removes all entries from the Training File window. You must press Save to make the changes to the training file.
Classify button: Classifies the file specified in the File Name field.
File Name field: Wave file name (.wav extension not needed), that
Command Number field: Type the integer corresponding to the word (phrase) of the wave file in File Name field. Pressing the ENTER key on a keyboard will automatically fill in the Command String field with the assigned text, as specified in Valid Commands window of Run tab. If "Unknown" appears in Command String field after pressing ENTER, then this number is not bound correctly.
Command String field: Type the text corresponding to the word (phrase) of the wave file in File Name field. Pressing ENTER will automatically fill in the Command Number field with the assigned integer, as specified in Valid Commands window of Run tab. If "0" appears in Command Number field after pressing ENTER, then this text string is not bound correctly.
Templates text: The number of templates that exist for this user. This is displayed to the left of the Classify button.
Training File window: Scrollable window displaying list of all wave files that are to be used in creating templates. The format of each line is "File name (with or without .wav extension); text string representing the sound; number bound to the text string."

Audio Tab:
Audio tab is used to record, play and manipulate wave files.
Play button: Play the current wave through the speakers. The current wave's information is displayed in bottom left corner of this tab.
Stop button: Stops the recording.
Record button: Start recording a wave from the microphone, to stop recording press the Stop button.
Save Wave button: Save the current wave to a file that you choose. The current wave's information is displayed in bottom left corner of this tab.
Load Wave button: Load a wave from a file that you choose.
Resample at: button: Resamples the current wave according to the 3 fields on right. The current wave's information is displayed in bottom left corner of this tab.
File Information box: This box on bottom left of the tab shows (1) the file name if the wave is saved (NA if no wave is present, or UNSAVED if the wave exists but is unsaved), (2) Rate: sampling frequency in Hertz, (3) sample size and channels, and (4) Time: length of the wave in seconds.
Sampling Frequency field: Frequency in Hertz (Hz) that you want to resample a wave to. This should be between 8000 and 48000.
Sampling Channels selection box: Number of channels that you want to resample a wave to. Mono is 1 channel, Stereo is 2 channels.
Sample Size selection box: Number of bits per sample of one channel. Either 8 bits or 16 bits.

Release notes: The software has been tested with the Audio-technica equipment, Model ATW-T310, on a Windows OS machine.

Version:: 1.0
Author:: Martin Urban