Preparing the Data

In this directory we will prepare the given data, such that they will become Janus-readable. We will add a few missing words to the dictionary and we will create a Janus-style database for the given task. We will also split the given data into a training and a test set.

We need a bigger database than in last steps so great the following link in your home directory

ln -s /home/islpra0/IslData IslData

Change to the directory step1. If it doesn't exist create it next to the data directory. Originally this directory is empty. In the rest of this page you will find the documented process of how to create the files that should eventually be there.

 

Looking at the Dictionary

Let's start by having a look at what we have. Type

more ../IslData/dict

and have a look at the given dictionary. You can find such kinds of dictionaries a various publicly accessible places on the internet.
The dictionary contains in every line one word (given in all capital letters) followed by the space-separated phoneme sequence that describes one common pronounciation of the word. Some words can have multiple valid pronounciations, noted by numbers in parentheses. For example there are three pronounciations for the word 'ARBEITEN', the common one and two slightly different ones (one where the B is pronounced more like a P and one where the T is pronounced more like a D). Type the command

grep -v ";" dict | cut -d ' ' -f 2- | tr -s '[[:space:]]' | tr ' ' '\\n' | sort | uniq -c | less

This will list all the used phonemes together with the frequencies. There are 49 different phones.

 

Looking at the Transcriptions

Transcriptions can come in many different formats and styles. Possibly you've done some transcribing yourself. Do a:

more ../ISLData/transcripts 

to look at the given transcriptions. In our case we have one transcription per line in the transcriptions file. The first space-separated item in every line is the "name" of the utterance. The next entries are speaker, the speakers gender and the name of the ADC file. The rest of the line contains the transcribed words. Keep in mind, that Janus does read transcriptions and dictionaries (as well as everything else) in a case-sensitive mode, don't ever assume that two words capitalized differently are the same thing, they are not.

Looking at the Recordings

Usually, Janus can understand almost any format of raw data, such that there is no need for manipulating the recordings. All we need, is a suitable feature description that tells Janus how to interpret the recordings. In our case such a "feature description" could look like this:

 readADC ADC (adc) 
adc2mel MSC ADC 16ms

It has two lines. The first line tells the feature module how to read in the recording-files. It defines a feature named "ADC" which is filled by the "readADC" command using the arguments that follow in the rest of the line. In the second line we specify how to preprocess the data. In this example the preprocessing command is "adc2mel" which means: compute melscale coefficients (default value is 16 coefficients), each frame covers 16 ms on the time axis. Since this tutorial is not intended to teach you preprocessing, you should read the documentation of the feature module to find out what kind of preprocessing is possible and read a book for details about the theory. Usually you would have a feature description file that contains these two lines. For now we don't have to use a file, we can just type the commands manually into a running Janus.

You can enter the following commands in Janus to see what the preprocessing does:

% FeatureSet fs 
% fs readADC ADC IslData/adcs/alex_waibel/1.adc.shn -h 0 -offset mean -bm shorten
% fs adc2mel MSC ADC 16ms
% fs show ADC

Then a window will pop up and display the waveform. Use the controls of the feature displaying tool, you can also select the MSC feature there and have a look at the mel spectral coefficients.

 

Adding Missing Words to the Dictionary

Often you will encounter the case that some of the words in your database are not covered by the dictionary. If there are only a few missing words, then you will simply have to add them manually. You'll have to find a pronounciation yourself. You should do this by looking at the pronounciations of similar words that are covered by the dictionary. Sometimes adding the new word means just adding a plural-s at the end, or concatenating two other words, etc.

In order to find the missing words we first need a list of all the words that are in the dictionary.

cat ../IslData/dict | cut -d ' ' -f 1 | sort > wordsInDict

To find out which words are missing, type the following one-liner:

cut -f2- -d' ' ../IslData/transcripts | tr ' ' '\\012' | sort -u | join -v 1 - wordsInDict > missingWords 

Now the file "missingWords" contains a list of all words that should be added to the dictionary. The following lines are an example of how the dictionary could be completed:

echo "+GARBIFY_PREV+ GARBAGE" >> mappedDict 
echo "+LIP_SMACK+ GARBAGE" >> mappedDict 
echo "+MISC_NOISE+ GARBAGE" >> mappedDict 
echo "+TONGUE_CLICK+ GARBAGE" >> mappedDict 

So create a pronounciation for all the missing words and add them to mappedDict. Also the original 
dictionary needs to be in there of course. 

cat ../IslData/dict >> mappedDict 
sort -o mappedDict mappedDict 

Creating a Janus-Readable Dictionary

Janus expects a special format for its dictionary. Fortunately this format does not differ much from what the usually available dictionaries look like. Usually a dictionary looks like:

ABLE EY B AX L 
ABOUT AX B AW TD 
ACCEPTANCE AE K S EH PD T AX N S 
ACCEPT AE K S EH PD TD 
ACTION AE K SH AX N 
... 

Where Janus wants:

{ABLE} {EY B AX L} 
{ABOUT} {AX B AW TD} 
{ACCEPTANCE} {AE K S EH PD T AX N S} 
{ACCEPT} {AE K S EH PD TD}
{ACTION} {AE K SH AX N} 
... 

The curly braces are used by Tcl, because the dictionary will be interpreted by Tcl when reading. Obviously you wouldn't need the braces around the words in the above example, but you could imagine to have words that include special characters, that should be packed in braces. However, you are strongly discouraged to use special characters in names of words or phonemes, they are going to cause you trouble pretty sure.

The follwing one-liner does the conversion of the dictionary for our example:

cat mappedDict | sed 's/ \$//g' | perl -pe 's/([^ ]*) ([^ \\n]*)(.*)/{\$1} {{\$2 WB} \$3}/g' \
| perl -pe 's/ ([^ }]+)}\\n/ {\$1 WB}}\\n/g' | tr -s ' ' | sed s/GARBAGE/+/g > convertedDict 

Remember, it's a one-line-command. We've only split it into four lines for your reading convenience.

This way, we can easily create our Janus-readable dictionary. It's now called "convertedDict" and will be used from now on without further modifications.

Creating a Janus Vocabulary File

In addition to the dictionary Janus requires a vocabulary file that defines the search space of words that hypotheses can consist of. The first entry in every line is a search vocabulary word. Optionally, separated by white space, the second entry is the so called filler tag, which we will explain later when introducing the decoder:

cut -f1 -d" " convertedDict | tr -d "{" | tr -d "}" | grep -v "(" | grep -v ")" > vocab 

'(' and ')' denote the begin and end of a hypothesis and do not need to be in the vocabulary.

 

Creating a Task Database

Databases are standard objects in Janus. They can be used for anything, but one of the most common usages is a task-database, describing the recordings of the task, giving all the needed information about every utterance. In our example, we have a file ../data/transcripts which contains the most essential information about the task's utterances, namely an utterance ID and a transcription.
Other tasks can be organized in a different way. So you'll have to figure out yourself what is the best way to structure your data into a Janus database. In our example the follwing script can be run in Janus an will create a Janus-style database:

[DBase db] open db.dat db.idx -mode rwc 
set fp [open ../IslData/transcripts r] 
while { [gets \$fp line] != -1 } { 
   set utt [lindex \$line 0] 
   set spk [lindex \$line 1] 
   set sex [lindex \$line 2] 
   set adc [lindex \$line 3] 
   set text [lrange \$line 4 end] 
   db add \$utt [list [list UTT \$utt] [list SPK \$spk] [list SEX \$sex] [list ADC \$adc] \ 
   [concat TEXT \$text]] 
} 
db close 
exit

Defining a Training Set and a Test Set

You know that you should use a cross-validation set for developing a recognizer and report the results on a test set that has never been seen before. Therefore we have to split up the database. So we just take the first 8000 sentences as training set and the last 881 sentences as test se.t

cut -d ' ' -f1 ../IslData/transcripts > uttIDs head -n 8000 uttIDs > trainIDs tail -n 881 uttIDs > testIDs 
to create two files, one that contains the utterance IDs of the training set one that contains the utterance IDs of the test set.