Context-Dependent Acoustic Models

A context-dependent phone in Janus is called a polyphone. Many recognizers use context-dependent models, triphones have been the most common for many years. We use the term polyphones when we are talking about arbitrary wide contexts. Don't confuse triphones or polyphones with sequences of phones. A polyphones is still one single phone. It is just modeled depending on its context.

Okay, before we can start training a context-dependent system we have to do some other work, namely find out which polyphones do occur in our database and build a Janus object that maintains these polyphones. Because of the internal storage style, we call such objects trees (Janus object class PTree). PTrees can be "attached" to distribution tree nodes. Once a PTree is attached, it can accumulate polyphones and store them together with an index, identifying which acoustic unit (i.e. distribution) is used to model them.

The collection process is triggered by the building of an utterance-HMM. When doing this, Janus alway provides the entire available context to the distribution tree when asking for a distribution. If there are no PTrees attached to a leaf node of the distribution tree, then the distribution tree will return the distribution index that is stored in that leaf node. If there is a PTree atached, it will return the index of the distribution that is stored together with the polyphone that represents the provided context (possibly limited in its width). If there is no such polyphone in the PTree, the PTree can be configured to automatically create one and call a previously declared Tcl procedure which will return the index of the associated distribution and do everything else that is necessary, like e.g. create a new distribution for the polyphone.

So much about the superficial description of the polyphone collection. Let's now have a look at the script which you can find in its completeness as usual in the scripts thread.

Start up Janus a bit differently:

[FeatureSet fs] setDesc @/home/islpra0/IslData/featDesc
              fs setAccess @/home/islpra0/IslData/featAccess 

fs FMatrix LDAMatrix
fs:LDAMatrix.data bload ../step5/ldaMatrix

[CodebookSet cbs fs]    read ../step8/codebookSet 
[DistribSet dss cbs]      read ../step2/distribSet 
[PhonesSet ps]            read ../step2/phonesSet 
ps:PHONES add pad 
[Tags tags]                 read ../step2/tags 
[Tree dst ps:PHONES ps tags dss] read ../step2/distribTree 

SenoneSet sns [DistribStream str dss dst] 

[TmSet tms] read                        ../step2/transitionModels 
[TopoSet tps sns tms] read             ../step2/topologies 
[Tree tpt ps:phones ps tags tps] read  ../step2/topologyTree 
[Dictionary diction ps:PHONES tags] read ../step4/convertedDict 

[DBase db] open ../step1/db.dat ../step1/db.idx -mode r 
AModelSet amo tpt ROOT 
HMM hmm diction amo

The difference is the additional phone "pad" which is added to the phones object right after reading it from file. The pad phone will be used later as a filler phone for unknown contexts. For example at the beginning or at the end of a sentence there is no context in one direction. So rather than forcing the distribution tree to answer context questions with "don't know" we simple tell it to assume that every unknown context is the pad phone. This, however, means that every question about an unknown context will be answered with "no" as long as we don't explicitely ask for the pad phone or for a phone set which contains the pad phone.

After startup, we configure the default configuration of all PTree objects. Remember that configuring an object class instead of an object incarnation, we define the initial properties of any new object of that class which will be creted in the future:

PTree configure -maxContext 1 
PTree configure -addProc ptreeAddProc

The to parameters that we are configuring is the maxContext parameter which defines the maximum context width we would like to consider. By setting it to one, we say, that we only want to consider the context up to one phone to the right and to the left, thus giving "triphones".

The other configuration paramter is the previously mentioned procedure that will be called, whenever there is a context in an utterance, that hasn't been seen before. We call this Tcl procedure "ptreeAddProc". We define it like this:

proc ptreeAddProc { ptree args } { 
     regexp {(.*)-([bme])} \$ptree dummy phone subphone 
     set dsName [lindex \$args 0]-\$subphone 
     set cbName "\$phone-\$subphone" 
     if { [dss index \$dsName] == -1 } { dss add \$dsName \$cbName } 
     return \$dsName 
}

See the Janus documentation for details about the argument list of this procedure. For here, it is enough to know that the first argument is the name of the PTree into which the new polyphone will be inserted and the first item of the args list is the context itself (a string containing comma-separated phonemes for both sides of the context, where the context -1 and +1 are separated by a pipe character).

So, the first thing that we are doing in the ptreeAddProc procedure is getting the name of the phone and the subphone segment out of the name of the ptree. Since we are going to name the ptrees that we attach to a node with the same name as the node, this will make it easy to decompose that name into phone and suphone unit. Then we compose the name of a new distribution, which is the name of the context plus the subphone name, the corresponding codebook name is the same as the ptree name. Finally we add the new distribution to the distribution set (unless it already exists) and return the distribution index to the calling ptree.

Prepared with this procedure we can start attaching ptrees:

foreach ds [dss:] { 
        if { \$ds != "_-m" && \$ds != "+-m" } {
             set phone [string range \$ds 0 [expr [string first "-" \$ds] -1]] 
             dst.ptreeSet add \$ds \$phone 0 0 -count 1 
             dst:\$ds configure -ptree [dst.ptreeSet index \$ds]
        } 
}

For each distribution (except the silence and the garbage models, which we don't want to model context-dependently) we create a ptree with the same name \$ds, initialize it with the monophone context \$phone itentified with the two zeros, meaning that the context is going from 0 to 0, and give it the initial count of 1 (all polyphones are counted). Then we attach this newly created PTree to its distribution tree node by configuring the -ptree parameter of that node.

Now the only thing missing in the preparation for the polyphone collection is the configuration of the distribution tree. We have to tell it to use the pad phone that we've added before, and we have to tell it to make its ptrees add unseen polyphones:

dst configure -padPhone [ps:phones index pad] 
dst configure -ptreeAdd 1

Finally we have to loop over all of our data. Remember that it is the building of the utterance HMM that triggers the polyphone collection. We don't need anything else. We don't need any alignments. So the loop looks pretty simple:

foreach utt [db] { 
     puts \$utt 
     makeArray arr [db get \$utt] 
     hmm make \$arr(TEXT) -optWord SIL 
}

We should not forget so store the datastructures that we've just computed:

dss              write distribSet 
dst              write distribTree 
dst.ptreeSet write ptreeSet

Now have a look at the written files. Look at the distribution set file and see how many distribution we have now. You can also see the contexts in the distribution names. There are many phonemes that are tagged with the word-boundary tag, and all distributions use a maximum context width of +-1.

You are welcome to rerun the entire script, this time using a wider maximum context width. Compare the results.