Computing New Codebooks with k-means
Besides using weights that have been trained by some other system, the most popular way to initialize codebooks is the k-means algorithm. Whenever we change the feature space to something we haven't had so far we have to find new reference vectors for our codebooks. This is e.g. always the case, when we compute a new LDA matrix. The k-means algorithm needs a great lot of example vectors which it then clusters into fewer vectors, namely the number of vectors we want for our codebooks. Since we can consider both the set of reference vectors of a codebook and the set of example vectors to be a matrix, the k-means operation is a matrix to matrix operation. The source matrices are the sample vectors and the destination matrices are the reference vectors.
The full script which is described in detail below can be found here.
Extracting Sample Vectors
Before we can run k-means, we have to extract some sample vectors. Thefore we start up Janus as usual, only this time using the ldaMatrix files from ../step5 instead of the old ones. After the startup we create a sample set object:
SampleSet sms fs LDA 32 foreach ds [dss:] { sms add \$ds ; sms map [dss index \$ds] -class \$ds }
It should remind you very much on the creation of an LDA object. In fact, it is the same thing, only with a different object. Here, too we define classes and define which acoustic unit indices belong to which class.
When the sample set object is created we configure some of its properties:
set fp [open ../step5/ldaCounts] ; makeArray counts [read \$fp] ; close \$fp foreach class [sms:] { sms:\$class configure -maxCount 500 -modulus [expr 1+\$counts(\$class))/500] }
First we read the counts file that we've written when doing the LDA. This way we know how many vectors we can expect from every class in the entire database. If we know that we only want 500 example vectors per class for the k-means, then it might be risky to just take the first 500 occurrences. It is better to take examples from all over the database. So why not take every n-th vector that belongs to a class, where n is the occurrence frequency divided by 500. This way we get 500 examples for every class spread from all over the database. To do this we first build an array named counts with the makeArray command that we've already used before. The we configure the maximum number of vectors to be extracted for every class to be 500, and the modulus defining the n from above to be the number of counts for the class divided by 500. The adding of the 1 is there to avoid a modulus of 0 for classes that have less than 500 counts.
Then we run a loop over the entire training data which is very similar to the one we used for computing the LDA matrix:
foreach utt [db] { puts \$utt set uttInfo [db get \$utt] makeArray arr \$uttInfo fs eval \$uttInfo hmm make \$arr(TEXT) -optWord SIL path bload ../step4/labels/\$utt path map hmm -senoneSet sns -stream 0 sms accu path }
In fact, the only difference is that this time we accumulate for the sms instead of the lda object. Eventually, when the loop is finished, we do a
sms flush
To write out all the vectors that have not been written to a file yet. Remember that the sample set object is a buffer whose purpose is to extract sample vectors fast. If the buffer is smaller than the maximum number of extracted vectors per class, it will be flushed automatically when it is full. At the end of the loop we must flush the remainder manually.
Creating New Codebooks
We can't replace the codebooks in cbs by the new vectors we will get from k-means. This is because the codebooks we've used so far were defined on the MSC feature, and the ones we will use from now on will be defined on the LDA feature. Also we would like to reduce the dimensionality of the feature space from 16 (MSC) down to 12 (LDA). This makes the system smaller and faster, and hopefully even more capable for generalization. We can still use the old distributions and replace their values with the new ones that we will get from k-means, because the size of the codebooks will not change, and thus the size of the distributions will also remain the same.So we create a new codebook set named cbs2, plus some helper objects for holding vectors and matrices:
CodebookSet cbs2 fs FMatrix smp FVector cnt
Then we run a loop over all codebooks, for every codebook of our old codebook set we must create one in the new codebok set:
foreach cb [cbs:] { puts \$cb cbs2 add \$cb LDA [cbs:\$cb configure -refN] 32 [cbs:\$cb configure -type]In the new codebook set we use the same number of reference vectors (-refN) and the same covariance matrix type (-type), but we use the feature LDA and only 32 coefficients.
Then - still within the loop - we load the extracted sample vectors from their file into the previously created smp matrix. We have to reduce the size of the matrix, because the sample set object did not only save the 32 LDA coefficients but one additional coefficient which contained the path-likelihood of the frame (always 1.0 for Viterbi paths, and the "gamma"-value for forward-backward paths:
if [catch {smp bload \$cb} msg] { puts "ERROR: \$msg" } else { smp resize [smp configure -m] 32 [expr [smp configure -n]-1]Now the smp matrix contains only the 32 LDA coefficients for all the extracted vectors of the \$cb codebook. We can now call the k-means algorithm:
cbs2:\$cb.mat neuralGas smp -maxIter 5 -tempS 0 -counts cnt }The method is called "neuralGas" because k-means is a special case of the neural gas algorithm. With -tempS 0 we are saying that we only want pure k-means.
When the loop has finished, we have a new codebook set cbs2, filled with new codebooks, and the the old distribution set filled with equally distributed mixture weight distributions. All that is left to do is to store the new data structures:
cbs2 write codebookSet cbs2 save codebookWeights dss save distribWeights