Extracting Samples

The k-means or neural-gas algorithm are used to cluster many vectors into few classes. Please do not confuse this 'clustering' of vectors with the clustering of contexts, which is a completely different issue. Also the classes mentionned in the previous sentence have nothing to do with the LDA classes or with the sample-set buffers that will be explained further down.

Why Writing Sample-Files

Why do we have to extract samples at all? Well, it is not really necessary, it is possible to not write any files and just keep all that is needed for the k-means in memory. This is certainly no problem for context-independent systems. But consider you want to create 5000 codebooks, each of which has 32 vectors, each of which has 32 coefficients, and to build a 32-vector codebook with k-means you would like to have a multiple of 32 example vectors, say something like 1000 or more. This would mean that you'd have to store 5000*1000*32 floating point numbers and you easily end up with more than half a gig of data. If this is not a problem for your computer, you can just go ahead and not write sample files, and simply keep them in memory in large enough buffers of a SampleSet object. If you believe that you could need the extracted samples again, e.g. if the k-means went wrong for whatever reason, then having the files will make a restart of the k-means much easier.

Writing Sample-Files

In Janus, the object type SampleSet is very similar to the LDA object. It is defined in a similar way, you must define buffers and asign a buffer to every acoustic model, just like you would assign a class to every model when defining an LDA object. Then comes a training iteration which looks just like any other training iteration. During the training the buffers will be filled with the featire vectors of the training data. Whenever a buffer is filled, it will be flushed into a file. At the end of the training you end up having a file for each buffer, that contains all the vectors that belong to it. You will use them as inputs for the k-means and neural-gas algorithms to compute codebooks. So when defining the buffers you have to define one buffer for each codebook you're gonna have later.