The k-means or neural-gas algorithm are used to cluster many vectors into few classes. Please do not confuse this 'clustering' of vectors with the clustering of contexts, which is a completely different issue. Also the classes mentionned in the previous sentence have nothing to do with the LDA classes or with the sample-set buffers that will be explained further down.
Why Writing Sample-FilesWhy do we have to extract samples at all? Well, it is not really necessary, it is possible to not write any files and just keep all that is needed for the k-means in memory. This is certainly no problem for context-independent systems. But consider you want to create 5000 codebooks, each of which has 32 vectors, each of which has 32 coefficients, and to build a 32-vector codebook with k-means you would like to have a multiple of 32 example vectors, say something like 1000 or more. This would mean that you'd have to store 5000*1000*32 floating point numbers and you easily end up with more than half a gig of data. If this is not a problem for your computer, you can just go ahead and not write sample files, and simply keep them in memory in large enough buffers of a SampleSet object. If you believe that you could need the extracted samples again, e.g. if the k-means went wrong for whatever reason, then having the files will make a restart of the k-means much easier.