python 2.7 - Building Speech Dataset for LSTM binary classification

Question

Welcome To Ask or Share your Answers For Others

python 2.7 - Building Speech Dataset for LSTM binary classification

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 2.7 - Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano. I have gone through the example code however I want to build my own.

I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.

My problem now is: How do I use this text file to train the LSTM?

I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.

Any simpler way using LSTM's would also be welcome.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:08:59+0000

There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.

Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.

Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.

Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False

To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,...,1], [0,0,....,2]]

In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.

To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:

X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]

Y = [[0,0,1], [0,1,0]]

Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.

Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.

Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.

You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Categories

python 2.7 - Building Speech Dataset for LSTM binary classification

python 2.7 - Building Speech Dataset for LSTM binary classification

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags