Interspeech 2017 ComPareE

Written by Hui-Ting Hong

Mar. 30, 2017

I start working on this challenge around Feb. 2017 with other members in our lab. We plan to apply several traditional procedures on detecting audio signal with specific characteristic and try other distinctive approaches to see if they will get better results. In this article, I will give a brief concept of different encoding approach, including BOW, GMM along with Fisher Vector. Also I will show some detail on strength model. Finally show the results we get in this challenge.


Commonly applied encoding approaches

1. BOW ( Bag of Words )

Bag of words is an approach which can reduce the dimension of features processed from raw data. The brief concept can be devided into two parts, one is to decide the criteria of categorization and the other is to generate histogram base on the criteria:


Suppose we have already processed our raw data (audio signal) and get the NxD dimension features.

Decide the criteria of cluster categorization

Sometimes we also say it's just like making a vocabulary codebook, which can later be used at counting the amount of each word.
The idea started at using kmeans method to generate numbers of k centrol points of k-clusters respectively. If nowadays we have multiple features, Fi, whose dimension are all NxD, we can randomly sample equal amount of rows from Fi and generate the center point of each cluster base on the information in different features. In this way, the center point we create of each cluster can be more valid on different feature, and let the later categorized-result(histogram) to be more representative and convincing.

Generate histogram

After we got the center point of each cluster, we can categorize our features into different clusters. Each features, Fi, will have a categorization result, showing the amount of vectors belongs to each cluster.(Due to the dimension of Fi is NxD, we can know that there are totally N vectors with length D in the feature Fi). Finally, the 'voting' result can be seen as a histogram, which x-axis represents k-cluster and y-axis represents the number of vectors belongs to each cluster.
The generated histogram is then used to describe the original features, which reduce the dimension from NxD to 1xD. 

In short, BOW approach can help us to encode the feature into a more representative while low-dimension result. Here's the flow:

2. GMM and Fisher Vector

Another much more popularly used encoding approach is Gaussian Mixture Model(GMM) followed by Fisher Vector. The GMM is like a more extended version of kmeans clustering, it represents a cluster not using a center point but a gaussian distribution with corresponding weight. Therefore, with our features, Fi, in dimension NxD, we can expect that we will get the D amount of set of Gaussian Distribution. In each set, there will be the number of K gaussian distribution as we set the number of clusters to be K. We all know that in order to describe a gaussian distribution, we have to specify its mean and variance, so after generating the Gaussian Mixture Model of our k-cluster, we will have a matrix of MeanDxK and a matrix of CovarienceDxK. Additionally, to describe the relationship between k gaussian distribution, we will have the posterior value and a matrix of PriorKx1 ( which could also be considered as the weight between k gaussian distribution ). 


After we build our GMM, we can now encode our features, Fi, base on the MeanDxK, CovarienceDxK, posterier, and PriorKx1. The encoded result is what we called Fisher Vector. (A special case of general Fisher Kernel)

When starting encoding our features, FiNxD, we will pick out one feature(Nx1) at each time, and see which set of gaussian distribution is more similar to itself. After finding out that set, it will modify the prior between K gaussian distribution in the set, in order to fit the feature better. Therefore, the encoded result of one feature is the matrix of prior value along with posterior, which dimension should be Kx2. After finishing each feature in the Fi, we will get the Fisher Vector with dimension of DxKx2.

The Fisher Vector is then used to describe the original features. We can see that the original dimension is NxD while the encoded result is DxKx2, usually K will smaller than N, so the encoded result will decrease the dimension.  And since the representation of gaussian distribution is more complicate, use it as the description of a cluster will be more convincing. That is why Fisher Vector is used more popularly when encoding the features. Here's the flow:

Strength Modeling

Strength model is a kind of fusion method proposed by J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, which not only consider the fusion of feature level but also decision level, and shows great improvement in fusion perspective.

1. Feature Level
Before fusion, we use sub-dictionary concept and generate 3 different kinds of encoded features, which are label-based approach, unsupervised approach and supervised approach.

In label-based approach, we know that we have people with cold and no-cold these two kind of label. We first generate the cold-specific GMM and no-cold-specific GMM and later use the Fisher Vector mentioned above to generate new encoded features.

Here's the brief flow:

In unsupervised approach, we apply kmeans method and categorize the data into two categories which give the data another meaning, like men or women for example. After splitting out the two groups(clusters), we apply the same procedure: first generate GMM for each cluster data and then use Fisher Vector to encode features.

Here's the brief flow:

In supervise approach, it will be a little bit more complicated. The main idea is to find out those data whose connection between its label and characteristic is super low and hard to classify. The way to find out is first put train data into classifier, like SVM, and then put train data as test data to see its prediction result. We could imagine the UAR must be super high due to the testing part is using the train data which the machine have already seen in training process. However, there must exist some data still get a wrong prediction result, and that is what we are looking for! After finding out these data, we base on their prediction result and generate GMM respectively. Finally apply Fisher Vector to get another encoded features.

Here's the brief flow:

2. Decision Level
In addition to feature level encoded result mentioned above, we still need to obtain decision level outcomes
Due to different classifiers will provide different perspective of the problem, we choose 4 well-known classifiers, including SVM, AdaBoost, random forest, and naive bayes classifiers. Then take these classifiers' prediction result as the decision level outcome which will be fusion later.

3. Fusion Process
In fusion process, we then combine the feature level and decision level as the training input of our classifier, and hope it will learn different perspective and characteristic of data, them predict better result.

Approach and Result

Combine sub-dictionary features, decision level features and eGeMAPS functional features, and optimize each feature with different classifiers.

The result is as follows:

We can see that with strength model, the performance gets better. Also, combine the sub-dictionary approach and eGeMAPS functional features, it gets quite a convincing result.


H. Kaya and A. A. Karpov, Fusing acoustic feature representations for computational paralinguistics tasks Interspeech 2016, pp. 2046–2050, 2016.
J. Han, Z. Zhang, N. Cummins, F. Ringeval, and B. Schuller, Strength modelling for real-worldautomatic continuous affect recognition from audiovisual signals, Image and Vision Computing, pp. –, 2016.
F.A.Laleye, E.C.Ezin, and C.Motamed, Speechphonemeclas- sification by intelligent decision-level fusion in Informatics in Control, Automation and Robotics 12th International Conference, ICINCO 2015 Colmar, France, July 21-23, 2015 Revised Selected Papers. Springer, 2016, pp. 63–78.
N. Zhou and J. Fan, Jointly learning visually correlated dictionaries for large-scale visual recognition applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 715–730, April 2014.
F. Scalzo, G. Bebis, M. Nicolescu, L. Loss, and A. Tavakkoli, Feature fusion hierarchies for gender classification, in Pattern Recognition, 2008. ICPR 2008. 19th International Conference on IEEE, 2008, pp. 1–4.
M. Liu, D. Zhang, and D. Shen, Hierarchical fusion of features and classifier decisions for alzheimer’s disease diagnosis, Human brain mapping, vol. 35, no. 4, pp. 1305–1319, 2014.
F. Eyben, F. Weninger, F. Gross, and B. Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, in Proceedings of the 21st ACM International Conference on Multimedia, ser. MM ’13. New York, NY, USA: ACM, 2013, pp. 835–838.

KAO, Yi-Ying, et al. Automatic Detection of Speech Under Cold Using Discriminative Autoencoders and Strength Modeling with Multiple Sub-Dictionary Generation. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018. p. 416-420.

Last Update Apr. 14th 2020

© 2020 by Hui-Ting(Winnie) Hong

All rights reserved.