'How to properly remove redundant components for Scikit-Learn's DPGMM?
I am using scikit-learn to implement the Dirichlet Process Gaussian Mixture Model:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
That is, it is sklearn.mixture.BayesianGaussianMixture()
with default set to weight_concentration_prior_type = 'dirichlet_process'
. As opposed to k-means, where users set the number of clusters "k" a priori, DPGMM is an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters.
My DPGMM model consistently outputs the exact number of clusters as n_components
. As discussed here, the correct way to deal with this is to "reduce redundant components" with predict(X)
:
Scikit-Learn's DPGMM fitting: number of components?
However, the example linked to does not actually remove redundant components and show the "correct" number of clusters in the data. Rather, it simply plots the correct number of clusters.
http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
How do users actually remove the redundant components, and output an array which should these components? Is this the "official"/only way to remove redundant clusters?
Here is my code:
>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> from sklearn import mixture
>>> X = pd.read_csv(....) # my matrix
>>> X.shape
(20000, 48)
>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2)
>>> dpgmm3.fit(X) # Fitting the DPGMM model
>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted
>>> max(labels)
>>> np.unique(labels) #Number of lab els == n_components specified above
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
#Trying with a different n_components
>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components
>>> dpgmm3_1.fit(X)
>>> labels_1 = dpgmm3_1.predict(X)
>>> labels_1
array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label
#Trying with n_components = 7
>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)
>>> dpgmm3_2.fit()
>>> labels_2 = dpgmm3_2.predict(X)
>>> np.unique(labels_2)
array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components
Solution 1:[1]
There is no automated method to do so yet but you can have a look at the estimated weights_
attribute and prune components that have a small value (e.g. below 0.01).
Edit: yo count the number of components effectively used by the model you can do:
model = BayesianGaussianMixture(n_components=30).fit(X)
print("active components: %d" % np.sum(model.weights_ > 0.01)
This should print a number of active components lower than the provided upper bound (30 in this example).
Edit 2: the n_components
parameter specifies the maximum number of components the model can use. The effective number of components actually used by the model can be retrieved by introspecting the weigths_
attribute at the end of the fit. It will mostly depend on the structure of the data and on the value of weight_concentration_prior
(especially if the number of samples is small).
Solution 2:[2]
Check out repulsive gaussian mixtures described in [1]. They try to fit to a mixture with gaussians that have less overlap and therefor are typically less redundant.
I didn't find source code for it (yet).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | eavsteen |