'Negative BIC values for GaussianMixture in scikit-learn (sklearn)
In scikit-learn
, the GaussianMixture
object has the method bic(X)
that implements the Bayesian Information Criterion to choose the number of components that better fits the data.
This is an example of usage:
from sklearn import mixture
for n in range(0,10):
gmm = mixture.GaussianMixture(n_components=n, max_iter=1000, covariance_type='diag', n_init=50)
gmm.fit(data)
bic_n = gmm.bic(data)
I am fitting a GMM on a dataset with 600k rows and 7 columns. The BIC values are always negative, e.g. [-2000, -3000, -3300, ..]
.
In the documentation of the method bic()
, it says "The lower the better". In the case of negative values as in my example, is then -3300
the best value, or it refers to the lowest value in absolute terms?
Solution 1:[1]
Generally, the aim is to minimize BIC, so if you are in negative territory, a negative number that has the largest modulus (deepest down in the negative territory) indicates the preferred model.
Look at the source code: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/mixture/_gaussian_mixture.py#L727
def bic(self, X):
"""Bayesian information criterion for the current model on the input X.
Parameters
----------
X : array of shape (n_samples, n_dimensions)
Returns
-------
bic : float
The lower the better.
"""
return (-2 * self.score(X) * X.shape[0] +
self._n_parameters() * np.log(X.shape[0]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |