'Negative BIC values for GaussianMixture in scikit-learn (sklearn)

In scikit-learn, the GaussianMixture object has the method bic(X) that implements the Bayesian Information Criterion to choose the number of components that better fits the data. This is an example of usage:

from sklearn import mixture
for n in range(0,10):
    gmm = mixture.GaussianMixture(n_components=n, max_iter=1000, covariance_type='diag', n_init=50)
    gmm.fit(data)
    bic_n = gmm.bic(data)

I am fitting a GMM on a dataset with 600k rows and 7 columns. The BIC values are always negative, e.g. [-2000, -3000, -3300, ..].
In the documentation of the method bic(), it says "The lower the better". In the case of negative values as in my example, is then -3300 the best value, or it refers to the lowest value in absolute terms?



Solution 1:[1]

Generally, the aim is to minimize BIC, so if you are in negative territory, a negative number that has the largest modulus (deepest down in the negative territory) indicates the preferred model.

Look at the source code: https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/mixture/_gaussian_mixture.py#L727

    def bic(self, X):
        """Bayesian information criterion for the current model on the input X.
        Parameters
        ----------
        X : array of shape (n_samples, n_dimensions)
        Returns
        -------
        bic : float
            The lower the better.
        """
        return (-2 * self.score(X) * X.shape[0] +
                self._n_parameters() * np.log(X.shape[0]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1