Home > RMixmod > RMixmod

RMixmod

Rmixmod package contains

  • a set of functions to provide all Mixmod features in R environment
  • some graphical functions
    Note : Computations are made in the library (written in C++) for more performance.

Download

License :
RMixmod is available under GNU GPL

Rmixmod 2.1.1 - Download Rmixmod package from CRAN (2016/07)
Note : This package is based on mixmodLib 3.2.2

News :

  • Initialisation with labels (PARTITION) or with a parameter (PARAMETER)
  • Memory leaks decreased (mixmodLib)
  • bugs fix

Instructions

The package can be installed directly from R:

# install the package
> install.packages("Rmixmod")
# load it
> library(Rmixmod)
# show the documentation
> help(Rmixmod)

Examples

Thanks to Rémi Lebret
Example 1 : Unsupervised classification - Geyser dataset (Continuous variables)

It is a data frame containing 272 observations from the Old Faithful Geyser in the Yellowstone National Park taken from the Modern Applied Statis- tics in S library (Venables and Ripley, 2002).

Each observation consists of two measurements: The duration (in minutes) of the eruption and the waiting time (in minutes) to the next eruption. In this example we ignore the partition and we want to estimate the best Gaussian mixture model fitting the data set. The following code provides a way to do it by running a cluster analysis with a list of clusters (from 2 to 8 clusters).

>data(geyser)
>out_geyser<-mixmodCluster(geyser, nbCluster=2:8)
>summary(out_geyser)
**************************************************************
* Number of samples    =  272
* Problem dimension    =  2
**************************************************************
*       Number of cluster =  3
*              Model Type =  Gaussian_pk_Lk_C
*               Criterion =  BIC(2321.9464)
*              Parameters =  list by cluster
*                  Cluster  1 :
                        Proportion =  0.4044
                             Means =  4.5139 80.9044
                         Variances = |     0.0677     0.4420 |
                                     |     0.4420    29.3023 |
*                  Cluster  2 :
                        Proportion =  0.2397
                             Means =  3.9112 78.3832
                         Variances = |     0.1055     0.6889 |
                                     |     0.6889    45.6731 |
*                  Cluster  3 :
                        Proportion =  0.3558
                             Means =  2.0363 54.4793
                         Variances = |     0.0742     0.4847 |
                                     |     0.4847    32.1360 |
*          Log-likelihood =  -1124.5355
**************************************************************


The plot() function has been redefined to get on the same graph:

  • On diagonal: A 1D representation with densities and data;
  • On lower triangular: A 2D representation with isodensities, data points and partition.
> plot(out_geyser)

PNG

Example 2 : Unsupervised classification - Birds of different subspecies (Categorical variables)

This data set (Bretagnolle 2007) provides details on the morphology of birds (puffins). Each bird is described by five qualitative variables: One variable for the gender and four variables giving a morphological description of the birds. There are 69 puffins divided in two sub- classes: lherminieri and subalaris (34 and 35 individuals respectively). Here we run a cluster analysis of birds with 2 clusters.

> data(birds)
> out_birds <- mixmodCluster(birds, nbCluster=2)
> summary(out_birds)
**************************************************************
* Number of samples    =  69
* Problem dimension    =  5
**************************************************************
*       Number of cluster =  2
*              Model Type =  Binary_pk_Ekjh
*               Criterion =  BIC(518.9159)
*              Parameters =  list by cluster
*                  Cluster  1 :
                        Proportion =  0.3456
                            Center =  2.0000 2.0000 2.0000 2.0000 1.0000
                           Scatter = |     0.4280     0.4280 |
                                     |     0.1203     0.1463     0.0153     0.0107 |
                                     |     0.0509     0.0751     0.0080     0.0080     0.0080 |
                                     |     0.3641     0.5495     0.1288     0.0485     0.0080 |
                                     |     0.1074     0.0940     0.0134 |
*                  Cluster  2 :
                        Proportion =  0.6544
                            Center =  1.0000 3.0000 1.0000 1.0000 1.0000
                           Scatter = |     0.4937     0.4937 |
                                     |     0.0761     0.0063     0.1741     0.0917 |
                                     |     0.1521     0.1391     0.0043     0.0043     0.0043 |
                                     |     0.0390     0.0045     0.0043     0.0259     0.0043 |
                                     |     0.0577     0.0288     0.0289 |
*          Log-likelihood =  -198.0634
**************************************************************

The plot() function has been redefined in the qualitative case. A multiple correspondance analysis is performed to get a 2-dimensional representation of the data set and a bigger symbol is used when observations are similar.
The barplot() function has also been redefined. For each qualitative variable, we have:

  • A barplot with the frequencies of the modalities;
  • For each cluster a barplot with the probabilities for each modality to be in that cluster.
> barplot(out_birds)
> plot(out_birds)

Example 3 : Supervised classification

The following example concerns quantitative data. But, obviously, discriminant analysis also works with qualitative datasets in Rmixmod.

The outputs and graphs of discriminant analysis with Rmixmod are illustrated through pre- diction of the company’s ability to cover its financial obligations (Du Jardin and Séverin 2010; Lourme and Biernacki 2011). It is an important question that requires a strong knowledge of the mechanism leading to bankruptcy. The original first sample (year 2002) is made up of 216 healthy firms and 212 bankruptcy firms. The second sample (year 2003) is made up of 241 healthy firms and 220 bankruptcy firms. Four financial ratios expected to provide some meaningful information about the company’s health are considered: EBITDA/Total Assets, Value Added/Total Sales, Quick Ratio, Accounts Payable/Total Sales.

First step: Learning

After spliting data into years 2002 and 2003, we learn the discriminant rule on year 2002 then we have a look at the best result and we call now the plot() function to a get a visualisation of it.

> data("finance")
> ratios2002 <- finance[finance["Year"] == 2002, 3:6]
> health2002 <- finance[finance["Year"] == 2002, 2]
> ratios2003 <- finance[finance["Year"] == 2003, 3:6]
> health2003 <- finance[finance["Year"] == 2003, 2]
> learn <- mixmodLearn(ratios2002, health2002)
> learn["bestResult"]

* nbCluster   =  2
* model name  =  Gaussian_pk_Lk_C
* criterion   =  CV(0.8178)
* likelihood  =  444.9579
****************************************
*** Cluster 1
* proportion =  0.4953
* means      =  -0.0386 0.2069 0.6089 0.1774
* variances  = |     0.0226     0.0064     0.0186    -0.0023 |
              |     0.0064     0.0166     0.0076    -0.0006 |
              |     0.0186     0.0076     0.2728    -0.0095 |
              |    -0.0023    -0.0006    -0.0095     0.0079 |
*** Cluster 2
* proportion =  0.5047
* means      =  0.1662 0.2749 1.0661 0.1079
* variances  = |     0.0172     0.0049     0.0142    -0.0017 |
              |     0.0049     0.0126     0.0058    -0.0005 |
              |     0.0142     0.0058     0.2076    -0.0073 |
              |    -0.0017    -0.0005    -0.0073     0.0060 |
****************************************
* Classification with CV:
          | Cluster 1 | Cluster 2 |
----------- ----------- -----------
Cluster 1 |       167 |        33 |
Cluster 2 |        45 |       183 |
----------- ----------- -----------
* Error rate with CV =  18.22 %

* Classification with MAP:
          | Cluster 1 | Cluster 2 |
----------- ----------- -----------
Cluster 1 |       212 |         0 |
Cluster 2 |         0 |       216 |
----------- ----------- -----------
* Error rate with MAP =  0.00 %
****************************************
> plot(learn)

PNG

Second step: Prediction

We perform predictions on year 2003, then we get a summary (note that […] indicates that output has been truncated) and finally we compare predictions of health 2003 with the true health 2003 (75.7% of good classification).

> prediction <- mixmodPredict(data = ratios2003, classificationRule = learn["bestResult"])
> summary(prediction)
**************************************************************
* partition     = 2 1 1 1 1 1 [...] 1 2
* probabilities = | 0.4966 0.5034 |
*                 | 0.8125 0.1875 |
*                 | 0.8851 0.1149 |
*                          [...]
*                 | 0.5626 0.4374 |
*                 | 0.0308 0.9692 |
**************************************************************
> paste("accuracy= ",mean(as.integer(health2003) == prediction["partition"])*100,"%",sep="")
[1] "accuracy= 75.704989154013%"