NAME
    Statistics::Gap - Perl extension for the "Gap Statistic"

SYNOPSIS
      use Statistics::Gap;
      &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "unif", 90);

      OR

      use Statistics::Gap;
      &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "prop", 90);


      Input file is expected in the "dense" format -
      Sample Input file:

      6 5
        1       1       0       0       1
        1       0       0       0       0
        1       1       0       0       1
        1       1       0       0       1
        1       0       0       0       1
        1       1       0       0       1 	  	


DESCRIPTION
        Given a dataset how does one automatically find the optimal number 
        of clusters that the dataset should be grouped into? - is one of the 
        prevailing problems. Statisticians Robert Tibshirani, Guenther Walther 
        and Trevor Hastie  propose a solution for this problem in a Techinal 
        Report named - "Estimating the number of clusters in a dataset via 
        the Gap Statistic". This perl module implements the approach proposed 
        in the above paper. 

	If one tries to cluster a dataset (i.e. numerous observations described 
	in terms of a feature space) into n groups/clusters and if we plot the
	graph of Within Cluster (Dis)Similarity along Y-axis and Number of clusters
	along X-axis then this graph generally takes a form of a elbow/knee depending
	upon the measure on the Y-axis. The Gap Statistic seeks to locate this
	elbow/knee because the value on the X-axis at this elbow is the optimal 
	number of clusters for the data.	

        NOTE: 
        Gap Statistic uses reference distribution in the process of estimating
        the number of clusters. The appropriate methodology for generation of this 
        reference distribution is dependent on the data to be clustered. 
        This module was implemented for data with following characteristics:
        1. highly sparse - very few features occur in any given observation.
        2. high multivariate dimensionality (i.e. large feature space)
        3. binary feature frequency - feature either occurs or does not occur 
           in an observation but rarely occurs multiple times in the same observation.

EXPORT
     "gap" function by default.

INPUT
  Prefix
        The string that should be used to as a prefix while naming the 
        intermediate files and the .png files (graph files).

  InputFile
	The input dataset is expected in "dense" matrix format.
	The input dense matrix is expected in a plain text file where the first
	line in the file gives the dimensions of the dataset and then the 
	dataset in a matrix format should follow. The contexts / observations 
	should be along the rows and the features should be along the column.

	eg:
      	6 5
        1       1       0       0       1
        1       0       0       0       0
        1       1       0       0       1
        1       1       0       0       1
        1       0       0       0       1
        1       1       0       0       1 	

	The first line (6 5) gives the number of rows (observations) and the 
	number of columns (features) present in the following matrix.
	Following each line records the frequency of occurrence of the feature
	at the column in the given observation. Thus features1 (1st column) occurs
	once in the observation1 and infact once in all the other observations too 
	while the feature3 does not occur in observation1.

   DistanceMeasure
        The Distance Measure that should be used.
        Currrently this module supports the following distance measure:
        1. Squared Euclidean (string that should be used as an argument: "squared")
        2. Manhattan (string that should be used as an argument: "manhattan")
        3. Euclidean (string that should be used as an argument: "euclidean")

   ClusteringAlgorithm
        The Clustering Measures that can be used are:
        1. rb - Repeated Bisections [Default]
        2. rbr - Repeated Bisections for by k-way refinement
        3. direct - Direct k-way clustering
        4. agglo  - Agglomerative clustering
        5. graph  - Graph partitioning-based clustering
        6. bagglo - Partitional biased Agglomerative clustering

   K value
        This is an approximate upper bound for the number of clusters that may be
        present in the dataset. Thus for a dataset that you expect to be seperated
        into 3 clusters this value should be set some integer value greater than 3.

   B value
        Specifies the number of time the reference distribution should be generated.

   ReferenceGenerationMethod
        1. Uniform - While generating the reference distribution, all the features
        in the feature set have equal probability of being selected for the observation
        under consideration.
    
        2. Proportional - Each feature is assigned a probability of being selected
        depending upon its frequency of occurrence in the observed data. Thus feature
        distribution is taken into consideration while selecting the features for the 
        reference distribution generation.

   Percentage Confidence Interval
	This parameter specifies the percentage confidence to be reported in the log file.
	Since Statistics::Gap uses parametric bootstrap method for reference distribution 
	generation, it is critical to understand the interval around the sample mean that
	could contain the population ("true") mean and with what certainty.

OUTPUT		
        1. The PREFIX.out file contains a single integer number which is the Gap Statistic's 
	estimate of number of clusters present in the input dataset.
        2. The PREFIX.log file contains the log of various values at different K values.
	The first table in the file gives values like Gap(k), log(W(k)) etc. for every K value
	experimented with.	
        3. The PREFIX.fig*.png files are the graphical representations of different values which
	help locate the knee/elbow. 
	
PRE-REQUISITES
        1. This module uses suite of C programs called CLUTO for clustering purposes. 
        Thus CLUTO needs to be installed for this module to be functional.
        CLUTO can be downloaded from http://www-users.cs.umn.edu/~karypis/cluto/

        2. Following Perl Modules
            1. GD   (http://search.cpan.org/~lds/GD-2.19/GD.pm)
            2. GD::Text     (http://search.cpan.org/~mverb/GDTextUtil-0.86/Text.pm)
            3. GD::Graph::lines     (http://search.cpan.org/~mverb/GDGraph-1.43/)
            4. GD::Graph::colour    (http://search.cpan.org/~mverb/GDGraph-1.43/Graph/colour.pm)

SEE ALSO
        http://citeseer.ist.psu.edu/tibshirani00estimating.html
        http://www-users.cs.umn.edu/~karypis/cluto/

AUTHOR
        Anagha Kulkarni, University of Minnesota Duluth
        kulka020 <at> d.umn.edu
        
        Guergana Savova, Mayo Clinic
        savova.guergana <at> mayo.edu

COPYRIGHT AND LICENSE
        Copyright (C) 2005-2006, Guergana Savova and Anagha Kulkarni

        This program is free software; you can redistribute it and/or
        modify it under the terms of the GNU General Public License
        as published by the Free Software Foundation; either version 2
        of the License, or (at your option) any later version.
        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.

        You should have received a copy of the GNU General Public License
        along with this program; if not, write to the Free Software
        Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.