peakml.math
Class Statistical

java.lang.Object
  extended by peakml.math.Statistical

public abstract class Statistical
extends java.lang.Object

The class Statistical contains methods for performing basis statistical operations, such as mean, median, standard deviation. Also supported are correlation analysis, factorial calculations, ranking, analysis of variance, etc. Basically it provides the toolbox for performing an analysis of a set of data.

A large number of the implementation have been taken from Numerical Recipes in C.


Field Summary
static int COLUMN
          Indicates that the column should be processed
static int MAXIMUM
          The maximum value in the result of the stats(double[]) method.
static int MEAN
          The mean value in the result of the stats(double[]) method.
static int MINIMUM
          The minimum value in the result of the stats(double[]) method.
static int NRSTATS
          The total number of statistics in the result of the stats(double[]) method.
static int PEARSON_CORRELATION
          The correlation value in the result of the pearsonsCorrelation(double[], double[]) method.
static int PEARSON_FISHER
          The fisher transformed correlation (normally distributed) in the result of the pearsonsCorrelation(double[], double[]) method.
static int PEARSON_FISHER_STDDEV
          The fisher transformed standard error value in the result of the pearsonsCorrelation(double[], double[]) method.
static int PEARSON_TTEST
          The student's t-test value in the result of the pearsonsCorrelation(double[], double[]) method.
static int QUARTILE_LOWER
          The lower quartile in the result of the quartiles(double[]) method.
static int QUARTILE_MEDIAN
          The median quartile in the result of the quartiles(double[]) method.
static int QUARTILE_UPPER
          The upper quartile in the result of the quartiles(double[]) method.
static int ROW
          Indicates that the row should be processed
static int SHAPIRO_WILK_PVALUE
          The p-value in the result of the shapiroWilk(double[]) method.
static int SHAPIRO_WILK_WSTAT
          The w-statistic in the result of the shapiroWilk(double[]) method.
static int SPEARMAN_CORRELATION
          The correlation value in the result of the spearmanCorrelation(double[], double[]) method.
static int SPEARMAN_TWOSIDED_SIGNIFICANCE
          The significant level in the result of the spearmanCorrelation(double[], double[]) method.
static int STDDEV
          The standard deviation value in the result of the stats(double[]) method.
static int VARIANCE
          The variance value in the result of the stats(double[]) method.
 
Constructor Summary
Statistical()
           
 
Method Summary
static double beta(double z, double w)
           
static double betacf(double a, double b, double x)
           
static double betai(double a, double b, double x)
           
static double binomialln(int n, int k)
          A binomial is a polynomial with two terms—the sum of two monomials—often bound by parenthesis or brackets when operated upon.
static double durbinWatson(double[] values)
          Simplistic implementation of the Durbin-Watson statistic.
static double durbinWatsonCERN(double[] values)
          This implementation has been taken from http://www.lbl.gov/
static double factorialln(int n)
          In mathematics, the factorial of a non-negative integer n, denoted by n!, is the product of all positive integers less than or equal to n.
static double ftest(double[] a, double[] b)
          An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true.
static double gammaln(double z)
          The gamma function interpolates the factorial function.
static double geomean(double[] values)
          The geometric mean, in mathematics, is a type of mean or average, which indicates the central tendency or typical value of a set of numbers.
static int indexOfMax(double[] values)
           
static double interquartileRange(double[] values)
          In descriptive statistics, the interquartile range (IQR), also called the midspread, middle fifty and middle of the #s, is a measure of statistical dispersion, being equal to the difference between the third and first quartiles.
static double kthElement(double[] values, int k)
          Returns the k-th smallest value in the given array.
static double max(double[] values)
          Calculates the maximum value from the values in the given array.
static double mean(double[] values)
          Calculates the mean of the values in the given array.
static double median(double[] values)
          Calculates the median of the values in the given array.
static double min(double[] values)
          Calculates the minimum value from the values in the given array.
static void normalize(double[] values)
          Normalizes the values in the given vector to the maximum (ie the maximum value will be 1).
static void normalize(double[] values, double max)
          Normalizes the values in the given vector to given the maximum.
static double[] pearsonsCorrelation(double[] xvalues, double[] yvalues)
          In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the MCV or PMCC, and typically denoted by r) is a common measure of the correlation (linear dependence) between two variables X and Y.
static double pearsonsCorrelationS(double[] xvalues, double[] yvalues)
           
static double[] permute(double[] vector)
          Randomly permutes the values in the given vector.
static double[][] permute(double[][] data, int rowcolumn)
          Randomly permutes the values in the given matrix.
static void qsort(double[] values, double[]... arrays)
           
static double[] quartiles(double[] values)
          Quartiles partition the corresponding distribution into four quarters each containing 25% of the data.
static double[][] rank(double[][] data, boolean reversed, int rowcolumn)
          Highly optimized function for ranking the contents of a matrix either on the COLUMN or the ROW.
static double[] rank(double[] data, boolean reversed)
          Highly optimized function for ranking the contents of a vector.
static double rsd(double[] values)
           
static double[][] scale(double[][] data, int rowcolumn)
          Calculates the standard score for either each row or each column.
static double[] shapiroWilk(double[] values)
          The Shapiro-Wilk test tests the null hypothesis that a sample x1, ..., xn came from a normally distributed population.
static double[] spearmanCorrelation(double[] xvalues, double[] yvalues)
          Calculates the Spearman rank correlation and the two-sided significance levels of its deviation from zero between the two given arrays.
static double spearmanCorrelationS(double[] xvalues, double[] yvalues)
           
static double[] stats(double[] values)
          Calculates some basic statistics on the given array of data.
static double stddev(double[] values)
          The standard deviation of a sample is one measure of statistical dispersion, calculated by taking the square root of the deviation.
static double sum(double[] values)
          Calculates the sum of the values in the given array.
static double ttest(double[] a, double[] b)
          A t-test is any statistical hypothesis test in which the test statistic has a Student's t distribution if the null hypothesis is true.
static double variance(double[] values)
          The variance of a sample is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value (mean).
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ROW

public static final int ROW
Indicates that the row should be processed

See Also:
Constant Field Values

COLUMN

public static final int COLUMN
Indicates that the column should be processed

See Also:
Constant Field Values

MINIMUM

public static final int MINIMUM
The minimum value in the result of the stats(double[]) method.

See Also:
Constant Field Values

MAXIMUM

public static final int MAXIMUM
The maximum value in the result of the stats(double[]) method.

See Also:
Constant Field Values

MEAN

public static final int MEAN
The mean value in the result of the stats(double[]) method.

See Also:
Constant Field Values

VARIANCE

public static final int VARIANCE
The variance value in the result of the stats(double[]) method.

See Also:
Constant Field Values

STDDEV

public static final int STDDEV
The standard deviation value in the result of the stats(double[]) method.

See Also:
Constant Field Values

NRSTATS

public static final int NRSTATS
The total number of statistics in the result of the stats(double[]) method.

See Also:
Constant Field Values

QUARTILE_LOWER

public static final int QUARTILE_LOWER
The lower quartile in the result of the quartiles(double[]) method.

See Also:
Constant Field Values

QUARTILE_MEDIAN

public static final int QUARTILE_MEDIAN
The median quartile in the result of the quartiles(double[]) method.

See Also:
Constant Field Values

QUARTILE_UPPER

public static final int QUARTILE_UPPER
The upper quartile in the result of the quartiles(double[]) method.

See Also:
Constant Field Values

SHAPIRO_WILK_WSTAT

public static final int SHAPIRO_WILK_WSTAT
The w-statistic in the result of the shapiroWilk(double[]) method.

See Also:
Constant Field Values

SHAPIRO_WILK_PVALUE

public static final int SHAPIRO_WILK_PVALUE
The p-value in the result of the shapiroWilk(double[]) method.

See Also:
Constant Field Values

PEARSON_CORRELATION

public static final int PEARSON_CORRELATION
The correlation value in the result of the pearsonsCorrelation(double[], double[]) method.

See Also:
Constant Field Values

PEARSON_FISHER

public static final int PEARSON_FISHER
The fisher transformed correlation (normally distributed) in the result of the pearsonsCorrelation(double[], double[]) method.

See Also:
Constant Field Values

PEARSON_FISHER_STDDEV

public static final int PEARSON_FISHER_STDDEV
The fisher transformed standard error value in the result of the pearsonsCorrelation(double[], double[]) method.

See Also:
Constant Field Values

PEARSON_TTEST

public static final int PEARSON_TTEST
The student's t-test value in the result of the pearsonsCorrelation(double[], double[]) method.

See Also:
Constant Field Values

SPEARMAN_CORRELATION

public static final int SPEARMAN_CORRELATION
The correlation value in the result of the spearmanCorrelation(double[], double[]) method.

See Also:
Constant Field Values

SPEARMAN_TWOSIDED_SIGNIFICANCE

public static final int SPEARMAN_TWOSIDED_SIGNIFICANCE
The significant level in the result of the spearmanCorrelation(double[], double[]) method.

See Also:
Constant Field Values
Constructor Detail

Statistical

public Statistical()
Method Detail

min

public static double min(double[] values)
Calculates the minimum value from the values in the given array.

Parameters:
values - The array with the values.
Returns:
The minimum value.

max

public static double max(double[] values)
Calculates the maximum value from the values in the given array.

Parameters:
values - The array with the values.
Returns:
The maximum value.

indexOfMax

public static int indexOfMax(double[] values)

sum

public static double sum(double[] values)
Calculates the sum of the values in the given array.

Parameters:
values - The array with the values.
Returns:
The sum.

mean

public static double mean(double[] values)
Calculates the mean of the values in the given array.

Parameters:
values - The array with the values.
Returns:
The mean.

geomean

public static double geomean(double[] values)
                      throws java.lang.IllegalArgumentException
The geometric mean, in mathematics, is a type of mean or average, which indicates the central tendency or typical value of a set of numbers. The geometric mean can be understood in terms of geometry. The geometric mean of two numbers, a and b, is simply the side length of the square whose area is equal to that of a rectangle with side lengths a and b. That is, what is n such that n^2 = a × b? Similarly, the geometric mean of three numbers, a, b, and c, is the side length of a cube whose volume is the same as that of a rectangular prism with side lengths equal to the three given numbers.

The geometric mean only applies to positive numbers.

Parameters:
values - The array with the values.
Returns:
The geometric mean of the values in the array.
Throws:
java.lang.IllegalArgumentException - Thrown when the array has 0 elements or the array contains one or more negative values.
See Also:
http://en.wikipedia.org/wiki/Geometric_mean

median

public static double median(double[] values)
Calculates the median of the values in the given array. This method is highly optimized and gives correct values for arrays of odd size. For arrays of even size and an excess of 100 elements, a small error is introduced, the value values[values.length/2] is returned as opposed to (values[values.length/2-1]+values[values.length/2])/2. Quoting from numerical recipes: "When N is odd, the median is the kth element, with k=(N+1)/2. When N is even, statistics books define the median as the arithmatic mean of the elements k=N/2 and k=N/2+1 (that is, N/2 from the bottom and N/2 from the top). If you accept such pedantry, you must perform two separate selections to find these elements. For N>100 we usually define k=N/2 to be the median element, pedants be dammed."

For some fixed sizes (3,5,6,7,9) the most optimal sorting is implemented (taken from XILINX XCELL magazine, vol. 23 by John L. Smith). The method kthElement(double[], int) is used for other cases.

Parameters:
values - The array with the values.
Returns:
The median value.

variance

public static double variance(double[] values)
The variance of a sample is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value (mean). Whereas the mean is a way to describe the location of a distribution, the variance is a way to capture its scale or degree of being spread out.

Parameters:
values - The array with the values.
Returns:
The variance.
See Also:
http://en.wikipedia.org/wiki/Variance

stddev

public static double stddev(double[] values)
The standard deviation of a sample is one measure of statistical dispersion, calculated by taking the square root of the deviation.

Parameters:
values - The array with the values.
Returns:
The standard deviation.
See Also:
http://en.wikipedia.org/wiki/Standard_deviation

rsd

public static double rsd(double[] values)
Parameters:
values -
Returns:

kthElement

public static double kthElement(double[] values,
                                int k)
Returns the k-th smallest value in the given array. This is a highly optimized method, which should be preferred over sorting the complete array with for example Arrays.sort.

This method is taken from numerical recipes in paragraph 8.5 (select).

Parameters:
values - The array with the values.
k - The index.
Returns:
The k-th smalles element.

stats

public static double[] stats(double[] values)
Calculates some basic statistics on the given array of data. An array with the results is returned, which can safely be accessed by using the constant field values: MINIMUM, MAXIMUM, MEAN, VARIANCE and STDDEV.

Parameters:
values - The array with the values.
Returns:
The basic statistics calculated from the array.

quartiles

public static double[] quartiles(double[] values)
Quartiles partition the corresponding distribution into four quarters each containing 25% of the data. A particular quartile is therefore the border between two neighboring quarters of the distribution.

Parameters:
values - Array with the distribution to quartile.
Returns:
Array with the three quartile values.
See Also:
http://en.wikipedia.org/wiki/Quartile, http://www.vias.org/tmdatanaleng/cc_quartile.html

interquartileRange

public static double interquartileRange(double[] values)
In descriptive statistics, the interquartile range (IQR), also called the midspread, middle fifty and middle of the #s, is a measure of statistical dispersion, being equal to the difference between the third and first quartiles.
Unlike the (total) range, the interquartile range is a robust statistic, having a breakdown point of 25%, and is thus often preferred to the total range.

Parameters:
values - The array with the values.
Returns:
The inter quartile range.
See Also:
http://en.wikipedia.org/wiki/Interquartile_range

normalize

public static void normalize(double[] values)
Normalizes the values in the given vector to the maximum (ie the maximum value will be 1). The values in the vector are adjusted.

Parameters:
values - The array with the values.

normalize

public static void normalize(double[] values,
                             double max)
Normalizes the values in the given vector to given the maximum. The values in the vector are adjusted.

Parameters:
values - The array with the values.
max - The maximum to scale to.

scale

public static double[][] scale(double[][] data,
                               int rowcolumn)
Calculates the standard score for either each row or each column. The standard score indicates how many standard deviations an observation is above or below the mean. It allows comparison of observations from different normal distributions, which is done frequently in research.
For each value the mean of either the row or the column is subtracted and the result divided by the standard deviation of either the row or the column. This causes the mean to be 0 and the standard deviation to be 1 for either each row or each column.

Parameters:
data - The data matrix to be scaled
rowcolumn - Either ROW or COLUMN.

pearsonsCorrelation

public static double[] pearsonsCorrelation(double[] xvalues,
                                           double[] yvalues)
In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the MCV or PMCC, and typically denoted by r) is a common measure of the correlation (linear dependence) between two variables X and Y. It is very widely used in the sciences as a measure of the strength of linear dependence between two variables, giving a value somewhere between +1 and -1 inclusive.

This implementation has been based on Numerical Recipes in C paragraph 14.5.

Parameters:
xvalues - The array with the x-values.
yvalues - The array with the y-values.
Returns:
The correlation between the x- and y-values.

pearsonsCorrelationS

public static double pearsonsCorrelationS(double[] xvalues,
                                          double[] yvalues)

spearmanCorrelation

public static double[] spearmanCorrelation(double[] xvalues,
                                           double[] yvalues)
Calculates the Spearman rank correlation and the two-sided significance levels of its deviation from zero between the two given arrays. A small significance level indicates a good correlation (correlation positive) or anti-correlation (correlation negative).

This implementation has been based on Numerical Recipes in C paragraph 14.6.

Parameters:
xvalues - The array with the x-values.
yvalues - The array with the y-values.
Returns:
The calculated values.

spearmanCorrelationS

public static double spearmanCorrelationS(double[] xvalues,
                                          double[] yvalues)

factorialln

public static double factorialln(int n)
In mathematics, the factorial of a non-negative integer n, denoted by n!, is the product of all positive integers less than or equal to n. This method calculates the natural logarithm of factorial of the given n. This method utilizes the gammaln(double) method, which is highly optimized.

Parameters:
n - The non-negative integer.
Returns:
The natural logarithm of the factorial.

binomialln

public static double binomialln(int n,
                                int k)
A binomial is a polynomial with two terms—the sum of two monomials—often bound by parenthesis or brackets when operated upon. It is the simplest kind of polynomial other than monomials. This method calculates the natural logarithm if the bionomial.

Parameters:
n - The non-negative integer.
k - The non-negative integer.
Returns:
The natural logarithm of the binomial.

ttest

public static double ttest(double[] a,
                           double[] b)
A t-test is any statistical hypothesis test in which the test statistic has a Student's t distribution if the null hypothesis is true. It is applied when the population is assumed to be normally distributed but the sample sizes are small enough that the statistic on which inference is based is not normally distributed because it relies on an uncertain estimate of standard deviation rather than on a precisely known value.

Parameters:
a - Population 1.
b - Population 2.
Returns:
The t-test p-value. Less than 0.05 could be considered significant.

ftest

public static double ftest(double[] a,
                           double[] b)
An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true.

Parameters:
a - Population 1.
b - Population 2.
Returns:
The f-test p-value. Less than 0.05 could be considered significant.

rank

public static double[] rank(double[] data,
                            boolean reversed)
Highly optimized function for ranking the contents of a vector. The result is a vector of equal size as the data-vector given as a parameter, which is filled with the ranks of the cells in the data vector.

A binary search is utilized to find the rank-values of each of the cells.

Parameters:
data - The data vector to rank.
reversed - When set to true, the ranks will be reversed
Returns:
The ranks of the data in the vector.

rank

public static double[][] rank(double[][] data,
                              boolean reversed,
                              int rowcolumn)
                       throws java.lang.IllegalArgumentException
Highly optimized function for ranking the contents of a matrix either on the COLUMN or the ROW. The result is a matrix of equal size as the data-matrix given as a parameter, which is filled with the ranks of the cells in the data matrix. The rowcolumn parameter can be used to indicate whether the ranking needs to be done on the rows or the columns.

A binary search is utilized to find the rank-values of each of the cells.

Parameters:
data - The data-matrix to rank
reversed - When set to true, the ranks will be reversed
rowcolumn - Either ROW or COLUMN
Returns:
A matrix of the same size as data filled with the ranks throws IllegalArgumentException Thrown when the rowcolumn parameter has an illegal value
Throws:
java.lang.IllegalArgumentException

qsort

public static final void qsort(double[] values,
                               double[]... arrays)
Parameters:
values -
arrays -

permute

public static double[] permute(double[] vector)
Randomly permutes the values in the given vector. This method makes use of an internally seeded with time of creation for this class. This should ensure that the pseudo random values generated are reasonable. The passed vector is not affected.

Parameters:
vector - The vector which needs to be permuted.
Returns:
The new vector with the permutation.

permute

public static double[][] permute(double[][] data,
                                 int rowcolumn)
                          throws java.lang.IllegalArgumentException
Randomly permutes the values in the given matrix. The rowcolumn argument indicates whether it is ROW or COLUMN. This method makes use of an internally seeded with time of creation for this class. This should ensure that the pseudo random values generated are reasonable. The passed matrix is not affected.

Parameters:
data - The matrix to be permuted.
rowcolumn - Either ROW or COLUMN
Returns:
The new matrix with the permutation
Throws:
java.lang.IllegalArgumentException - Thrown when the rowcolumn parameter has an illegal value

gammaln

public static double gammaln(double z)
The gamma function interpolates the factorial function. For integer n: gamma(n+1) = n! = prod(1:n).

Reference: "Lanczos, C. 'A precision approximation of the gamma function', J. SIAM Numer. Anal., B, 1, 86-96, 1964.". Translation of Alan Miller's FORTRAN-implementation.

Parameters:
z - The value for which to calculate the gammaln.
Returns:
The gammaln of z.
See Also:
http://lib.stat.cmu.edu/apstat/245

beta

public static double beta(double z,
                          double w)

betai

public static double betai(double a,
                           double b,
                           double x)

betacf

public static double betacf(double a,
                            double b,
                            double x)

shapiroWilk

public static double[] shapiroWilk(double[] values)
                            throws java.lang.IllegalArgumentException
The Shapiro-Wilk test tests the null hypothesis that a sample x1, ..., xn came from a normally distributed population.

Parameters:
values - The distribution to test.
Returns:
The result of the test, which is a double array of 2 elements: (SHAPIRO_WILK_PVALUE and SHAPIRO_WILK_WSTAT).
Throws:
java.lang.IllegalArgumentException - Thrown when the length of the vector is less than 3 elements.

durbinWatson

public static double durbinWatson(double[] values)
Simplistic implementation of the Durbin-Watson statistic. This statistic is a test statistic used to detect the presence of autocorrelation in the residuals from a regression analysis. It can be used to check whether a signal is very noisy.

The residual e is calculated by subtracting the measured value from the mean of all the values.

Parameters:
values - The array with the values.
Returns:
The Durbin-Watson statistic.

durbinWatsonCERN

public static double durbinWatsonCERN(double[] values)
This implementation has been taken from http://www.lbl.gov/

Parameters:
values -
Returns: