stats.e
This stats library providing a number of basic statistical functions.
The definition of each function is given with the details of the function.
Preparing the data
All functions take as parameters one or more objects containing arrays
of data.
Consequently in order to use these functions you need to arrange your data
in an array, within a sequence.
Sometimes, however, your data may be in the form of a frequency
distribution - pairs of values and frequencies.
In this case you can easily use
Euphoria's 'repeat' function and the concatenation operator '&' to create
the right form of input.
N.B. No provision is made in this library for 'missing values'.
If you wish to use these functions then you must remove such data from
your sequence.
Most functions operate on integer or real data values without problem.
Some functions, however, have two forms:
- one for integer data, and
- one for real data.
See Dual functions for details of the exceptions.
The function frequency can handle string-type
data as well as numeric data.
Moments in statistics are defined in a systemmatic way.
Each value can be seen as a deviation from a given value.
The mean of these deviations, raised to the power 'r' is known as the
rth moment of the distribution, or sometimes the moment of order r.
A number of the functions in this library are defined using moments.
Mainly for this reason all the functions are based on 'n' (the number of
observations) and not on 'n-1' (for sample-bound estimates). If you wish to
record such estimates then you can easily make the simple calculation
adjustment before reporting your results.
Library specification
Version: 1.2.1
Date: February 2010
Author: C A Newbould (canewbould@users.sourceforge.net)
Licence: Free
Changes:
- Version 1.2.1
- added the zeroiseRoot function, so that
more than one frequency Tree can be analysed at a time.
- Version 1.2.0
- added the anova function
- added the count function
- added the simple linear regression function
- made the sum function global
- added the t_test function
- Version 1.1.0
- added the frequency function and associated
supporting data and routines
- added a function printTree which outputs the
details of a tree generated by the frequency
function.
(These functions have been modified from the RDS demo file "tree.ex".)
- Version 1.0.0 is the one filed in the
Euphoria Archive.
Interface
Globally-defined includes
- sort.e - for function
sort
Routines:
** univariate routines **
All these functions take, as the parameter, a sequence consisting
of atoms in the form of a one-dimensional array. A vector
type
has been defined with the library to enable type-checking to take place when
each function is called.
- [function] average(
vector
this)
- calculates the first moment, centred on zero.
The return value is the (real) arithmetic average of the values in the
vector.
- [function] count(
vector
this)
- synonym for Euphoria's length
function
The return value is the (real) number of values in the vector.
- [function] kurtosis(
vector
this)
- calculates the kurtosis of the distribution in the vector.
This is based the fourth moment about the arithmetic
mean.
The kurtosis of the distribution is defined as the 'standardised' fourth
moment, that is, the fourth moment divided by the fourth power of the
standard deviation.
Because the kurtosis of the normal distribution is 3, some authors subtract
the 3 before reporting the value. If you wish to use this convention
then simply subtract three from the return value of this function.
The return value is the (real) kurtosis of the values in the vector.
- [function] maximum(
vector
this)
- calculates the maximum value in the vector.
The return value is the (real) maximum.
- [function] median(
vector
this)
- calculates the median value in the vector.
If there are an even number then we take the average of the two "middle"
ones.
The return value is the (real) median.
- [function] minimum(
vector
this)
- calculates the minimum value in the vector.
The return value is the (real) minimum.
- [function] quartile(
vector
this)
Calculates the qth (1st, 2nd or 3rd) quartile of the vector.
If there are an even number then we take the average of the two adjacent
ones.
The return value is the (real) quartile.
- [function] range(
vector
this)
- calculates the difference between the maximum and minimum value in the
vector.
The return value is the (real) range.
- [function] skewness(
vector
this)
- calculates the skewness of the distribution in the vector.
This is based the third /moment about the arithmetic mean.
The skewness of the distribution is defined as the 'standardised' third
moment, that is, the third moment divided by the
cube of the standard deviation.
The return value is the (real) skewness coefficient of the values in the
vector.
Assumes no missing values.
- [function] sum(
vector
this)
- calculates the sum of the values in the called vector.
The return value is the (real) sum of values in the vector.
- [function] variance(
vector
this)
- calculates the second moment of the distribution,
about the arithmetic mean.
The return value is the (real) variance of the values in the vector.
** more generalised univariate routine **
This function take, as the parameter, a sequence consisting
of objects in the form of a vector.
- [function] frequency(
sequence
this)
- determines the frequency distribution of the vector.
The return value is a tree containing counts.
Each node contains the value, its count and the pointers to its
neighbours.
Will handle any kind of values in the vector.
- [procedure] printTree(
sequence
node)
- displays the counts in a tree structure.
- [procedure] zeroiseRoot()
- zeroises a tree structure (if more than one frequency Tree is to be
analysed.
** bivariate routines **
N.B. All these functions take, as parameters, two sequences consisting
of atoms in the form of a vector. A vector
type
has been defined with the library to enable type-checking to take place when
each function is called. The data are analysed pair-wise: that is, the first
value in vector1 is associated with the first value in vector2, etc.
- [function] anova(
vector
this, vector
class)
- calculates the one-way
analysis of variance of the array, given the
class groupings.
The return value is a sequence containing the following arrays:
- the sum of squares due to methods (classification) and error
- the corresponding degrees of freedom
The total sum of squares is the sum of the values in the first array and
the total degrees of freedom are the sum of the values in the second array.
- [function] correlation(
vector
this, vector
that)
- calculates the correlation coefficient of the paired values in the two
vectors.
The return value is the (real) correlation of the values in the vectors.
- [function] covariance(
vector
this, vector
that)
- calculates the covariance of the paired values in the two vectors.
The return value is the (real) covariance of the values in the vectors.
- [function] regression(
vector
this,vector
that)
- calculates the simple linear regression coefficient of the values in
this given the values in that.
The return value is the (real) simple linear regression coefficient.
- [function] t_test(
vector
this,vector
that)
- calculates the
t-test for the difference between the means of the two vectors.
The return value is a sequence
giving:
- the (real) t statistic, with sign
- a sequence containing the degrees of freedom (1,df)
The statistical significance of the result can be looked up in the tables
in the library F.e.
** Dual functions **
Instances where there are separate functions for integer and real data.
|
Distributions |
Modal value[s] |
Integer |
distribution |
mode |
Real |
real_distribution |
real_mode |
Integer functions
- [function] distribution(
vector
this)
- determines the frequency distribution of the vector.
The base unit is relative to the minimum.
The return value is a sequence with counts.
It consists of the base unit, followed by all the counts.
- [function] mode(
vector
this)
- determines the modal value(s) of the frequency distribution of the
vector.
The return value is a sequence with one or more values.
Real functions
- [function]
real_distribution(
vector
this,integer
interval)
- determines the frequency distribution of the vector.
The base unit is relative to the minimum.
The intervals used in the distribution are determined by the second
parameter - starting with the integer immediately less than
the minimum value.
The return value is a sequence with counts.
Its value consists of the base unit, followed by all the counts.
NB. The base will always be an integer.
- [function]
real_mode(
vector
this,integer
interval)
- determines the modal value(s) of the frequency distribution of the vector.
The return value is a sequence with one or more values.
Assumes that the values in the array will be grouped into intervals
- of size 'interval'.