Distribution Functions

The distribution functions calculate the probability of a given value over a random distribution.

Cumulative Distribution Functions (CDF): Calculate the probability of a random variable X taking on a value less than or equal to Y.
Bernoulli Distribution (BERNDIST): Determines the probability of a specific event occurring, or not occurring, in tests that have only two possible outcomes: success (1) or failure (0).
Beta-Binomial Distribution (BETABINDIST): Computes probability using a combination of both binomial and beta probability distributions.
Hypergeometric Distribution (HYPGEODIST): Calculates probability from a distribution that is often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.
Logarithmic (Series) Distribution (LOGSERDIST): Calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.
Skellam Distribution (SKELLAMDIST): Calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.

The URI for the data science functions is <http://cambridgesemantics.com/anzograph/statistics#>. For readability, the syntax for each function below includes the prefix stats:, defined as PREFIX stats: <http://cambridgesemantics.com/anzograph/statistics#>.

Cumulative Distribution Functions (CDF)

A Cumulative distribution function function calculates the probability of a random variable X taking on a value less than or equal to Y. The following functions produce cumulative distribution calculations:

Binomial Distribution (BINOMDIST): Calculates the probability for X successes in N trials given a probability of success P for each trial.
Chi-Squared Distribution (CHISQDIST): Calculates probability often used in hypothesis testing to compare an observed distribution with a theoretical one. Also provides a way to show a relationship between two categorical variables.
Continuous Uniform Distribution (CONUNIDIST): Calculates probability using continuous probability distribution concerned with events that are equally likely to occur.
Discrete Uniform Distribution (DISCUNIDIST): Calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.
Exponential Distribution (EXPDIST): Calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).
Laplace Distribution (LAPLACEDIST): Calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).
Log Normal Distribution (LOGNORDIST): Calculates probability using a distribution of a random variable whose logarithm follows a normal distribution. Log normal distributions are widely used in risk analysis.
Negative Binomial Distribution (NEGBINDIST): Calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.
Normal Distribution (NORMDIST): Calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values having few outliers.
Poisson Distribution (POISDIST): Calculates probability using a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space and those events occur with a known constant rate and occur independently of the time since the last event.
Student's T-Distribution (TDIST): Calculates probability using the Student's t-distribution and associated t scores. Often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.
TDigest Metric (TDIGEST): Creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure.
Weibull Distribution (WEIBULDIST): Calculates probability from a continuous probability distribution that is commonly used to assess product reliability, analyze product life data and failure times.

Binomial Distribution (BINOMDIST)

The Binomial distribution aggregate calculates the probability for X successes in N trials given a probability of success P for each trial.

Syntax

stats:binomdist(data, n, k, "success_string")

Parameter	Type	Description
data	string	Column data.
n	long	Number of trials.
k	long	Number of successes in `n` trials.
success_string	string	Defines the success string.

Returns

Type	Description
double	Probability mass function value.
double	Lower cumulative distribution: probability (<=k) under the area of distribution.
double	Upper cumulative distribution: probability (>k) under the area of distribution.

Chi-Squared Distribution (CHISQDIST)

The Chi-squared distribution aggregate calculates probability that is often used in hypothesis testing to compare an observed distribution with a theoretical one. It also provides a way to show a relationship between two categorical variables.

Syntax

stats:chisqdist(data, s)

Parameter	Type	Description
data	double	Sample data.
s	double	Population standard deviation.

Returns

Type	Description
double	Mean of the distribution.
double	Standard deviation of the distribution.
double	Variance of the distribution.
double	Chi-squared statistic: `[(n - 1) * s^2] / d^2` where `d` is the standard deviation of the population, `s` is the standard deviation of the sample, and `n` is the sample size.
long	Number of samples: the degrees of freedom(`k`) is `(count-1)`.
double	Probability mass function value.
double	Cumulative distribution: the probability for <= the chi-squared statistic.

Continuous Uniform Distribution (CONUNIDIST)

The Continuous uniform distribution aggregate calculates probability using a continuous probability distribution concerned with events that are equally likely to occur.

Syntax

stats:conunidist(data, a, b)

Parameter	Type	Description
data	double	Column data.
a	double	Minimum value of the probability interval.
b	double	Maximum value of the probability interval.

Returns

Type	Description
double	Cumulative distribution: probability under the area of distribution.
double	Probability density function value.
double	Differential entropy in nats.

Discrete Uniform Distribution (DISCUNIDIST)

The Discrete uniform distribution aggregate calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.

Syntax

stats:discunidist(data, k)

Parameter	Type	Description
data	long	Column data.
k	long	The number of outcomes.

Returns

Type	Description
double	Cumulative distribution: probability under the area of distribution.
double	Probability density function value.
double	Differential entropy in nats.

Exponential Distribution (EXPDIST)

The Exponential distribution aggregate calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).

Syntax

stats:expdist(data, x)

Parameter	Type	Description
data	long	Column data.
x	double	The probability for the interval.

Returns

Type	Description
double	Lower cumulative distribution: probability (<=k) under the area of distribution.
double	Upper cumulative distribution: probability (>k) under the area of distribution.
double	Probability density function value.
double	Differential entropy in nats.

Laplace Distribution (LAPLACEDIST)

The Laplace distribution aggregate calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).

Syntax

stats:laplacedist(data, "c", x1, x2)

Parameter	Type	Description
data	double	Column data.
c	string	"below", "above", "bet" (between), or "out" (outside).
x1	double	Lower number (>0) to find the probability.
x2	double	Upper number (>0) to find the probability.

Returns

Type	Description
double	Mean of the distribution.
double	Scale parameter of the distribution.
double	Standard deviation of the distribution.
double	Variance of the distribution.
double	Differential entropy in nats.
double	Cumulative distribution: probability under the area of distribution.
double	Probability density function value for x1.
double	Probability density function value for x2.

Log Normal Distribution (LOGNORDIST)

The Log-normal distribution aggregate calculates probability using distribution of a random variable whose logarithm follows a normal distribution. The log normal distribution widely used in risk analysis.

Syntax

stats:lognordist(data, "c", x1, x2)

Parameter	Type	Description
data	double	Column data.
c	string	"below", "above", "bet" (between), or "out" (outside).
x1	double	Lower number (>0) to find the probability.
x2	double	Upper number (>0) to find the probability.

Returns

Type	Description
double	Mean of the distribution of natural logarithms distribution.
double	Standard deviation of the distribution of natural logarithms distribution.
double	Variance of the distribution.
double	Differential entropy in nats.
double	Cumulative distribution: probability under the area of distribution.
double	Probability density function value for x1.
double	Probability density function value for x2.

Negative Binomial Distribution (NEGBINDIST)

The Negative binomial distribution aggregate calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.

Syntax

stats:negbindist("data", k, r, "success_string")

Parameter	Type	Description
data	string	Column data.
k	long	Number of successes.
r	long	Number of failures.
success_string	string	Defines the success string.

Returns

Type	Description
double	Probability mass function value.
double	Lower cumulative distribution: probability (<=k) under the area of distribution.
double	Upper cumulative distribution: probability (>k) under the area of distribution.

Normal Distribution (NORMDIST)

The Normal distribution aggregate calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values with few outliers.

Syntax

stats:normdist(data, "c", x1, x2)

Parameter	Type	Description
data	double	Column data.
c	string	"below", "above", "bet" (between), or "out" (outside).
x1	double	Lower number (>0) to find the probability.
x2	double	Upper number (>0) to find the probability.

Returns

Type	Description
double	Mean of the distribution.
double	Standard deviation of the distribution.
double	Variance of the distribution.
double	Differential entropy in nats.
double	Cumulative distribution: probability under the area of distribution.
double	Probability density function value for x1.
double	Probability density function value for x2.

Poisson Distribution (POISDIST)

The Poisson distribution function calculates probability using discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, given these events occur with a known constant rate and occur independently of the time since the last event.

Syntax

stats:poisdist(data, k)

Parameter	Type	Description
data	long	Column data.
k	long	Probability of observing k events in an interval.

Returns

Type	Description
double	Probability mass function value.
double	Lower cumulative distribution: probability (<=k) under the area of distribution.
double	Upper cumulative distribution: probability (>k) under the area of distribution.

Student's T-Distribution (TDIST)

The Student's t-distribution function calculates probability using the Student's t-distribution (and associated t scores) which are often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.

Syntax

stats:tdist(data, m)

Parameter	Type	Description
data	double	Sample data.
m	double	Population mean.

Returns

Type	Description
double	Mean of the distribution.
double	Standard deviation of the distribution.
double	Variance of the distribution.
double	T-statistics: `t = [ u - M ] / [ s / sqrt( N ) ]` where `u` is the sample mean, `M` is the population mean, `s` is the standard deviation of the sample, and `N` is the sample size.
long	Number of samples: the degrees of freedom is `(count-1)`.
double	Probability mass function value.
double	Cumulative distribution: the probability for <= t-statistics.

TDigest Metric (TDIGEST)

This function creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure. For background information about this function, see Computing quantiles using t-Digests.

Syntax

stats:tdigest(data, p, q, cdf)

Parameter	Type	Description
data	double	Column data.
p	double	The percentile (0 - 100) to compute.
q	double	The quantile (0.0 - 1.0) to compute.
cdf	double	The CDF to use.

Returns

Type	Description
double	Percentile: the value below which a given percentage of observations falls.
double	Quantile: Cut point to dividing the observations in a sample.
double	The computation of F(x) where F is the CDF of the distribution.

Weibull Distribution (WEIBULDIST)

The Weibull distribution function calculates probability from a continuous probability distribution commonly used to assess product reliability and analyze product life data and failure times.

Syntax

stats:weibuldist(data, k, x)

Parameter	Type	Description
data	double	Sample data.
k	double	The initial starting value for the shape parameter. A good guess is crucial to quick convergence.
x	double	The probability for a random variable.

Returns

Type	Description
double	The mean of the distribution.
double	The standard deviation of the distribution.
double	The variance of the distribution.
long	The count of the number of samples.
double	The estimated shape parameter(k) of the distribution from the mean and variance using the root finding method.
double	The estimated scale parameter(a) of the distribution from the mean and variance using the root finding method.
double	Differential entropy in nats.
double	Probability density function value.
double	Lower cumulative distribution: probability (<=x) under the area of distribution.
double	Upper cumulative distribution: probability (>x) under the area of distribution.
long	The actual number of iterations performed to get an estimate of the k value.
double	The mean calculated using estimated values of k and a.
double	The variance calculated using estimated values of k and a.

Bernoulli Distribution (BERNDIST)

The Bernoulli distribution function determines the probability of success or failure (or Yes or No) in tests that have only two possible outcomes.

Syntax

stats:berndist("data", prob, "success_string")

Parameter	Type	Description
data	string	Column data.
prob	boolean	Probability of success (true) or failure (false).
success_string	string	The success message.

Returns

Type	Description
double	The Bernoulli distribution probability.

Beta-Binomial Distribution (BETABINDIST)

The Beta-binomial distribution function computes probability using a combination of both binomial and beta probability distributions.

Syntax

stats:betabindist(k, n, alpha, beta)

Parameter	Type	Description
k	double	The probability for the number.
n	double	The number of trials.
alpha, beta	double	Shape parameters.

Returns

Type	Description
double	The probability of occurrence k for a beta binomial n, alpha, beta.

Hypergeometric Distribution (HYPGEODIST)

The Hypergeometric distribution function calculates probability from a distribution often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.

Syntax

stats:hypgeodist("data", n, k, "success_string")

Parameter	Type	Description
data	string	Column data.
n	int	The number of trials.
k	int	The number of success in n trials.
success_string	string	The success message.

Returns

Type	Description
double	The hypergeometric distribution probability.

Logarithmic (Series) Distribution (LOGSERDIST)

The Logarithmic (series) distribution function calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.

Syntax

stats:logserdist("data", k, "success_string")

Parameter	Type	Description
data	string	Column data.
k	long	The probability for the number.
success_string	string	The success message.

Returns

Type	Description
double	The logarithmic distribution probability.

Skellam Distribution (SKELLAMDIST)

The Skellam distribution function calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.

Syntax

stats:skellamdist(n1_data, n2_data, k)

Parameter	Type	Description
n1_data	long	N1 column data.
n2_data	long	N2 column data.
k	long	Probability for the number.

Returns

Type	Description
double	The Skellam probability.