Distribution Functions

The distribution functions calculate the probability of a given value over a random distribution.

The URI for the data science functions is <http://cambridgesemantics.com/anzograph/statistics#>. For readability, the syntax for each function below includes the prefix stats:, defined as PREFIX stats: <http://cambridgesemantics.com/anzograph/statistics#>.

Cumulative Distribution Functions (CDF)

A Cumulative distribution function function calculates the probability of a random variable X taking on a value less than or equal to Y. The following functions produce cumulative distribution calculations:

  • Binomial Distribution (BINOMDIST): Calculates the probability for X successes in N trials given a probability of success P for each trial.
  • Chi-Squared Distribution (CHISQDIST): Calculates probability often used in hypothesis testing to compare an observed distribution with a theoretical one. Also provides a way to show a relationship between two categorical variables.
  • Continuous Uniform Distribution (CONUNIDIST): Calculates probability using continuous probability distribution concerned with events that are equally likely to occur.
  • Discrete Uniform Distribution (DISCUNIDIST): Calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.
  • Exponential Distribution (EXPDIST): Calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).
  • Laplace Distribution (LAPLACEDIST): Calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).
  • Log Normal Distribution (LOGNORDIST): Calculates probability using a distribution of a random variable whose logarithm follows a normal distribution. Log normal distributions are widely used in risk analysis.
  • Negative Binomial Distribution (NEGBINDIST): Calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.
  • Normal Distribution (NORMDIST): Calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values having few outliers.
  • Poisson Distribution (POISDIST): Calculates probability using a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space and those events occur with a known constant rate and occur independently of the time since the last event.
  • Student's T-Distribution (TDIST): Calculates probability using the Student's t-distribution and associated t scores. Often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.
  • TDigest Metric (TDIGEST): Creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure.
  • Weibull Distribution (WEIBULDIST): Calculates probability from a continuous probability distribution that is commonly used to assess product reliability, analyze product life data and failure times.

Binomial Distribution (BINOMDIST)

The Binomial distribution aggregate calculates the probability for X successes in N trials given a probability of success P for each trial.

Syntax

stats:binomdist(data, n, k, "success_string")  
Parameter Type Description
data string Column data.
n long Number of trials.
k long Number of successes in n trials.
success_string string Defines the success string.

Returns

Type Description
double Probability mass function value.
double Lower cumulative distribution: probability (<=k) under the area of distribution.
double Upper cumulative distribution: probability (>k) under the area of distribution.

Chi-Squared Distribution (CHISQDIST)

The Chi-squared distribution aggregate calculates probability that is often used in hypothesis testing to compare an observed distribution with a theoretical one. It also provides a way to show a relationship between two categorical variables.

Syntax

stats:chisqdist(data, s)
Parameter Type Description
data double Sample data.
s double Population standard deviation.

Returns

Type Description
double Mean of the distribution.
double Standard deviation of the distribution.
double Variance of the distribution.
double Chi-squared statistic: [(n - 1) * s^2] / d^2 where d is the standard deviation of the population, s is the standard deviation of the sample, and n is the sample size.
long Number of samples: the degrees of freedom(k) is (count-1).
double Probability mass function value.
double Cumulative distribution: the probability for <= the chi-squared statistic.

Continuous Uniform Distribution (CONUNIDIST)

The Continuous uniform distribution aggregate calculates probability using a continuous probability distribution concerned with events that are equally likely to occur.

Syntax

stats:conunidist(data, a, b)
Parameter Type Description
data double Column data.
a double Minimum value of the probability interval.
b double Maximum value of the probability interval.

Returns

Type Description
double Cumulative distribution: probability under the area of distribution.
double Probability density function value.
double Differential entropy in nats.

Discrete Uniform Distribution (DISCUNIDIST)

The Discrete uniform distribution aggregate calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.

Syntax

stats:discunidist(data, k) 
Parameter Type Description
data long Column data.
k long The number of outcomes.

Returns

Type Description
double Cumulative distribution: probability under the area of distribution.
double Probability density function value.
double Differential entropy in nats.

Exponential Distribution (EXPDIST)

The Exponential distribution aggregate calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).

Syntax

stats:expdist(data, x)
Parameter Type Description
data long Column data.
x double The probability for the interval.

Returns

Type Description
double Lower cumulative distribution: probability (<=k) under the area of distribution.
double Upper cumulative distribution: probability (>k) under the area of distribution.
double Probability density function value.
double Differential entropy in nats.

Laplace Distribution (LAPLACEDIST)

The Laplace distribution aggregate calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).

Syntax

stats:laplacedist(data, "c", x1, x2)
Parameter Type Description
data double Column data.
c string "below", "above", "bet" (between), or "out" (outside).
x1 double Lower number (>0) to find the probability.
x2 double Upper number (>0) to find the probability.

Returns

Type Description
double Mean of the distribution.
double Scale parameter of the distribution.
double Standard deviation of the distribution.
double Variance of the distribution.
double Differential entropy in nats.
double Cumulative distribution: probability under the area of distribution.
double Probability density function value for x1.
double Probability density function value for x2.

Log Normal Distribution (LOGNORDIST)

The Log-normal distribution aggregate calculates probability using distribution of a random variable whose logarithm follows a normal distribution. The log normal distribution widely used in risk analysis.

Syntax

stats:lognordist(data, "c", x1, x2)
Parameter Type Description
data double Column data.
c string "below", "above", "bet" (between), or "out" (outside).
x1 double Lower number (>0) to find the probability.
x2 double Upper number (>0) to find the probability.

Returns

Type Description
double Mean of the distribution of natural logarithms distribution.
double Standard deviation of the distribution of natural logarithms distribution.
double Variance of the distribution.
double Differential entropy in nats.
double Cumulative distribution: probability under the area of distribution.
double Probability density function value for x1.
double Probability density function value for x2.

Negative Binomial Distribution (NEGBINDIST)

The Negative binomial distribution aggregate calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.

Syntax

stats:negbindist("data", k, r, "success_string")
Parameter Type Description
data string Column data.
k long Number of successes.
r long Number of failures.
success_string string Defines the success string.

Returns

Type Description
double Probability mass function value.
double Lower cumulative distribution: probability (<=k) under the area of distribution.
double Upper cumulative distribution: probability (>k) under the area of distribution.

Normal Distribution (NORMDIST)

The Normal distribution aggregate calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values with few outliers.

Syntax

stats:normdist(data, "c", x1, x2) 
Parameter Type Description
data double Column data.
c string "below", "above", "bet" (between), or "out" (outside).
x1 double Lower number (>0) to find the probability.
x2 double Upper number (>0) to find the probability.

Returns

Type Description
double Mean of the distribution.
double Standard deviation of the distribution.
double Variance of the distribution.
double Differential entropy in nats.
double Cumulative distribution: probability under the area of distribution.
double Probability density function value for x1.
double Probability density function value for x2.

Poisson Distribution (POISDIST)

The Poisson distribution function calculates probability using discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, given these events occur with a known constant rate and occur independently of the time since the last event.

Syntax

stats:poisdist(data, k)
Parameter Type Description
data long Column data.
k long Probability of observing k events in an interval.

Returns

Type Description
double Probability mass function value.
double Lower cumulative distribution: probability (<=k) under the area of distribution.
double Upper cumulative distribution: probability (>k) under the area of distribution.

Student's T-Distribution (TDIST)

The Student's t-distribution function calculates probability using the Student's t-distribution (and associated t scores) which are often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.

Syntax

stats:tdist(data, m) 
Parameter Type Description
data double Sample data.
m double Population mean.

Returns

Type Description
double Mean of the distribution.
double Standard deviation of the distribution.
double Variance of the distribution.
double T-statistics: t = [ u - M ] / [ s / sqrt( N ) ] where u is the sample mean, M is the population mean, s is the standard deviation of the sample, and N is the sample size.
long Number of samples: the degrees of freedom is (count-1).
double Probability mass function value.
double Cumulative distribution: the probability for <= t-statistics.

TDigest Metric (TDIGEST)

This function creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure. For background information about this function, see Computing quantiles using t-Digests.

Syntax

stats:tdigest(data, p, q, cdf)
Parameter Type Description
data double Column data.
p double The percentile (0 - 100) to compute.
q double The quantile (0.0 - 1.0) to compute.
cdf double The CDF to use.

Returns

Type Description
double Percentile: the value below which a given percentage of observations falls.
double Quantile: Cut point to dividing the observations in a sample.
double The computation of F(x) where F is the CDF of the distribution.

Weibull Distribution (WEIBULDIST)

The Weibull distribution function calculates probability from a continuous probability distribution commonly used to assess product reliability and analyze product life data and failure times.

Syntax

stats:weibuldist(data, k, x)
Parameter Type Description
data double Sample data.
k double The initial starting value for the shape parameter. A good guess is crucial to quick convergence.
x double The probability for a random variable.

Returns

Type Description
double The mean of the distribution.
double The standard deviation of the distribution.
double The variance of the distribution.
long The count of the number of samples.
double The estimated shape parameter(k) of the distribution from the mean and variance using the root finding method.
double The estimated scale parameter(a) of the distribution from the mean and variance using the root finding method.
double Differential entropy in nats.
double Probability density function value.
double Lower cumulative distribution: probability (<=x) under the area of distribution.
double Upper cumulative distribution: probability (>x) under the area of distribution.
long The actual number of iterations performed to get an estimate of the k value.
double The mean calculated using estimated values of k and a.
double The variance calculated using estimated values of k and a.

Bernoulli Distribution (BERNDIST)

The Bernoulli distribution function determines the probability of success or failure (or Yes or No) in tests that have only two possible outcomes.

Syntax

stats:berndist("data", prob, "success_string")
Parameter Type Description
data string Column data.
prob boolean Probability of success (true) or failure (false).
success_string string The success message.

Returns

Type Description
double The Bernoulli distribution probability.

Beta-Binomial Distribution (BETABINDIST)

The Beta-binomial distribution function computes probability using a combination of both binomial and beta probability distributions.

Syntax

stats:betabindist(k, n, alpha, beta)  
Parameter Type Description
k double The probability for the number.
n double The number of trials.
alpha, beta double Shape parameters.

Returns

Type Description
double The probability of occurrence k for a beta binomial n, alpha, beta.

Hypergeometric Distribution (HYPGEODIST)

The Hypergeometric distribution function calculates probability from a distribution often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.

Syntax

stats:hypgeodist("data", n, k, "success_string") 
Parameter Type Description
data string Column data.
n int The number of trials.
k int The number of success in n trials.
success_string string The success message.

Returns

Type Description
double The hypergeometric distribution probability.

Logarithmic (Series) Distribution (LOGSERDIST)

The Logarithmic (series) distribution function calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.

Syntax

stats:logserdist("data", k, "success_string")
Parameter Type Description
data string Column data.
k long The probability for the number.
success_string string The success message.

Returns

Type Description
double The logarithmic distribution probability.

Skellam Distribution (SKELLAMDIST)

The Skellam distribution function calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.

Syntax

stats:skellamdist(n1_data, n2_data, k) 
Parameter Type Description
n1_data long N1 column data.
n2_data long N2 column data.
k long Probability for the number.

Returns

Type Description
double The Skellam probability.