Distribution Functions
The distribution functions calculate the probability of a given value over a random distribution.
The URI for the data science functions is <http://cambridgesemantics.com/anzograph/statistics#>
. For readability, the syntax for each function below includes the prefix stats:
, defined as PREFIX stats: <http://cambridgesemantics.com/anzograph/statistics#>
.
Cumulative Distribution Functions (CDF)
A Cumulative distribution function function calculates the probability of a random variable X taking on a value less than or equal to Y. The following functions produce cumulative distribution calculations:
- Binomial Distribution (BINOMDIST): Calculates the probability for X successes in N trials given a probability of success P for each trial.
- Chi-Squared Distribution (CHISQDIST): Calculates probability often used in hypothesis testing to compare an observed distribution with a theoretical one. Also provides a way to show a relationship between two categorical variables.
- Continuous Uniform Distribution (CONUNIDIST): Calculates probability using continuous probability distribution concerned with events that are equally likely to occur.
- Discrete Uniform Distribution (DISCUNIDIST): Calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.
- Exponential Distribution (EXPDIST): Calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).
- Laplace Distribution (LAPLACEDIST): Calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).
- Log Normal Distribution (LOGNORDIST): Calculates probability using a distribution of a random variable whose logarithm follows a normal distribution. Log normal distributions are widely used in risk analysis.
- Negative Binomial Distribution (NEGBINDIST): Calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.
- Normal Distribution (NORMDIST): Calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values having few outliers.
- Poisson Distribution (POISDIST): Calculates probability using a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space and those events occur with a known constant rate and occur independently of the time since the last event.
- Student's T-Distribution (TDIST): Calculates probability using the Student's t-distribution and associated t scores. Often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.
- TDigest Metric (TDIGEST): Creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure.
- Weibull Distribution (WEIBULDIST): Calculates probability from a continuous probability distribution that is commonly used to assess product reliability, analyze product life data and failure times.
Binomial Distribution (BINOMDIST)
The Binomial distribution aggregate calculates the probability for X successes in N trials given a probability of success P for each trial.
Syntax
stats:binomdist(data, n, k, "success_string")
data
|
string |
Column data. |
n
|
long |
Number of trials. |
k
|
long |
Number of successes in n trials. |
success_string
|
string |
Defines the success string. |
Returns
double |
Probability mass function value. |
double |
Lower cumulative distribution: probability (<=k) under the area of distribution. |
double |
Upper cumulative distribution: probability (>k) under the area of distribution. |
Chi-Squared Distribution (CHISQDIST)
The Chi-squared distribution aggregate calculates probability that is often used in hypothesis testing to compare an observed distribution with a theoretical one. It also provides a way to show a relationship between two categorical variables.
Syntax
stats:chisqdist(data, s)
data
|
double |
Sample data. |
s
|
double |
Population standard deviation. |
Returns
double |
Mean of the distribution. |
double |
Standard deviation of the distribution. |
double |
Variance of the distribution. |
double |
Chi-squared statistic: [(n - 1) * s^2] / d^2 where d is the standard deviation of the population, s is the standard deviation of the sample, and n is the sample size. |
long |
Number of samples: the degrees of freedom(k ) is (count-1) . |
double |
Probability mass function value. |
double |
Cumulative distribution: the probability for <= the chi-squared statistic. |
Continuous Uniform Distribution (CONUNIDIST)
The Continuous uniform distribution aggregate calculates probability using a continuous probability distribution concerned with events that are equally likely to occur.
Syntax
stats:conunidist(data, a, b)
data
|
double |
Column data. |
a
|
double |
Minimum value of the probability interval. |
b
|
double |
Maximum value of the probability interval. |
Returns
double |
Cumulative distribution: probability under the area of distribution. |
double |
Probability density function value. |
double |
Differential entropy in nats. |
Discrete Uniform Distribution (DISCUNIDIST)
The Discrete uniform distribution aggregate calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.
Syntax
stats:discunidist(data, k)
data
|
long |
Column data. |
k
|
long |
The number of outcomes. |
Returns
double |
Cumulative distribution: probability under the area of distribution. |
double |
Probability density function value. |
double |
Differential entropy in nats. |
Exponential Distribution (EXPDIST)
The Exponential distribution aggregate calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).
Syntax
stats:expdist(data, x)
data |
long |
Column data. |
x |
double |
The probability for the interval. |
Returns
double |
Lower cumulative distribution: probability (<=k) under the area of distribution. |
double |
Upper cumulative distribution: probability (>k) under the area of distribution. |
double |
Probability density function value. |
double |
Differential entropy in nats. |
Laplace Distribution (LAPLACEDIST)
The Laplace distribution aggregate calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).
Syntax
stats:laplacedist(data, "c", x1, x2)
data
|
double |
Column data. |
c
|
string |
"below", "above", "bet" (between), or "out" (outside). |
x1
|
double |
Lower number (>0) to find the probability. |
x2
|
double |
Upper number (>0) to find the probability. |
Returns
double |
Mean of the distribution. |
double |
Scale parameter of the distribution. |
double |
Standard deviation of the distribution. |
double |
Variance of the distribution. |
double |
Differential entropy in nats. |
double |
Cumulative distribution: probability under the area of distribution. |
double |
Probability density function value for x1. |
double |
Probability density function value for x2. |
Log Normal Distribution (LOGNORDIST)
The Log-normal distribution aggregate calculates probability using distribution of a random variable whose logarithm follows a normal distribution. The log normal distribution widely used in risk analysis.
Syntax
stats:lognordist(data, "c", x1, x2)
data
|
double |
Column data. |
c
|
string |
"below", "above", "bet" (between), or "out" (outside). |
x1
|
double |
Lower number (>0) to find the probability. |
x2
|
double |
Upper number (>0) to find the probability. |
Returns
double |
Mean of the distribution of natural logarithms distribution. |
double |
Standard deviation of the distribution of natural logarithms distribution. |
double |
Variance of the distribution. |
double |
Differential entropy in nats. |
double |
Cumulative distribution: probability under the area of distribution. |
double |
Probability density function value for x1. |
double |
Probability density function value for x2. |
Negative Binomial Distribution (NEGBINDIST)
The Negative binomial distribution aggregate calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.
Syntax
stats:negbindist("data", k, r, "success_string")
data
|
string |
Column data. |
k
|
long |
Number of successes. |
r
|
long |
Number of failures. |
success_string
|
string |
Defines the success string. |
Returns
double |
Probability mass function value. |
double |
Lower cumulative distribution: probability (<=k) under the area of distribution. |
double |
Upper cumulative distribution: probability (>k) under the area of distribution. |
Normal Distribution (NORMDIST)
The Normal distribution aggregate calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values with few outliers.
Syntax
stats:normdist(data, "c", x1, x2)
data
|
double |
Column data. |
c
|
string |
"below", "above", "bet" (between), or "out" (outside). |
x1
|
double |
Lower number (>0) to find the probability. |
x2
|
double |
Upper number (>0) to find the probability. |
Returns
double |
Mean of the distribution. |
double |
Standard deviation of the distribution. |
double |
Variance of the distribution. |
double |
Differential entropy in nats. |
double |
Cumulative distribution: probability under the area of distribution. |
double |
Probability density function value for x1. |
double |
Probability density function value for x2. |
Poisson Distribution (POISDIST)
The Poisson distribution function calculates probability using discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, given these events occur with a known constant rate and occur independently of the time since the last event.
Syntax
stats:poisdist(data, k)
data
|
long |
Column data. |
k
|
long |
Probability of observing k events in an interval. |
Returns
double |
Probability mass function value. |
double |
Lower cumulative distribution: probability (<=k) under the area of distribution. |
double |
Upper cumulative distribution: probability (>k) under the area of distribution. |
Student's T-Distribution (TDIST)
The Student's t-distribution function calculates probability using the Student's t-distribution (and associated t scores) which are often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.
Syntax
stats:tdist(data, m)
data
|
double |
Sample data. |
m
|
double |
Population mean. |
Returns
double |
Mean of the distribution. |
double |
Standard deviation of the distribution. |
double |
Variance of the distribution. |
double |
T-statistics: t = [ u - M ] / [ s / sqrt( N ) ] where u is the sample mean, M is the population mean, s is the standard deviation of the sample, and N is the sample size. |
long |
Number of samples: the degrees of freedom is (count-1) . |
double |
Probability mass function value. |
double |
Cumulative distribution: the probability for <= t-statistics. |
TDigest Metric (TDIGEST)
This function creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure. For background information about this function, see Computing quantiles using t-Digests.
Syntax
stats:tdigest(data, p, q, cdf)
data
|
double |
Column data. |
p
|
double |
The percentile (0 - 100) to compute. |
q
|
double |
The quantile (0.0 - 1.0) to compute. |
cdf
|
double |
The CDF to use. |
Returns
double |
Percentile: the value below which a given percentage of observations falls. |
double |
Quantile: Cut point to dividing the observations in a sample. |
double |
The computation of F(x) where F is the CDF of the distribution. |
Weibull Distribution (WEIBULDIST)
The Weibull distribution function calculates probability from a continuous probability distribution commonly used to assess product reliability and analyze product life data and failure times.
Syntax
stats:weibuldist(data, k, x)
data
|
double |
Sample data. |
k
|
double |
The initial starting value for the shape parameter. A good guess is crucial to quick convergence. |
x
|
double |
The probability for a random variable. |
Returns
double |
The mean of the distribution. |
double |
The standard deviation of the distribution. |
double |
The variance of the distribution. |
long |
The count of the number of samples. |
double |
The estimated shape parameter(k) of the distribution from the mean and variance using the root finding method. |
double |
The estimated scale parameter(a) of the distribution from the mean and variance using the root finding method. |
double |
Differential entropy in nats. |
double |
Probability density function value. |
double |
Lower cumulative distribution: probability (<=x) under the area of distribution. |
double |
Upper cumulative distribution: probability (>x) under the area of distribution. |
long |
The actual number of iterations performed to get an estimate of the k value. |
double |
The mean calculated using estimated values of k and a. |
double |
The variance calculated using estimated values of k and a. |
Bernoulli Distribution (BERNDIST)
The Bernoulli distribution function determines the probability of success or failure (or Yes or No) in tests that have only two possible outcomes.
Syntax
stats:berndist("data", prob, "success_string")
data
|
string |
Column data. |
prob
|
boolean |
Probability of success (true) or failure (false). |
success_string
|
string |
The success message. |
Returns
double |
The Bernoulli distribution probability. |
Beta-Binomial Distribution (BETABINDIST)
The Beta-binomial distribution function computes probability using a combination of both binomial and beta probability distributions.
Syntax
stats:betabindist(k, n, alpha, beta)
k
|
double |
The probability for the number. |
n
|
double |
The number of trials. |
alpha, beta
|
double |
Shape parameters. |
Returns
double |
The probability of occurrence k for a beta binomial n, alpha, beta. |
Hypergeometric Distribution (HYPGEODIST)
The Hypergeometric distribution function calculates probability from a distribution often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.
Syntax
stats:hypgeodist("data", n, k, "success_string")
data
|
string |
Column data. |
n
|
int |
The number of trials. |
k
|
int |
The number of success in n trials. |
success_string
|
string |
The success message. |
Returns
double |
The hypergeometric distribution probability. |
Logarithmic (Series) Distribution (LOGSERDIST)
The Logarithmic (series) distribution function calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.
Syntax
stats:logserdist("data", k, "success_string")
data
|
string |
Column data. |
k
|
long |
The probability for the number. |
success_string
|
string |
The success message. |
Returns
double |
The logarithmic distribution probability. |
Skellam Distribution (SKELLAMDIST)
The Skellam distribution function calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.
Syntax
stats:skellamdist(n1_data, n2_data, k)
n1_data
|
long |
N1 column data. |
n2_data
|
long |
N2 column data. |
k
|
long |
Probability for the number. |
Returns
double |
The Skellam probability. |