Additional Data Science Functions (Preview)

AnzoGraph offers an additional PREVIEW package of pre-built data science functions that you can use in the same way as other native, built-in analytic functions. In addition, Cambridge Semantics offers an Apache Zeppelin Docker image, which includes a collection of individual notebooks that provide details and example usage of each of the AnzoGraph Data Science functions. The Docker image also includes a custom SPARQL interpreter, which allows you to securely connect to AnzoGraph, to run queries from the notebooks, or write your own queries to run against AnzoGraph data. See Zeppelin Notebook Integration for more information on installing the custom Apache Zeppelin Docker image.

This additional collection of Preview functions is subject to change based on feedback from users and, in particular, Data Science professionals who seek improvement or changes to individual functions, their signatures, or their operation.

The additional data science functions are organized into the following categories:

  • Correlation – determine the relationship between different elements.
  • Distribution– calculate the probability of a given X value over a random distribution.
  • Entropy – determine variance and probability density across a given distribution.
  • Feature Exploration – classify values in a distribution using techniques such as Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), or Singular Value Decomposition (SVD).
  • Linear Algebra – create product vectors or matrix from a given collection of random variables.
  • Profiling – produce different statistical metrics such as percentile, geometric mean, or skew on a given data population.
  • Sketching – estimate or determine frequency of items in a data distribution.
  • Utility – returns information on various attributes of vector space mapping and related matrix tensors.

The topics in this section provide details about each of the additional data science functions that are available.

Category/Function Description
Correlation:
Canonical Correlation (CANCOR) Calculates the overall correlation between two sets of variables.
Covariance (COVARIANCE) Provides a measure of the strength of the correlation between two or more sets of random variables (or variates).
Matthews Correlation Coefficient (MCC) Provides a measure of the quality of binary classifications of a condition with observed versus predicted scoring.
Pearson Correlation Coefficient (PCC) Determines the extent to which two variables are linearly related: positive, negative, or no relationship.
Spearman Correlation Coefficient (SCC) Determines how well the relationship between two variables can be described using a monotonic function.
Distribution:
Cumulative Distribution Function (CDF) Calculates the probability of a random variable X taking on a value less than or equal to Y. Various other AnzoGraph distribution functions provide this calculation. Refer to the description and signatures of the following functions that can produce cumulative distribution calculations:
  • Binomial Distribution (BINOMDIST)
  • Chi-Squared Distribution (CHISQDIST)
  • Continuous Uniform Distribution (CONUNIDIST)
  • Discrete Uniform Distribution (DISCUNIDIST)
  • Exponential Distribution (EXPDIST)
  • Laplace Distribution (LAPLACEDIST)
  • Log Normal Distribution (LOGNORDIST)
  • Negative Binomial Distribution (NEGDIST)
  • Normal Distribution (NORMDIST)
  • Poisson Distribution (POISDIST)
  • Student's t-distribution (TDIST)
  • Weibull Distribution (WEIBULDIST)
  • T Digest Metric (TDIGEST)
Bernoulli Distribution (BERNDIST) Determines the probability of a specific event occurring, or not occurring, in tests that have only two possible outcomes (1 - Success or 0 - Failure).
Beta-Binomial Distribution (BETABINDIST) Computes probability using a combination of both binomial and beta probability distributions.
Binomial Distribution (BINOMDIST) Calculates the probability for X successes in N trials given a probability of success P for each trial.
Chi-Squared Distribution (CHISQDIST) Calculates probability often used in hypothesis testing to compare an observed distribution with a theoretical one. Also provides a way to show a relationship between two categorical variables.
Continuous Uniform Distribution (CONUNIDIST) Calculates probability using continuous probability distribution concerned with events that are equally likely to occur.
Discrete Uniform Distribution (DISCUNIDIST) Calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.
Exponential Distribution (EXPDIST) Calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).
Hypergeometric Distribution (HYPGEODIST) Calculates probability from a distribution that is often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.
Laplace Distribution (LAPLACEDIST) Calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).
Log Normal Distribution (LOGNORDIST) Calculates probability using a distribution of a random variable whose logarithm follows a normal distribution. Log normal distributions are widely used in risk analysis.
Logarithmic (Series) Distribution (LOGSERDIST) Calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.
Negative Binomial Distribution (NEGBINDIST) Calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.
Normal Distribution (NORMDIST) Calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values having few outliers.
Poisson Distribution (POISDIST) Calculates probability using a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space and those events occur with a known constant rate and occur independently of the time since the last event.
Skellam Distribution (SKELLAMDIST) Calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.
Student's T-Distribution (TDIST)

Calculates probability using the Student's t-distribution and associated t scores. Often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.

Weibull Distribution (WEIBULDIST) Calculates probability from a continuous probability distribution that is commonly used to assess product reliability, analyze product life data and failure times.
Entropy:
Cross Entropy (CROSSENTROPY) Computes cross-entropy, which is commonly used to quantify the difference between two probability distributions.
Discrete Entropy Metric (DISCENTROPY) Calculates discrete entropy for maps on finite sets.
Differential Entropy or Continuous Entropy Metric Computes differential entropy (also referred to as continuous entropy), which is entropy defined for distributions with a continuous random variable. Various other AnzoGraph distribution functions can provide this calculation. Refer to the description and signatures of the following functions that can produce entropy calculations:
  • Normal Distribution (NORMDIST)
  • Log Normal Distribution (LOGNORDIST)
  • Exponential Distribution (EXPDIST)
  • Discrete Uniform Distribution (DISCUNIDIST)
  • Continuous Uniform Distribution (CONUNIDIST)
  • Laplace Distribution (LAPLACEDIST)
  • Weibull Distribution (WEIBULDIST)
Feature Exploration:
Principal Component Analysis (PCA) Reduces a high-dimensional dataset into fewer dimensions while retaining important information, which makes it easier to explore and visualize data.
Singular Value Decomposition (SVD) Similar to PCA, except that the factorization for SVD is done on the data matrix, whereas the factorization is done on the covariance matrix with PCA.
Linear Discriminant Analysis (LDA) Uses dimensionality reduction and classifier to make predictions.
Linear Algebra:
Gramian Matrix (GRAMIAN)

Creates a Gramian matrix commonly used to compute linear independence.

Profiling Metric:
Discrete Probability Metric Calculates a discrete probability distribution of values.

Various other AnzoGraph distribution functions provide a discrete probability metric. Refer to the description and signatures of the following functions that can produce discrete probability metrics:

  • Binomial Distribution (BINOMDIST)
  • Poisson Distribution (POISDIST)
  • Negative Binomial Distribution (NEGDIST)
  • Bernoulli Distribution (BERNDIST)
Geometric Mean Metric (GMEAN) Calculates geometric mean, defined as the nth root of the product of n positive numbers.
Percentile Metric (PERCENTILE) Calculates 1 to 100 percentile of numeric values.
Skew Metric (SKEWCOEFF) Calculates Pearson’s coefficient of skewness on numeric values.
TDigest Metric (TDIGEST) Creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure.
Sketches:
Cardinality Metric (HLL) Calculates cardinality estimates of a data set.
Frequent Items (FI) Collection of function signatures used to create sketches and obtain the most frequent items from a stream of items.
Quantile/Rank Sketch (KLL) Collection of signatures used to calculate the quantile/rank from a stream of items using the KLL sketch computation model.
Theta Sketch (THETA) Collection of signatures used to perform estimates of set operations, Union, Intersection, and Difference, all using the Theta Sketch framework. There are several different function signatures available for Theta Sketch estimate calculations.
Miscellaneous:
Matrix Utilities Collection of functions that return information on various attributes of vector space mapping and related matrix tensors.

The following sections provided additional detail of each available data science function (listed in alphabetical order) as well as the syntax or signature of each function call.

Bernoulli Distribution (BERNDIST)

The Bernoulli Distribution function determines the probability of success or failure (or Yes or No) in tests that have only two possible outcomes.

Wikipedia Reference: Bernoulli Distribution

The general signature for calling the Bernoulli Distribution function is the following:

prefix:berndist(data : String, prob : bool, SuccessIs : String)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input :
data String String. Column data.
prob bool Probability of success (true) or failure (false).
SuccessIs String Success string.
Output :
Probability double Bernoulli distribution probability value.

Beta-Binomial Distribution (BETABINDIST)

The Beta-Binomial Distribution function computes probability using a combination of both binomial and beta probability distributions.

Wikipedia Reference: Beta-Binomial Distribution

The general signature for calling the Beta-Binomial Distribution function is the following:

prefix:betabindist(k : double, n : long, alpha : double, beta : double)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
k double Find probability for the number.
n double Number of trials.
alpha, beta double Shape parameters.
Output:
probability double The probability of occurrence k for a beta binomial n, alpha, beta.

Binomial Distribution (BINOMDIST)

The Binomial Distribution function calculates the probability for X successes in N trials given a probability of success P for each trial.

Wikipedia Reference: Binomial Distribution

The general signature for calling the Binomial Distribution function is the following:

prefix:binomdist(data : String, n : long, k : long, SuccessIs : String)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data String Column data.
n long Number of trials.
k long Number of success in n trials.
SuccessIs String Define success string among the column data.
Output:
probability double Probability value which is the value for probability mass function.
cdfLower double Cumulative distribution function which is probability (<=k) under the area of distribution.
cdfUpper double Cumulative distribution function which is probability (>k) under the area of distribution.

Canonical Correlation (CANCOR)

The Canonical Correlation function calculates the overall correlation between two sets of variables.

Wikipedia Reference: Canonical correlation function

The general signature for calling the Canonical Correlation function is the following:

prefix:cancor(lc : int, m : int, x1, x2,...,xm : double, y1, y2,...,yn : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
lc int Display linear combinations for only the first 'lc' canonical correlations.
m int Number of columns in first set.
x1, x2,...,xm double Feature columns in first datasets.
y1, y2,...,yn double Feature columns in second datasets.
Output:
CanonicalCorrelations String Canonical Correlation.
SquaredCanonicalCorrelations String Square of Canonical Correlation.
CanonicalCoefficients String Canonical Coefficient.

Cardinality Metric (HLL)

This function calculates cardinality estimates of a data set using Apache datasketches's HyperLogLog(hll).

Wikipedia Reference: Cardinality Prominence Metric

The general signature for calling the Cardinality function is the following:

prefix:hll(data : Object, lgConfigK : int , TgtHllType : int) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/sketch#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data Object

Data set(Input can be mix of all types).

lgConfigK int Log-base-2 of K, where K is the number of buckets or slots for the sketch. This value must be between 4 and 21 inclusively. It is an optional parameter and default value is 12.
TgtHllType int Specifies the target type of HLL sketch to be created. It's value must be 4 (for HLL_4), 6 (HLL_6) or 8 (HLL_8). It is an optional parameter and default value is 4.
Output:
cardinality double Cardinality metric value of a data set.

Chi-Squared Distribution (CHISQDIST)

The Chi-Squared Distribution function calculates probability often used in hypothesis testing, to compare an observed distribution with a theoretical one. Also provides a way to show a relationship between two categorical variables.

Wikipedia Reference: Chi-Squared Distribution

The general signature for calling the Chi-Squared Distribution function is the following:

prefix:chisqdist(data : double, S : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Sample data.
S double Population standard deviation.
Output:
mean double Mean of the distribution.
stdDev double Standard deviation of the distribution.
variance double Variance of the distribution.
chi-squareStatistic double [(n - 1) * s^2] / d^2, where d is the standard deviation of the population; s is the standard deviation of the sample, and n is the sample size.
count long Number of samples so the degrees of freedom(k) is (count-1).
pdf double Probability value which is the value for probability mass function.
cdf double Cumulative distribution function which is probability for <= chi-squareStatistic.

Continuous Entropy Metric

This function is also referred to as Differential Entropy. See Differential Entropy or Continuous Entropy Metric for function call parameters and details.

Continuous Uniform Distribution (CONUNIDIST)

This function calculates probability using a continuous probability distribution concerned with events that are equally likely to occur.

Wikipedia Reference: Continuous Uniform Distribution

The general signature for calling the Continuous Uniform Distribution function is the following:

prefix:conunidist(data : double, a : double, b : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data.
a double Minimum value of the probability interval.
b double Maximum value of the probability interval.
Output:
cdf double Cumulative distribution function which is probability under the area of distribution.
pdf double the probability density function value.
diffEntropy double Differential Entropy in nats.

Covariance (COVARIANCE)

The Covariance function provides a measure of the strength of the correlation between two or more sets of random variables (or variates).

Wikipedia Reference: Covariance

The general signature for calling the Covariance function is the following:

prefix:covariance(x1 : double, x2 : double,...,xn : double)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
x1, x2,...,xn double Feature column datasets.
Output:
covariance_matrix "http://anzograph.com/matrices#tensor" Co-variance matrix.

Cross Entropy (CROSSENTROPY)

This function computes cross-entropy, which is commonly used to quantify the difference between two probability distributions.

Wikipedia Reference: Cross Entropy

The general signature for calling the Cross Entropy function is the following:

prefix:crossentropy(p : double, q : double)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
p double true probabilities for x.
q double predicted probabilities for x.
Output:
cross_entropy double Cross entropy value.

Cumulative Distribution Function (CDF)

A Cumulative Distribution function calculate the probability of a random variable X taking on a value less than or equal to Y.

Wikipedia Reference: Cumulative Distribution Function (CDF)

Various other AnzoGraph distribution functions provide this calculation. Refer to the description and signatures of the following functions that can produce cumulative distribution calculations:

Differential Entropy or Continuous Entropy Metric

Differential entropy (also referred to as continuous entropy) is entropy that can be computed for distributions with a continuous random variable.

Wikipedia Reference: Differential entropy

Various other AnzoGraph distribution functions can provide this calculation. Refer to the description and signatures of the following functions that can produce entropy calculations:

Discrete Entropy Metric (DISCENTROPY)

This function calculates entropy for maps on finite sets, referred to as discrete entropy.

ScienceDirect Reference: Discrete Entropy

The general signature for calling the Discrete Entropy Metric function is the following:

prefix:discentropy(data : String)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data String Column data.
Output:
discrete_entropy double Discrete entropy value.

Discrete Probability Metric

This function calculates a discrete probability distribution of values.

Wikipedia Reference: Discrete Probability

Various other AnzoGraph distribution functions can provide a discrete probability metric. Refer to the description and signatures of the following functions that can produce discrete probability metrics:

Discrete Uniform Distribution (DISCUNIDIST)

This function calculates probability using symmetric probability distribution where a finite number of values are equally likely to be observed and every one of n values has equal probability.

Wikipedia Reference: Discrete Uniform Distribution

The general signature for calling the Discrete Uniform Distribution function is the following:

prefix:discunidist(data : long, k : long) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data long Column data.
k long To find CDF for k value (finite number of outcomes).
Output:
cdf double Cumulative distribution function which is probability under the area of distribution.
pdf double Probability density function value.
diffEntropy double Differential Entropy in nats.

Exponential Distribution (EXPDIST)

The Exponential Distribution function calculates probability using a distribution that describes time between events in a Poisson point process (where events occur continuously and independently at a constant average rate).

Wikipedia Reference: Exponential Distribution

The general signature for calling the Exponential Distribution function is the following:

prefix:expdist(data : long, x : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data long Column data.
x double To find probability for interval.
Output:
cdfLower double Cumulative distribution function which is probability (<=x) under the area of distribution.
cdfUpper double Cumulative distribution function which is probability (>x) under the area of distribution.
pdf double Probability density function value.
diffEntropy double Differential Entropy in nats.

Frequent Items (FI)

This function is used to create sketches and obtain the most frequent items from a stream of items. There are several different function signatures available for frequent items discovery.

Reference: Frequent Items

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/sketch#> location of the AnzoGraph data science functions.

  • fi – Creates frequent items sketches.
    prefix:fi(val : Object, weight : long)  
    ParameterData TypeDescription
    Input:
    valObjectData set (supporting short, int, long, float, double and string).
    weightlongCorresponding weight of the 'val', default is 1. This is optional parameter.
    Output:
    fi_sketch"http://anzograph.com/statistics#fi_sketch"Binary stream containing sketch data type and its frequency sketch.
  • fi::get_estimates – Gets the estimate for the frequency, lower and upper bound of the given item.
    prefix:fi::get_estimates(fi_sketch : "http://anzograph.com/statistics#fi_sketch",
      item : Object) 
    ParameterData TypeDescription
    Input:
    fi_sketch"http://anzograph.com/statistics#fi_sketch" Binary stream containing sketch data type and its frequency sketch.
    item ObjectData item whose frequency to be estimated.
    Output:
    frequencylong Frequency estimates of the given item.
    lower_boundlong Lower bound frequency estimates of the given item.
    upper_boundlong Upper bound frequency estimates of the given item.
  • fi::get_active_items_total_weights – Gets the number of active items in the sketch and the estimated total stream weight.
    prefix:fi::get_active_items_total_weights(fi_sketch : "http://anzograph.com/statistics#fi_sketch")  
    ParameterData TypeDescription
    Input:
    fi_sketch"http://anzograph.com/statistics#fi_sketch"

    Binary stream containing sketch data type and its frequency sketch.

    Output:
    num_active_itemslong The number of active items in the sketch.
    total_weightslong The estimated total stream weight.
  • fi::get_top_items – Get top frequent items and their corresponding frequency.
    prefix:fi::get_top_items(fi_sketch : "http://anzograph.com/statistics#fi_sketch") 
    ParameterData TypeDescription
    Input:
    fi_sketch"http://anzograph.com/statistics#fi_sketch" Binary stream containing sketch data type and its frequency sketch.
    Output:
    Item1 double Item who has the highest frequency.
    Item1_frequencylong Frequency estimates of the first item.
    Item2 double Item who has second highest frequency.
    Item2_frequencylong Frequency estimates of the second item.
    ...Item5 doubleItem who has 5th highest frequency.
    ...Item5_frequencylong Frequency estimates of the 5th item.
  • fi::get_top_strings – Get top frequent strings and their corresponding frequency.
    prefix:fi::get_top_strings(fi_sketch : "http://anzograph.com/statistics#fi_sketch") 
    ParameterData TypeDescription
    Input:
    fi_sketch"http://anzograph.com/statistics#fi_sketch" Binary stream containing sketch data type and its frequency sketch.
    Output:
    Item1 string String who has the highest frequency.
    Item1_frequency long Frequency estimates of the first String.
    Item2 string String who has the highest frequency.
    Item2_frequency long Frequency estimates of the first String.
    ... Item5 stringString who has 5th highest frequency.
    Item5_frequency longFrequency estimates of the 5th string.

Geometric Mean Metric (GMEAN)

This function calculates geometric mean, defined as the nth root of the product of n positive numbers.

Wikipedia Reference: Geometric Mean

The general signature for calling the Geometric Mean function is the following:

prefix:gmean(data : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data
Output:
geometric_mean double Geometric mean value

Gramian Matrix (GRAMIAN)

This function creates a Gramian matrix commonly used to compute linear independence.

Wikipedia Reference: Gramian Matrix

The general signature for calling the Gramian Matrix function is the following:

prefix:gramian(x1 : double, x2 : double,...,xn : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
x1, x2,...,xn double Feature column data sets.
Output:
gramian_matrix "http://anzograph.com/matrices#tensor" Gramian matrix.

Hypergeometric Distribution (HYPGEODIST)

The Hypergeometric Distribution function calculates probability from a distribution often used to predict the outcome of a process in which different elements are randomly drawn from a collection and not replaced.

Wikipedia Reference: HyperGeometric Distribution

The general signature for calling the Hypergeometric Distribution function is the following:

prefix:hypgeodist(data : String, n : int, k : int, SuccessIs : String) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data String Column data.
n int Number of trials.
k int Number of success in n trials.
SuccessIs String Success string.
Output:
probability double Hypergeometric distribution probability value.

Laplace Distribution (LAPLACEDIST)

The Laplace Distribution function calculates probability using a distribution that represents differences between two independent variables that have identical exponential distributions (also called double exponential distribution).

Wikipedia Reference: Laplace Distribution

The general signature for calling the Laplace Distribution function is the following:

prefix:laplacedist(data : double, c : String, x1 : double, x2 : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data.
c String User choice: 'below'; 'above'; 'bet'(Between); 'out'(Outside).
x1 double Lower number x1 to find the probability.
x2 double Upper number x2 to find the probability.
Output:
mean double Mean of the distribution.
scaleParam double Scale parameter of the distribution.
stdDev double Standard deviation of the distribution.
variance double Variance of the distribution.
diffEntropy double Differential Entropy in nats.
cdf double Cumulative distribution function which is probability under the area of distribution.
pdfLower double Probability density function value for x1.
pdfUpper double Probability density function value for x2.

Linear Discriminant Analysis (LDA)

This function applies linear discriminant analysis (LDA) to create combined eigen values and vectors that characterize or separate two or more classes of objects or events.

Wikipedia Reference: Linear Discriminant Analysis

There are several different function signatures available for Linear Discriminant analysis.

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph LDA data science functions.

  • lda::create – apply Linear Discriminant Analysis(LDA) to create combined eigenvalues and eigenvectors.
    prefix:lda::create(y : double, x1 : double, x2 : double,...,xn : double)
    ParameterData TypeDescription
    Input:
    ydouble

    Class of feature tuple.

    x1, x2,...,xndouble Feature column data sets.
    Output:
    eigen_values_vectors_mean"http://anzograph.com/matrices#lda_result"Combined eigenvalues, eigenvectors, class mean, count and class map.
  • lda::get_eigvec – Get LDA's eigen vectors as a matrix from LDA data.
    prefix:lda::get_eigvec(lda_data : "http://anzograph.com/matrices#lda_result") 
    ParameterData TypeDescription
    Input:
    lda_data"http://anzograph.com/matrices#lda_result"Linear Discriminant Analysis data.
    Output:
    eigen_vectors"http://anzograph.com/matrices#tensor"Eigen vectors as a matrix.
  • lda::get_eigval – Get LDA's eigen values as a column vector from LDA data.
    prefix:lda::get_eigval(lda_data : "http://anzograph.com/matrices#lda_result") 
    ParameterData TypeDescription
    Input:
    lda_data"http://anzograph.com/matrices#lda_result"LDA data.
    Output:
    eigen_values"http://anzograph.com/matrices#tensor"Eigen values in the descending order as column vector.
  • lda::transform – Apply Linear Discriminant Analysis(LDA) to transform the samples onto the new subspace.
    prefix:lda::transform(lda_data : "http://anzograph.com/matrices#lda_result",
      d : int, x1 : double, x2 : double,...,xn : double)
    ParameterData TypeDescription
    Input:
    lda_data "http://anzograph.com/matrices#lda_result"LDA data.
    dintNumber of eigen vectors to consider from the start.
    x1, x2,...,xndoubleFeature column data sets.
    Output:
    transformed_datadoubleOriginal data transformed into the tuple of lower dimensional space.
  • dump_tensor – Get string representation of vector or matrix in row-wise/column-wise order.

    prefix:dump_tensor(m : "http://anzograph.com/matrices#tensor",
      type: int, isRowWize: Boolean)
  • Parameter Data Type Description
    Input:
    m "http://anzograph.com/matrices#tensor" A tensor of matrix/row vector/column vector.
    type Int Type of tensor: 0-Row vector, 1-Column Vector, 2-Matrix. Optional, default is 2.
    isRowWize Boolean False if the display matrix is column-wise. Optional, default is true.
    Output:
    dump String String representation of vector or matrix in row-wise/column-wise.
  • lda::predict – Predict the class for the samples using Linear Discriminant Analysis (LDA) as a classifier.
  • prefix:lda::predict(lda_data : "http://anzograph.com/matrices#lda_result",
       p1 : double, p2 : double,...,pn : double) 
    Parameter Data Type Description
    Input:
    lda_data "http://anzograph.com/matrices#lda_result" LDA data.
    p1, p2,...,pn double Data sample whose class to predict.
    Output:
    class_name String Class name to which data tuple belongs.
  • lda::get_raw_eigval – Get LDA's unsorted eigen values from LDA data.
    prefix:lda::get_raw_eigval(lda_data : "http://anzograph.com/matrices#lda_result") 
    ParameterData TypeDescription
    Input:
    lda_data"http://anzograph.com/matrices#lda_result"LDA data.
    Output:
    eigen_values"http://anzograph.com/matrices#tensor"Eigen values in unsorted order as a column vector.

Log Normal Distribution (LOGNORDIST)

This function calculates probability using distribution of a random variable whose logarithm follows a normal distribution. The log normal distribution widely used in risk analysis.

Wikipedia Reference: Log Normal Distribution

The general signature for calling the Log Normal Distribution function is the following:

prefix:lognordist(data : double, c : String, x1 : double, x2 : double) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data.
c String User choice: 'below', 'above', 'bet'(Between), 'out'(Outside), where user can choose probability below, above, between or outside.
x1 double Lower number x1(>0) to find the probability.
x2 double Upper number x2(>0) to find the probability.
Output:
mean double Mean of the distribution of natural logarithms distribution.
stdDev double Standard deviation of the distribution of natural logarithms distribution.
variance double Variance of the distribution.
diffEntropy double Differential Entropy in nats.
cdf double Cumulative distribution function which is probability under the area of distribution.
pdfLower double Probability density function value for x1.
pdfUpper double Probability density function value for x2.

Logarithmic (Series) Distribution (LOGSERDIST)

This function calculates probability using a discrete probability distribution derived from the Maclaurin series expansion.

Wikipedia Reference: Logarithmic (Series) Distribution

The general signature for calling the Logarithmic (Series) Distribution function is the following:

prefix:logserdist(data : String, k : long, SuccessIs : String) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data String Column data.
k long Find probability for the number.
SuccessIs String Success string.
Output:
probability double Logarithmic distribution probability value.

Matrix Utilities

This collection of functions returns information on various attributes of vector space mapping and related matrix tensors.

There are several different matrix utility functions available.

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph matrix utility functions.

  • get_rows – Get number of rows present in tensor.
    prefix:get_rows(b : "http://anzograph.com/matrices#tensor") 
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    nlongNumber of rows.
  • get_cols – Get number of columns present in tensor.
    prefix:get_cols(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    nlong Number of columns.
  • get_slices – Get number of slices present in tensor.
    prefix:get_slices(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    nlong Number of slices.
  • get_order – Get tensor order.
    prefix:get_order(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    nlong Tensor order.
  • get_total_elem – Get total number of elements present in tensor.
    prefix:get_total_elem(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b

    "http://anzograph.com/matrices#tensor"

    A Tensor.
    Output:
    nlongTotal number of elements.
  • get_nonzero – Get number of non-zero elements present in sparse matrix.
    prefix:get_nonzero(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    nlongNumber of non-zero elements present in sparse matrix.
  • get_elem – Access the individual element stored in tensor.
    prefix:get_elem(b : "http://anzograph.com/matrices#tensor",
       i : long, j : long, k : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    ilongElement stored at ith row.
    jlongElement stored at jth column; optional parameter.
    klongElement stored at kth slice; optional parameter.
    Output:
    vdouble Element value.
  • dump_tensor – Display the Armadillo header and the first few elements of the matrix or vector data as a string.
    prefix:dump_tensor(b : "http://anzograph.com/matrices#tensor", type : Int, isRowWise : boolean) 
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A matrix/row vector/column vector.
    type intType of tensor: 0 - Row vector; 1 -Column Vector; 2 - Matrix. This parameter is optional; the default value is 2.
    isRowWiseBoolean False if display matrix in column-wise. This parameter is optional, the default value is true.
    Output:
    sStringRow-wise or column-wise string representation of vector or matrix.
  • make_matrix – Create a matrix of doubles with the given dimensions and values.
    prefix:make_matrix(m : int, n : int, v : double, ...)
    ParameterData TypeDescription
    Input:
    m intThe number of rows in the new matrix.
    nintThe number of columns in the new matrix
    vdoubleMatrix elements to fill in row-wise; optional repeatable parameter. The default value is 0 for all elements.
    Output:
    b"http://anzograph.com/matrices#tensor" Tensor representation for m x n element matrix of doubles.
  • subview_col – Extract a column from matrix or sparse matrix.
    prefix:subview_col(b: "http://anzograph.com/matrices#tensor", n : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    nlongColumn index.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of Column vectors.
  • subview_row – Extract a row from matrix or sparse matrix.
    prefix:subview_row(b: "http://anzograph.com/matrices#tensor", n : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    nlongRow index.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of row vector.
  • dump_vec – Display the row or column vector data as a string.
    prefix:dump_vec(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"Row or column vector.
    Output:
    sStringString representation of row or column vector.
  • subview_rows – Extract a range of rows from matrix or sparse matrix.
    prefix:subview_rows(b : "http://anzograph.com/matrices#tensor",
       r1 : long, ... rn : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    r1, r2,...,rn longStart row index (inclusive) to end row
    index (inclusive).
    Output:
    v"http://anzograph.com/matrices#tensor"Tensor representation of matrix with rows from r1 to rn.
  • subview_cols – Extract a range of columns from matrix or sparse matrix.
    prefix:subview_cols(b : "http://anzograph.com/matrices#tensor", c1 : long,... cn : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    c1, c2,...,cn longStart column index (inclusive) to end column
    index (inclusive).
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of matrix with columns from c1 to cn.
  • subview_mat – Extract a submatrix from matrix or sparse matrix.
    prefix:subview_mat(b : "http://anzograph.com/matrices#tensor",
       r1 : long, c1 : long,... rn : long, cn : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    r1, r2,...,rn longStart row index (inclusive) to end row index (inclusive)
    c1,c2,...,cn longStart column index (inclusive) to end column index (inclusive)
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of matrix of [1+(rn-r1)] x [1+(cn-c1)] size.
  • subview_head_rows – Extract starting rows from matrix or sparse matrix.
    prefix:subview_head_rows(b : "http://anzograph.com/matrices#tensor",
       n : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    nlongNumber of rows from the start.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of matrix with rows from 0 to n-1.
  • subview_head_cols – Extract starting columns from matrix or sparse matrix.
    prefix:subview_head_cols(b : "http://anzograph.com/matrices#tensor",
       n : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    nlongNumber of columns from the start.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of matrix with columns from 0 to n-1.
  • subview_tail_rows – Extract tailing rows from matrix or sparse matrix.
    prefix:subview_tail_rows(b : "http://anzograph.com/matrices#tensor",
      n : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    n longNumber of rows from the tail.
    Output:
    v"http://anzograph.com/matrices#tensor"Tensor representation of matrix with n rows from tail.
  • subview_tail_cols – Extract tailing columns from matrix or sparse matrix.
    prefix:subview_tail_cols(b : "http://anzograph.com/matrices#tensor",
       n : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    n longNumber of columns from the tail.
    Output:
    v"http://anzograph.com/matrices#tensor"Tensor representation of matrix with n columns from tail.
  • get_subvec – Extract range of elements from a row or column vector.
    prefix:get_subvec(b : "http://anzograph.com/matrices#tensor",
       i : long, j : long)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    ilongStart index.
    jlongEnd index.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of row or column vector.
  • subvec_head – Extract starting elements from a row or column vector.
    prefix:subvec_head(b : "http://anzograph.com/matrices#tensor",
       n : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    n longNumber of elements from the start.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of row or column vector having elements from the start.
  • subvec_tail – Extract tailing elements from a row or column vector.
    prefix:subvec_tail(b : "http://anzograph.com/matrices#tensor",
       n : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    n longNumber of elements from the tail.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of row or column vectors having elements from the tail.
  • get_diag – Extract a diagonal from matrix or sparse matrix.
    prefix:get_diag(b : "http://anzograph.com/matrices#tensor",
       k : long)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    k longOptional diagonal number parameter; by default, the main diagonal is accessed (k=0). For k > 0, the kth super-diagonal is accessed (top-right corner). For k < 0, the kth sub-diagonal is accessed (bottom-left corner).
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of the diagonal as a column vector.
  • flatten_as_col – Get a flattened version of the matrix as a column vector.
    prefix:flatten_as_col(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of a flattened version of the matrix as a column vector.
  • flatten_as_row – Get a flattened version of the matrix as a row vector.
    prefix:flatten_as_row(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    v"http://anzograph.com/matrices#tensor" Tensor representation of a flattened version of the matrix as a row vector.
  • getmax_val – Get the maximum value in the tensor.
    prefix:getmax_val(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    ndouble Maximum value in the tensor.
  • getmin_val – Get the minimum value in the tensor.
    prefix:getmin_val(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    ndoubleMinimum value in the tensor.
  • is_vec – Check whether a matrix is a vector.
    prefix:is_vec(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix can be interpreted as a vector (either column or row vector). False if the matrix does not have exactly one column or one row.
  • is_rowvec – Check whether matrix is a row vector.
    prefix:is_rowvec(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix can be interpreted as a row vector. False if the matrix does not have exactly one row.
  • is_colvec – Check whether matrix is a column vector.
    prefix:is_colvec(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix can be interpreted as a column vector. False if the matrix does not have exactly one column.
  • is_sorted – Check whether vector or matrix is sorted.
    prefix:is_sorted(b : "http://anzograph.com/matrices#tensor",
       t : boolean, d : int)
    ParameterData TypeDescription
    Input:
    b"http://anzograph.com/matrices#tensor"A Tensor.
    tbooleanSort dimension for matrix. This parameter is optional; the default is false; True if elements are sorted in each row. False if elements are sorted in each column.
    d int

    Optional argument specifying sort direction for matrix; the default is 0 (ascend). Allowed arguments are:

    0-ascend (default),
    1-descend
    2-strictascend
    3-strictdescend
    "ascend" - elements are ascending; consecutive elements can be equal; this is the default. "descend" - elements are descending; consecutive elements can be equal.

    "strictascend" - elements are strictly ascending; consecutive elements cannot be equal.

    "strictdescend" - elements are strictly descending; consecutive elements cannot be equal.

    Output:
    vbooleanTrue if the elements are sorted, else false.
  • is_tri_mat_upper – Check whether matrix is upper triangular.
    prefix:is_tri_mat_upper(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vbooleanTrue if the matrix is upper triangular,that is, the matrix is square sized and all elements below the main diagonal are zero; otherwise, returns false.
  • is_tri_mat_lower – Check whether matrix is lower triangular.
    prefix:is_tri_mat_lower(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix is lower triangular, that is, the matrix is square sized and all elements above the main diagonal are zero; otherwise, returns false.
  • is_diag_mat – Check whether a matrix is diagonal.
    prefix:is_diag_mat(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix is diagonal, that is,. all elements outside of the main diagonal are zero; otherwise, returns false.
  • is_square – Check whether matrix is square-sized.
    prefix:is_square(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vbooleanTrue if the matrix is square, that is, the number of rows is equal to the number of columns.
  • is_symmetric – Check whether the matrix is symmetric.
    prefix:is_symmetric(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix is symmetric.
  • is_hermitian – Check whether matrix is hermitian.
    prefix:is_hermitian(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vboolean True if the matrix is hermitian (self-adjoint).
  • has_nan – Check whether a matrix is NaN.
    prefix:has_nan(b : "http://anzograph.com/matrices#tensor")
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    Output:
    vbooleanTrue if at least one of the elements of the object is NaN (not-a-number).
  • vec_all – Check whether all elements are non-zero, or satisfy a relational condition in a row or column vectors.
    prefix:vec_all(b : "http://anzograph.com/matrices#tensor",
       c : int, val : double)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    cint

    Optional parameter; the default is 0. Relation condition:
    0-not equal
    1-greater than
    2-less than

    3-equal

    4-greater than or equal

    5-less than or equal

    valdouble 
    Output:
    vbooleanTrue if all elements of the vector are non-zero or satisfy a relational condition.
  • mat_all – Check whether all elements are non-zero, or satisfy a relational condition in a matrix.
    prefix:mat_all(b : "http://anzograph.com/matrices#tensor",
       d : boolean, c : int, val : double)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    dbooleanCheck rows or columns. Optional parameter; default is to check all columns.
    cint

    Relation condition:

    0-not equal

    1-greater than

    2-less than

    3-equal

    4-greater than or equal

    5-less than or equal

    This parameter is optional; the default is 0.

    valdouble Value against which the condition c to apply. This parameter is optional; the default is 0.
    Output:
    v"http://anzograph.com/matrices#tensor" A Tensor representation of a row vector, with each element (0 or 1) indicating whether the corresponding row/column has all non-zero elements.
  • vec_any – Check whether any element is non-zero, or satisfy a relational condition in a row or column vectors.
    prefix:vec_any(b : "http://anzograph.com/matrices#tensor",
       c : int, val : double)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    cint

    Relation condition:

    0-not equal

    1-greater than

    2-less than

    3-equal

    4-greater than or equal

    5-less than or equal

    This parameter is optional; the default is 0.

    valdoubleValue against which the condition c to apply. This parameter is optional; the default is 0.
    Output:
    v"http://anzograph.com/matrices#tensor"True if any element of the vector is non-zero or satisfies a relational condition.
  • mat_any – Check whether any element is non-zero, or satisfy a relational condition in a matrix.
    prefix:mat_any(b : "http://anzograph.com/matrices#tensor",
       d : boolean, c : int, val : double)
    ParameterData TypeDescription
    Input:
    b "http://anzograph.com/matrices#tensor"A Tensor.
    dbooleanCheck rows or columns. Optional parameter; default is to check all columns.
    cint

    Relation condition:

    0-not equal,

    1-greater than,

    2-less than,

    3-equal,

    4-greater than or equal,

    5-less than or equal. Optional parameter; default is 0.

    valdoubleValue against which the condition c to apply. This parameter is optional; the default is 0.
    Output:
    v"http://anzograph.com/matrices#tensor" A Tensor representation of a row vector, with each element (0 or 1) indicating whether the corresponding row/column has any non-zero elements.

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient function returns a coefficient value between observed and predicted binary classifications.

Wikipedia Reference: Matthews correlation coefficient.

The general signature for calling the Matthews Correlation Coefficient function is the following:

prefix:mcc(x : boolean, y : boolean) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input :
x boolean 1st variable column data.
y boolean 2nd variable column data.
Output:
coefficient double Extent to which observed and predicted binary classifications are related.

Negative Binomial Distribution (NEGBINDIST)

This function calculates probability using a discrete probability distribution that concerns the number of trials which must occur in order to have a predetermined number of successes.

Wikipedia Reference: Negative Binomial Distribution

The general signature for calling the Negative Binomial Distribution function is the following:

prefix:negbindist(data : String, k : long, r : long, SuccessIs : String) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data String Column data.
k long Number of successes.
r long Number of failures.
SuccessIs String Define success string among the column data.
Output:
probability double Probability value which is the value for probability mass function.
cdfLower double Cumulative distribution function which is probability (<=k) under the area of distribution.
cdfUpper double Cumulative distribution function which is probability (>k) under the area of distribution.

Normal Distribution (NORMDIST)

This function calculates probability using a continuous probability distribution of data in which the majority of data points are relatively similar, within a small range of values with few outliers.

Wikipedia Reference: Normal Distribution

The general signature for calling the Normal Distribution function is the following:

prefix:normdist(data : double, c : String, x1 : double, x2 : double) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data.
c String User choice: 'below'; 'above'; 'bet'(Between); 'out'(Outside) where user can choose probability below, above, between, or outside.
x1 double Lower number x1 to find the probability.
x2 double Upper number x2 to find the probability.
Output:
mean double Mean of the distribution.
stdDev double Standard deviation of the distribution.
variance double Variance of the distribution.
diffEntropy double Differential Entropy in nats.
cdf double Cumulative distribution function which is probability under the area of distribution.
pdfLower double Probability density function value for x1.
pdfUpper double Probability density function value for x2.

Pearson Correlation Coefficient (PCC)

The Pearson Correlation Coefficient function determines the extent to which two variables are linearly related: positive, negative, or no relationship.

Wikipedia Reference: Pearson correlation coefficient

The general signature for calling the Pearson Correlation Coefficient function is the following:

prefix:pcc(x : double, y : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input :
x double 1st variable column data.
y double 2nd variable column data.
Output :
coefficient double Extent to which two variables are linearly related.

Percentile Metric (PERCENTILE)

This function calculates percentile (1 to 100) of numeric values.

Wikipedia Reference: Percentile Metric

The general signature for calling the Percentile function is the following:

prefix:percentile(data : double, p : double)  

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Data set.
p double

To compute the percentiles of a specific value in [0, 100].

Output:
percentile double percentile value.

Poisson Distribution (POISDIST)

This function calculates probability using discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, given these events occur with a known constant rate and occur independently of the time since the last event.

Wikipedia Reference: Poisson Distribution

The general signature for calling the Poisson Distribution function is the following:

prefix:poisdist(data : long, k : long)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data long Column data.
k long Probability of observing k events in an interval.
Output:
probability double Probability value which is the value for probability mass function.
cdfLower double Cumulative distribution function which is probability (<=k) under the area of distribution.
cdfUpper double Cumulative distribution function which is probability (>k) under the area of distribution.

Principal Component Analysis (PCA)

Applies principal component analysis (PCA) to create combined eigen values and vectors that highlight patterns in a dataset, making it easier to explore and visualize data.

Wikipedia Reference: Principal Component Analysis

There are several different function signatures available for Principal Component analysis.

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph PCA data science functions.

  • pca::create– Apply Principal Component Analysis(PCA) to create combined eigenvalues and eigenvectors.
    prefix:pca::create(x1 : double, x2 : double,...,xn : double) 
    ParameterData TypeDescription
    Input:
    x1, x2,...,xn doubleFeature column datasets.
    Output:
    eigen_values_vectors"http://anzograph.com/matrices#feature_result" PCA data containing eigenvalues and eigenvectors.
  • pca::get_eigvec – Get PCA's eigen vectors as a matrix from the PCA data.
    prefix:pca::get_eigvec(pca_data: "http://anzograph.com/matrices#feature_result")
    ParameterData TypeDescription
    Input:
    pca_data"http://anzograph.com/matrices#feature_result"Principal Component Analysis data.
    Output:
    eigen_vectors "http://anzograph.com/matrices#tensor"Eigen vectors as a matrix.
  • pca::get_eigval – Get PCA's eigen values as a column vector from PCA data.
    prefix:pca::get_eigval(pca_data : "http://anzograph.com/matrices#feature_result")  
    ParameterData TypeDescription
    Input:
    pca_data"http://anzograph.com/matrices#feature_result"Principal Component Analysis data.
    Output:
    eigen_values "http://anzograph.com/matrices#tensor"Eigen values in descending order as column vectors.
  • transform – Apply Linear Discriminant Analysis (LDA) to transform the samples onto the new subspace.
    prefix:transform(pca_data : "http://anzograph.com/matrices#feature_result",
       d : int, x1 : double, x2 : double,...,xn : double) 
    ParameterData TypeDescription
    Input:
    pca_data"http://anzograph.com/matrices#feature_result"PCA data.
    dintNumber of eigen vectors to consider from the end.
    x1, x2,...,xndoubleFeature column data sets.
    Output:
    transformed_datadoubleSample data transformed into the tuple of lower dimensional space.
  • dump_tensor – Get string representation of vector or matrix in row-wise/column-wise order.
    prefix:dump_tensor(m : "http://anzograph.com/matrices#tensor",
       type: int, isRowWize: Boolean)
     Data TypeDescription
    Input:
    m"http://anzograph.com/matrices#tensor" A tensor of matrix/row vector/column vector.
    typeInt Type of tensor: 0-Row vector, 1-Column Vector, 2-Matrix. Optional, default is 2.
    isRowWizeBoolean False if the display matrix is column-wise. Optional, default is true.
    Output:
    dumpStringString representation of vector or matrix in row-wise/column-wise.
  • pca::get_raw_eigval – Get PCA's unsorted eigen values from the PCA data.
    prefix:pca::get_raw_eigval(pca_data : "http://anzograph.com/matrices#feature_result")  
    ParameterData TypeDescription
    Input:
    pca_data"http://anzograph.com/matrices#feature_result"Principal Component Analysis data.
    Output:
    eigen_values "http://anzograph.com/matrices#tensor"Eigen values in unsorted order as column vector.

Quantile/Rank Sketch (KLL)

This function is used to calculate the quantile/rank from a stream of items using the KLL sketch computation model. There are several different signatures available for the Quantile/Rank Sketch function.

Reference: KLL Sketch

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/sketch#> location of the AnzoGraph data science functions.

  • kll– Creates binary image for the theta sketch.
    prefix:kll(val : Object, k : int)  
    ParameterData TypeDescription
    Input:
    valObjectInput data stream (supporting short, int, long, float, double and string).
    kintSketch configuration parameter, which affects the size of the sketch and its estimation error. It is optional and default value is 200. k can be any value between 8 and 65535, inclusive. The default k = 200 results in a normalized rank error of about 1.65%. Higher values of K will have smaller error but the sketch will be larger (and slower)..
    Output:
    kll_sketch"http://anzograph.com/statistics#kll_sketch"Binary stream containing KLL sketch data.
  • kll::get_min_value – Gets the minimum value of the stream.
    prefix:kll::get_min_value(kll_sketch : "http://anzograph.com/statistics#kll_sketch") 
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    Output:
    double_valdouble The min value of the stream.
    string_valString The string having min value when input stream is of string type.
  • kll::get_max_value – Gets the maximum value of the stream.

    prefix:kll::get_max_value(kll_sketch : "http://anzograph.com/statistics#kll_sketch") 
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    item ObjectData item whose frequency to be estimated.
    Output:
    double_valdouble The max value of the stream.
    string_valString The string having max value when input stream is of string type.
  • kll::get_n – Gets the stream length.
    prefix:kll::get_n(kll_sketch : "http://anzograph.com/statistics#kll_sketch")
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch"

    Binary stream containing KLL sketch data.

    Output:
    nlong The length of the input stream.
  • kll::get_num_retained – Gets the number of retained items (samples) in the sketch.
    prefix:get_num_retained(kll_sketch : "http://anzograph.com/statistics#kll_sketch")
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    Output:
    n longThe number of retained items (samples) in the sketch.
  • kll::get_rank – Gets an approximation to the normalized (fractional) rank of the given value from 0 to 1, inclusive.
    prefix:kll::get_rank(kll_sketch : "http://anzograph.com/statistics#kll_sketch",
       v : double)
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    vdouble Item to be ranked.
    Output:
    r double An approximate rank of the given item.
  • kll::get_quantile – Gets an approximation to the value of the data item from the rank.
  • prefix:kll::get_quantile(kll_sketch : "http://anzograph.com/statistics#kll_sketch", 
       fraction : double)
    Parameter Data Type Description
    Input:
    kl_sketch "http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    fraction double The specified fractional position in the hypothetical sorted stream.
    Output:
    v

    double

    An approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.
    stringVal String An approximation to the string when input stream is of string type.
  • kll::get_quantiles – Provides more efficient multiple-query version of kll::get_quantile() and allows the caller to specify the number of evenly spaced fractional ranks.
    prefix:kll::get_quantiles(kll_sketch : "http://anzograph.com/statistics#kll_sketch", 
      f1 : double, f2 : double, ..., f10 : double)
    ParameterData TypeDescription
    Input:
    kll_sketch"http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    f1, f2, ... f10double Given fractional positions in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. These fractions must be in the interval [0.0, 1.0], inclusive.
    Output:
    v1, v2, ...v10doubleAn approximation to the values in the same order as the given fractional positions.
  • kll::get_quantiles_str – Provides an approximation to the strings when the input stream is of string type.
  • prefix:kll::get_quantiles_str(kll_sketch : "http://anzograph.com/statistics#kll_sketch", 
       f1 : double, f2 : double, ..., f10 : double) 
    Parameter Data Type Description
    Input:
    kll_sketch "http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    f1, f2, ... f10 double Given fractional positions in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. These fractions must be in the interval [0.0, 1.0], inclusive.
    Output:
    v1, v2, ...v10 string An approximation to the strings when input stream is of string type.
  • kll::get_pmf – Provides an approximation to the Probability Mass Function (PMF) of the input stream given the values.
  • prefix:kll::get_pmf(kll_sketch : "http://anzograph.com/statistics#kll_sketch", 
       v1 : Object, v2 : Object, ..., v10 : Object) 
    Parameter Data Type Description
    Input:
    kll_sketch "http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    v1, v2, ...v10 Object Input values between the min and max values of the input stream. Values must be unique and monotonically increasing.
    Output:
    r1, r2, ...r10 double PMF values corresponding to the input..
  • kll::get_cdf – Provides an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF of the input stream, given the values.
  • prefix:kll::get_cdf(kll_sketch : "http://anzograph.com/statistics#kll_sketch", 
       v1 : Object, v2 : Object, ..., v10 : Object) 
    Parameter Data Type Description
    Input:
    kll_sketch "http://anzograph.com/statistics#kll_sketch" Binary stream containing KLL sketch data.
    v1, v2, ...v10 Object Input values between the min and max values of the input stream. Values must be unique and monotonically increasing.
    Output:
    r1, r2, ...r10 double CDF values corresponding to the input.

Singular Value Decomposition (SVD)

Singular value decomposition (SVD), a matrix factorization method, creates combined singular values and right singular vectors.

Wikipedia Reference: Singular Value Decomposition

There are several different function signatures available for Singular Value Decomposition.

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/matrices#> location of the AnzoGraph SVD data science functions.

  • svd::create – Apply Singular Value Decomposition (SVD) to create combined singular values and right singular vectors.
    prefix: svd::create(x1 : double, x2 : double,...,xn : double) 
    ParameterData TypeDescription
    Input:
    x1, x2,...,xndoubleFeature column data sets.
    Output:
    svd_data"http://anzograph.com/matrices#feature_result"SVD data containing singular values and right singular vectors.
  • svd::get_sigval – Get SVD's singular values as a column vector from the SVD data.
    prefix:svd::get_sigval(svd_data : "http://anzograph.com/matrices#feature_result")
    ParameterData TypeDescription
    Input:
    svd_data"http://anzograph.com/matrices#feature_result"SVD data.
    Output:
    singular_values"http://anzograph.com/matrices#tensor"Singular values in the descending order as a column vector.
  • svd::get_sigvec – Get SVD's singular vector as a matrix from the SVD data.
    prefix:svd::get_sigvec(svd_data : "http://anzograph.com/matrices#feature_result")
    ParameterData TypeDescription
    Input:
    svd_data"http://anzograph.com/matrices#feature_result"svd_data - SVD data.
    Output:
    singular_vector"http://anzograph.com/matrices#tensor"Right singular vectors as matrix.
  • transform – Apply PCA or SVD to transform the samples onto the new subspace.
    prefix:transform(svd_data : "http://anzograph.com/matrices#feature_result", 
       d : int, x1, x2,...,xn : double)
    ParameterData TypeDescription
    Input:
    svd_data "http://anzograph.com/matrices#feature_result"SVD data.
    d intNumber of singular vectors to consider from the start.
    x1, x2,...,xndoubleFeature column data sets.
    Output:
    transformed_dataStringSample data transformed into the tuple of lower dimensional space.
  • dump_tensor – Return string representation of vector or matrix in row-wise/column-wise order.
    prefix: dump_tensor(m : Blob, type : Int, isRowWise : boolean)
    ParameterData TypeDescription
    Input:
    m "http://anzograph.com/matrices#tensor" A matrix/row vector/column vector.
    type Int Type of tensor: 0-Row vector; 1-Column Vector; 2-Matrix.
    isRowWise boolean True if displayed matrix row-wise.
    Output:
    dumpStringString representation of vector or matrix in row-wise/column-wise.

Skellam Distribution (SKELLAMDIST)

This function calculates probability using the Skellam distribution which models the difference between two independent Poisson distributed variables.

Wikipedia Reference: Skellam Distribution

The general signature for calling the Skellam Distribution function is the following:

prefix:skellamdist(N1_data : long, N2_data : long, k : long) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
N1_data long N1 Column data.
N2_data long N2 Column data.
k long Find probability for the number.
Output:
probability double Skellam probability value.

Skew Metric (SKEWCOEFF)

This function calculates the Pearson’s coefficient of skewness on numeric values.

Wikipedia Reference: Skewness

The general signature for calling the Skew Metric function is the following:

prefix:skewcoeff(data : double, dp : int) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Data set.
dp int Number of decimal points to consider for the input data.
Output:
mode double Value that appears most often(Highest frequency).
median double The middle number in an ordered set of data.
mean double Average value.
stdDev double Standard deviation.
modeSkewnessCoef double Pearson mode skewness or first skewness coefficient.
medianSkewnessCoef double The Pearson median skewness or second skewness coefficient.

Spearman Correlation Coefficient (SCC)

The Spearman Correlation Coefficient function determines how well the relationship between two variables can be described using a monotonic function.

Wikipedia Reference: Spearman's Correlation Coefficient

The general signature for calling the Spearman Correlation Coefficient function is the following:

prefix:scc(rank_X : double, rank_Y : double) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input
rank_X double X ranked data.
rank_Y double Y ranked data.
Output
coefficient double Coefficient between ranked data.

Student's T-Distribution (TDIST)

This function calculates probability using the Student's t-distribution (and associated t scores) which are often used in hypothesis testing when the sample size is small and/or when the population variance is unknown.

Wikipedia Reference: Student's t-distribution

The general signature for calling the Student's t-distribution function is the following:

prefix:tdist(data : double, M : double) 

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Sample data.
M double Population mean.
Output:
mean double Mean of the distribution.
stdDev double Standard deviation of the distribution.
variance double Variance of the distribution.
t-statistics double t = [ u - M ] / [ s / sqrt( N ) ] where u is the sample mean, M is the population mean, s is the standard deviation of the sample, and N is the sample size.
count double Number of samples so the degrees of freedom is (count-1).
pdf double Probability value which is the value for probability mass function.
cdf double Cumulative distribution function which is probability for <= t-statistics.

TDigest Metric (TDIGEST)

This function creates an estimate of the median (and more generally, any percentile) from either distributed data or streaming data, using a t-Digest probabilistic data structure.

Wikipedia Reference: Computing Quantiles using T-Digests

The general signature for calling the T-Digest function is the following:

prefix:tdigest(data : double, p : double, q : double, cdf : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics/sketch#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Column data.
p double To compute the percentiles of a specific value in [0, 100].
q double To compute the quantiles of a specific value in [0.0, 1.0].
cdf double To compute the CDF of a specific value.
Output:
percentile double Value below which a given percentage of observations in a group of observations falls.
quantile double Cut point to dividing the observations in a sample.
cdf double The computation of F(x); F denotes the CDF of the distribution.

Theta Sketch (THETA)

This function is used to perform estimates of set operations, Union, Intersection, and Difference, all using the Theta Sketch framework. There are several different function signatures available for Theta Sketch estimate calculations. Theta Sketches are a generalization of the well known Kth Minimum Value (KMV) sketches.

Reference: The Theta Sketch Framework

The prefix shown in the function signatures below points to the URI <http://cambridgesemantics.com/anzograph/sketch#> location of the AnzoGraph data science functions.

  • theta– Creates a binary image for the theta sketch.
    prefix:theta(val : Object)
    ParameterData TypeDescription
    Input:
    valObjectData set (supporting short, int, long, float, double and string).
    Output:
    theta_sketch"http://anzograph.com/statistics#theta_sketch"Binary stream containing theta sketch data.
  • theta::cardinality – Gets the Cardinality estimate of the input stream.
    prefix:theta::cardinality(theta_sketch : "http://anzograph.com/statistics#theta_sketch")
    ParameterData TypeDescription
    Input:
    theta_sketch"http://anzograph.com/statistics#theta_sketch" Binary stream containing theta sketch data.
    Output:
    ndouble Sketch's best estimate of the cardinality of the input stream.
  • theta::union – Gets the Cardinality estimate of the union of the input streams.
    prefix:theta::union(theta_sketch... : "http://anzograph.com/statistics#theta_sketch")
    ParameterData TypeDescription
    Input:
    theta_sketch..."http://anzograph.com/statistics#theta_sketch"

    Binary stream containing theta sketch data. You can provide as many sketches as you want into the input as indicated by '...' in the signature.

    Output:
    ndouble Sketch's best estimate of the union of the input streams.
  • theta::intersection – Get the Cardinality estimate of the intersection of the input streams.
    prefix:theta::intersection(theta_sketch... : "http://anzograph.com/statistics#theta_sketch") 
    ParameterData TypeDescription
    Input:
    theta_sketch"http://anzograph.com/statistics#theta_sketch" Binary stream containing theta sketch data. You can provide as many sketches as you want into the input as indicated by '...' in the signature.
    Output:
    n double Sketch's best estimate of the intersection of the input streams.
  • theta::difference – Get the Cardinality estimate of set difference operation A and not B operations.
    prefix:theta::difference(a : "http://anzograph.com/statistics#theta_sketch", 
       b : "http://anzograph.com/statistics#theta_sketch") 
    ParameterData TypeDescription
    Input:
    a"http://anzograph.com/statistics#theta_sketch" Binary stream containing theta sketch for data set A.
    b"http://anzograph.com/statistics#theta_sketch" Binary stream containing theta sketch for data set B.
    Output:
    n double Sketch's best estimate of the set difference operation A and not B operations.

Weibull Distribution (WEIBULDIST)

This function calculates probability from a continuous probability distribution commonly used to assess product reliability, analyze product life data and failure times.

Wikipedia Reference: Weibull Distribution

The general signature for calling the Weibull Distribution function is the following:

prefix:weibuldist(data : double, k : double, x : double)

Where prefix points to the URI <http://cambridgesemantics.com/anzograph/statistics#> location of the AnzoGraph data science functions.

Parameter Data Type Description
Input:
data double Sample data.
k double The initial starting value for shape parameter. A good guess is crucial to quick convergence.
x double Find the probability for a random variable.
Output:
mean double Mean of the distribution.
stdDev double Standard deviation of the distribution.
variance double Variance of the distribution.
count long Number of samples.
ShapeParam double Estimated shape parameter(k) of the distribution from mean and variance using root finding method.
ScaleParam double Estimated scale parameter(a) of the distribution from mean and variance using root finding method.
diffEntropy double Differential Entropy in nats.
pdf double Probability value which is the value for probability density function.
cdfLower double Cumulative distribution function which is probability(<=x) under the area of distribution.
cdfUpper double Cumulative distribution function which is probability(>x) under the area of distribution.
maxit long Actual number of iterations performed to get an estimate of the k value.
estimatedMean double Mean calculated using estimated values of k and a.
estimatedVariance double Variance calculated using estimated values of k and a.