Sketch Library

The sketch library provides extremely efficient streaming algorithms that approximate calculations, such as count distinct, quantiles, most frequent items, joins, and matrix computations, and return data sketches. This topic describes each of the sketch functions.

The URI for the sketch functions is <http://cambridgesemantics.com/anzograph/statistics/sketch#>. For readability, the syntax for each function below includes the prefix sketch:, defined as PREFIX sketch: <http://cambridgesemantics.com/anzograph/statistics/sketch#>.

  • Cardinality Metric (HLL): Uses Apache DataSketches HyperLogLog (HLL) to calculate cardinality estimates for a dataset.
  • Frequent Items (FI): Collection of functions used to create frequency sketches and obtain information about frequent items.
  • Quantile/Rank Sketch (KLL): Collection of functions that use the KLL sketch computation model to approximate minimum and maximum items in a dataset, the quantile and rank of items, the Probability Mass Function (PMF), and the Cumulative Distribution Function (CDF).
  • Theta Sketch (THETA): Collection of functions that use the Theta Sketch framework to compute estimates of the cardinality, union, intersection, and difference set operations and return a Theta Sketch.

Cardinality Metric (HLL)

This aggregate calculates cardinality estimates for a dataset using Apache DataSketches HyperLogLog (HLL).

Reference: Cardinality Prominence Metric

Syntax

sketch:hll(data [, log_base_2_K ] [, hll_target_type ]) 
Parameter Type Description
data byte, short, int, long, float, double, string, URI The dataset.
log_base_2_K int Optional argument that specifies the log base 2 of K, where K is the number of buckets or slots for the sketch. Must be between 4 and 21 (inclusive). Default value is 12.
hll_target_type int Optional argument that specifies the target type for the HLL sketch. Supported values are 4 (HLL_4), 6 (HLL_6), or 8 (HLL_8). Default value is 4.

Returns

Type Description
double The cardinality metric value.

Frequent Items (FI)

The FI aggregate is used to estimate the frequency of items in a dataset, the upper and lower bounds of the items, the number of active items, and the total stream weight. FI returns a binary stream (Frequent Items Sketch) containing all of the computed values. Values can retrieved from the sketch using the Frequent Items Sketch Retrieval Functions: get_estimates, get_active_items_total_weights, get_top_items, and get_top_strings.

For more information about frequency sketches, see Frequency Sketches Overview.

FI Syntax

sketch:fi(values [, weight ])
Parameter Type Description
values short, int, long, float, double, string The dataset.
weight long Optional argument that specifies the weight of val. The default value is 1.

Returns

Type Description
http://anzograph.com/statistics#fi_sketch Binary Frequent Items Sketch.

Frequent Items Sketch Retrieval Functions

The following functions are available for retrieving values from a Frequent Items Sketch:

fi::get_estimates

Returns the estimates for the frequency and lower and upper bound of the given item in a sketch.

Syntax

sketch:fi::get_estimates(fi_sketch, item) 
Parameter Type Description
fi_sketch http://anzograph.com/statistics#fi_sketch Frequent Items Sketch.
item Object Item for which to get estimates.

Returns

Type Description
long Frequency estimate for the item.
long Lower bound estimate for the item.
long Upper bound estimate for the item.

fi::get_active_items_total_weights

Returns the number of active items and the estimated total stream weight from a sketch.

Syntax

sketch:fi::get_active_items_total_weights(fi_sketch)
Parameter Type Description
fi_sketch http://anzograph.com/statistics#fi_sketch Frequent Items Sketch.

Returns

Type Description
long The estimated number of active items.
long The estimated total stream weight.

fi::get_top_items

Returns the most frequent items and their corresponding frequency.

Syntax

sketch:fi::get_top_items(fi_sketch)
Parameter Type Description
fi_sketch http://anzograph.com/statistics#fi_sketch Frequent Items Sketch.

Returns

Type Description
double The item with the highest frequency.
long Frequency estimate of the first item.
double The item with the second highest frequency.
long Frequency estimate of the second item.
double The item with the nth highest frequency.
long Frequency estimate of the nth item.

fi::get_top_strings

Gets top frequent strings and their corresponding frequency.

Syntax

sketch:fi::get_top_strings(fi_sketch)
Parameter Type Description
fi_sketch http://anzograph.com/statistics#fi_sketch Frequent Items Sketch.

Returns

Type Description
string The string with the highest frequency.
long Frequency estimate of the first string.
string The string with the second highest frequency.
long Frequency estimate of the second string.
string The string with the nth highest frequency.
long Frequency estimate of the nth string.

Quantile/Rank Sketch (KLL)

The KLL aggregate uses the KLL Sketch computation model to calculate the approximate minimum and maximum items in a dataset, the quantile and rank of items, the Probability Mass Function (PMF), and the Cumulative Distribution Function (CDF). KLL returns a binary stream (KLL Sketch) containing all of the computed values. Values can retrieved from the sketch using various KLL Sketch Retrieval Functions.

For more information about KLL sketches, see KLL Sketch.

KLL Syntax

sketch:kll(values [, k ])
Parameter Type Description
values short, int, long, float, double, string The dataset.
k int Optional argument that configures the size of the sketch and its estimation error. Can be any value between 8 and 65535 (inclusive). The default value is 200, which results in a normalized rank error of about 1.65%. Higher values will have a smaller error but the sketch will be larger (and slower).

Returns

Type Description
http://anzograph.com/statistics#kll_sketch Binary KLL sketch.

KLL Sketch Retrieval Functions

The following functions are available for retrieving values from a KLL sketch:

kll::get_min_value

Returns the minimum value in a KLL sketch.

Syntax

sketch:kll::get_min_value(kll_sketch)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.

Returns

Type Description
double The minimum value in the sketch.
string If the input is a string, the minimum string is returned.

kll::get_max_value

Returns the maximum value in a KLL sketch.

Syntax

sketch:kll::get_max_value(kll_sketch)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.

Returns

Type Description
double The maximum value in the sketch.
string If the input is a string, the minimum string is returned.

kll::get_n

Returns the length of a KLL sketch.

Syntax

sketch:kll::get_n(kll_sketch)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.

Returns

Type Description
long The length of the sketch.

kll::get_num_retained

Returns the number of retained items (samples) in a sketch.

Syntax

sketch:get_num_retained(kll_sketch)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.

Returns

Type Description
long The number of retained items (samples) in the sketch.

kll::get_rank

Returns an approximation of the normalized (fractional) rank of the given item.

Syntax

sketch:kll::get_rank(kll_sketch, v)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
v double The item to retrieve the rank for.

Returns

Type Description
double The approximate rank of the item from 0 - 1 (inclusive).

kll::get_quantile

Returns an approximation of the value for an item from the rank.

Syntax

sketch:kll::get_quantile(kll_sketch, fraction)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
fraction double The fractional position in the hypothetical sorted stream.

Returns

Type Description
double An approximation of the value of the item that would be preceded by the given fraction of a hypothetical sorted version of the sketch.
string An approximation of the string when the input is a string.

kll::get_quantiles

Provides a more efficient, multiple-query version of kll::get_quantile that enables you to specify a number of evenly spaced fractional ranks.

Syntax

sketch:kll::get_quantiles(kll_sketch, f1, f2, ..., f10)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
f1–f10 double Normalized or fractional ranks in the hypothetical sorted stream. The ranks must be in the interval 0.0 - 1.0 (inclusive).

Returns

Type Description
double An approximation of the values in the same order as the given fractional positions.

kll::get_quantiles_str

Provides an approximation to the strings when the input is a string type.

Syntax

sketch:kll::get_quantiles_str(kll_sketch, f1, f2, ..., f10)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
f1–f10 double Normalized or fractional ranks in the hypothetical sorted stream. The ranks must be in the interval 0.0 - 1.0 (inclusive).

Returns

Type Description
string An approximation of the strings.

kll::get_pmf

Provides an approximation to the Probability Mass Function (PMF) of the input stream.

Syntax

sketch:kll::get_pmf(kll_sketch, v1, v2, ..., v10)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
v1–v10 Object Input values between the minimum and maximum values of the input stream. Values must be unique and monotonically increasing.

Returns

Type Description
double PMF values corresponding to the input.

kll::get_cdf

Provides an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF of the input stream.

Syntax

sketch:kll::get_cdf(kll_sketch, v1, v2, ..., v10)
Parameter Type Description
kll_sketch http://anzograph.com/statistics#kll_sketch KLL sketch.
v1–v10 Object Input values between the minimum and maximum values of the input stream. Values must be unique and monotonically increasing.

Returns

Type Description
double CDF values corresponding to the input.

Theta Sketch (THETA)

The THETA aggregate uses the Theta Sketch framework to compute estimates of the cardinality, union, intersection, and difference set operations and return a binary stream (Theta Sketch) containing the computed values. Values can be retrieved from the sketch using the : cardinality, union, intersection, and difference.

Theta Sketches are a generalization of the well-known Kth Minimum Value (KMV) sketches. For more information about the framework, you may find the following references helpful:

THETA Syntax

sketch:theta(values)
Parameter Type Description
values short, int, long, float, double, string The dataset to operate on.

Returns

Type Description
http://anzograph.com/statistics#theta_sketch Binary Theta Sketch

Theta Sketch Retrieval Functions

The following functions are available for retrieving values from a Theta Sketch:

theta::cardinality

Retrieves the estimated count of values in a Theta Sketch.

Syntax

sketch:theta::cardinality(theta_sketch)
Parameter Type Description
theta_sketch http://anzograph.com/statistics#theta_sketch Binary Theta Sketch

Returns

Type Description
double The count of items in the sketch.

theta::union

Retrieves the estimate of the number of items that are in the union of two or more Theta Sketches.

Syntax

sketch:theta::union(theta_sketch1, theta_sketch2 [, theta_sketchN ])
Parameter Type Description
theta_sketch1–N http://anzograph.com/statistics#theta_sketch Any number of Theta Sketches.

Returns

Type Description
double The estimated number of items in the union.

theta::intersection

Retrieves the estimate of the number of items that are in the intersection between two or more Theta Sketches.

Syntax

sketch:theta::intersection(theta_sketch1, theta_sketch2 [, theta_sketchN ])
Parameter Type Description
theta_sketch1–N http://anzograph.com/statistics#theta_sketch Any number of Theta Sketches.

Returns

Type Description
double The estimated number of items that intersect in the sketches.

theta::difference

Retrieves the estimate of the number of items that are in the difference between two Theta Sketches, i.e., the number of items that are in the first sketch but not in the second sketch.

Syntax

sketch:theta::difference(a, b)
Parameter Type Description
a http://anzograph.com/statistics#theta_sketch The first Theta Sketch.
b http://anzograph.com/statistics#theta_sketch The Theta Sketch to compare to sketch a.

Returns

Type Description
double The estimated number of items in the difference between the sketches.