Sketch Library
The sketch library provides extremely efficient streaming algorithms that approximate calculations, such as count distinct, quantiles, most frequent items, joins, and matrix computations, and return data sketches. This topic describes each of the sketch functions.
The URI for the sketch functions is <http://cambridgesemantics.com/anzograph/statistics/sketch#>
. For readability, the syntax for each function below includes the prefix sketch:
, defined as PREFIX sketch: <http://cambridgesemantics.com/anzograph/statistics/sketch#>
.
- Cardinality Metric (HLL): Uses Apache DataSketches HyperLogLog (HLL) to calculate cardinality estimates for a dataset.
- Frequent Items (FI): Collection of functions used to create frequency sketches and obtain information about frequent items.
- Quantile/Rank Sketch (KLL): Collection of functions that use the KLL sketch computation model to approximate minimum and maximum items in a dataset, the quantile and rank of items, the Probability Mass Function (PMF), and the Cumulative Distribution Function (CDF).
- Theta Sketch (THETA): Collection of functions that use the Theta Sketch framework to compute estimates of the cardinality, union, intersection, and difference set operations and return a Theta Sketch.
Cardinality Metric (HLL)
This aggregate calculates cardinality estimates for a dataset using Apache DataSketches HyperLogLog (HLL).
Reference: Cardinality Prominence Metric
Syntax
sketch:hll(data [, log_base_2_K ] [, hll_target_type ])
data
|
byte, short, int, long, float, double, string, URI |
The dataset. |
log_base_2_K
|
int |
Optional argument that specifies the log base 2 of K, where K is the number of buckets or slots for the sketch. Must be between 4 and 21 (inclusive). Default value is 12 . |
hll_target_type
|
int |
Optional argument that specifies the target type for the HLL sketch. Supported values are 4 (HLL_4), 6 (HLL_6), or 8 (HLL_8). Default value is 4 . |
Returns
double |
The cardinality metric value. |
Frequent Items (FI)
The FI aggregate is used to estimate the frequency of items in a dataset, the upper and lower bounds of the items, the number of active items, and the total stream weight. FI returns a binary stream (Frequent Items Sketch) containing all of the computed values. Values can retrieved from the sketch using the Frequent Items Sketch Retrieval Functions: get_estimates, get_active_items_total_weights, get_top_items, and get_top_strings.
FI Syntax
sketch:fi(values [, weight ])
values
|
short, int, long, float, double, string |
The dataset. |
weight
|
long |
Optional argument that specifies the weight of val . The default value is 1 . |
Returns
http://anzograph.com/statistics#fi_sketch |
Binary Frequent Items Sketch. |
Frequent Items Sketch Retrieval Functions
The following functions are available for retrieving values from a Frequent Items Sketch:
fi::get_estimates
Returns the estimates for the frequency and lower and upper bound of the given item in a sketch.
Syntax
sketch:fi::get_estimates(fi_sketch, item)
fi_sketch
|
http://anzograph.com/statistics#fi_sketch |
Frequent Items Sketch. |
item
|
Object |
Item for which to get estimates. |
Returns
long |
Frequency estimate for the item. |
long |
Lower bound estimate for the item. |
long |
Upper bound estimate for the item. |
fi::get_active_items_total_weights
Returns the number of active items and the estimated total stream weight from a sketch.
Syntax
sketch:fi::get_active_items_total_weights(fi_sketch)
fi_sketch
|
http://anzograph.com/statistics#fi_sketch |
Frequent Items Sketch. |
Returns
long |
The estimated number of active items. |
long |
The estimated total stream weight. |
fi::get_top_items
Returns the most frequent items and their corresponding frequency.
Syntax
sketch:fi::get_top_items(fi_sketch)
fi_sketch
|
http://anzograph.com/statistics#fi_sketch |
Frequent Items Sketch. |
Returns
double |
The item with the highest frequency. |
long |
Frequency estimate of the first item. |
double |
The item with the second highest frequency. |
long |
Frequency estimate of the second item. |
double |
The item with the nth highest frequency. |
long |
Frequency estimate of the nth item. |
fi::get_top_strings
Gets top frequent strings and their corresponding frequency.
Syntax
sketch:fi::get_top_strings(fi_sketch)
fi_sketch
|
http://anzograph.com/statistics#fi_sketch |
Frequent Items Sketch. |
Returns
string |
The string with the highest frequency. |
long |
Frequency estimate of the first string. |
string |
The string with the second highest frequency. |
long |
Frequency estimate of the second string. |
string |
The string with the nth highest frequency. |
long |
Frequency estimate of the nth string. |
Quantile/Rank Sketch (KLL)
The KLL aggregate uses the KLL Sketch computation model to calculate the approximate minimum and maximum items in a dataset, the quantile and rank of items, the Probability Mass Function (PMF), and the Cumulative Distribution Function (CDF). KLL returns a binary stream (KLL Sketch) containing all of the computed values. Values can retrieved from the sketch using various KLL Sketch Retrieval Functions.
For more information about KLL sketches, see KLL Sketch.
KLL Syntax
sketch:kll(values [, k ])
values
|
short, int, long, float, double, string |
The dataset. |
k
|
int |
Optional argument that configures the size of the sketch and its estimation error. Can be any value between 8 and 65535 (inclusive). The default value is 200 , which results in a normalized rank error of about 1.65%. Higher values will have a smaller error but the sketch will be larger (and slower). |
Returns
http://anzograph.com/statistics#kll_sketch |
Binary KLL sketch. |
KLL Sketch Retrieval Functions
The following functions are available for retrieving values from a KLL sketch:
kll::get_min_value
Returns the minimum value in a KLL sketch.
Syntax
sketch:kll::get_min_value(kll_sketch)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
Returns
double |
The minimum value in the sketch. |
string |
If the input is a string, the minimum string is returned. |
kll::get_max_value
Returns the maximum value in a KLL sketch.
Syntax
sketch:kll::get_max_value(kll_sketch)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
Returns
double |
The maximum value in the sketch. |
string |
If the input is a string, the minimum string is returned. |
kll::get_n
Returns the length of a KLL sketch.
Syntax
sketch:kll::get_n(kll_sketch)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
Returns
long |
The length of the sketch. |
kll::get_num_retained
Returns the number of retained items (samples) in a sketch.
Syntax
sketch:get_num_retained(kll_sketch)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
Returns
long |
The number of retained items (samples) in the sketch. |
kll::get_rank
Returns an approximation of the normalized (fractional) rank of the given item.
Syntax
sketch:kll::get_rank(kll_sketch, v)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
v
|
double |
The item to retrieve the rank for. |
Returns
double |
The approximate rank of the item from 0 - 1 (inclusive). |
kll::get_quantile
Returns an approximation of the value for an item from the rank.
Syntax
sketch:kll::get_quantile(kll_sketch, fraction)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
fraction
|
double |
The fractional position in the hypothetical sorted stream. |
Returns
double |
An approximation of the value of the item that would be preceded by the given fraction of a hypothetical sorted version of the sketch. |
string |
An approximation of the string when the input is a string. |
kll::get_quantiles
Provides a more efficient, multiple-query version of kll::get_quantile that enables you to specify a number of evenly spaced fractional ranks.
Syntax
sketch:kll::get_quantiles(kll_sketch, f1, f2, ..., f10)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
f1–f10
|
double |
Normalized or fractional ranks in the hypothetical sorted stream. The ranks must be in the interval 0.0 - 1.0 (inclusive). |
Returns
double |
An approximation of the values in the same order as the given fractional positions. |
kll::get_quantiles_str
Provides an approximation to the strings when the input is a string type.
Syntax
sketch:kll::get_quantiles_str(kll_sketch, f1, f2, ..., f10)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
f1–f10
|
double |
Normalized or fractional ranks in the hypothetical sorted stream. The ranks must be in the interval 0.0 - 1.0 (inclusive). |
Returns
string |
An approximation of the strings. |
kll::get_pmf
Provides an approximation to the Probability Mass Function (PMF) of the input stream.
Syntax
sketch:kll::get_pmf(kll_sketch, v1, v2, ..., v10)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
v1–v10
|
Object |
Input values between the minimum and maximum values of the input stream. Values must be unique and monotonically increasing. |
Returns
double |
PMF values corresponding to the input. |
kll::get_cdf
Provides an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF of the input stream.
Syntax
sketch:kll::get_cdf(kll_sketch, v1, v2, ..., v10)
kll_sketch
|
http://anzograph.com/statistics#kll_sketch |
KLL sketch. |
v1–v10
|
Object |
Input values between the minimum and maximum values of the input stream. Values must be unique and monotonically increasing. |
Returns
double |
CDF values corresponding to the input. |
Theta Sketch (THETA)
The THETA aggregate uses the Theta Sketch framework to compute estimates of the cardinality, union, intersection, and difference set operations and return a binary stream (Theta Sketch) containing the computed values. Values can be retrieved from the sketch using the : cardinality, union, intersection, and difference.
Theta Sketches are a generalization of the well-known Kth Minimum Value (KMV) sketches. For more information about the framework, you may find the following references helpful:
THETA Syntax
sketch:theta(values)
values |
short, int, long, float, double, string |
The dataset to operate on. |
Returns
http://anzograph.com/statistics#theta_sketch |
Binary Theta Sketch |
Theta Sketch Retrieval Functions
The following functions are available for retrieving values from a Theta Sketch:
theta::cardinality
Retrieves the estimated count of values in a Theta Sketch.
Syntax
sketch:theta::cardinality(theta_sketch)
theta_sketch
|
http://anzograph.com/statistics#theta_sketch |
Binary Theta Sketch |
Returns
double |
The count of items in the sketch. |
theta::union
Retrieves the estimate of the number of items that are in the union of two or more Theta Sketches.
Syntax
sketch:theta::union(theta_sketch1, theta_sketch2 [, theta_sketchN ])
theta_sketch1–N
|
http://anzograph.com/statistics#theta_sketch |
Any number of Theta Sketches. |
Returns
double |
The estimated number of items in the union. |
theta::intersection
Retrieves the estimate of the number of items that are in the intersection between two or more Theta Sketches.
Syntax
sketch:theta::intersection(theta_sketch1, theta_sketch2 [, theta_sketchN ])
theta_sketch1–N
|
http://anzograph.com/statistics#theta_sketch |
Any number of Theta Sketches. |
Returns
double |
The estimated number of items that intersect in the sketches. |
theta::difference
Retrieves the estimate of the number of items that are in the difference between two Theta Sketches, i.e., the number of items that are in the first sketch but not in the second sketch.
Syntax
sketch:theta::difference(a, b)
a
|
http://anzograph.com/statistics#theta_sketch |
The first Theta Sketch. |
b
|
http://anzograph.com/statistics#theta_sketch |
The Theta Sketch to compare to sketch a . |
Returns
double |
The estimated number of items in the difference between the sketches. |