UDX Terminology and Concepts
This topic introduces the Graph Lakehouse user-defined extensions (UDX) interface and describes fundamental terminology and concepts associated with developing custom Graph Lakehouse extensions that implement the UDX interface. Subjects covered here are the following:
Extension Types
Graph Lakehouse extensions are programs that implement the UDX interface and can be registered and loaded into the Graph Lakehouse system where they can be used within queries or other command statements. Graph Lakehouse currently supports three different kinds of extensions. Each extension has similar but distinct requirements:
- User-Defined Functions (UDF): A UDF extension maps or processes a single row of input values to return a single row of output values. For example, a developer can design a UDF extension to create an analytic function, such as those that concatenate values or convert integers to alternate currencies.
- User-Defined Aggregates (UDA): A UDA extension maps or processes multiple rows of input values to return a single row of output values. For example, a developer can design a UDA extension, such as those that compute an arithmetic mean, or perform operations like SUM, STDDEV, or MAX. Unlike a UDF, which returns a distinct value each time it is applied, a UDA aggregates the collection of values to which it is applied into a single summary value.
- User-Defined Services (UDS): A UDS extension maps or processes multiple rows of input values to return multiple rows of output values. For example, a developer can design and register a UDS extension that defines a SPARQL endpoint.
Extension Libraries
Extension libraries are executable code modules that define and organize a collection of extensions. Libraries can be implemented in either C++ or any JVM-based language such as Java or Scala. Developers can create and register any number of extension libraries.
Extension Metadata
Extension libraries are self-describing; that is, they include the necessary metadata that describe the number, name, type, and calling signature of the various extensions they implement. When a new UDX is implemented, the developer adds the metadata to an extension library that describes each new UDX. When the extension library is loaded into Graph Lakehouse, the system adds the extension library metadata to an internal Graph Lakehouse registry so that the new UDX can be invoked from within subsequent SPARQL queries.
Extension Data Types
The following table describes the types of values that can be passed into and returned from a user-defined extension. For each type, we can specify:
- Enum Type: A unique number that identifies the data type.
- RDF Type: The name by which the type is known within the SPARQL query language.
- C++ Type: The type by which it is known within the C++ language.
- JVM Type: The type by which it is known within the JVM language.
- UDX Registry Data Type: The language-independent name by which it is known within the Graph Lakehouse registry.
UDX Data Types
The following table describes mapping for the various data types that can be specified in an Graph Lakehouse user-defined extension.
The data types listed in the table describe values that can be passed into and out of a user-defined extension. In C++, we do this by placing the values into the elements of a row. In JVM languages, the values are passed on the stack as explicit parameters to the relevant UDX.
Enum Type | RDF Type | Description | C++ Type | JVM Type | |
---|---|---|---|---|---|
t_boolean | xsd:boolean | A non-nullable 8-bit boolean value | bool | boolean | boolean |
t_byte | xsd:byte | A non-nullable 8-bit signed integer | byte/uint8_t | byte | byte |
t_short | xsd:short | A non-nullable 16-bit signed integer | short/int16_t | short | short |
t_int | xsd:int | A non-nullable 32-bit signed integer | int/int32_t | int | int |
t_long | xsd:long | A non-nullable 64-bit signed integer | long/int64_t | long | long |
t_float | xsd:float | A non-nullable 32-bit IEE single precision float | float | float | float |
t_double | xsd:double | A non-nullable 64-bit IEE double precision float | double | double | double |
t_Object | N/A | A direct sum of all possible nullable types | -- | java/lang/Object | Object |
t_Boolean | xsd:boolean | A nullable 8-bit boolean value | bool | java/lang/Boolean | Boolean |
t_Byte | xsd:byte | A nullable 8-bit signed integer | byte/uint8_t | java/lang/Byte | Byte |
t_Short | xsd:short | A nullable 16-bit boolean integer | short/int16_t | java/lang/short | Short |
t_Integer | xsd:int | A nullable 32-bit signed integer | int/int32_t | java/lang/Integer | Int |
t_Long | xsd:long | A nullable 64 bit signed integer | long/int64_t | java/lang/Long | Long |
t_Float | xsd:float | A nullable 32-bit IEE single precision float | float | java/lang/Float | Float |
t_Double | xsd:double | A nullable 64-bit IEE double precision float | double | java/lang/Double | Double |
t_Date | xsd:date |
A nullable 32-bit signed number of days since 1/1/2000 |
udx2::Date
|
java/time/LocalDate | Date |
t_Time | xsd:time | A nullable 64-bit signed number of microseconds since 1/1/2000 |
udx2::Time
|
java/time/OffsetTime | Time |
t_DateTime | xsd:dateTime | A nullable <us, time zone> pair - since 1/1/2000 | udx2::DateTime | java/time/ZonedDateTime | DateTime |
t_Duration | xsd:duration |
A nullable <months, us> pair - since 1/1/2000 |
udx2::Duration
|
java/time/Duration | Duration |
t_String | xsd:string |
A nullable view into a string of UTF8 characters |
udx2::String |
java/lang/String | String |
t_LString | xsd:string |
A nullable pair of string views |
udx2::LString |
com/cambridgesemantics/anzograph/udx/LString | LString |
t_UDT | N/A | A nullable pair of string views |
udx2::UDT
|
com/cambridgesemantics/anzograph/udx/UDT | UDT |
t_URI | IRI | A nullable view into a string of UTF8 characters |
udx2::String
|
com/cambridgesemantics/anzograph/udx/URI | URI |
t_Blob | N/A | A nullable block of raw binary bytes | udx2::Blob | com/cambridgesemantics/anzograph/udx/Blob | N/A |
Data Type Handling
The illustration below provides a diagram of Graph Lakehouse's UDX data type handling. The top row in the diagram shows the built-in primitive types, and the bottom plane shows the corresponding reference types. The arrows pointing from primitive types to corresponding reference types represent automatic coercions. Details about data type processing and automatic type coercion follow the diagram.
Primitive Types
The top row in the diagram depicts non-nullable types that are native to both the C++ and JVM languages.
If a UDX registers itself as requiring a primitive type as one of its arguments, but it receives a null value at run time, the system generates an exception and the query is aborted. Similarly, if a UDX registers itself as returning a primitive type as one of its results, but it actually returns a null value, the system also generates an exception and the query is aborted.
Passing and returning values of primitive types is generally faster than using the corresponding reference types, and thus, is preferred whenever possible for best performance.
Reference Types
The reference types shown in the bottom plane of the diagram represent data values that are passed by reference. These types are ultimately derived from "Object," have methods, are instances of classes, and are interrogated at run-time for their type. Reference types are also nullable. Each primitive type (boolean, byte, short, int, long, float, double) has a corresponding reference type that it is mapped to (Boolean, Byte, Short, Integer, Long, Float, Double).
Passing and returning values as reference types is generally slower than using their primitive counterparts, but using reference types often provide more flexibility.
Data Type Coercion
Graph Lakehouse supports automatic type coercion of certain data types. These data types are represented by the downward-pointing arrows in the previous diagram showing Graph Lakehouse data type mapping. Where automatic conversion is supported, a value of one type can be supplied to a UDX where a value of another type is generally prescribed, and Graph Lakehouse will convert the data type without a loss of information or precision.
For example, if a UDX expects a Double value as an input argument and the value supplied is an int, Graph Lakehouse coerces the value as follows:
int→long→float→double→Double
If a UDX requires a long value, but an int is supplied, Graph Lakehouse converts the int from a 32-bit signed integer to a 64-bit signed integer 3L type and clears out the high 32 bits.