Load File Requirements
AnzoGraph supports loading data from files on the AnzoGraph file system, on a remote web server or bucket, or on a mounted file system or bucket. You can load data from a single file or multiple files in a directory.
This topic provides details about the supported load file types, load directory requirements, and AnzoGraph load architecture and performance information.
- Supported File Types
- AnzoGraph Data Type Handling
- Directory Name Requirements
- File System Requirements
- Web Server Requirements
- Load Architecture and Performance Information
Supported File Types
AnzoGraph supports the following load file types:
- Turtle (.ttl file type): Terse RDF Triple Language that writes an RDF graph in compact form.
- N-Triple (.n3 and .nt file types): A subset of Turtle known as simple triples.
- N-Quad (.nq and .quads file types): N-Triples with a blank node or graph designation.
- TriG (.trig file type): An extension of Turtle that supports representing a complete RDF data set.
- CSV (.csv file type): Comma-separated value format.
You can GZIP any of the load file types and load the <filename>.<extension>.gz
files into the database. In addition, AnzoGraph supports loading tarballs that contain the load files. For example, if you have a directory of gzipped TTL files, you can tar the directory and load the resulting .ttl.gz.tar file.
AnzoGraph supports a maximum URI length of 16K characters. There is also a limit of 64K on the number of unique predicate and graph URIs that can be stored in AnzoGraph. If the total number of unique predicate and graph URIs exceeds the 64K limit, the load operation that exceeds the limit will fail and AnzoGraph returns the message "m_lowest_unused_index <= a_max_value()."
AnzoGraph Data Type Handling
AnzoGraph natively supports the following RDF data types. Literal values with types that are not included in the table below are treated as "user-defined" types in AnzoGraph. User-defined types are stored as strings and can be cast to supported types as needed to perform analytic operations.
RDF Data Types | Description |
---|---|
xsd:boolean | For true or false values. Regardless of whether the input value is "true" or "false" or "0" or "1," AnzoGraph stores and displays "t" for true and "f" for false. To use 1 and 0 for true and false, you must specify the xsd:boolean type in the load file. Otherwise the system assumes these values are integers. |
xsd:byte | For 1-byte integers from -128 to 127. |
xsd:date | For date values that follow a format such as YYYY-MM-DD. You can also include timezone indicators in xsd:date values. |
xsd:dateTime | 8-byte date and time values that follow a format such as YYYY-MM-DDThh:mm:ss. You can also include timezone indicators in xsd:dateTime values. |
xsd:double | 8-byte double floating point values. Note: Decimal values are converted to xsd:double in AnzoGraph. |
xsd:duration | Duration of time expressed as a number of years, months, days, hours, minutes, and seconds in a format such as PnYnMnDTnHnMnS. |
xsd:float | 4-byte floating point values with potential decimal places. |
xsd:int | 4-byte integers for values from -2,147,483,648 to 2,147,483,647. |
xsd:long | 8-byte integers for values from –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. |
xsd:short | 2-byte integers for values from -32,768 to 32,767. |
xsd:string | Character values of varying length, up to 1 MB (by default) in size; 1 MB holds approximately 1 million characters. AnzoGraph can be configured to allow strings of up to 2 MB (approximately 2 million characters), however, this is not recommended. Instead, enable the truncate_clob setting to truncate strings that are larger than 1MB. |
xsd:time | Time values that follow a format such as hh:mm:ss. |
Directory Name Requirements
Organize load files in directories by file extension type, and include the file type extension in the name of the directory. For example, place TTL files in a name.ttl directory, place TRIG files in a name.trig directory, place NQ files in a name.nq directory, and so on.
File System Requirements
When loading data from files on a file system, the file system must be accessible from AnzoGraph. For example, in a Docker or Kubernetes container environment, the file or directory must be stored on the AnzoGraph container file system. For instructions on copying files or directories from a local file system to a container's file system, see How do I copy load files from the host to the AnzoGraph file system in Docker?
Web Server Requirements
When loading directories of files from a web server (HTTP/S URLs), include an ls.dir table of contents file in the root directory that contains the files that you want to load. If you have a data server with multiple directories and you plan to load one directory at a time, include an ls.dir file in each directory. If you have a server with multiple directories and you plan to load from all directories at once, include an ls.dir file at the top level of the server. All ls.dir files must include file names only and list the files in any sub-directories. Run the following command to create ls.dir files:
find . -type f -print| sed 's/\.\///g' |grep -v ls.dir| tee ls.dir
Load Architecture and Performance Information
Depending on how and where you stage load files, AnzoGraph loads the files in parallel, using all available cores on a server or all servers in the cluster. The location of the load files and the number of servers that can access those files affects network and load performance. For optimal performance, break data into several smaller load files and make sure that all servers in the cluster have access to the files. For example, on a four-node cluster with 256 vCPU total, AnzoGraph reaches the best load performance when all four nodes can access the files and the number of files is a multiple of 256 so that all 256 cores load the files in parallel.
- When loading data from a single file, only the leader server performs the load. For example, on an 8-node cluster, a single-file load uses only 1/8 of the cluster's CPU resources to load the data. Loading a large single file can take longer than loading a directory of files where all AnzoGraph servers have access to those files.
- When loading data from directories of files that only the leader server can access, only the cores on the leader server perform the load. For example, on a 4-node cluster where each server has 32 cores (128 total cores), only the leader server performs the load, i.e, AnzoGraph loads 32 files at a time.
- When files are on S3, a web server, or mounted file system and the leader and compute servers have access to those files, all servers load a subset of the files. For example, on a 4-node cluster where each server has 32 cores, 128 cores will work in parallel to load the data, i.e. AnzoGraph loads 128 files at a time.