Loading Triple and Quad Files

This topic provides instructions for using the SPARQL LOAD statement to load data to AnzoGraph from files that are in Turtle, N-Triple, N-Quad, or TriG format.

For information about load file directory requirements and load architecture, see Load File Requirements. For more information on the data types thatAnzoGraph uses to store loaded or inserted data, see AnzoGraph Data Type Handling.

Supported Load File Types

AnzoGraph supports the following load file types:

Turtle (.ttl file type): Terse RDF Triple Language that writes an RDF graph in compact form.
N-Triple (.n3 and .nt file types): A subset of Turtle known as simple triples.
N-Quad (.nq and .quads file types): N-Triples with a blank node or graph designation.
TriG (.trig file type): An extension of Turtle that supports representing a complete RDF data set.

You can GZIP any of the load file types and load the <filename>.<extension>.gz files into the database. In addition, AnzoGraph supports loading tarballs that contain the load files. For example, if you have a directory of gzipped TTL files, you can tar the directory and load the resulting .ttl.gz.tar file.

AnzoGraph supports a maximum URI length of 16K characters. There is also a limit of 64K on the number of unique URIs you can load into AnzoGraph. That is, the number of unique URIs, including both graph URIs and predicate URIs, that you can load into AnzoGraph must be less than 64K. If you exceed this limit, the Load operation exceeding the limit will fail and AnzoGraph returns the message "m_lowest_unused_index <= a_max_value()".

LOAD Syntax

Run the following statement to load data from Turtle, N-Triple, N-Quad, or TriG files.

LOAD [ SILENT ] [ WITH 'leader' | 'compute' | 'global' ] <URI> [...<URIn>] [ INTO GRAPH <graph_name> ]

Option	Description
SILENT	Include this optional keyword if you do not want AnzoGraph to return errors during the load. When SILENT is omitted, AnzoGraph aborts the load upon hitting an error and reports the error to the client. When SILENT is specified and AnzoGraph encounters an error, it logs the error to a graph and proceeds with the load. By default, any errors are captured in the <load_errors> graph. After a load completes, you can query the graph to review errors. To customize the load error graph URI, you can change the load_errors_graph setting value in the system configuration file, `<install_path>/config/settings.conf`. See Changing System Settings for instructions.
WITH 'leader'	Include this optional clause when loading files that only the leader server can access. WITH 'leader' is the default value for the LOAD statement. When the WITH clause is omitted, the load proceeds as if WITH 'leader' was specified. The "leader" keyword is case-sensitive. Type the term using lower case letters.
WITH 'compute'	Include this optional clause when all servers will load files from their local file systems. Use this option if you have arranged the load files so that each AnzoGraph server has a unique subset of files on its local file system. The "compute" keyword is case-sensitive. Type the term using lower case letters.
WITH 'global'	Include this optional clause when all servers will load a subset of files from directories on a mounted file system. Include this option when every AnzoGraph server in the cluster has visibility to the entire data set. AnzoGraph automatically divides file selection among the servers. The "global" keyword is case-sensitive. Type the term using lower case letters.
<URI>	Required clause that specifies the URIs to load. Each URI lists the path to the file or directory of files that you want to load. To load a single file, the scheme of the URI should be file:. To load a directory of files, the scheme of the URI should be dir:. When you specify a directory, AnzoGraph loads all valid files in that directory as well as any subdirectories. AnzoGraph does not load any hidden files that are named with a leading period, such as .file.ttl. For example, the following URI loads a single file from a shared directory: <file:/shared-files/data/tickit.ttl> This example URI loads a directory of .ttl.gz files on a mounted file system: <dir:/global/nfs/vpc_nfs_server/data/tickit_all.ttl.gz> If you specify more than one URI to load from, each URI must be of the same file type, that is, each URI must specify graph data in the same format such as `.ttl`, `.TriG`, etc. Also each URI must specify the same scheme, file: or dir:. Make sure that the file system is accessible from AnzoGraph. In a Docker environment, the file or directory must be shared between the host and the container or be stored in the AnzoGraph container file system. For instructions on copying files or directories from a local file system to the AnzoGraph file system in a Docker container, see Loading Files from the AnzoGraph File System in Docker. For more information on loading data into AnzoGraph from HDFS data sources, see Loading Files from HDFS.
INTO GRAPH <graph_name>	When loading files such as Turtle or N-Triple files without graph specifications, include this optional clause to specify the graph to load data into. If the graph does not exist, the system automatically creates it and then loads the data into it. If you do not specify a graph, AnzoGraph loads data into the default graph. You can also include the INTO GRAPH option when loading N-Quad files. If the N-Quad files contain a mixture of quads and triples, AnzoGraph loads the triples into the specified graph. Quads are still loaded according to their graph specification. If you omit this option for N-Quad files, any triples without graph specifications are loaded into the default graph.

Depending on how and where you stage load files, AnzoGraph loads the files in parallel, using all available cores on a server or all servers in the cluster. The location of the load files and the number of servers that can access those files affects network and load performance. For optimal performance, break data into several smaller load files and make sure that all servers in the cluster have access to the files. For example, on a four-node cluster with 256 vCPU total, AnzoGraph reaches the best load performance when all four nodes can access the files and the number of files is a multiple of 256 so that all 256 cores load the files in parallel.

When loading data from a single file, only the leader server performs the load. For example, on an 8-node cluster, a single-file load uses only 1/8 of the cluster's CPU resources to load the data. Loading a large single file can take longer than loading a directory of files where all AnzoGraph servers have access to those files.
When loading data from directories of files that only the leader server can access, only the cores on the leader server perform the load. For example, on a 4-node cluster where each server has 32 cores (128 total cores), only the leader server performs the load, i.e, AnzoGraph loads 32 files at a time.
When files are on S3, a web server, or mounted file system and the leader and compute servers have access to those files, all servers load a subset of the files. For example, on a 4-node cluster where each server has 32 cores, 128 cores will work in parallel to load the data, i.e. AnzoGraph loads 128 files at a time.

Loading Files on Amazon S3

The following example statement loads data from files in a directory on Amazon S3, for installations not requiring authorization. The data is loaded into a graph named tickit:

LOAD WITH 'global' <s3://mybucket/load-files/tickit_all.ttl> INTO GRAPH <tickit>

Omit the INTO GRAPH argument when you load files that include graph specifications.

To load data from an Amazon S3 bucket that requires permissions to access, you need to first create a file in the ~/.aws directory on your AnzoGraph server that includes the necessary credentials to access the bucket.

Create a directory named .aws in your home directory on the AnzoGraph host server.
In the ~/.aws directory, create a file named credentials (all lower-case and no extension in the file name):
Add the following statements in the credentials file:

aws_access_key_id = <access-key-value>
aws_secret_key = <secret-key-value>

For example:

aws_access_key_id = AKIARRAKWT4K362XDWUJ
aws_secret_key = 0NuZvJJvo4Q8pwKAqjIMmZnog2u3IdWcQbucpzA7

Run the LOAD command as before, that is:

LOAD WITH 'global' <s3://mybucket/load-files/tickit_all.ttl> INTO GRAPH <tickit>

Loading Files from a Mounted File System

To load data from TTL, NT, NQ, or TriG files stored on a mounted NFS server, you can use the following command:

LOAD with 'global' <dir:/global/nfs/vpc_nfs_server/directory_path/directory_name> [ INTO GRAPH <graph_name> ]

For example, the following command loads data from gzipped turtle files in a directory on a mounted file system. All of the servers in the cluster have access to the filesystem. The data is loaded into a graph named sales:

LOAD WITH 'global' <dir:/global/nfs/vpc_nfs_server/data/sales_data.ttl.gz> INTO GRAPH <sales>

Omit the INTO GRAPH argument when you load files that include graph specifications.

Loading Files from a Data Server

The following example statement loads data from TRIG files in the employees.trig directory on a data server.

LOAD WITH 'global' <https://data.cambridgesemantics.com/loads/employees.trig>

The employees.trig directory contains an ls.dir text file that lists the filenames for the files to load.

Loading Files from the AnzoGraph File System in Docker

When running AnzoGraph in a Docker container, the files you plan to load into AnzoGraph must already reside in the same Docker container where AnzoGraph is installed. In some cases, that means you may have to copy load files from another host system to the AnzoGraph file system in Docker. To do that:

In Docker, run the following command to access the AnzoGraph file system, the /opt/anzograph directory:
```
sudo docker exec -it anzograph_container_name /bin/bash
```
Where anzograph_container_name is the name of the AnzoGraph container whose file system you want to access. For example:
```
sudo docker exec -it anzograph /bin/bash
```
Determine where on the file system you would like to place the load files and create a new directory if necessary. If you plan to load a directory of files, remember to include the file type in the directory name. See Load File Requirements for more information. For example:
```
mkdir /opt/anzograph/load-files.ttl
```
Type exit to exit the container.

Run the following Docker command to copy files from the host server to a location in the AnzoGraph container.

sudo docker cp /path/filename anzograph_container_name:/path/dir

For example:

sudo docker cp /home/user/sales.ttl anzograph:/opt/anzograph/load-files.ttl/

Or this command copies a directory to the container:

sudo docker cp -r /path/dirname anzograph_container_name:/path

For example:

sudo docker cp -r /home/user/load-files.ttl anzograph:/opt/anzograph/

Loading Files from HDFS

To load files into AnzoGraph from Hadoop Distributed File Systems (HDFS), you can use the same LOAD syntax as with other file transfer locations, except you have a few different options in the file handling and authentication protocols that are available. For example, to load data from a HDFS cluster, the syntax of the LOAD command you can use is the following:

LOAD [WITH 'leader'|'compute'|'global'] <HDFS_protocol://URLpath> [INTO GRAPH <graph_name>]

where you may specify one of the following options for the HDFS protocol you can use:

hdfs – Use non-secure HTTP protocol for data load from HDFS file.
shdfs – Load HDFS files via secure HTTPS protocol.
kshdfs – Load HDFS files using secure HTTPS with Kerberos authentication.

For example:

LOAD WITH 'global' <shdfs://load-files/tickit_all.ttl> INTO GRAPH <tickit>

Along with specifying the HDFS protocol to use, you also need to edit the etc/hosts file on all nodes of theAnzoGraph cluster to include the IP addresses and host names of the NameNode and all DataNodes of the HDFS cluster you are loading data from.

For HDFS data loading using Kerberos protocols, you also need to configure AnzoGraph to use Kerberos authentication. See Configuring AnzoGraph for Kerberos Authentication below for those instructions.

Configuring AnzoGraph for Kerberos Authentication

If you plan to load data to AnzoGraph from an HDFS file store that uses Kerberos authentication, follow the steps below to configure AnzoGraph for Kerberos authentication.

In order to be able to generate an authentication token for requesting encrypted ticket-granting tickets (TGT) from the key distribution center (KDC), each AnzoGraph host server must include the Kerberos workstation package, krb5-workstation. On each server in the cluster, run the following command to install the package:
```
sudo yum install -y krb5-workstation
```
In order to establish a connection to the KDC, AnzoGraph must have a copy of the KDC's krb5.conf file. Place a copy of krb5.conf in the /etc directory on each AnzoGraph host server.
In addition to krb5.conf, each AnzoGraph server needs a copy of the .keytab file from the principal node. The keytab file and principal name are used to generate an authentication token.
To find the location of the .keytab file and the principal name, you can look up the dfs.web.authentication.kerberos.keytab and dfs.web.authentication.kerberos.principal values in hdfs-site.xml on the HDFS master node.
Copy the .keytab file to any location on each AnzoGraph host server, and then run the following command to generate the authentication token:
```
kinit -p principal_name -k -t path/keytab_file
```
Where principal_name is the Kerberos principal name and path/keytab_file is the location and name of the .keytab file.

Loading Triple and Quad Files

Supported Load File Types

LOAD Syntax

Loading Files on Amazon S3

Loading Files from a Mounted File System

Loading Files from a Data Server

Loading Files from the AnzoGraph File System in Docker

Loading Files from HDFS

Configuring AnzoGraph for Kerberos Authentication

Related Topics