Loading Triple and Quad Files

This topic provides instructions for using the SPARQL LOAD statement to load data to AnzoGraph DB from files that are in Turtle, N-Triple, N-Quad, or TriG format.

For information about load file directory requirements and load architecture, see RDF Load File Requirements. For more information on the data types that AnzoGraph DB uses to store data, see Data Type Handling.

Supported Load File Types

AnzoGraph DB supports the following load file types:

  • Turtle (.ttl file type): Terse RDF Triple Language that writes an RDF graph in compact form.
  • N-Triple (.n3 and .nt file types): A subset of Turtle known as simple triples.
  • N-Quad (.nq and .quads file types): N-Triples with a blank node or graph designation.
  • TriG (.trig file type): An extension of Turtle that supports representing a complete RDF data set.

You can GZIP any of the load file types and load the <filename>.<extension>.gz files into the database. In addition, AnzoGraph DB supports loading tarballs that contain the load files. For example, if you have a directory of gzipped TTL files, you can tar the directory and load the resulting .ttl.gz.tar file.

AnzoGraph DB supports a maximum URI length of 16K characters. There is also a limit of 64K on the number of unique URIs you can load into AnzoGraph DB. That is, the number of unique URIs, including both graph URIs and predicate URIs, that you can load into AnzoGraph DB must be less than 64K. If you exceed this limit, the Load operation exceeding the limit will fail and AnzoGraph DB returns the message "m_lowest_unused_index <= a_max_value()".

LOAD Syntax

Run the following statement to load data from Turtle, N-Triple, N-Quad, or TriG files.

LOAD [ SILENT ] [ WITH 'leader' | 'compute' | 'global' ] <URI> [...<URIn>] [ INTO GRAPH <graph_uri> ]

SILENT

Include this optional keyword if you want AnzoGraph DB to ignore "bad data" errors during the load. Data issues are problems such as dateTime values that are incorrectly formatted or strings that are tagged as double data types. The SILENT keyword does not silence syntax errors in the files. If a file is ill-formed, such as it includes invalid characters in place of URIs, AnzoGraph DB cannot parse the data and the file must be corrected.

  • When SILENT is omitted, AnzoGraph DB aborts the load upon hitting a data or syntax error and reports the error to the client.
  • When SILENT is specified and AnzoGraph DB encounters an error with the data, it logs the error to a graph and proceeds with the load. By default, any errors are captured in the <load_errors> graph. After a load completes, you can query the graph to review errors.

    When SILENT is specified, the load will still be aborted if there are syntax errors in the files. AnzoGraph DB cannot parse the data if there are syntax errors. The file or files must be corrected and loaded again.

    To customize the load error graph URI, you can change the load_errors_graph setting value in the system configuration file, <install_path>/config/settings.conf. See Changing System Settings for instructions.

leader

Include the optional WITH 'leader' clause when loading files that only the leader server can access. WITH 'leader' is the default value for the LOAD statement. When the WITH clause is omitted, the load proceeds as if WITH 'leader' was specified.

The "leader" keyword is case-sensitive. Type the term using lower case letters.

compute

Include the optional WITH 'compute' clause when all servers will load files from their local file systems. Use this option if you have arranged the load files so that each AnzoGraph DB server has a unique subset of files on its local file system.

The "compute" keyword is case-sensitive. Type the term using lower case letters.

global

Include the optional WITH 'global' clause when all servers will load a subset of the same files directories on a mounted file system. Include this option when every AnzoGraph DB server in the cluster has visibility to the entire data set. AnzoGraph DB automatically divides file selection among the servers.

The "global" keyword is case-sensitive. Type the term using lower case letters.

URI

Required clause that specifies the absolute path to the load file or files. To load a single file, the scheme of the URI should be file:. To load a directory of files, the scheme of the URI should be dir:. When loading a directory of files, make sure the directory name includes the same file type extension as the files in the directory, i.e., a directory of TTL files is named name.ttl, a directory of TriG files is named name.trig directory, and a directory of NQ files is named name.nq. When you specify a directory, AnzoGraph DB loads all valid files in that directory as well as any subdirectories. AnzoGraph DB does not load any hidden files that are named with a leading period, such as .file.ttl.

For example, the following URI loads a single file from a shared directory:

<file:/shared-files/data/tickit.ttl>

This example URI loads a directory of .ttl.gz files on a mounted file system:

<dir:/global/nfs/vpc_nfs_server/data/tickit_all.ttl.gz>

If you specify more than one URI to load from, each URI must be of the same file type, that is, each URI must specify the same extension type, such as .ttl, .trig, etc. Also each URI must specify the same scheme, file: or dir:.

Make sure that the file system is accessible from AnzoGraph DB. In a Docker environment, the file or directory must be shared between the host and the container or be stored in the AnzoGraph DB container file system. For instructions on copying files or directories from a local file system to the AnzoGraph DB file system in a Docker container, see Loading Files from the File System in Docker. For more information on loading data into AnzoGraph DB from HDFS, see Loading Files from HDFS.

INTO GRAPH <graph_uri>

When loading files such as Turtle or N-Triple files without graph specifications, include this optional clause to specify the graph to load data into. If the graph does not exist, the system automatically creates it and then loads the data into it. If you do not specify a graph, AnzoGraph DB loads data into the default graph.

You can also include the INTO GRAPH option when loading N-Quad files. If the N-Quad files contain a mixture of quads and triples, AnzoGraph DB loads the triples into the specified graph. Quads are still loaded according to their graph specification. If you omit this option for N-Quad files, any triples without graph specifications are loaded into the default graph.

Loading Files on Amazon S3

The following example statement loads data from files in a directory on Amazon S3, for installations not requiring authorization. The data is loaded into a graph named tickit:

LOAD WITH 'global' <s3://mybucket/load-files/tickit_all.ttl> INTO GRAPH <tickit>

Omit the INTO GRAPH argument when you load files that include graph specifications.

To load data from an Amazon S3 bucket that requires permissions to access, you need to first create a file in the ~/.aws directory on your AnzoGraph DB server that includes the necessary credentials to access the bucket.

  1. Create a directory named .aws in your home directory on the AnzoGraph DB host server.
  2. In the ~/.aws directory, create a file named credentials (all lower-case and no extension in the file name):
  3. Add the following statements in the credentials file:
  4. aws_access_key_id = <access-key-value>
    aws_secret_key = <secret-key-value>

    For example:

    aws_access_key_id = AKIARRAKWT4K362XDWUJ
    aws_secret_key = 0NuZvJJvo4Q8pwKAqjIMmZnog2u3IdWcQbucpzA7
  5. Run the LOAD command as before, that is:
  6. LOAD WITH 'global' <s3://mybucket/load-files/tickit_all.ttl> INTO GRAPH <tickit>

Loading Files from a Mounted File System

To load data from TTL, NT, NQ, or TriG files stored on a mounted NFS server, you can use the following command:

LOAD with 'global' <dir:/global/nfs/vpc_nfs_server/directory_path/directory_name> [ INTO GRAPH <graph_name> ]

For example, the following command loads data from gzipped turtle files in a directory on a mounted file system. All of the servers in the cluster have access to the filesystem. The data is loaded into a graph named sales:

LOAD WITH 'global' <dir:/global/nfs/vpc_nfs_server/data/sales_data.ttl.gz> INTO GRAPH <sales>

Omit the INTO GRAPH argument when you load files that include graph specifications.

Loading Files from a Data Server

The following example statement loads data from TRIG files in the employees.trig directory on a data server.

LOAD WITH 'global' <https://data.cambridgesemantics.com/loads/employees.trig>

The employees.trig directory contains an ls.dir text file that lists the filenames for the files to load.

Loading Files from the File System in Docker

When running AnzoGraph DB in a Docker container, the files you plan to load into AnzoGraph DB must already reside in the same Docker container where AnzoGraph DB is installed. In some cases, that means you may have to copy load files from another host system to the AnzoGraph DB file system in Docker. To do that:

  1. In Docker, run the following command to access the AnzoGraph DB file system, the /opt/anzograph directory:
    sudo docker exec -it anzograph_container_name /bin/bash

    Where anzograph_container_name is the name of the AnzoGraph DB container whose file system you want to access. For example:

    sudo docker exec -it anzograph /bin/bash
  2. Determine where on the file system you would like to place the load files and create a new directory if necessary. If you plan to load a directory of files, remember to include the file type in the directory name. See RDF Load File Requirements for more information. For example:
    mkdir /opt/anzograph/load-files.ttl
  3. Type exit to exit the container.
  4. Run the following Docker command to copy files from the host server to a location in the AnzoGraph DB container.
    sudo docker cp /path/filename anzograph_container_name:/path/dir

    For example:

    sudo docker cp /home/user/sales.ttl anzograph:/opt/anzograph/load-files.ttl/

    Or this command copies a directory to the container:

    sudo docker cp -r /path/dirname anzograph_container_name:/path

    For example:

    sudo docker cp -r /home/user/load-files.ttl anzograph:/opt/anzograph/

Loading Files from HDFS

To load files into AnzoGraph DB from Hadoop Distributed File Systems (HDFS), you can use the same LOAD syntax as with other file transfer locations, except you have a few different options in the file handling and authentication protocols that are available. For example, to load data from a HDFS cluster, the syntax of the LOAD command you can use is the following:

LOAD [WITH 'leader'|'compute'|'global'] <HDFS_protocol://URLpath> [INTO GRAPH <graph_name>]

where you may specify one of the following options for the HDFS protocol you can use:

  • hdfs: Use non-secure HTTP protocol for loads from HDFS.
  • shdfs: Load HDFS files via secure HTTPS protocol.
  • kshdfs: Load HDFS files using secure HTTPS with Kerberos authentication.

For example:

LOAD WITH 'global' <shdfs://load-files/tickit_all.ttl> INTO GRAPH <tickit>

Along with specifying the HDFS protocol to use, you also need to edit the etc/hosts file on all nodes of theAnzoGraph DB cluster to include the IP addresses and host names of the NameNode and all DataNodes of the HDFS cluster you are loading data from.

For HDFS data loading using Kerberos protocols, you also need to configure AnzoGraph DB to use Kerberos authentication. See Configuring AnzoGraph for Kerberos Authentication below for instructions.

Configuring AnzoGraph for Kerberos Authentication

If you plan to load data to AnzoGraph from an HDFS file store that uses Kerberos authentication, follow the steps below to configure AnzoGraph for Kerberos authentication.

  1. In order to be able to generate an authentication token for requesting encrypted ticket-granting tickets (TGT) from the key distribution center (KDC), each AnzoGraph host server must include the Kerberos workstation package, krb5-workstation. On each server in the cluster, run the following command to install the package:
    sudo yum install -y krb5-workstation
  2. In order to establish a connection to the KDC, AnzoGraph must have a copy of the KDC's krb5.conf file. Place a copy of krb5.conf in the /etc directory on each AnzoGraph host server.
  3. In addition to krb5.conf, each AnzoGraph server needs a copy of the .keytab file from the principal node. The keytab file and principal name are used to generate an authentication token.

    To find the location of the .keytab file and the principal name, you can look up the dfs.web.authentication.kerberos.keytab and dfs.web.authentication.kerberos.principal values in hdfs-site.xml on the HDFS master node.

    Copy the .keytab file to any location on each AnzoGraph host server, and then run the following command to generate the authentication token:

    kinit -p principal_name -k -t path/keytab_file

    Where principal_name is the Kerberos principal name and path/keytab_file is the location and name of the .keytab file.

Related Topics