Loading Local or Remote RDF Files with the IO Load Service

This topic provides instructions for loading locally- or remotely-stored RDF files to AnzoGraph DB using the IO Load service. See RDF Load File Requirements for details about the supported file types, encryption, storage systems, and the directory naming requirements.

The IO services require that the AnzoGraph DB C++ extensions and dependencies are installed. Docker, Kubernetes, and AWS Cloud Formation deployments include the C++ extensions and dependencies by default. For RHEL/CentOS installer installations, the extensions and dependencies are optional. Make sure you answer Yes when prompted about the C++ extensions (see Installing AnzoGraph DB). In addition, follow the instructions in Install Dependencies to Run the C++ Extensions to install the required dependencies.

For instructions on loading files with SPARQL LOAD queries, see Loading Local RDF Files with SPARQL LOAD.

Load Service Query Syntax

The following query syntax shows the structure of a load service query. The clauses, patterns, and placeholders that are links are described below.

PREFIX io: <http://cambridgesemantics.com/anzograph/io#>
INSERT {
   [  GRAPH <graph_name> { ]
   ?sub ?pred ?obj
[ } ]
}
WHERE {
  { SELECT ?sub ?pred ?obj .
    WHERE {
       SERVICE io:load('<protocol://path_to_files[,protocol://path_to_files][,...]>'){} .
  }
 }
}

?sub ?pred ?obj

This triple pattern is required and the variable names must be ?sub ?pred ?obj. The WHERE clause requires a subquery that selects the same triple pattern.

GRAPH <graph_name>

This clause is optional. When loading files such as Turtle or N-Triple files without graph specifications, include this optional clause to specify the graph to load data into. If the graph does not exist, AnzoGraph DB automatically creates it and then loads the data into it. If you do not specify a graph, the data is loaded to the default graph.

You can also include the GRAPH clause when loading quad files. If the quad files contain a mixture of quads and triples, AnzoGraph DB loads the triples into the specified graph. Quads are still loaded according to their graph specification. If you omit this option for quad files, any triples without graph specifications are loaded into the default graph.

SERVICE io:load

This is the required call to the IO load service. If your query omits the PREFIX clause, include the full URI in the call: SERVICE <http://cambridgesemantics.com/anzograph/io#load>.

protocol

The service call includes a URI that specifies the load protocol to use and the path to the load file or directory of files. The protocol that you specify depends on the type of file system that hosts the files:

  • Local File System: Specify file to access files that are stored on the local AnzoGraph DB file system or a file system that is mounted to the AnzoGraph DB servers. Including the file:// protocol is optional. When files are locally accessible, you can omit the protocol and specify only the path to the file or directory.
  • NFS: Specify nfs to access files on an NFS that is not mounted to the AnzoGraph DB servers.
  • Web Server: Specify http or https (for SSL connections) to access files on a web server.
  • Amazon S3: Specify s3 or s3crt to access files on S3. Using s3crt is recommended when loading extremely large files. The S3CrtClient improves the throughput for transfers of large files to and from Amazon S3. For more information about s3crt, see Using S3CrtClient for Amazon S3 operations in the AWS documentation.
  • Google Storage: Specify gs to access files on Google storage.
  • Azure Storage: Specify az to access files on Azure blob storage.
  • Azure WebDAV: Specify webdav or webdavs (for SSL connections) to access files on Azure WebDAV.

path_to_files

After the protocol in the service call URI, specify server connection details, if necessary, and the path to the load file or directory of files. When loading a directory of files, make sure the directory name includes the same file type extension as the files in the directory (see Directory Name Requirements for more information).

AnzoGraph DB loads all valid files in that directory as well as any subdirectories. Hidden files that are named with a leading period, such as .file.ttl, are not loaded. See Protocol and Path Examples below for example URIs.

Protocol and Path Examples

The following example URI, loads a directory of compressed TTL files from Amazon S3:

<s3://shared-data/load-files/emr.ttl.gz>

The example below connects to an NFS that is not mounted and loads a single NT file:

<nfs://10.10.100.10/shared-data/load-files/rdf/sales-2022.nt>

This example loads a TTL file from a Google object store and another TTL file from a web server:

<gs://shared-data/load-files/emr-data.ttl/patients.ttl,https://10.30.103.3/emr/medications.ttl>

The following two examples load a directory of compressed TTL files from the AnzoGraph DB file system. The second example omits the file:// protocol since it is optional:

<file:///opt/data/airlines/airline-data.ttl.gz>
</opt/data/airlines/airline-data.ttl.gz>

Load Service Query Examples

The example query below loads a directory of compressed TTL files from an Azure blob store:

PREFIX io: <http://cambridgesemantics.com/anzograph/io#>
INSERT {
   GRAPH <http://anzograph.com/emr> {
   ?sub ?pred ?obj
  }
}
WHERE {
  { SELECT ?sub ?pred ?obj .
    WHERE {
       SERVICE io:load('<az://shared-data/load-files/emr.ttl.gz>'){} .
  }
 }
}

This query loads a directory of compressed N3 files from Amazon S3:

PREFIX io: <http://cambridgesemantics.com/anzograph/io#>
INSERT {
   GRAPH <http://anzograph.com/sales> {
   ?sub ?pred ?obj
  }
}
WHERE {
  { SELECT ?sub ?pred ?obj .
    WHERE {
       SERVICE io:load('<s3://shared-data/load-files/sales.ttl.n3>'){} .
  }
 }
}