Adding a Dataset to the Dataset Catalog

Source data that is not in RDF format is onboarded through structured or unstructured pipelines, where the data is imported to Anzo and converted to RDF format before becoming available in the Dataset catalog. Certain RDF file types, however, can be added to the catalog directly, making the data available to add to a Graphmart for loading and analyzing in AnzoGraph.

Users can add to the Dataset catalog any pre-existing file-based linked data set (FLDS), such as when migrating a Dataset from one Anzo server to another. Or they can point Anzo to a directory of Turtle or N-Triple files and Anzo will create the FLDS and add the Dataset to the catalog.

To import data from CSV, JSON, XML, Parquet, or SAS files, follow the processes described in Adding Data Sources and Schemas.

This topic provides instructions for making RDF files available as a Dataset in the catalog.

File Requirements

To add data to the Dataset catalog, the location of the files, the file format, and the directory structure must meet the following requirements:

  • Supported File Locations: Files must be staged on a configured file store.
  • Supported File Formats: Files must be in one of the following formats.
    • Turtle (.ttl file type)
    • N-Triple (.n3 and .nt file types)

    Either of the file types listed above can be compressed in GZIP format and imported as filename.filetype.gz files.

  • Supported Directory Structure: The directory structure that is required depends on whether you are importing a File-Based Linked Data Set (FLDS)—a data set that was previously created by onboarding data to Anzo—or files that are not yet part of an FLDS:
    • FLDS Imports: FLDS directories should contain an flds.trig file, an onts directory that includes the model .trig file, and an rdf.ttl or rdf.ttl.gz directory that contains the data files. For example:
      LoadEmployees_f7b1f
      ├── flds.trig
      ├── onts
      │   └── Employees.trig
      └── rdf.ttl.gz
          └── Loadnew_employees_8be23.ttl.gz
      	 └── 20191021034225.ttl.gz
      	     └── part-00000.ttl.gz
      	     └── part-00001.ttl.gz
                   └── part-00003.ttl.gz
      Models must be in TriG format, regardless of the file type of the data files.
    • RDF File Imports: When importing RDF files that are not part of an FLDS, the files must be placed in a directory named rdf.<filetype> or rdf.<filetype>.gz. Stage uncompressed TTL files in a directory called rdf.ttl, and stage compressed TTL files in a directory called rdf.ttl.gz. Stage uncompressed N-Triple files in a directory called rdf.nt or rdf.n3, depending on the file type extension. Place compressed files in an rdf.nt.gz or rdf.n3.gz directory. For example:
      External-RDF-Top-Level-Directory
      └── rdf.ttl.gz
          └── external-rdf-file1.ttl.gz
          └── external-rdf-file2.ttl.gz
          └── external-rdf-file3.ttl.gz
      

      All files inside an rdf.<filetype> directory must be the same format and end in the same extension. Data in mixed formats will not load successfully. If you plan to import multiple file types, organize files into separate directories by file extension type, and then import each directory separately.

Importing RDF Files

Follow the instructions below to create a Dataset catalog entry from a directory of Turtle or N-Triple. Make sure that the files and directory meet the requirements in File Requirements.

Anzo provides the option to link the files to an existing data model during the import. If the model is not yet available in Anzo, consider uploading it before importing the RDF files. See Uploading a Model to Anzo for instructions. You are not required to include a model at import time; a model can be associated with a data set at any time. How do I associate a Model with an existing Dataset?

  1. In the Anzo application, expand the Blend menu and click Datasets. Anzo displays the Datasets screen, which lists the catalog of Datasets. For example:

  2. On the Datasets screen, click Add Dataset. Anzo opens the Create Dataset dialog box.

  3. The From Existing RDF radio button is selected by default. Type a name for the new Dataset in the Title field and an optional description in the Description field.
  4. Click the RDF File Location field to open the File Location dialog box. Find and select the rdf.<filetype> directory that you want to import, and then click OK to close the dialog box.
  5. If you want to associate a model with this Dataset, click the Ontologies drop-down list and select the model. To include a system model, select the Include System Data checkbox. If you do not want to associate a model with the data at this time, leave the Ontologies field blank.

    Datasets without a model cannot be viewed in Hi-Res Analytics dashboards, but the imported data can still be queried. A model can be associated with the data set at a later time. How do I associate a Model with an existing Dataset?

  6. Click Save to create the FLDS, add it to the catalog, and return to the Datasets screen. You can now select the FLDS from the catalog and create a Graphmart. See Creating a New Graphmart for instructions.

    Anzo generates an flds.trig file at the same level as the rdf.ttl or rdf.ttl.gz directory. The file contains metadata about the load files.

Importing an Existing Dataset

Follow the instructions below to add an existing Dataset, such as an exported Dataset, to the catalog. Make sure that the FLDS meets the requirements in File Requirements.

  1. In the Anzo application, expand the Blend menu and click Datasets. Anzo displays the Datasets screen, which lists the catalog of Datasets. For example:

  2. On the Datasets screen, click Add Dataset. Anzo opens the Create Dataset dialog box.

  3. Select the From Existing Dataset radio button.

  4. Click the RDF File Location field to open the File Location dialog box. Select the root directory for the Dataset. This is the directory that contains the flds.trig file, the onts directory, and the rdf.ttl or rdf.ttl.gz directory. For example:

  5. Click Save to import the FLDS and return to the Datasets screen. You can now select the Dataset in the catalog and create a Graphmart. See Creating a New Graphmart for instructions.
Related Topics