Adding a Dataset to the Dataset Catalog

Source data that is not in RDF format is onboarded through structured or unstructured pipelines, where the data is imported to Anzo and converted to RDF format before becoming available in the Dataset catalog. Certain RDF file types, however, can be added to the catalog directly, making the data available to add to a Graphmart for loading and analyzing in AnzoGraph.

Users can add to the Dataset catalog any pre-existing file-based linked data set (FLDS), such as when migrating an FLDS from one Anzo server to another. Or they can point Anzo to a directory of Turtle, N-Triple, N-Quad, or TriG files and Anzo will create the FLDS and add the data set to the catalog.

To import data from CSV, JSON, XML, Parquet, or SAS files, follow the processes described in Adding Data Sources and Schemas.

This topic provides instructions for making RDF files available as a Dataset in the catalog.

File Requirements

To add data to the Dataset catalog, the location of the files, the file format, and the directory structure must meet the following requirements:

  • Supported File Locations: Files must be staged on a configured file store.
  • Supported File Formats: Files must be in one of the following formats.
    • Turtle (.ttl file type)
    • N-Triple (.n3 and .nt file types)
    • N-Quad (.nq and .quads file types)
    • TriG (.trig file type)

    Any of the file types listed above can be compressed in GZIP format and imported as filename.filetype.gz files.

  • Supported Directory Structure: The directory structure that is required depends on whether you are importing a File-Based Linked Data Set (FLDS)—a data set that was previously created by onboarding data to Anzo—or files that are not yet part of an FLDS:
    • FLDS Imports: FLDS directories should contain an flds.trig file, an onts directory that includes the model .trig file, and an rdf.ttl or rdf.ttl.gz directory that contains the data files. For example:
      LoadEmployees_f7b1f
      ├── flds.trig
      ├── onts
      │   └── Employees.trig
      └── rdf.ttl.gz
          └── Loadnew_employees_8be23.ttl.gz
      	 └── 20191021034225.ttl.gz
      	     └── part-00000.ttl.gz
      	     └── part-00001.ttl.gz
                   └── part-00003.ttl.gz
      Models must be in TriG format, regardless of the file type of the data files.
    • RDF File Imports: When importing RDF files that are not part of an FLDS, the files must be placed in a directory named rdf.ttl or rdf.ttl.gz. Use one of those names regardless of the file format. Stage N-Triple, N-Quad, and TriG files in a directory named rdf.ttl. Place uncompressed files in an rdf.ttl directory and gzipped files in an rdf.ttl.gz directory.

      For example:

      External-RDF-Top-Level-Directory
      └── rdf.ttl.gz
          └── external-rdf-file1.ttl.gz
          └── external-rdf-file2.ttl.gz
          └── external-rdf-file3.ttl.gz
      

      All files inside an rdf.ttl or rdf.ttl.gz directory must be the same format and end in the same extension. Data in mixed formats will not load successfully. If you plan to import multiple file types, organize files into separate directories by file extension type, and then import each directory separately.

Importing RDF Files

Follow the instructions below to create an FLDS catalog entry from a directory of Turtle, N-Triple, N-Quad, or TriG files. Make sure that the files and directory meet the requirements in File Requirements.

Anzo provides the option to link the files to an existing data model during the import. If the model is not yet available in Anzo, consider uploading it before importing the RDF files. See Uploading a Model to Anzo for instructions. You are not required to include a model at import time; a model can be associated with a data set at any time. How do I associate a Model with an existing Dataset?

  1. In the Anzo application, expand the Blend menu and click Datasets. Anzo displays the Datasets screen, which lists the catalog of data sets. For example:

  2. On the Datasets screen, click Add Dataset > File Based Dataset. Anzo opens the Create Catalog Data dialog box.

  3. The Import RDF radio button is selected by default. Type a name for the data set in the Title field and an optional description in the Description field.
  4. Click the RDF File Location field to open the File Location dialog box. Find and select the rdf.ttl or rdf.ttl.gz directory that you want to import, and then click OK to close the dialog box.
  5. If you want to associate a model with this data set, click the Ontologies drop-down list and select the model. To include a system model, select the Include System Data checkbox. If you do not want to associate a model with the data at this time, leave the Ontologies field blank.

    Data sets without a model cannot be viewed in Hi-Res Analytics dashboards, but the imported data can still be queried. A model can be associated with the data set at a later time. How do I associate a Model with an existing Dataset?

  6. Click Save to create the FLDS, add it to the catalog, and return to the Datasets screen. You can now select the FLDS from the catalog and create a graphmart. See Creating a Graphmart for instructions.

    Anzo generates an flds.trig file at the same level as the rdf.ttl or rdf.ttl.gz directory. The file contains metadata about the load files.

Importing an FLDS

Follow the instructions below to add an FLDS to the catalog. Make sure that the FLDS meets the requirements in File Requirements.

  1. In the Anzo application, expand the Blend menu and click Datasets. Anzo displays the Datasets screen, which lists the catalog of data sets. For example:

  2. On the Datasets screen, click Add Dataset > File Based Dataset. Anzo opens the Create Catalog Data dialog box.

  3. Select the Import FLDS radio button.
  4. Click the RDF File Location field to open the File Location dialog box. Select the root directory for the FLDS, the directory that contains the flds.trig file, the onts directory, and the rdf.ttl directory. For example:

  5. Click Save to import the FLDS and return to the Datasets screen. You can now select the Dataset in the catalog and create a graphmart. See Creating a Graphmart for instructions.
Related Topics