Creating an Anzo Data Store

An Anzo data store, also known as a graph data source, is a designated directory on the shared file system where Anzo saves the AnzoGraph load files that are generated during the ETL process. You can create one Anzo data store and configure all pipelines to write to that store (each ETL run automatically creates a new sub-directory under the data store directory) or you can create multiple Anzo data stores and use a different one for each data set. This topic provides instructions for creating a data store.

For information about setting up a file store connection, see Connecting to a File Store.

  1. In the Anzo console, expand the Administration menu and click Anzo Data Store. Anzo displays the Graph Data Source screen, which lists any existing data stores. For example:

  2. On the Graph Data Source screen, click the Create button. Anzo opens the Create Graph Data Source screen.

  3. Type a Title and optional Description for the graph source.
  4. Click in the Data Location field. Anzo opens the File Location dialog box.

  5. On the left side of the screen, select the storage location where you want to create this graph store. On the right side of the screen, navigate to the base directory where you want Anzo to save the data files for this graph source. Select a directory, and then click OK. Each time ETL runs for this store, Anzo creates a new subdirectory under the base location that you specify.

    Note: Ideally, the Data Location is a directory that the Anzo and AnzoGraph servers have access to, such as a mounted file system or cloud storage location. If you want Anzo to generate files for this graph source in one location and load the files into AnzoGraph from another location, specify the file generation location in this field, and then specify the AnzoGraph load location in the Alternate Data Location field that is displayed on the details screen after you save the data store.

  6. Specify whether to compress the generated load files. By default, the Compress output checkbox is selected, indicating that Anzo generates .ttl.gz files when writing to this graph data source. If you clear the checkbox, Anzo generates uncompressed .ttl files. To preserve disk space and reduce read times when loading data into memory, Cambridge Semantics recommends that you accept the default configuration and compress load files.
  7. The Spark ETL engine does not remove duplicates by default when running pipelines. If the source contains a significant number of duplicate entities, you have two options for deduplicating the data:
    • Deduplicate the data during the ETL process: To deduplicate the data while running the jobs that will generate this graph source, select the Dedupe output per executor option. Enabling the dedupe option limits the number of duplicates to one duplicate per executor node. For example, if the Spark configuration has 10 executor nodes, the resulting data set can contain a maximum of 10 duplicate entities.

      Note: Deduplication is based on primary keys and URI templates. If the source does not employ templating, do not enable the dedupe option.

      Important: Enabling this option substantially increases the time it takes to run the jobs for this graph source.

    • Deduplicate the data after loading it to AnzoGraph: AnzoGraph deduplicates data during a "vacuum" process that runs automatically after data is loaded into memory. If you leave the Dedupe output per executor option disabled, duplicates will be removed by AnzoGraph.

      Note: Deduplicating data with AnzoGraph streamlines the ETL process but can increase load time and temporary memory usage in AnzoGraph during the load.

  8. Click Save to create the data store. Anzo saves the configuration and displays the details view. For example:

    You can click the Edit icon () to modify any of the options. Click the check mark icon () to save changes to an option, or click the X icon () to clear the value for an option.

  9. If you plan to load files into AnzoGraph from a location that is different than the Data Location that you specified, edit the Alternate Data Location field and select the location for AnzoGraph load files.