Creating an Anzo Data Store

This topic provides instructions for creating an Anzo data store, also known as a graph data source. Creating a data store means that you designate a directory on the file store where file-based linked data sets and other files can be created and shared during the ETL process. All installations require at least one data store. You can create one data store and configure all pipelines to write to that store (each ETL run automatically creates a new sub-directory under the data store directory) or you can create multiple data stores to use for different data sets.

For information about setting up a connection to the shared file system that will host the data store, see Connecting to a File Store.

Administrator privileges are required to complete this task. Specifically, the Create Anzo Data Stores and Administer System Setup permissions are required.

  1. In the Administration application, expand the Connections menu and click Anzo Data Store. Anzo displays the Anzo Data Store screen, which lists any existing data stores. For example:

  2. On the Anzo Data Store screen, click the Add Anzo Data Store button. Anzo opens the Create Anzo Data Store screen.

  3. Type a Title and optional Description for the data store.
  4. Click in the Data Location field. Anzo opens the File Location dialog box.

  5. On the left side of the screen, select the file store on which to create this data store. On the right side of the screen, navigate to the directory that you want to designate as the data location. Select a directory, and then click OK. Or click Create New Folder to create a new directory. Each time a pipeline is run for this data store, a new subdirectory is created under the specified data location.

    The Data Location needs to be a directory on the file store that is shared between Anzo, AnzoGraph, and any Anzo Unstructured, Elasticsearch, or Spark servers. If you want Anzo to generate files for this data store in one location and then load the files into AnzoGraph from another location, specify the file generation location in this field, and then specify the AnzoGraph load location in the Alternate Data Location field that is displayed on the Details screen after you save the data store.

  6. Specify whether to compress the generated load files. By default, the Compress output checkbox is selected, indicating that Anzo generates .ttl.gz files when writing to this graph data source. If you clear the checkbox, Anzo generates uncompressed .ttl files. To preserve disk space and reduce read times when loading data into memory, Cambridge Semantics recommends that you accept the default configuration and compress load files.
  7. The Spark ETL engine does not remove duplicates by default when running pipelines. If the source contains a significant number of duplicate entities, you have two options for deduplicating the data:
    • Deduplicate the data during the ETL process: To deduplicate the data while running the jobs that will generate this graph source, select the Dedupe output per executor option. Enabling the dedupe option limits the number of duplicates to one duplicate per executor node. For example, if the Spark configuration has 10 executor nodes, the resulting data set can contain a maximum of 10 duplicate entities.

      Deduplication is based on primary keys and URI templates. If the source does not employ templating, do not enable the dedupe option. In addition, enabling this option substantially increases the time it takes to run the jobs for this data store.

    • Deduplicate the data after loading it to AnzoGraph: AnzoGraph deduplicates data during a "vacuum" process that runs automatically after data is loaded into memory. If you leave the Dedupe output per executor option disabled, duplicates will be removed by AnzoGraph.

      Deduplicating data with AnzoGraph streamlines the ETL process but can increase load time and temporary memory usage in AnzoGraph during the load.

  8. Click Save to create the data store. Anzo saves the store and displays the data store overview. For example:

    You can click a field to edit a value. Click the check mark icon () to save changes to an option, or click the X icon () to clear the value for an option.

  9. If you plan to load files into AnzoGraph from a location that is different than the Data Location that you specified, edit the Alternate Data Location field and select the location for AnzoGraph load files.

See Making a Basic Connection to AnzoGraph for instructions on connecting Anzo to AnzoGraph for loading the files that are generated for the new data store.

Related Topics