Creating an Anzo Data Store
This topic provides instructions for creating an Anzo data store, also known as a graph data source. Creating a data store means that you designate a directory on the file storage system where Anzo can save the AnzoGraph load files that are generated during the ETL process. All installations require at least one data store. You can create one graph data store and configure all pipelines to write to that store (each ETL run automatically creates a new sub-directory under the data store directory) or you can create multiple data stores to use for different data sets.
For information about setting up a file system or storage connection, see Connecting to a File Store.
- In the Anzo console, expand the Administration menu and click Anzo Data Store.
- On the Graph Data Source screen, click the Create button. Anzo opens the Create Anzo Data Store screen.
- Type a Title and optional Description for the data store.
- Click in the Data Location field. Anzo opens the File Location dialog box.
- On the left side of the screen, select the storage location where you want to create this graph store. On the right side of the screen, navigate to the base directory where you want Anzo to save the data files for this data store. Select a directory, and then click OK. Each time ETL runs for this store, Anzo creates a new subdirectory under the base location that you specify.
Note: Ideally, the Data Location is a directory that the Anzo, AnzoGraph, and any Anzo Unstructured and Elasticsearch servers have access to, such as a mounted file system or cloud storage location. If you want Anzo to generate files for this graph source in one location and load the files into AnzoGraph from another location, specify the file generation location in this field, and then specify the AnzoGraph load location in the Alternate Data Location field that is displayed on the details screen after you save the data store.
- Specify whether to compress the generated load files. By default, the Compress output checkbox is selected, indicating that Anzo generates .ttl.gz files when writing to this graph data source. If you clear the checkbox, Anzo generates uncompressed .ttl files. To preserve disk space and reduce read times when loading data into memory, Cambridge Semantics recommends that you accept the default configuration and compress load files.
- The Spark ETL engine does not remove duplicates by default when running pipelines. If the source contains a significant number of duplicate entities, you have two options for deduplicating the data:
- Deduplicate the data during the ETL process: To deduplicate the data while running the jobs that will generate this graph source, select the Dedupe output per executor option. Enabling the dedupe option limits the number of duplicates to one duplicate per executor node. For example, if the Spark configuration has 10 executor nodes, the resulting data set can contain a maximum of 10 duplicate entities.
Note: Deduplication is based on primary keys and URI templates. If the source does not employ templating, do not enable the dedupe option.
Important: Enabling this option substantially increases the time it takes to run the jobs for this graph source.
- Deduplicate the data after loading it to AnzoGraph: AnzoGraph deduplicates data during a "vacuum" process that runs automatically after data is loaded into memory. If you leave the Dedupe output per executor option disabled, duplicates will be removed by AnzoGraph.
Note: Deduplicating data with AnzoGraph streamlines the ETL process but can increase load time and temporary memory usage in AnzoGraph during the load.
- Deduplicate the data during the ETL process: To deduplicate the data while running the jobs that will generate this graph source, select the Dedupe output per executor option. Enabling the dedupe option limits the number of duplicates to one duplicate per executor node. For example, if the Spark configuration has 10 executor nodes, the resulting data set can contain a maximum of 10 duplicate entities.
- Click Save to create the graph source. Anzo saves the graph source and displays the graph source details view. For example:
You can click the Edit icon () to modify any of the options. Click the check mark icon () to save changes to an option, or click the X icon () to clear the value for an option.
- If you plan to load files into AnzoGraph from a location that is different than the Data Location that you specified, edit the Alternate Data Location field and select the location for AnzoGraph load files.
See Making a Basic Connection to AnzoGraph for instructions on connecting Anzo to AnzoGraph for loading the files that are generated for the new graph data store.