Graph Storage Concepts
This topic describes the way onboarded graph data is shared between and stored in the Anzo and AnzoGraph graph stores.
The onboarding process generates different types of graph data artifacts. Storage of the artifacts differs based on the type of data that is being stored and the purpose of the data. The list below describes the artifacts and storage methods:
- The metadata, such as data models, data source configuration details, datasets catalog entries, registries, and access control definitions, are stored in Anzo's embedded graph store. The Anzo graph store is a transaction-oriented store that is built for processing many updates to small amounts of data. Data is persisted to disk in a journal, also known as a volume. The system volume (or system data source) is the default, required volume where Anzo stores models as well as system configuration, dataset, registry, and access control metadata. Users can create secondary local volumes that are used for more compartmentalized data and can be created and deleted without affecting the core system.
- When an unstructured pipeline is run or data is exported from a graphmart, the instance data and copies of the data models are written to a file-based linked data set (FLDS) on the shared file store. Each FLDS is represented as a dataset in Anzo's Datasets catalog. The catalog entry includes a pointer to the RDF files on disk.
- When a dataset from the catalog is added to a graphmart and the graphmart is activated, Anzo loads the data from the FLDS into the AnzoGraph graph store. Once the data is in memory, the rest of the graphmart's data layer steps are executed by AnzoGraph (known as the ELT process). Each data layer becomes a graph in AnzoGraph, and each layer graph includes the instance data created by that layer as well as the related models.
- Anzo system ontologies and metadata remain in Anzo's graph store, the system data source. They are not loaded to AnzoGraph unless the system data is added to a graphmart and the graphmart is activated.
- AnzoGraph does not have a "graphmart" or "step" construct, and graphmart and step URIs do not exist in the database. Though a graphmart acts as a container for data layers and its metadata can be queried in Anzo's embedded graph store, it does not include instance data that is needed by AnzoGraph.
Typically organizations manage all data with Anzo, i.e., data is onboarded to Anzo through unstructured pipelines or the Graph Data Interface. Anzo then loads the data to AnzoGraph for analytics. When data is loaded to AnzoGraph through Anzo, Anzo manages the reloading of graphmarts if AnzoGraph is restarted. Though users can load data and create named graphs directly in AnzoGraph, AnzoGraph is not configured by default to persist the data in memory to disk. Graphs that do not originate in Anzo must be reloaded manually any time AnzoGraph is restarted.
Example
An Anzo instance has two active graphmarts. Each graphmart has two data layers, one for loading datasets into memory and another for creating views and running ELT queries. When the following query is run against AnzoGraph to return a list of all distinct graphs, the results show that there are five graphs:
SELECT DISTINCT ?graph WHERE { GRAPH ?graph { ?s ?p ?o } }
graph ------------------------------------------------------------------------------ http://cambridgesemantics.com/Layer/546fb89ac6d245f8bea2777a52077bc9 http://cambridgesemantics.com/Layer/1162fb0d0b724a18b4133c10d69f16b7 http://cambridgesemantics.com/Layer/12c7eedddff9449ab4b133373b56e65c http://cambridgesemantics.com/Layer/b69bb3295ba3434e846b1ed372039416 http://cambridgesemantics.com/GqeDatasource/guid_10492203b5aa4a54f217ababb3dc6dee 5 rows
The first four graphs are the data layers for the two graphmarts. The graph URIs match the data layer URIs in Anzo. The last graph in the results above is the AnzoGraph data source graph. This graph contains one triple that records a timestamp for the last time the data source was updated. If Anzo loses the connection to AnzoGraph, it checks this timestamp when it reconnects. The last updated time is used to determine whether the Anzo and AnzoGraph graph stores are in sync or if the graphmarts need to be reloaded to AnzoGraph.