Graph Storage Concepts

This topic describes the way onboarded graph data is shared between and stored in the Graph Studio and Graph Lakehouse graph stores.

The onboarding process generates different types of graph data artifacts. Storage of the artifacts differs based on the type of data that is being stored and the purpose of the data. The list below describes the artifacts and storage methods:

  • The metadata, such as data models, data source configuration details, datasets catalog entries, registries, and access control definitions, are stored in Graph Studio's embedded graph store. The Graph Studio graph store is a transaction-oriented store that is built for processing many updates to small amounts of data. Data is persisted to disk in a journal, also known as a volume. The system volume (or system data source) is the default, required volume where Graph Studio stores models as well as system configuration, dataset, registry, and access control metadata. Users can create secondary local volumes that are used for more compartmentalized data and can be created and deleted without affecting the core system.
  • When an unstructured pipeline is run or data is exported from a graphmart, the instance data and copies of the data models are written to a file-based linked data set (FLDS) on the shared file store. Each FLDS is represented as a dataset in Graph Studio's Datasets catalog. The catalog entry includes a pointer to the RDF files on disk.
  • When a dataset from the catalog is added to a graphmart and the graphmart is activated, Graph Studio loads the data from the FLDS into the Graph Lakehouse graph store. Once the data is in memory, the rest of the graphmart's data layer steps are executed by Graph Lakehouse (known as the ELT process). Each data layer becomes a graph in Graph Lakehouse, and each layer graph includes the instance data created by that layer as well as the related models.
  • Graph Studio system ontologies and metadata remain in Graph Studio's graph store, the system data source. They are not loaded to Graph Lakehouse unless the system data is added to a graphmart and the graphmart is activated.
  • Graph Lakehouse does not have a "graphmart" or "step" construct, and graphmart and step URIs do not exist in the database. Though a graphmart acts as a container for data layers and its metadata can be queried in Graph Studio's embedded graph store, it does not include instance data that is needed by Graph Lakehouse.

Typically organizations manage all data with Graph Studio, i.e., data is onboarded to Graph Studio through unstructured pipelines or the Graph Data Interface. Graph Studio then loads the data to Graph Lakehouse for analytics. When data is loaded to Graph Lakehouse through Graph Studio, Graph Studio manages the reloading of graphmarts if Graph Lakehouse is restarted. Though users can load data and create named graphs directly in Graph Lakehouse, Graph Lakehouse is not configured by default to persist the data in memory to disk. Graphs that do not originate in Graph Studio must be reloaded manually any time Graph Lakehouse is restarted.

Example

An Graph Studio instance has two active graphmarts. Each graphmart has two data layers, one for loading datasets into memory and another for creating views and running ELT queries. When the following query is run against Graph Lakehouse to return a list of all distinct graphs, the results show that there are five graphs:

SELECT DISTINCT ?graph
WHERE { 
  GRAPH ?graph {
    ?s ?p ?o
  }
}
graph
------------------------------------------------------------------------------
http://cambridgesemantics.com/Layer/546fb89ac6d245f8bea2777a52077bc9
http://cambridgesemantics.com/Layer/1162fb0d0b724a18b4133c10d69f16b7
http://cambridgesemantics.com/Layer/12c7eedddff9449ab4b133373b56e65c
http://cambridgesemantics.com/Layer/b69bb3295ba3434e846b1ed372039416
http://cambridgesemantics.com/GqeDatasource/guid_10492203b5aa4a54f217ababb3dc6dee
5 rows

The first four graphs are the data layers for the two graphmarts. The graph URIs match the data layer URIs in Graph Studio. The last graph in the results above is the Graph Lakehouse data source graph. This graph contains one triple that records a timestamp for the last time the data source was updated. If Graph Studio loses the connection to Graph Lakehouse, it checks this timestamp when it reconnects. The last updated time is used to determine whether the Graph Studio and Graph Lakehouse graph stores are in sync or if the graphmarts need to be reloaded to Graph Lakehouse.