Unstructured Onboarding Process Overview

Graph Studio onboards unstructured data through pipelines that run in a distributed environment where a cluster of worker nodes process the incoming documents and generate output artifacts. This topic provides an overview of the Graph Studio Distributed Unstructured (DU) pipeline process and infrastructure.

The diagram below provides a high level overview of the Graph Studio platform architecture with integration of DU and Elasticsearch. The description below the diagram describes the unstructured data onboarding process and resulting artifacts.

When an unstructured pipeline is run, a crawler service streams data to a pipeline service. The pipeline service reads the stream of files and constructs the appropriate request payloads—one request per document to process. Graph Studio sends the requests to the DU leader instance, and the leader queues the requests and distributes them to the worker instances to process in parallel. When each worker processes a document, it creates a temporary output artifact on the shared file system. The artifact includes the following items:

  • An RDF file that describes the text annotations and general metadata about the processed document.
  • A binary store artifact for Graph Studio.
  • A JSON artifact that contains a reference to the extracted text of the document. Elasticsearch uses this artifact to generate the document index.

When the DU workers have processed all of the documents, Graph Studio completes the following post-processing steps:

  • Consolidate the RDF artifacts from the workers and create a file-based linked data set (FLDS) for loading to Graph Lakehouse.
  • Read the JSON artifacts and instruct the Elasticsearch server to build an index with the text extracted from the documents. A snaphsot of the index is saved on the file system with the FLDS. Any time a graphmart that includes that FLDS is loaded to an Graph Lakehouse instance, Graph Studio loads the corresponding snapshot into the Elasticsearch server that is associated with the Graph Lakehouse connection.

When the post-processing is finished, the pipeline service finalizes the FLDS metadata to store in its catalog. The new unstructured data set becomes available in the Datasets catalog, and it can be added to a graphmart and loaded to Graph Lakehouse for use in Hi-Res Analytics dashboards.