Requirements Overview

This topic highlights the principal requirements to be aware of when planning and provisioning for an environment that includes all of Graph Studio platform components.

Cambridge Semantics recommends that you create separate development, staging, potentially user acceptance testing, and production environments. Separating environments is essential for promoting organized development, safeguarding data, and minimizing disruptions in data processing solutions.

The diagram below shows a high-level overview of the platform requirements. The table below the image describes the elements that are pictured and includes references to the detailed requirements for each component.

Component Details
Anzo Network For security, deploy the platform component instances in the same network and set up firewall rules to allow connections only to trusted data sources and services. For details about the ports that need to be opened for inbound and outbound connections over the network, see Firewall Requirements in Graph Studio Server Requirements.
User Account For integration between components and appropriate ownership of installation directories and shared files, it is important to use the same service user account when installing and running all of the platform software. For security, the account should not have root privileges. For specifics about the account requirements, see Platform Service User Account Requirements.
Graph Lakehouse Graph Lakehouse is a massively parallel processing (MPP) graph OLAP engine. To provide the highest performance possible, Graph Lakehouse stores all data and performs all analytic operations entirely in memory, making RAM the most important resource to consider when provisioning the host server or servers. Graph Lakehouse can be installed on a single server or on multiple servers in a cluster. In a cluster, you designate one server as the leader server and that is the server that connects to Graph Studio and to Elasticsearch if it is included in your platform. When loading data from databases and HTTP endpoints, the Graph Lakehouse servers also need to be able to connect directly to those sources.

The connection to Graph Studio is made on ports 5700 and 5600. 5700 is the gRPC port for all user-initiated SPARQL requests, and 5600 is the system management port for system-level requests like stopping or starting Graph Lakehouse from Graph Studio. For additional details on Graph Lakehouse requirements, see Graph Lakehouse Requirements.

Elasticsearch Elasticsearch is a required component when using the Distributed Unstructured (DU) component. DU uses the Elasticsearch engine to build an index for each unstructured pipeline and for running text searches on the knowledge graph after it is created. Elasticsearch is optional for use without DU. It can also be used with structured sources to generate an index for data layers in graphmarts (see Creating an Elasticsearch Index from a Graphmart for more information). Elasticsearch connects to Graph Studio, the DU worker nodes, and the Graph Lakehouse leader node on ports 9200-9300. For more details on the Elasticsearch requirements, see Elasticsearch Requirements.
Distributed Unstructured The Distributed Unstructured (DU) component is required to be deployed in order to process and transform unstructured data. The DU cluster consists of one leader instance and one or more worker instances. The leader instance is a lightweight program that is typically installed on the Graph Studio host server. The worker instances require significantly more resources, CPU in particular, to process unstructured documents in parallel. Therefore, they are typically installed on dedicated servers. The worker instances communicate with the leader instance on port 2551 by default. For more details on the DU requirements, see Distributed Unstructured Requirements.
NFS Though Graph Studio can connect to and read files from various types of long-term storage systems, it is critical to deploy a file system that consistently offers good read and write support and can be shared by all of the components in the platform. Cambridge Semantics strongly recommends that you deploy an NFS and mount it in the same location on all component host servers. For details on the requirements, see Platform Shared File Storage Requirements.
Anzo Server The Graph Studio Server is the administrative layer that organizes and provides access control over all of the platform assets. It connects data sources and components and provides the Graph Studio application, Administration, and Hi-Res Analytics user interfaces as well as APIs and endpoints for accessing data from third-party applications. For more details about the requirements, see Graph Studio Server Requirements.