Distributed Unstructured Requirements

The Distributed Unstructured (DU) infrastructure is highly customizable and scalable. The number, size, and configuration of the servers in the environment depends on your unstructured data size, pipeline workload, and performance expectations. A DU cluster consists of one leader instance and one or more worker instances.

  • The leader instance is a lightweight program. It is typically installed on the Anzo host server but can be installed on a dedicated server and then connected to Anzo.
  • The worker instances require significant resources to process the unstructured documents and are typically installed on dedicated servers.
  • In addition to the DU cluster, Elasticsearch is required for indexing and searching documents.

Do not run any other software, including anti-virus software, on the DU worker servers. Additional programs running on the worker nodes may severely impact the performance of unstructured pipelines.

Consider the size of your unstructured data workload when deploying the worker host servers. Each worker instance can have multiple server instances to process documents. The table below lists the requirements for DU worker servers.

Component Recommendation Description
Operating System RHEL/CentOS 7.9 DU is supported on RHEL/CentOS 7.9, RHEL/Rocky 8, and RHEL/Rocky 9 operating systems.
Elasticsearch Installed separately Elasticsearch is required for indexing and searching unstructured document contents. For information about requirements, see Elasticsearch Requirements.
CPU 8+ CPU The more CPU you provision, the more parallelism and higher throughput you can achieve. DU processes N documents in parallel, where N is the total number of worker cores in the cluster (minus 1-2 CPU per node for management processes). Since the nature of unstructured documents varies greatly from case to case and the number of annotations per document can vary significantly, Cambridge Semantics recommends that you start with at least 16 CPU per worker node. If you are deploying servers in a cloud environment, choose compute optimized machines that can be scaled to add CPU if needed.
RAM 16+ GB Unless you plan to process excessively large or complex documents, such as documents with many graphics, you do not need to provision a significant amount of RAM. Typical installations deploy about 2 GB RAM per CPU.
Disk Space 10+ GB When documents and the generated RDF files are stored on the shared file system, the DU installation path does not require a significant amount of disk space.
Java 8 Java 8 Java 8 is included in the installer. Java does not need to be pre-installed on the DU host servers.
Ports 2551 By default, the worker nodes use port 2551 for communication to the leader node. You can choose an alternate port during the installation.
Shared File System Mounted NFS The shared file system must be accessible from each DU host server. See Platform Shared File Storage Requirements.
Service User Account Enterprise-level account It is important to install and run DU (and all other platform components) as the same service user. See Platform Service User Account Requirements.

For instructions on installing Anzo DU, see Installing Distributed Unstructured.