Anzo Unstructured Requirements

The Anzo Unstructured (AU) infrastructure is highly customizable and scalable. The number, size, and configuration of the servers in the environment depends on your unstructured data size, pipeline workload, and performance expectations. This topic provides guidance on determining the infrastructure to deploy as well as the requirements for each of the AU components. For an introduction to the AU architecture and pipeline process, see Anzo Unstructured Data Onboarding Process.

AU requires two programs that are installed separately from Anzo:

Anzo Unstructured Cluster Requirements and Recommendations

An Anzo Unstructured (AU) cluster consists of one Leader instance and one or more Worker instances. Cambridge Semantics provides an installation script for installing the AU software. In an AU cluster:

  • The Leader instance is a lightweight program and is typically installed on the Anzo host server.
  • The Worker instances require significant resources to process the unstructured documents and are typically installed on dedicated servers.

Consider the size of your unstructured data workload when deploying Worker host servers. Each Worker instance can have multiple server instances to process documents. The table below lists the requirements for Anzo Unstructured Worker servers:

Component Requirement
Operating System RHEL/CentOS 7.9

Cambridge Semantics recommends that you tune the ulimits for your Linux distribution to increase the limits for certain resources. See Configure User Resource Limits for more information.

CPU 8+ CPU

The more CPU you provision, the more parallelism and higher throughput you can achieve. AU processes N documents in parallel, where N is the total number of Worker cores in the cluster (minus 1-2 CPU per node for management processes). Since the nature of unstructured documents varies greatly from case to case and the number of annotations per document can vary significantly, Cambridge Semantics recommends that you start with at least 16 CPU per Worker node. If you are deploying servers in a cloud environment, choose compute optimized machines that can be scaled to add CPU if needed.

RAM 16+ GB

Unless you plan to process excessively large or complex documents, such as documents with many graphics, you do not need to provision a significant amount of RAM. Typical installations deploy about 2 GB RAM per CPU.

Disk Space 10+ GB
File System The Anzo file store (shared file system) must be accessible from each AU server in the cluster. For more information about the shared file system, see Deploying the Shared File System.

Do not run any other software, including anti-virus software, on the Anzo Unstructured Worker servers. Additional programs running on the Worker nodes may severely impact the performance of Unstructured Pipelines.

For instructions on installing Anzo Unstructured, see Installing Anzo Unstructured.

Elasticsearch Requirements and Recommendations

Anzo Unstructured uses the Elasticsearch engine to build an index after an unstructured pipeline runs and for running searches on unstructured data that is onboarded to Anzo. When choosing an Elasticsearch host server, consider the following information:

  • Generating the index is a lightweight operation compared to document search operations. If you have a light unstructured data workload and do not perform text searches on large amounts of data, installing an Elasticsearch engine on the Anzo host server might be sufficient.
  • If you onboard a large number of unstructured documents and plan to perform text searches across a large amount of data, Cambridge Semantics recommends that you install Elasticsearch on a dedicated server.

The table below list the Elasticsearch server requirements:

Component Requirement
Elasticsearch Version Versions 7.10.2 – 7.17.3 are supported.
Java Elasticsearch requires Java 11 or later. The software includes an embedded JDK.
CPU 8+ cores
RAM 64+ GB
Disk Space 100+ GB
Ports By default, the port range for Elasticsearch requests (http.port) is 9200-9300. If port 9200 is not available when Elasticsearch is started, Elasticsearch tries 9201 and so on until it finds an accessible port. The Anzo server and the AnzoGraph leader server need to be able to access Elasticsearch on the HTTP request port that Elasticsearch uses.
File System The Anzo file store (shared file system) must be accessible from each Elasticsearch server. For more information about the shared file system, see Deploying the Shared File System.

For instructions on installing Elasticsearch, see Installing and Configuring Elasticsearch.

Related Topics