Anzo Unstructured Requirements and Recommendations

The Anzo Unstructured (AU) infrastructure is highly customizable and scalable. The number, size, and configuration of the servers in the environment depends on your unstructured data size, pipeline workload, and performance expectations. This topic provides guidance on determining the infrastructure to deploy as well as the requirements for each of the AU components. For an introduction to the AU architecture and pipeline process, see Anzo Unstructured Architecture and Process Overview.

AU requires two programs that are installed separately from Anzo:

Anzo Unstructured Cluster Requirements and Recommendations

An Anzo Unstructured (AU) cluster consists of one leader instance and one or more worker instances. Cambridge Semantics provides an installation script for installing the AU software. In an AU cluster:

  • The leader instance is a lightweight program and is typically installed on the Anzo host server.
  • The worker instances require significant resources to process the unstructured documents and are typically installed on dedicated servers.

Consider the size of your unstructured data workload when deploying worker host servers. Each worker instance can have multiple server instances to process documents. The table below lists the requirements for Anzo Unstructured worker servers:

Component Requirement
Operating System RHEL/CentOS 7+
CPU 4+ cores
RAM 16+ GB
Disk Space 10+ GB
File System The Anzo file store (shared file system) must be accessible from each AU server in the cluster. For more information about file stores, see Connecting to a File Store.

For instructions on installing Anzo Unstructured, see Deploying an Anzo Unstructured Cluster.

Elasticsearch Requirements and Recommendations

Anzo Unstructured uses the Elasticsearch engine to build an index after an unstructured pipeline runs and for running searches on unstructured data that is onboarded to Anzo. When choosing an Elasticsearch host server, consider the following information:

  • Generating the index is a lightweight operation compared to document search operations. If you have a light unstructured data workload and do not perform text searches on large amounts of data, installing an Elasticsearch engine on the Anzo host server might be sufficient.
  • If you onboard a large number of unstructured documents and plan to perform text searches across a large amount of data, Cambridge Semantics recommends that you install Elasticsearch on a dedicated server.

The table below list the Elasticsearch server requirements:

Component Requirement
Elasticsearch Version 7.1.1
CPU 8+ cores
RAM 64+ GB
Disk Space 100+ GB
Ports By default, the port range for Elasticsearch requests (http.port) is 9200-9300. If port 9200 is not available when Elasticsearch is started, Elasticsearch tries 9201 and so on until it finds an accessible port. The Anzo server and the AnzoGraph leader server need to be able to access Elasticsearch on the HTTP request port that Elasticsearch uses.
File System The Anzo file store (shared file system) must be accessible from each Elasticsearch server. For more information about file stores, see Connecting to a File Store.

For instructions on installing Elasticsearch, see Installing and Configuring Elasticsearch.

Related Topics