Anzo Unstructured Requirements
The Anzo Unstructured (AU) infrastructure is highly customizable and scalable. The number, size, and configuration of the servers in the environment depends on your unstructured data size, pipeline workload, and performance expectations. This topic provides guidance on determining the infrastructure to deploy as well as the requirements for each of the AU components. For an introduction to the AU architecture and pipeline process, see Anzo Unstructured Data Onboarding Process.
AU requires two programs that are installed separately from Anzo:
- An Anzo Unstructured cluster for processing the incoming data. See Anzo Unstructured Cluster Requirements and Recommendations.
- Elasticsearch for indexing and searching unstructured document contents. See Elasticsearch Requirements and Recommendations.
Anzo Unstructured Cluster Requirements and Recommendations
An Anzo Unstructured (AU) cluster consists of one Leader instance and one or more Worker instances. Cambridge Semantics provides an installation script for installing the AU software. In an AU cluster:
- The Leader instance is a lightweight program and is typically installed on the Anzo host server.
- The Worker instances require significant resources to process the unstructured documents and are typically installed on dedicated servers.
Consider the size of your unstructured data workload when deploying Worker host servers. Each Worker instance can have multiple server instances to process documents. The table below lists the requirements for Anzo Unstructured Worker servers:
Component | Requirement |
---|---|
Operating System | RHEL/CentOS 7.5+ Cambridge Semantics recommends that you tune the ulimits for your Linux distribution to increase the limits for certain resources. See Configure User Resource Limits for more information. |
CPU | 4+ cores |
RAM | 16+ GB |
Disk Space | 10+ GB |
File System | The Anzo file store (shared file system) must be accessible from each AU server in the cluster. For more information about the shared file system, see Deploying the Shared File System. |
Do not run any other software, including anti-virus software, on the Anzo Unstructured Worker servers. Additional programs running on the Worker nodes may severely impact the performance of Unstructured Pipelines.
For instructions on installing Anzo Unstructured, see Installing Anzo Unstructured.
Elasticsearch Requirements and Recommendations
Anzo Unstructured uses the Elasticsearch engine to build an index after an unstructured pipeline runs and for running searches on unstructured data that is onboarded to Anzo. When choosing an Elasticsearch host server, consider the following information:
- Generating the index is a lightweight operation compared to document search operations. If you have a light unstructured data workload and do not perform text searches on large amounts of data, installing an Elasticsearch engine on the Anzo host server might be sufficient.
- If you onboard a large number of unstructured documents and plan to perform text searches across a large amount of data, Cambridge Semantics recommends that you install Elasticsearch on a dedicated server.
The table below list the Elasticsearch server requirements:
Component | Requirement |
---|---|
Elasticsearch Version | 7.1.1 |
CPU | 8+ cores |
RAM | 64+ GB |
Disk Space | 100+ GB |
Ports | By default, the port range for Elasticsearch requests (http.port) is 9200-9300. If port 9200 is not available when Elasticsearch is started, Elasticsearch tries 9201 and so on until it finds an accessible port. The Anzo server and the AnzoGraph leader server need to be able to access Elasticsearch on the HTTP request port that Elasticsearch uses. |
File System | The Anzo file store (shared file system) must be accessible from each Elasticsearch server. For more information about the shared file system, see Deploying the Shared File System. |
For instructions on installing Elasticsearch, see Installing and Configuring Elasticsearch.