AnzoGraph Server and Cluster Sizing Guidelines

This topic provides guidance on determining the server and cluster size that is ideal for hosting AnzoGraph, depending on the characteristics of your data.

Memory Size Guidelines

Since AnzoGraph is a high-performance, in-memory database, it is important to consider the amount of memory needed to store the data that you plan to load. Estimating the amount of memory your workload requires can help you decide what size server to use and whether to use multiple servers. The sections below describe the key points to consider about memory usage and AnzoGraph.

Data at rest should use less than 50% of the total memory

The data loaded into memory should not consume more than 50% of the total available memory on the instance or across a cluster. Preserve at least 50% of the memory for server processes, query processing, and storing intermediate results.

AnzoGraph reserves 20% of the memory for the OS

To avoid unexpected shutdowns by the Linux operating system, the default AnzoGraph configuration leaves 20% of memory available for the OS; AnzoGraph will not use more than 80% of the total available memory. Account for this memory buffer in sizing calculations.

Memory usage depends on data characteristics

Memory usage varies significantly depending on the makeup of the data, such as the data types and sizes of literal values, and the complexity of the queries that you run. Triple storage ranges anywhere from 12 bytes per triple to 1 megabyte for a triple that stores pages of text from an unstructured document. For example:

  • Triples with integer objects like the example below require about 12 bytes to store in memory.
    <http://csi.com/resource/person1> <http://csi.com/resource/age> 50
  • Triples made up of edges like the example below require about 18 bytes to store in memory.
    <http://csi.com/resource/person1> <http://csi.com/resource/friend> <http://csi.com/resource/person100>

    Triples with user-defined data types like the example below also require about 18 bytes to store in memory.

    <http://csi.com/resource/person1> <http://csi.com/resource/height> "5'8""^^height
  • Triples with long strings like the example below require about 700 bytes to store in memory.
    <http://dbpedia.org/resource/Keanu_Reeves> <http://dbpedia.org/ontology/abstract> "Keanu Charles Reeves
    (/keɪˈɑːnuː/ kay-AH-noo; born September 2, 1964) is a Canadian actor, producer, director and musician.
    Reeves is best known for his acting career, beginning in 1985 and spanning more than three decades.
    He gained fame for his starring role performances in several blockbuster films including comedies
    from the Bill and Ted franchise (1989–1991), action thrillers Point Break (1991) and Speed (1994),
    and the science fiction-action trilogy The Matrix (1999–2003). He has also appeared in dramatic
    films such as Dangerous Liaisons (1988), My Own Private Idaho (1991), and Little Buddha (1993),
    as well as the romantic horror Bram Stoker's Dracula (1992)."

The table below provides estimates for the number of triples that you can load and query with commonly configured amounts of available RAM. The table also lists the number of triples that could be stored if a data set comprised the example triples above.

Note: The examples below show the number of triples at rest and consider that the data should not consume more than 50% of the available RAM.

Available RAM General Estimate Examples
16 GB Up to 100 million triples Considering that the data at rest should use less than 8 GB RAM, a server with 16 GB total RAM could store:
  • About 11 million 700-byte triples like the Keanu Reeves example above.
  • About 400 million 18-byte URI triples like the example above.
32 GB Up to 200 million triples Considering that the data at rest should use less than 16 GB RAM, a server with 32 GB total RAM could store:
  • About 22 million 700-byte triples like the Keanu Reeves example above.
  • About 800 million 18-byte URI triples like the example above.
64 GB Up to 400 million triples Considering that the data at rest should use less than 32 GB RAM, a server with 64 GB total RAM could store:
  • About 45 million 700-byte triples like the Keanu Reeves example above.
  • About 1.7 billion 18-byte URI triples like the example above.
120 GB Up to 800 million triples Considering that the data at rest should use less than 60 GB RAM, a server with 120 GB total RAM could store:
  • About 85 million 700-byte triples like the Keanu Reeves example above.
  • About 3 billion 18-byte URI triples like the example above.
240 GB Up to 1.5 billion triples Considering that the data at rest should use less than 120 GB RAM, a server with 240 GB total RAM could store:
  • About 172 million 700-byte triples like the Keanu Reeves example above.
  • About 6.5 billion 18-byte URI triples like the example above.
480 GB Up to 3 billion triples Considering that the data at rest should use less than 240 GB RAM, a server with 480 GB total RAM could store:
  • About 340 million 700-byte triples like the Keanu Reeves example above.
  • About 13 billion 18-byte URI triples like the example above.

Memory usage can temporarily double during loads

During the load process, before the data can be compressed and moved to its final storage block, memory usage temporarily increases and potentially doubles, particularly if the data includes many string values. For example, though you can expect to be able to store and query about one billion triples on a server with 200 GB available RAM, loading all one billion triples in a single load command might require more memory than is available on the server. For optimal performance and to minimize memory usage, break data into several smaller load files and use multiple load statements to load the files in batches.

Memory Usage Example

The AnzoGraph installation includes two sample data sets. Combined, the data sets have about 110 million triples. The data originates from structured data sources and contains many URIs, short strings, and numeric values, which are highly compressible in memory. After the load, the data consumes about 3 GB of available memory. While the load is running, however, AnzoGraph uses up to 7 GB of memory. Accounting for the 20% of memory reserved for the operating system, loading the 110 million sample triples requires an instance with at least 9 GB of memory available at load time. Considering that a user might run very complex queries or advanced analytics against all of the data, an instance with 16 – 32 GB of available memory is the ideal size when working with the sample data or similar data sets.

Cluster Size Guidelines

When your workload size requires using a cluster, do not create clusters with fewer than 4 nodes. When using a single node, data gets redistributed in memory without using the network. If you add 1 or 2 more nodes to create a 2- or 3-node cluster, data then gets distributed over the network. The CPU gain from the additional 1 or 2 nodes does not outweigh the performance degradation from the network. Using at least 4 nodes significantly reduces the network degradation and provides a near-linear performance benefit when compared to a single node.

Related Topics