Sizing Guidelines for In-Memory Storage

This topic provides guidance on determining the server and cluster size that is ideal for hosting AnzoGraph, depending on the characteristics of your data.

Memory Sizing Guidelines

Since AnzoGraph is a high-performance, in-memory database, it is important to consider the amount of memory needed to store the data that you plan to load. Estimating the amount of memory your workload requires can help you decide what size server to use and whether to use multiple servers. The sections below describe the key points to consider about memory usage and AnzoGraph.

Data at rest should remain below 50% of the total memory

The data loaded into memory should not consume more than 50% of the total available memory on the instance or across a cluster. Ideally, the data at rest should use only 25%-30% of the available memory because query processing and intermediate results can temporarily consume a very large amount of RAM.

AnzoGraph reserves 20% of the memory for the OS

To avoid unexpected shutdowns by the Linux operating system, the default AnzoGraph configuration leaves 20% of memory available for the OS; AnzoGraph will not use more than 80% of the total available memory. Account for this memory buffer in sizing calculations.

Memory usage can be high during loads

During the load streaming process, before duplicates are pruned and triples are moved to their final storage blocks, memory usage temporarily increases and potentially doubles, particularly if the data includes many string values.

Memory usage depends on data characteristics

Memory usage varies significantly depending on the makeup of the data, such as the data types and sizes of literal values, and the complexity of the queries that you run. Triple storage ranges anywhere from 12 bytes per triple to 1 megabyte for a triple that stores pages of text from an unstructured document. For example:

  • Triples with integer objects like the following example require about 16 bytes to store in memory.
    <http://csi.com/resource/person1> <http://csi.com/resource/age> 50
  • Triples made up of URIs like the following example require about 18 bytes to store in memory.
    <http://csi.com/resource/person1> <http://csi.com/resource/friend> <http://csi.com/resource/person100>
  • Triples with user-defined data types (UDTs) like the following example also require about 18 bytes to store in memory.
  • <http://csi.com/resource/person1> <http://csi.com/resource/height> "5'8""^^height
  • Triples with dateTime values like the following example require about 20 bytes to store in memory.
    <http://www.wikidata.org/entity/Q65949130> 
    <http://www.wikidata.org/prop/direct/P585> 
    "1995-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
  • Triples with long strings like the following example require about 700 bytes to store in memory.
    <http://dbpedia.org/resource/Keanu_Reeves> <http://dbpedia.org/ontology/abstract> "Keanu Charles Reeves
    (/keɪˈɑːnuː/ kay-AH-noo; born September 2, 1964) is a Canadian actor, producer, director and musician.
    Reeves is best known for his acting career, beginning in 1985 and spanning more than three decades.
    He gained fame for his starring role performances in several blockbuster films including comedies
    from the Bill and Ted franchise (1989–1991), action thrillers Point Break (1991) and Speed (1994),
    and the science fiction-action trilogy The Matrix (1999–2003). He has also appeared in dramatic
    films such as Dangerous Liaisons (1988), My Own Private Idaho (1991), and Little Buddha (1993),
    as well as the romantic horror Bram Stoker's Dracula (1992)."

The table below provides estimates for the number of triples that you can load and query with commonly configured amounts of available RAM. The table also lists the number of triples that could be stored if a data set comprised the example triples above.

The examples below show the number of triples at rest and consider that the data should not consume more than 50% of the available RAM.

RAM General Estimate Examples
16 GB Up to about 100 million triples Considering that the data at rest should use less than 8 GB RAM, a server with 16 GB total RAM could store:
  • About 12 million 700-byte triples like the Keanu Reeves example above.
  • About 475 million 18-byte URI triples like the example above.
32 GB Up to about 200 million triples Considering that the data at rest should use less than 16 GB RAM, a server with 32 GB total RAM could store:
  • About 24 million 700-byte triples like the Keanu Reeves example above.
  • About 850 million 20-byte triples like the dateTime example above.
64 GB Up to about 400 million triples Considering that the data at rest should use less than 32 GB RAM, a server with 64 GB total RAM could store:
  • About 48 million 700-byte triples like the Keanu Reeves example above.
  • About 1.7 billion 20-byte triples.
128 GB Up to about 800 million triples Considering that the data at rest should use less than 64 GB RAM, a server with 128 GB total RAM could store:
  • About 96 million 700-byte triples like the Keanu Reeves example above.
  • About 3.4 billion 20-byte triples.
256 GB Up to about 1.5 billion triples Considering that the data at rest should use less than 128 GB RAM, a server with 256 GB total RAM could store:
  • About 192 million 700-byte triples like the Keanu Reeves example above.
  • About 6.8 billion 20-byte triples.
480 GB Up to about 3 billion triples Considering that the data at rest should use less than 240 GB RAM, a server with 480 GB total RAM could store:
  • About 368 million 700-byte triples like the Keanu Reeves example above.
  • About 12 billion 20-byte triples.

Analyzing Data Characteristics in Load Files

AnzoGraph enables you to perform pre-load analysis on file-based linked data sets without actually loading the data into memory. You can use this method to run statistical queries, such as counting the number of triples or returning a list of the unique subjects and predicates. Performing a "dry run" of a data load enables you to analyze data set characteristics to help with tasks such as memory sizing. Since the data remains on disk, you can use this method to capture statistics about a large data set without having to deploy an AnzoGraph cluster that has enough memory to store all of the data.

Important Considerations for Analyzing Load Files

  • Since AnzoGraph scans the files on disk, queries run much slower than they do when run against data in memory. Consider performance when deciding how many files to query at once and how complex to make the queries.
  • Though the pre-load feature does not use memory for storing data, queries that you run against files do consume memory. The server must have sufficient memory available to use for these intermediate query results.
  • Unlike loads into the database, pre-load analysis does not prune duplicate triples. Statistics returned for load file queries may differ somewhat from the statistics returned after the data is loaded.

Analysis Query Syntax

Use the following query syntax to analyze load files :

SELECT <expression>
FROM EXTERNAL <URI>
[ FROM EXTERNAL <URI> ]
WHERE { <triple_patterns> }
Option Description
SELECT <expression> The SELECT clause specifies an expression that returns statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error.
FROM EXTERNAL <URI> The URI in the FROM clause specifies the location of the load file or directory of files. For example, this URI specifies a single file:
<file:/data/load/values.ttl>

This example specifies a directory of files:

<dir:/data/store/LoadDBNorthwind/rdf.ttl.gz>

For example, the following query analyzes the files in the rdf.ttl.gz directory for an FLDS. The query counts the total number of triples in the files:

SELECT (count (*) as ?triples)
FROM EXTERNAL <dir:/nfs/data/store/LoadGHIB_f5886/rdf.ttl.gz>
WHERE { ?s ?p ?o . }
triples
-----------
143704445
1 rows

Assessing Memory Requirements Based on File Analysis

Although the memory required to load and perform queries on specific data sets will vary based on the size and type of data contained in a data set as well as the type of queries run, you can still obtain a reasonable estimate for the amount of memory you will need to store data set by using the equation below:

total_triples x avg_triple_size + total_chars = size_estimate(bytes)

Follow the steps below to calculate the values to use in the equation:

  1. Count the total number of triples in the files
  2. Determine the average triple size
  3. Count the number of characters for all strings
  4. Calculate the size estimate

Count the total number of triples in the files

As shown in the example above, the following query counts the total number of triples in FLDS load files:

SELECT (count (*) as ?triples)
FROM EXTERNAL <dir:/nfs/data/store/LoadGHIB_f5886/rdf.ttl.gz>
WHERE { ?s ?p ?o . }
triples
-----------
143704445
1 rows

Determine the average triple size

The Memory usage depends on data characteristics section above shows some example triples and their estimated size. If you are familiar with the data in the files, you may be able to determine the average size based on the examples. Otherwise, Cambridge Semantics recommends using 30 bytes as the average triple size.

Count the number of characters for all strings

For ASCII characters, AnzoGraph uses about 1-byte of memory to store each character. Counting the number of characters in the load files provides a good estimate of the number of bytes required to store the strings in your data.

SELECT (SUM(IF(DATATYPE(?o)=<http://www.w3.org/2001/XMLSchema#string>,
       (STRLEN(?o)),0)) as ?char_count)
FROM EXTERNAL <URI> 
WHERE {?s ?p ?o}

For example, the following query returns the number of characters in the strings for the FLDS referenced above:

SELECT (SUM(IF(DATATYPE(?o)=<http://www.w3.org/2001/XMLSchema#string>,
       (STRLEN(?o)),0)) as ?char_count)
FROM EXTERNAL <dir:/nfs/data/store/LoadGHIB_f5886/rdf.ttl.gz> 
WHERE {?s ?p ?o}
char_count
------------
684348190
1 rows

Calculate the size estimate

Once you have counted the triples, determined the average triple size, and counted the characters, use the formula below to estimate the amount of memory needed to store the data at rest:

total_triples x avg_triple_size + total_chars = size_estimate(bytes)

For example:

143,704,445 x 30 + 684,348,190 = 4,995,481,540 bytes

This example FLDS requires roughly 5 GB of memory to store the data.

Cluster Sizing Guidelines

When your workload size requires using a cluster, do not create clusters with fewer than 4 nodes. When using a single node, data gets redistributed in memory without using the network. If you add 1 or 2 more nodes to create a 2- or 3-node cluster, data then gets distributed over the network. The CPU gain from the additional 1 or 2 nodes does not outweigh the performance degradation from the network. Using at least 4 nodes significantly reduces the network degradation and provides a near-linear performance benefit when compared to a single node.

Related Topics