Sizing Guidelines for In-Memory Storage

This topic provides guidance on determining the server and cluster size that is ideal for hosting Graph Lakehouse, depending on the characteristics of your data and Graph Lakehouse use case.

Memory and Cluster Size Guidelines
Analyzing Data Characteristics in Load Files
Estimating Memory Requirements Based on Data Characteristics

Memory and Cluster Size Guidelines

Since Graph Lakehouse is a high-performance, in-memory database, it is important to consider the amount of memory needed to store the data that you plan to load. Estimating the amount of memory your workload requires can help you decide what size server to use and whether to use multiple servers. The sections below describe the key points to consider about memory usage and Graph Lakehouse.

Data at rest should use less than 50% of the total memory
Graph Lakehouse reserves 20% of the memory for the OS
Memory usage depends on data characteristics
Memory usage can be high during loads
Memory Usage Examples

Data at rest should use less than 50% of the total memory

The data loaded into memory should not consume more than 50% of the total available memory on the instance or across a cluster. Preserve at least 50% of the memory for server processes, query processing, and storing intermediate results.

Altair recommends that you allocate 3 to 4 times as much RAM as the planned data size, especially if the planned workload includes running complex analytic queries. There is no hard-wired limit on the number of queries you can run concurrently, however, you can set a limit, configured by the user_queues setting, that determines how many queries may be started before additional queries are placed in a queue.

Graph Lakehouse reserves 20% of the memory for the OS

To avoid unexpected shutdowns by the Linux operating system, the default Graph Lakehouse configuration leaves 20% of memory available for the OS; Graph Lakehouse will not use more than 80% of the total available memory. Account for this memory buffer in sizing calculations.

Memory usage depends on data characteristics

Memory usage varies significantly depending on the makeup of the data, such as the data types and sizes of literal values, and the complexity of the queries that you run. Data is loaded into Graph Lakehouse as triples, and the storage required for each triple ranges anywhere from 12 bytes per triple to 1 megabyte, for a triple that stores pages of text from an unstructured document.

Triples with integer objects like the following example require about 16 bytes to store in memory.
```
<http://anzograph.com/resource/person1> <http://anzograph.com/resource/age> 50
```

Triples made up of URIs like the following example require about 18 bytes to store in memory.

<http://anzograph.com/resource/person1> <http://anzograph.com/resource/friend> <http://anzograph.com/resource/person100>

Triples with user-defined data types (UDTs) like the following example also require about 18 bytes to store in memory.

<http://anzograph.com/resource/person1> <http://anzograph.com/resource/height> "5'8""^^height

Triples with dateTime values like the following example require about 20 bytes to store in memory.

<http://www.wikidata.org/entity/Q65949130> 
<http://www.wikidata.org/prop/direct/P585> 
"1995-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

Triples with long strings like the following example require about 700 bytes to store in memory.

<http://dbpedia.org/resource/Keanu_Reeves> <http://dbpedia.org/ontology/abstract> "Keanu Charles Reeves
(/keɪˈɑːnuː/ kay-AH-noo; born September 2, 1964) is a Canadian actor, producer, director and musician.
Reeves is best known for his acting career, beginning in 1985 and spanning more than three decades.
He gained fame for his starring role performances in several blockbuster films including comedies
from the Bill and Ted franchise (1989–1991), action thrillers Point Break (1991) and Speed (1994),
and the science fiction-action trilogy The Matrix (1999–2003). He has also appeared in dramatic
films such as Dangerous Liaisons (1988), My Own Private Idaho (1991), and Little Buddha (1993),
as well as the romantic horror Bram Stoker's Dracula (1992)."

The following table provides estimates for the number of triples that you can load and query with specific amounts of available RAM. The table also lists the number of triples that could be stored in given amounts of memory, using the triples described in the previous examples.

The estimates listed in the table represent the number of triples at rest and take into consideration that the data should not consume more than 50% of all available RAM.

Available RAM	General Estimate	Examples
16 GB	Up to about 100 million triples	Considering that the data at rest should use less than 8 GB RAM, a server with 16 GB total RAM could store: About 12 million 700-byte triples like the Keanu Reeves example above. About 475 million 18-byte URI triples like the example above.
32 GB	Up to about 200 million triples	Considering that the data at rest should use less than 16 GB RAM, a server with 32 GB total RAM could store: About 24 million 700-byte triples like the Keanu Reeves example above. About 850 million 20-byte triples like the dateTime example above.
64 GB	Up to about 400 million triples	Considering that the data at rest should use less than 32 GB RAM, a server with 64 GB total RAM could store: About 48 million 700-byte triples like the Keanu Reeves example above. About 1.7 billion 20-byte URI triples.
128 GB	Up to about 800 million triples	Considering that the data at rest should use less than 64 GB RAM, a server with 128 GB total RAM could store: About 96 million 700-byte triples like the Keanu Reeves example above. About 3.4 billion 20-byte URI triples.
256 GB	Up to about 1.5 billion triples	Considering that the data at rest should use less than 128 GB RAM, a server with 256 GB total RAM could store: About 192 million 700-byte triples like the Keanu Reeves example above. About 6.8 billion 20-byte URI triples.
512 GB	Up to about 3 billion triples	Considering that the data at rest should use less than 256 GB RAM, a server with 512 GB total RAM could store: About 390 million 700-byte triples like the Keanu Reeves example above. About 13 billion 20-byte URI triples.

Memory usage can be high during loads

During the load process, before the data can be moved to its final storage block, memory usage temporarily increases, particularly if the data includes many string values.

Memory Usage Examples

The following table provides memory requirement estimates for public or commercial data sets that users may already use or be familiar with:

Data Set	Memory Requirements	Description
Graph-500 (.csv)	2 GB	Graph search data set created by graph500.org to facilitate benchmark testing of large data and CPU intensive computing.
WikiData (.nt)	340 GB at rest; 900+ GB to load and run queries	Large downloadable data set reflecting contents of various Wikimedia projects.

Analyzing Data Characteristics in Load Files

Graph Lakehouse allows you to perform pre-load analysis on load files without actually loading the data into memory. You can use this method to run statistical queries, such as counting the number of triples, getting to know the data, or returning a list of the nodes or subjects and predicates. Performing a "dry run" of a data load, beforehand, enables you to analyze data set characteristics to help with tasks such as memory sizing and overall capacity planning. You can use this method to capture statistics about a large data set on a smaller system than what would actually be required to load the data in memory.

Important Considerations for Analyzing Load Files

Since Graph Lakehouse scans the files on-disk, queries run much slower than they do when run against data in memory. Consider performance when deciding how many files to query at once and how complex to make queries.
Though the pre-load feature does not use memory for storing data, queries that you run against files do consume some memory. The server must have sufficient memory available to use for these intermediate query results.
Unlike loads into the database, pre-load analysis does not prune duplicate triples. Statistics returned for load file queries may differ somewhat from the statistics returned after the data is loaded.

Analysis Query Syntax

The syntax that you use to query load files depends on the file type. For example, for files in triple or quad format, like Turtle (.ttl), N-Triple (.n3 and .nt), N-Quad (.nq and .quads), and TriG (.trig) files, you can use the following syntax:

SELECT <expression>
FROM EXTERNAL <dir:/path/dir_or_file_name>
[ FROM EXTERNAL <dir:/path/dir_or_file_name> ]
WHERE { <triple_patterns> }

Option Description

SELECT <expression> In the SELECT clause, specifies an expression that returns statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error.

Option	Description
`SELECT <expression>`	In the SELECT clause, specifies an expression that returns statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error.
`FROM EXTERNAL <dir:/path/dir_or_file_name>`	The URI in the FROM clause specifies the location of the load file or directory of files. For example, this URI specifies a single file on the local file system: <file:/home/user/data/tickit.ttl> This example specifies a directory of files: <dir:/data/load-files/tickit.ttl.gz>

FROM EXTERNAL 
<dir:/path/dir_or_file_name>

The URI in the FROM clause specifies the location of the load file or directory of files. For example, this URI specifies a single file on the local file system:

<file:/home/user/data/tickit.ttl>

This example specifies a directory of files:

<dir:/data/load-files/tickit.ttl.gz>

For example, the following query analyzes the tickit.ttl.gz directory to count the total number of triples in the files:

SELECT (count (*) as ?triples)
FROM EXTERNAL <dir:/opt/anzograph/etc/tickit.ttl.gz>
WHERE { ?s ?p ?o . }

triples
--------
5368800
1 rows

The example below analyzes the tickit.ttl.gz directory to count the total number of triples and the number of distinct subjects and predicates:

SELECT
  (count (*) as ?triples)
  (count(distinct ?s) as ?subjects)
  (count(distinct ?p) as ?preds)
FROM EXTERNAL <dir:/opt/anzograph/etc/tickit.ttl.gz>
WHERE { ?s ?p ?o . }

triples | subjects | preds
--------+----------+-------
5368800 | 424319   | 45
1 rows

Estimating Memory Requirements Based on Data Characteristics

Although the memory required to load and perform queries on specific data sets will vary based on the size and type of data contained in a data set, you can still obtain a reasonable estimate or starting point for the amount of memory you will need to load any specific data set. Using the method of pre-load analysis of load files described earlier (see Analyzing Data Characteristics in Load Files), you can query the data set to calculate a rough estimate of the memory required to load the data set in memory.

Calculate the number of triples the data set will generate when stored in Graph Lakehouse.
Multiply the number of triples by an average triple size.
Add the number of characters stored in all of the character strings contained in the data set.

Using the example of the Tickit data set provided with Graph Lakehouse, you can perform a query like the following to calculate the number of triples the Tickit data set will contain when loaded into memory:

SELECT (COUNT(*) as ?triple_count)
FROM EXTERNAL <dir:/opt/anzograph/etc/tickit.ttl.gz>
WHERE {?s ?p ?o}

triple_count
------------
7696012
1 rows

Queries run against files on disk will run significantly slower than they do when run against data in memory. Also, note that pre-load analysis of data sets does not prune duplicate triples, unlike data sets loaded in memory, so the calculation of the number of triples may differ somewhat from the number reported after the data set is loaded in memory.

Once you know the total number of triples, multiply the value by the average triple storage size. The Memory usage depends on data characteristics section above shows some example triples and their estimated size. If you are familiar with the data in the files, you may be able to determine the average size based on the examples. Otherwise, Altair recommends using 30 bytes as the average triple size. For example, using the triple count above and an average triple size of 30 bytes:

7696012 x 30 = 230,880,360 bytes

To calculate the additional memory required for in-memory storage of character string data, you can run a query like the following:

SELECT
  (SUM(IF(DATATYPE(?o)=<http://www.w3.org/2001/XMLSchema#string>,(STRLEN(?o)),0)) AS ?char_count)
FROM EXTERNAL <dir:/opt/anzograph/etc/tickit.ttl.gz>
WHERE {?s ?p ?o.}
 
char_count 
----------- 
4893660 
1 rows

For ASCII characters, Graph Lakehouse requires a single byte of memory for each character, so adding the total number of characters to the previous memory calculation for storing triples, the result is the following:

230,880,360 + 4,893,660 = 235,774,020 total bytes

Note that the calculation of 235,774,020 total bytes (0.24 GB) provides an estimate for data set storage "at rest" and takes into account only one data set stored in memory. When coming up with a final recommendation for total memory requirements of an Graph Lakehouse deployment, account for any other data sets you may want to load in memory at the same time. You also need to keep in mind other memory sizing guidelines, for example, that all loaded data should not consume more than 50% of all available RAM.