Analyzing Load Files Without Loading Data

AnzoGraph enables you to perform pre-load analysis on load files before loading the data into memory. You can use the pre-load feature to run statistical queries, such counting the number of triples, or getting to know the data, such as returning a list of the predicates or classes. Performing a "dry run" of the load enables you to analyze data set characteristics to help with tasks such as capacity planning. For example, you can use a single AnzoGraph server with limited memory to capture statistics about a potentially huge data set.

This topic provides instructions for performing pre-load analysis of your load files. For information about load file directory requirements and load architecture, see Load Requirements and Recommendations.

Important Considerations

  • Since AnzoGraph scans the files on-disk, queries run much slower than they do when run against data in memory. Consider performance when deciding how many files to query at once and how complex to make the queries.
  • Though the pre-load feature does not use memory for storing data, queries that you run against files do consume memory. The server must have memory available to use for intermediate query results.
  • Unlike loads into the database, pre-load analysis does not prune duplicate triples. Statistics returned for load file queries may differ from the statistics returned after the data is loaded.

Query Syntax

The syntax that you use to query load files depends on the file type:

  • For files in triple or quad format, like Turtle (.ttl), N-Triple (.n3 and .nt), N-Quad (.nq and .quads), and TriG (.trig) use the syntax in Triple or Quad Format.
  • For analyzing files in tabular format, such as CSV files, follow the syntax in Tabular Format.

AnzoGraph does not support pre-load analysis of XML or JSON files at this time.

Triple or Quad Format

Use the following syntax to analyze files in triple or quad format:

SELECT expression
FROM EXTERNAL <dir:/path/dir_or_file_name>
[ FROM EXTERNAL <dir:/path/dir_or_file_name> ]
WHERE { triple_patterns }
Option Description
SELECT expression In the SELECT clause, specify expressions that return statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error.
FROM EXTERNAL <dir:/path/dir_or_file_name> The URI in the FROM clause specifies the location of the load file or directory of files.

For example, this URI specifies a single file on the local file system:

<file:/home/user/data/tickit.ttl>

This example specifies a directory of files:

<dir:/data/load-files/tickit_all.ttl>

Tabular Format

Use the following syntax to analyze CSV files:

SELECT expression
WHERE {
TABLE <dir:/path/dir_or_file_name>
  (materialType,fileSystem,delimiter,hasHeaders,columnMappingString)
  quad_or_triple_patterns
}
Option Description
SELECT expression In the SELECT clause, specify expressions that return statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error.
TABLE <dir:/path/dir_or_file_name> The URI in the TABLE clause specifies the location of the load file or directory of files.

For example, this URI specifies a single file on the local file system:

<file:/home/user/data/tickit.csv>

This example specifies a directory of files:

<dir:/data/load-files/tickit_all.csv>
materialType The Multipurpose Internet Mail Extensions (MIME) type of the files. Specify csv for uncompressed files or csvgz for gzipped CSV files.
fileSystem This option specifies which AnzoGraph servers can access and load the CSV files. Specify global if all servers have access to the files or if you are loading from S3. Specify leader if only the leader server has access. Or specify compute if each compute server has a subset of the files on its local filesystem.
delimiter Specify the character that is used as the delimiter in the CSV file, such as a pipe (|), comma (,), asterisk (*), etc.
hasHeaders Boolean value that specifies whether the file has headers. Specify true if headers exist or false if the file does not contain headers.
columnMappingString Defines the mapping for the file. The value lists each column name and data type in a comma-separated list. For example, for a CSV file with four columns (ID, Title, Category ID, and Category Name):
ID:long,Title:char,CategoryID:long,CategoryName:char

Examples

The following example analyzes the tickit.ttl load file to count the total number of triples in the file:

SELECT (count (*) as ?triples)
FROM EXTERNAL <file:/opt/gqe/etc/tickit.ttl>
WHERE { ?s ?p ?o . }
triples
--------
5368800
1 rows

This example analyzes a directory of tickit.ttl load files to count the total number of triples and the number of distinct subjects and predicates:

SELECT
  (count (*) as ?triples)
  (count(distinct ?s) as ?subjects)
  (count(distinct ?p) as ?preds)
FROM EXTERNAL <dir:/opt/gqe/etc/tickit.ttl>
WHERE { ?s ?p ?o . }
triples | subjects | preds
--------+----------+-------
5368800 | 424319   | 45
1 rows

This example analyzes a MovieActors.csv file to count the total number of records in the file:

SELECT (count(*) as ?records)
WHERE {
TABLE <dir:/home/user/movie.csv/MovieActors1.csv> ('csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char')
}
records
---------
10000
1 rows

This example analyzes the MovieActors.csv file to display the column headers in the file and 9 of the records.

SELECT *
WHERE {
TABLE <dir:/home/user/movie.csv/MovieActors1.csv> ('csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char') } LIMIT 10
field_0  | field_1                               | field_2  | field_3
---------+---------------------------------------+----------+------------------
MovieID  | MovieTitle                            |  ActorID | ActorName
15400287 | Mission: Impossible (film series)     |   689567 | Ving Rhames
15400287 | Mission: Impossible (film series)     |    31460 | Tom Cruise
30271424 | Alvin and the Chipmunks (film series) |  1582246 | Kimberly Williams-Paisley
30271424 | Alvin and the Chipmunks (film series) |   708892 | Jason Lee (actor)
30271424 | Alvin and the Chipmunks (film series) |   876916 | Kaley Cuoco
30271424 | Alvin and the Chipmunks (film series) |  4995125 | Matthew Gray Gubler
30271424 | Alvin and the Chipmunks (film series) |   921466 | Tony Hale
30271424 | Alvin and the Chipmunks (film series) |  2647835 | Zachary Levi
30271424 | Alvin and the Chipmunks (film series) | 24202290 | Jenny Slate
10 rows
Related Topics