Analyzing Load Files Without Loading Data
AnzoGraph enables you to perform pre-load analysis on load files before loading the data into memory. You can use the pre-load feature to run statistical queries, such counting the number of triples, or getting to know the data, such as returning a list of the predicates or classes. Performing a "dry run" of the load enables you to analyze data set characteristics to help with tasks such as capacity planning. For example, you can use a single AnzoGraph server with limited memory to capture statistics about a potentially huge data set.
This topic provides instructions for performing pre-load analysis of your load files. For information about load file directory requirements and load architecture, see Load Requirements and Recommendations.
Important Considerations
- Since AnzoGraph scans the files on-disk, queries run much slower than they do when run against data in memory. Consider performance when deciding how many files to query at once and how complex to make the queries.
- Though the pre-load feature does not use memory for storing data, queries that you run against files do consume memory. The server must have memory available to use for intermediate query results.
- Unlike loads into the database, pre-load analysis does not prune duplicate triples. Statistics returned for load file queries may differ from the statistics returned after the data is loaded.
Query Syntax
The syntax that you use to query load files depends on the file type:
- For files in triple or quad format, like Turtle (.ttl), N-Triple (.n3 and .nt), N-Quad (.nq and .quads), and TriG (.trig) use the syntax in Triple or Quad Format.
- For analyzing files in tabular format, such as CSV files, follow the syntax in Tabular Format.
AnzoGraph does not support pre-load analysis of XML or JSON files at this time.
Triple or Quad Format
Use the following syntax to analyze files in triple or quad format:
SELECT expression FROM EXTERNAL <dir:/path/dir_or_file_name> [ FROM EXTERNAL <dir:/path/dir_or_file_name> ] WHERE { triple_patterns }
Option | Description |
---|---|
SELECT expression | In the SELECT clause, specify expressions that return statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error. |
FROM EXTERNAL <dir:/path/dir_or_file_name> | The URI in the FROM clause specifies the location of the load file or directory of files. For example, this URI specifies a single file on the local file system: <file:/home/user/data/tickit.ttl> This example specifies a directory of files: <dir:/data/load-files/tickit_all.ttl> |
Tabular Format
Use the following syntax to analyze CSV files:
SELECT expression WHERE { TABLE <dir:/path/dir_or_file_name> (materialType,fileSystem,delimiter,hasHeaders,columnMappingString) quad_or_triple_patterns }
Option | Description |
---|---|
SELECT expression | In the SELECT clause, specify expressions that return statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error. |
TABLE <dir:/path/dir_or_file_name> | The URI in the TABLE clause specifies the location of the load file or directory of files. For example, this URI specifies a single file on the local file system: <file:/home/user/data/tickit.csv> This example specifies a directory of files: <dir:/data/load-files/tickit_all.csv> |
materialType | The Multipurpose Internet Mail Extensions (MIME) type of the files. Specify csv for uncompressed files or csvgz for gzipped CSV files. |
fileSystem | This option specifies which AnzoGraph servers can access and load the CSV files. Specify global if all servers have access to the files or if you are loading from S3. Specify leader if only the leader server has access. Or specify compute if each compute server has a subset of the files on its local filesystem. |
delimiter | Specify the character that is used as the delimiter in the CSV file, such as a pipe (|), comma (,), asterisk (*), etc. |
hasHeaders | Boolean value that specifies whether the file has headers. Specify true if headers exist or false if the file does not contain headers. |
columnMappingString | Defines the mapping for the file. The value lists each column name and data type in a comma-separated list. For example, for a CSV file with four columns (ID, Title, Category ID, and Category Name):ID:long,Title:char,CategoryID:long,CategoryName:char |
Examples
The following example analyzes the tickit.ttl load file to count the total number of triples in the file:
SELECT (count (*) as ?triples) FROM EXTERNAL <file:/opt/gqe/etc/tickit.ttl> WHERE { ?s ?p ?o . }
triples -------- 5368800 1 rows
This example analyzes a directory of tickit.ttl load files to count the total number of triples and the number of distinct subjects and predicates:
SELECT (count (*) as ?triples) (count(distinct ?s) as ?subjects) (count(distinct ?p) as ?preds) FROM EXTERNAL <dir:/opt/gqe/etc/tickit.ttl> WHERE { ?s ?p ?o . }
triples | subjects | preds --------+----------+------- 5368800 | 424319 | 45 1 rows
This example analyzes a MovieActors.csv file to count the total number of records in the file:
SELECT (count(*) as ?records)
WHERE {
TABLE <dir:/home/user/movie.csv/MovieActors1.csv> ('csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char')
}
records --------- 10000 1 rows
This example analyzes the MovieActors.csv file to display the column headers in the file and 9 of the records.
SELECT *
WHERE {
TABLE <dir:/home/user/movie.csv/MovieActors1.csv> ('csv','leader',',',true,'MovieID:long,MovieTitle:char,ActorID:long,ActorName:char') } LIMIT 10
field_0 | field_1 | field_2 | field_3 ---------+---------------------------------------+----------+------------------ MovieID | MovieTitle | ActorID | ActorName 15400287 | Mission: Impossible (film series) | 689567 | Ving Rhames 15400287 | Mission: Impossible (film series) | 31460 | Tom Cruise 30271424 | Alvin and the Chipmunks (film series) | 1582246 | Kimberly Williams-Paisley 30271424 | Alvin and the Chipmunks (film series) | 708892 | Jason Lee (actor) 30271424 | Alvin and the Chipmunks (film series) | 876916 | Kaley Cuoco 30271424 | Alvin and the Chipmunks (film series) | 4995125 | Matthew Gray Gubler 30271424 | Alvin and the Chipmunks (film series) | 921466 | Tony Hale 30271424 | Alvin and the Chipmunks (film series) | 2647835 | Zachary Levi 30271424 | Alvin and the Chipmunks (film series) | 24202290 | Jenny Slate 10 rows