Analyzing Load Files Without Loading Data
AnzoGraph enables you to perform pre-load analysis on load files before loading the data into memory. You can use the pre-load feature to run statistical queries, such counting the number of triples, or getting to know the data, such as returning a list of the predicates or classes. Performing a "dry run" of the load enables you to analyze data set characteristics to help with tasks such as capacity planning. For example, you can use a single AnzoGraph server with limited memory to capture statistics about a potentially huge data set.
This topic provides instructions for performing pre-load analysis of your load files. For information about load file directory requirements and load architecture, see Load File Requirements.
Important Considerations
- Since AnzoGraph scans the files on-disk, queries run much slower than they do when run against data in memory. Consider performance when deciding how many files to query at once and how complex to make the queries.
- Though the pre-load feature does not use memory for storing data, queries that you run against files do consume memory. The server must have memory available to use for intermediate query results.
- Unlike loads into the database, pre-load analysis does not prune duplicate triples. Statistics returned for load file queries may differ from the statistics returned after the data is loaded.
Query Syntax
Use the following syntax to analyze files in triple or quad format, like Turtle (.ttl), N-Triple (.n3 and .nt), N-Quad (.nq and .quads), and TriG (.trig). AnzoGraph does not support pre-load analysis of XML or JSON files at this time.
SELECT expression FROM EXTERNAL <dir:/path/dir_or_file_name> [ FROM EXTERNAL <dir:/path/dir_or_file_name> ] WHERE { triple_patterns }
Option | Description |
---|---|
SELECT expression | In the SELECT clause, specify expressions that return statistical results such as a count of the total number of triples or the number of distinct predicates. Queries that return values for a specific property may return an error. |
FROM EXTERNAL <dir:/path/dir_or_file_name> | The URI in the FROM clause specifies the location of the load file or directory of files. For example, this URI specifies a single file on the local file system: <file:/home/user/data/tickit.ttl> This example specifies a directory of files: <dir:/data/load-files/tickit_all.ttl> |
Examples
The following example analyzes the tickit.ttl load file to count the total number of triples in the file:
SELECT (count (*) as ?triples) FROM EXTERNAL <file:/opt/gqe/etc/tickit.ttl> WHERE { ?s ?p ?o . }
triples -------- 5368800 1 rows
This example analyzes a directory of tickit.ttl load files to count the total number of triples and the number of distinct subjects and predicates:
SELECT (count (*) as ?triples) (count(distinct ?s) as ?subjects) (count(distinct ?p) as ?preds) FROM EXTERNAL <dir:/opt/gqe/etc/tickit.ttl> WHERE { ?s ?p ?o . }
triples | subjects | preds --------+----------+------- 5368800 | 424319 | 45 1 rows