SPARQL Best Practices
This topic provides some background information on query execution and gives general guidelines to follow to help ensure that your SPARQL queries are optimized and do not exhaust Anzo or AnzoGraph resources.
Query Execution Overview
When you run a new query against a graphmart, AnzoGraph parses the text and then plans the query execution strategy. The planner determines the steps that are required to compute the results in an optimal way. The plan is based on the statistics about your data that is gathered during graphmart activation. After the plan is determined, code for the plan is generated, compiled, and then distributed throughout the cluster to be executed. The plan XML file and compiled code are both saved to disk so that they can be reused when the query is run again. That is why the first run of a query takes longer than subsequent runs.
In addition to the caching that AnzoGraph performs, Anzo caches the query results. If you enable the Skip server cache option when running a query in the Query Builder, it means that Anzo will not access the cache that is available from a previous run of the query.
SPARQL Clause Execution Order and Optimization
The following image shows the execution order of clauses in a read (SELECT or CONSTRUCT) query. The list below describes the order.
- First, any subselects or subqueries are performed. Note that the ordering presented in the image also applies to the clauses included in subqueries.
- Then the graph patterns that are not included in a subquery are scanned.
- Next, filtering is applied if a FILTER clause is included.
- Once the WHERE clause has been processed, grouping is performed if a GROUP BY clause is included.
- Next, the result clause is processed, including keywords like DISTINCT and any aggregation or other functions.
- Lastly, the results are ordered when ORDER BY is specified. The ORDER BY clause is more complex when GROUP BY and LIMIT are included:
Nlog(M)
where N is the number of grouped solutions and M is the LIMIT.
The GROUP BY and result clauses (steps 4 and 5) add linear complexity on top of the complexity in the WHERE clause. Query execution time increases proportionally to the total number of solutions found by the WHERE clause. To reduce complexity, trim the WHERE clause to have as few solutions as possible so that the expensive clauses are executed against as few solutions as possible.
General Guidelines
This section offers guidance on developing performant queries and avoiding unexpected results.
- Limit the Results
- Replace FILTER with VALUES or Triple Patterns
- Avoid Cross-Product Joins
- Use Subqueries when Querying Large Amounts of Data
Limit the Results
The easiest way to reduce query execution time in some cases is to apply a LIMIT statement to limit the result set to a specific number of solutions. Limiting the number of results improves performance for cases where query results are calculated and returned in a streaming fashion. Limiting results is particularly useful when results need to be ordered so that the first group of results are the only ones of interest.
Example: Find the sample ID and binding density for the top 10 most dense samples
PREFIX biom: <http://identifiers.csi.com/pharmakg/def/biomarker#> SELECT ?sampleId ?bindingDensity WHERE { ?sample a biom:Sample ; biom:sampleId ?sampleId ; biom:bindingDensity ?bindingDensity . } ORDER BY DESC(?bindingDensity) LIMIT 10
Replace FILTER with VALUES or Triple Patterns
While a FILTER clause is useful for narrowing down selected data per a set of requirements, only use FILTER when the logic does not lend to other operations. In many cases, replacing FILTER with a VALUES clause or a well-organized set of triple patterns increases query performance. When processing a FILTER statement, all non-filtered data must be retrieved before the FILTER can be applied. Using a VALUES clause or triple pattern, however, reduces the amount of data that is retrieved and processed after the retrieval.
Example 1: Inappropriate use of FILTER for a value-driven SELECT query
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass a uriRoot:Class ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?filteredVar . FILTER(?filteredVar = 'COMPAREDDATA1' || ?filteredVar = ‘COMPAREDDATA2’ || ?filteredVar = ‘COMPAREDDATA3’)
# filteredVar is first retrieved and then run through several comparisons }
Solution 1: Use VALUES to select certain values
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass a uriRoot:Class ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?valueVar . VALUES (?valueVar) { (‘COMPAREDDATA1’) (‘COMPAREDDATA2’) (‘COMPAREDDATA3’) } # selection is performed once for each entry in the VALUES clause, # retrieving no more data than necessary }
Example 2: Inappropriate use of FILTER for a value-driven SELECT query
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?filteredVar WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?filteredVar . FILTER(?filteredVar = 'COMPAREDDATA1')
# filteredVar is first retrieved and then compared }
Solution 2: Use a triple literal to select certain values
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?filteredVar WHERE { ?instanceOfClass a uriRoot:Class ; uriRoot:varName1 ?var1 ; uriRoot:filteredVar 'COMPAREDDATA' . # data is only retrieved if filteredVar matches desired compared data upon initial retrieval }
Avoid Cross-Product Joins
When trying to gather data from multiple classes at once, it is possible to accidentally create a cross-product join, a selection that combines the selected data in a hyper-linear way rather than simply assembling the data and returning an unprocessed set.
Example: Accidental cross-product query
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass1 a uriRoot:Class1 ; uriRoot:varName1 ?var1 . ?instanceOfClass2 a uriRoot:Class2 ; uriRoot:varName2 ?var2 . }
In the above example, the goal may have been to retrieve IDs from all instances of Class1 and all instances of Class2, for example, all of the participants and all of the subjects. However, the result of the query would be every combination of participant and subject. If there are 10 participants and 5 subjects, there would be 50 results rather than 15. In large data sets, this severely affects performance and puts the system under unnecessary strain.
There are two straightforward ways to separate or parameterize data to write a more performant query:
Solution 1: Use UNION to replace the cross-product
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?commonVar WHERE { { ?instanceOfClass1 a uriRoot:Class1 ; uriRoot:varName1 ?var1 . BIND(?var1 as ?commonVar) } UNION { ?instanceOfClass2 a uriRoot:Class2 ; uriRoot:varName2 ?var2 . BIND(?var2 as ?commonVar) } # this creates a temporary graph that is a union of two graphs # in each of the two graphs, the desired data is saved under the same name }
Solution 2: Use VALUES to replace the cross-product
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?commonVar WHERE { ?instanceOfClass a ?classURI ; ?propertyURI ?commonVar . VALUES (?classURI ?propertyURI) { (uriRoot:Class1 uriRoot:varName1) (uriRoot:Class2 uriRoot:varName2) } }
Use Subqueries when Querying Large Amounts of Data
When analyzing data, there may be a need to aggregate data and then perform a selection or derivation on the resulting aggregate. In this case, it is advisable to use one or more subselects or subqueries, where a SELECT query is included inside the WHERE clause and the remainder of the WHERE clause operates on the results of that SELECT as though the data were immediately available in the graph.
Example: Aggregate a variable and then process the aggregation
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2Aggregation WHERE { { SELECT ?var1 (GROUP_CONCAT(?var2) as ?var2Aggregation) WHERE { ?instanceOfClass1 a uriRoot:Class1 ; uriRoot:varName1 ?var1 . ?instanceOfClass2 a uriRoot:Class2 ; uriRoot:varName2 ?var2 . } GROUP BY ?var1 } # var1 and var2Aggregation are now available for the usual processing # while var2 is no longer available as it only existed within the subselect FILTER(regex(?var2Aggregation, ‘DESIREDVAR2VAL’)) # FILTER is used for illustrative purposes, but any processing would work }