SPARQL Query Templates and Best Practices
To provide guidance on developing performant SPARQL queries and avoiding unexpected results, this topic offers SPARQL best practices and query templates that you can use as a starting point for writing SPARQL queries in Anzo, such as in Data Layers, dashboard query lenses, and the Query Builder.
SPARQL Query Templates
This section provides templates that you can use as a starting point for writing SPARQL queries.
- Basic Data Selection
- Graph Traversal Data Selection
- Text Cleanup with REGEX
- Data Aggregation
- Applying a Filter to Selected Data
- Creating or Deriving New Variables
Basic Data Selection
The most fundamental use case for writing SPARQL queries is to select data from properties from a collection of instances. The following template and example query illustrate how to access a class in a model and return the properties on that class using their URIs.
Abstracted Query Template – Replace the bold text to modify the query
PREFIX uriRoot: <http://example.com/rootOfUris#> # select the variables that are populated in the WHERE clause SELECT ?var1 ?var2 WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; # use a prefix to abbreviate a property URI as shown above # or use the full URI as shown below <http://example.com/rootOfUris#varName2> ?var2 . }
Example Query – Get Sample ID and Anatomical Location for each Sample
PREFIX bm: <http://identifiers.csi.com/pharmakg/def/biomarker#> SELECT ?sampleId ?anatomicalLocation WHERE { ?sample a bm:Sample ; bm:sampleId ?sampleId ; <http://identifiers.csi.com/pharmakg/def/biomarker#fmi_anatomicalLocation> ?anatomicalLocation . }
Graph Traversal Data Selection
The graph model enables the flexibility to combine data from different classes. The following template illustrates how to traverse between classes in the data model and access data from properties on multiple classes.
Abstracted Query Template – Replace the bold text to modify the query
PREFIX uriRoot: <http://example.com/rootOfUris#> # select the variables that are populated in the WHERE clause SELECT ?var1 ?var2 ?varFromOtherClass WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; # use a prefix to abbreviate a property URI as shown above # or use the full URI as shown below <http://example.com/rootOfUris#varName2> ?var2 ; # getting data from other classes requires traversing per the model uriRoot:pointerToOtherClass ?instanceOfOtherClass . ?instanceOfOtherClass a uriRoot:OtherClassName ; uriRoot:varName3 ?varFromOtherClass . }
Text Cleanup with REGEX
Once data is onboarded to Anzo, it is common to encounter string values that include issues such as unintended characters, missing spaces, and inconsistent formatting. You can use regular expressions in a Data Layer query to manipulate those values so that they are consistent and readable in analytics against the Graphmart.
The BIND clause in the Data Layer query below trims any white space from before and after the string, converts the characters to upper case, and removes all non-alphanumeric characters and non-spaces.
Replace the bold text as needed
PREFIX : <http://csi.com/> DELETE { GRAPH ${targetGraph}{ ?s ?pred ?old_val } } INSERT { GRAPH ${targetGraph}{ ?s ?pred ?new_val } } ${usingSources} WHERE { ?s a :Class ; ?pred ?old_val . VALUES (?pred) { (:property) } BIND(TRIM(UPPER(REPLACE(?val, "[^a-zA-Z0-9[[:space:]]", ""))) as ?new_val) }
Data Aggregation
Grouping data selections around a central property yields a more complete representation or summary of the data available. The following template illustrates how to use one property to act as a pivot point for collecting all the data from another property.
Abstracted Query Template – Replace the bold text to modify the query
PREFIX pref: <http://example.com/rootOfUris#> SELECT # data can be aggregated to yield counts, concatenations of data, etc. ?instanceId GROUP_CONCAT(DISTINCT(?instanceDetail) as ?instanceDetails) WHERE { # apply selection/filtering logic to narrow the aggregation # or get summaries of total data by applying only simple restrictions ?instance a pref:Class ; pref:instanceId ?instanceId ; pref:instanceDetail ?instanceDetail . } GROUP BY ?instanceId # all non-aggregated variables must be grouped in GROUP BY
Applying a Filter to Selected Data
Filtering the results for a query gives the ability to focus on specific aspects of the data. The following template illustrates how to restrict the total selected result set by including a filter on a variable.
Abstracted Query Template – Replace the bold text to modify the query
PREFIX pref1: <http://example.com/rootOfUris1#> PREFIX pref2: <http://example.com/rootOfUris2#> SELECT ?varFromClass1 ?varFromClass2 ?varFromClass3 ?filteredVar WHERE { ?instance1 a pref1:Class1 ; pref1:varName1 ?varFromClass1 ; # the path on the model points from Class1 to Class2 pref1:pointerToClass2 ?instance2 . ?instance2 a pref1:Class2 ; pref1:varName2 ?varFromClass2 . # models with different prefixes can still be joined ?instance3 a pref2:Class3 ; # the path on the model points from Class3 to Class2 pref2:pointerToClass2 ?instance2 ; pref2:filteredVarName ?filteredVar . # filters use comparisons to scope the selected data # they can use existence checks or other boolean expressions as well FILTER(?filteredVar = 'COMPAREDDATA') }
For optimal query performance, replace FILTER clauses. See Replace FILTER with VALUES or Triple Patterns when Possible below for more information.
Creating or Deriving New Variables
Storing intermediate or derived data within a query enables a single query to answer more complex questions. The following template illustrates how to bind a derived value to a variable. That variable is then available for selection or further manipulation.
Abstracted Query Template – Replace the bold text to modify the query
PREFIX pref1: <http://example.com/rootOfUris1#> PREFIX pref2: <http://example.com/rootOfUris2#> PREFIX pref3: <http://example.com/rootOfUris3#> SELECT ?var1 ?filterVar ?var2AndVar3 WHERE { ?instance1 a pref1:Class1 ; pref1:varName1 ?var1 . ?filterInstance a pref2:MedicalHistory ; pref2:filterVarName ?filterVar ; # multiple traversals between classes may be necessary to link appropriate data pref2:pointerToIntermediateClass ?intermediateInstance . ?intermediateInstance a pref2:IntermediateClass ; pref2:pointerToClass1 ?instance1 . ?instance2 a pref3:Class2 ; # forwards traversals tend to be more performant # it is still possible to identify a latter class and do a backwards traversal pref3:pointerToClass1 ?instance1 ; pref3:varName2 ?var2 . ?instance3 a pref3:Class3 ; pref3:pointerToClass2 ?instance2 ; pref3:varName3 ?var3 . # filters can be executed on various data types FILTER(?filterVar < "filterData"^^xsd:filterDataType) # binding allows population of new/derived variables BIND(CONCAT(?var2, "--", ?var3) as ?var2AndVar3) }
SPARQL Best Practices
To ensure that your SPARQL queries perform well and do not overtax Anzo, Cambridge Semantics recommends that you follow these guidelines when writing and testing your queries:
- Limit Results when Developing and Testing Queries
- Replace FILTER with VALUES or Triple Patterns when Possible
- Beware of Cross-Product Joins
- Use Subqueries when Querying Large Amounts of Data
Limit Results when Developing and Testing Queries
The easiest way to reduce query execution time in some cases is to apply a LIMIT statement to limit the result set to a specific number of solutions. Limiting the number of results improves performance for cases where query results are calculated and returned in a streaming fashion. Limiting results is particularly useful when results need to be ordered so that the first group of results are the only ones of interest.
Example Solution – Get Sample ID and the Binding Density for the top 10 most dense Samples
PREFIX bm: <http://identifiers.csi.com/pharmakg/def/biomarker#> SELECT ?sampleId ?bindingDensity WHERE { ?sample a bm:Sample ; bm:sampleId ?sampleId ; bm:bindingDensity ?bindingDensity . } ORDER BY DESC(?bindingDensity) LIMIT 10
Replace FILTER with VALUES or Triple Patterns when Possible
While a FILTER clause is useful for narrowing down selected data per a set of requirements, only use FILTER when the logic does not lend to other operations. In many cases, replacing FILTER with a VALUES clause or a well-organized set of triple patterns increases query performance. When processing a FILTER statement, all non-filtered data must be retrieved before the FILTER can be applied. Using a VALUES clause or triple pattern, however, reduces the amount of data that is retrieved and processed after the retrieval.
Example – Inappropriate use of FILTER for value-driven SELECT
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?filteredVar . FILTER(?filteredVar = 'COMPAREDDATA1' || ?filteredVar = ‘COMPAREDDATA2’ || ?filteredVar = ‘COMPAREDDATA3’)
# filteredVar is first retrieved, then run through several comparisons }
Solution – VALUES used to select data of certain values
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?valueVar . VALUES (?valueVar) { (‘COMPAREDDATA1’) (‘COMPAREDDATA2’) (‘COMPAREDDATA3’) } # selection is performed once for each entry in the VALUES clause, # retrieving no more data than necessary }
Example – Inappropriate use of FILTER for value-driven SELECT
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?filteredVar WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; uriRoot:varName2 ?var2 ; uriRoot:filteredVar ?filteredVar . FILTER(?filteredVar = 'COMPAREDDATA1')
# filteredVar is first retrieved, then compared }
Solution – Triple literal used to select data of a certain value
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?filteredVar WHERE { ?instanceOfClass a uriRoot:ClassName ; uriRoot:varName1 ?var1 ; uriRoot:filteredVar 'COMPAREDDATA' . # data is only retrieved if filteredVar matches desired compared data upon initial retrieval }
Beware of Cross-Product Joins
When trying to gather data from multiple classes at once, it is possible to accidentally create a cross-product join, a selection that combines the selected data in a hyper-linear way rather than simply assembling the data and returning an unprocessed set.
Example – Accidental cross-product query
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2 WHERE { ?instanceOfClass1 a uriRoot:ClassName1 ; uriRoot:varName1 ?var1 . ?instanceOfClass2 a uriRoot:ClassName2 ; uriRoot:varName2 ?var2 . }
In the above example, the goal may have been to retrieve IDs from all instances of ClassName1 and all instances of ClassName2, for example, all of the Participants and all of the Subjects. However, the result of the query would be every combination of Participant and Subject. If there are 10 Participants and 5 Subjects, there would be 50 results rather than 15. In large data sets, this severely affects performance and puts the system under unnecessary strain.
There are two straightforward ways to separate or parameterize data to write a more performant query.
Solution 1 – Use UNION to replace the cross-product
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?commonVar WHERE { { ?instanceOfClass1 a uriRoot:ClassName1 ; uriRoot:varName1 ?var1 . BIND(?var1 as ?commonVar) } UNION { ?instanceOfClass2 a uriRoot:ClassName2 ; uriRoot:varName2 ?var2 . BIND(?var2 as ?commonVar) } # this creates an ephemeral graph that is a union of two graphs # in each of the two graphs, the desired data is saved under the same name }
Solution 2 – Use VALUES to replace the cross-product
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?commonVar WHERE { ?instanceOfClass a ?classURI ; ?propertyURI ?commonVar . VALUES (?classURI ?propertyURI) { (uriRoot:ClassName1 uriRoot:varName1) (uriRoot:ClassName2 uriRoot:varName2) } }
Use Subqueries when Querying Large Amounts of Data
When analyzing data, there may be a need to aggregate data and then perform a selection or derivation on the resulting aggregate. In this case, it is advisable to use one or more subselects or subqueries, where a SELECT query is included inside the WHERE clause and the remainder of the WHERE clause operates on the results of that SELECT as though that data were immediately available in the graph.
Example Solution – Aggregate a variable and then process the aggregation
PREFIX uriRoot: <http://example.com/rootOfUris#> SELECT ?var1 ?var2Aggregation WHERE { { SELECT ?var1 (GROUP_CONCAT(?var2) as ?var2Aggregation) WHERE { ?instanceOfClass1 a uriRoot:ClassName1 ; uriRoot:varName1 ?var1 . ?instanceOfClass2 a uriRoot:ClassName2 ; uriRoot:varName2 ?var2 . } GROUP BY ?var1 } # var1 and var2Aggregation are now available for the usual processing # while var2 is no longer available as it only existed within the subselect FILTER(regex(?var2Aggregation, ‘DESIREDVAR2VAL’)) # FILTER is used for illustrative purposes, but any processing would work }