SPARQL Query Templates and Best Practices

To provide guidance on developing performant SPARQL queries and avoiding unexpected results, this topic offers SPARQL best practices and query templates that you can use as a starting point for writing SPARQL queries in Anzo, such as in data layer steps, dashboard query lenses, and the Query Builder.

SPARQL Query Templates
SPARQL Best Practices

SPARQL Query Templates

This section provides templates that you can use as a starting point for writing SPARQL queries.

Basic Data Selection
Graph Traversal Data Selection
Text Cleanup with REGEX
Data Aggregation
Applying a Filter to Selected Data
Creating or Deriving New Variables

Basic Data Selection

The most fundamental use case for writing SPARQL queries is to select data from properties from a collection of instances. The following template and example query illustrate how to access a class in a model and return the properties on that class using their URIs.

Abstracted Query Template – Replace the bold text to modify the query

PREFIX uriRoot: <http://example.com/rootOfUris#>
			
# select the variables that are populated in the WHERE clause
SELECT ?var1 ?var2
WHERE {
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      # use a prefix to abbreviate a property URI as shown above
      # or use the full URI as shown below			
      <http://example.com/rootOfUris#varName2> ?var2 .
}

Example Query – Get Sample ID and Anatomical Location for each Sample

PREFIX bm: <http://identifiers.csi.com/pharmakg/def/biomarker#>
			
SELECT ?sampleId ?anatomicalLocation
WHERE {
  ?sample a bm:Sample ;
      bm:sampleId ?sampleId ;
      <http://identifiers.csi.com/pharmakg/def/biomarker#fmi_anatomicalLocation> ?anatomicalLocation .
}

Graph Traversal Data Selection

The graph model enables the flexibility to combine data from different classes. The following template illustrates how to traverse between classes in the data model and access data from properties on multiple classes.

Abstracted Query Template – Replace the bold text to modify the query

PREFIX uriRoot: <http://example.com/rootOfUris#>
# select the variables that are populated in the WHERE clause
SELECT ?var1 ?var2 ?varFromOtherClass
WHERE {	
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      # use a prefix to abbreviate a property URI as shown above
      # or use the full URI as shown below
      <http://example.com/rootOfUris#varName2> ?var2 ;
      # getting data from other classes requires traversing per the model
      uriRoot:pointerToOtherClass ?instanceOfOtherClass .
			
  ?instanceOfOtherClass a uriRoot:OtherClassName ;
      uriRoot:varName3 ?varFromOtherClass .
}

Text Cleanup with REGEX

Once data is onboarded to Anzo, it is common to encounter string values that include issues such as unintended characters, missing spaces, and inconsistent formatting. You can use regular expressions in a Data Layer query to manipulate those values so that they are consistent and readable in analytics against the Graphmart.

The BIND clause in the Data Layer query below trims any white space from before and after the string, converts the characters to upper case, and removes all non-alphanumeric characters and non-spaces.

Replace the bold text as needed

PREFIX : <http://csi.com/>
DELETE {
  GRAPH ${targetGraph}{
    ?s ?pred ?old_val
  }
}
INSERT {
  GRAPH ${targetGraph}{
   ?s ?pred ?new_val
  }
}
${usingSources}
WHERE { 
  ?s a :Class ;
    ?pred ?old_val .
  
  VALUES (?pred) {
    (:property)
  }
BIND(TRIM(UPPER(REPLACE(?val, "[^a-zA-Z0-9[[:space:]]", ""))) as ?new_val)
}

Data Aggregation

Grouping data selections around a central property yields a more complete representation or summary of the data available. The following template illustrates how to use one property to act as a pivot point for collecting all the data from another property.

Abstracted Query Template – Replace the bold text to modify the query

PREFIX pref: <http://example.com/rootOfUris#>
				
SELECT
# data can be aggregated to yield counts, concatenations of data, etc.
  ?instanceId GROUP_CONCAT(DISTINCT(?instanceDetail) as ?instanceDetails)
WHERE {
  # apply selection/filtering logic to narrow the aggregation
  # or get summaries of total data by applying only simple restrictions
  ?instance a pref:Class ;
      pref:instanceId ?instanceId ;
      pref:instanceDetail ?instanceDetail .
}
GROUP BY ?instanceId
# all non-aggregated variables must be grouped in GROUP BY

Applying a Filter to Selected Data

Filtering the results for a query gives the ability to focus on specific aspects of the data. The following template illustrates how to restrict the total selected result set by including a filter on a variable.

Abstracted Query Template – Replace the bold text to modify the query

PREFIX pref1: <http://example.com/rootOfUris1#>
PREFIX pref2: <http://example.com/rootOfUris2#>
			
SELECT ?varFromClass1 ?varFromClass2 ?varFromClass3 ?filteredVar
WHERE {
  ?instance1 a pref1:Class1 ;
      pref1:varName1 ?varFromClass1 ;
      # the path on the model points from Class1 to Class2
      pref1:pointerToClass2 ?instance2 .
			
  ?instance2 a pref1:Class2 ;
      pref1:varName2 ?varFromClass2 .

  # models with different prefixes can still be joined
  ?instance3 a pref2:Class3 ;
      # the path on the model points from Class3 to Class2
      pref2:pointerToClass2 ?instance2 ;
      pref2:filteredVarName ?filteredVar .
			
  # filters use comparisons to scope the selected data
  # they can use existence checks or other boolean expressions as well
  FILTER(?filteredVar = 'COMPAREDDATA')
}

For optimal query performance, replace FILTER clauses. See Replace FILTER with VALUES or Triple Patterns when Possible below for more information.

Creating or Deriving New Variables

Storing intermediate or derived data within a query enables a single query to answer more complex questions. The following template illustrates how to bind a derived value to a variable. That variable is then available for selection or further manipulation.

Abstracted Query Template – Replace the bold text to modify the query

PREFIX pref1: <http://example.com/rootOfUris1#>
PREFIX pref2: <http://example.com/rootOfUris2#>
PREFIX pref3: <http://example.com/rootOfUris3#>
			
SELECT ?var1 ?filterVar ?var2AndVar3
WHERE {
  ?instance1 a pref1:Class1 ;
      pref1:varName1 ?var1 .

  ?filterInstance a pref2:MedicalHistory ;
      pref2:filterVarName ?filterVar ;
      # multiple traversals between classes may be necessary to link appropriate data
      pref2:pointerToIntermediateClass ?intermediateInstance .
			
  ?intermediateInstance a pref2:IntermediateClass ;
      pref2:pointerToClass1 ?instance1 .

  ?instance2 a pref3:Class2 ;
      # forwards traversals tend to be more performant 
      # it is still possible to identify a latter class and do a backwards traversal
      pref3:pointerToClass1 ?instance1 ;
      pref3:varName2 ?var2 .

  ?instance3 a pref3:Class3 ;
      pref3:pointerToClass2 ?instance2 ;
      pref3:varName3 ?var3 .

  # filters can be executed on various data types
  FILTER(?filterVar < "filterData"^^xsd:filterDataType)
			
  # binding allows population of new/derived variables
  BIND(CONCAT(?var2, "--", ?var3) as ?var2AndVar3)
}

SPARQL Best Practices

To ensure that your SPARQL queries perform well and do not overtax Anzo, Cambridge Semantics recommends that you follow these guidelines when writing and testing your queries:

Limit Results when Developing and Testing Queries
Replace FILTER with VALUES or Triple Patterns when Possible
Beware of Cross-Product Joins
Use Subqueries when Querying Large Amounts of Data

Limit Results when Developing and Testing Queries

The easiest way to reduce query execution time in some cases is to apply a LIMIT statement to limit the result set to a specific number of solutions. Limiting the number of results improves performance for cases where query results are calculated and returned in a streaming fashion. Limiting results is particularly useful when results need to be ordered so that the first group of results are the only ones of interest.

Example Solution – Get Sample ID and the Binding Density for the top 10 most dense Samples

PREFIX bm: <http://identifiers.csi.com/pharmakg/def/biomarker#>
			
SELECT ?sampleId ?bindingDensity
WHERE {
  ?sample a bm:Sample ;
      bm:sampleId ?sampleId ;
      bm:bindingDensity ?bindingDensity .
}
ORDER BY DESC(?bindingDensity)
LIMIT 10

Replace FILTER with VALUES or Triple Patterns when Possible

While a FILTER clause is useful for narrowing down selected data per a set of requirements, only use FILTER when the logic does not lend to other operations. In many cases, replacing FILTER with a VALUES clause or a well-organized set of triple patterns increases query performance. When processing a FILTER statement, all non-filtered data must be retrieved before the FILTER can be applied. Using a VALUES clause or triple pattern, however, reduces the amount of data that is retrieved and processed after the retrieval.

Example – Inappropriate use of FILTER for value-driven SELECT

PREFIX uriRoot: <http://example.com/rootOfUris#>
			
SELECT ?var1 ?var2
WHERE {
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      uriRoot:varName2 ?var2 ;
      uriRoot:filteredVar ?filteredVar .
  FILTER(?filteredVar = 'COMPAREDDATA1' || ?filteredVar = ‘COMPAREDDATA2’ || ?filteredVar = ‘COMPAREDDATA3’)
  # filteredVar is first retrieved, then run through several comparisons
}

Solution – VALUES used to select data of certain values

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?var1 ?var2
WHERE {
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      uriRoot:varName2 ?var2 ;
      uriRoot:filteredVar ?valueVar .
		
  VALUES (?valueVar) {
      (‘COMPAREDDATA1’)
      (‘COMPAREDDATA2’)
      (‘COMPAREDDATA3’)
  }
  # selection is performed once for each entry in the VALUES clause,
  # retrieving no more data than necessary
}

Example – Inappropriate use of FILTER for value-driven SELECT

PREFIX uriRoot: <http://example.com/rootOfUris#>
			
SELECT ?var1 ?filteredVar
WHERE {
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      uriRoot:varName2 ?var2 ;
      uriRoot:filteredVar ?filteredVar .
  FILTER(?filteredVar = 'COMPAREDDATA1')
  # filteredVar is first retrieved, then compared
}

Solution – Triple literal used to select data of a certain value

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?var1 ?filteredVar
WHERE {
  ?instanceOfClass a uriRoot:ClassName ;
      uriRoot:varName1 ?var1 ;
      uriRoot:filteredVar 'COMPAREDDATA' .
      # data is only retrieved if filteredVar matches desired compared data upon initial retrieval
}

Beware of Cross-Product Joins

When trying to gather data from multiple classes at once, it is possible to accidentally create a cross-product join, a selection that combines the selected data in a hyper-linear way rather than simply assembling the data and returning an unprocessed set.

Example – Accidental cross-product query

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?var1 ?var2
WHERE {
  ?instanceOfClass1 a uriRoot:ClassName1 ;
      uriRoot:varName1 ?var1 .
  ?instanceOfClass2 a uriRoot:ClassName2 ;
      uriRoot:varName2 ?var2 .
}

In the above example, the goal may have been to retrieve IDs from all instances of ClassName1 and all instances of ClassName2, for example, all of the Participants and all of the Subjects. However, the result of the query would be every combination of Participant and Subject. If there are 10 Participants and 5 Subjects, there would be 50 results rather than 15. In large data sets, this severely affects performance and puts the system under unnecessary strain.

There are two straightforward ways to separate or parameterize data to write a more performant query.

Solution 1 – Use UNION to replace the cross-product

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?commonVar
WHERE {
  {
    ?instanceOfClass1 a uriRoot:ClassName1 ;
        uriRoot:varName1 ?var1 .
    BIND(?var1 as ?commonVar)
  } 
  UNION 
  {
    ?instanceOfClass2 a uriRoot:ClassName2 ;
        uriRoot:varName2 ?var2 .
    BIND(?var2 as ?commonVar)
  }
  # this creates an ephemeral graph that is a union of two graphs
  # in each of the two graphs, the desired data is saved under the same name
}

Solution 2 – Use VALUES to replace the cross-product

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?commonVar
WHERE {
  ?instanceOfClass a ?classURI ;
      ?propertyURI ?commonVar .

  VALUES (?classURI ?propertyURI) {
      (uriRoot:ClassName1 uriRoot:varName1)
      (uriRoot:ClassName2 uriRoot:varName2)
  }
}

Use Subqueries when Querying Large Amounts of Data

When analyzing data, there may be a need to aggregate data and then perform a selection or derivation on the resulting aggregate. In this case, it is advisable to use one or more subselects or subqueries, where a SELECT query is included inside the WHERE clause and the remainder of the WHERE clause operates on the results of that SELECT as though that data were immediately available in the graph.

Example Solution – Aggregate a variable and then process the aggregation

PREFIX uriRoot: <http://example.com/rootOfUris#>

SELECT ?var1 ?var2Aggregation
WHERE {
  {
    SELECT ?var1 (GROUP_CONCAT(?var2) as ?var2Aggregation)
    WHERE {
        ?instanceOfClass1 a uriRoot:ClassName1 ;
            uriRoot:varName1 ?var1 .
        ?instanceOfClass2 a uriRoot:ClassName2 ;
            uriRoot:varName2 ?var2 .
    } 
    GROUP BY ?var1
  }
  # var1 and var2Aggregation are now available for the usual processing
  # while var2 is no longer available as it only existed within the subselect

  FILTER(regex(?var2Aggregation, ‘DESIREDVAR2VAL’))
  # FILTER is used for illustrative purposes, but any processing would work
}