Querying an Elasticsearch Source

This topic provides details about the structure to use when writing GDI queries to read or ingest data from Elasticsearch data sources. It also includes example queries that may be useful as a starting point for writing your own GDI queries.

Query Syntax

The following query syntax shows the structure of a GDI query for Elasticsearch sources. The clauses, patterns, and placeholders that are links are described below.

# PREFIX Clause
PREFIX s:     <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX es:    <http://elastic.co/search/>
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>
PREFIX anzo:  <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:  <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

# Result Clause
{ 
   [ GRAPH ${targetGraph} { ] 
   triple_patterns
 [ } ]
}
[ ${usingSources} ]

WHERE
{
   # SERVICE Clause: Include the following service call when reading or inserting data.
    SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>

   # View SERVICE Clause: Or use the service call below when constructing a view.
    SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(${targetGraph})

    { 
      ?data a s:ElasticSource ;
        s:url "string" ;
        [ s:username "string" ; ]
        [ s:password "string" ; ]
        [ s:property [ s:name "string" ; s:value "string" ; ]
        [ es:aggregations [ rdf_list ] ; ]
        [ es:config "string" ; ]
        [ es:document "string" ; ]
        [ es:field "string" | ?variable ; ]
        [ es:highlight [ rdf_list ] ; ]
        [ es:html boolean ; ]
        [ es:index "string" ; ]
        [ es:minScore float ; ]
        [ es:query "string" | [ rdf_list ] ; ]
        [ es:routing "string" ; ]
        [ es:searchAfter [ rdf_list ] ; ]
        [ es:size int ; ]
        [ es:source boolean | [ rdf_list ] ; ]
        [ s:timeout int ; ]
        [ s:batching boolean | int ; ]
        [ s:paging [ pagination_options ; ]
        [ s:concurrency int | [ list_of_properties ] ; ]
        [ s:rate int | "string" ; ]
        [ s:partitionBy "string" | ?variable ; ]
        [ s:locale "string" ; ]
        [ s:sampling int ; ]
        [ s:selector "string" | [ list ] ; ]
        [ s:model "string" ; ]
        [ s:key ("string") ; ]
        [ s:reference [ s:model "string" ; s:using ("string") ]
        [ s:formats [ datatype_formatting_options ] ; ]
        [ s:normalize boolean | [ normalization_rules ] ; ]
        [ s:count ?variable ; ]
        [ s:offset int ; ]
        [ s:orderBy "string" | ?variable ; ]
        [ s:limit int ; ]
        # Mapping variables
        ?mapping_variable ( [ "binding" ] [ datatype ] [ "datetime_format" ] ) ;
        ... ;
        .
     # Additional clauses such as BIND, VALUES, FILTER
   }
}

For readability, the parameters below exclude the base URIs <http://cambridgesemantics.com/ontologies/DataToolkit#> and <http://elastic.co/search/> as well as the s: and es: prefixes. As shown in the examples, however, the prefixes or full property URIs do need to be included in queries.

Option Data Type Description
PREFIX Clause N/A The PREFIX clause declares the standard and custom prefixes for GDI service queries against Elasticsearch. Generally, queries include the following prefixes (or a subset of them) plus any data-specific declarations:
PREFIX s:    <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX es:   <http://elastic.co/search/>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX anzo: <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl: <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:   <http://purl.org/dc/elements/1.1/>
Result Clause N/A The result clause defines the type of SPARQL query to run and the set of results to return, i.e., whether you want to read (SELECT or CONSTRUCT) from the source or ingest the data into Anzo (INSERT).
GRAPH ${targetGraph} N/A Include the GRAPH keyword and target graph parameter ${targetGraph} when writing an INSERT query to ingest data into a graphmart. Anzo automatically populates the query with the appropriate target URIs when the query runs.
${usingSources} N/A Include the source graph parameter ${usingSources} when writing a "topdown" query that passes values from the data that is in the graphmart to the data source. Anzo automatically populates the query with the appropriate FROM clauses when the query runs. When passing literal values to the remote source, you do not need to include the source graph parameter. The SERVICE Clause description below includes more information about passing input to data sources.
SERVICE Clause N/A
Include the SERVICE call SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit> to invoke the GDI service when you are running a SELECT, INSERT, or CONSTRUCT query that is not creating a view. When writing a CONSTRUCT query in a View Step, use the DataToolkitView service call, as described below in View SERVICE Clause.

Include the optional TOPDOWN keyword when you want to pass input values from the graphmart to the data source. When you include TOPDOWN in the service call, it indicates that the rest of the query produces values to send to the source. In this case, the GDI makes repeated calls to pass in each of the specified values and retrieve the data that is based on those values.

View SERVICE Clause N/A
When writing a CONSTRUCT query that creates a view of the data (usually in a View Step), include the following SERVICE call: SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(${targetGraph}). Using the DataToolkitView call optimizes query execution because it tells the GDI to inspect the query and determine which filters to push to the data source. It also limits the result set and retrieves only the data that is needed, i.e., the source data is fully mapped but all of the mapped data is not necessarily returned.
url string This property specifies the URL to use to access the source.

For security, it is a best practice to reference connection information (such as the url, username, and password) from a Query Context so that the sensitive details are abstracted from any requests. In addition, using a Query Context makes connection details reusable across queries. See Using Query Contexts in Queries for more information. For example:

?data a s:ElasticSource ;
  s:url "{{@es.hostname}}:{{@es.port}}" ; 
  s:username "{{@es.username}}" ;
  s:password "{{@es.password}}" ;
username string This property lists the user name to use for the connection to the Elasticsearch.

If you want to group the username and password properties, you can wrap them with s:credentials [ ]. For example:

s:credentials [
  s:username “username” ;
  s:password “password” ;
] ;
password string
This property lists the password for the given username.
property RDF list This property can be included to list any source-specific configuration values.
s:property [ s:name "custom_property_name" ; s:value "custom_value" ]
aggregations RDF list
You can include this property to calculate aggregations over the specified bindings. For information about aggregations, see Aggregations in the Elasticsearch documentation.
config string
To enable you to use explicit mappings, you can include this property to specify the URL to the index configuration file to employ. For example, es:config "/opt/shared/elastic/mapping.json".
document string
This property lists the document(s) to search.
field string or variable
This property defines the field to operate on. The value can be a string or bound variable.
highlight RDF list
You can include this property to define how results are highlighted. For information about the available properties, see Highlighting Elasticsearch Results.
html boolean
This property controls whether to output HTML for highlighted results. Defaults to true.
index string
This property can be included to specify the index to search.
minScore float
This property defines the minimum score for matching documents. Documents with a lower score are not included in the search results.
query string or RDF list This property defines the query to execute. The value can be a string or a query object that maps to the Elasticsearch Query DSL. To generate the final query, the GDI combines es:query with any filters it can push to the Elasticsearch DSL. For more information about the query property and mapping Elasticsearch filters to SPARQL FILTER clauses, see Query DSL and Filter Mapping below.
routing string
This property can be included to route a document to a specific shard or to limit the search to a particular shard.
searchAfter RDF list
You can include this property to define the key values to start searching from.
size int
This property maps to the size parameter in the Elasticsearch Search API and configures the batch size or maximum number of hits to return in a single call. Defaults to 10 and typically does not need to be changed.
source boolean or RDF list
This property can be included to specify the source data to include in results. The value can be a boolean, list of fields, or a list of variable bindings. When true, all source data is returned. When false, no source data is returned.
timeout int
This property can be used to specify the timeout (in milliseconds) to use for requests against the source. For example, s:timeout 5000 configures a 5 second timeout.
batching boolean or int
This property can be used to disable batching, or it can be used to change the default the batch size. By default, batching is set to 5000 (s:batching 5000). To disable batching, you can include s:batching false in the query. Typically users do not change the batching size. However, it can be useful to control the batch size when performing updates. To configure the size, include s:batching int in the query. For example, s:batching 3000.
paging RDF list
This property can be used to configure paging so that the GDI can access large amounts of data across a number of smaller requests. For details about the paging property, see Pagination Options.
concurrency int or RDF list
This property can be included to configure the maximum level of concurrency for the query. The value can be an integer, such as s:concurrency 8. If the value is an integer, it configures a maximum limit on the number of slices that can execute the query. For finer-grained control over the number of nodes and slices to use, concurrency can also be included as an object with limit, nodes, and/or executorsPerNode properties. For example, the following object configures a concurrency model that allows a maximum of 24 executors distributed across 4 nodes with 8 executors per node:
s:concurrency [
  s:limit 24 ;
  s:nodes 4 ;
  s:executorsPerNode 8 ;
] ;
rate int or string
This property can be included to control the frequency with which a request is sent to the source. The limit applies to the number of requests a single slice can make. If you specify an integer for the rate, then the value is treated as the maximum number of requests to issue per minute. If you specify a string, you have more flexibility in configuring the rate. The sample values below show the types of values that are supported:
s:rate "90/minute" ;
s:rate "90 per minute" ;
s:rate "200000 every week" ;
s:rate "10000 every 6 hours" ;

To enforce the rate limit, the GDI introduces a sleep between requests that is equal to the rate delay. The more executing slices, the longer the rate delay needs to be to enforce the limit in aggregate.

Given the example of s:rate "90/minute", the GDI would optimize the concurrency and only use 1 slice for execution with a rate delay of 666ms between requests. If s:rate "240/minute", the GDI would use 3 executors with a rate delay of 750ms between requests.

partitionBy string, variable, list
The GDI attempts to partition queries automatically across the available cores (slices) in AnzoGraph. To determine how to partition the query, the GDI uses metadata from the source database. It looks for any column in an index, preferring the primary key column if it is interpolable. However, it only considers the first column in any index on the table. After determining the partition column, the GDI does a MIN/MAX on the column as well as a basic sizing query. To specify which column or columns the GDI should partition on, you can include the partitionBy property in the query. The property supports a list of source field names, bound variables, or the object s:auto, which forces the GDI to partition the data when the source does not define partitioning metadata.
locale string
This property can be used to specify the locale to use when parsing locale-dependent data such as numbers, dates, and times.
sampling int
This property can be used to configure the number of records in the source to examine for data type inferencing.
selector string or RDF list
This property can be used as a binding component to identify the path to the source objects. For example, s:selector "Sales.SalesOrderHeader" targets the SalesOrderHeader table in the Sales schema. For more information about binding components and the selector property, see Using Binding Trees and Selector Paths.
model string
This property defines the class (or table) name for the type of data that is generated from the specified data source. For example, s:model "employees". Model is optional when querying a single source. If your query targets multiple sources, however, and you want to define resource templates (primary keys) and object properties (foreign keys), you must specify the model value for each source.
key string
This property can be used to define the primary key column for the source file or table. This column is leveraged in a resource template for the instances that are created from the source. For example, s:key ("EMPLOYEE_ID"). For more information about key, see Data Linking Options.
reference RDF list
This property can be used to specify a foreign key column. The reference property is an RDF list that includes the model property to list the target table and a using property that defines the foreign key column. For more information about reference, see Data Linking Options.
formats RDF list
To give users control over the data types that are used when coercing strings to other types, this property can be included in GDI queries to define the desired types. In addition, it can be used to describe the formats of date and time values in the source to ensure that they are recognized and parsed to the appropriate date, time, and/or dateTime values. For details about the formats property, see Data Type Formatting Options.
normalize RDF list
To give users control over the labels and URIs that are generated, the GDI offers several options for normalizing the model and/or the fields that are created from the specified data source(s). For details about the normalize property, see Normalization Options.
count variable If you want to turn the query into a COUNT query, you can include this property with a ?variable to perform a count. For example, s:count ?count. The GDI runs an Elasticsearch value count aggregation.
offset int
This property can be used to offset the data that is returned by a number of rows.
orderBy string, variable, list
You can include this property to order the result set by a field name, a bound variable, or a list of names or bound variables.
limit int
You can include this property to limit the number of results that are returned. s:limit maps to the SPARQL LIMIT clause.
mapping_variable variable
The mapping variables, in ?mapping_variable (["binding"] [datatype] ["datetime_format"]) format, define the triple patterns to output. When the specified ?variable matches the source column name, the GDI uses the variable as the source data selector. If you specify an alternate variable name, a binding needs to be specified to map the new variable to the source. You also have the option to transform the data using the datatype and datetime_format options.

The parentheses around the binding, data type, and format specifications are not required but are included in this document for readability.

binding string
The binding is a literal value that binds a ?mapping_variable to a source column. If you specify a ?variable that matches the source column name, then that variable name is the data selector and it is not necessary to specify a binding. If you specify an alternate variable name or there is a hierarchical path to the source column, then the binding is needed to map the new variable to that source column.
datatype URI
The datatype is the data type to convert the column to. If you do not specify a data type, the GDI infers the type. The GDI supports the following types:
  • xsd:int
  • xsd:long
  • xsd:float
  • xsd:double
  • xsd:boolean
  • xsd:time
  • xsd:dateTime
  • xsd:date
  • xsd:gMonthDay
  • xsd:gYearMonth
  • xsd:duration
  • xsd:dayTimeDuration
  • xsd:yearMonthDuration
  • xsd:gMonth
  • xsd:anyURI
datetime_format string
This option is used to specify the format to use for date and time data types. The GDI supports Java date and time formats. Specify days as "d," months as "M," and years as "y." For the time, specify "H" for hours, "m" for minutes, and "s" for seconds. For example, "yyyyMMdd HH:mm:ss" or "ddMMMyy" to display date values such as "01JAN19."

The GDI's default base year is 2000. If the source data has years with only two digits, such as 02-04-99, the GDI prepends 20 to the digits. The value 02-04-99 is parsed to 02-04-2099. To specify an alternate base year to use for two-digit values, you can include the notation ^nnnn (e.g., ^1900) in the format value. For example, to set the base year to 1900 instead of 2000, use a format value such as xsd:date "dd-MMM-yy^1900" or xsd:date "dd-MMM-yy^1990". When one of those values is specified, 02-04-99 is parsed to 02-04-1999.

Query DSL and Filter Mapping

The vocabulary used in GDI queries against an ElasticSource closely mimics the Elasticsearch Query DSL. The table below shows a side-by-side view of a DSL query that is mapped to SPARQL using the es:query property:

DSL SPARQL
{
 "query": {
  "bool" : {
   "must" : {
    "term" : { "user.id" : "kimchy" }
   },
   "filter": { 
    "term" : { "tags" : "production" }
   },
   "must_not" : {
    "range" : {
     "age" : { "gte" : 10, "lte" : 20 }
    }
   },
   "should" : [
    { "term" : { "tags" : "env1" } },
    { "term" : { "tags" : "deployed" } }
   ],
   "minimum_should_match" : 1,
   "boost" : 1.0
  }
 }
}
es:query [
  a es:BoolQuery ;
  es:must [
   a es:TermQuery ;
   es:field "user.id" ;
   es:value "kimchy" ;
  ] ;
  es:filter [
   a es:TermQuery ;
   es:field "tags" ;
   es:value "production" ;
  ] ;
  es:mustNot [ 
   a es:RangeQuery ;
   es:field "age" ;
   es:gte 10 ;
   es:lte 20 ;
  ] ;
  es:should (
   [ a es:TermQuery ; es:field "tags" ; es:value "env1" ]
   [ a es:TermQuery ; es:field "tags" ; es:value "deployed" ]
  ) ;
  es:minimumShouldMatch 1 ;
  es:boost 1.0 ;
] ;

The following example SERVICE clause with comments provides details about how the GDI es:query property can be mapped to DSL:

SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
  ?data a es:ElasticSource ;
    s:url "http://localhost:9200/" ;
# When the value of es:query is a simple literal,
# it is mapped to an Elastic query string query.
    es:query "literal"

# When the value of es:query is an RDF list,
# you can specify other query types,
# such as a match query:
    es:query [ 
      a es:MatchQuery ;
      es:field "title" | ?title ; # field can be a literal or bound variable
      es:query "moby dick" ;
    ] ;
# or a boolean query:
    es:query [ 
      a es:BoolQuery ;
      es:should ([
        a es:RangeQuery ;
        es:field ?amount ;
        es:gt 500 ;
        es:lt 1000 ;
      ] [
        a es:TermQuery ;
        es:field ?status ;
        es:value 'late' ;
      ]) ;
    ] ;
}

Filter Mapping

Filtering can be performed inside the es:query list or you can add a FILTER clause to the query. For example, the table below shows the SPARQL snippet above expressed as a FILTER clause.

SPARQL Query FILTER Clause
es:query [
  a es:BoolQuery ;
  es:must [
   a es:TermQuery ;
   es:field "user.id" ;
   es:value "kimchy" ;
  ] ;
  es:filter [
   a es:TermQuery ;
   es:field "tags" ;
   es:value "production" ;
  ] ;
  es:mustNot [ 
   a es:RangeQuery ;
   es:field "age" ;
   es:gte 10 ;
   es:lte 20 ;
  ] ;
  es:should (
   [ a es:TermQuery ; es:field "tags" ; es:value "env1" ]
   [ a es:TermQuery ; es:field "tags" ; es:value "deployed" ]
  ) ;
  es:minimumShouldMatch 1 ;
  es:boost 1.0 ;
] ;
FILTER(?user_id = "kimchy" && 
       ?tags = "production" &&
       !(?age >= 10 && ?age <= 20) &&
       (?tags == "env1" || ?tags == "deployed"))

The table below shows each of the supported ElasticSource FILTER translations. Only expressions matching the list below will be translated by the GDI. If the expression is of the form value <= ?field, the inequality is flipped to ?field > value before translating.

es:query Expression FILTER Clause Expression
es:query [ a es:BoolQuery ; es:mustNot expr ] !expr
es:query [ a es:BoolQuery ; es:must (left right) ] left && right
es:query [ a es:BoolQuery ; es:should (left right) ] left || right
es:query [ a es:RangeQuery ; es:field ?field ; es:lt value ] ?field < value
es:query [ a es:RangeQuery ; es:field ?field ; es:lte value ] ?field <= value
es:query [ a es:TermQuery ; es:field ?field ; es:value value ] ?field = value
es:query [ a es:BoolQuery ; es:mustNot [ a es:TermQuery ; es:field ?field ; es:value value ] ] ?field != value
es:query [ a es:RangeQuery ; es:field ?field ; es:gte value ] ?field >= value
es:query [ a es:RangeQuery ; es:field ?field ; es:gt value ] ?field > value
es:query [ a es:QueryStringQuery ; es:field ?field ; es:query pattern ; es:defaultOperator "AND" ] REGEX(?field, pattern, "q")
es:query [ a es:TermsQuery ; es:field ?field ; es:value value, ... ] IN(?field, value, ...)
es:query [ a es:BoolQuery ; es:mustNot [ a es:TermsQuery ; es:field ?field ; es:value value, ... ] ] NOT IN(?field, value, ...)
es:query [ a es:MatchQuery ; es:field ?field ; es:query value ; es:lenient true ] CONTAINS(?field, value)
es:query [ a es:PrefixQuery ; es:field ?field ; es:value value ] STRSTARTS(?field, value)
es:query [ a es:ExistsQuery ; es:field ?field ] BOUND(?field)

Query Examples

General Query

The following example queries any Elasticsearch indexes that are loaded in the graphmart for which you run the query. No configuration is needed because Anzo manages the indexes that it loads and uses predictable naming conventions and aliases.

PREFIX docm: <http://cambridgesemantics.com/ontologies/2011/07/DocumentMetadata#>
PREFIX es: <http://elastic.co/search/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX services: <http://cambridgesemantics.com/services/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?title
WHERE {
  SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
    ?data a es:ElasticSource;
      s:selector "unstructuredfile" ;
      es:query "string" ; # input the text search string
      es:fields "fullText" ;
      ?title (xsd:string) .
  }
}

Aggregations

The following example query performs terms aggregations.

PREFIX es: <http://elastic.co/search/>
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT *
WHERE {
  SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
    ?data a es:ElasticSource;
    s:url "https://{{@es.hostname}}:{{@es.port}}/" ;
    s:username "{{@es.username}}" ;
    s:password "{{@es.password}}" ;
    es:index "templated_consumption_es" ;
    es:query "*ELM*" ;
    ?instance () ;
    es:aggregations [
      ?artifactTypes [
        a es:TermsAggregation ;
        es:field ?artifactType ;
        es:meta [
          ?label "artifactType" ;
        ] ;
        ?value () ;
        ?count () ;
      ] ;
      ?fileTypes [
        a es:TermsAggregation ;
        es:field ?fileType ;
        es:meta [
          ?label "fileType" ;
        ] ;
        ?value () ;
        ?count () ;
      ] ;
      ?managedBys [
        a es:TermsAggregation ;
        es:field ?managedBy ;
        es:meta [ 
          ?label "managedBy" ;
        ] ;
        ?value () ;
        ?count () ;
      ] ;
    ] .
  }
}

Highlighting

The following example configures highlighting for fragments from the actor field.

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX es: <http://elastic.co/search/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
​SELECT * 
WHERE {
  SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit>
  {
    ?data a es:ElasticSource ;
      ​es:url "http://localhost:9200/" ;
      es:index "films" ;
      es:html false ;
      es:query "Clint" ;
      es:field ?actor, ?director ;
      es:highlight [
        es:field ?actor ;
        es:type "plain" ;
        es:fragmentSize 200 ;
        es:numberOfFragments 10 ;
        es:preTags "<mark hit='true'>" ;
        es:postTags "</mark>" ;
      ] ;
      s:selector "film" ;
      ?actor (xsd:string) ;
      ?awards (xsd:string) ;
      ?director (xsd:string) ;
      ?image (xsd:string) ;
      ?length (xsd:long) ;
      ?popularity (xsd:long) ;
      ?subject (xsd:string) ;
      ?title (xsd:string) ;
      ?year (xsd:long) ;
​      ?score () ;
      ?id () ;
      ?highlights [
        ?field () ;
        ?fragment () ;
      ] .
  FILTER(?year = 1990 || ?length > 103)
  FILTER(REGEX(?title, "Manhattan", "q") || REGEX(?subject, "Comedy", "q") || REGEX(?subject, "Drama", "q"))
  }
}