Reading or Ingesting Remote Instance Data

Depending on the type of SPARQL query that you write, the Graph Data Interface (GDI) service can be used to ingest instance data or it can be used to analyze source data without updating the database.

Graph Data Interface service calls are processed by AnzoGraph using a Java plugin. Before running GDI service queries, make sure that the AnzoGraph cluster is configured to use the plugin. For more information, see Graph Data Interface Setup.

This topic provides details about the syntax to use when writing GDI queries and includes examples that demonstrate the data integration capabilities for different types of data sources.

GDI Query Syntax

To invoke the GDI service in a data layer, add a Query Step or View Step to the layer. For instructions on creating a Query Step, see Adding a Query Step. For instructions on creating a View Step, see Adding a View Step.

The following query syntax shows the structure of a GDI query. The clauses, patterns, and placeholders in blue are described below.

# PREFIX Clause
PREFIX s:     <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>
PREFIX anzo:  <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:  <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

# Result Clause
[ { GRAPH ${targetGraph} { ... } ]
[ ${usingSources} ]
WHERE
{
	SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>
    {
	   ?data a s:source_type ;
        s:connection_parameters ;
        s:input_parameters ;
        # output_parameters
        ?variable (["binding"] [datatype] ["format"]) ;
        ... ;
        .
    }
  # Additional clauses such as BIND, VALUES, FILTER
}

PREFIX Clause

The PREFIX clause declares the prefixes that are standard for all GDI service queries. You can declare additional prefixes to use in the query, but the PREFIX clause must include the following statements:

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX anzo:   <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:   <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:     <http://purl.org/dc/elements/1.1/>

Result Clause

The result clause defines the type of SPARQL query to run and the set of results to return. This clause defines whether you want to read (SELECT, CONSTRUCT, or CREATE VIEW) from the remote source or ingest the data into Anzo (INSERT or CREATE  MATERIALIZED VIEW).

GRAPH ${targetGraph}

Include the GRAPH keyword and target graph parameter ${targetGraph} when writing an INSERT query to ingest data into a graphmart. Anzo automatically populates the query with the appropriate target URIs when the query runs.

${usingSources}

Include the source graph parameter ${usingSources} when writing a query that passes values from the data that is in the graphmart to the remote data source. Anzo automatically populates the query with the appropriate FROM clauses and graph URIs when the query runs. When passing literal values to the remote source, you do not need to include the source graph parameter. The SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit> description below includes more information about passing input to remote sources.

WHERE

The WHERE clause includes the required GDI SERVICE call and defines the data source, input and output parameters, and any additional clauses for ingesting or analyzing the data. The descriptions below describe the patterns in the WHERE clause.

SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>

The SERVICE call invokes the GDI service. The optional TOPDOWN keyword is used to pass input values to the data source. When you include TOPDOWN in the service call, it indicates that the rest of the query produces values to send to the source. The GDI service makes repeated calls to pass in each of the specified values and retrieve the data that is based on those values.

source_type

The ?data a s:source_type triple pattern specifies the type of data source that the query will run against. For example, ?data a s:DbSource. The list below describes the commonly used types:

  • DbSource for any type of database.
  • FileSource for flat files. The supported file types are CSV and TSV, JSON, XML, Parquet, and SAS (SAS Transport XPT and SAS7BDAT formats) .
  • HttpSource for HTTP endpoints.

For a complete list of the supported source types, view the GDI ontology. Each data source type is represented by an owl:Class. See Getting Familiar with the Graph Data Interface for more information.

connection_parameters

The source connection parameters are the values that are required for accessing the source, such the database connection URL, path to a file source, username, password, key, token, etc. For example, the pattern below specifies the connection details for a database source:

?data a s:DbSource ;
   s:url "jdbc:postgresql://10.100.2.9:5555/kl_hosp_db?user=postgres&password=postgres123"

The example below specifies the connection details for a file-based source, a directory of CSV files:

?data a s:FileSource ;
   s:url "/opt/shared-files/sales-csv"

For a single file, specify the file name in the URL. For example, "/opt/shared-files/sas/edu_inc.sas7bdat".

input_parameters

In addition to the source connection parameters, you can include other input parameters based on the data source type. For example, the list below describes the commonly used input parameters:

  • timeout ms: Specifies the timeout (in milliseconds) to use for requests against the source. For example, s:timeout 5000.
  • refresh s: The number of seconds to wait before refreshing the source. If unspecified no caching will be performed and the source will be queried each time.
  • limit n: The maximum number of results to retrieve from the source. For example, s:limit 1000.
  • debug true: (For View Steps) Indicates that the GDI service should not cache the resulting view definition when the view is created. By default, to avoid running expensive metrics-gathering queries against the source system more than once, the GDI service caches the view definition (s:debug is false) when a view is first created. That means if you debug, modify, and re-run the View Step, the original view definition remains cached instead of being updated. Including s:debug true as part of the service call instructs the GDI service not to cache the definition.
  • selector "path": The selector is a source-specific binding component that identifies the path to the source object. For example, s:selector "currently" targets the "currently" class of data from a source. And s:selector "hourly.data" specifies a hierarchical path. For more information about the selector property and bindings, see the Deep-Dive into Hierarchical Bindings section below.

For a complete list of the supported input parameters by data source, view the GDI ontology. Each data source type is represented by an owl:Class and the related input parameters are properties in the class. See Getting Familiar with the Graph Data Interface for more information.

output_parameters

The output parameters, in ?variable (["binding"] [datatype] ["format"]) format, define the triple patterns to output. When the specified ?variable matches the source column name, the GDI uses the variable as the source data selector. If you specify an alternate variable name, then a binding needs to be specified to map the new variable to the source.

In the object position in parenthesis, you also have the option to transform the data using the datatype and format options. The list below describes each option:

  • binding: The binding is a literal value that binds a ?variable to a source column. If you specify a ?variable that matches the source column name, then that variable name is the data selector and it is not necessary to specify a binding. If you specify an alternate variable name or there is a hierarchical path to the source column, then the binding is needed to map the new variable to that source column.

    For example for a flat source like CSV, the following pattern simply binds the source column AIRLINE to the lowercase variable ?airline: ?airline ("AIRLINE"). For a database source, this example binds the ?subject variable by navigating to the SUBJECT column in the FILM table in the dbo schema: ?subject ("dbo.FILM.SUBJECT"). And for an HTTP source, this example binds the ?time variable to the time object under the minutely data path: ?time ("minutely.data.time").

    For FileSource and HttpSource, periods (.), forward slashes (/), and brackets ([ ]) are parsed as path notation. Therefore, if a source column name includes any of those characters, they must be escaped in the binding. Use two backslashes (\\) as an escape character. For example, if a column name is average/day, the variable and binding pattern could be written as ?averagePerDay ("average\\/day").

    For DbSource, database, schema, and table names in bindings are parsed according to the specific rules for that database type. You do not need to escape characters in database names. However, database names with characters that do not match (_|A-Z|a-z)(_|A-Z|a-z|0-9)* should be quoted, such as ("'Adventure.Works'.Sales.'Daily.Totals'").

  • datatype: The datatype is the data type to convert the column to. If you do not specify a data type, the GDI infers the type. See Property Range Guidelines for information about data types.
  • format: The format option is used to specify the format to use for data types such as xsd:date or xsd:dateTime. Specify days as "d," months as "M," and years as "y." For the time, specify "H" for hours, "m" for minutes, and "s" for seconds. For example, "yyyyMMdd HH:mm:ss" or "ddMMMyy" to display date values such as "01JAN19."

Deep-Dive into Hierarchical Bindings

As part of the GDI's flexibility, there are multiple ways to express binding hierarchies in GDI queries. One way is to use brackets ( [ ] ) to group triple patterns into binding trees. For example, the WHERE clause snippet below organizes output parameters into an hourly.data hierarchy:

WHERE
{
  SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:HttpSource;
        s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
        ?latitude (xsd:double) ;
        ?longitude (xsd:double) ;
        ?timezone (xsd:string) ;
        ?hourly [
          s:selector "hourly" ;
          ?data [
            s:selector "data" ;
            ?time (xsd:long) ;
            ?summary (xsd:string) ;
            ?nearestStormDistance (xsd:int) ;
            ?rainIntensity ("precipIntensity" xsd:double) ;
            ?rainProbability ("precipProbability" xsd:double) ;
            ?temperature (xsd:double) ;
            ?feelsLike ("apparentTemperature" xsd:double) ;
            ?humidity (xsd:double) ;
            ?pressure (xsd:double) ;
            ?windSpeed (xsd:double) ;
        ] ;
    ] .
  }
}

Each level of the binding hierarchy above has a s:selector property that is used to navigate the data. The selector is optional, however. If a selector is not specified, the output triple patterns default to the name of the variable that introduces that level of the hierarchy.

As an alternative to grouping triple patterns in bracketed trees, the s:selector property can specify a path. For example, the WHERE clause snippet below rewrites the example above to express the same hierarchy by specifying a path as the value for s:selector:

WHERE
{
  SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:HttpSource;
        s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
        ?latitude (xsd:double) ;
        ?longitude (xsd:double) ;
        ?timezone (xsd:string) ;
        ?hourly [
          s:selector "hourly.data" ;
          ?time (xsd:long) ;
          ?summary (xsd:string) ;
          ?nearestStormDistance (xsd:int) ;
          ?rainIntensity ("precipIntensity" xsd:double) ;
          ?rainProbability ("precipProbability" xsd:double) ;
          ?temperature (xsd:double) ;
          ?feelsLike ("apparentTemperature" xsd:double) ;
          ?humidity (xsd:double) ;
          ?pressure (xsd:double) ;
          ?windSpeed (xsd:double) ;
        ] .
   }
}

When working with schema-less (or schema-flexible) sources like JSON, you can also capture a tree of data as a JSON string. For example, using the query snippet above, if the properties under hourly were unknown, the snippet could be rewritten as follows. This query would bind all of the data below hourly to the ?hourly variable and return a JSON string representation of the properties and instance data:

WHERE
{
  SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:HttpSource;
        s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
        ?latitude (xsd:double) ;
        ?longitude (xsd:double) ;
        ?timezone (xsd:string) ;
        ?hourly () .
   }
}  

For example, the results look like this:

 latitude  | longitude  | timezone        | hourly                                     
-----------+------------+-----------------+----------------------------
30.374563  | -97.975892 | America/Chicago | {"summary":"\"Humid and partly cloudy
throughout the day.\"","icon":"\"partly-cloudy-day\"","data":[{"time":"1595559600",
summary":"\"Clear\"","icon":"\"clear-night\"","precipIntensity":"0",
"precipProbability":"0","temperature":"88.39","apparentTemperature":"91.72",
"dewPoint":"67.42","humidity":"0.5","pressure":"1011.7","windSpeed":"7.48",
"windGust":"16.71","windBearing":"109","cloudCover":"0.06","uvIndex":"0",
"visibility":"10","ozone":"285.2"},{"time":"1595563200","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"2.0E-4","precipProbability":"0.01",
"precipType":"\"rain\"","temperature":"86.69","apparentTemperature":"90.1",
"dewPoint":"67.84","humidity":"0.54","pressure":"1012","windSpeed":"7.05",
"windGust":"17.56","windBearing":"110","cloudCover":"0.12","uvIndex":"0",
"visibility":"10","ozone":"284.9"},...

Similar to the example above, you can write a query that specifically captures some of the properties in a hierarchy and then returns the rest of the properties and their values as a JSON string representation. To do so, use @ as the binding path. For example:

WHERE
{
  SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:HttpSource;
      s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
      ?latitude (xsd:double) ;
      ?longitude (xsd:double) ;
      ?timezone (xsd:string) ;
      ?hourly [
        s:selector "hourly.data" ;
        ?time (xsd:long) ;
        ?summary (xsd:string) ;
        ?hourly_data ("@") ;
      ] .
   }
}

The results look like this:

 latitude  | longitude  | timezone        | time       | summary          | hourly_data  
-----------+------------+-----------------+------------+------------------+---------------
 30.374563 | -97.975892 | America/Chicago | 1595559600 | Clear            | {"time":"1595559600","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"0","precipProbability":"0","temperature":"88.39",
"apparentTemperature":"91.72","dewPoint":"67.42","humidity":"0.5","pressure":"1011.7","windSpeed":"7.48",
"windGust":"16.71","windBearing":"109","cloudCover":"0.06","uvIndex":"0","visibility":"10","ozone":"285.2"}
                                       
 30.374563 | -97.975892 | America/Chicago | 1595563200 | Clear            | {"time":"1595563200","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"2.0E-4","precipProbability":"0.01","precipType":"\"rain\"","temperature":"86.69",
"apparentTemperature":"90.1","dewPoint":"67.84","humidity":"0.54","pressure":"1012","windSpeed":"7.05","windGust":"17.56",
"windBearing":"110","cloudCover":"0.12","uvIndex":"0","visibility":"10","ozone":"284.9"}
          
 30.374563 | -97.975892 | America/Chicago | 1595566800 | Partly Cloudy    | {"time":"1595566800","summary":"\"Partly Cloudy\"",
"icon":"\"partly-cloudy-night\"","precipIntensity":"3.0E-4","precipProbability":"0.01",
"precipType":"\"rain"","temperature":"85.63","apparentTemperature":"89.21",
"dewPoint":"68.33","humidity":"0.56","pressure":"1012.6","windSpeed":"6.48","windGust":"17.92","windBearing":"110",
"cloudCover":"0.34","uvIndex":"0","visibility":"10","ozone":"284.5"}
...

GDI Query Examples

The SELECT query below also reads data from an HTTP source, the Dark Sky API, which compiles worldwide weather statistics. The API has several models available for retrieving data that is current, daily, historical, etc. To target current data, the query includes s:selector "currently" as an input parameter. In addition, the query demonstrates the use of the "topdown" functionality, where the query includes values to be input to the source to narrow the results to specific locations. The query includes the TOPDOWN keyword in the GDI service call, and the VALUES clause specifies the latitude and longitude values for the cities to return data for. In addition, since this API service requires parameters to be specified in the connection URL, the s:url value includes ?lat and ?long as parameters for the value.

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX anzo:   <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:   <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:     <http://purl.org/dc/elements/1.1/>
PREFIX ex:     <http://example.org/ontologies/City#>

SELECT  
	?city ?state ?summary ?temp ?rainChance
	?humidity ?pressure ?windSpeed
WHERE
{
	SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit>
    {
	  ?data a s:HttpSource ;
	  s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/{{?lat}},{{?long}}" ;
	  s:selector "currently" ;
	  ?lat ("latitude") ;
	  ?long ("longitude") ;
	  ?time () ;
	  ?summary () ;
	  ?temp ("temperature") ;
	  ?rainChance ( "precipProbability" ) ;
	  ?rainIntensity ( "precipIntensity" ) ;
	  ?humidity () ;
	  ?pressure () ;
	  ?windSpeed () ;
	  ?windGust () ;
	  ?windBearing () ;
	  ?nearestStorm ("nearestStormDistance") ;
	  ?stormBearing ("nearestStormBearing") ;
	  ?visibility () .
    }
    VALUES( ?city ?state ?lat ?long )
    {
	  ( "Lakeway" "TX" 30.374563 -97.975892 )
	  ( "Boston" "MA" 42.358043 -71.060415 )
	  ( "Seattle" "WA" 47.590720 -122.307053 )
	  ( "Chicago" "IL" 41.837741 -87.823296 )
	  ( "Hilo" "HI" 19.702040 -155.090312 )
    }
}
ORDER BY ?city

The query returns the following results:

city    | state | summary       | temp  | rainChance | humidity | pressure | windSpeed
--------+-------+---------------+-------+------------+----------+----------+-----------
Boston  | MA    | Overcast      | 79.81 |          0 |      0.6 |   1018.7 |      7.71
Chicago | IL    | Clear         |  81.7 |          0 |     0.52 |   1021.1 |      5.13
Hilo    | HI    | Partly Cloudy |  72.6 |       0.13 |     0.79 |   1018.6 |      4.86
Lakeway | TX    | Partly Cloudy | 92.43 |          0 |     0.48 |   1013.3 |     10.85
Seattle | WA    | Mostly Cloudy | 61.82 |          0 |     0.76 |   1018.2 |      4.57
5 rows

The example below ingests data into a data layer from a database source using an INSERT query in a Query Step.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX :        <http://example.com/ontologies/kl_hosp#>

INSERT
{
	GRAPH ${targetGraph}
    {
	?InputEvent_cv a :InputEvent_cv ;
	    :row_id ?row_id ;
	    :subject_id ?subject_id ;
	    :hadm_id ?hadm_id ;
	    :icustay_id ?icustay_id ;
	    :charttime ?charttime ;
	    :itemid ?itemid ;
	    :amount ?amount ;
	    :amountuom ?amountuom ;
	    :rate ?rate ;
	    :rateuom ?rateuom ;
	    :storetime ?storetime ;
	    :cgid ?cgid ;
	    :orderid ?orderid ;
	    :linkorderid ?linkorderid ;
	    :stopped ?stopped ;
	    :newbottle ?newbottle ;
	    :originalamount ?originalamount ;
	    :originalamountuom ?originalamountuom ;
	    :originalroute ?originalroute ;
	    :originalrate ?originalrate ;
	    :originalrateuom ?originalrateuom ;
	    :originalsite ?originalsite .
    }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
	?data a s:DbSource ;
	    s:url "jdbc:postgresql://10.10.5.3:5555/kl_hosp_db?user=postgres&password=postgres123" ;
	    s:selector "kl_hosp_schema.inputevents_cv" ;
	    ?row_id (xsd:int) ;
	    ?subject_id (xsd:int) ;
	    ?hadm_id (xsd:int) ;
	    ?icustay_id (xsd:int) ;
	    ?charttime (xsd:dateTime) ;
	    ?itemid (xsd:int) ;
	    ?amount (xsd:float) ;
	    ?amountuom (xsd:string) ;
	    ?rate (xsd:float) ;
	    ?rateuom (xsd:string) ;
	    ?storetime (xsd:dateTime) ;
	    ?cgid (xsd:int) ;
	    ?orderid (xsd:int) ;
	    ?linkorderid (xsd:int) ;
	    ?stopped (xsd:string) ;
	    ?newbottle (xsd:int) ;
	    ?originalamount (xsd:float) ;
	    ?originalamountuom (xsd:string) ;
	    ?originalroute (xsd:string) ;
	    ?originalrate (xsd:float) ;
	    ?originalrateuom (xsd:string) ;
	    ?originalsite (xsd:string) ;
	BIND(IRI("http://example.com/inputevent_cv/{{?row_id}}") AS ?InputEvent_cv)
	BIND(IRI("http://example.com/patients/{{?subject_id}}") AS ?patient)
	BIND(IRI("http://example.com/admissions/{{?hadm_id}}") AS ?admission) 
    }
}

The following query ingests airport-related data from a CSV file.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>

INSERT 
{
  GRAPH ${targetGraph}
  { 
	?code a <http://anzograph.com/airport> ;
	  <http://anzograph.com/airport/name> ?name ;
	  <http://anzograph.com/airport/city> ?city ;
	  <http://anzograph.com/airport/state> ?state ;
	  <http://anzograph.com/airport/latitude> ?lat;
	  <http://anzograph.com/airport/longitude> ?long.
  }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
	?data a s:FileSource ;
		s:url "/opt/shared-files/airports.csv" ;
		?iata_code ("IATA_CODE" xsd:string) ;
		?name ("AIRPORT" xsd:string) ;
		?city ("CITY" xsd:string) ;
		?state ("STATE" xsd:string) ;
		?lat ("LATITUDE" xsd:double) ;
		?long ("LONGITUDE" xsd:double).
	BIND(IRI("http://anzograph.com/airport/{{?iata_code}}") as ?code)
    }
}
Related Topics