Reading or Ingesting Instance Data

Depending on the type of SPARQL query that you write, the Graph Data Interface (GDI) service can be used to ingest instance data through an extract, load, and transform process or it can be used to analyze source data without updating the database. When you query a source such as a database, the GDI service leverages that source to retrieve only the data that it needs for the query. Unlike a JDBC driver, the GDI service does not need to retrieve all values and then complete an often time-consuming step to filter the results.

This topic provides details about the syntax to use when writing GDI queries and includes examples that demonstrate the data integration capabilities for different types of data sources.

Graph Data Interface Query Syntax

The following query syntax shows the structure of a GDI query. The table below the query describes the parts of the query and defines the placeholder elements in bold.

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:	<http://www.w3.org/2002/07/owl#>
PREFIX anzo:	<http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:	<http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:	<http://purl.org/dc/elements/1.1/>

# the result clause: the type of query to run
# and set of results to return
SELECT | CONSTRUCT | INSERT | CREATE [OR REPLACE] [MATERIALIZED] VIEW ...
WHERE
{
	SERVICE [TOPDOWN] <http://cambridgesemantics.com/services/DataToolkit>
    {	

	?data a s:source_type ;
		s:connection_parameters ;
		s:input_parameters ;
		# the output parameters: variable bindings and transformations,
		# such as column name or data type conversions
		?variable (["binding"] [datatype] ["format"]) ;
		... ;
		.
	# additional clauses such as VALUES or FILTER
	...
    }
}
Clause or Option Description
PREFIX Clause The PREFIX clause shown above declares the prefixes that are common or standard for all GDI service queries. You can declare additional prefixes to use in the query.
Result Clause The result clause defines the type of SPARQL query to run and the set of results to return. This clause defines whether you want to read (SELECT or CONSTRUCT) from the remote source, ingest the data into AnzoGraph (INSERT or CREATE MATERIALIZED VIEW), or create a "virtual" data source or non-materialized view (CREATE VIEW).

The Graph Data Interface Query Examples section below includes more information and examples of each type of query.

WHERE Clause The WHERE clause includes the required GDI SERVICE call and defines the data source, input and output parameters, and any additional clauses for ingesting or analyzing the data. The rows below describe the patterns in the WHERE clause.
SERVICE Call The following SERVICE call invokes the GDI service:
SERVICE [TOPDOWN] <http://cambridgesemantics.com/services/DataToolkit>

The optional TOPDOWN keyword is used to pass input values to the data source. When you include TOPDOWN in the service call, it indicates that the rest of the query produces values to send to the source. The GDI service makes repeated calls to pass in each of the specified values and retrieve the data that is based on those values.

source_type The ?data a s:source_type triple pattern specifies the type of data source that the query will run against. For example, ?data a s:DbSource. The list below describes the commonly used types:
  • DbSource for any type of database.
  • FileSource for flat files.
  • HttpSource for HTTP endpoints.

For a complete list of the supported source types, view the GDI ontology. Each data source type is represented by an owl:Class. See Graph Data Interface Model for more information.

connection_parameters The source connection parameters are the values that are required for accessing the source, such the database connection URL, path to a file source, username, password, key, token, etc. The connection details can be referenced from a Query Context or specified in the query. For more information about contexts, see Using Query Contexts.

For example, the pattern below specifies the connection details for a database source:

?data a s:DbSource ;
   s:url "jdbc:postgresql://10.100.2.9:5555/kl_hosp_db?
user=postgres&password=postgres123"

The example below specifies the connection details for a file-based source, a directory of CSV files:

?data a s:FileSource ;
   s:url "/opt/shared-files/sales-csv"

For a single file, specify the filename in the URL. For example, "/opt/shared-files/sas/edu_inc.sas7bdat".

input_parameters In addition to the source connection parameters, you can include other input parameters based on the data source type. For example, the list below describes the commonly used input parameters:
  • timeout ms: Specifies the timeout (in milliseconds) to use for requests against the source. For example, s:timeout 5000.
  • refresh s: The number of seconds to wait before refreshing the source. If unspecified no caching will be performed and the source will be queried each time.
  • limit n: The maximum number of results to retrieve from the source. For example, s:limit 1000.
  • debug true: (For CREATE VIEW queries) Indicates that the GDI service should not cache the resulting view definition when the view is created. By default, to avoid running expensive metrics-gathering queries against the source system more than once, the GDI service caches the view definition (s:debug is false) when a view is first created. That means if you debug, modify, and re-run the view creation query, the original view definition remains cached instead of being updated. Including s:debug true as part of the service call instructs the GDI not to cache the definition.
  • selector "path": The selector is a source-specific binding component that identifies the path to the source object. For example, s:selector "currently" targets the "currently" class of data from a source. And s:selector "hourly.data" specifies a hierarchical path. For more information about the selector property and bindings, see the Deep-Dive into Hierarchical Bindings section below.

For a complete list of the supported input parameters by data source, view the GDI ontology. Each data source type is represented by an owl:Class and the related input parameters are properties in the class. See Graph Data Interface Model for more information.

?variable () The output parameters, in ?variable (["binding"] [datatype] ["format"]) format, define the triple patterns to output. When the specified ?variable matches the source column name, the GDI uses the variable as the source data selector. If you specify an alternate variable name, then a binding needs to be specified to map the new variable to the source.

If you are unsure about the exact column names or other source details, you can run a query that returns metadata information. For details, see Reading Data Source Metadata.

In the object position in parenthesis, you also have the option to transform the data using the datatype and format options. The list below describes each option:

  • binding: The binding is a literal value that binds a ?variable to a source column. If you specify a ?variable that matches the source column name, then that variable name is the data selector and it is not necessary to specify a binding. If you specify an alternate variable name or there is a hierarchical path to the source column, then the binding is needed to map the new variable to that source column.

    For example for a flat source like CSV, the following pattern simply binds the source column AIRLINE to the lowercase variable ?airline: ?airline ("AIRLINE"). For a database source, this example binds the ?subject variable by navigating to the SUBJECT column in the FILM table in the dbo schema: ?subject ("dbo.FILM.SUBJECT"). And for an HTTP source, this example binds the ?time variable to the time object under the minutely data path: ?time ("minutely.data.time").

    For FileSource and HttpSource, periods (.), forward slashes (/), and brackets ([ ]) are parsed as path notation. Therefore, if a source column name includes any of those characters, they must be escaped in the binding. Use two backslashes (\\) as an escape character. For example, if a column name is average/day, the variable and binding pattern could be written as ?averagePerDay ("average\\/day").

    For DbSource, database, schema, and table names in bindings are parsed according to the specific rules for that database type. You do not need to escape characters in database names. However, database names with characters that do not match (_|A-Z|a-z)(_|A-Z|a-z|0-9)* should be quoted, such as ("'Adventure.Works'.Sales.'Daily.Totals'").

  • datatype: The datatype is the data type to convert the column to. If you do not specify a data type, the GDI infers the type. See AnzoGraph Data Type Handling for information about data types.
  • format: The format option is used to specify the format to use for data types such as xsd:date or xsd:dateTime. Specify days as "d," months as "M," and years as "y." For the time, specify "H" for hours, "m" for minutes, and "s" for seconds. For example, "yyyyMMdd HH:mm:ss" or "ddMMMyy" to display date values such as "01JAN19."

Deep-Dive into Hierarchical Bindings

As part of the GDI's flexibility, there are multiple ways to express binding hierarchies in GDI queries. One way is to use brackets ( [ ] ) to group triple patterns into binding trees. For example, the query snippet below organizes output parameters into an hourly.data hierarchy:

?data a s:HttpSource;
    s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
    ?latitude (xsd:double) ;
    ?longitude (xsd:double) ;
    ?timezone (xsd:string) ;
    ?hourly [
        s:selector "hourly" ;
        ?data [
            s:selector "data" ;
            ?time (xsd:long) ;
            ?summary (xsd:string) ;
            ?nearestStormDistance (xsd:int) ;
            ?rainIntensity ("precipIntensity" xsd:double) ;
            ?rainProbability ("precipProbability" xsd:double) ;
            ?temperature (xsd:double) ;
            ?feelsLike ("apparentTemperature" xsd:double) ;
            ?humidity (xsd:double) ;
            ?pressure (xsd:double) ;
            ?windSpeed (xsd:double) ;
        ] ;
    ] .

Each level of the binding hierarchy above has a s:selector property that is used to navigate the data. The selector is optional, however. If a selector is not specified, the output triple patterns default to the name of the variable that introduces that level of the hierarchy.

As an alternative to grouping triple patterns in bracketed trees, the s:selector property can specify a path. For example, the snippet below rewrites the example above to express the same hierarchy by specifying a path as the value for s:selector:

?data a s:HttpSource;
    s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
    ?latitude (xsd:double) ;
    ?longitude (xsd:double) ;
    ?timezone (xsd:string) ;
    ?hourly [
        s:selector "hourly.data" ;
        ?time (xsd:long) ;
        ?summary (xsd:string) ;
        ?nearestStormDistance (xsd:int) ;
        ?rainIntensity ("precipIntensity" xsd:double) ;
        ?rainProbability ("precipProbability" xsd:double) ;
        ?temperature (xsd:double) ;
        ?feelsLike ("apparentTemperature" xsd:double) ;
        ?humidity (xsd:double) ;
        ?pressure (xsd:double) ;
        ?windSpeed (xsd:double) ;
    ] .

When working with schema-less (or schema-flexible) sources like JSON, you can also capture a tree of data as a JSON string. For example, using the query snippet above, if the properties under hourly were unknown, the snippet could be rewritten as follows. This query would bind all of the data below hourly to the ?hourly variable and return a JSON string representation of the properties and instance data:

?data a s:HttpSource;
    s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
    ?latitude (xsd:double) ;
    ?longitude (xsd:double) ;
    ?timezone (xsd:string) ;
    ?hourly () .  

For example, the results look like this:

 latitude  | longitude  | timezone        | hourly                                     
-----------+------------+-----------------+----------------------------
30.374563  | -97.975892 | America/Chicago | {"summary":"\"Humid and partly cloudy
throughout the day.\"","icon":"\"partly-cloudy-day\"","data":[{"time":"1595559600",
summary":"\"Clear\"","icon":"\"clear-night\"","precipIntensity":"0",
"precipProbability":"0","temperature":"88.39","apparentTemperature":"91.72",
"dewPoint":"67.42","humidity":"0.5","pressure":"1011.7","windSpeed":"7.48",
"windGust":"16.71","windBearing":"109","cloudCover":"0.06","uvIndex":"0",
"visibility":"10","ozone":"285.2"},{"time":"1595563200","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"2.0E-4","precipProbability":"0.01",
"precipType":"\"rain\"","temperature":"86.69","apparentTemperature":"90.1",
"dewPoint":"67.84","humidity":"0.54","pressure":"1012","windSpeed":"7.05",
"windGust":"17.56","windBearing":"110","cloudCover":"0.12","uvIndex":"0",
"visibility":"10","ozone":"284.9"},...

Similar to the example above, you can write a query that specifically captures some of the properties in a hierarchy and then returns the rest of the properties and their values as a JSON string representation. To do so, use @ as the binding path. For example:

?data a s:HttpSource;
    s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/30.374563,-97.975892" ;
    ?latitude (xsd:double) ;
    ?longitude (xsd:double) ;
    ?timezone (xsd:string) ;
    ?hourly [
        s:selector "hourly.data" ;
        ?time (xsd:long) ;
        ?summary (xsd:string) ;
        ?hourly_data ("@") ;
    ] .

The results look like this:

 latitude  | longitude  | timezone        | time       | summary          | hourly_data  
-----------+------------+-----------------+------------+------------------+---------------
 30.374563 | -97.975892 | America/Chicago | 1595559600 | Clear            | {"time":"1595559600","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"0","precipProbability":"0","temperature":"88.39",
"apparentTemperature":"91.72","dewPoint":"67.42","humidity":"0.5","pressure":"1011.7","windSpeed":"7.48",
"windGust":"16.71","windBearing":"109","cloudCover":"0.06","uvIndex":"0","visibility":"10","ozone":"285.2"}
                                       
 30.374563 | -97.975892 | America/Chicago | 1595563200 | Clear            | {"time":"1595563200","summary":"\"Clear\"",
"icon":"\"clear-night\"","precipIntensity":"2.0E-4","precipProbability":"0.01","precipType":"\"rain\"","temperature":"86.69",
"apparentTemperature":"90.1","dewPoint":"67.84","humidity":"0.54","pressure":"1012","windSpeed":"7.05","windGust":"17.56",
"windBearing":"110","cloudCover":"0.12","uvIndex":"0","visibility":"10","ozone":"284.9"}
          
 30.374563 | -97.975892 | America/Chicago | 1595566800 | Partly Cloudy    | {"time":"1595566800","summary":"\"Partly Cloudy\"",
"icon":"\"partly-cloudy-night\"","precipIntensity":"3.0E-4","precipProbability":"0.01",
"precipType":"\"rain"","temperature":"85.63","apparentTemperature":"89.21",
"dewPoint":"68.33","humidity":"0.56","pressure":"1012.6","windSpeed":"6.48","windGust":"17.92","windBearing":"110",
"cloudCover":"0.34","uvIndex":"0","visibility":"10","ozone":"284.5"}
...

Graph Data Interface Query Examples

This section includes several sample queries that demonstrate how to use the Graph Data Interface (GDI) to read or ingest data from different types of data sources.

Reading Data from Remote Sources

The examples in this section read data from remote data sources by using the GDI service to run SELECT queries against the sources.

The following query reads Covid statistics for the United States from the publicly available Data API Covid Tracking Project. The query simply retrieves data from the HTTP source without passing values as input to the endpoint:

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX anzo:   <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:   <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:     <http://purl.org/dc/elements/1.1/>

SELECT *
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
		?data a s:HttpSource ;
		s:url "https://covidtracking.com/api/v1/us/current.csv" ;
		?positive (xsd:int) ;
		?negative (xsd:int) ;
		?totalTests ("totalTestResults" xsd:int) ;
		?hospitalized ("hospitalizedCurrently" xsd:int) ;
		?inICU ("inIcuCurrently" xsd:int) ;
		?updated ("date" xsd:date "yyyyMMdd") .
    }
}

The query returns the following results:

positive | negative | totalTests | hospitalized | inICU | updated
---------+----------+------------+--------------+-------+------------
 4017735 | 44777235 |   48794970 |        59846 | 10466 | 2020-07-23
1 rows

The SELECT query below also reads data from an HTTP source, the Dark Sky API, which compiles worldwide weather statistics. The API has several models available for retrieving data that is current, daily, historical, etc. To target current data, the query includes s:selector "currently" as an input parameter. In addition, the query demonstrates the use of the "topdown" functionality, where the query includes values to be input to the source to narrow the results to specific locations. The query includes the TOPDOWN keyword in the GDI service call, and the VALUES clause specifies the latitude and longitude values for the cities to return data for. In addition, since this API service requires parameters to be specified in the connection URL, the s:url value includes ?lat and ?long as parameters for the value.

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX anzo:   <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:   <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:     <http://purl.org/dc/elements/1.1/>
PREFIX ex:     <http://example.org/ontologies/City#>

SELECT  
	?city ?state ?summary ?temp ?rainChance
	?humidity ?pressure ?windSpeed
WHERE
{
	SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit>
    {
	  ?data a s:HttpSource ;
	  s:url "https://api.darksky.net/forecast/bdbe3f638eb908c9b94919537dad5945/{{?lat}},{{?long}}" ;
	  s:selector "currently" ;
	  ?lat ("latitude") ;
	  ?long ("longitude") ;
	  ?time () ;
	  ?summary () ;
	  ?temp ("temperature") ;
	  ?rainChance ( "precipProbability" ) ;
	  ?rainIntensity ( "precipIntensity" ) ;
	  ?humidity () ;
	  ?pressure () ;
	  ?windSpeed () ;
	  ?windGust () ;
	  ?windBearing () ;
	  ?nearestStorm ("nearestStormDistance") ;
	  ?stormBearing ("nearestStormBearing") ;
	  ?visibility () .
    }
    VALUES( ?city ?state ?lat ?long )
    {
	  ( "Lakeway" "TX" 30.374563 -97.975892 )
	  ( "Boston" "MA" 42.358043 -71.060415 )
	  ( "Seattle" "WA" 47.590720 -122.307053 )
	  ( "Chicago" "IL" 41.837741 -87.823296 )
	  ( "Hilo" "HI" 19.702040 -155.090312 )
    }
}
ORDER BY ?city

The query returns the following results:

city    | state | summary       | temp  | rainChance | humidity | pressure | windSpeed
--------+-------+---------------+-------+------------+----------+----------+-----------
Boston  | MA    | Overcast      | 79.81 |          0 |      0.6 |   1018.7 |      7.71
Chicago | IL    | Clear         |  81.7 |          0 |     0.52 |   1021.1 |      5.13
Hilo    | HI    | Partly Cloudy |  72.6 |       0.13 |     0.79 |   1018.6 |      4.86
Lakeway | TX    | Partly Cloudy | 92.43 |          0 |     0.48 |   1013.3 |     10.85
Seattle | WA    | Mostly Cloudy | 61.82 |          0 |     0.76 |   1018.2 |      4.57
5 rows

The following SELECT query reads film data from a database source. The connection parameters reference a context to obscure the sensitive information from the request. For information about contexts, see Using Query Contexts.

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:	<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:	<http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:	<http://www.w3.org/2001/XMLSchema#>
PREFIX owl:	<http://www.w3.org/2002/07/owl#>
PREFIX anzo:	<http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:	<http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:	<http://purl.org/dc/elements/1.1/>

​SELECT DISTINCT *
​WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
	?data a s:DbSource ;
	  s:url "{{@AdventureWorks2012:db.url}}" ;
	  s:username "{{@AdventureWorks2012:db.username}}" ;
	  s:password "{{@AdventureWorks2012:db.password}}" ;
	  s:selector "[dbo].[FILM]" ;
	  ?year ("[YEAR]" xsd:int);
	  ?length (xsd:int) ;
	  ?title (xsd:string) ;
	  ?subject ("[dbo].[FILM].[SUBJECT]" xsd:string) ;
	  ?actor ("[ACTOR]" xsd:string) ;
	  ?actress (xsd:string) ;
	  ?director (xsd:string) ;
	  ?popularity (xsd:int) ;
	  ?awards (xsd:string) ;
	  ?image (xsd:string) .
	FILTER(?year >= 1990 && ?year < 2000)
	FILTER(?subject = "Drama" || ?subject = "Action")
	FILTER(?length <= 90)
    }
}

The example query below reads data from a CSV file source that contains data about airports. The query uses bindings to map the upper case column names in the file to lowercase variables. The result list includes an expression to concatenate the city and state values, and the WHERE clause includes a filter to narrow the results to two states: CA and MA.

PREFIX s:      <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:    <http://www.w3.org/2002/07/owl#>
PREFIX anzo:   <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:   <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:     <http://purl.org/dc/elements/1.1/>

SELECT ?code ?name (CONCAT(?city,", ", ?state) as ?location)
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
		?data a s:FileSource ;
		s:url "/opt/shared-files/airports.csv" ;
		?code ("IATA_CODE") ;
		?name ("AIRPORT") ;
		?city ("CITY") ;
		?state ("STATE") ;
		?country ("COUNTRY") ;
		?lat ("LATITUDE") ;
		?long ("LONGITUDE").
		FILTER (?state="CA" || ?state="MA") .
    }
}
ORDER BY ?state

The query returns the following results:

code | name                                                      | location
-----+-----------------------------------------------------------+-------------------
ACV  | Arcata Airport                                            | Arcata/Eureka, CA
BFL  | Meadows Field                                             | Bakersfield, CA
BUR  | Bob Hope Airport (Hollywood Burbank Airport)              | Burbank, CA
CEC  | Del Norte County Airport (Jack McNamara Field)            | Crescent City, CA
CLD  | McClellan-Palomar Airport                                 | San Diego, CA
FAT  | Fresno Yosemite International Airport                     | Fresno, CA
LAX  | Los Angeles International Airport                         | Los Angeles, CA
LGB  | Long Beach Airport (Daugherty Field)                      | Long Beach, CA
MMH  | Mammoth Yosemite Airport                                  | Mammoth Lakes, CA
MRY  | Monterey Regional Airport (Monterey Peninsula Airport)    | Monterey, CA
OAK  | Oakland International Airport                             | Oakland, CA
ONT  | Ontario International Airport                             | Ontario, CA
PSP  | Palm Springs International Airport                        | Palm Springs, CA
RDD  | Redding Municipal Airport                                 | Redding, CA
SAN  | San Diego International Airport (Lindbergh Field)         | San Diego, CA
SBA  | Santa Barbara Municipal Airport (Santa Barbara Airport)   | Santa Barbara, CA
SBP  | San Luis Obispo County Regional Airport (McChesney Field) | San Luis Obispo, CA
SFO  | San Francisco International Airport                       | San Francisco, CA
SJC  | Norman Y. Mineta San José International Airport           | San Jose, CA
SMF  | Sacramento International Airport                          | Sacramento, CA
SMX  | Santa Maria Public Airport (Capt G. Allan Hancock Field)  | Santa Maria, CA
SNA  | John Wayne Airport (Orange County Airport)                | Santa Ana, CA
ACK  | Nantucket Memorial Airport                                | Nantucket, MA
BOS  | Gen. Edward Lawrence Logan International Airport          | Boston, MA
HYA  | Barnstable Municipal Airport                              | Hyannis, MA
MVY  | Martha's Vineyard Airport                                 | Marthas Vineyard, MA
ORH  | Worcester Regional Airport                                | Worcester, MA
27 rows

And the example query below reads data from Parquet files.

PREFIX s:	<http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:	<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:	<http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:	<http://www.w3.org/2001/XMLSchema#>
PREFIX owl:	<http://www.w3.org/2002/07/owl#>
PREFIX anzo:	<http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:	<http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:	<http://purl.org/dc/elements/1.1/>

SELECT *
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    { 
		?data a s:FileSource ;
		s:url "/opt/shared-files/data/" ;
		s:pattern "*.parquet" ;
		?bammUid {xsd:string) ;
		?eventHeader [
			?eventId (xsd:string) ;
			?eventName (xsd:string) ;
			?eventVersion (xsd:string) ;
			?eventCreatedAt (xsd:string) ;
		] ;
		?Location [
			?LocationId (xsd:string) ;
			?PlantId (xsd:string) ;
			?LineNo (xsd:long) ;
			?StationNo (xsd:long) ;
			?StationIdx (xsd:long) ;
			?FunctionNo (xsd:long) ;
			?WorkPos (xsd:long) ;
			?ToolPos (xsd:long) ;
			?ProcessNo (xsd:long) ;
		] ;
		?ProcessedPanel [
			?aeMachineId (xsd:string) ;
			?panelId (xsd:string) ;
			?productPartNo (xsd:string) ;
			?productPartVariant (xsd:string) ;
			?workingCode (xsd:long) ;
			?Prog1Name (xsd:string) ;
			?TestStartedAt (xsd:string) ;
			?NumOfFailures (xsd:long) ;
			?PanelProcessResult (xsd:long) ;
			?PanelProcessResultBit (xsd:long) ;
			?PanelTargetIdx (xsd:long) ;
			?PanelState (xsd:long) ;
		] ;
		?ProcessedBoards [
			?boardNo (xsd:long) ;
			?boardId (xsd:string) ;
			?BoardProcessResult (xsd:long) ;
			?BoardState (xsd:long) ;
		] .
	}
}

Ingesting Data from Remote Sources

The examples in this section ingest data from remote data sources using the extract, load, and transform (ELT) method of data integration. The GDI service ingests remote data and then transforms it into a graph structure in AnzoGraph based on the provided SPARQL INSERT or CREATE MATERIALIZED VIEW queries.

The following query ingests airport-related data from a CSV file. The INSERT clause creates a graph called <http://anzograph.com/airports> and inserts the specified triple patterns based on the GDI service call in the WHERE clause. The BIND statement creates a unique subject URI for each airport using the ?iata_code from the source file.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>

INSERT 
{
  GRAPH <http://anzograph.com/airports>
  { 
	?code a <http://anzograph.com/airport> ;
	  <http://anzograph.com/airport/name> ?name ;
	  <http://anzograph.com/airport/city> ?city ;
	  <http://anzograph.com/airport/state> ?state ;
	  <http://anzograph.com/airport/latitude> ?lat;
	  <http://anzograph.com/airport/longitude> ?long.
  }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
	?data a s:FileSource ;
		s:url "/opt/shared-files/airports.csv" ;
		?iata_code ("IATA_CODE") ;
		?name ("AIRPORT") ;
		?city ("CITY") ;
		?state ("STATE") ;
		?lat ("LATITUDE") ;
		?long ("LONGITUDE").
	BIND(IRI("http://anzograph.com/airport/{{?iata_code}}") as ?code)
    }
}

Another way to ingest data into AnzoGraph is to create a materialized view. When you create a materialized view, you write a CREATE MATERIALIZED VIEW statement with a CONSTRUCT query that specifies the triple patterns for the new graph. The GDI retrieves the data from the remote source and creates a new named graph in the database. When creating views, View needs to be added to the GDI service URI, as shown below:

SERVICE <http://cambridgesemantics.com/services/DataToolkitView>

In addition, the name of the view needs to be added in parenthesis at the end of the URI. The view name in the service call must match the name of the view in the CREATE VIEW statement. For example:

CREATE OR REPLACE MATERIALIZED VIEW <http://example.com/kl_hosp/patients_m> AS
CONSTRUCT
{ ... }
WHERE
{ 
   SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<http://example.com/kl_hosp/patients_m>)
    { ... }
}

The example below creates a materialized view from data in a database source. Running this query inserts a graph called <http://example.com/kl_hosp/patients_m> in AnzoGraph. When running analytics on the data in the view, the view is treated like any other named graph.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX :        <http://example.com/ontologies/kl_hosp#>

CREATE OR REPLACE MATERIALIZED VIEW <http://example.com/kl_hosp/patients_m> AS
CONSTRUCT
{
	?patient a :Patient ;
		:rowid ?row_id ;
		:subjectid ?subject_id;
		:gender ?gender;
		:dob ?dob ;
		:dod ?dod ;
		:dod_hosp ?dod_hosp .
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<http://example.com/kl_hosp/patients_m>)
    {
	?data a s:DbSource ;
	    s:url "jdbc:postgresql://10.10.5.3:5555/kl_hosp_db?user=postgres&password=postgres123" ;
	    s:debug true ;
	    s:selector "kl_hosp_schema.patients" ;
	    ?row_id xsd:int;
	    ?subject_id xsd:int;
	    ?gender xsd:string;
	    ?dob xsd:dateTime;
	    ?dod xsd:dateTime;
	    ?dod_hosp xsd:dateTime ;
	BIND(IRI("http://example.com/patients/{{?subject_id}}") AS ?patient)
    }
}

For more information about creating views, see Named Views. For guidance on creating non-materialized views, also called virtual graphs or data virtualization, see the Creating Virtual Graphs from Remote Sources section below.

The next example below ingests data from a database source using an INSERT query.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX :        <http://example.com/ontologies/kl_hosp#>

INSERT
{
	GRAPH <http://example.com/kl_hosp/inputevents_cv_i>
    {
	?InputEvent_cv a :InputEvent_cv ;
	    :row_id ?row_id ;
	    :subject_id ?subject_id ;
	    :hadm_id ?hadm_id ;
	    :icustay_id ?icustay_id ;
	    :charttime ?charttime ;
	    :itemid ?itemid ;
	    :amount ?amount ;
	    :amountuom ?amountuom ;
	    :rate ?rate ;
	    :rateuom ?rateuom ;
	    :storetime ?storetime ;
	    :cgid ?cgid ;
	    :orderid ?orderid ;
	    :linkorderid ?linkorderid ;
	    :stopped ?stopped ;
	    :newbottle ?newbottle ;
	    :originalamount ?originalamount ;
	    :originalamountuom ?originalamountuom ;
	    :originalroute ?originalroute ;
	    :originalrate ?originalrate ;
	    :originalrateuom ?originalrateuom ;
	    :originalsite ?originalsite .
    }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
	?data a s:DbSource ;
	    s:url "jdbc:postgresql://10.10.5.3:5555/kl_hosp_db?user=postgres&password=postgres123" ;
	    s:selector "kl_hosp_schema.inputevents_cv" ;
	    ?row_id (xsd:int) ;
	    ?subject_id (xsd:int) ;
	    ?hadm_id (xsd:int) ;
	    ?icustay_id (xsd:int) ;
	    ?charttime (xsd:dateTime) ;
	    ?itemid (xsd:int) ;
	    ?amount (xsd:float) ;
	    ?amountuom (xsd:string) ;
	    ?rate (xsd:float) ;
	    ?rateuom (xsd:string) ;
	    ?storetime (xsd:dateTime) ;
	    ?cgid (xsd:int) ;
	    ?orderid (xsd:int) ;
	    ?linkorderid (xsd:int) ;
	    ?stopped (xsd:string) ;
	    ?newbottle (xsd:int) ;
	    ?originalamount (xsd:float) ;
	    ?originalamountuom (xsd:string) ;
	    ?originalroute (xsd:string) ;
	    ?originalrate (xsd:float) ;
	    ?originalrateuom (xsd:string) ;
	    ?originalsite (xsd:string) ;
	BIND(IRI("http://example.com/inputevent_cv/{{?row_id}}") AS ?InputEvent_cv)
	BIND(IRI("http://example.com/patients/{{?subject_id}}") AS ?patient)
	BIND(IRI("http://example.com/admissions/{{?hadm_id}}") AS ?admission) 
    }
}

Creating Virtual Graphs from Remote Sources

The examples in this section demonstrate "data virtualization" by using the GDI service to run CREATE VIEW queries that create virtual graphs. Virtual graphs are non-materialized views whose view definition is stored in AnzoGraph but the instance data remains in the remote source. This enables you to analyze data in a remote source without first ingesting all of the data into the database.

When you create a virtual graph, you write a CREATE VIEW statement with a CONSTRUCT query that specifies the triple patterns for the virtual graph. The GDI creates a view definition in the database. When creating views, View needs to be added to the GDI service URI, as shown below:

SERVICE <http://cambridgesemantics.com/services/DataToolkitView>

In addition, the name of the view needs to be added in parenthesis at the end of the URI. The view name in the service call must match the name of the view in the CREATE VIEW statement. For example:

CREATE OR REPLACE VIEW <http://example.com/kl_hosp/transfers> AS
CONSTRUCT
{ ... }
WHERE
{ 
   SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<http://example.com/kl_hosp/transfers>)
    { ... }
}

By default, to avoid running expensive metrics-gathering queries against the source system more than once, the GDI service caches the virtual graph (view) definition when the view is created. That means if you modify and re-run the view creation query, the original view definition remains cached instead of being updated. You can include the s:debug true parameter to avoid caching the definition. If you need to drop the cache after a virtual graph is created, you can run the following query:

PREFIX s:  <http://cambridgesemantics.com/ontologies/DataToolkit#>
SELECT *
WHERE
{
  SERVICE <http://cambridgesemantics.com/management/DataToolkit>
  {
    # to refresh all definitions (clear all caches)
    # include/uncomment the following line
    # ( s:refresh ) .
    # to refresh a specific view, include the view URI as follows
    ( s:refresh <view_URI> ) .
  }
}

The example below creates a virtual graph from data in a database source. Running this query inserts a virtual graph called <http://example.com/kl_hosp/transfers> in AnzoGraph but the instance data is not inserted into the database. When running analytics on the data in the view, the view is treated like any other named graph, but AnzoGraph retrieves the instance data from the data source for which the view was created.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX :        <http://example.com/ontologies/kl_hosp#>

CREATE OR REPLACE VIEW <http://example.com/kl_hosp/transfers> AS
CONSTRUCT
{
	?transfer a :Transfer ;
	    :rowid ?row_id ;
	    :subjectid ?patient;
	    :hadmid ?hadm_id ;
	    :eventtype ?eventtype .
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<http://example.com/kl_hosp/transfers>)
    {
	?data a s:DbSource ;
	    s:url "jdbc:postgresql://10.10.5.3:5555/kl_hosp_db?user=postgres&password=postgres123" ;
	    s:selector "kl_hosp_schema.transfers" ; # selects the transfers table in the kl_hosp_schema
	    s:debug true ; # turns off view query caching
	    ?row_id (xsd:int) ;
	    ?subject_id (xsd:int) ;
	    ?hadm_id (xsd:int) ;
	    ?eventtype (xsd:string) .
	BIND(IRI("http://example.com/transfers/{{?row_id}}") AS ?transfer)
	BIND(IRI("http://example.com/patients/{{?subject_id}}") AS ?patient) 
    }
}

The following example creates a virtual graph from a file source.

PREFIX s:       <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:     <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX anzo:    <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:    <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>

CREATE OR REPLACE VIEW  <http://anzograph.com/airports> AS
CONSTRUCT
{ 
	?code a <http://anzograph.com/airport> ;
	  <http://anzograph.com/airport/name> ?name ;
	  <http://anzograph.com/airport/city> ?city ;
	  <http://anzograph.com/airport/state> ?state ;
	  <http://anzograph.com/airport/latitude> ?lat;
	  <http://anzograph.com/airport/longitude> ?long.
  }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<http://anzograph.com/airports>)
    {
	?data a s:FileSource ;
		s:url "/opt/shared-files/airports.csv" ;
		s:debug true ;
		?iata_code ("IATA_CODE") ;
		?name ("AIRPORT") ;
		?city ("CITY") ;
		?state ("STATE") ;
		?lat ("LATITUDE") ;
		?long ("LONGITUDE").
	BIND(IRI("http://anzograph.com/airport/{{?iata_code}}") as ?code)
    }
}

For more information about creating views, see Named Views. For guidance on creating materialized views, see the Ingesting Data from Remote Sources section above.

Related Topics