Query an HTTP Source
This topic provides details about the structure to use when writing GDI queries to read or ingest data from HTTP data sources. It also includes example queries that may be useful as a starting point for writing your own GDI queries.
Query Syntax
The following query syntax shows the structure of a GDI query for HTTP sources. The clauses, patterns, and placeholders that are links are described below.
# PREFIX Clause PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX anzo: <http://openanzo.org/ontologies/2008/07/Anzo#> PREFIX zowl: <http://openanzo.org/ontologies/2009/05/AnzoOwl#> PREFIX dc: <http://purl.org/dc/elements/1.1/> # Result Clause { [ GRAPH <target_graph> { ] triple_patterns [ } ] } [ FROM Clause ] WHERE { # SERVICE Clause: Include the following service call when reading or inserting data. SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit> # View SERVICE Clause: Or use the service call below when constructing a view. SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<target_graph>) { ?data a s:HttpSource ; s:url "string" ; [ s:authorization [ a s:BearerToken ; s:token "string" ; | a s:AWSSignature ; s:accessKey "string" ; s:region "string" ; s:secretKey "string" ; s:serviceName "string" ; s:sessionToken "string" ; | a s:BasicAuth ; s:username "string" ; s:password "string" ; ] ; ] [ s:trust "string" ; ] [ s:proxy "string" | [ s:host "string" ; s:port int ] ] [ s:header [ s:name: "string" ; s:value "string" ] ; ] [ s:mimetype "string" ; ] [ s:contentType "string" ; ] [ s:content """string""" ; ] [ s:parameter [ s:name "string" ; s:value "string" ] ; ] [ s:method "string" ; ] [ s:encoding "string" ; ] [ s:form [ s:name: "string" ; s:value "string" ] ; ] [ s:format [ source_format_options ; ] ; ] [ s:timeout int ; ] [ s:batching boolean | int ; ] [ s:concurrency int | [ list_of_properties ] ; ] [ s:rate int | "string" ; ] [ s:partitionBy "string" | ?variable ; ] [ s:locale "string" ; ] [ s:sampling int ; ] [ s:selector "string" | [ list ] ; ] [ s:model "string" ; ] [ s:key ("string") ; ] [ s:reference [ s:model "string" ; s:using ("string") ] [ s:formats [ datatype_formatting_options ] ; ] [ s:normalize boolean | [ normalization_rules ] ; ] [ s:count ?variable ; ] [ s:offset int ; ] [ s:orderBy "string" | ?variable ; ] [ s:limit int ; ] # Mapping variables ?mapping_variable ( [ "binding" ] [ datatype ] [ "datetime_format" ] ) ; ... ; . # Additional clauses such as BIND, VALUES, FILTER } }
For readability, the parameters below exclude the base URI <http://cambridgesemantics.com/ontologies/DataToolkit#>
as well as the s:
prefix. As shown in the examples, however, the s:
prefix or full property URI does need to be included in queries.
Option | Type | Description |
---|---|---|
PREFIX Clause | N/A |
The PREFIX clause declares the standard and custom prefixes for GDI service queries. Generally, queries include the prefixes from the query template (or a subset of them) plus any data-specific declarations.
|
Result Clause | N/A | The result clause defines the type of SPARQL query to run and the set of results to return, i.e., whether you want to read (SELECT or CONSTRUCT) from the source or ingest the data into AnzoGraph DB (INSERT). |
SERVICE Clause | N/A |
Include the SERVICE call
SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit> to invoke the GDI service when you are running a SELECT, INSERT, or CONSTRUCT query that is not creating a view. When creating a view, use the DataToolkitView service call, as described below in View SERVICE Clause. Include the optional TOPDOWN keyword when you want to pass input values from AnzoGraph DB to the data source. When you include TOPDOWN in the service call, it indicates that the rest of the query produces values to send to the source. In this case, the GDI makes repeated calls to pass in each of the specified values and retrieve the data that is based on those values. |
View SERVICE Clause | N/A |
When writing a CONSTRUCT query that creates a view of the data, include the following SERVICE call:
SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<target_graph>) . Using the DataToolkitView call optimizes query execution because it tells the GDI to inspect the query and determine which filters to push to the data source. It also limits the result set and retrieves only the data that is needed, i.e., the source data is fully mapped but all of the mapped data is not necessarily returned. |
url | string | This property specifies the URL to use to access the source. Query binding variables can be inserted into the url string by surrounding the variable name with double curly braces. For example, "{{?name}}" .For security, it is a best practice to reference connection information (such as the url, username, and password) from a Query Context so that the sensitive details are abstracted from any requests. In addition, using a Query Context makes connection details reusable across queries. See Use a Query Context for more information. For example:
|
authorization | RDF list |
This property specifies the type of authorization to use and the values for authentication. The options are BearerToken, AWSSignature, or BasicAuth.
|
BearerToken | string |
Specify this property when a bearer token is used for authentication, and include the token property.
|
AWSSignature | RDF list |
For authorization to AWS service endpoints, specify this property and include the appropriate authentication properties from the list below:
|
BasicAuth | RDF list |
Specify this property when basic authentication is used, and include the username and password properties.
|
trust | string |
Include this property to set the level of trust for the source's SSL certificate. The value can be either
"system" or "all" . |
proxy | string or RDF list |
Include this property to specify proxy information if a proxy is used. The value can be a string, such as
s:proxy "host_url:port_number" , or an RDF list that includes host and port properties, such as s:proxy [ s:host "host_url" ; s:port port_number ] . |
header | RDF list |
You can use this property to specify name-value pairs to include as headers in the request. For example:
If you are creating a view, you can include variables in the |
mimetype | string |
You can include this property to specify the MIME type of the source. For example,
s:mimetype "text/html" . |
contentType | string |
Include this property to specify the content type of the body of the request. For example,
s:contentType "application/sparql-query" or s:contentType "application/json" . |
content | string or RDF list | This property can be included to send content to the source in the body of the request. For example, content can be a SPARQL query, JSON arrays, or a list of key-value pairs. Content can also be configured with an inline object (blank node) that gets translated to JSON. For more information, see Mapping the Content Property to JSON below. |
parameter | RDF list |
You can include this property to list any URL parameters as name-value pairs. For example, the
s:parameter property below adds format to return results in CSV format and the named-graph-uri parameter to target a specific layer in a graphmart.
If you are creating a view, you can include variables in the |
method | string |
You can include this property to specify the HTTP method. For example,
s:method "GET" or s:method "POST" . |
encoding | string |
When targeting a file, you can include this property to specify the character encoding used by the file. The default value is
s:encoding "utf8" . |
form | RDF list |
To send data to the HTTP endpoint, you can use this property to post the data. Form is a list of name-value pairs. When including
s:form , you must also include s:contentType "multipart/form-data" . The GDI sends the form object as an application/x-www-form-urlencoded string that contains the specified parameters. See Query an HTTP Source below for sample usage. |
format | RDF list | If the data is file-based, you can include the format property to add parameters that describe the source. See File Source Format Options for details about the supported parameters. |
timeout | int |
This property can be used to specify the timeout (in milliseconds) to use for requests against the source. For example,
s:timeout 5000 configures a 5 second timeout. |
batching | boolean or int |
This property can be used to disable batching, or it can be used to change the default the batch size. By default, batching is set to 5000 (
s:batching 5000 ). To disable batching, you can include s:batching false in the query. Typically users do not change the batching size. However, it can be useful to control the batch size when performing updates. To configure the size, include s:batching int in the query. For example, s:batching 3000 . |
concurrency | int or RDF list |
This property can be included to configure the maximum level of concurrency for the query. The value can be an integer, such as
s:concurrency 8 . If the value is an integer, it configures a maximum limit on the number of slices that can execute the query. For finer-grained control over the number of nodes and slices to use, concurrency can also be included as an object with limit , nodes , and/or executorsPerNode properties. For example, the following object configures a concurrency model that allows a maximum of 24 executors distributed across 4 nodes with 8 executors per node:
|
rate | int or string |
This property can be included to control the frequency with which a request is sent to the source. The limit applies to the number of requests a single slice can make. If you specify an integer for the rate, then the value is treated as the maximum number of requests to issue per minute. If you specify a string, you have more flexibility in configuring the rate. The sample values below show the types of values that are supported:
To enforce the rate limit, the GDI introduces a sleep between requests that is equal to the rate delay. The more executing slices, the longer the rate delay needs to be to enforce the limit in aggregate. Given the example of |
partitionBy | string, variable, list | The GDI attempts to partition queries automatically across the available cores (slices) in AnzoGraph DB. To determine how to partition the query, the GDI uses metadata from the source. It looks for any column in an index, preferring the primary key column if it is interpolable. However, it only considers the first column in any index on the table. After determining the partition column, the GDI does a MIN/MAX on the column as well as a basic sizing query. To specify which column or columns the GDI should partition on, you can include the partitionBy property in the query. The property supports a list of source field names, bound variables, or the object s:auto , which forces the GDI to partition the data when the source does not define partitioning metadata. |
locale | string |
This property can be used to specify the locale to use when parsing locale-dependent data such as numbers, dates, and times.
|
sampling | int |
This property can be used to configure the number of records in the source to examine for data type inferencing.
|
selector | string or RDF list |
This property can be used as a binding component to identify the path to the source objects. For example,
s:selector "Sales.SalesOrderHeader" targets the SalesOrderHeader table in the Sales schema. For more information about binding components and the selector property, see Using Binding Trees and Selector Paths. |
model | string |
This property defines the class (or table) name for the type of data that is generated from the specified data source. For example,
s:model "employees" . Model is optional when querying a single source. If your query targets multiple sources, however, and you want to define resource templates (primary keys) and object properties (foreign keys), you must specify the model value for each source. |
key | string |
This property can be used to define the primary key column for the source file or table. This column is leveraged in a resource template for the instances that are created from the source. For example,
s:key ("EMPLOYEE_ID") . For more information about key , see Data Linking Options. |
reference | RDF list |
This property can be used to specify a foreign key column. The reference property is an RDF list that includes the
model property to list the target table and a using property that defines the foreign key column. For more information about reference , see Data Linking Options. |
formats | RDF list |
To give users control over the data types that are used when coercing strings to other types, this property can be included in GDI queries to define the desired types. In addition, it can be used to describe the formats of date and time values in the source to ensure that they are recognized and parsed to the appropriate date, time, and/or dateTime values. For details about the
formats property, see Data Type Formatting Options. |
normalize | RDF list |
To give users control over the labels and URIs that are generated, the GDI offers several options for normalizing the model and/or the fields that are created from the specified data source(s). For details about the
normalize property, see Model Normalization Options. |
count | variable |
If you want to turn the query into a COUNT query, you can include this property with a
?variable to perform a count. For example, s:count ?count . |
offset | int |
This property can be used to offset the data that is returned by a number of rows.
|
orderBy | string, variable, list |
You can include this property to order the result set by a field name, a bound variable, or a list of names or bound variables.
|
limit | int |
You can include this property to limit the number of results that are returned.
s:limit maps to the SPARQL LIMIT clause. |
mapping_variable | variable |
The mapping variables, in
?mapping_variable (["binding"] [datatype] ["datetime_format"]) format, define the triple patterns to output. When the specified ?variable matches the source column name, the GDI uses the variable as the source data selector. If you specify an alternate variable name, a binding needs to be specified to map the new variable to the source. You also have the option to transform the data using the datatype and datetime_format options.The parentheses around the binding, data type, and format specifications are not required but are included in this document for readability. |
binding | string |
The
binding is a literal value that binds a ?mapping_variable to a source column. If you specify a ?variable that matches the source column name, then that variable name is the data selector and it is not necessary to specify a binding. If you specify an alternate variable name or there is a hierarchical path to the source column, then the binding is needed to map the new variable to that source column.For example for CSV, the following pattern simply binds the source column AIRLINE to the lowercase variable ?airline: For FileSource, periods (.), forward slashes (/), and brackets ([ ]) are parsed as path notation. Therefore, if a source column name includes any of those characters they must be escaped in the binding. Use two backslashes (\\) as an escape character. For example, if a column name is average/day, the variable and binding pattern could be written as |
datatype | URI |
The
datatype is the data type to convert the column to. If you do not specify a data type, the GDI infers the type. The GDI supports the following types:
|
datetime_format | string |
This option is used to specify the format to use for date and time data types. The GDI supports Java date and time formats. Specify days as "d," months as "M," and years as "y." For the time, specify "H" for hours, "m" for minutes, and "s" for seconds. For example,
"yyyyMMdd HH:mm:ss" or "ddMMMyy" to display date values such as "01JAN19."
The GDI's default base year is 2000. If the source data has years with only two digits, such as |
Mapping the Content Property to JSON
The s:content
property can be configured with an inline object (blank node) that gets translated to JSON in the request body. This mapping allows for creation of embedded objects and arrays as well as a mechanism for iterating over all available input so that HTTP endpoints that support batching can be used more effectively.
Using Blank Nodes
Blank nodes are used to create an object in the output JSON. The local name of any predicate used within content
becomes a key in the generated JSON object. Blank nodes can be embedded within each other, allowing the hierarchical nature of JSON to be represented. For example:
s:content [ ex:firstName "Mary" ; ex:lastName "Barry" ] ;
Or
s:content [ ex:person [ ex:firstName "Mary" ] ] ;
Using Variables
Variables can be also used in the object position to construct a request from input at runtime. For example:
s:content [ ex:firstName ?firstName ; ex:lastName ?lastName ] ;
The values for the variables can come from a TOPDOWN variable, a VALUES clause in the SERVICE block, or another data source. Any unbound variables in the input will not be added to the generated JSON object.
Using RDF Lists
An RDF list can also be used to create an array in the output JSON. For example:
s:content [ ex:allKnownNames ( ?firstName ?lastName ?nickName ) ]
An RDF list can also be embedded inside another list to create an array in the output JSON and populate it with items evaluated against a repeating pattern across all available input rows for a slice. That pattern can be a variable, which generates an array of primitive values, or a blank node, which generates an array of mapped JSON objects. For example:
s:content [ ex:documents ((?id)) ] ;
Or
s:content [ ex:documents (([ ex:id ?id ; ex:title ?title ])) ;
Example
The following example query demonstrates the use of s:content
to generate JSON. The query also includes the s:concurrency
property to restrict execution to a single slice. Without limiting execution when there are a small number of inputs (as in the VALUES clause), each input row gets executed on its own. As the inputs increase, each slice operates over a larger number of inputs until the default s:batching 5000
is applied.
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX api: <http://contoso.com/api/> SELECT * WHERE { SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit> { VALUES (?firstName ?lastName ?dob ?email) { ("Gray" "Hay" "1978-03-18"^^xsd:date "gray@abc.com") ("Ana" "Bana" "1974-10-20"^^xsd:date "ana@abc.com") ("George" "Forge" "1975-08-13"^^xsd:date "george@abc.com") ("Miles" "Giles" "1977-04-12"^^xsd:date "miles@abc.com") } ?data a s:HttpSource ; s:url "https://postman-echo.com/post" ; s:header [ s:name "Accept" ; s:value "application/json" ] ; s:concurrency 1 ; s:content (([ api:dateOfBirth ?dob ; api:email ?email ; api:year 2020 ; api:person [ api:firstName ?firstName ; api:lastName ?lastName ] ; ])) ; s:selector "data" ; ?firstName ("person.firstName" xsd:string) ; ?lastName ("person.lastName" xsd:string) ; ?dob ("dateOfBirth" xsd:date) ; ?email ("email" xsd:string) ; ?year ("year" xsd:int) . } }
The content
portion of the request that the query generates is shown below:
[{ "firstName": "Gray" , "lastName": "Hay" , "dateOfBirth": "1978-03-18" , "email": "gray@abc.com" , "year": 2020 }, { "firstName": "Ana" , "lastName": "Bana" , "dateOfBirth": "1974-10-20" , "email": "ana@abc.com" , "year": 2020 }, { ... }]
Query Examples
Topdown Query with URL Parameters
The query below reads data from a sample HTTP source that compiles worldwide weather statistics. The source has several models available for retrieving data that is current, daily, historical, etc. To target current data, the query includes s:selector "currently"
. In addition, the query demonstrates the use of the "topdown" functionality, where the query sends values to the source to narrow the results. The VALUES clause specifies the latitude and longitude values for the cities to return data for. In addition, since this sample source requires parameters to be specified in the connection URL, the s:url
value includes ?lat
and ?long
as parameters as part of the value.
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX ex: <http://example.org/ontologies/City#> SELECT ?city ?state ?temp ?rainChance ?humidity ?pressure ?windSpeed WHERE { SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit> { ?data a s:HttpSource ; s:url "https://sampleEndpoint.com/forecast/{{?lat}},{{?long}}" ; s:selector "currently" ; ?lat ("latitude") ; ?long ("longitude") ; ?temp ("temperature") ; ?rainChance ( "precipProbability" ) ; ?humidity () ; ?pressure () ; ?windSpeed () . } VALUES( ?city ?state ?lat ?long ) { ( "Lakeway" "TX" 30.374563 -97.975892 ) ( "Boston" "MA" 42.358043 -71.060415 ) ( "Seattle" "WA" 47.590720 -122.307053 ) ( "Chicago" "IL" 41.837741 -87.823296 ) ( "Hilo" "HI" 19.702040 -155.090312 ) } } ORDER BY ?city
Generator Query against a SPARQL Endpoint
The example below is a GDI Generator query that retrieves data from a remote SPARQL endpoint.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> INSERT { GRAPH <http://anzograph.com/something> { ?s ?p ?o } } WHERE { SERVICE <http://cambridgesemantics.com/services/DataToolkit> { ?data a s:HttpSource ; s:url "https://10.10.0.10/sparql/http%3A%2F%2Fsomething.com%2Fdata"; s:trust "all" ; s:username "user"; s:password "pass"; s:contentType "application/sparql-query" ; s:header [ s:name "Accept" ; s:value "text/csv" ] ; s:content """ PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?s ?p ?o WHERE { ?s ?p ?o . FILTER(ISLITERAL(?o)) } """ . ?rdf a s:RdfGenerator, s:OntologyGenerator ; s:as (?s ?p ?o) ; s:ontology <http://anzograph.com/ontologies/TopMovies> ; s:base <http://anzograph.com/data> . } }
API Queries
The following example queries the Google Recognize API to request transcriptions for voice recordings that are stored in a Google bucket.
PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX anzo: <http://openanzo.org/ontologies/2008/07/Anzo#> PREFIX zowl: <http://openanzo.org/ontologies/2009/05/AnzoOwl#> PREFIX dc: <http://purl.org/dc/elements/1.1/> INSERT { GRAPH <http://anzograph.com/transcriptions> { ?record <http://google.com/transcript> ?transcript . ?record <http://google.com/confidence> ?confidence . ?record <http://google.com/file> ?file . } } WHERE { BIND(<gs://csi-se/demo/emergency-test.mp3> as ?file) BIND(UUID() as ?record) { SERVICE <http://cambridgesemantics.com/services/DataToolkit> { ?data a s:HttpSource ; s:selector "results.alternatives" ; s:url "https://speech.googleapis.com/v1p1beta1/speech:recognize" ; s:authorization [ a s:BearerToken ; s:token """ya29...""" ] ; s:content """ { "config": { "encoding":"MP3", "sampleRateHertz": 16000, "languageCode": "en-US", "enableWordTimeOffsets": false }, "audio": { "uri":"gs://csi-se/demo/emergency-test.mp3"} } """ ; ?confidence ("confidence") ; ?transcript ("transcript") . } } }
The example below includes the header
and content
properties to send a request that contains small text snippets for sentiment analysis.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX ont: <http://cambridgesemantics.com/ontologies/Sentiment_Analysis#> INSERT { GRAPH <http://anzograph.com/sentiment> { ?requirement a ont:Sentiment ; ont:p_Sentiment_Type ?sentiment ; ont:p_Sentiment_Score ?polarity . } } WHERE { ?requirement a <http://cambridgesemantics.com/Requirements> ; <http://cambridgesemantics.com/Requirements.reqText> ?requirement_text . SERVICE TOPDOWN <http://cambridgesemantics.com/services/DataToolkit> { ?data a s:HttpSource ; s:url "https://text-analysis12.p.rapidapi.com/sentiment-analysis/api/v1.1" ; s:method "POST" ; s:header [ s:name "Accept" ; s:value "application/json" ] , [ s:name "X-RapidAPI-Key" ; s:value "key" ] , [ s:name "X-RapidAPI-Host" ; s:value "text-analysis12.p.rapidapi.com" ] ; s:contentType "application/json" ; s:content """{ "text": "{{?requirement_text}}" , "language": "english" }""" ; ?polarity ("aggregate_sentiment/compound" xsd:double); ?sentiment () . } }