Query XML Files

This topic provides details about the structure to use when writing GDI queries to read or ingest data from XML files. It also includes example queries that may be useful as a starting point for writing your own GDI queries.

Query Syntax
Hierarchical Bindings and Arrays
Query Examples

Query Syntax

The following query syntax shows the structure of a GDI query for XML sources. The clauses, patterns, and placeholders that are links are described below.

# PREFIX Clause
PREFIX s:     <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>
PREFIX anzo:  <http://openanzo.org/ontologies/2008/07/Anzo#>
PREFIX zowl:  <http://openanzo.org/ontologies/2009/05/AnzoOwl#>
PREFIX dc:    <http://purl.org/dc/elements/1.1/>

# Result Clause
{ 
   [ GRAPH ${targetGraph} { ] 
   triple_patterns
 [ } ]
}
[ ${usingSources} ]

WHERE
{
   # SERVICE Clause: Include the following service call when reading or inserting data.
    SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>

   # View SERVICE Clause: Or use the service call below when constructing a view.
    SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(${targetGraph})

    { 
      ?data a s:FileSource ;
        s:url "string" ;
        [ s:options [ file_storage_connection_options ] ; ]
        [ s:pattern "string" ; ]
        [ s:maxDepth int ; ]
        [ s:format [ source_format_options ; ] ; ]
        [ s:mimetype "string" ; ]
        [ s:username "string" ; ]
        [ s:password "string" ; ]
        [ s:timeout int ; ]
        [ s:batching boolean | int ; ]
        [ s:paging [ pagination_options ; ]
        [ s:concurrency int | [ list_of_properties ] ; ]
        [ s:rate int | "string" ; ]
        [ s:locale "string" ; ]
        [ s:sampling int ; ]
        [ s:selector "string" | [ list ] ; ]
        [ s:model "string" ; ]
        [ s:key ("string") ; ]
        [ s:reference [ s:model "string" ; s:using ("string") ]
        [ s:formats [ datatype_formatting_options ] ; ]
        [ s:normalize boolean | [ normalization_rules ] ; ]
        [ s:count ?variable ; ]
        [ s:offset int ; ]
        [ s:limit int ; ]
        # Mapping variables and hierarchical bindings
        ?mapping_variable ( [ "binding" ] [ datatype ] [ "datetime_format" ] ) ;
        ... ;
        .
     # Additional clauses such as BIND, VALUES, FILTER
   }
}

For readability, the parameters below exclude the base URI <http://cambridgesemantics.com/ontologies/DataToolkit#> as well as the s: prefix. As shown in the examples, however, the s: prefix or full property URI does need to be included in queries.

Option	Type	Description
PREFIX Clause	N/A	The PREFIX clause declares the standard and custom prefixes for GDI service queries. Generally, queries include the prefixes from the query template (or a subset of them) plus any data-specific declarations.
Result Clause	N/A	The result clause defines the type of SPARQL query to run and the set of results to return, i.e., whether you want to read (SELECT or CONSTRUCT) from the source or ingest the data into Graph Lakehouse (INSERT).
GRAPH ${targetGraph}	N/A	Include the `GRAPH` keyword and target graph parameter `${targetGraph}` when writing an INSERT query to ingest data into a graphmart. Graph Lakehouse automatically populates the query with the appropriate target URIs when the query runs.
${usingSources}	N/A	Include the source graph parameter `${usingSources}` when writing a "topdown" query that passes values from the data that is in the graphmart to the data source. Graph Lakehouse automatically populates the query with the appropriate FROM clauses when the query runs. When passing literal values to the remote source, you do not need to include the source graph parameter. The SERVICE Clause description below includes more information about passing input to data sources.
SERVICE Clause	N/A	Include the SERVICE call `SERVICE [ TOPDOWN ] <http://cambridgesemantics.com/services/DataToolkit>` to invoke the GDI service when you are running a SELECT, INSERT, or CONSTRUCT query that is not creating a view. When creating a view, use the `DataToolkitView` service call, as described below in View SERVICE Clause. Include the optional TOPDOWN keyword when you want to pass input values from Graph Lakehouse to the data source. When you include TOPDOWN in the service call, it indicates that the rest of the query produces values to send to the source. In this case, the GDI makes repeated calls to pass in each of the specified values and retrieve the data that is based on those values.
View SERVICE Clause	N/A	When writing a CONSTRUCT query that creates a view of the data, include the following SERVICE call: `SERVICE <http://cambridgesemantics.com/services/DataToolkitView>(<target_graph>)`. Using the `DataToolkitView` call optimizes query execution because it tells the GDI to inspect the query and determine which filters to push to the data source. It also limits the result set and retrieves only the data that is needed, i.e., the source data is fully mapped but all of the mapped data is not necessarily returned.
url	string	This property specifies the file system location of the source file or directory of files. When specifying a directory (such as `s:url "/opt/shared-files/loads/"`), the GDI loads all of the file formats it recognizes. To specify a directory but limit the number or type of files that are read, you can include the pattern and/or maxDepth properties.
options	RDF list	If additional connection information needs to be provided to access the file storage system, include the `options` property to list any storage-specific connection parameters. See File Storage Connection Options for information about the supported properties for each storage type.
pattern	string	This property is used to specify a wildcard pattern for matching file names. For example, `s:pattern "common_prefix.xml"`. You can include one `s:pattern` property per FileSource. The GDI supports Unix file globbing syntax outside of parentheses. Within parentheses, full Java regular expression language is supported. For example, including `s:pattern "data//customer_.xml"` tells the GDI to load all files that match the pattern "customer_.xml" from any number of subdirectories under the `data` directory. Similarly `s:pattern "(\d+)/transaction_.xml"` tells the GDI to load all files that match the pattern "transaction_*.xml" in all subdirectories.
maxDepth	int	This property can be used to limit the directory traversal depth. By default, when `s:url` specifies a directory (and a `s:pattern` that limits that traversal depth is not specified), all subdirectories are processed. To process only the files in the top level directory, set maxDepth to 0 (`s:maxDepth 0`). To process the files in the top level directory plus the first-level subdirectories, set maxDepth to 1 (`s:maxDepth 1`), and so on.
format	RDF list	You can include the `format` property to add parameters that describe the source files. See File Source Format Options for details about the supported parameters.
mimetype	string	This property can be included to specify the MIME type of the data.
username	string	If authentication is required to access the source, include this property to specify the user name.
password	string	This property lists the password for the given username.
timeout	int	This property can be used to specify the timeout (in milliseconds) to use for requests against the source. For example, `s:timeout 5000` configures a 5 second timeout.
batching	boolean or int	This property can be used to disable batching, or it can be used to change the default the batch size. By default, batching is set to 5000 (`s:batching 5000`). To disable batching, you can include `s:batching false` in the query. Typically users do not change the batching size. However, it can be useful to control the batch size when performing updates. To configure the size, include `s:batching int` in the query. For example, `s:batching 3000`.
paging	RDF list	This property can be used to configure paging so that the GDI can access large amounts of data across a number of smaller requests. For details about the `paging` property, see Pagination Options.
concurrency	int or RDF list	This property can be included to configure the maximum level of concurrency for the query. The value can be an integer, such as `s:concurrency 8`. If the value is an integer, it configures a maximum limit on the number of slices that can execute the query. For finer-grained control over the number of nodes and slices to use, concurrency can also be included as an object with `limit`, `nodes`, and/or `executorsPerNode` properties. For example, the following object configures a concurrency model that allows a maximum of 24 executors distributed across 4 nodes with 8 executors per node: `s:concurrency [ s:limit 24 ; s:nodes 4 ; s:executorsPerNode 8 ; ] ;`
rate	int or string	This property can be included to control the frequency with which a request is sent to the source. The limit applies to the number of requests a single slice can make. If you specify an integer for the rate, then the value is treated as the maximum number of requests to issue per minute. If you specify a string, you have more flexibility in configuring the rate. The sample values below show the types of values that are supported: `s:rate "90/minute" ; s:rate "90 per minute" ; s:rate "200000 every week" ; s:rate "10000 every 6 hours" ;` To enforce the rate limit, the GDI introduces a sleep between requests that is equal to the rate delay. The more executing slices, the longer the rate delay needs to be to enforce the limit in aggregate. Given the example of `s:rate "90/minute"`, the GDI would optimize the concurrency and only use 1 slice for execution with a rate delay of 666ms between requests. If `s:rate "240/minute"`, the GDI would use 3 executors with a rate delay of 750ms between requests.
locale	string	This property can be used to specify the locale to use when parsing locale-dependent data such as numbers, dates, and times.
sampling	int	This property can be used to configure the number of records in the source to examine for data type inferencing.
selector	string or RDF list	This property can be used for XML path extraction to traverse nested structures and target specific data. For example, `s:selector "projects"` targets the `projects` class of data. To express a hierarchy, use dot notation. For example, `s:selector "region.state.city"` navigates a hierarchy to target `city` data. For more information about binding components and the selector property, see Hierarchical Bindings and Arrays below.
model	string	This property defines the class (or table) name for the type of data that is generated from the specified data source. For example, `s:model "employees"`. Model is optional when querying a single source. If your query targets multiple sources, however, and you want to define resource templates (primary keys) and object properties (foreign keys), you must specify the model value for each source.
key	string	This property can be used to define the primary key column for the source file or table. This column is leveraged in a resource template for the instances that are created from the source. For example, `s:key ("EMPLOYEE_ID")`. For more information about `key`, see Data Linking Options.
reference	RDF list	This property can be used to specify a foreign key column. The reference property is an RDF list that includes the `model` property to list the target table and a `using` property that defines the foreign key column. For more information about `reference`, see Data Linking Options.
formats	RDF list	To give users control over the data types that are used when coercing strings to other types, this property can be included in GDI queries to define the desired types. In addition, it can be used to describe the formats of date and time values in the source to ensure that they are recognized and parsed to the appropriate date, time, and/or dateTime values. For details about the `formats` property, see Data Type Formatting Options.
normalize	RDF list	To give users control over the labels and URIs that are generated, the GDI offers several options for normalizing the model and/or the fields that are created from the specified data source(s). For details about the `normalize` property, see Model Normalization Options.
count	variable	If you want to turn the query into a COUNT query, you can include this property with a `?variable` to perform a count. For example, `s:count ?count`.
offset	int	This property can be used to offset the data that is returned by a number of rows.
limit	int	You can include this property to limit the number of results that are returned. `s:limit` maps to the SPARQL LIMIT clause.
mapping_variable	variable	The mapping variables, in `?mapping_variable (["binding"] [datatype] ["datetime_format"])` format, define the triple patterns to output. When the specified `?variable` matches the source column name, the GDI uses the variable as the source data selector. If you specify an alternate variable name, a binding needs to be specified to map the new variable to the source. You also have the option to transform the data using the datatype and datetime_format options. See Hierarchical Bindings and Arrays below for more information about configuring mapping variables and unpacking JSON files with nested objects and arrays.
binding	string	The `binding` is a literal value that binds a ?mapping_variable to a source column. If you specify a ?variable that matches the source column name, then that variable name is the data selector and it is not necessary to specify a binding. If you specify an alternate variable name or there is a hierarchical path to the source column, then the binding is needed to map the new variable to that source column. For example for CSV, the following pattern simply binds the source column AIRLINE to the lowercase variable ?airline: `?airline ("AIRLINE")`. For FileSource, periods (.), forward slashes (/), and brackets ([ ]) are parsed as path notation. Therefore, if a source column name includes any of those characters they must be escaped in the binding. Use two backslashes (\\) as an escape character. For example, if a column name is average/day, the variable and binding pattern could be written as `?averagePerDay ("average\\/day")`.
datatype	URI	The `datatype` is the data type to convert the column to. If you do not specify a data type, the GDI infers the type. The GDI supports the following types: `xsd:int, xsd:long, xsd:float, xsd:double, xsd:boolean, xsd:time, xsd:dateTime, xsd:date, xsd:duration, xsd:dayTimeDuration, xsd:yearMonthDuration, xsd:gMonthDay, xsd:gMonth, xsd:gYearMonth, xsd:anyURI`
datetime_format	string	This option is used to specify the format to use for date and time data types. The GDI supports Java date and time formats. Specify days as "d," months as "M," and years as "y." For the time, specify "H" for hours, "m" for minutes, and "s" for seconds. For example, `"yyyyMMdd HH:mm:ss"` or `"ddMMMyy"` to display date values such as "01JAN19." The GDI's default base year is 2000. If the source data has years with only two digits, such as `02-04-99`, the GDI prepends 20 to the digits. The value `02-04-99` is parsed to `02-04-2099`. To specify an alternate base year to use for two-digit values, you can include the notation `^nnnn (e.g., ^1900)` in the format value. For example, to set the base year to 1900 instead of 2000, use a format value such as `xsd:date "dd-MMM-yy^1900"` or `xsd:date "dd-MMM-yy^1990"`. When one of those values is specified, `02-04-99` is parsed to `02-04-1999`.

Hierarchical Bindings and Arrays

When configuring the mapping variables in a query, the GDI provides syntax for unpacking XML files with nested objects and arrays. One way to express hierarchies in queries is to use brackets ( [ ] ) to group objects into binding trees. For example, the WHERE clause snippet below organizes mapping variable objects into an hourly/data hierarchy by nesting the ?data patterns inside the ?hourly [ ] tree:

WHERE
{
  SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:FileSource;
        s:url "/mnt/data/xml/weather.xml" ;
        ?latitude (xsd:double) ;
        ?longitude (xsd:double) ;
        ?timezone (xsd:string) ;
        ?hourly [
          ?data [
            ?time (xsd:long) ;
            ?rainProbability ("precipProbability" xsd:double) ;
            ?temperature (xsd:double) ;
            ?feelsLike ("apparentTemperature" xsd:double) ;
            ?windSpeed (xsd:double) ;
        ] ;
    ] .
  }
}

When constructing object binding trees, if you choose to introduce the hierarchy with a variable name that is not an exact match to the source label, include a selector property to list the value from the source. For example, in the WHERE clause snippet below, s:selector is included to select eventHeader in the source as ?event in the query and statLocation as ?location.

WHERE 
{
   SERVICE <http://cambridgesemantics.com/services/DataToolkit>
  {
      ?data a s:FileSource ;
      s:url "/mnt/data/xml/part_1.xml" ;
      ?event [
         s:selector "eventHeader" ;
           ?eventId (xsd:string) ;
           ?eventName (xsd:string) ;
           ?eventVersion (xsd:string) ;
           ?eventTime (xsd:dateTime) ;
      ] ;
      ?location [
         s:selector "statLocation" ;
           ?locationId (xsd:string) ;
           ?lineNo (xsd:int) ;
           ?statNo (xsd:int) ;
           ?statId (xsd:int) ;
      ] .
  }
}

As an alternative to grouping objects in binding trees, the selector property also supports using dot notation to specify paths. For example, the WHERE clause snippet below rewrites the first example query to express the same hourly/data hierarchy as a path in the s:selector value:

WHERE
{
  SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:FileSource;
        s:url "/mnt/data/xml/weather.xml" ;
        ?latitude (xsd:double) ;
        ?longitude (xsd:double) ;
        ?timezone (xsd:string) ;
        s:selector: "hourly.data" ;
        ?time (xsd:long) ;
        ?rainProbability ("precipProbability" xsd:double) ;
        ?temperature (xsd:double) ;
        ?feelsLike ("apparentTemperature" xsd:double) ;
        ?windSpeed (xsd:double) .
  }
}

Query Examples

The following example query ingests data from an XML file that contains hierarchies.

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX fmcsa: <http://census.gov/ontologies/FMCSA#>

INSERT {
  GRAPH <http://anzograph.com/define> {
    ?s ?p ?o
  }
}
WHERE {
  SERVICE <http://cambridgesemantics.com/services/DataToolkit> {
      ?data a s:FileSource ;
        s:url "file:///opt/shared/data/xml/define.xml" ;
 
        ?ItemGroupDef [
           ?OID (xsd:string) ;
           ?Name (xsd:string) ;
           ?Repeating (xsd:string) ;
           ?IsReferenceData (xsd:string) ;
           ?Purpose (xsd:string) ;
           ?Label (xsd:string) ;
           ?Structure (xsd:string) ;
           ?DomainKeys (xsd:string) ;
           ?Class (xsd:string) ;
           ?ArchiveLocationID (xsd:string) ;
           ?Comment (xsd:string) ;
           ?ItemRef [ 
              ?ItemOID (xsd:string) ;
              ?OrderNumber (xsd:int) ;
              ?Mandatory (xsd:string) ;
           ] ;
       ] .
   }
}