Generating a Graph and Ontology with a Direct Load Step

With no mapping required, Anzo's Direct Load Step functionality automatically generates a graph and an ontology (model) for a data source. Using a relatively simple SPARQL query, the direct load option invokes the Graph Data Interface (GDI) RDF and Ontology Generators. The GDI Generators recognize the structure of a data source and automatically generate the necessary statements.

Invoking the Generators is preferable to producing a hand-written query, especially when the structure of the data is very complex, such as a JSON data source with many inner repeating structures or a database with many tables and keys. When the source contains complex structures, the GDI will generate only the required statements and avoid cross-products, optimizing query execution and memory usage. In addition, the GDI Generator parallelizes the load across the AnzoGraph cluster so that a data source (such as a database) can be ingested with a single query.

This topic provides details about invoking the GDI RDF and Ontology Generators. The Generators can be used with all of the supported data source types.

How to Use the GDI Generator in a Graphmart

To invoke the GDI Generator in a data layer, you add a Direct Load Step to the layer. In the Direct Load Step, you compose a SPARQL query that incorporates the GDI Generator parameters as detailed below in GDI Generator Query Syntax.

For instructions on adding steps to layers, see Adding Steps to Data Layers.

Why Use a Direct Load Step?

It is important to use a Direct Load Step with the RDF and Ontology Generators because it is the only type of step with the ability to manage the generated ontologies (models). An ontology that is generated in a Direct Load Step is automatically registered in Anzo. The registered model is linked to and managed by the data layer that contains the step. If an Ontology Generator query is changed, additional Direct Load Steps are added to the same layer, or the underlying source schema changes, the managed model is automatically updated when the graphmart is reloaded or refreshed. See Managed Model Details below for important details about layer-managed models.

Managed Model Details

Though an ontology that is generated in a Direct Load Step is registered in Anzo and is available for viewing in the Model editor, the model is owned and managed by the data layer that contains the Direct Load Step. That means any manual changes made to the model outside of the step, such as from the Model editor, will be overwritten any time the graphmart or layer is refreshed or reloaded. Do not modify generated managed models except by editing (or adding) Direct Load Step queries.

There is only one managed model per layer. If you include multiple Direct Load Steps in the same layer, they will all update the same ontology. This functionality can be useful if you want to align the data and generated model across multiple steps. If you have multiple sources that are not intended to align or update the same model, create separate layers.

If you delete a layer that includes a managed model, the model is also deleted. Use caution when referencing a managed model outside of a graphmart. For example, if you create a dataset and reference a managed model when you select the ontology, the reference will break if the data layer that manages the model is deleted.

GDI Generator Query Syntax

The following query syntax shows the structure of a GDI Generator query. The clauses, patterns, and placeholders that are links are described below.

# PREFIX Clause
PREFIX s:    <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>

#Result Clause
INSERT {
   GRAPH ${targetGraph} {
     ?s ?p ?o .
  }
}
WHERE
{
	SERVICE <http://cambridgesemantics.com/services/DataToolkit>
    {
      ?data a s:source_type ;      
# Based on the source_type, additional connection and input parameters are
# available. The options below are valid for all sources. For source-related
# options, see Source-Specific Properties in the GDI property reference. 
        s:url "string" ;
        [ s:model "class_name_for_this_source" ; ]
        [ s:username "string" ; ]
        [ s:password "string" ; ]
        [ s:timeout int ; ]
        [ s:maxConnections int ; ]
        [ s:batching boolean | int ; ]
        [ s:paging [ pagination_options ; ]
        [ s:concurrency int | [ list_of_properties ] ; ]
        [ s:rate int | "string" ; ]
        [ s:locale "string" ; ]
        [ s:sampling int ; ]
        [ s:selector "string" | [ list ] ; ]
        [ s:key ("string") ; ]
        [ s:reference [ s:model "string" ; s:using ("string") ]
        [ s:formats [ datatype_formatting_options ] ; ]
        [ s:normalize boolean | [ source_normalization_rules ] ; ]
        [ s:count ?variable ; ]
        [ s:offset int ; ]
        [ s:limit int ; ] .

      # Multiple data sources can be merged if they project a similar set
      # of output variables. Make sure each source has a unique subject variable.
  
        [ ?unique_variable a s:source_type ;
          ...
        . ]

      ?rdf a s:RdfGenerator, s:OntologyGenerator ;
        s:as (?s ?p ?o);
        s:ontology ontology_uri ;
        s:base base_uri ;
        [ s:normalize boolean | [ global_normalization_rules ] ; ]
        .
      # Additional clauses such as BIND, FILTER
   }
}

For readability, the parameters below exclude the base URI <http://cambridgesemantics.com/ontologies/DataToolkit#> as well as the s: prefix. As shown in the examples, however, the s: prefix or full property URI does need to be included in queries.

Option Data Type Description
PREFIX Clause N/A The PREFIX clause declares the standard and custom prefixes for GDI Generator queries. Generally, queries include the following prefixes (or a subset of them) plus any data-specific declarations:
PREFIX s:    <http://cambridgesemantics.com/ontologies/DataToolkit#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
Result Clause N/A The result clause for Direct Load Steps is typically an INSERT query with the following graph pattern:
INSERT {
   GRAPH ${targetGraph} {
     ?s ?p ?o .
  }
}

It is important to include the GRAPH keyword and target graph parameter ${targetGraph} when you are writing an INSERT query. Anzo automatically replaces the ${targetGraph) parameter with the appropriate target URI(s) when the query runs.

source_type object The ?data a s:source_type triple pattern specifies the type of data source that the query will run against. For example, ?data a s:DbSource, specifies that the source type is a database. The list below describes the available types:
  • DbSource to connect to any type of database.
  • FileSource for flat files. The supported file types are CSV and TSV, JSON, NDJSON, XML, Parquet, and SAS (SAS Transport XPT and SAS7BDAT formats) . The GDI automatically determines the file type from the file extensions. When querying file sources, make sure that the files are accessible to both Anzo and AnzoGraph.
  • HttpSource to connect to HTTP endpoints.
  • ElasticSource to connect to Elasticsearch indexes on an Elasticsearch server.
  • KafkaSource to connect to Kafka streaming sources.

Certain connection and input parameters are available based on the specified source type. For details about the options for your source, see Source-Specific Properties.

url string
This property specifies the URL for the data source, such as the database URL, Elasticsearch URL, or HTTP endpoint URL. For file-based sources, the url property specifies the file system location of the source file or directory of files. When specifying a directory (such as s:url "/opt/shared-files/loads/"), the GDI loads all of the file formats it recognizes. To specify a directory but limit the number or type of files that are read, you can include the pattern and/or maxDepth properties described in FileSource Properties.

For security, it is a best practice to reference connection information (such as the url, username, and password) from a Query Context so that the sensitive details are abstracted from any requests. In addition, using a Query Context makes connection details reusable across queries. See Using Query Contexts in Queries for more information. For example, the triple patterns below reference keys from a Query Context:

?data a s:DbSource ;
  s:url "{{@db.eca4bfa83481f3638b93ab5fdf93ff9a.url}}" ;
  s:username "{{@db.eca4bfa83481f3638b93ab5fdf93ff9a.user}}" ;
  s:password "{{@db.eca4bfa83481f3638b93ab5fdf93ff9a.password}}" ;
model string
This property defines the class (or table) name for the type of data that is generated from the specified data source. For example, s:model "employees". Model is optional when querying a single source. If your query targets multiple sources, however, and you want to define resource templates (primary keys) and object properties (foreign keys), you must specify the model value for each source.
username string
If authentication is required to access the source, include this property to specify the user name.
password string
This property lists the password for the given username.
timeout int
This property can be used to specify the timeout (in milliseconds) to use for requests against the source. For example, s:timeout 5000 configures a 5 second timeout.
maxConnections int
For database sources, this property can be used to set a limit on the maximum number of active connections to the source. For example, s:maxConnections 16 sets the limit to 16 connections. The default value is 10.
batching boolean or int
This property can be used to disable batching, or it can be used to change the default the batch size. By default, batching is set to 5000 (s:batching 5000). To disable batching, you can include s:batching false in the query. Typically users do not change the batching size. However, it can be useful to control the batch size when performing updates. To configure the size, include s:batching int in the query. For example, s:batching 3000.
paging RDF list
This property can be used to configure paging so that the GDI can access large amounts of data across a number of smaller requests. For details about the paging property, see Pagination Options.
concurrency int or RDF list
This property can be included to configure the maximum level of concurrency for the query. The value can be an integer, such as s:concurrency 8. If the value is an integer, it configures a maximum limit on the number of slices that can execute the query. For finer-grained control over the number of nodes and slices to use, concurrency can also be included as an object with limit, nodes, and/or executorsPerNode properties. For example, the following object configures a concurrency model that allows a maximum of 24 executors distributed across 4 nodes with 8 executors per node:
s:concurrency [
  s:limit 24 ;
  s:nodes 4 ;
  s:executorsPerNode 8 ;
] ;
rate int or string
This property can be included to control the frequency with which a request is sent to the source. The limit applies to the number of requests a single slice can make. If you specify an integer for the rate, then the value is treated as the maximum number of requests to issue per minute. If you specify a string, you have more flexibility in configuring the rate. The sample values below show the types of values that are supported:
s:rate "90/minute" ;
s:rate "90 per minute" ;
s:rate "200000 every week" ;
s:rate "10000 every 6 hours" ;

To enforce the rate limit, the GDI introduces a sleep between requests that is equal to the rate delay. The more executing slices, the longer the rate delay needs to be to enforce the limit in aggregate.

Given the example of s:rate "90/minute", the GDI would optimize the concurrency and only use 1 slice for execution with a rate delay of 666ms between requests. If s:rate "240/minute", the GDI would use 3 executors with a rate delay of 750ms between requests.

locale string
This property can be used to specify the locale to use when parsing locale-dependent data such as numbers, dates, and times.
sampling int
This property can be used to configure the number of records in the source to examine for data type inferencing.
selector string or RDF list
This property can be used as a binding component to identify the path to the source objects. For example, s:selector "Sales.SalesOrderHeader" targets the SalesOrderHeader table in the Sales schema. For more information about binding components and the selector property, see Using Binding Trees and Selector Paths.
key string
This property can be used to define the primary key column for the source file or table. This column is leveraged in a resource template for the instances that are created from the source. For example, s:key ("EMPLOYEE_ID"). For more information about key, see Data Linking Options.
reference RDF list
This property can be used to specify a foreign key column. The reference property is an RDF list that includes the model property to list the target table and a using property that defines the foreign key column. For more information about reference, see Data Linking Options.
formats RDF list
To give users control over the data types that are used when coercing strings to other types, this property can be included in GDI queries to define the desired types. In addition, it can be used to describe the formats of date and time values in the source to ensure that they are recognized and parsed to the appropriate date, time, and/or dateTime values. For details about the formats property, see Data Type Formatting Options.
normalize RDF list
To give users control over the labels and URIs that are generated, the GDI offers several options for normalizing the model and/or the fields that are created from the specified data source(s). For details about the normalize property, see Normalization Options.
count variable
If you want to turn the query into a COUNT query, you can include this property with a ?variable to perform a count. For example, s:count ?count.
offset int
This property can be used to offset the data that is returned by a number of rows.
limit int
You can include this property to limit the number of results that are returned. s:limit maps to the SPARQL LIMIT clause.
RdfGenerator object Include this property to invoke the RDF Generator. If you only want to generate a model without RDF, you can exclude RdfGenerator.
OntologyGenerator object Include this property to invoke the Ontology Generator. If you only want to generate RDF without a model, you can exclude OntologyGenerator.
as N/A This property provides the variable bindings for the RDF Generator's projection to RDF. Typically the value is s:as (?s ?p ?o) to match the variables in the result clause.
ontology URI This property specifies the URI to use as the base URI for any generated ontology artifacts. For example, s:ontology <http://abc.com/ontologies/MyOntology>.

In the graphmart, the data layer ID is appended to the ontology URI that is generated. The complete URI is based on the layer and cannot be customized.

base URI This property specifies the base URI for instance data. The base value should NOT end in #. The Generator will add a trailing slash (/) if one does not exist. For example, s:base <http://abc.com/>.

GDI Generator Example Queries

This section includes sample queries that may be useful as a starting point for writing your own RDF and Ontology Generator queries.

Basic Query that Generates RDF and Ontology for a JSON File

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>

INSERT {
   GRAPH ${targetGraph} {
     ?s ?p ?o .
  }
}
WHERE {
   SERVICE <http://cambridgesemantics.com/services/DataToolkit> {

      ?data a s:FileSource ;
         s:model "People" ;
         s:url "/opt/shared-files/json/people.json" .

      ?rdf a s:RdfGenerator , s:OntologyGenerator ;
         s:as (?s ?p ?o) ;
         s:ontology <http://cambridgesemantics.com/ontologies/People> ;
         s:base <http://cambridgesemantics.com/data/> .
  }
}

Basic Query that Generates an Ontology for a Directory of CSV Files

PREFIX s:  <http://cambridgesemantics.com/ontologies/DataToolkit#>

INSERT {
   GRAPH ${targetGraph} {
     ?s ?p ?o .
  }
}
WHERE {
   SERVICE <http://cambridgesemantics.com/services/DataToolkit> {

      ?data a s:FileSource ;
         s:model "Sales" ;
         s:url "/opt/shared-files/csv/sales" ;
         s:format [
            s:delimiter "," ;
            s:headers true ;
            s:comment "#" ;
            s:quote "\"" ;
            s:maxColumns 22 ;
         ] .

      ?rdf a s:OntologyGenerator ;
         s:as (?s ?p ?o) ;
         s:ontology <http://cambridgesemantics.com/ontologies/Sales> ;
         s:base <http://cambridgesemantics.com/data/> .
  }
}

Query that Normalizes and Generates RDF and Ontology for a Database

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>

INSERT {
   GRAPH ${targetGraph} {
      ?s ?p ?o .
  }
}
WHERE {
   SERVICE <http://cambridgesemantics.com/services/DataToolkit> {

      ?data a s:DbSource ;
      s:url "jdbc:mysql://10.11.12.9/emrdbbig" ;
      s:username "root" ;
      s:password "sql1@#" ;
      s:normalize [
         s:model [
            s:removeStart "emr_" ;
            s:words "activity 'patient complaint' medication observation patient specialty study" ;
         ] ;
         s:field [ 
            s:removePartialPrefix true ;
            s:words "provider description start end drug complaint date medication normal code
                     observation product active dose generic route admin strength collection
                     activity home first last status first year birth death directed complex
                     period age flag gender language" ;
         ] ;
      ] .

      ?rdf a s:RdfGenerator , s:OntologyGenerator ;
      s:as (?s ?p ?o) ;
      s:ontology <http://cambridgesemantics.com/ontologies/EMR> ;
      s:base <http://cambridgesemantics.com/EMR> .
  }
}

Query with Query Context that Normalizes and Generates RDF and Ontology for a Database

The query below references a Query Context to supply the username and password for the database connection.

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>

INSERT {
   GRAPH ${targetGraph} {
      ?s ?p ?o .
  }
}
WHERE {
   SERVICE <http://cambridgesemantics.com/services/DataToolkit> {

      ?data a s:DbSource ;
         s:url "jdbc:sqlserver://localhost;databaseName=AdventureWorks2012" ;
         s:username "{{@db.username}}" ;
         s:password "{{@db.password}}" ;
         s:schema "Production", "HumanResources", "Person", "Sales", "Purchasing" ;
         s:normalize [ 
            s:model [
               s:localNamePrefix "C_" ;
               s:localNameSeparator "_" ;
               s:match [ s:pattern "(.+)Enlarged" ; s:replace "$1" ] ;
            ] ;
            s:field [
               s:localNamePrefix "P_" ;
               s:localNameSeparator "_" ;
               s:ignore "rowguid ModifiedDate" ;
               s:match (
                  [ s:pattern "(.+)GUID$" ; s:replace "$1" ]
                  [ s:pattern "(.+)ID$" ; s:replace "$1" ]
               ) ;
            ] ;
         ] .

      ?rdf a s:RdfGenerator, s:OntologyGenerator ;
         s:as (?s ?p ?o) ;
         s:ontology <http://cambridgesemantics.com/ontologies/AdventureWorks> ;
         s:base <http://cambridgesemantics.com/AdventureWorks> .
  }
}

Query for Multiple Sources that Generates RDF and Ontology with Resource Templates and Object Properties

This query also includes global normalization rules for normalizing the data across all Data Sources.

PREFIX s: <http://cambridgesemantics.com/ontologies/DataToolkit#>

INSERT {
   GRAPH ${targetGraph} {
      ?s ?p ?o .
  }
}
WHERE { 
   SERVICE <http://cambridgesemantics.com/services/DataToolkit> {

      ?event a s:FileSource ;
         s:model "event" ;
         s:url  "/opt/shared-files/csv/events.csv" ;
         s:key ("EVENT_ID") .

      ?listing a s:FileSource ;
         s:model "listing" ;
         s:url " /opt/shared-files/csv/listings.csv" ;
         s:key ("LIST_ID") ;
         s:reference [ s:model "event" ; s:using ("EVENT_ID") ; s:key ("EVENT_ID") ] .

      ?date a s:FileSource ;
         s:model "date" ;
         s:url  "/opt/shared-files/csv/event_dates.csv" ;
         s:key ("DATE_ID") ;
         s:reference [ s:model "event" ; s:using ("EVENT_ID") ; s:key ("EVENT_ID") ] .

      ?venue a s:FileSource ;
         s:model "venue" ;
         s:url " /opt/shared-files/csv/venues.csv" ;
         s:key ("VENUE_ID") ;
         s:reference [ s:model "event" ; s:using ("EVENT_ID") ; s:key ("EVENT_ID") ] .
     
      ?sale a s:FileSource ;
         s:model "sale" ;
         s:url " /opt/shared-files/csv/sales.csv" ;
         s:key ("SALE_ID") ;
         s:reference [ s:model "event" ; s:using ("EVENT_ID") ; s:key ("EVENT_ID") ] ;
         s:reference [ s:model "listing" ; s:using ("LIST_ID") ; s:key ("LIST_ID") ] .

      ?rdf a s:RdfGenerator, s:OntologyGenerator ;
         s:as (?s ?p ?o) ;
         s:ontology <http://cambridgesemantics.com/tickets> ;
         s:base <http://cambridgesemantics.com/data> ;
         s:normalize [ 
            s:all [
               s:casing s:UPPER ;
               s:localNameSeparator "_" ;
            ] ;
         ] .
  }
}

Related Topics