Directly Loading a Data Source (Direct Load Step)

With no mapping required, a Direct Load Step can be used to automatically generate a graph and model for a data source. The Direct Load Step is the only type of step with the ability to manage generated models. A model that is generated by a Direct Load Step is automatically registered in Anzo is linked to and managed by the layer that contains the step. If a query is changed, additional Direct Load Steps are added to the same layer, or the underlying source schema changes, the managed model is automatically updated when the graphmart is reloaded or refreshed. Follow the steps below to create a Direct Load Step.

  1. Go to the graphmart for which you want to add a step and then click the Data Layers tab.
  2. On the Data Layers tab, find the layer that you want to add the step to. Click the menu icon () for that layer and select Add Step/View. The Add Step/View dialog box is displayed with the New tab selected.

  3. To create a new Direct Load step, select Direct Load Step and then click OK. If you want to clone an existing step, click the Existing Steps tab, select the step that you want to clone, and then click OK. Anzo creates or clones the step and displays the Details tab:

  4. On the Details tab, configure the following options as needed:
    • Title: The required name of the step.
    • Description: An optional short description of the step.
    • Enabled: When creating a new step, the Enabled option is selected by default, indicating that the step is enabled and will run when the layer is loaded or refreshed. If you want to disable the step so that it is not processed, clear the Enabled checkbox.
    • Source: The source data that this step should act upon. Steps can build upon the data generated by steps in other layers or can be self-contained, applying changes that relate only to the data defined in the layer that contains this step. You can select any number of the following options:
      • Self: This option is selected by default and means that the step runs against only the data that is generated in the layer this step belongs to.
      • All Previous Layers Within Graphmart: This option means that the step runs against the data that is generated by all of the successful layers that precede the layer this step is in. Any failed layers are ignored.
      • Previous Layer Within Graphmart: This option means that the query runs against only the data that is generated by the one layer that precedes the layer this step is in.
      • Layer Name: The Source drop-down list also includes options for specific layer names. You can choose a specific layer to act upon the data in that layer only.
    • Data models: This optional field specifies the model or models to associate with this step. By default, Managed is selected. If you are onboarding a source that does not have a model, make sure Managed remains specified so that the step generates a model. See Important Notes about Managed Models below for more information about managed models.

      The Data Models list displays all of the available models. By default, the field is set to Exclude System Data (). If you want to choose a system model, click the toggle button on the right side of the field to change it to Include System Data (). When system data is included, the drop-down list displays the system models in addition to the user-generated models.

    • Pre-Run Generate Statistics: This option controls whether to initiate AnzoGraph's internal statistics gathering queries before running the query to pre-compile. The statistics gathering helps ensure that the AnzoGraph query planner generates ideal query execution plans for queries that are run against the graphmart.
  5. When you have finished configuring the Details tab, click the Query tab. This tab defines the query that this step should run.

  6. Typically Direct Load Step queries are GDI RDF and Ontology Generator queries. Using a relatively simple SPARQL query, the GDI Generators recognize the structure of a data source and automatically generate the necessary statements. Invoking the Generators is preferable when the structure of the data is very complex, such as a JSON data source with many inner repeating structures or a database with many tables and keys. When the source contains complex structures, only the required statements are generated, avoiding cross-products and optimizing query execution and memory usage. For details about writing GDI Generator queries, see GDI Generator Query Syntax.

    If your query connects to a source that requires input of connection and authorization information, Cambridge Semantics recommends that you do not include the connection and authorization values directly in the query. Instead, replace those values with Context Variables from a Query Context. You can access Context Providers for each data source from the step's Query Context tab. For detailed information about query contexts and referencing variables in a query, see Using Query Contexts in Queries.

  7. When you have finished writing the query, click Save to save the step configuration.

Once the Details tab is configured and the query is written, the step can be run. For information about running this step conditionally by setting up an execution condition, see Defining Execution Conditions.

Important Notes about Managed Models

Though an ontology that is generated in a Direct Load Step is registered in Anzo and is available for viewing in the Model editor, the model is owned and managed by the data layer that contains the Direct Load Step. That means any manual changes made to the model outside of the step, such as from the Model editor, will be overwritten any time the graphmart or layer is refreshed or reloaded. Do not modify generated managed models except by editing (or adding) Direct Load Step queries.

There is only one managed model per layer. If you include multiple Direct Load Steps in the same layer, they will all update the same ontology. This functionality can be useful if you want to align the data and generated model across multiple steps. If you have multiple sources that are not intended to align or update the same model, create separate layers.

If you delete a layer that includes a managed model, the model is also deleted. Use caution when referencing a managed model outside of a graphmart. For example, if you create a dataset and reference a managed model when you select the ontology, the reference will break if the data layer that manages the model is deleted.