Ingesting a New Data Source

Follow the instructions below to set up the Ingest workflow for a new Data Source that has not been previously ingested. The procedure below focuses on configuring the workflow to generate a new Model in addition to the Mappings, ETL jobs, and Dataset Pipeline that is needed to ingest the data into Anzo, convert it to the graph data model, and make it available for inclusion in a Graphmart.

For information about initial Data Source creation, see Adding Data Sources and Schemas.

For instructions on ingesting an updated Data Source, see Re-Ingesting an Updated Data Source. If the Data Source has an associated Metadata Dictionary that you want to apply to the workflow, see Ingesting a Data Source with a Metadata Dictionary.

  1. In the Anzo application, expand the Onboard menu and click Structured Data. Anzo displays the Data Sources screen, which lists any existing Data Sources. For example:

  2. On the Data Sources screen, click the name of the Data Source for which you want to ingest data. Anzo displays the Tables screen for the source. For example:

  3. Click the Ingest button. If the source has more than one Schema, Anzo displays the select schema dialog box. In the drop-down list, select the schema to use, and then click OK. For example:

    Anzo opens the Ingest dialog box and automatically populates the Data Source Connection value. If there is only one configured Data Store, the Anzo Data Store value is also auto-populated. In addition, if the default ETL Engine is configured for the system, the Auto Map Engine Config field is also populated (see Configure the Default ETL Engine). For example, in the image below the Anzo Data Store field is not populated because there are multiple available choices. The ETL Engine field is populated because the Local Sparkler Engine is configured as the default ETL Engine:

  4. If necessary, click the Anzo Data Store field and select the Data Store for this pipeline. For information about creating an Anzo Data Store, see Creating an Anzo Data Store.
  5. If necessary, click the Auto Map Engine Config field and select the ETL engine to use for this pipeline.
  6. By default, the Select all tables radio button is enabled to ingest the data for all tables in the Schema. If you do not want to add all tables, click the Custom select radio button and then select each of the tables to add.
  7. By default, the Ingest workflow is configured to generate a new Model in addition to the Mappings and jobs that are needed to onboard the data. You can click Save to save the configuration and proceed with the Model and Pipeline generation. If you want customize the URI that is generated for the new Model or the class and property URIs in the Model, you can click Advanced to expand the screen and view the following options:

    The list below describes the options:

    • Schema Ontology URI: The URI for the Model. When this field is blank, Anzo generates the Model URI with the following format:
      http://cambridgesemantics.com/ont/autogen/xx/<schema_name>

      Where xx is a hash snippet based on the model's globally unique identifier (GUID). If you want to specify a different format, you can type that URI into the Schema Ontology URI field. For example, a URI such as http://mycompany.com.ontology/movies results in a model URI of http://mycompany.com.ontology/movies.

      Make sure that Schema Ontology URI is unique. If the URI is not unique, this Model will overwrite any existing Model that uses this URI.

    • Schema Class Prefix: The URI prefix format to use for classes in the Model. When this field is blank, Anzo generates class URIs using the following format:
      http://cambridgesemantics.com/ont/autogen/xx/<schema_name>#<class_name>

      Where xx is a hash snippet based on the model's GUID. If you want to specify a different format for class URIs, type the prefix to use in this field. For example, a prefix such as http://mycompany.com.ontology/class results in class URIs like http://mycompany.com.ontology/class#<class_name>.

      Since you are specifying a prefix format, and the class name will be appended to the prefix, it is permissible to set Schema Class Prefix to the same value across schemas.

    • Schema Property Prefix: The URI prefix format to use for properties in the Model. When this field is blank, Anzo generates property URIs using the following format:
      http://cambridgesemantics.com/ont/autogen/xx/<schema_name>#<class_name>_<property_name>

      Where xx is a hash snippet based on the model's GUID. If you want to specify a different format for property URIs, type the prefix to use in this field. you can type that URI into the Schema Property Prefix field. For example, a prefix such as http://mycompany.com.ontology/property results in property URIs like http://mycompany.com.ontology/property#<class_name>_<property_name>.

      Since you are specifying a prefix format, and the property name will be appended to the prefix, it is permissible to set Schema Property Prefix to the same value across schemas.

    • Transform Property Names: Transforms property names to upper or lower case letters. To transform names, select the Transform Property Names checkbox. Then select the To lowercase radio button if you want to convert property names to lowercase or select the To UPPERCASE radio button if you want to convert property names to uppercase.
  8. Click Save if you changed advanced options. Anzo creates a Pipeline and generates the Model and Mappings according to the options you specified.
  9. In the main navigation menu under Onboard, click Structured Data. Then click the Pipelines tab.
  10. Click the name of the Pipeline that you created. Anzo displays the Pipeline Overview screen. For example:

  11. If you would like to see the jobs that Anzo created for this Pipeline, click the Jobs tab. The jobs are listed on the left side of the screen. A job exists for each of the tables that were imported. If this Pipeline has not been published previously, the right side of the screen remains blank. After the jobs are run, selecting a job from the list displays its history on the right. For example, the image below shows a new pipeline that has not been published:

    This image shows an example of a pipeline that has been published previously and has job history:

  12. To run all of the jobs, click the Publish All button at the top of the screen. To publish a subset of the jobs, select the checkbox next to each job that you want to run and then click the Publish button above the list of jobs. Anzo runs the pipeline and generates the resulting file-based linked data set in a new subdirectory under the specified Anzo data store.

When the Pipeline finishes, this run of the Pipeline becomes the Managed Edition. The Managed Edition always contains the latest successfully published data for all of the jobs in the Pipeline. If one or more of the jobs failed, those jobs are excluded from the Edition. If you publish the failed jobs at a later date or you create and publish additional jobs in the Pipeline, the data from those jobs is also added to the Managed Edition. For more information about Editions, see Managing Dataset Editions.

The new Dataset also becomes available in the Dataset catalog. From the catalog, you can generate graph Data Profiles and create Graphmarts. See Blending Data for next steps.

Related Topics