Onboarding Structured Data
There are three ways to onboard structured and semi-structured data to Anzo:
- Automated ETL Pipeline Workflow
- Automated Direct Data Load Workflow
- Manual Ingestion with the Graph Data Interface
Automated ETL Pipeline Workflow
If the Spark component is installed, you can onboard data using Anzo's built-in pipelines that use an automated workflow and follow the traditional extract, transform, and load (ETL) process. When a pipeline is generated, Anzo automatically creates a data model, mappings, and the ETL jobs required to ingest the source. These ingestion pipelines natively support CSV, JSON, XML, SAS, and Parquet files, along with all common database connections, including SQL, Oracle, MySQL, HIVE, and others.
How to get started with onboarding data via ETL pipelines
- The first step in onboarding data using the automated ETL workflow is to connect Anzo to data sources. See Adding Data Sources.
- Then see Ingesting Data Sources via ETL Pipelines for next steps.
Automated Direct Data Load Workflow
If Spark is not installed or you do not want to use the ETL pipeline workflow, you can use another automated workflow that follows an extract, load, and transform (ELT) process. In the ELT workflow, data sources are onboarded directly to graphmarts. Data layers with SPARQL queries are automatically generated to transform and blend the data to an analytics-ready knowledge graph. The AnzoGraph Graph Data Interface (GDI) Java plugin (sometimes called the Data Toolkit) is used to connect to the sources, create a model, and generate the data layer queries. The automated direct data load workflow supports all of the data sources that the automated ETL ingestion process supports.
How to get started with the direct data load workflow
- The first step in onboarding data using the automated direct data load workflow is to connect Anzo to data sources. See Adding Data Sources.
- Then see Directly Loading Data Sources via Graphmarts for next steps.
Manual Ingestion with the Graph Data Interface
For advanced users who are familiar with SPARQL, the GDI can also be invoked by writing queries from scratch. The GDI is extremely flexible, allowing you to connect directly to sources in queries and control all aspects of the extract, load, and transform process. In addition to the data sources that the two automated workflows support, you can also onboard raw data and data from HTTP/REST endpoints with manually written GDI queries.
How to get started onboarding data manually with GDI queries
To get started writing GDI queries for manual data onboarding, see Onboarding or Virtualizing Data with the Graph Data Interface.
For instructions on importing files that are in RDF format (Turtle or N-Triple), see Creating a Dataset from RDF Files.