Adding a Parquet Data Source

Follow the instructions below to create a Parquet data source. You can onboard one file or multiple files with the identical format (schema) per data source.

If your Parquet source is consistently updated with new or changed files, you can configure the source to process the data incrementally. For details, see Configuring a CSV or Parquet Source for Incremental Processing.

  1. In the Anzo application, expand the Onboard menu and click Structured Data. Anzo displays the Data Sources screen, which lists any existing data sources. For example:

  2. Click the Add Data Source button and select File > Parquet Data Source. Anzo opens the Create Parquet Data Source screen.

  3. Specify a name for the source in the Title field, and type an optional description in the Description field. Then click Save. Anzo saves the source and displays the Overview tab. For example:

  4. On the Overview tab, click in the Parquet File field to make the value editable. Then click Browse to open the File Location dialog box and select the file to import.
  5. In the File Location dialog box on the left side of the screen, select the file store for the Parquet file. On the right side of the screen, navigate to the directory that contains the file to import. The screen displays the list of files in the directory. For example:

  6. Select the file that you want to import. If you have multiple files with the identical format you can select the Insert Wildcard option. Then type a string using asterisks as wildcard characters to find the files with similar names. Files that match the specified string will be imported as one file and will result in one job being created in the pipeline to ingest all of the files that are selected by the specified string. You can specify up to 16,000 files using a wildcard. After typing a string, click Apply to include that string in the Selected list.

    The image below shows a directory with multiple parquet files. The events.parquet and events-2.parquet file have the identical format and can be imported as one file. The Insert Wildcard option is selected, and event* is specified to identify the two files.

  7. After selecting the file, click OK to close the File Location dialog box. Then click the checkmark icon () to save the change to the Parquet File field. Anzo imports the file and generates a data model.

For information about creating or changing primary keys and foreign keys, see Assigning Primary and Foreign Keys in a Schema.

When you are ready to onboard the data to Anzo, see Onboarding Data with the Automated Workflow for next steps. Or, if you want to onboard or virtualize the source by manually writing SPARQL queries against the Graph Data Interface service, see Onboarding or Virtualizing Data with SPARQL Queries.