Schedule Automated Data Updates

There are often data update operations that must be performed on a regular or periodic basis, such as retrieving updates from external data sources or exporting data. Graph Lakehouse provides a CRON-like mechanism to automatically perform these repetitive operations. These operations are managed entirely within the database rather than being controlled by the configuration of external control files.

There are two primary aspects to creating and configuring automated or scheduled operations within Graph Lakehouse:

Create and define the contents of one or more Cron graphs, each of which specify the database operations to perform for one or more Cron jobs. Each Cron graph is defined as a collection of RDF triples, with each triple specifying a particular scheduled job attribute or parameter. The Cron graph includes configuration settings that control other aspects of each scheduled job, such as a job's scheduled execution time (particular dates and times or intervals), retry options, error handling policies, and so on.
Update the scheduled Cron graph job settings in the Graph Lakehouse settings.conf file to include the Cron graphs you want to execute. The settings.conf file contains two settings to control the scheduling and execution of Cron graphs, cron_graphs and cron_graphs_recheck.

This topic provides instructions for setting up automated database operations and describes the configuration options and best practices available to control the scheduling, prioritization, error handling, and other aspects of running jobs.

Create a Cron Graph
Load a Cron Graph
Configure Graph Lakehouse to Run Cron Jobs
Monitor Job Execution and Errors

Create a Cron Graph

A Cron graph is defined in a TTL file that contains a collection of RDF triples that define configuration and scheduling information for one or more Cron jobs. A Cron graph can contain any number of Cron jobs, and each job can have custom scheduling and error-handling policies.

The content below shows the syntax for a Cron graph file. Descriptions of each job parameter are provided below.

# filename.ttl

PREFIX azg: <http://www.anzograph.com/> .
 
<job_name> azg:Schedule | Delay "<duration_value>"^^xsd:duration ;
           azg:Statement "<statement>" ;
           azg:ErrorPolicy "<policy>" ;
           [ azg:BaseTime "<datetime_value>"^^xsd:dateTime ; ]
           [ azg:RetryInterval "<duration_value>"^^xsd::duration ; ]
           [ azg:RetryCount <integer_value> ; ]
           [ azg:RunAfterStartup "true | false" ] .
 
[ <job2>   azg:Schedule | Delay "<duration_value>"^^xsd:duration ;
           azg:Statement "<statement>" ;
           azg:ErrorPolicy "<policy>" ;
           ...
]

If any required triples are missing or invalid, the associated Cron graph job is rejected and returns an error. (See Monitor Job Execution and Errors.)

Parameter	Description
Schedule \| Delay	Each job is required to include either azg:Schedule or azg:Delay. Both parameters accept an `xsd:duration` data type value, as described in xsd:duration in the W3C specification. If BaseTime is also specified, the azg:Schedule duration is added to the BaseTime value to produce the job's next scheduled execution time. If azg:Delay is specified, the execution of the associated job is delayed by the specified interval (in seconds) from the time the job last completed. These options set "scheduled request times," not guaranteed start times. If the system is busy enough that a given job would have multiple outstanding requested start times, only the last one is executed. If Graph Lakehouse is stopped and subsequently restarted, Cron jobs return to their normal scheduled interval times. For example, for nightly jobs scheduled for execution at midnight, a skipped midnight job will not be performed until midnight of the next day.
Statement	This required parameter defines the database operation (any valid SPARQL statement) to be performed when the corresponding job is executed. You can specify multiple statements by separating each statement by double semicolon characters (;;), e.g., `statement1;;statement2`. Each statement is executed as a separate transaction following ACID principles. Specifying SPARQL statements in a separate file rather than as a text string (for example, <file:/path/job1.rq>) is not currently supported.
ErrorPolicy	To specify what happens when Graph Lakehouse encounters an error in processing a job, all jobs require an azg:ErrorPolicy parameter. Each job must contain exactly one error policy. The list below describes the valid policy values: AbortDatabase: The most conservative policy. Any critical job failure produces a crash-dump Xray. Ignore: The most liberal policy. The error information is recorded in the `sth_errors` system table and included in manually generated Xrays, but no feedback is returned while the job is processed. Disable: If an error occurs, this value directs Graph Lakehouse not to attempt to run the Cron job again. BlockUsers: Similar to AbortDatabase, this policy causes all subsequent user-issued SELECT queries to error out with a "Cron failed, contact your system administrator" message. To unblock users from running SELECT queries, you can restart the database or run a `SET selects_blocked TO false` query. For example: azgi -c "SET selects_blocked TO false" If RetryInterval is specified in for a job, error policy actions are postponed until the number of retries is reached.
BaseTime	This optional parameter specifies the time to use as the base for other timing-related settings. If azg:BaseTime is not specified, Graph Lakehouse's start time is used to determine the Cron job's first start time. For example, if Graph Lakehouse was started at 2 PM on Sunday, May 12, and `azg:Schedule "1Day"^^xsd:duration` was set, then the job would run every day at 2 PM.
RetryInterval	This optional parameter specifies the duration to wait before retrying the job if the job errors out. The job will be continually retried until the first success or until the RetryCount value is reached. Afterwards, the job returns to its normal scheduled time.
RetryCount	This optional parameter specifies the number of times to retry a job if it errors out. If azg:RetryCount is specified, RetryInterval must also be specified. When the number of retries reaches the retry count, the specified ErrorPolicy for this job is performed. If the Ignore error policy is specified, the associated job resumes its normal schedule time.
RunAfterStartup	This optional parameter accepts a "true" or "false" value that indicates whether the associated job should run shortly after Graph Lakehouse startup. If `azg:RunAfterStartup "true"`, the azg:Schedule value is ignored.

Example Cron Graph File

The following content provides a simple example of a Cron graph file, named cron1.ttl, which schedules two jobs in the same graph:

PREFIX azg: <http://www.anzograph.com/> .
<job1> azg:BaseTime "2020-04-07:11:32"^^xsd:dateTime .
<job1> azg:Schedule "1Day"^^xsd:duration .
<job1> azg:ErrorPolicy "AbortDatabase" .
<job1> azg:RetryInterval "1Hour"^^xsd::duration .
<job1> azg:RetryCount 23 .
<job1> azg:Statement "REFRESH VIEW <testView1>" .
<job2> azg:BaseTime "2020-07-08:00:00"^^xsd:dateTime .
<job2> azg:Schedule "1Day"^^xsd:duration .
<job2> azg:ErrorPolicy "Ignore" .
<job2> azg:Statement "REFRESH VIEW <testView2>" .

In this example, the subject defines the job names: job1 for scheduling and configuration of one scheduled job, and job2 for the scheduling and configuration of a second job. Each predicate specifies a particular attribute or parameter of a scheduled job.

Each Cron graph is assigned a different Cron thread. The Cron thread acts as a "virtual user" that evaluates when to run the next job defined within the same graph. Each Cron thread runs only one job at a time per graph. If two jobs are scheduled for the same time, they are run sequentially. To execute Cron jobs concurrently, you can define Cron jobs in separate graphs, since jobs in different graphs are run using different Cron threads. For example, you could create one graph named "quickjobs" that defines many shorter jobs and create another graph that runs longer-executing jobs. Then the jobs from the two graphs could be run concurrently.

Load a Cron Graph

Once you have created a Cron graph file, you load the Cron graph into Graph Lakehouse using the following LOAD command:

LOAD <file:/<path>/<filename>.ttl> INTO GRAPH <graph_name>

For example:

LOAD <file:/tmp/cron1.ttl> INTO GRAPH <CronGraph1>

In this example, the triples stored in the cron1.ttl file are loaded into a graph named CronGraph1. It is this name, CronGraph1, that is added to the cron_graphs setting in <install_path>/config/settings.conf to run the scheduled jobs defined in CronGraph1. More details about configuring Graph Lakehouse to run Cron jobs are included in Configure Graph Lakehouse to Run Cron Jobs.

As an alternative to specifying the graph name as part of the LOAD query, you can specify the name of the Cron graph within the triples file. For example:

PREFIX azg: <http://www.anzograph.com/> .

GRAPH <CronGraph1> {

  <job1> azg:BaseTime "2020-04-07:11:32"^^xsd:dateTime .
  ...
  <job1> azg:Statement "REFRESH VIEW <testView1>" .
  <job2> azg:BaseTime "2020-07-08:00:00"^^xsd:dateTime .
  ...
  <job2> azg:Statement "REFRESH VIEW <testView2>" .
}

You could then load the Cron graph using the following LOAD command:

LOAD <file:/path/cron1.ttl>

Configure Graph Lakehouse to Run Cron Jobs

To configure Graph Lakehouse to run the jobs within Cron graphs, edit the <install_path>/config/settings.conf configuration file to specify values for the following two settings:

cron_graphs: A comma-separated list of the Cron graph names to enable. For example, cron_graphs=CronGraph1, CronGraph2.
cron_graphs_recheck: The interval (in number of seconds) to wait before re-checking the cron_graphs value to determine if there are changes, i.e, new or deleted graph names. For example, cron_graphs_recheck=300.

If a Cron graph is non-existent or empty, the associated Cron thread periodically checks at the specified interval whether the named graph is now loaded and has new jobs.

After changing settings.conf, restart Graph Lakehouse to apply the configuration changes.

Monitor Job Execution and Errors

Details about scheduled job run are logged to the following Graph Lakehouse system tables.

System Table	Logging Details
sth_query	SPARQL statements executed from jobs are logged to this table. To identify Cron job queries, look for the text cron: in the `label` column.
sth_cron_events	Activities related to execution of Cron jobs by their associated Cron threads are logged to this table. You can monitor this table for failed entries in the `event` column and take corrective action based on the failures.
sth_cron_graph	All scans of the Cron graphs (including Cron graph refreshes) are logged to this table.
sth_errors	All errors arising from scheduled job execution are logged to this table. If an error is caused by one of a Cron graph's job configuration settings, the `basic_text` column value will begin with Cron:. The Cron Graph Errors section below includes a list of Cron graph related errors.

You can query Graph Lakehouse's system tables using SPARQL queries in the following format:

SELECT * | list_of_variables
WHERE { table 'table_name' }

For example:

SELECT *
WHERE { table 'sth_cron_events' }
LIMIT 100

Entries in the sth_cron_events and sth_errors system tables are, by default, also spooled to disk so that they are incorporated into Crashdumps and Xrays.

Cron Graph Errors

The table below lists the errors that are returned for errors related to Cron job processing.

Cron Graph Error	Error Message
CronInvalidPredicate	"Cron: Invalid predicate"
CronOneDuration	"Cron: Multiple durations are being requested"
CronOneStatement	"Cron: Multiple statements are being requested"
CronOneFirstTime	"Cron: Multiple base/first times are being requested"
CronMissingStatement	"Cron: Missing Statement to execute"
CronMissingErrorPolicy	"Cron: Missing ErrorPolicy to execute"
CronConflictingErrorPolicy	"Cron: ErrorPolicy must be 'Ignore' if no RetryCount is specified"
CronUnknownErrorPolicy	"Cron: Unknown ErrorPolicy"
CronSingleErrorPolicy	"Cron: Only a single ErrorPolicy allowed per subject"
CronSingleStatement	"Cron: Only a single Statement allowed per subject"
CronSingleFirstTime	"Cron: Only a single FirstTime allowed per subject"
CronMustBeLiteralNotIRI	"Cron: Object must be a literal, cannot be an IRI"
CronWrongType	"Cron: Object wrong type"
CronStatementFailed	"Cron: Statement failed to execute, see system table sth_errors for more information"
CronRetryCountIncons	"Cron: Specifying an RetryCount requires a RetryInterval"
CronRetryCountPos	"Cron: RetryCount must be greater than 0"
CronIntervalPos	"Cron: Schedule, Delay, RetryInterval must be greater than 0"
CronMissingRetryCount	"Cron: Retry requires a RetryCount unless ErrorPolicy is Ignore"
CronBlockingUsers	"All SELECTS blocked, contact your system administrator" To unblock users from running SELECT queries, you can restart the database or run a `SET selects_blocked TO false` query.