Complete the Post-Installation Configuration

This topic provides instructions for completing the required and optional post-installation configuration of Graph Lakehouse.

Adding Drivers for Custom Database Sources
Installing the C++ Dependencies
Optimizing the Linux Kernel Configuration for AnzoGraph
Non-Root Installs: Configuring and Starting the Graph Lakehouse Services

Adding Drivers for Custom Database Sources

Graph Lakehouse uses the Graph Data Interface (GDI) Java plugin to connect directly to data sources. The GDI plugin is included in the Graph Lakehouse installation. Also included in the installation are JDBC drivers for the following databases:

Databricks
H2
IBM DB2
Microsoft SQL Server
MariaDB
Oracle
PostgreSQL
SAP Sybase (jTDS)
Snowflake

To extend the GDI to access custom database sources, JDBC drivers can be added to Graph Lakehouse. To add a driver, follow the steps below.

Copy the .jar file to the <install_path>/lib/udx directory on the leader server.
Restart the database by running the following command. When the database is restarted, the leader broadcasts any new .jar files to the compute servers.
```
sudo systemctl restart anzograph
```

The <install_path>/lib/udx directory on the leader node is a user-managed directory rather than an Graph Lakehouse-managed directory like <install_path>/bin or <install_path>/internal. Users can place JDBC drivers and Java or C++ extensions in the lib/udx directory any time. Each time the database is restarted, Graph Lakehouse scans that directory, saves a copy of its contents to the <install_path>/internal/extensions directory, and then broadcasts the internal/extensions contents from the leader node to the compute nodes. Each restart clears internal/extensions and Graph Lakehouse rescans lib/udx to reload internal/extensions with the latest plugins.

Installing the C++ Dependencies

Dependencies are required to be installed on all servers in the cluster to support the C++ extensions that Graph Lakehouse offers, including the remote read (load) and write service, the Data Science functions, and the integration with Apache Arrow. The installer provides a .repo file to aid you in configuring the Cambridge Semantics repository and installing the required software packages with or without internet access.

The ability to write to the /etc/yum.repos.d directory requires root access permissions. See Adding, Enabling, and Disabling a YUM Repository for more information on defining and using yum repositories.

libarchive
libarmadillo12
libboost_filesystem1_80_0
libboost_iostreams1_80_0
libboost_system1_80_0
libflatbuffers2
libhdfs3
libnfs13
libserd-0-0
libsmb2
shadow-utils

Install Dependencies via Internet Access to the Cambridge Semantics Repository
Install Dependencies without Internet Access via the Repository Mirror (tarball)

Install Dependencies via Internet Access to the Cambridge Semantics Repository

Follow the steps below if the Graph Lakehouse servers have external internet access and you want to install the dependencies directly from the Cambridge Semantics repository.

Copy the csi-obs-cambridgesemantics-udxcontrib.repo file from the <install_path>/pre-req/yum.repos.d directory to the /etc/yum.repos.d directory. For example, the following command copies the file from the default installation path to /etc/yum.repos.d:
```
sudo cp /opt/cambridgesemantics/pre-req/yum.repos.d/csi-obs-cambridgesemantics-udxcontrib.repo /etc/yum.repos.d
```

Next, run the following command to enable the repository and install the required packages:

sudo dnf install --enablerepo=crb $(cat <install_path>/pre-req/rh9-anzograph-requirements.txt)

For example, on a server where Graph Lakehouse is installed in the default location:

sudo dnf install --enablerepo=crb $(cat /opt/cambridgesemantics/pre-req/rh9-anzograph-requirements.txt)

Repeat these steps on all servers in the cluster.

Install Dependencies without Internet Access via the Repository Mirror (tarball)

Follow the steps below if the Graph Lakehouse servers do not have external internet access and you want to install the dependencies from the mirrored Cambridge Semantics repository. The steps below give instructions for copying the repository to each Graph Lakehouse host server and configuring the .repo file accordingly. You can also chose to set up the mirror on a remote server that each of the Graph Lakehouse servers can access.

From a computer that does have internet access, download the dependency tarball, csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz, from the following Cambridge Semantics Google Cloud Storage location: https://storage.googleapis.com/csi-anzograph/udx/csi-os-contrib/rocky9/2023-03/20230321945/csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz.
You can run the following cURL command to download the tarball:
```
curl -OL https://storage.googleapis.com/csi-anzograph/udx/csi-os-contrib/rocky9/2023-03/20230321945/csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz(.sha512)
```
Also from the computer that has internet access, download the repomd.xml.key from the following Cambridge Semantics Google Cloud Storage location: https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key.
You can run the following cURL command to download the file:
```
curl -OL https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key
```
On each of the Graph Lakehouse servers, create a directory called /tmp/repo.
Copy csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz to the /tmp/repo directory on each server.
Then run the following command to unpack the tarball in the /tmp/repo directory:
```
tar -xvf csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz
```
The files are unpacked into subdirectories under /tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib.
Next, copy the repomd.xml.key file to the /tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib directory on each of the Graph Lakehouse servers.

Now, open the csi-obs-cambridgesemantics-udxcontrib.repo file in the <install_path>/examples/yum.repos.d directory. The contents of the file are shown below:

[csi-obs-cambridgesemantics-udxcontrib]
name=Contrib directory for CambridgeSemantics AnzoGraph UDX dependencies
baseurl=https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9
gpgkey=https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key
gpgcheck=1
enabled=1

Edit the csi-obs-cambridgesemantics-udxcontrib.repo file contents to replace the baseurl and gpgkey values so that they point to the repo files that you unpacked in the /tmp/repo directory. In addition, change the gpgcheck and enabled values from 1 to 0. The contents of the updated file are shown below:

[csi-obs-cambridgesemantics-udxcontrib]
name=Contrib directory for CambridgeSemantics AnzoGraph UDX dependencies
baseurl=file:///tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib
gpgkey=file:///tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib/repomd.xml.key
gpgcheck=0
enabled=0

Save and close the file.
Copy csi-obs-cambridgesemantics-udxcontrib.repo from <install_path>/pre-req/yum.repos.d to the /etc/yum.repos.d directory. For example, the following command copies the file from the default installation path to /etc/yum.repos.d:
```
sudo cp /opt/cambridgesemantics/pre-req/yum.repos.d/csi-obs-cambridgesemantics-udxcontrib.repo /etc/yum.repos.d
```

Next, run the following command to enable the repository and install the required packages:

sudo dnf install --enablerepo=crb $(cat <install_path>/pre-req/rh9-anzograph-requirements.txt)

For example, on a server where Graph Lakehouse is installed in the default location:

sudo dnf install --enablerepo=crb $(cat /opt/cambridgesemantics/pre-req/rh9-anzograph-requirements.txt)

Repeat the steps above as needed to install the dependencies on all servers in the cluster.

Optimizing the Linux Kernel Configuration for AnzoGraph

To streamline the configuration of the operating system for peak Graph Lakehouse performance, the installer includes a tuned profile that you can activate. Tuned is a daemon program that uses the udev device monitor to statically and dynamically tune operating system settings based on the specified profile. It is highly recommended that you activate the Graph Lakehouse tuned profile.

Tuned and Performance Tuning with Tuned and Tuned-ADM in the RedHat Performance Tuning Guide provide an overview of the tuned daemon and more information on using the tuned service to improve the performance of specific workloads.

Activating the Tuned Profile

The profile, called azg, is in the <install_path>/examples/tuned-profile directory and consists of two files: tuned.conf and additional-tuneables.sh. For details about the files, see Tuned Profile Reference below.

To activate the azg profile, follow the steps below. Complete these steps on all servers in the cluster:

If you ran the installer in sudo mode, you can skip this step. The installer copied the tuned profile to the etc/tuned directory but it did not automatically activate the profile. If you ran the installer as a non-root user, copy the azg directory from <install_path>/examples/tuned-profile to the /etc/tuned directory. For example, the following command copies azg from the default installation path to /etc/tuned:
```
sudo cp -r /opt/cambridgesemantics/examples/azg /etc/tuned
```
Next, tuned is installed by default with RHEL 9. If you are using Rocky9, you may need to install it. You can run the following command to install the program:
```
sudo dnf install -y tuned
```
Run the following command to activate the azg profile:
```
sudo tuned-adm profile azg
```

The host servers are now configured to use the tuned profile that is optimal for Graph Lakehouse.

To disable tuned profiles, you can run sudo tuned-adm off. After running the command, no tuned profiles will be active.

Tuned Profile Reference

This section describes the tuned profile files and the kernel configuration changes that they apply.

tuned.conf

The table below describes the Linux kernel configuration settings that are modified by tuned.conf.

Setting	Description	AZG Profile Change
vm.dirty_ratio	Specifies the percentage of system memory that can be occupied by "dirty" data before flushing the cache to disk. Dirty data are pages in memory that have been updated and do not match what is stored on disk.	Reduces `vm.dirty_ratio` to `2%` to increase the frequency with which the system cache is flushed.
vm.swappiness	Controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. A value of `0` means the kernel avoids swapping processes out of physical memory for as long as possible. A value of `100` tells the kernel to aggressively swap processes out of physical memory to the swap disk.	Sets `vm.swappiness` to `30`.
vm.max_map_count	Sets the limit on the maximum number of memory map areas a process can use. Since Graph Lakehouse is memory intensive, it may reach the default maximum map count of `65535` and be shut down by the operating system.	Increases `vm.max_map_count` to `2097152`.
net.ipv4.tcp_rmem	Controls the size of the receive buffer for TCP connections. It sets the minimum, default, and maximum sizes of the buffer in bytes.	Sets `tcp_rmem` to `"4096 87380 16777216"`.
net.ipv4.tcp_wmem	Controls the size of the send buffer for TCP connections. It sets the minimum, default, and maximum sizes of the buffer in bytes.	Sets `tcp_wmem` to `"4096 16384 16777216"`.
net.ipv4.udp_mem	Controls the amount of memory that can be allocated for the kernel's UDP buffer. It sets the minimum, default, and maximum sizes of the buffer in bytes.	Sets `udp_mem` to `"3145728 4194304 16777216"`.
transparent_hugepages	Controls whether Transparent Huge Pages (THP) is enabled or disabled system-wide. When THP is enabled system-wide, it can dramatically degrade Graph Lakehouse performance.	Disables THP by setting `transparent_hugepages` to `never`.

additional-tunables.sh

The additional-tuneables.sh script is called by tuned.conf and configures the following Linux kernel configuration settings so that they are optimal for Graph Lakehouse.

Setting	Description	AZG Profile Change
overcommit_memory	Controls whether obvious overcommits of the address space are allowed.	Sets `overcommit_memory` to `0` to ensure that very large overcommits are not allowed but some overcommits can be used to reduce swap usage.
overcommit_ratio	Controls the percentage of memory that is allowed to be used for overcommits.	Sets `overcommit_ratio` to `50%`.
transparent_hugepage/defrag	Though the tuned profile disables Transparent Huge Pages (THP) system-wide, this setting controls whether huge pages can still be enabled on a per process basis (inside MADV_HUGEPAGE madvise regions).	Sets `transparent_hugepage/defrag` to `madvise` so that the kernel only assigns huge pages to individual process memory regions that are specified with the `madvise()` system call.
tcp_timestamps	Controls whether TCP timestamps are enabled or disabled.	Sets `tcp_timestamps` to `0`, which disables TCP timestamps in order to reduce performance spikes related to timestamp generation.

Non-Root Installs: Configuring and Starting the Graph Lakehouse Services

If the installer was run in sudo mode, the installer automatically created Graph Lakehouse systemd services in the /etc/systemd/system directory. If Graph Lakehouse is already running, see Connecting to Graph Lakehouse for next steps.

If the installer was run as the platform service user and not with sudo privileges, the last step in the post-installation configuration is to implement the Graph Lakehouse systemd services and start the database. It is important to set up the services to run as the service user so that Graph Lakehouse can access files on the shared file system. In addition, the services are configured to tune user resource limits (ulimits) as well as set $JAVA_HOME so that Graph Lakehouse can find the OpenJDK installation.

The service files are included in the <install_path>/examples/systemd-services directory. Follow the instructions below to configure and start the services.

Configure the System Management Service
Configure the Database Service on the Leader Server (and Single-Servers)

Configure the System Management Service

The system management daemon, azgmgrd, is a very lightweight program that runs on all Graph Lakehouse servers and manages communication between the system manager and the database as well as between the nodes in a cluster. Follow the steps below to configure and start the service that runs the azgmgrd process.

Open the azgmgrd.service file in the <install_path>/examples/systemd-services directory. The contents of the file are shown below.

The following contents are from an installation that used the default installation path, /opt/cambridgesematics. The contents of your file may differ. Also, note the User=anzograph value shown in bold below. The value needs to be edited to replace anzograph with the platform service user name.

[Unit]
Description=AnzoGraph communication service
# depends on NetworkManager-wait-online.service enabled
Wants=network-online.target
After=network-online.target

[Service]
Type=forking
# The PID file is optional, but recommended in the manpage
# "so that systemd can identify the main process of the daemon"
#PIDFile=/var/run/azgmgrd.pid
WorkingDirectory=/opt/cambridgesemantics/anzograph
StandardOutput=syslog
StandardError=syslog
LimitCPU=infinity
LimitNOFILE=4096
LimitAS=infinity
LimitNPROC=infinity
LimitMEMLOCK=infinity
LimitLOCKS=infinity
LimitFSIZE=infinity
User=anzograph
UMask=007
Environment=PATH=$PATH:/opt/cambridgesemantics/anzograph/bin:/opt/cambridgesemantics/anzograph/tools/bin
Environment=JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.1.0.12-3.el9.x86_64
Environment=UDX_LOGS=/opt/cambridgesemantics/anzograph/internal/logs
Environment=HYPER_PATH=/opt/cambridgesemantics/anzograph/vendor/com.tableau/hyper/lib/hyper
ExecStart=/opt/cambridgesemantics/anzograph/bin/azgmgrd /opt/cambridgesemantics/anzograph

CPUAccounting=false
MemoryAccounting=false
[Install]
WantedBy=multi-user.target
Alias=sbxmgrd.service

In the following line of the file, replace anzograph with the name of the platform service user.
```
User=anzograph
```
For example:
```
User=anzo
```
Save and close the file.
Copy azgmgrd.service from the <install_path>/examples/systemd-services directory to the /usr/lib/systemd/system directory. For example, the following command copies azgmgrd.service from the default installation path to /usr/lib/systemd/system:
```
sudo cp /opt/cambridgesemantics/examples/systemd-services/azgmgrd.service /usr/lib/systemd/system
```

Run the following commands to start and enable the service:

sudo systemctl start azgmgrd.service

sudo systemctl enable azgmgrd.service

Repeat this process on all servers in the cluster.

The azgmgrd deamon must be running to start the database, but it typically does not need to be restarted unless you are upgrading Graph Lakehouse or the host servers are rebooted. It does not need to be stopped and started each time the database is restarted.

Configure the Database Service on the Leader Server (and Single-Servers)

The anzograph service runs the database process. This service is configured to run after azgmgrd is started. Starting the database is done only on the leader server. The leader connects to the system managers on the compute servers and starts the database across the cluster.

Open the anzograph.service file in the <install_path>/examples/systemd-services directory. The contents of the file are shown below.

[Unit]
Description=AnzoGraph database service
After=azgmgrd.service
Wants=azgmgrd.service

[Service]
Type=oneshot
# The PID file is optional, but recommended in the manpage
# "so that systemd can identify the main process of the daemon"
#PIDFile=/var/run/azg.pid
WorkingDirectory=/opt/cambridgesemantics/anzograph
StandardOutput=syslog
StandardError=syslog
User=anzograph
UMask=027
RemainAfterExit=yes
Environment=PATH=$PATH:/opt/cambridgesemantics/anzograph/bin:/opt/cambridgesemantics/anzograph/tools/bin
ExecStart=/opt/cambridgesemantics/anzograph/bin/azgctl -start
ExecStop=/opt/cambridgesemantics/anzograph/bin/azgctl -stop

[Install]
WantedBy=multi-user.target
Alias=gqe.service

In the following line of the file, replace anzograph with the name of the platform service user.
```
User=anzograph
```
For example:
```
User=anzo
```
Save and close the file.
Copy anzograph.service from the <install_path>/examples/systemd-services directory to the /usr/lib/systemd/system directory. For example, the following command copies anzograph.service from the default installation path to /usr/lib/systemd/system:
```
sudo cp /opt/cambridgesemantics/examples/systemd-services/anzograph.service /usr/lib/systemd/system
```

Run the following commands to start and enable the new service:

sudo systemctl start anzograph.service

sudo systemctl enable anzograph.service

Once the services are in place and enabled, Graph Lakehouse should be running. To stop and start the database from the command line, run the following systemctl commands on the leader node (You do not need to stop and start azgmgrd.):

sudo systemctl stop anzograph

sudo systemctl start anzograph

For instructions on configuring the connection to Graph Lakehouse in Graph Studio, see Connecting to Graph Lakehouse.

See Securing an AnzoGraph Environment for recommendations to follow for securing Graph Lakehouse environments.