Complete the Post-Installation Configuration

This topic provides instructions for completing the required and optional post-installation configuration of Graph Lakehouse.

Adding Drivers for Custom Database Sources

Graph Lakehouse uses the Graph Data Interface (GDI) Java plugin to connect directly to data sources. The GDI plugin is included in the Graph Lakehouse installation. Also included in the installation are JDBC drivers for the following databases:

  • Databricks
  • H2
  • IBM DB2
  • Microsoft SQL Server
  • MariaDB
  • Oracle
  • PostgreSQL
  • SAP Sybase (jTDS)
  • Snowflake

To extend the GDI to access custom database sources, JDBC drivers can be added to Graph Lakehouse. To add a driver, follow the steps below.

  1. Copy the .jar file to the <install_path>/lib/udx directory on the leader server.
  2. Restart the database by running the following command. When the database is restarted, the leader broadcasts any new .jar files to the compute servers.
    sudo systemctl restart anzograph

The <install_path>/lib/udx directory on the leader node is a user-managed directory rather than an Graph Lakehouse-managed directory like <install_path>/bin or <install_path>/internal. Users can place JDBC drivers and Java or C++ extensions in the lib/udx directory any time. Each time the database is restarted, Graph Lakehouse scans that directory, saves a copy of its contents to the <install_path>/internal/extensions directory, and then broadcasts the internal/extensions contents from the leader node to the compute nodes. Each restart clears internal/extensions and Graph Lakehouse rescans lib/udx to reload internal/extensions with the latest plugins.

Installing the C++ Dependencies

Dependencies are required to be installed on all servers in the cluster to support the C++ extensions that Graph Lakehouse offers, including the remote read (load) and write service, the Data Science functions, and the integration with Apache Arrow. The installer provides a .repo file to aid you in configuring the Cambridge Semantics repository and installing the required software packages with or without internet access.

The ability to write to the /etc/yum.repos.d directory requires root access permissions. See Adding, Enabling, and Disabling a YUM Repository for more information on defining and using yum repositories.

libarchive
libarmadillo12
libboost_filesystem1_80_0
libboost_iostreams1_80_0
libboost_system1_80_0
libflatbuffers2
libhdfs3
libnfs13
libserd-0-0
libsmb2
shadow-utils

Install Dependencies via Internet Access to the Cambridge Semantics Repository

Follow the steps below if the Graph Lakehouse servers have external internet access and you want to install the dependencies directly from the Cambridge Semantics repository.

  1. Copy the csi-obs-cambridgesemantics-udxcontrib.repo file from the <install_path>/pre-req/yum.repos.d directory to the /etc/yum.repos.d directory. For example, the following command copies the file from the default installation path to /etc/yum.repos.d:
    sudo cp /opt/cambridgesemantics/pre-req/yum.repos.d/csi-obs-cambridgesemantics-udxcontrib.repo /etc/yum.repos.d
  2. Next, run the following command to enable the repository and install the required packages:
    sudo dnf install --enablerepo=crb $(cat <install_path>/pre-req/rh9-anzograph-requirements.txt)

    For example, on a server where Graph Lakehouse is installed in the default location:

    sudo dnf install --enablerepo=crb $(cat /opt/cambridgesemantics/pre-req/rh9-anzograph-requirements.txt)
  3. Repeat these steps on all servers in the cluster.

Install Dependencies without Internet Access via the Repository Mirror (tarball)

Follow the steps below if the Graph Lakehouse servers do not have external internet access and you want to install the dependencies from the mirrored Cambridge Semantics repository. The steps below give instructions for copying the repository to each Graph Lakehouse host server and configuring the .repo file accordingly. You can also chose to set up the mirror on a remote server that each of the Graph Lakehouse servers can access.

  1. From a computer that does have internet access, download the dependency tarball, csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz, from the following Cambridge Semantics Google Cloud Storage location: https://storage.googleapis.com/csi-anzograph/udx/csi-os-contrib/rocky9/2023-03/20230321945/csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz.

    You can run the following cURL command to download the tarball:

    curl -OL https://storage.googleapis.com/csi-anzograph/udx/csi-os-contrib/rocky9/2023-03/20230321945/csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz(.sha512)
  2. Also from the computer that has internet access, download the repomd.xml.key from the following Cambridge Semantics Google Cloud Storage location: https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key.

    You can run the following cURL command to download the file:

    curl -OL https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key
    
  3. On each of the Graph Lakehouse servers, create a directory called /tmp/repo.
  4. Copy csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz to the /tmp/repo directory on each server.
  5. Then run the following command to unpack the tarball in the /tmp/repo directory:
    tar -xvf csi-obs-cambridgesemantics-udxcontrib.rocky9.tar.xz

    The files are unpacked into subdirectories under /tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib.

  6. Next, copy the repomd.xml.key file to the /tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib directory on each of the Graph Lakehouse servers.
  7. Now, open the csi-obs-cambridgesemantics-udxcontrib.repo file in the <install_path>/examples/yum.repos.d directory. The contents of the file are shown below:
    [csi-obs-cambridgesemantics-udxcontrib]
    name=Contrib directory for CambridgeSemantics AnzoGraph UDX dependencies
    baseurl=https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9
    gpgkey=https://storage.googleapis.com/csi-rpmmd-pd/CambridgeSemantics:/UDXContrib/ubi-9/repodata/repomd.xml.key
    gpgcheck=1
    enabled=1
  8. Edit the csi-obs-cambridgesemantics-udxcontrib.repo file contents to replace the baseurl and gpgkey values so that they point to the repo files that you unpacked in the /tmp/repo directory. In addition, change the gpgcheck and enabled values from 1 to 0. The contents of the updated file are shown below:
    [csi-obs-cambridgesemantics-udxcontrib]
    name=Contrib directory for CambridgeSemantics AnzoGraph UDX dependencies
    baseurl=file:///tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib
    gpgkey=file:///tmp/repo/dl/rocky9/csi-obs-cambridgesemantics-udxcontrib/repomd.xml.key
    gpgcheck=0
    enabled=0
  9. Save and close the file.
  10. Copy csi-obs-cambridgesemantics-udxcontrib.repo from <install_path>/pre-req/yum.repos.d to the /etc/yum.repos.d directory. For example, the following command copies the file from the default installation path to /etc/yum.repos.d:
    sudo cp /opt/cambridgesemantics/pre-req/yum.repos.d/csi-obs-cambridgesemantics-udxcontrib.repo /etc/yum.repos.d
  11. Next, run the following command to enable the repository and install the required packages:
    sudo dnf install --enablerepo=crb $(cat <install_path>/pre-req/rh9-anzograph-requirements.txt)

    For example, on a server where Graph Lakehouse is installed in the default location:

    sudo dnf install --enablerepo=crb $(cat /opt/cambridgesemantics/pre-req/rh9-anzograph-requirements.txt)

Repeat the steps above as needed to install the dependencies on all servers in the cluster.

Optimizing the Linux Kernel Configuration for AnzoGraph

To streamline the configuration of the operating system for peak Graph Lakehouse performance, the installer includes a tuned profile that you can activate. Tuned is a daemon program that uses the udev device monitor to statically and dynamically tune operating system settings based on the specified profile. It is highly recommended that you activate the Graph Lakehouse tuned profile.

Tuned and Performance Tuning with Tuned and Tuned-ADM in the RedHat Performance Tuning Guide provide an overview of the tuned daemon and more information on using the tuned service to improve the performance of specific workloads.

Activating the Tuned Profile

The profile, called azg, is in the <install_path>/examples/tuned-profile directory and consists of two files: tuned.conf and additional-tuneables.sh. For details about the files, see Tuned Profile Reference below.

To activate the azg profile, follow the steps below. Complete these steps on all servers in the cluster:

  1. If you ran the installer in sudo mode, you can skip this step. The installer copied the tuned profile to the etc/tuned directory but it did not automatically activate the profile. If you ran the installer as a non-root user, copy the azg directory from <install_path>/examples/tuned-profile to the /etc/tuned directory. For example, the following command copies azg from the default installation path to /etc/tuned:
    sudo cp -r /opt/cambridgesemantics/examples/azg /etc/tuned
  2. Next, tuned is installed by default with RHEL 9. If you are using Rocky9, you may need to install it. You can run the following command to install the program:
    sudo dnf install -y tuned
  3. Run the following command to activate the azg profile:
    sudo tuned-adm profile azg

The host servers are now configured to use the tuned profile that is optimal for Graph Lakehouse.

To disable tuned profiles, you can run sudo tuned-adm off. After running the command, no tuned profiles will be active.

Tuned Profile Reference

This section describes the tuned profile files and the kernel configuration changes that they apply.

tuned.conf

The table below describes the Linux kernel configuration settings that are modified by tuned.conf.

Setting Description AZG Profile Change
vm.dirty_ratio Specifies the percentage of system memory that can be occupied by "dirty" data before flushing the cache to disk. Dirty data are pages in memory that have been updated and do not match what is stored on disk. Reduces vm.dirty_ratio to 2% to increase the frequency with which the system cache is flushed.
vm.swappiness Controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. A value of 0 means the kernel avoids swapping processes out of physical memory for as long as possible. A value of 100 tells the kernel to aggressively swap processes out of physical memory to the swap disk. Sets vm.swappiness to 30.
vm.max_map_count Sets the limit on the maximum number of memory map areas a process can use. Since Graph Lakehouse is memory intensive, it may reach the default maximum map count of 65535 and be shut down by the operating system. Increases vm.max_map_count to 2097152.
net.ipv4.tcp_rmem Controls the size of the receive buffer for TCP connections. It sets the minimum, default, and maximum sizes of the buffer in bytes. Sets tcp_rmem to "4096 87380 16777216".
net.ipv4.tcp_wmem Controls the size of the send buffer for TCP connections. It sets the minimum, default, and maximum sizes of the buffer in bytes. Sets tcp_wmem to "4096 16384 16777216".
net.ipv4.udp_mem Controls the amount of memory that can be allocated for the kernel's UDP buffer. It sets the minimum, default, and maximum sizes of the buffer in bytes. Sets udp_mem to "3145728 4194304 16777216".
transparent_hugepages Controls whether Transparent Huge Pages (THP) is enabled or disabled system-wide. When THP is enabled system-wide, it can dramatically degrade Graph Lakehouse performance. Disables THP by setting transparent_hugepages to never.

additional-tunables.sh

The additional-tuneables.sh script is called by tuned.conf and configures the following Linux kernel configuration settings so that they are optimal for Graph Lakehouse.

Setting Description AZG Profile Change
overcommit_memory Controls whether obvious overcommits of the address space are allowed. Sets overcommit_memory to 0 to ensure that very large overcommits are not allowed but some overcommits can be used to reduce swap usage.
overcommit_ratio Controls the percentage of memory that is allowed to be used for overcommits. Sets overcommit_ratio to 50%.
transparent_hugepage/defrag Though the tuned profile disables Transparent Huge Pages (THP) system-wide, this setting controls whether huge pages can still be enabled on a per process basis (inside MADV_HUGEPAGE madvise regions). Sets transparent_hugepage/defrag to madvise so that the kernel only assigns huge pages to individual process memory regions that are specified with the madvise() system call.
tcp_timestamps Controls whether TCP timestamps are enabled or disabled. Sets tcp_timestamps to 0, which disables TCP timestamps in order to reduce performance spikes related to timestamp generation.

Non-Root Installs: Configuring and Starting the Graph Lakehouse Services

If the installer was run in sudo mode, the installer automatically created Graph Lakehouse systemd services in the /etc/systemd/system directory. If Graph Lakehouse is already running, see Connecting to Graph Lakehouse for next steps.

If the installer was run as the platform service user and not with sudo privileges, the last step in the post-installation configuration is to implement the Graph Lakehouse systemd services and start the database. It is important to set up the services to run as the service user so that Graph Lakehouse can access files on the shared file system. In addition, the services are configured to tune user resource limits (ulimits) as well as set $JAVA_HOME so that Graph Lakehouse can find the OpenJDK installation.

The service files are included in the <install_path>/examples/systemd-services directory. Follow the instructions below to configure and start the services.

  1. Configure the System Management Service
  2. Configure the Database Service on the Leader Server (and Single-Servers)

Configure the System Management Service

The system management daemon, azgmgrd, is a very lightweight program that runs on all Graph Lakehouse servers and manages communication between the system manager and the database as well as between the nodes in a cluster. Follow the steps below to configure and start the service that runs the azgmgrd process.

  1. Open the azgmgrd.service file in the <install_path>/examples/systemd-services directory. The contents of the file are shown below.

    The following contents are from an installation that used the default installation path, /opt/cambridgesematics. The contents of your file may differ. Also, note the User=anzograph value shown in bold below. The value needs to be edited to replace anzograph with the platform service user name.

    [Unit]
    Description=AnzoGraph communication service
    # depends on NetworkManager-wait-online.service enabled
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    Type=forking
    # The PID file is optional, but recommended in the manpage
    # "so that systemd can identify the main process of the daemon"
    #PIDFile=/var/run/azgmgrd.pid
    WorkingDirectory=/opt/cambridgesemantics/anzograph
    StandardOutput=syslog
    StandardError=syslog
    LimitCPU=infinity
    LimitNOFILE=4096
    LimitAS=infinity
    LimitNPROC=infinity
    LimitMEMLOCK=infinity
    LimitLOCKS=infinity
    LimitFSIZE=infinity
    User=anzograph
    UMask=007
    Environment=PATH=$PATH:/opt/cambridgesemantics/anzograph/bin:/opt/cambridgesemantics/anzograph/tools/bin
    Environment=JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.1.0.12-3.el9.x86_64
    Environment=UDX_LOGS=/opt/cambridgesemantics/anzograph/internal/logs
    Environment=HYPER_PATH=/opt/cambridgesemantics/anzograph/vendor/com.tableau/hyper/lib/hyper
    ExecStart=/opt/cambridgesemantics/anzograph/bin/azgmgrd /opt/cambridgesemantics/anzograph
    
    CPUAccounting=false
    MemoryAccounting=false
    [Install]
    WantedBy=multi-user.target
    Alias=sbxmgrd.service
  2. In the following line of the file, replace anzograph with the name of the platform service user.
    User=anzograph

    For example:

    User=anzo
  3. Save and close the file.
  4. Copy azgmgrd.service from the <install_path>/examples/systemd-services directory to the /usr/lib/systemd/system directory. For example, the following command copies azgmgrd.service from the default installation path to /usr/lib/systemd/system:
    sudo cp /opt/cambridgesemantics/examples/systemd-services/azgmgrd.service /usr/lib/systemd/system
  5. Run the following commands to start and enable the service:
    sudo systemctl start azgmgrd.service
    sudo systemctl enable azgmgrd.service
  6. Repeat this process on all servers in the cluster.

The azgmgrd deamon must be running to start the database, but it typically does not need to be restarted unless you are upgrading Graph Lakehouse or the host servers are rebooted. It does not need to be stopped and started each time the database is restarted.

Configure the Database Service on the Leader Server (and Single-Servers)

The anzograph service runs the database process. This service is configured to run after azgmgrd is started. Starting the database is done only on the leader server. The leader connects to the system managers on the compute servers and starts the database across the cluster.

  1. Open the anzograph.service file in the <install_path>/examples/systemd-services directory. The contents of the file are shown below.

    The following contents are from an installation that used the default installation path, /opt/cambridgesematics. The contents of your file may differ. Also, note the User=anzograph value shown in bold below. The value needs to be edited to replace anzograph with the platform service user name.

    [Unit]
    Description=AnzoGraph database service
    After=azgmgrd.service
    Wants=azgmgrd.service
    
    [Service]
    Type=oneshot
    # The PID file is optional, but recommended in the manpage
    # "so that systemd can identify the main process of the daemon"
    #PIDFile=/var/run/azg.pid
    WorkingDirectory=/opt/cambridgesemantics/anzograph
    StandardOutput=syslog
    StandardError=syslog
    User=anzograph
    UMask=027
    RemainAfterExit=yes
    Environment=PATH=$PATH:/opt/cambridgesemantics/anzograph/bin:/opt/cambridgesemantics/anzograph/tools/bin
    ExecStart=/opt/cambridgesemantics/anzograph/bin/azgctl -start
    ExecStop=/opt/cambridgesemantics/anzograph/bin/azgctl -stop
    
    [Install]
    WantedBy=multi-user.target
    Alias=gqe.service
  2. In the following line of the file, replace anzograph with the name of the platform service user.
    User=anzograph

    For example:

    User=anzo
  3. Save and close the file.
  4. Copy anzograph.service from the <install_path>/examples/systemd-services directory to the /usr/lib/systemd/system directory. For example, the following command copies anzograph.service from the default installation path to /usr/lib/systemd/system:
    sudo cp /opt/cambridgesemantics/examples/systemd-services/anzograph.service /usr/lib/systemd/system
  5. Run the following commands to start and enable the new service:
    sudo systemctl start anzograph.service
    sudo systemctl enable anzograph.service

Once the services are in place and enabled, Graph Lakehouse should be running. To stop and start the database from the command line, run the following systemctl commands on the leader node (You do not need to stop and start azgmgrd.):

sudo systemctl stop anzograph
sudo systemctl start anzograph 

For instructions on configuring the connection to Graph Lakehouse in Graph Studio, see Connecting to Graph Lakehouse.

See Securing an AnzoGraph Environment for recommendations to follow for securing Graph Lakehouse environments.