diff --git a/docs/developer/develop-background-process-worker.md b/docs/developer/develop-background-process-worker.md new file mode 100644 index 0000000000..7d3820eb9f --- /dev/null +++ b/docs/developer/develop-background-process-worker.md @@ -0,0 +1,61 @@ +--- +title: Develop a Background Worker Process +--- + +# Develope a Background Worker Process + +Apache Cloudberry can be extended to run user-supplied code in separate processes. Such processes are started, stopped, and monitored by `postgres`, which permits them to have a lifetime closely linked to the server's status. These processes have the option to attach to Apache Cloudberry's shared memory area and to connect to databases internally; they can also run multiple transactions serially, just like a regular client-connected server process. Also, by linking to `libpq` they can connect to the server and behave like a regular client application. + +:::caution: +There are considerable robustness and security risks in using background worker processes because, being written in the `C` language, they have unrestricted access to data. Administrators wishing to enable modules that include background worker processes should exercise extreme caution. Only carefully audited modules should be permitted to run background worker processes. +::: + +Background workers can be initialized at the time that Apache Cloudberry is started by including the module name in the shared\_preload\_libraries server configuration parameter. A module wishing to run a background worker can register it by calling `RegisterBackgroundWorker(BackgroundWorker *worker)` from its `_PG_init()`. Background workers can also be started after the system is up and running by calling the function `RegisterDynamicBackgroundWorker(BackgroundWorker *worker, BackgroundWorkerHandle **handle)`. Unlike `RegisterBackgroundWorker`, which can only be called from within the `postmaster`, `RegisterDynamicBackgroundWorker` must be called from a regular backend. + +The structure `BackgroundWorker` is defined thus: + +```c +typedef void (*bgworker_main_type)(Datum main_arg); +typedef struct BackgroundWorker +{ + char bgw_name[BGW_MAXLEN]; + int bgw_flags; + BgWorkerStartTime bgw_start_time; + int bgw_restart_time; /* in seconds, or BGW_NEVER_RESTART */ + bgworker_main_type bgw_main; + char bgw_library_name[BGW_MAXLEN]; /* only if bgw_main is NULL */ + char bgw_function_name[BGW_MAXLEN]; /* only if bgw_main is NULL */ + Datum bgw_main_arg; + int bgw_notify_pid; +} BackgroundWorker; +``` + +`bgw_name` is a string to be used in log messages, process listings and similar contexts. + +`bgw_flags` is a bitwise-or'd bit mask indicating the capabilities that the module wants. Possible values are `BGWORKER_SHMEM_ACCESS` (requesting shared memory access) and `BGWORKER_BACKEND_DATABASE_CONNECTION` (requesting the ability to establish a database connection, through which it can later run transactions and queries). A background worker using `BGWORKER_BACKEND_DATABASE_CONNECTION` to connect to a database must also attach shared memory using `BGWORKER_SHMEM_ACCESS`, or worker start-up will fail. + +`bgw_start_time` is the server state during which `postgres` should start the process; it can be one of `BgWorkerStart_PostmasterStart` (start as soon as `postgres` itself has finished its own initialization; processes requesting this are not eligible for database connections), `BgWorkerStart_ConsistentState` (start as soon as a consistent state has been reached in a hot standby, allowing processes to connect to databases and run read-only queries), and `BgWorkerStart_RecoveryFinished` (start as soon as the system has entered normal read-write state). Note the last two values are equivalent in a server that's not a hot standby. Note that this setting only indicates when the processes are to be started; they do not stop when a different state is reached. + +`bgw_restart_time` is the interval, in seconds, that `postgres` should wait before restarting the process, in case it crashes. It can be any positive value, or `BGW_NEVER_RESTART`, indicating not to restart the process in case of a crash. + +`bgw_main` is a pointer to the function to run when the process is started. This function must take a single argument of type `Datum` and return `void`. `bgw_main_arg` will be passed to it as its only argument. Note that the global variable `MyBgworkerEntry` points to a copy of the `BackgroundWorker` structure passed at registration time. `bgw_main` may be NULL; in that case, `bgw_library_name` and `bgw_function_name` will be used to determine the entry point. This is useful for background workers launched after postmaster startup, where the postmaster does not have the requisite library loaded. + +`bgw_library_name` is the name of a library in which the initial entry point for the background worker should be sought. It is ignored unless `bgw_main` is NULL. But if `bgw_main` is NULL, then the named library will be dynamically loaded by the worker process and `bgw_function_name` will be used to identify the function to be called. + +`bgw_function_name` is the name of a function in a dynamically loaded library which should be used as the initial entry point for a new background worker. It is ignored unless `bgw_main` is NULL. + +`bgw_notify_pid` is the PID of a Apache Cloudberry backend process to which the postmaster should send `SIGUSR1` when the process is started or exits. It should be 0 for workers registered at postmaster startup time, or when the backend registering the worker does not wish to wait for the worker to start up. Otherwise, it should be initialized to `MyProcPid`. + +Once running, the process can connect to a database by calling ``BackgroundWorkerInitializeConnection(char *dbname, char *username)``. This allows the process to run transactions and queries using the `SPI` interface. If dbname is NULL, the session is not connected to any particular database, but shared catalogs can be accessed. If username is NULL, the process will run as the superuser created during `initdb`. BackgroundWorkerInitializeConnection can only be called once per background process, it is not possible to switch databases. + +Signals are initially blocked when control reaches the `bgw_main` function, and must be unblocked by it; this is to allow the process to customize its signal handlers, if necessary. Signals can be unblocked in the new process by calling `BackgroundWorkerUnblockSignals` and blocked by calling `BackgroundWorkerBlockSignals`. + +If `bgw_restart_time` for a background worker is configured as `BGW_NEVER_RESTART`, or if it exits with an exit code of 0 or is terminated by `TerminateBackgroundWorker`, it will be automatically unregistered by the postmaster on exit. Otherwise, it will be restarted after the time period configured via `bgw_restart_time`, or immediately if the postmaster reinitializes the cluster due to a backend failure. Backends which need to suspend execution only temporarily should use an interruptible sleep rather than exiting; this can be achieved by calling `WaitLatch()`. Make sure the `WL_POSTMASTER_DEATH` flag is set when calling that function, and verify the return code for a prompt exit in the emergency case that `postgres` itself has terminated. + +When a background worker is registered using the `RegisterDynamicBackgroundWorker` function, it is possible for the backend performing the registration to obtain information regarding the status of the worker. Backends wishing to do this should pass the address of a `BackgroundWorkerHandle *` as the second argument to `RegisterDynamicBackgroundWorker`. If the worker is successfully registered, this pointer will be initialized with an opaque handle that can subsequently be passed to ``GetBackgroundWorkerPid(BackgroundWorkerHandle *, pid_t *)`` or ``TerminateBackgroundWorker(BackgroundWorkerHandle *)``. `GetBackgroundWorkerPid` can be used to poll the status of the worker: a return value of `BGWH_NOT_YET_STARTED` indicates that the worker has not yet been started by the postmaster; `BGWH_STOPPED` indicates that it has been started but is no longer running; and `BGWH_STARTED` indicates that it is currently running. In this last case, the PID will also be returned via the second argument. `TerminateBackgroundWorker` causes the postmaster to send `SIGTERM` to the worker if it is running, and to unregister it as soon as it is not. + +In some cases, a process which registers a background worker may wish to wait for the worker to start up. This can be accomplished by initializing `bgw_notify_pid` to `MyProcPid` and then passing the `BackgroundWorkerHandle *` obtained at registration time to `WaitForBackgroundWorkerStartup(`BackgroundWorkerHandle *handle`, `pid_t *`)` function. This function will block until the postmaster has attempted to start the background worker, or until the postmaster dies. If the background runner is running, the return value will `BGWH_STARTED`, and the PID will be written to the provided address. Otherwise, the return value will be `BGWH_STOPPED` or `BGWH_POSTMASTER_DIED`. + +The `worker_spi` contrib module contains a working example, which demonstrates some useful techniques. + +The maximum number of registered background workers is limited by the `max_worker_processes` server configuration parameter. diff --git a/docs/developer/functions-and-procedures/overview.md b/docs/developer/functions-and-procedures/overview.md new file mode 100644 index 0000000000..4f14b3709c --- /dev/null +++ b/docs/developer/functions-and-procedures/overview.md @@ -0,0 +1,35 @@ +--- +title: Stored Procedures and User-Defined Functions +--- + +# Stored Procedures and User-Defined Functions + +Apache Cloudberry provides powerful capabilities for extending the database functionality through Stored Procedures (SPs) and User-Defined Functions (UDFs). + +## User-Defined functions (UDFs) + +User-Defined Functions (UDFs) return values and can be used in queries. They allow you to bundle complex logic and calculations into reusable components. + +Apache Cloudberry supports several procedural languages for writing UDFs: + +- **PL/Python**: Write functions using Python 3. With the `plpython3u` untrusted language, you can access system calls and external libraries. +- **[PL/Java](https://github.com/cloudberry-contrib/pljava)**: Write functions using Java. Suitable for complex computations and integration with existing Java libraries. +- **PL/Perl**: Write functions using Perl, leveraging its strong string manipulation capabilities. +- **[PL/Container](https://github.com/cloudberry-contrib/plcontainer)**: Run Python and R functions securely inside Docker containers. This provides isolation and security for running untrusted code. +- **[PL/R](https://github.com/cloudberry-contrib/plr)**: Write functions using the R statistical computing language. Ideal for advanced data analysis and statistical modeling. + +## Stored procedures + +Stored Procedures are similar to functions but do not return a value. They are invoked using the `CALL` command and can handle transaction control (e.g., `COMMIT`, `ROLLBACK`) within the procedure body, which is not allowed in functions. + +## Supported languages + +| Language | Extension | Trusted or not | Description | +| :--- | :--- | :--- | :--- | +| **SQL** | Built-in | Yes | Write functions using standard SQL queries. | +| **PL/pgSQL** | Built-in | Yes | The procedural language for the PostgreSQL database system. | +| **PL/Python** | `plpython3u` | No | Python 3 procedural language. | +| **PL/Java** | `pljava` | Yes | Java procedural language. | +| **PL/Perl** | `plperl` | Yes | Perl procedural language. | +| **PL/R** | `plr` | No | R procedural language. | +| **PL/Container** | `plcontainer` | Yes (Safe) | Run Python/R in Docker containers. | diff --git a/docs/developer/functions-and-procedures/use-pl-container.md b/docs/developer/functions-and-procedures/use-pl-container.md new file mode 100644 index 0000000000..b04e3540c8 --- /dev/null +++ b/docs/developer/functions-and-procedures/use-pl-container.md @@ -0,0 +1,612 @@ +# Use PL/Container + +PL/Container allows you to run procedural language functions inside Docker containers, mitigating the security risks associated with running Python or R code directly on Apache Cloudberry segment hosts. This document introduces the architecture, configuration, usage, and advanced topics of PL/Container. + +## About the PL/Container language extension + +The PL/Container language extension allows you to securely create and run PL/Python or PL/R user-defined functions (UDFs) inside Docker containers. Docker provides the ability to package and run an application in a loosely isolated environment called a container. + +Running UDFs inside a Docker container ensures that: + +- The function execution occurs in an isolated environment, decoupling data processing. +- The user code cannot access the host's operating system or file system. +- The user code does not introduce any security risks. +- If the container is started with limited or no network access, the function cannot connect back to the database. + +### PL/Container architecture + +Example of the process flow: + +1. When a query calls a PL/Container function, the Query Executor (QE) on a segment host starts a container and communicates with it to get results. The container might call back to the database to execute SQL queries via the Server Programming Interface (SPI) and then returns the final results to the QE. + +2. A container in standby mode waits for a socket connection without consuming any CPU resources. When the Apache Cloudberry database session that started it is closed, the container connection is also closed, and the container shuts down. + +## Configure and use PL/Container + +You do not need to install an extension package to use PL/Container in Apache Cloudberry. You only need to configure the Docker environment and enable the extension in the database. + +### Prerequisites + +- Docker is installed on all Apache Cloudberry database hosts (coordinator and all segments). +- The minimum Linux kernel version is 3.10. +- The minimum Docker version is 19.03. + +### Step 1: Install Docker + +You need to install Docker on all Apache Cloudberry database hosts. The following is an installation example on CentOS 7: + +1. Ensure the current user has `sudo` privileges or is the `root` user. + +2. Install the dependencies required for Docker: + + ```bash + sudo yum install -y yum-utils device-mapper-persistent-data lvm2 + ``` + +3. Add the Docker software repository: + + ```bash + sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo + ``` + +4. Install Docker: + + ```bash + sudo yum -y install docker-ce + ``` + +5. Start the Docker service: + + ```bash + sudo systemctl start docker + ``` + +6. Add the Apache Cloudberry database administrator user (usually `gpadmin`) to the `docker` group so that it can manage Docker images and containers: + + ```bash + sudo usermod -aG docker gpadmin + ``` + +7. Log out of the current session and log back in for the permissions to take effect. + +8. Configure Docker to start on boot: + + ```bash + sudo systemctl enable docker.service + ``` + +9. Test whether Docker is installed successfully. This command lists the currently running containers (which should be empty at this point): + + ```bash + docker ps + ``` + +10. After installing Docker on all hosts, restart the Apache Cloudberry database for the changes to take effect: + + ```bash + gpstop -ra + ``` + +### Step 2: Enable PL/Container + +PL/Container is a built-in extension in Apache Cloudberry. You only need to run one command in the database where you want to use it to enable it. + +1. Connect to the target database using `psql`: + + ```bash + psql -d your_database_name + ``` + +2. Run the `CREATE EXTENSION` command: + + ```sql + CREATE EXTENSION plcontainer; + ``` + + After successful execution, PL/Container is activated in that database. + +### Step 3: Install and configure Docker images + +PL/Container requires specific Docker images to create containers for running UDFs. + +1. Obtain the PL/Container Docker image file (for example, `plcontainer-python3-image-*.tar.gz`) from trusted sources. + +2. Use the `plcontainer image-add` command to install the image on all Apache Cloudberry hosts. Use the `-f` option to specify the path to the image file. + + :::note + Make sure that your system has the `rsync` dependency; otherwise, the following commands will fail. + ::: + + ```bash + # Installs a Python 3 based Docker image. + plcontainer image-add -f /path/to/your/plcontainer-python3-image.tar.gz + + # Installs an R based Docker image. + plcontainer image-add -f /path/to/your/plcontainer-r-image.tar.gz + ``` + +3. Use the `plcontainer image-list` command to view the installed images: + + ```bash + plcontainer image-list + ``` + +4. Use the `plcontainer runtime-add` command to register the installed image as a "runtime" so that it can be called in functions. + + - `-r`: Specify the name of the runtime (custom). + - `-i`: Specify the associated Docker image. + - `-l`: Specify the language supported by this runtime (`python`, `python3`, or `r`). + + ```bash + # Adds a Python 3 runtime. + plcontainer runtime-add -r plc_python3_shared -i pivotaldata/plcontainer_python3_shared:devel -l python3 + + # Adds an R runtime. + plcontainer runtime-add -r plc_r_shared -i pivotaldata/plcontainer_r_shared:devel -l r + ``` + +### Step 4: Test the installation + +1. Connect to the database where PL/Container is enabled using `psql`. + +2. Create a simple function to test the Python runtime. Note the `# container:` comment, which specifies the runtime ID used by the function. + + ```sql + CREATE FUNCTION dummy_python() RETURNS text AS $$ + # container: plc_python3_shared + return 'Hello from Python in a container' + $$ LANGUAGE plcontainer; + ``` + +3. Call the function to test it: + + ```sql + SELECT dummy_python(); + ``` + +4. Similarly, you can create a function to test the R runtime: + + ```sql + CREATE FUNCTION dummy_r() RETURNS text AS $$ + # container: plc_r_shared + return ('Hello from R in a container') + $$ LANGUAGE plcontainer; + ``` + +5. Call the function to perform a test: + + ```sql + SELECT dummy_r(); + ``` + +## Write PL/Container functions + +When you enable the PL/Container extension in a database, the `plcontainer` language is registered. You can write functions in PL/Container-supported languages, such as Python or R, by specifying `LANGUAGE plcontainer` in a user-defined function (UDF). + +The PL/Container configuration file is read only when the first PL/Container function is called in each database session. If you modify the configuration externally, you can force a reload by running `SELECT * FROM plcontainer_refresh_config;` within the session. + +### Function definition + +To create a PL/Container function, the UDF you define must include the following elements: + +- **Container declaration**: The first line of the function body must be `# container: ID`, where `ID` is the runtime ID you defined with `plcontainer runtime-add`. +- **Language declaration**: The `LANGUAGE` attribute of the function must be `plcontainer`. + +If the `# container` line in a UDF specifies an ID that does not exist in the PL/Container configuration file, the database returns an error when trying to run the UDF. + +### Function examples + +Here are some basic function examples. Ensure that the `# container` ID in the examples matches the runtime ID you have configured. + +**PL/Python example:** + +```sql +CREATE OR REPLACE FUNCTION pylog100() RETURNS double precision AS $$ + # container: plc_python3_shared + import math + return math.log10(100) +$$ LANGUAGE plcontainer; + +SELECT pylog100(); +``` + +**PL/R example:** + +```sql +CREATE OR REPLACE FUNCTION rlog100() RETURNS text AS $$ +# container: plc_r_shared +return(log10(100)) +$$ LANGUAGE plcontainer; + +SELECT rlog100(); +``` + +### About PL/Python functions + +#### PL/Python 2 vs. PL/Python 3 + +- PL/Container supports Python 2.7 and Python 3.6+. +- If you want to use PL/Container to run functions for both Python 2 and Python 3, you need to create two different user-defined functions for them. +- Note that UDFs written for Python 2 might not run directly in Python 3. + +#### Database interaction (`plpy` module) + +The `plpy` module provides a set of methods for interacting with the database: + +- `plpy.execute(stmt)`: Executes a SQL query string and returns the result as a list of dictionaries. +- `plpy.prepare(stmt[, argtypes])`: Prepares a query plan. +- `plpy.execute(plan[, argtypes])`: Executes a prepared query plan. +- `plpy.debug(msg)`, `plpy.log(msg)`, `plpy.info(msg)`, `plpy.notice(msg)`, `plpy.warning(msg)`, `plpy.error(msg)`, `plpy.fatal(msg)`: Sends log messages of different levels to the database logging system. An `ERROR` or `FATAL` level error aborts the transaction. +- `plpy.subtransaction()`: Manages explicit subtransactions. + +#### String quoting + +The `plpy` module supports several useful string quoting functions for building dynamic queries: + +- `plpy.quote_literal(string)`: Quotes a string as a literal in a SQL statement, correctly handling single quotes and backslashes. +- `plpy.quote_nullable(string)`: Same as above, but returns `NULL` if the input is `null`. +- `plpy.quote_ident(string)`: Quotes a string as an identifier in a SQL statement, adding quotes only when necessary. + +#### Global dictionaries + +The `plpy` module provides two special global dictionaries for persisting data between function calls: + +- **GD (Global Dictionary)**: This dictionary is shared among all function calls within the same container. The data persists as long as the container instance is alive. +- **SD (Static Dictionary)**: This dictionary shares data only between multiple calls to the same function. + +:::note +Note that the lifecycle of a container is associated with a session. When an idle session is terminated by the database, the related containers are destroyed, and the data in GD and SD will be lost. +::: + +### About PL/R functions + +The `pg.spi` module in PL/Container provides methods for the R language to interact with the database: + +- `pg.spi.exec(stmt)`: Executes a SQL query and returns an R `data.frame`. +- `pg.spi.prepare(stmt[, argtypes])`: Prepares a query plan. +- `pg.spi.execp(plan[, argtypes])`: Executes a prepared query plan. +- `pg.spi.debug(msg)`, `pg.spi.log(msg)`, `pg.spi.info(msg)`, `pg.spi.notice(msg)`, `pg.spi.warning(msg)`, `pg.spi.error(msg)`, `pg.spi.fatal(msg)`: Sends log messages of different levels to the database logging system. + +### Function limitations + +Note the following limitations when using PL/Container: + +- Database domains are not supported. +- Multidimensional arrays are not supported. +- Call stack information for Python and R is not displayed when debugging UDFs. +- The `nrows()` and `status()` methods of `plpy.execute()` are not supported. +- The `plpy.SPIError()` method of PL/Python is not supported. +- Running the `SAVEPOINT` command in `plpy.execute()` is not supported. +- Container flow control is not supported. +- Triggers are not supported. +- `OUT` parameters are not supported. +- PL/Python functions cannot directly return a Python `dict` type, but it can be converted to a database user-defined type (UDT) before returning. + +## Manage PL/Container + +### Manage configurations and containers in a session + +PL/Container provides some built-in views and functions to help you view and manage configurations within a database session. + +- **Refresh configuration**: If you modify the `plcontainer_configuration.xml` file externally, you can run the following command in a session to force a configuration reload without restarting the database. + + ```sql + SELECT * FROM plcontainer_refresh_config; + ``` + + This command returns the status of the configuration refresh on the coordinator and all segment instances. + +- **View current configuration**: + + ```sql + SELECT * FROM plcontainer_show_config; + ``` + + This command executes a PL/Container function to display configuration information from the coordinator and all segment instances. + +- **View running containers**: + + ```sql + SELECT * FROM plcontainer_containers_summary(); + ``` + + If run by a regular user, this function only displays the containers created by that user. If run by a superuser, it displays containers created by all users. The output includes segment ID, container ID, runtime, owner, and memory usage. + +## Advanced topics + +### Resource management + +Docker containers share CPU and memory resources with the Apache Cloudberry database service on the same host. By default, the database is unaware of the resources consumed by PL/Container instances. You can use the database's resource group feature to control the total CPU resource usage of PL/Container instances. + +PL/Container manages resources at two levels: container level and runtime level. You can control container-level CPU and memory resources by configuring `memory_mb` and `cpu_share` settings for a runtime. `memory_mb` controls the memory resources available to each container instance, while `cpu_share` defines the CPU usage weight of a container relative to others. + +:::note +If you do not explicitly configure a resource group for a PL/Container runtime, its container instances will only be limited by system resources. This can lead to containers consuming excessive resources, thus affecting the performance of the database server. +::: + +#### Configuration process + +To use resource groups to manage PL/Container's CPU resources, you need to explicitly configure both resource groups and PL/Container. + +1. **Plan resource allocation**: + + - Analyze the resource usage of your deployment environment to determine the percentage of CPU resources to allocate to PL/Container Docker containers. + - Decide how to distribute these resources among different PL/Container runtimes. Clarify the required number of resource groups, the CPU percentage for each group, and the mapping between resource groups and runtimes. + +2. **Create resource groups**: Create resource groups according to your plan. For example, assume you decide to allocate 25% of CPU resources to PL/Container and distribute it between two different resource groups in a 60/40 ratio: + + ```sql + -- Creates a resource group for the R runtime, allocating 15% CPU. + CREATE RESOURCE GROUP plr_run1_rg WITH (CONCURRENCY=0, CPU_MAX_PERCENT=15); + + -- Creates a resource group for the Python runtime, allocating 10% CPU. + CREATE RESOURCE GROUP plpy_run1_rg WITH (CONCURRENCY=0, CPU_MAX_PERCENT=10); + ``` + +3. **Get resource group IDs**: Query the `gp_toolkit.gp_resgroup_config` view to get the `groupid` of the resource groups you created. + + ```sql + SELECT groupname, groupid FROM gp_toolkit.gp_resgroup_config + WHERE groupname IN ('plpy_run1_rg', 'plr_run1_rg'); + + -- Example output: + -- groupname | groupid + -- --------------+---------- + -- plpy_run1_rg | 16391 + -- plr_run1_rg | 16393 + -- (2 rows) + ``` + +4. **Assign resource groups to runtimes**: Use the `plcontainer runtime-add` (for a new runtime) or `plcontainer runtime-replace` (for an existing runtime) command to assign a resource group via the `-s resource_group_id=` parameter. + + ```bash + # Assigns a resource group to the new python_run1 runtime. + plcontainer runtime-add -r python_run1 -i pivotaldata/plcontainer_python_shared:devel -l python -s resource_group_id=16391 + + # Replaces and assigns a resource group to the existing r_run1 runtime. + plcontainer runtime-replace -r r_run1 -i pivotaldata/plcontainer_r_shared:devel -l r -s resource_group_id=16393 + ``` + + You can also use the `plcontainer runtime-edit` command to manually edit the configuration file to assign a resource group. + +After assigning a resource group to a runtime, all container instances that share this runtime configuration will be subject to the CPU limits of that group's configuration. If you delete a PL/Container resource group that is in use, the database will terminate the running containers. + +### Logging + +When the logging feature of PL/Container is enabled, you can set the log level (default is `warning`) through the database's `log_min_messages` parameter. This parameter controls the log level for both the database and PL/Container. + +- **Enable logging**: PL/Container logging is enabled per runtime ID and controlled by the `use_container_logging` setting, which defaults to no logging. +- **Log content and location**: PL/Container log messages originate from UDFs running in Docker containers. On Red Hat 8 systems, log messages are sent to the `journald` service by default. Database log messages are sent to the log files on the coordinator node. +- **Dynamically adjust log level**: When testing or troubleshooting, you can use the `SET` command in a session to temporarily change the log level, for example, to `debug1`. + + ```sql + SET log_min_messages='debug1'; + ``` + +:::note +The `log_min_messages` parameter affects logging for both the database and PL/Container. Increasing the log level might affect database performance, even if no PL/Container functions are running. +::: + +### Use CUDA for GPU acceleration + +PL/Container supports using NVIDIA GPUs for computational acceleration. This requires you to prepare a custom Docker image containing the CUDA Toolkit and corresponding Python libraries (such as PyCUDA) and configure it accordingly. + +#### Prerequisites + +- Docker engine version is not lower than v19.03. +- PL/Container version is not lower than 2.2.0. +- At least one NVIDIA GPU is on the host, and the corresponding GPU driver is installed. +- NVIDIA Container Toolkit is installed, and it is verified that the `nvidia-docker` image can successfully use the GPU. + +#### Install and customize the PL/Container image + +1. **Load the base image**: Obtain the PL/Container Python 3 image from official channels and load it into Docker. + + ```bash + docker image load < plcontainer-python3-image-*.tar.gz + ``` + +2. **Customize the image**: Create a Dockerfile to add the CUDA runtime and the `pycuda` library to the base image. The following is an example Dockerfile content for adding CUDA 11.7 and `pycuda` 2021.1: + + ```dockerfile + FROM pivotaldata/plcontainer_python3_shared:devel + + ENV XKBLAYOUT=en + ENV DEBIAN_FRONTEND=noninteractive + + # Install CUDA from https://developer.nvidia.com/cuda-downloads + # By downloading and using the software, you agree to fully comply with the terms and conditions of the CUDA EULA. + RUN true &&\ + wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu2204.pin && \ + mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \ + wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb && \ + dpkg -i cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb && \ + cp /var/cuda-repo-ubuntu1804-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \ + apt-get update && \ + apt-get -y install cuda && \ + rm cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb &&\ + rm -rf /var/lib/apt/lists/* + + ENV PATH="/usr/local/cuda-11.7/bin/:${PATH}" + ENV LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" + ENV CUDA_HOME="/usr/local/cuda-11.7" + + RUN true && \ + python3.7 -m pip --no-cache-dir install typing-extensions==3.10.0.0 && \ + python3.7 -m pip --no-cache-dir install Mako==1.2.0 && \ + python3.7 -m pip --no-cache-dir install platformdirs==2.5.2 && \ + python3.7 -m pip --no-cache-dir install pytools==2022.1.2 && \ + python3.7 -m pip --no-cache-dir install pycuda==2021.1 + ``` + +3. **Build the custom image**: + + ```bash + docker build . -t localhost/plcontainer_python3_cuda_shared:latest + ``` + +4. **Add and configure the runtime**: + + - Add a new runtime and associate it with the image you just built. + + ```bash + plcontainer runtime-add -r plc_python_cuda_shared -I localhost/plcontainer_python3_cuda_shared:latest -l python3 + ``` + + - Edit the runtime configuration to assign a GPU device to it. + + ```bash + plcontainer runtime-edit + ``` + + In the XML configuration, add a `` section for this runtime. + + ```xml + + plc_python_cuda_shared + localhost/plcontainer_python3_cuda_shared:latest + ... + + 0 + + + ``` + +#### Create and run a CUDA function example + +1. Connect to the database and create a function that uses the CUDA runtime. + + ```sql + CREATE FUNCTION hello_cuda() RETURNS float4[] AS $$ + # container: plc_python_cuda_shared + + import pycuda.driver as drv + import pycuda.tools + import pycuda.autoinit + import numpy + import numpy.linalg as la + from pycuda.compiler import SourceModule + + mod = SourceModule(""" + __global__ void multiply_them(float *dest, float *a, float *b) + { + const int i = threadIdx.x; + dest[i] = a[i] * b[i]; + } + """) + + multiply_them = mod.get_function("multiply_them") + + a = numpy.random.randn(400).astype(numpy.float32) + b = numpy.random.randn(400).astype(numpy.float32) + + dest = numpy.zeros_like(a) + multiply_them( + drv.Out(dest), drv.In(a), drv.In(b), + block=(400,1,1)) + + return [float(i) for i in (dest-a*b)] + + $$ LANGUAGE plcontainer; + ``` + +2. Execute the function and verify the result. + + ```sql + -- Calculates the sum of the resulting array, which is expected to be 0.0. + WITH a AS (SELECT unnest(hello) AS cuda FROM hello_cuda() AS hello) SELECT sum(cuda) FROM a; + ``` + +### Configure remote PL/Container + +You can configure one or more remote hosts outside the cluster to execute PL/Container workloads, thereby reducing the computational overhead on database hosts. + +#### Prerequisites + +- PL/Container version is not lower than 2.4.0. +- Docker engine (version not lower than v19.03) is installed on the remote host, and you have `root` or `sudo` permissions. + +#### Step 1: Configure the remote host + +1. **Install Docker**: Install Docker on the remote host. + +2. **Enable remote API**: Edit Docker's service file to make it listen on a TCP port (for example, 2375). + + ```bash + sudo systemctl edit docker.service + + # Adds the following content to the beginning of the file: + [Service] + ExecStart= + ExecStart=/usr/bin/dockerd -H fd:// -H tcp://0.0.0.0:2375 + + # Restarts the Docker service. + sudo systemctl restart docker + ``` + +3. **Prepare the remote environment**: Ensure that the `gpadmin` user, passwordless SSH access, `python3`, and `rsync` are installed on the remote host. Then, perform the remote setup from the coordinator node. + + ```bash + # Creates a directory on the remote host. + ssh gpadmin@ "sudo mkdir $GPHOME && sudo chown gpadmin:gpadmin $GPHOME" + + # Copies the plcontainer client from the coordinator to the remote host (multiple hosts can be specified). + plcontainer remote-setup --hosts , + ``` + +#### Step 2: Load the Docker image to the remote host + +From the coordinator node, load the required Docker image onto one or more remote hosts. + +```bash +plcontainer image-add --hosts , -f +``` + +#### Step 3: Configure the backend node + +1. On the coordinator node, edit the PL/Container configuration file. + + ```bash + plcontainer runtime-edit + ``` + +2. In the XML file, add a new `` section to define the remote host cluster. Then, modify the existing `` section to use this newly defined backend via ``. + + ```xml + + + +
REMOTE_HOST_IP
+ 2375 +
+ + plc_python_remote + your_image:tag + /clientdir/py3client.sh + + + + +
+ ``` + + If you have multiple remote hosts, you need to create different backend and runtime configurations for them. + +#### Step 4: Verify the configuration + +Create a function that uses the remote runtime and execute it. If it succeeds, it means the function has run on the remote host. + +```sql +CREATE FUNCTION dummy_remote_python() RETURNS text AS $$ +# container: plc_python_remote +return 'hello from a remote Python container' +$$ LANGUAGE plcontainer; + +SELECT * from dummy_remote_python(); +``` + +## Notes + +- If a PL/Container Docker container exceeds the allowed maximum memory, it will be terminated with an out-of-memory warning. +- PL/Container does not limit the base device size of Docker containers. In some cases, the Docker daemon controls this size. For example, if the Docker storage driver is `devicemapper`, its base size defaults to 10GB. You can use the `docker info` command to view the storage driver and related configurations. +- When a PL/Container UDF is executed, the Query Executor (QE) process starts and reuses Docker containers as needed. After being idle for a period, the QE process exits and destroys its Docker containers. You can control this idle time through the `gp_vmem_idle_resource_timeout` parameter, thereby affecting the container reuse policy. However, note that modifying this parameter also affects the recycling of other database resources, which might impact performance. \ No newline at end of file diff --git a/docs/developer/functions-and-procedures/use-pl-java.md b/docs/developer/functions-and-procedures/use-pl-java.md new file mode 100644 index 0000000000..48b62cf7a6 --- /dev/null +++ b/docs/developer/functions-and-procedures/use-pl-java.md @@ -0,0 +1,155 @@ +--- +title: Use PL/Java +--- + +# Use PL/Java + +PL/Java is an embedded trusted procedural language that allows you to write PostgreSQL functions and triggers using the Java programming language. + +With PL/Java, you can: + +- Write functions in Java and call them from SQL. +- Use the Java Standard Library. +- Access the database via JDBC from within the Java function. + +## Enable PL/Java + +To use PL/Java, you must enable it in your database. + +1. Connect to your database using `psql`. + + ```bash + psql -d + ``` + +2. Create the `pljava` extension. + + ```sql + CREATE EXTENSION pljava; + ``` + + This registers both the trusted (`pljava`) and untrusted (`javau`) languages. If you only want the trusted language, use `CREATE EXTENSION pljavat`. + +3. Configure the classpath headers. + + You need to set the `pljava_classpath` configuration parameter to include the JAR files containing your Java classes. + + ```sql + SET pljava_classpath = 'examples.jar:myclasses.jar'; + ``` + + To make this change permanent for the database: + + ```sql + ALTER DATABASE SET pljava_classpath = 'examples.jar:myclasses.jar'; + ``` + +## Write PL/Java functions + +To create a PL/Java function, you write a Java class with static methods, compile it into a JAR file, and then declare the function in SQL. + +### SQL declaration + +A Java function is declared with the name of a class and a static method on that class. + +```sql +CREATE FUNCTION getsysprop(VARCHAR) +RETURNS VARCHAR +AS 'java.lang.System.getProperty' +LANGUAGE java; +``` + +You can then call the function just like any other SQL function: + +```sql +SELECT getsysprop('user.home'); +``` + +### Type mapping + +PL/Java automatically maps PostgreSQL types to Java types. + +| PostgreSQL type | Java type | +| :--- | :--- | +| `boolean` | `boolean` | +| `smallint` | `short` | +| `integer` | `int` | +| `bigint` | `long` | +| `real` | `float` | +| `double precision` | `double` | +| `text`, `varchar`, `char` | `java.lang.String` | +| `date` | `java.sql.Date` | +| `time` | `java.sql.Time` | +| `timestamp` | `java.sql.Timestamp` | +| `bytea` | `byte[]` | + +### NULL handling + +Primitive Java types (like `int`, `double`, `boolean`) cannot be `NULL`. If you pass a SQL `NULL` to a Java function expecting a primitive, it will result in an error unless you map it to the corresponding wrapper class (for example, `java.lang.Integer`). + +Example of handling NULLs: + +1. Create the Java method. + + ```java + package com.example; + public class Utils { + public static boolean trueIfEvenOrNull(Integer value) { + return (value == null) ? true : (value.intValue() % 2) == 0; + } + } + ``` + +2. Declare the function in SQL using the full class name. + + ```sql + CREATE FUNCTION true_if_even_or_null(integer) + RETURNS boolean + AS 'com.example.Utils.trueIfEvenOrNull(java.lang.Integer)' + LANGUAGE java; + ``` + +### Complex types + +You can pass complex types and rows to Java functions. PL/Java supports `Complex` types and `ResultSet` for handling rows and sets. + +## Use JDBC + +PL/Java includes a JDBC driver that allows your Java code to access the database where the function is running. + +To establish a connection to the current database session: + +```java +import java.sql.Connection; +import java.sql.DriverManager; + +Connection conn = DriverManager.getConnection("jdbc:default:connection"); +``` + +### Limitations + +When using the internal JDBC driver (`jdbc:default:connection`): +- You cannot manage transactions explicitly (`commit()`, `rollback()`, etc. are not allowed). The function runs within the transaction context of the calling SQL statement. +- `Savepoint` operations are allowed but must be released within the same function. +- `ResultSet` from `executeQuery()` is always `FETCH_FORWARD` and `CONCUR_READ_ONLY`. + +## Handle exceptions + +PL/Java maps database errors to `java.sql.SQLException`. You can catch standard SQL exceptions in your Java code. If your Java code throws an exception, it is reported as an error to the PostgreSQL client, and the transaction is aborted (unless caught by a savepoint). + +## Log messages + +PL/Java uses the standard `java.util.logging.Logger`. Messages logged via this logger are redirected to the PostgreSQL log system (`elog`). + +```java +import java.util.logging.Logger; + +Logger.getAnonymousLogger().info("Time is " + new java.util.Date()); +``` + +The database configuration `log_min_messages` determines which log levels are actually sent to the client or written to the server log. + +## Security + +- **Trusted Language (`java`)**: When using the trusted `java` language, the Java security manager prevents access to the file system and other sensitive system resources using standard Java security policies. Any user can create functions in the trusted language. +- **Untrusted Language (`javau`)**: The untrusted `javau` language allows unrestricted access (for example, file system access), similar to a standalone Java application. Only superusers can create functions using `javau`. diff --git a/docs/developer/functions-and-procedures/use-pl-perl.md b/docs/developer/functions-and-procedures/use-pl-perl.md new file mode 100644 index 0000000000..2f76e14c8f --- /dev/null +++ b/docs/developer/functions-and-procedures/use-pl-perl.md @@ -0,0 +1,94 @@ +--- +title: Use PL/Perl +--- + +# Use PL/Perl + +PL/Perl is an embedded procedural language that allows you to write PostgreSQL functions using the Perl programming language. + +With PL/Perl, you can: + +- Write functions in Perl and call them from SQL. +- Use the powerful string manipulation features of Perl. +- Use available Perl modules. + +## Enable PL/Perl + +To use PL/Perl, you must enable it in your database. + +1. Connect to your database using `psql`. + + ```bash + psql -d + ``` + +2. Create the `plperl` extension. + + ```sql + CREATE EXTENSION plperl; + ``` + + This registers the trusted (`plperl`) language. If you want the untrusted language, use `CREATE EXTENSION plperlu`. + +## Write PL/Perl functions + +You define a PL/Perl function using the standard SQL `CREATE FUNCTION` syntax. The body of the function is ordinary Perl code. + +```sql +CREATE FUNCTION perl_max (integer, integer) +RETURNS integer +AS $$ + if ($_[0] > $_[1]) { return $_[0]; } + return $_[1]; +$$ LANGUAGE plperl; +``` + +### Arguments and results + +- Arguments are accessed via the `@_` array. +- You return a result value with the `return` statement or as the last evaluated expression. +- To return a SQL `NULL`, return the Perl `undef`. + +### Strict functions + +By default, PL/Perl functions are called with non-null arguments. If a SQL `NULL` is passed, the result is `NULL` unless you specify `STRICT`. + +```sql +CREATE FUNCTION perl_max_strict (integer, integer) +RETURNS integer +AS $$ + if ($_[0] > $_[1]) { return $_[0]; } + return $_[1]; +$$ LANGUAGE plperl STRICT; +``` + +## Built-in functions + +PL/Perl provides access to the database via built-in functions. + +- `spi_exec_query(query [, limit])`: Executes a query and returns the result. +- `elog(level, msg)`: Emits a log message. + +Example of using `spi_exec_query`: + +```sql +CREATE OR REPLACE FUNCTION return_match(varchar) RETURNS SETOF test AS $$ + my $rv = spi_exec_query('select * from test;'); + my $nrows = $rv->{processed}; + foreach my $rn (0 .. $nrows - 1) { + my $row = $rv->{rows}[$rn]; + if (index($row->{v}, $_[0]) != -1) { + return_next($row); + } + } + return undef; +$$ LANGUAGE plperl; +``` + +## Security and limitations + +- **Trusted Language (`plperl`)**: Restricts file system operations and other potentially unsafe operations. Any user can create functions in `plperl`. +- **Untrusted Language (`plperlu`)**: Allows unrestricted access to the system. Only superusers can create functions in `plperlu`. +- **Limitations**: + - PL/Perl triggers are not supported in Apache Cloudberry. + - PL/Perl functions cannot call each other directly. diff --git a/docs/developer/functions-and-procedures/use-pl-pgsql.md b/docs/developer/functions-and-procedures/use-pl-pgsql.md new file mode 100644 index 0000000000..024e88c9c5 --- /dev/null +++ b/docs/developer/functions-and-procedures/use-pl-pgsql.md @@ -0,0 +1,108 @@ +--- +title: Use PL/pgSQL +--- + +# Use PL/pgSQL + +PL/pgSQL is a loadable procedural language that is installed and registered by default with Apache Cloudberry. It adds the ability to perform complex computations and usage of control structures to standard SQL. + +## Benefits of PL/pgSQL + +- **SQL integration**: It is designed to be easy to use with SQL. +- **Portability**: Functions written in PL/pgSQL can be used on any platform where Apache Cloudberry runs. +- **Performance**: It reduces the communication overhead between the client and the server by grouping multiple SQL statements into a single block. + +## Structure of PL/pgSQL functions + +A PL/pgSQL function differs from a standard SQL function in that the function body is organized into blocks. + +```sql +[ <