Load data into a Redshift database with a Singer target.
Full documentation can be found on the GitHub Repo.
type: "io.kestra.plugin.singer.targets.pipelinewiseredshift"YES1Name of the schema where the tables will be created.
If schema_mapping is not defined then every stream sent by the tap is loaded into this schema.
YESThe raw data from a tap.
YES1The database hostname.
NOThe database port.
YES1The S3 bucket name.
YES1The database user.
YESS3 Access Key ID.
Used for S3 and Redshift copy operations.
NOfalseAdd metadata columns.
Metadata columns add extra row level information about data ingestions, (i.e. when was the row read in source, when was inserted or deleted in redshift etc.) Metadata columns are creating automatically by adding extra columns to the tables with a column prefix SDC. The metadata columns are documented at here. Enabling metadata columns will flag the deleted rows by setting the _SDC_DELETED_AT metadata column. Without the addMetadataColumns option the deleted rows from singer taps will not be recongisable in Redshift.
NO100000Maximum number of rows in each batch.
At the end of each batch, the rows in the batch are loaded into Redshift.
YESOverride default singer command.
YESbzip2gzipbzip2The compression method to use when writing files to S3 and running Redshift COPY.
YESpython:3.10.12The task runner container image, only used if the task runner is container-based.
YESCOPY options.
Parameters to use in the COPY command when loading data to Redshift. Some basic file formatting parameters are fixed values and not recommended overriding them by custom values. They are like: CSV GZIP DELIMITER ',' REMOVEQUOTES ESCAPE.
NO0Object type RECORD items from taps can be transformed to flattened columns by creating columns automatically.
When hardDelete option is true then DELETE SQL commands will be performed in Redshift to delete rows in tables. It's achieved by continuously checking the _SDC_DELETED_AT metadata column sent by the singer tap. Due to deleting rows requires metadata columns, hardDelete option automatically enables the addMetadataColumns option as well..
YESThe database name.
YESGrant USAGE privilege on newly created schemas and grant SELECT privilege on newly created tables to a specific list of users or groups.
If schemaMapping is not defined then every stream sent by the tap is granted accordingly.
NOfalseDisable table cache.
By default the connector caches the available table structures in Redshift at startup. In this way it doesn't need to run additional queries when ingesting data to check if altering the target tables is required. With disable_table_cache option you can turn off this caching. You will always see the most recent table structures but will cause an extra query runtime.
NODeprecated, use 'taskRunner' instead
NOfalseFlush and load every stream into Redshift when one batch is full.
Warning: This may trigger the COPY command to use files with low number of records..
NOfalseDelete rows on Redshift.
When hardDelete option is true then DELETE SQL commands will be performed in Redshift to delete rows in tables. It's achieved by continuously checking the _SDC_DELETED_AT metadata column sent by the singer tap. Due to deleting rows requires metadata columns, hardDelete option automatically enables the addMetadataColumns option as well.
NO16Max number of parallel threads to use when flushing tables.
NO0The number of threads used to flush tables.
0 will create a thread for each stream, up to parallelism_max. -1 will create a thread for each CPU core. Any other positive number will create that number of threads, up to parallelism_max.
YESThe database user's password.
YESOverride default pip packages to use a specific version.
NOtrueLog based and Incremental replications on tables with no Primary Key cause duplicates when merging UPDATE events.
When set to true, stop loading data if no Primary Key is defined..
YESAWS Redshift COPY role ARN.
AWS Role ARN to be used for the Redshift COPY operation. Used instead of the given AWS keys for the COPY operation if provided - the keys are still used for other S3 operations.
YESAWS S3 ACL.
S3 Object ACL.
YESS3 Key Prefix.
A static prefix before the generated S3 key names. Using prefixes you can upload files into specific directories in the S3 bucket. Default(None).
YESSchema mapping.
Useful if you want to load multiple streams from one tap to multiple Redshift schemas. If the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value. Note, that using schema_mapping you can overwrite the default_target_schema_select_permissions value to grant SELECT permissions to different groups per schemas or optionally you can create indices automatically for the replicated tables.
YESS3 Secret Access Key.
Used for S3 and Redshift copy operations.
YESAWS S3 Session Token.
S3 AWS STS token for temporary credentials.
NOfalseDo not update existing records when Primary Key is defined.
Useful to improve performance when records are immutable, e.g. events.
NO1number of slices to split files into prior to running COPY on Redshift.
This should be set to the number of Redshift slices. The number of slices per node depends on the node size of the cluster - run SELECT COUNT(DISTINCT slice) slices FROM stv_slices to calculate this.
YESsinger-stateThe name of Singer state file stored in KV Store.
NOThe task runner to use.
Task runners are provided by plugins, each have their own properties.
NOfalseValidate every single record message to the corresponding JSON schema.
This option is disabled by default and invalid RECORD messages will fail only at load time by Redshift. Enabling this option will detect invalid records earlier but could cause performance degradation..
Key of the state in KV Store
YESbusyboxThe image used for the file sidecar container.
NOThe maximum amount of CPU resources a container can use.
Make sure to set that to a numeric value e.g. cpus: "1.5" or cpus: "4" or For instance, if the host machine has two CPUs and you set cpus: "1.5", the container is guaranteed at most one and a half of the CPUs.
NONONONOYESThe registry authentication.
The auth field is a base64-encoded authentication string of username: password or a token.
YESThe identity token.
YESThe registry password.
YESThe registry URL.
If not defined, the registry will be extracted from the image name.
YESThe registry token.
YESThe registry username.
YESThe ARM resource ID of the user assigned identity.
YESExtra boot disk size for each task.
YESThe milliCPU count.
Defines the amount of CPU resources per task in milliCPU units. For example, 1000 corresponds to 1 vCPU per task. If undefined, the default value is 2000.
If you also define the VM's machine type using the machineType property in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the CPU resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.
For example, if you specify the n2-standard-2 machine type, which has 2 vCPUs, you can set the cpu to no more than 2000. Alternatively, you can run two tasks on the same VM if you set the cpu to 1000 or less.
YESMemory in MiB.
Defines the amount of memory per task in MiB units. If undefined, the default value is 2048. If you also define the VM's machine type using the machineType in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the memory resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.
For example, if you specify the n2-standard-2 machine type, which has 8 GiB of memory, you can set the memory to no more than 8192.
NONONONONOThe configuration of the target Kubernetes cluster.
YESAdditional YAML spec for the container.
NOtrueWhether the pod should be deleted upon completion.
YESAdditional YAML spec for the sidecar container.
NO{
"image": "busybox"
}The configuration of the file sidecar container that handle download and upload of files.
YESThe pod custom labels
Kestra will add default labels to the pod with execution and flow identifiers.
YESdefaultThe namespace where the pod will be created.
YESNode selector for pod scheduling
Kestra will assign the pod to the nodes you want (see Assign Pod Nodes)
YESAdditional YAML spec for the pod.
YESALWAYSIF_NOT_PRESENTALWAYSNEVERThe image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.
NOThe pod custom resources
NOtrueWhether to reconnect to the current pod if it already exists.
YESThe name of the service account.
NOfalseNO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESPT5SdurationThe additional duration to wait for logs to arrive after pod completion.
As logs are not retrieved in real time, we cannot guarantee that we have fetched all logs when the pod complete, therefore we wait for a fixed amount of time to fetch late logs.
YESPT1HdurationThe maximum duration to wait for the pod completion unless the task timeout property is set which will take precedence over this property.
YESPT10MdurationThe maximum duration to wait until the pod is created.
This timeout is the maximum time that Kubernetes scheduler can take to
- schedule the pod
- pull the pod image
- and start the pod.
YESThe Batch account name.
YESThe blob service endpoint.
YESId of the pool on which to run the job.
NOYESThe Batch access key.
NOYESPT5SdurationDetermines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.
NOtrueWhether the job should be deleted upon completion.
Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.
NOThe private registry which contains the container image.
NOtrueWhether to reconnect to the current job if it already exists.
NOfalseEnable log streaming during task execution.
This property is useful for capturing logs from tasks that have a timeout. If a task with a timeout is terminated, this property makes sure all logs up to that point are retrieved.
NOfalseNO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESPT1HdurationThe maximum duration to wait for the job completion unless the task timeout property is set, which will take precedence over this property.
Azure Batch will automatically timeout the job upon reaching such duration and the task will fail.
YESExit codes of a task execution.
If there are more than 1 exit codes, when task executes with any of the exit code in the list, the condition is met and the action will be executed.
YESThe GCP region.
NOYESGoogle Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.
It's mandatory to provide a bucket if you want to use such properties.
YESPT5SdurationDetermines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.
NOtrueWhether the job should be deleted upon completion.
YESThe GCP service account to impersonate.
NO3The maximum number of retries for the Cloud Run job. By default, the task runner retries the job up to 3 times.
YESThe GCP project ID.
NOtrueWhether to reconnect to the current job if it already exists.
YES["https://www.googleapis.com/auth/cloud-platform"]The GCP scopes to be used.
YESThe GCP service account key.
NONO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESThe full resource name of the VPC Access Connector to route egress traffic through.
Example: projects/my-project/locations/europe-west1/connectors/my-connector
YESVPC_EGRESS_UNSPECIFIEDALL_TRAFFICPRIVATE_RANGES_ONLYUNRECOGNIZEDThe VPC egress setting for the Cloud Run job.
Must be PRIVATE_RANGES_ONLY or ALL_TRAFFIC (case-insensitive). Requires vpcAccessConnector to be set.
YESPT5SdurationAdditional time after the job ends to wait for late logs.
YESPT1HdurationThe maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.
Google Cloud Run will automatically timeout the Job upon reaching such duration and the task will be failed.
YESACTION_UNSPECIFIEDRETRY_TASKFAIL_TASKUNRECOGNIZEDAction on task failures based on different conditions.
NOConditions for actions to deal with task failures.
YESNetwork identifier with the format projects/HOST_PROJECT_ID/global/networks/NETWORK.
YESSubnetwork identifier in the format projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNET
YESYESYESv1The API version
YESCA certificate as data
YESCA certificate as file path
YESClient certificate as data
YESClient certificate as a file path
YESRSAClient key encryption algorithm
default is RSA
YESClient key as data
YESClient key as a file path
YESClient key passphrase
NODisable hostname verification
YESKey store file
YESKey store passphrase
YEShttps://kubernetes.default.svcThe url to the Kubernetes API
YESThe namespace used
YESOauth token
NOOauth token provider
YESPassword
NOTrust all certificates
YESTruststore file
YESTruststore passphrase
YESUsername
YESThe URL of the blob container the compute node should use.
Mandatory if you want to use namespaceFiles, inputFiles or outputFiles properties.
YESConnection string of the Storage Account.
YESThe blob service endpoint.
NONO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESThe GCP region.
NOYESGoogle Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.
It's mandatory to provide a bucket if you want to use such properties.
YESPT5SdurationDetermines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.
NOCompute resource requirements.
ComputeResource defines the amount of resources required for each task. Make sure your tasks have enough compute resources to successfully run. If you also define the types of resources for a job to use with the InstancePolicyOrTemplate field, make sure both fields are compatible with each other.
NOtrueWhether the job should be deleted upon completion.
Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.
YESContainer entrypoint to use.
YESThe GCP service account to impersonate.
NOLifecycle management schema when any task in a task group is failed.
Currently we only support one lifecycle policy. When the lifecycle policy condition is met, the action in the policy will execute. If task execution result does not meet with the defined lifecycle policy, we consider it as the default policy. Default policy means if the exit code is 0, exit task. If task ends with non-zero exit code, retry the task with max_retry_count.
YESe2-mediumThe GCP machine type.
NO2NO >= 0 <= 10Maximum number of retries on failures.
The default, 0, which means never retry.
YESThe GCP project ID.
YESCompute reservation.
NOtrueWhether to reconnect to the current job if it already exists.
YES["https://www.googleapis.com/auth/cloud-platform"]The GCP scopes to be used.
YESThe GCP service account key.
NOfalseNO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESPT5SdurationAdditional time after the job ends to wait for late logs.
YESPT1HdurationThe maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.
Google Cloud Batch will automatically timeout the job upon reaching such duration and the task will be failed.
YESThe maximum amount of kernel memory the container can use.
The minimum allowed value is 4MB. Because kernel memory cannot be swapped out, a container which is starved of kernel memory may block host machine resources, which can have side effects on the host machine and on other containers. See the kernel-memory docs for more details.
YESThe maximum amount of memory resources the container can use.
Make sure to use the format number + unit (regardless of the case) without any spaces.
The unit can be KB (kilobytes), MB (megabytes), GB (gigabytes), etc.
Given that it's case-insensitive, the following values are equivalent:
"512MB""512Mb""512mb""512000KB""0.5GB"
It is recommended that you allocate at least 6MB.
YESAllows you to specify a soft limit smaller than memory which is activated when Docker detects contention or low memory on the host machine.
If you use memoryReservation, it must be set lower than memory for it to take precedence. Because it is a soft limit, it does not guarantee that the container doesn’t exceed the limit.
YESThe total amount of memory and swap that can be used by a container.
If memory and memorySwap are set to the same value, this prevents containers from using any swap. This is because memorySwap includes both the physical memory and swap space, while memory is only the amount of physical memory that can be used.
YESA setting which controls the likelihood of the kernel to swap memory pages.
By default, the host kernel can swap out a percentage of anonymous pages used by a container. You can set memorySwappiness to a value between 0 and 100 to tune this percentage.
NOBy default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container.
To change this behavior, use the oomKillDisable option. Only disable the OOM killer on containers where you have also set the memory option. If the memory flag is not set, the host can run out of memory, and the kernel may need to kill the host system’s processes to free the memory.
YESThe reference to the user assigned identity to use to access the Azure Container Registry instead of username and password.
YESThe password to log into the registry server.
YESThe registry server URL.
If omitted, the default is "docker.io".
YESThe user name to log into the registry server.
YES1Docker image to use.
YESDocker configuration file.
Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.
NOLimits the CPU usage to a given maximum threshold value.
By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.
YESYESDocker entrypoint to use.
YESExtra hostname mappings to the container network interface configuration.
YESDocker API URI.
NOLimits memory usage to a given maximum threshold value.
Docker can enforce hard memory limits, which allow the container to use no more than a given amount of user or system memory, or soft limits, which allow the container to use as much memory as it needs unless certain conditions are met, such as when the kernel detects low memory or contention on the host machine. Some of these options have different effects when used alone or when more than one option is set.
YESDocker network mode to use e.g. host, none, etc.
NOGive extended privileges to this container.
YESIF_NOT_PRESENTIF_NOT_PRESENTALWAYSNEVERThe image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.
YESSize of /dev/shm in bytes.
The size must be greater than 0. If omitted, the system uses 64MB.
YESUser in the Docker container.
YESList of volumes to mount.
Must be a valid mount expression as string, example : /home/user:/app.
Volumes mount are disabled by default for security reasons; you must enable them on server configuration by setting kestra.tasks.scripts.docker.volume-enabled to true.
NOYESDocker configuration file.
Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.
NOLimits the CPU usage to a given maximum threshold value.
By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.
YESNOtrueWhether the container should be deleted upon completion.
YES[
""
]Docker entrypoint to use.
YESExtra hostname mappings to the container network interface configuration.
YESVOLUMEMOUNTVOLUMEFile handling strategy.
How to handle local files (input files, output files, namespace files, ...).
By default, we create a volume and copy the file into the volume bind path.
Configuring it to MOUNT will mount the working directory instead.
YESDocker API URI.
NOPT0SdurationWhen a task is killed, this property sets the grace period before killing the container.
By default, we kill the container immediately when a task is killed. Optionally, you can configure a grace period so the container is stopped with a grace period instead.
NOLimits memory usage to a given maximum threshold value.
Docker can enforce hard memory limits, which allow the container to use no more than a given amount of user or system memory, or soft limits, which allow the container to use as much memory as it needs unless certain conditions are met, such as when the kernel detects low memory or contention on the host machine. Some of these options have different effects when used alone or when more than one option is set.
YESDocker network mode to use e.g. host, none, etc.
YESList of port bindings.
Corresponds to the --publish (-p) option of the docker run CLI command using the format ip: dockerHostPort: containerPort/protocol.
Possible example :
8080: 80/udp-127.0.0.1: 8080: 80-127.0.0.1: 8080: 80/udp
NOGive extended privileges to this container.
YESIF_NOT_PRESENTIF_NOT_PRESENTALWAYSNEVERThe pull policy for a container image.
Use the IF_NOT_PRESENT pull policy to avoid pulling already existing images.
Use the ALWAYS pull policy to pull the latest version of an image
even if an image with the same tag already exists.
YESSize of /dev/shm in bytes.
The size must be greater than 0. If omitted, the system uses 64MB.
YESUser in the Docker container.
NO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESList of volumes to mount.
Make sure to provide a map of a local path to a container path in the format: /home/local/path:/app/container/path.
Volume mounts are disabled by default for security reasons — if you are sure you want to use them,
enable that feature in the plugin configuration
by setting volume-enabled to true.
Here is how you can add that setting to your kestra configuration:
kestra:
plugins:
configurations:
- type: io.kestra.plugin.scripts.runner.docker.Docker
values:
volume-enabled: true
NOtrueWhether to wait for the container to exit.
YESA list of capabilities; an OR list of AND lists of capabilities.
NOYESYESYESDriver-specific options, specified as key/value pairs.
These options are passed directly to the driver.
YESCompute environment in which to run the job.
YESAWS region with which the SDK should communicate.
NOYESAccess Key Id in order to connect to AWS.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YESS3 Bucket to upload (inputFiles and namespaceFiles) and download (outputFiles) files.
It's mandatory to provide a bucket if you want to use such properties.
YESPT5SdurationDetermines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.
NOtrueWhether the job should be deleted upon completion.
Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.
YESThe endpoint with which the SDK should communicate.
This property allows you to use a different S3 compatible storage backend.
YESExecution role for the AWS Batch job.
Mandatory if the compute environment is ECS Fargate. See the AWS documentation for more details.
YESJob queue to use to submit jobs (ARN). If not specified, the task runner will create a job queue — keep in mind that this can lead to a longer execution.
NO{
"request": {
"memory": "2048",
"cpu": "1"
}
}Custom resources for the ECS Fargate container.
See the AWS documentation for more details.
NOtrueWhether to reconnect to the current job if it already exists.
YESSecret Key Id in order to connect to AWS.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YESAWS session token, retrieved from an AWS token service, used for authenticating that this user has received temporary permissions to access a given resource.
If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YESThe AWS STS endpoint with which the SDKClient should communicate.
YESAWS STS Role.
The Amazon Resource Name (ARN) of the role to assume. If set the task will use the StsAssumeRoleCredentialsProvider. If no credentials are defined, we will use the default credentials provider chain to fetch credentials.
YESAWS STS External Id.
A unique identifier that might be required when you assume a role in another account. This property is only used when an stsRoleArn is defined.
YESPT15MdurationAWS STS Session duration.
The duration of the role session (default: 15 minutes, i.e., PT15M). This property is only used when an stsRoleArn is defined.
YESAWS STS Session name.
This property is only used when an stsRoleArn is defined.
NOfalseYESTask role to use within the container.
Needed if you want to authenticate with AWS CLI within your container.
NO\d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)The version of the plugin to use.
YESPT1HdurationThe maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.
AWS Batch will automatically timeout the job upon reaching that duration and the task will be marked as failed.