GCP Dataproc Stream Cluster – SYNTASA™

Available in Syntasa environments installed in GCP, the GCP Dataproc Stream Cluster runtime type utilizes the cloud service to run jobs, execute code in notebooks, etc., and have all the various settings seen in the cloud service.

Also, this runtime type is a streaming runtime that should only be used for streaming jobs, i.e. continuously running jobs unless manually stopped, as opposed to non-streaming runtimes that are used for batch jobs and notebooks that will be shut down once the job completes or inactivity or max timeout is reached.

The basic runtime attributes required for all runtime types are detailed in Creating Runtime Templates; the settings available for this runtime type are detailed below. The other fields are similar to those found in the GCP Dataproc Cluster runtime type, but the streaming-specific differences are noted below.

Instance type and options

The GCP Dataproc Cluster runtime type enables several fields related to the master and worker instance types required for the runtime. The various machine families and machine types can be reviewed in Google's Support Machine Types article.

The fields are the same as those found in the GCP Dataproc Cluster runtime type, but the "max uptime" fields are excluded here since it is intended for streaming.

Configuration options

There are also Spark configurations available. Key settings related to the number of cores and memory are defaulted but can be adjusted as needed. Other values available for configuration are detailed in the Apache Spark documentation on Spark Configuration and Running Spark on YARN.

The default configurations share those set as defaults in the GCP Dataproc Cluster runtime type, but many others are added to support the streaming use case.

Runtime - AWS EMR Stream - Configs (combo).png

{[{category.name}]}

Instance type and options

Configuration options