Available in Syntasa environments installed in GCP, the GCP Dataproc Cluster runtime type utilizes the cloud service to run jobs, execute code in notebooks, etc., and have all the various settings seen in the cloud service.
The basic runtime attributes required for all runtime types are detailed in Creating Runtime Templates; the settings available for this runtime type are detailed below.
Instance type and options
The GCP Dataproc Cluster runtime type enables several fields related to the master and worker instance types required for the runtime. The various machine families and machine types can be reviewed in Google's Support Machine Types article.
Here's a brief description of each field of the GCP DataProc Cluster:
- Cluster Base Name: A unique identifier for the GCP DataProc Cluster, aiding in easy management and identification.
- Cluster Release Label: Specifies the version of the DataProc software to use, ensuring compatibility and access to specific features. (Selected 2.1 by default)
- Cluster Network Tags: Network tags are assigned to the cluster for network firewall rules and traffic control within GCP.
- Runtime Max Uptime: Sets the maximum duration the cluster can remain active before automatically shutting down. (Selected 12 by default)
- Runtime Max Uptime Unit: Defines the time unit (minutes, hours, days) for the 'Runtime Max Uptime' setting. (Selected 'hours' by default)
- Terminate on Completion (Toggle): When enabled, the cluster will automatically terminate once the job or notebook execution is complete.
- Use Private IP (Toggle): Determines whether the cluster nodes communicate using private IP addresses for enhanced security.
-
Enable GPU (Toggle): Enables GPU resources within the cluster. Enabling this option reveals additional fields for GPU configuration.
-
- GPU Count: Defines the number of GPUs to allocate when GPU is enabled.
-
- Zeppelin (Toggle): Enables the Zeppelin notebook for interactive data analytics and visualization.
- Jupyter Notebook (Toggle): Enables the Jupyter Notebook for interactive programming and data exploration.
- Master Instance Type: Defines the machine type for the master node of the cluster.
- Worker Instance Type: Defines the machine type for the worker nodes within the cluster.
- HDD in GB: Specifies the size of the hard disk drive allocated to each node, defaulting to 500GB.
- Worker Instance Count: Determines the number of worker nodes to be included in the cluster.
-
Idle Time Deletion Interval in Secs: Sets the period of inactivity after which idle nodes are automatically deleted.
-
Cluster Mode: Defines the mode for cluster deployment:
- High Availability: Ensures continuous availability with multiple master nodes.
- Single: Single master node deployment.
- Standard: Basic deployment with a single master node.
-
Enable Autoscale: Automatically adjusts the number of worker nodes based on workload demand.
- Min Node Count: Defines the minimum number of worker nodes when Autoscale is enabled.
- Max Node Count: Defines the maximum number of worker nodes when Autoscale is enabled.
-
Deploy Mode (client/cluster): Determines the deployment mode for the cluster:
- Client Mode: The driver runs on a client machine outside the cluster.
- Cluster Mode: The driver runs on a cluster node for optimized resource usage.
Configuration options
There are also Spark configurations available. Key settings related to the number of cores and memory are defaulted but can be adjusted as needed. Other values available for configuration are detailed in the Apache Spark documentation on Spark Configuration.