Available in October 2022, the latest version of the Syntasa platform provides a major under-the-hood update with the upgrade to Spark 3, notable improvements for notebooks, additional options and updates to runtimes, and bells and whistles improving the user experience.
Under the hood
As the Syntasa platform utilizes many ever-changing cloud functions and technologies, we keep the platform updated to take advantage of the latest and greatest versions and the newest features available.
Spark 3 - Before Syntasa 7.0, the platform utilized versions of Spark 2. With this version, Syntasa is able to take advantage of all the performance improvements that Spark 3 offers. AWS-installed Syntasa environments use EMR 6.6.0+, which internally uses Spark 3.2; GCP-installed environments use Dataproc 2.0+, which internally uses Spark 3.1.3.
Spark 3 introduces Adaptive Query Execution which is beneficial for various performance optimizations during join operations, data skewness, and shuffle functions. Also, it brings better support for Kubernetes auto-scaling and complete SQL support for Delta Lake file formats.
Runtime improvements - Several improvements and additional options for runtimes expand the control and features available to you when creating and starting clusters:
- GPU runtimes - GPU-enabled runtimes added to reduce cost and execution times for jobs
- New Spark Image - Enable Spark Image for Kubernetes Container in AWS
- Updated images for Java 11 - New runtime images for Java 11, all Container and Spark Kubernetes runtimes support Java 11 Container runtimes
- Dataproc version updated - Latest Dataproc version, runtimes have been updated to support the latest version of Dataproc
- Postgres as Hive Metastore - Postgres as Hive Metastore for EMR cluster, by default RDS-MySQL is supported as Hive Metastore in EMR
Since integrated notebooks were introduced in Syntasa 6.2, we have continued to enhance and improve the feature. Syntasa 7.0 includes the following changes:
- Base notebook functionality expanded - The functionality available when launching a notebook has been expanded so there is less of a need to attach a runtime to the notebook.
- SparkMonitor - SparkMonitor integration provides several features to monitor and debug a Spark job from within the notebook interface
- Custom script initialization - Option to write custom initialization script for the notebook kernel
- Notebook collaboration awareness - Provide a warning to the user if/when their currently open notebook has been updated in another tab or by another user
- Job cost estimates - The cost estimate for each job step and the total for the job have been added in various places, e.g. activity details of a job, grids in the job tracker, execution, and task screens
- Troubleshooting improvement - We've added the ability to download the full logs of an individual step of a job
- User experience improvements - We've removed the auto-expanding of the left-hand side navigation bars, made the expand/collapse action manual, and added tooltips for icons seen when it is collapsed