The Spark Processor is a Synthesizer, Composer, and Orchestrator process that builds and manages user-defined analytics datasets using pre-written and verified working code that gets placed into a query editor window. Some of the high-level capabilities available are:
- Ability to run custom Scala, SQL, R and Python code
- Supports parameters within the code to pull in custom dates and database names
- Scheduled recurring runs of the code
Example use cases
- Union multiple data sources and aggregate to chosen level.
- Custom datasets feeding a business dashboard needing to be refreshed on a daily basis automatically
- Manage user validated SQL code to minimize engineer interaction
Please note: this should only be used by advanced Spark SQL user, if there is any question with how this process works please contact Syntasa support for assistance.
Process Configuration
The Spark Processor has very few parameters that need to be set, the key is to ensure the code is verified to be working through a query editor before attempting to deploy.
- Drag the Spark Processor process type on to the canvas
- Drag and connect a dataset to the Spark Processor (this is a must before editing of the process can occur)
- Click the Spark Processor node
- Unlock the canvas
- Start by filling in Parameters screen by pasting in Spark SQL code or uploading a file with the code
- Click Output
- Provide:
- Table Name - name of the table as it is found in storage, ensuring it is distinct to not overwrite other datasets
- Display Name - name of the output table node on the canvas
- Event Store Name - not editable, refers to the Event Store selected when creating the app
- Database - not editable, relates to selected Event Store
- Location - not editable, relates to selected Event Store
- Load to BQ or Redshift - option depends on environment Syntasa is configured, allows ability to build the table in Google Big Query or Amazon Redshift
Parameters
Output
Warnings
- The database and table must exist before running this process or syntax can be added to the beginning of the query to check if the database and table exist and define how the table should be created if does not exist.
- The table name must match in the query with the Output, otherwise, the query will run as expected, but the job will fail and the app will not be able to manage the state and won't be able to process automatically.
Expected Output
The expected output of the Spark Processor are the results of the Spark SQL query stored in the defined location within the query and on the Output tab. Depending on the environment Syntasa is installed, data can be queried using the query editor available, such as Big Query, Redshift, Athena, Hive, Impala, etc.