Spark Utils – SYNTASA™

Spark applications often involve working with data in DataFrames. Spark Utils provides helper functions to simplify these tasks, specifically focusing on DataFrame creation and table manipulation. This article explores two such functions: createDataFrame and createTable.

1. createDataFrame(inputDateSet)

This function creates a Spark DataFrame from a provided dataset path. It automatically applies date filters based on pre-defined process dates if the input dataset is partitioned by date.

Parameters:

@inputDateSet1 (String): Path to the input dataset.

Example:

Python:

/**
* 
* @param @inputDateSet1
* @throws com.syntasa.lib.exception.SyntasaException
* @return DataFrame
*/
createDataFrame(inputDateSet1: String)
Example :
df = createDataFrame("@inputDateSet1")
df.show()

This code snippet creates a DataFrame df from the dataset located at the path @inputDateSet1. If this dataset is partitioned by date and process dates are configured, createDataFrame will automatically filter the data based on those dates.

Scala:

/**
*
* @param inputDateSet is a user parameter
* @throws com.syntasa.lib.exception.SyntasaException
* @return DataFrame
*/
createDataFrame(inputDateSet: String)
Example :
val df = createDataFrame("@inputDateSet1")

2. createTable(dataFrame, outputDateSet, partitionedDateColumn=None)

This function creates a table from a DataFrame. It offers options to create either a non-partitioned or a partitioned table.

Parameters:

dataFrame (DataFrame): The DataFrame to create the table from.
@outputDateSet1 (String): Path to store the output table.
@partitionedDateColumn (String, optional): Name of the column to use for partitioning the table (default: None for non-partitioned table).

Examples:

Non-partitioned Table:

Python

/**
*
* @param dataFrame
* @param @outputDateSet1
*/
createTable(dataFrame: DataFrame, outputDateSet1: String)
Example:
createTable(df,"@outputDateSet1")

This code creates a non-partitioned table from the DataFrame df at the specified path @outputDateSet1.

Scala

/**
*
* @param dataFrame
* @param outputDateSet is a user parameter
*/
createTable(dataFrame: DataFrame, outputDateSet: String)
Example :
createTable(df,"@outputDateSet1")

Partitioned Table:

Python

/**
* @param dataFrame
* @param @outputDateSet1
* @param @partitionedDateColumn
*/
createTable(dataFrame: DataFrame, outputDateSet1: String, partitionedDateColumn:String)
Example:
createTable(df,"@outputDateSet1","event_partition")

This code creates a partitioned table from df at @outputDateSet1, using the column named "event_partition" for partitioning the data.

Scala

/**
*
* @param dataFrame
* @param outputDateSet is a user parameter
* @param partitionedDateColumn is a user parameter
*/

createTable(dataFrame: DataFrame, outputDateSet: String, partitionedDateColumn:String)
Example :
createTable(df,"@outputDateSet1","event_partition")

In Conclusion

These Spark Utils functions provide convenient ways to create DataFrames from datasets and manage them as tables. The createDataFrame function streamlines data loading with automatic date filtering, while createTable offers flexibility in creating both non-partitioned and partitioned tables based on your data structure. By utilizing these utilities, you can simplify your Spark development workflow and focus on data analysis tasks.

{[{category.name}]}