This tutorial introduces Spark SQL programming within a JupyterLab environment (assuming Python is used). It demonstrates how to:
- Query Hive Tables: Use Spark SQL queries to retrieve data from tables stored in Hive.
- Filter and Aggregate Data: Filter results based on specific conditions and perform aggregations like counting occurrences within a partition range.
- Work with Spark DataFrames: Store the queried data in Spark DataFrames, which offer powerful functionalities for manipulating and analyzing data.
- Convert to Pandas DataFrames (Optional): Optionally, convert Spark DataFrames to Pandas DataFrames to utilize Pandas' extensive data analysis capabilities.
import pandas as pd
The below Spark SQL code queries a Hive table, counts occurrences within a date partition range, and stores the results (partition name and record count) in a Spark DataFrame for further analysis.
phrase_df = spark.sql(f"""
SELECT event_partition, COUNT(*) records FROM DATASET_NANE.TABLE_NAME
WHERE event_partition between '2024-01-01' and '2024-01-30'
--GROUP BY event_partition order by 2 desc
""")
The below code displays the contents of the Spark DataFrame phrase_df with the following modifications:
Number of Rows: It shows a maximum of 10 rows.
Truncation: It displays results without truncation.
phrase_df.show(10,False)
The below code snippet demonstrates how to group data by a specific column using a Pandas DataFrame (not Spark DataFrame) after retrieving data from Hive using Spark SQL:
group_df = phrase_df.groupBy("records").count()
group_df.show()
Cache results to memory:
phrase_df.cache()
The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:
- Download the sample notebook .ipynb file from this article.
- Create a new notebook in your Syntasa environment using the import notebook option.