{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "aa9528c7-2546-44de-9bb3-c8a9beff708e",
   "metadata": {},
   "source": [
    "This tutorial introduces Spark SQL programming within a JupyterLab environment (assuming Python is used). It demonstrates how to:\n",
    "\n",
    "Query Hive Tables: Use Spark SQL queries to retrieve data from tables stored in Hive.\n",
    "Filter and Aggregate Data: Filter results based on specific conditions and perform aggregations like counting occurrences within a partition range.\n",
    "Work with Spark DataFrames: Store the queried data in Spark DataFrames, which offer powerful functionalities for manipulating and analyzing data.\n",
    "(Optional) Convert to Pandas DataFrames: Optionally, convert Spark DataFrames to Pandas DataFrames to utilize Pandas' extensive data analysis capabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2e8dc9d-e390-4df1-ae72-a297badd253e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Different pacakages you could import depending on your requirement - Below we're using pdS\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0190fcb-4e1a-44cb-a24c-92a1442c20d5",
   "metadata": {},
   "source": [
    "This Spark SQL code queries a Hive table, counts occurrences within a date partition range, and stores the results (partition name and record count) in a Spark DataFrame for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2174095c-2805-4e52-875b-4734b6a1733e",
   "metadata": {},
   "outputs": [],
   "source": [
    "phrase_df = spark.sql(f\"\"\" \n",
    "SELECT event_partition, COUNT(*) records  FROM DATASET_NANE.TABLE_NAME\n",
    "WHERE event_partition between '2024-01-01' and '2024-01-30'  \n",
    "--GROUP BY  event_partition order by 2 desc \n",
    "\"\"\") "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11107555-8fe5-41f2-b580-51a25145d6c8",
   "metadata": {},
   "source": [
    "This displays the contents of the Spark DataFrame phrase_df with the following modifications:\n",
    "\n",
    "Number of Rows: It shows a maximum of 10 rows.</br>\n",
    "Truncation: It displays results without truncation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cfc4373-263f-4e26-92b2-505ebb3ca67c",
   "metadata": {},
   "outputs": [],
   "source": [
    "phrase_df.show(100,False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff87fccb-5a9d-4610-a8b1-f13ab4a11802",
   "metadata": {},
   "source": [
    "This code snippet demonstrates how to group data by a specific column using a Pandas DataFrame (not Spark DataFrame) after retrieving data from Hive using Spark SQL. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0328d39f-2235-4a01-9c52-67f93624d9d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# How to group by using DataFrames instead of in the SQL query utself\n",
    "group_df = phrase_df.groupBy(\"records\").count()\n",
    "group_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2665fa6a-f4a5-4a00-884a-d17d982e19d0",
   "metadata": {},
   "source": [
    "Cache results to memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff9b995f-0b63-47b3-92f4-877e66930d63",
   "metadata": {},
   "outputs": [],
   "source": [
    "phrase_df.cache()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec40eeaf-2b25-492b-9466-0ebb481626dd",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10",
   "language": "python",
   "name": "syntasa_kernel_python_310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
