This article guides a user on how to get started with reading data from a CSV.
This Python code snippet uses Spark to read a CSV file from an Amazon S3 bucket and displays the first row.
This uses Spark's CSV reading functionality. This defines a formatted string literal that specifies the location of the CSV file.
Replace "bucket", "Path_to", and "my_file" with the actual bucket name, path within the bucket, and filename of your CSV file on S3.#header=True: This argument indicates that the CSV file has a header row containing column names.
test_csv = spark.read.csv(f"s3://bucket/Path_to/my_file/TEST_DATA.csv", header=True)
test_csv.show(1)
The below line of code filters the test_csv Spark DataFrame to keep only rows where the value in the 'Family' column is equal to "Screw & Bolt Kits".
#Filter results:
test_csv = test_csv.filter(test_csv['Family'] == 'Screw & Bolt Kits')
The below line of code displays the contents of the Spark DataFrame test_csv with the following modifications:
# 10 is for number of rows and False so that results aren't trimmed.
test_csv.show(10,False)
The below code creates a temporary view in Spark SQL and then uses that view to query the data.
Using the new temporary view titled test we then are able to query the data and return the results of the query.
#This creates a temp view to allow the spark dataframe to be used in spark sql
test_csv.createOrReplaceTempView('test')
# Spark sql on the csv file that we just loaded in
csv_sql = spark.sql(""" select * from test """)
csv_sql.show()
The below code creates a new DataFrame named joined_df that combines data from both test_csv and csv_sql. The join is performed based on matching values in the "TEST_COL" column, and it uses a left join strategy to ensure all rows from test_csv are included in the result.
# Doing a left join on the data
joined_df = test_csv.join(csv_sql, test_csv['TEST_COL'] == csv_sql['TEST_COL'], 'left')
Print the results of the joined_df dataframe:
joined_df.show(1)
The below code ensures that the test_csv DataFrame is written as a single CSV file to the specified S3 location. It uses coalesce to control the number of partitions and write.option to set the header option for the CSV output format.
# It will output as multiple files without colaesce as it splits the file up based on the spark partition.
test_csv = test_csv.coalesce(1)
output_path = "s3://bucket/Pathto/my_file/output"
test_csv.write.option("header","true").csv(output_path)
The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:
- Download the sample notebook .ipynb file from this article.
- Create a new notebook in your Syntasa environment using the import notebook option.