{ "cells": [ { "cell_type": "markdown", "id": "ac2db00b", "metadata": {}, "source": [ "### Reference Documents:\n", "SparkMagic
\n", "Spark Doc
" ] }, { "cell_type": "markdown", "id": "23eb97e1", "metadata": {}, "source": [ "# Downloading Files into local Notebook Environment from S3 (For Amazon AWS Environments)\n", "### In this notebook we'll be showing you examples on how to download files from S3 to your local notebook environment\n", "### Downloading files from other sources will be done in similar fashion." ] }, { "cell_type": "markdown", "id": "f3f6526a", "metadata": {}, "source": [ "#### How to download files from s3 using the aws cli -- Installing aws cli pre-requisites" ] }, { "cell_type": "code", "execution_count": 68, "id": "88363583", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: awscli in /opt/conda/lib/python3.7/site-packages (1.25.71)\n", "Requirement already satisfied: PyYAML<5.5,>=3.10 in /opt/conda/lib/python3.7/site-packages (from awscli) (5.3.1)\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from awscli) (0.6.0)\n", "Requirement already satisfied: colorama<0.4.5,>=0.2.5 in /opt/conda/lib/python3.7/site-packages (from awscli) (0.4.4)\n", "Requirement already satisfied: botocore==1.27.70 in /opt/conda/lib/python3.7/site-packages (from awscli) (1.27.70)\n", "Requirement already satisfied: docutils<0.17,>=0.10 in /opt/conda/lib/python3.7/site-packages (from awscli) (0.16)\n", "Requirement already satisfied: rsa<4.8,>=3.1.2 in /opt/conda/lib/python3.7/site-packages (from awscli) (4.7.2)\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore==1.27.70->awscli) (1.25.9)\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from botocore==1.27.70->awscli) (1.0.1)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore==1.27.70->awscli) (2.8.1)\n", "Requirement already satisfied: pyasn1>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from rsa<4.8,>=3.1.2->awscli) (0.4.8)\n", "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore==1.27.70->awscli) (1.15.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip3 install awscli --upgrade" ] }, { "cell_type": "markdown", "id": "fd7ae5c5", "metadata": {}, "source": [ "#### How to download files from s3 using the aws cli" ] }, { "cell_type": "code", "execution_count": 69, "id": "e5f3aff3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 42M\r\n", "drwxr-xr-x 2 root root 4.0K Sep 10 02:53 .\r\n", "drwxrwxrwt 1 root root 4.0K Sep 10 02:53 ..\r\n", "-rw-r--r-- 1 root root 42M Jul 12 16:51 20220701.export.CSV\r\n" ] } ], "source": [ "import subprocess\n", "import shutil\n", "\n", "bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "object_key = '20220701.export.CSV'\n", "obj_in_s3_full_path = f's3://{bucket_name}/{object_prefix}/{object_key}'\n", "local_des_path = '/tmp/my_files'\n", "\n", "# First lets delete all files that exist\n", "shutil.rmtree(local_des_path, ignore_errors=True, onerror=None)\n", "\n", "# Lets create a temporary folder locally to hold our files\n", "os.makedirs(local_des_path, exist_ok=True)\n", "\n", "# Now download the file from s3\n", "command = subprocess.check_output(f'aws s3 cp {obj_in_s3_full_path} {local_des_path}/', shell=True)\n", "\n", "# Lets validate the file was downloaded by printing the contents of the folder\n", "!ls -lah {local_des_path}" ] }, { "cell_type": "markdown", "id": "73e4869e", "metadata": {}, "source": [ "#### How to download a file using boto3 -- Installing pre-requisites (upgrading pip and downloading boto3)" ] }, { "cell_type": "code", "execution_count": 70, "id": "143389fe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: pip in /opt/conda/lib/python3.7/site-packages (22.2.2)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (1.24.24)\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3) (0.6.0)\n", "Requirement already satisfied: botocore<1.28.0,>=1.27.24 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.27.70)\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.0.1)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.28.0,>=1.27.24->boto3) (2.8.1)\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.28.0,>=1.27.24->boto3) (1.25.9)\n", "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.24->boto3) (1.15.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: The directory '/home/jovyan/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: cloudpathlib in /opt/conda/lib/python3.7/site-packages (0.10.0)\n", "Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from cloudpathlib) (1.6.1)\n", "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->cloudpathlib) (3.1.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "#First we'll install all the dependencies we'll use after this cell\n", "!pip3 install --upgrade pip #lets upgrade pip\n", "!pip3 install boto3\n", "!pip3 install cloudpathlib" ] }, { "cell_type": "markdown", "id": "faadc26d", "metadata": {}, "source": [ "#### How to download a single file to your local notebook environment using Boto3" ] }, { "cell_type": "code", "execution_count": 71, "id": "03182a57", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 42M\r\n", "drwxr-xr-x 2 root root 4.0K Sep 10 02:53 .\r\n", "drwxrwxrwt 1 root root 4.0K Sep 10 02:53 ..\r\n", "-rw-r--r-- 1 root root 42M Sep 10 02:53 20220701.export.CSV\r\n" ] } ], "source": [ "# Define your imports and you object location (bucket, object, destination path)\n", "import os\n", "import boto3\n", "import shutil\n", "\n", "bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "object_key = '20220701.export.CSV'\n", "local_des_path = '/tmp/my_files'\n", "\n", "# First lets delete all files that exist\n", "shutil.rmtree(local_des_path, ignore_errors=True, onerror=None)\n", "\n", "# Lets create a temporary folder locally to hold our files\n", "os.makedirs(local_des_path, exist_ok=True)\n", "\n", "# Create a boto3 client and download the specified object\n", "s3_client = boto3.client('s3')\n", "s3_client.download_file(bucket_name, f'{object_prefix}/{object_key}', f'{local_des_path}/{object_key}')\n", "\n", "# Now lets validate that the file exists in the local path we specified above\n", "!ls -lah /tmp/my_files/" ] }, { "cell_type": "markdown", "id": "abddf0dc", "metadata": {}, "source": [ "#### How to download an entire folder to your local notebook environment using CloudPathLib\n", "#### Cloudpathlib is a library that has built in functions on top of boto3 to help download/upload files easily" ] }, { "cell_type": "code", "execution_count": 72, "id": "e54ffa38", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 438M\r\n", "drwxr-xr-x 3 root root 4.0K Sep 10 02:53 .\r\n", "drwxr-xr-x 3 root root 4.0K Sep 10 02:53 ..\r\n", "drwxr-xr-x 3 root root 4.0K Sep 10 02:53 2022\r\n", "-rw-r--r-- 1 root root 42M Sep 10 02:53 20220701.export.CSV\r\n", "-rw-r--r-- 1 root root 6.9M Sep 10 02:53 20220701.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 25M Sep 10 02:53 20220702.export.CSV\r\n", "-rw-r--r-- 1 root root 4.0M Sep 10 02:53 20220702.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 23M Sep 10 02:53 20220703.export.CSV\r\n", "-rw-r--r-- 1 root root 3.5M Sep 10 02:53 20220703.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 32M Sep 10 02:53 20220704.export.CSV\r\n", "-rw-r--r-- 1 root root 5.1M Sep 10 02:53 20220704.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 41M Sep 10 02:53 20220705.export.CSV\r\n", "-rw-r--r-- 1 root root 6.7M Sep 10 02:53 20220705.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 43M Sep 10 02:53 20220706.export.CSV\r\n", "-rw-r--r-- 1 root root 7.2M Sep 10 02:53 20220706.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 46M Sep 10 02:53 20220707.export.CSV\r\n", "-rw-r--r-- 1 root root 7.6M Sep 10 02:53 20220707.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 44M Sep 10 02:53 20220708.export.CSV\r\n", "-rw-r--r-- 1 root root 7.0M Sep 10 02:53 20220708.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 26M Sep 10 02:53 20220709.export.CSV\r\n", "-rw-r--r-- 1 root root 4.1M Sep 10 02:53 20220709.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 24M Sep 10 02:53 20220710.export.CSV\r\n", "-rw-r--r-- 1 root root 3.5M Sep 10 02:53 20220710.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 37M Sep 10 02:53 20220711.export.CSV\r\n", "-rw-r--r-- 1 root root 5.9M Sep 10 02:53 20220711.export.CSV.zip\r\n", "-rw-r--r-- 1 root root 61 Sep 10 02:53 my_sample_file.csv\r\n" ] } ], "source": [ "import os\n", "import pathlib\n", "import shutil\n", "from cloudpathlib import CloudPath\n", "\n", "bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "local_des_path = '/tmp/my_files/my_downloaded_folder/'\n", "\n", "# First lets delete all files that exist\n", "shutil.rmtree(local_des_path, ignore_errors=True, onerror=None)\n", "\n", "# Lets create a local folder where we will download all the files from our s3 bucket\n", "os.makedirs(local_des_path, exist_ok=True)\n", "\n", "# Lets get a list of all the objects in the s3 folder we're trying to download (please note that if you have more than 5000 files or objects, you will need to use multi-threading or paginators)\n", "cloud_path = CloudPath(f's3://{bucket_name}/{object_prefix}/')\n", "cloud_path.download_to(local_des_path)\n", "\n", "# Now lets validate that the folder and the files were downloaded\n", "!ls -lah {local_des_path}" ] }, { "cell_type": "markdown", "id": "e4c8e2f5", "metadata": {}, "source": [ "# Uploading Files to an S3 bucket (For Amazon AWS Environments)\n", "### In this notebook we'll be showing you examples on how to upload files from your local notebook environment to an s3 bucket\n", "### Uploading files from other sources will be done in similar fashion." ] }, { "cell_type": "markdown", "id": "bb39255b", "metadata": {}, "source": [ "#### Lets create a dummy CSV File so that we can upload it to Amazon S3" ] }, { "cell_type": "code", "execution_count": 73, "id": "7cb91036", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Files in path /tmp/my_files\n", "total 12K\n", "drwxr-xr-x 2 root root 4.0K Sep 10 02:53 .\n", "drwxrwxrwt 1 root root 4.0K Sep 10 02:53 ..\n", "-rw-r--r-- 1 root root 61 Sep 10 02:53 my_sample_file.csv\n", "\n", "Contents of File :: my_sample_file.csv\n", "name,area,country_code2,country_code3\n", "Mexico,758400,MX,MEX\n" ] } ], "source": [ "import csv\n", "\n", "local_folder_path = '/tmp/my_files'\n", "csv_file_name = 'my_sample_file.csv'\n", "\n", "# First lets delete all files that exist\n", "shutil.rmtree(local_folder_path, ignore_errors=True, onerror=None)\n", "\n", "# Lets create a local folder where we will download all the files from our s3 bucket\n", "os.makedirs(local_folder_path, exist_ok=True)\n", "\n", "\n", "header = ['name', 'area', 'country_code2', 'country_code3']\n", "data = ['Mexico', 758400, 'MX', 'MEX']\n", "\n", "with open(f'{local_folder_path}/{csv_file_name}', 'w', encoding='UTF8') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(header)\n", " writer.writerow(data)\n", "\n", "# Verify file exists and that we can see the contents of the file\n", "print(f'Files in path {local_folder_path}')\n", "!ls -lah {local_folder_path}\n", "print(f'\\nContents of File :: {csv_file_name}')\n", "!cat {local_folder_path}/{csv_file_name}" ] }, { "cell_type": "markdown", "id": "9a0ec1c5", "metadata": {}, "source": [ "#### Uploading a file to Amazon S3 using AWS CLI" ] }, { "cell_type": "code", "execution_count": 74, "id": "3487336c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-09-10 02:54:03 61 my_sample_file.csv\r\n" ] } ], "source": [ "import subprocess\n", "import shutil\n", "\n", "remote_bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "csv_file_name = 'my_sample_file.csv'\n", "obj_in_s3_full_path = f's3://{remote_bucket_name}/{object_prefix}/{csv_file_name}'\n", "local_folder_path = '/tmp/my_files'\n", "\n", "# Now download the file from s3\n", "command = subprocess.check_output(f'aws s3 cp {local_folder_path}/{csv_file_name} {obj_in_s3_full_path}', shell=True)\n", "\n", "# Lets validate the file was uploaded to s3\n", "!aws s3 ls s3://{remote_bucket_name}/{object_prefix}/ | grep {csv_file_name}" ] }, { "cell_type": "markdown", "id": "2c5447aa", "metadata": {}, "source": [ "#### Uploading a file to Amazon S3 using Boto3" ] }, { "cell_type": "code", "execution_count": 75, "id": "f69c07f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-09-10 02:54:07 61 my_sample_file.csv\r\n" ] } ], "source": [ "import os\n", "import boto3\n", "\n", "remote_bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "csv_file_name = 'my_sample_file.csv'\n", "local_folder_path = '/tmp/my_files'\n", "\n", "# Create a boto3 client and download the specified object\n", "s3_client = boto3.client('s3')\n", "s3_client.upload_file(f'{local_folder_path}/{csv_file_name}', remote_bucket_name, f'{object_prefix}/{csv_file_name}')\n", "\n", "# Lets validate the file was uploaded to s3\n", "!aws s3 ls s3://{remote_bucket_name}/{object_prefix}/ | grep {csv_file_name}" ] }, { "cell_type": "markdown", "id": "001cbc27", "metadata": {}, "source": [ "#### Uploading a file to Amazon S3 using Cloudpathlib" ] }, { "cell_type": "code", "execution_count": 76, "id": "c76b39f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-09-10 02:54:11 61 my_sample_file.csv\r\n" ] } ], "source": [ "import os\n", "import pathlib\n", "import shutil\n", "from cloudpathlib import CloudPath\n", "\n", "remote_bucket_name = 'syntasa-gov-sandbox-01'\n", "object_prefix = 'other/sample-data/csv_test'\n", "csv_file_name = 'my_sample_file.csv'\n", "local_folder_path = '/tmp/my_files'\n", "\n", "# Create a cloud path and upload from the local file\n", "cloud_path = CloudPath(f's3://{remote_bucket_name}/{object_prefix}/{csv_file_name}')\n", "cloud_path.upload_from(f'{local_folder_path}/{csv_file_name}', force_overwrite_to_cloud=True)\n", "\n", "# Lets validate the file was uploaded to s3\n", "!aws s3 ls s3://{remote_bucket_name}/{object_prefix}/ | grep {csv_file_name}" ] }, { "cell_type": "code", "execution_count": null, "id": "2e675be7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Syntasa Kernel", "language": "python", "name": "syntasa_kernel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "syn_metadata": { "spark_lang_type": "python" } }, "nbformat": 4, "nbformat_minor": 5 }