As noted in the Syntasa Quick Start Guide, in order for the Syntasa platform to be installed and able to process data in your cloud there are a few steps that need to be completed: build the foundation within your virtual private cloud (VPC), install the Syntasa software on that foundation, and finally connect the software to the foundation.
This article details the first of those steps, Preparing the cloud environment, specifically for AWS:
- Installation and configuration options
- Policies and user accounts
- Connectivity settings
- Compute and storage setup
- Installing the Syntasa software
Installation and configuration options
Utilizing the Syntasa platform within your AWS VPC provides several options in connectivity and alternatives for your output database. These options should be reviewed and decided before initiating the configuration steps.
The sections immediately below review the choices for each category; the steps needed to be taken for each option are described further in this article.
The AWS infrastructure requires two subnets, one for the Syntasa application node using EC2 another for the Amazon EMR cluster nodes. The Syntasa application node will reside in a public subnet (access to an IGW), but the Amazon EMR cluster nodes can be either a public subnet or a private subnet. Depending on the security policies implemented by your IT department, please choose the correct subnet implementation below.
Syntasa supports a wide variety of output database locations instead of Amazon Athena. If choosing to use Amazon Redshift instead of Athena, the resulting configuration will be as follows, including with the private subnet option from above:
Policies and user accounts
In order for the SYNTASA application to function properly in an Amazon AWS environment, the following roles, policies, and users must be created. The full list and details of the required selections will be provided by the Syntasa services team.
- IAM Policy for S3 - A policy, suggested name SYNTASA-S3-Access-Policy, needs to be added to access S3 resources.
- IAM Role for S3 - A new role, suggested name SYNTASA-EC2-ACCESS-ROLE, needs to be created. This will be used to access S3 buckets to attach to EC2 and EMR instances for reaching S3 and other resources. The policy from the previous step needs to be attached to this role. Also, if the Syntasa application is to be configured using Instance Profile then a number of additional AWS policies need to be added. Details will be provided by the Syntasa services team.
- IAM Roles for EMR - In order to successfully create a cluster there are three roles that need to be provided. EMR_AutoScaling_DefaultRole, EMR_DefaultRole, EMR_EC2_DefaultRole. Details for each, including the policies that need to be assigned for each role will be provided by the Syntasa services team.
- IAM User for Athena - An IAM user such as SYNTASA-User-Account needs to be created with a number of policies, details provided by the Syntasa services team, for querying Athena as well as viewing S3 resources.
A number of settings related to connectivity need to be created and configured. Details of each setting and rule can be provided by the Syntasa services team. Also, as noted above, there are a number of settings needed if needing to configure the environment with a private subnet.
The following items are required regardless of the option chosen:
- Internet gateway - An internet gateway, SYNTASA-IGW, needs to be created in order for the Syntasa application to be accessed.
- VPC - The Syntasa application requires a custom VPC to be created, suggested name SYNTASA-VPC.
- Subnets - The Syntasa application is separated into two sections, one for the application node, SYNTASA-EC2-SUBNET, and another for the EMR cluster nodes, SYNTASA-EMR-SUBNET.
- Route tables - For the two subnets created on the previous step, each needs a custom route table, SYNTASA-EC2-RT, SYNTASA-EMR-RT.
- Elastic IPs - An elastic IP is required to access the Syntasa application node. A new IP needs to be allocated, or if a custom pool for the elastic IPs is being used then that can be used. The IP will be associated with the Syntasa application EC2 instance in a later step.
- Security groups - A number of security groups need to be created for access to the EC2 instance and EMR cluster instances, SYNTASA-APP-NODE-SG, SYNTASA-EMR-MASTER-SG, SYNTASA-EMR-SLAVE-SG, SYNTASA-EMR-SERVICE-SG.
If choosing to configure the environment using a private subnet then one of the following needs to be set up:
- NAT gateway - Another elastic IP needs to be allocated to use for the NAT gateway. Using the AWS VPC NAT Gateway Service, the NAT gateway can be created utilizing the SYNTASA-EC2-SUBNET and the new elastic IP.
- NAT instance - Another elastic IP needs to be allocated, an EC2 instance (t2.medium recommended) to be created, and a security group to allow access to all local subnets needs to be created to use for the NAT instance. The instance should then be placed in the SYNTASA-EC2-SUBNET.
Once the above is complete the SYNTASA-EMR-RT route table needs to be adjusted to point to the NAT gateway or instance.
Compute and storage setup
The Syntasa application node is an EC2 instance that will reside in the public subnet SYNTASA-EC2-SUBNET with docker and a few other technologies installed on it. The following will outline the requirements for the AWS infrastructure:
- EC2 key pairs - A key pair needs to be created to be able to SSH to the EC2 and EMR instances. The suggested name for this is SYNTASA-SSH-KeyPair.
- EC2 instance - The Syntasa application node, SYNTASA-APPLICATION-NODE, will need a minimum of 8 cores and 52 GB of memory, i.e. r5a.2xlarge or r4.2xlarge. This instance will utilize the resources created in previous steps, e.g. VPC, subnet, IAM role, security group, etc.
- S3 configuration bucket - An S3 bucket, syntasa-bucket, is needed before the application is installed for configuration and other files. The S3 bucket will utilize the policy created previously and contain the following folders: syn-cluster-config, syn-cluster-data, syn-cluster-logs.
Syntasa supports a wide variety of output database locations instead of Amazon Athena. If choosing to use Amazon Redshift instead of Athena there two options for configuring a cluster:
- A Redshift cluster in the same VPC as the Syntasa components. This option is presented in the previous diagram.
- A Redshift cluster in another, external VPC.
For either option, the Redshift cluster needs to be set up so that the SYNTASA-VPC and subnets SYNTASA-EMR-SUBNET and SYNTASA-EC2-SUBNET have access to the cluster via firewall rules. This is usually done by attaching a security group to the Redshift cluster.
The minimum and recommended sizing of the cluster are as follows:
- Instance type - Minimum: dc2.large. Recommended: dc2.8xlarge
- Instance count - Minimum: 1 node. Recommended: 3 nodes.
- Cluster permissions - An IAM role that has permissions to read/write from the database that the Syntasa application will use to write output tables.
Lastly, the cluster requires the following parameters:
- Hostname - For example, SYNTASA-REDSHIFT. If the cluster is created within the Syntasa VPC the private IP is to be used; if an external VPC the public IP is to be used.
- Database - A database within the Redshift cluster is where the Syntasa application will create output tables. If one is not available then create a database called 'syntasa' or leave the default 'dev' database if there is nothing else running on the cluster.
- Username - A username that the Syntasa application will use to write to the cluster and has access to the above database.
- Password - Password for the above user.
- IAM Role - An IAM role to use for Redshift COPY commands. Please make sure this role has read/write access to the database created for the Syntasa output tables.
Installing the Syntasa software
After the foundation within your VPC has been established, the Syntasa software can now be installed on top of that foundation. Typically the installation of the software will be performed by the Syntasa services team.