Struggling to get your data from Google Cloud Storage (GCS) into SYNTASA? This guide simplifies the process! Learn how to build a basic pipeline that ingests data from a text file stored in GCS. Follow these steps, and you'll have your data loaded and ready for further analysis within the SYNTASA Ecosystem.
- Pre-Requisites - What You Need to Know Before Building Your App
-
Building first App - A Step-by-Step Guide to follow
- Accessing workspace for app creation
- Creating a new App
- Entering the development workflow canvas
- Building the Apps components
- Setting Up the Connection Details
- Configure the "From File" Process detail of the App
- Saving and Locking Your Work
- Running the App
- Monitoring the App's Progress
- Reviewing the Results
- Deploying the App to Production
Pre-requisites for creating an App:
Before diving into app creation, ensure you are ready with the following:
Permission:
You'll need the appropriate permissions to create new connections, data stores, runtimes, and Apps or access existing ones. These permissions depend on your user profile Role.
GCS Connection:
Since data pipelines (App) rely on an external source for data intake depending on the requirement. For this case, you'll need a Google Cloud Storage (GCS) connection set up before creating your app. This connection acts as the input source for your pipeline.
Runtime:
Our data pipeline (App) needs an environment to run and process data. This environment is determined by the runtime you choose. Ensure a suitable runtime exists before running your app.
Datastore:
Apps process data from input sources and write the output to datasets within the SYNTASA ecosystem. These datasets/Outputs are essentially the app's Data Stores. A datastore must be ready before creating an app, as you'll need to assign one to it.
Step-by-Step Guide
1. Login and Workspace Navigation
- Launch and Login: Open SYNTASA and enter your username and password to sign in.
- Workspace Welcome: Upon successful login, you'll land on the workspace screen. This is the central hub for managing your apps and data pipelines.
- Organizing Your Apps: While the workspace allows you to create Folders for better organization, we'll be using the pre-existing "Demo App Folder" for this guide.
- Initiate App Creation: Click the "Create New" button located in the corner of the workspace screen.
-
Select App Type: From the available options, choose "App" to begin building your new data processing application.
2. Creating a New App [Configuration and Saving]
- Name Your App: Give your app a clear and descriptive name that reflects its purpose.
- Copy/Import (Optional): You'll see options to copy or import an existing app, but for this guide, we'll choose "Create New App" to build from scratch. (Refer Image below)
- Describe Your App: Briefly explain what your new app does in the description section.
- Organize Your App using Tag/Folder (Optional): Add relevant tags and choose a folder to organize your apps within the Workspace.
- Select Template and Datastore: This guide uses the "Synthesizer Free Form Template" and a pre-configured "Event Store".
- Set Sharing Permissions: Choose who can access your app. Here, "Everyone (Public)" is selected for complete access.
-
Create Your App: Click the "Save" button to finalize and create your new app.
3. Accessing the Development Canvas
- Welcome to the Canvas: After creating your app, you'll land on the canvas, the main workspace where you'll visually build the functionalities of your app.
- Unlocking for Action: The canvas is initially locked to prevent accidental edits. Click the unlock () button to gain editing permissions and start building your data pipeline.
-
Building Your Pipeline: Data pipelines are constructed on the canvas using a drag-and-drop approach. You'll drag and drop various elements. In this guide, we used connections and processes on the canvas to define how your app handles data.
4. Building the Pipeline
- Drag and Drop Essentials: Locate the element named "GCS Connection" and drag it onto the canvas. Repeat this for the process called "FromFile".
- Connecting the Dots: Look for corresponding icons on both elements. These icons represent connection points for data flow. Drag your cursor from the output end on the "GCS Connection" element and connect it to the input end on the "FromFile" process. This establishes the data flow between these elements in your pipeline.
5. Configuring the GCS Connection
- Configure the GCS Connection: With the "GCS Connection" element on your canvas, click its icon to open its configuration settings.
- Choose Your Connection: From the available options, we selected "Demo Connection". This establishes the connection your app will use to access Google Cloud Storage.
-
Save and See the Change: Once you've chosen the connection, click the "Save" button within the configuration panel. This confirms your selection and visually indicates a successful connection by turning the "GCS Connection" icon blue.
6. Configuring the "FromFile" Process
-
Input Section
-
Dive into the File Details: Double-click the "FromFile" process icon to open the Input details screen. This is where you'll define exactly which file your app will process.
- Specifying the Source: Here, you'll need to provide the source path and the source file name pattern. Think of the source path as the address within your Google Cloud Storage bucket, and the pattern helps identify the specific file(s) you want to use. Refer back to the GCS connection video for details on how to construct these.
-
Autoconfiguration and Validation: This screen might offer options for autoconfiguring the source file based on its location or even validating its structure. Take advantage of these features if available.
-
Data Preview with the Magic Pen ( ) under Event: Use the "Magic Pen" tool, which allows you to preview a sample of your data structure (delimited values e.g. comma-separated values) directly from the source file.
- Validation Confirmation: Click the "Validate" button to confirm that SYNTASA can successfully access and understand the structure of your chosen file.
-
Success Message: If the validation is successful, you might see a message like "No Errors" displayed.
Other relevant input details:
-
Dive into the File Details: Double-click the "FromFile" process icon to open the Input details screen. This is where you'll define exactly which file your app will process.
-
- Container Header: If your source file includes a header row for records, ensure this Contain Header toggle is switched "ON."
- Incremental Load: Select this option if you're loading files containing fresh data regularly (daily or hourly).
- Date Parsing: Configure values in this section if you want to pick files based on the date included in their filename.
- Data Manipulation: This section helps if you need to handle backdated data within your current files.
-
Schema Configuration
- Defining Partition Columns: The schema section allows you to configure details related to partitioning your data. By default, SYNTASA might use a field named "file_date" as the partition column. Partitioning helps organize your data based on specific criteria for easier retrieval and analysis.
- Smart Schema Autofill: Since your file has a header row containing column names, take advantage of the "Autofill" button. Clicking this button instructs SYNTASA to automatically extract all the column names and data types from the header row of your source file. This eliminates the need for manual configuration, saving you time and effort.
- Handling Files Without Headers: If your source file doesn't have a header row, you'll need to add the columns manually. SYNTASA might offer additional options for importing header fields from your data. Explore these options if manual configuration is necessary.
-
Schema Management Tools: The schema section might also provide shortcut buttons for managing your schema definition. These buttons could include options to clear all defined fields or export the current schema configuration for future reference.
-
Output Configuration
-
Choosing Your Output Destination: This screen lets you define where your processed data will be stored. By default, SYNTASA sets the destination as a Hive table. However, you have the flexibility to choose the "Load To BQ" option if you prefer storing your data in a BigQuery table instead.
-
Choosing Your Output Destination: This screen lets you define where your processed data will be stored. By default, SYNTASA sets the destination as a Hive table. However, you have the flexibility to choose the "Load To BQ" option if you prefer storing your data in a BigQuery table instead.
7. Save and Locking Changes
- Save Your Work: Once you've configured the "FromFile" process and other elements to your satisfaction, click the "Save" button. This confirms and applies the changes you've made within the process settings. As a visual confirmation, the process icon color might change after saving.
-
Locking in Your Changes: When you're confident about all the configurations within your app, click the "Save and Lock" button. This action finalizes all the adjustments you've made, essentially locking the app canvas to prevent accidental edits and ensuring your work is preserved.
8. Running the App
- Initiate Job Creation: Click the "Create Job" button to launch the job creation process.
- Meaningful Job Name: Assign a clear and descriptive name to your job that reflects its purpose. This will help you identify and manage your jobs easily.
- New or Copied? Decide if you want to create a brand-new job from scratch or base it on an existing one. If you're starting fresh, select "Add New".
- Including the Process: Choose the option "Add New" to incorporate the configured process (likely the "FromFile" process) into your newly created job.
- Selecting the Process: From the available options, identify and select the specific process you've configured for this job.
- Runtime Selection: Choose a suitable runtime. This essentially creates a cluster that will provide the computing resources necessary to execute your app.
- Process Mode: Depending on your workflow needs, you might be able to define how the output data is handled. Common options include "overwrite" (replacing existing data), "drop" (deleting existing data), or "add new data" (appending new data). Choose the most suitable option for your specific scenario.
-
Confirmation: Click the "Apply" button to finalize your job configuration and initiate the job execution process.
-
Scheduling for Specific Date: While configuring the job, you have the additional option to schedule its execution for a specific date. you can choose this specific date to ensure the app processes the relevant data. (This step might depend on the specific capabilities of SYNTASA's job scheduling functionality.)
Note: Remember to check SYNTASA's documentation or consult relevant resources to understand how to configure job scheduling for a specific date within the platform.
9. Monitoring the Job
-
Save and Run the Job: Once you've configured everything for the job, click the "Save and Execute" button. This action triggers the job execution process.
-
Monitoring Progress: The activity area on the right-hand side of the screen will transform into a live job status monitor. This panel will display the progress of your app as it goes through three distinct stages:
- Create Cluster: During this initial phase, SYNTASA allocates resources and sets up a cluster environment to execute your app.
- Process: Once the cluster is ready, the actual processing of your data commences according to the configurations you defined within your app.
- Terminate Cluster: After successful execution, SYNTASA terminates the cluster, releasing the allocated resources.
-
Wait for Completion: It's crucial to wait patiently for all three steps (Create Cluster, Process, and Terminate Cluster) to finish before proceeding with any further actions within SYNTASA. This ensures your app has completed its task successfully.
10. Verifying the App Output
- Examining the Output: Once your job finishes running successfully, you have the option to delve into the processed data. Click the "Output" icon to access the details.
-
Output Details Page: This dedicated page offers valuable information about the processing that just occurred. You'll likely find details regarding:
- Input File: Information about the source file that was processed by your app.
-
Output Details: This section sheds light on the final location of your processed data. Expect to see details like the dataset, database (if applicable), and the specific table name where the results reside within your data storage system.
-
Schema: The schema section displays column details like order, data type, and partitioned status. The state section showcases data per partition date.
-
State: The state section provides a summary of our output table. It includes the partitioned date, the number of entries in the table, the total size of the table, and the last time the table was updated. This information provides a snapshot of the health and activity of your output table that helps us monitor the efficiency of our output and identify any potential issues.
-
Preview: Finally, the preview page allows you to preview and download the sample from Processed data.
11. Deployment
-
Deploying to Production: If you're satisfied with the app's output and functionality, you can proceed with deploying it to a production environment, where it can be used by end users. To initiate deployment, click the "Deploy" button.