"Orchestrator Apps" refer to applications that create integrations with enterprise systems, enabling them to deliver data points. Orchestrator Free Form is an Orchestrator app that provides a blank workflow for the ability to create custom apps by using pre-built processes like To DB, Post, and From File.
Pre-requisites
The following screens need to be populated before configuring this app:
- Infrastructure - All required fields populated with the environment details of your on-premise, Google Cloud Platform, or AWS environment.
- Event Store - Where data will reside in your cloud environment or HDFS cluster.
-
Connection - Where data will be ingested for processing by Syntasa.
App creation
- From Workspace -> Folders -> Create New. Selecting the App option presents you with a new screen to provide the details of the app.
- Fill in the New App screen:
Name - The name of your new Synthesizer Free Form application.
Key - This will automatically populate based on the name you enter.
App Prefix - This will automatically populate based on the name you enter. However, you can change this if you prefer something else.
Copy/Import App - Toggling ON allows you to create a new app from an existing one instead of starting from scratch. You are then presented with two radio buttons:
Copy App - From the Syntasa platform, replicate an app you have access to based on the sharing settings of the source app.
Import App - Upload a .zip file on your local machine to replicate an external Syntasa app.
Description - Purely informational text field.
Tags - Users can tag apps, resources, and notebooks by creating new tags or selecting from existing ones in a custom text field.
Folder - Select the desired folder path where the app needs to be saved
Template - Choose your app template, in this case, Orchestrator Free Form
Event Store - Dropdown where you can choose your pre-configured event store.
Override Icon - Toggle button, you only need to turn it on if you would like to use a custom icon.
Pick a sharing option - The app can be shared as Private, Public, or group. However, the System Admin has access to all apps.
Private - The owner or system administrator can limit access to components by setting them to private. Only the owner and system administrators can access private components.
Public - The default preference for all apps, notebooks, or resources is set when creating something new.
Group - Group access can limit component access to specific user groups assigned to them. The system administrator can view these components regardless of group membership.
Owner - The owner of the App is by default the user who is creating the app. This can be changed, if needed, by the owner or the system administrator, after we create an app. - Click on 'Create'.
Configure Orchestrator Free Form
- You can just find your new app and click on it to open it.
-
The workflow will look like the screenshot below. We will then drag in our data connection sources, which will be the tables generated by other apps.
-
Click the lock icon (
) on the top-left to unlock the workflow.
- From the left side menu, under Stores drag a connection onto the workflow.
-
Click on the new node you've dragged on and from the dropdown, select your connection. Save the changes by clicking tick (
) on the right.
-
Now let's drag on a From File and configure it. The purpose of the From File process is to bring partitioned or non-partitioned data (like a CSV file) by defining the file pattern and data schema. A Connection will need to be dragged onto the canvas, selected, and connected to the From File process.
Process Configuration of From File
Configuration of this process includes three screens.
Input
This section provides the information Syntasa needs to understand the source connection path of the files and details of the files that need to be ingested.
- Source Path - Folder within the source connection, where the files reside. Do not include the bucket name or directory name specified in the Connection.
- Source File Pattern - File name pattern of the raw files to pull specific source files from the connection, keeping in mind that the files may be .tar or .zip containing multiple files, such as the raw data and supporting enrichment data. This provides the ability to pull specific files from a Connection where multiple source files may exist.
- Event File Pattern - File name of separate raw events file, if exists, if not use the same pattern as the source file. For example, Adobe provides a .tar file named with report suite and date, inside that .tar file is the hit_data.tsv file. In this example, the user would enter hit_data.tsv in this field because it is the event file of the source file.
- File Type - Tar, Text file, Zip file type
- Compression Type - Specify if the file is compressed and compression type
- Incremental Load - Provides an option to keep previous files or overwrite with a one-time lookup load
- File Name Has Date - Specify if the filenames provide a date if the data source needs to be partitioned by date
- Date Pattern - The pattern of the date in the filename can be either yyyyMMdd or yyyy-MM-dd
-
Date Extraction Type - Regex or Index example: .*sample_data_(.*)-.*.tsv.gz, replace sample_data with the actual name of your file.
- Group Number - The group number of the regex extraction string is 1
- Index Start/Index End - For the Index, you need the start index (example 14) and end index (example 22). Please note the positions may vary depending on how your date is formatted, see above.
- Date Manipulation - Provides means to shift partition date by a positive or negative number of days compared to file date (i.e. file date is 2018-08-01, but contents of the file is 2018-07-31), leave this as is.
Schema
This section provides the ability to define the column names of the input file and the time column.
- Time Source - column designating a timestamp, required incremental loading
-
Time Format Type - specify how the time source field is formatted, which is a dropdown for epoch or timestamp if timestamp a new text field is presented
-
Time format: to enter the format yyyyMMdd or yyyy-MM-dd
-
Time format: to enter the format yyyyMMdd or yyyy-MM-dd
Output
The Outputs tab provides the ability to name tables and display names on the graph canvas, along with selecting whether to load to Big Query (BQ) if in the Google Cloud Platform (GCP), load to Redshift or RDS if in Amazon Web Services (AWS), or simply write to HDFS if a using on-premise Hadoop.
Expected Output
Data from the source connection file in the Input tab of this process is loaded to Hive. Also, it allows Syntasa to be aware of the State and schema for downstream usage, and optionally write to other environment-specific query engines.