Generic Input Adapter is an app that provides the ability to ingest files for Generic Event (time-based) data (e.g. Transactional Credit Card, CRM, Call Center, Enterprise data, etc.) Leveraging Generic data to provide an understanding of a customer's behavior across different systems.
Prerequisites
The following screens need to be populated before configuring this app:
- Infrastructure - All required fields populated with the environment details of your on-premise, Google Cloud Platform, or AWS environment.
- Event Store - Where data will reside in your cloud environment or HDFS cluster.
- Connection - Where data will be ingested, for processing by SYNTASA.
App creation
- Click on the menu icon ( ) and under Platform select "Workspace" from the sub-menu.
- Then click the green "+" plus sign button, on the top right of your screen.
- Fill in the New App screen.
- Name = display name of your new Generic Input Adapter application.
- Key = will automatically populate based on the name you enter.
- App prefix = will automatically populate based on the name you enter, you are able to change this if you prefer something else.
- Template = Choose your app template, in this case, Generic Input Adapter
- Description = Purely informational text field.
- Event Store = Drop down where you can choose your pre-configured event store.
- Override Icon = Toggle button, (Don't change, unless you would like to use a custom icon).
- Click 'Create'.
Configure Generic Input Adapter
- Now find your new app and click on it to open it.
- Click the lock icon () on the top-left to unlock the workflow.
- From the left side menu, under Stores drag a Connection onto the workflow.
- Click on the new node you've dragged on and from the drop-down select your connection
- Regardless of which connection you're using, configuration options are the same.
- Now save the changes by clicking tick () on the right.
- Select the edge of your connection and drag it to From File and snap it into place. This will enable From File for editing.
- Let's configure From File by clicking on it.
- Input is for the configuration of picking up and reading through the files.
- Source Path: Location to the files (e.g. /generic/data/daily_files/)
- Source File pattern
- For daily files, you may have the date at the end of your file name e.g. file_name.*.tsv. Your file extension may be different from our example.
- For static files, then just write the name of your file.
- Text File type is a drop-down, select the option which applies to your data.
- Your selection may change the options available under Events, that's okay we will cover this further down.
- Again Compression Type is also a drop-down, select the option which applies to your data if no compression is being used then select None.
- Now select from the drop-down the Event Data Format, we present the different options please use the ones applicable to your data.
- Ours is delimited, so we will use Delimited.
- Select field delimiter, ours is ","
- Leave the remaining two as default (Quote Character, Escape Character).
- Select Avro
- Apache Log
- if Apache Log Format: Common, then leave Regex Pattern with the auto-completed text.
- If Apache Log Format: Combined, then leave the Regex Pattern with the auto-completed text
- If Apache Log Format: Custom, then for Regex Pattern you must write your own Regex logic.
- Ours is delimited, so we will use Delimited.
- If your data contains headers, toggle Contains Headers ON.
- Incremental Load is Toggled ON if your data is not daily toggled this OFF.
- If Incremental Load ON, select your Load Type for your data, this will either be Daily or Hourly.
- Date Parsing for this we will need to define the Date Pattern: yyyyMMdd
- For Regex Extraction Type: Regex or Index
- if Regex .*daily_crm(.*).tsv.gz
- Group Number is 1 as we want the first match in the Regex returned.
- if Regex .*daily_crm(.*).tsv.gz
- If Index (select the start and end position where your Date in the file name starts. (Remember index starts from 1).
- Start index: enter your number here
- End Index
- For Regex Extraction Type: Regex or Index
- Under Date Manipulation leave as default
- Now we're finished with Input, from the left side tab select Schema
- You will need to locate your column headers, this may be at the top of your data or in a separate file.
- if in the same file, as your data open your file and copy your column headers, we recommend you sanity check the headings to ensure they are as you expect by pasting them into a text editor first, and once ready copy again.
- Now click on Add Columns and paste in your headers, make sure you select the appropriate delimiter.
- If in a separate file, you have the option to upload your column headers using the Import button or alternatively, you can add them by repeating the previous step.
- Click on the time Source drop-down and select a header from your data.
- Click Time format type and again select from the drop-down, We're using epoch.
- Click on Output and we recommend you update Table Name and Display with one that's in line with your naming convention.
- Now save the changes by clicking tick () on the right.
- From the left side menu, under Stores drag an Event Store () onto the workflow.
- Click on the Event Store on the workflow and populate the two drop-downs:
- Select the Event Store, if you have a long list type the name of your list.
- Select the Dataset, again if you have a long list type the name of your list.
- Click on the tick box on the right to save.
- From the edge of the Event Store click and snap to the edge of the Generic Event Enrich node.
- Now click on Generic Event Enrich.
- You are now asked to alias your primary data source, meaning this is what the app will expect when mapping columns in a field titled "Function" and writing enrichment. We recommend you use a simple but consistent naming convention for tables, which everyone will know in your organization.
- Click on "Mapping" on the left menu on the new window being displayed.
- Edit Mapping screen. You only need to edit one mapping screen in one Unified Event Enrich process. The mapping screens will be identical and any changes made to one will also show in the other.
- All changes can also be made via Excel by clicking "Export" to generate and download a CSV.
- The name is a default schema header in the table (non-editable).
- The labels autocomplete, populated with the fields from the input tables.
- Function is where we write our enrichment,
- pulling in the exact value from the primary input source, by selecting from the drop-down options.
- Write custom enrichment example case statements, regular expression (regex) etc.
- Once done with mapping click on Filter on the left-side menu, we don't need to edit unless we identify something during testing that we would like to filter out.
- Avoids reading of unnecessary data from the lookup table.
- Beneficial only when lookup (secondary table) is big.
- It helps spark perform some optimization while doing product/event enrich.
- It extracts only values(like Product ID) that are present in the main table from the lookup table and performs the join.
- Click on the Output tab on the left, and check:
- The Table name, you are able to update this if you have a specific naming convention
- The View name, you are able to update this if you have a specific naming convention
- The Display name on the workflow.
- Under configurations. (These are default settings, please don't modify them without a consult with Syntasa).
- Partition Schema: Daily.
- File Format: Parquet.
- Load to Big Query: toggled ON (by default).
- Now save the changes by clicking tick () on the right.
- Finally, click on Save and Lock.
Test in development
Now you're ready to test your configuration.
- To test the Generic Input Adapter, click on the nodes one by one while holding the shift key. The nodes will be highlighted in grey with a tick(see the below screenshot) to indicate it's been selected.
- Now click on the"Job" button on the top-right and then click "New Job" from the drop-down. (see the screenshot below).
- You will now be presented with a window for configuring your job, let's populate the below:
- The job Name.
- The job Description is informative for the user.
- Tag the JOB.
- Process name: auto-populated, non-editable.
- Runtime is a drop-down, this allows us to choose the type/size of the cluster you want to use for processing.
- Process Mode, for the first run, Replace Date Range is sufficient, however, if you're multiple times you may opt for Drop and replace or add new and modified. (It's worth noting Drop and Replace is not advised for production as this will drop your data in Big Query).
- Date Range is a drop-down and you have a number of options like Custom, which allows you to select the dates you want to process (From Date / To Date) or alternatives are available like "Last N Days" which will allow you to select a relative date range e.g. last 2 days and if needed you can add an offset. For the purposes of this, we're using Custom so we will have to enter two dates.
- for Preview Record Limit and Default Test File Limit, leave these as default.
- Now click on "Save & Execute" and the job will start.
- Now Click on Activity to expand and show job details on the right side menu.
- Activity will have a 1 to indicate a running job.
- Once the job completes, you can click on the output preview.
- The window will have a menu on the left, click on "Preview".
Run job in production
- Deploy your development workflow to production. From the development workflow, click the "Deploy" button, and then you will be presented with the below screen for initial deployment.
- After the initial deployment, you will be required to create a snapshot name.
- Snapshot is a feature that saves the state of the app, so you can track changes over time.
- Now open the production workflow.
- Highlight the nodes you want to include in the job by holding "Shift" and then clicking on the Unified Event Enrich processes.
- Click on the Job button on the top-right menu, and choose "New Job" in the sub-menu.
- Fill in the name, description, and date range. Clicking "Save and Execute", which will start the job.
- Now Click on Activity to expand and show job details on the right side menu.
- Activity will have a 1 to indicate a running job.
- Once the job completes, you can click on the output preview.
- The window will have a menu on the left, click on "Preview".