Adobe Feed Watcher events are specialized user-defined events within Syntasa that monitor directories for the arrival of new Adobe clickstream data files. The files contain detailed logs of user interactions on a website, collected by Adobe Analytics. When a new clickstream data file that matches a specified pattern is detected, the event triggers downstream processes such as data ingestion and processing workflows. This automation is vital for organizations that rely on timely and accurate web analytics to drive business decisions.
Creating an Adobe Feed Watcher Event
Follow this step-by-step guide to create a new Adobe Feed Watcher Event:
- Click on the hamburger menu and select "Events" under Resources.
- Click the "+" icon. Takes you to the "Create User-Defined Event" screen.
- In the "Event Type" field, select "Adobe Feed Watcher."
- Fill out the fields according to your requirements. Please refer to the details provided for each field below.
- Once filled in all the information, click "Save."
Configurable Fields
Let's break down the purpose and function of each field available on the 'Create User Defined Event' screen when 'Adobe Feed Watcher' is chosen as the event type:
Basic Details
-
- Name: This field allows you to assign a name to the Adobe Feed watcher. This name helps identify the file watcher easily and understand its purpose.
- Description: In this field, you can describe the file watcher. This description can explain what the file watcher does and why it was created.
- Tags: You can use tags to categorize the file watcher. Tags help filter and find specific file watchers. To know more about this please refer to this article.
-
Active/Inactive: When the toggle is set to active (usually indicated by a green button or checkbox), the file watcher monitors the specified file path and file pattern for changes.
When the toggle is set to inactive (usually indicated by a red button or unchecked checkbox), the file watcher is paused. It won't monitor the file for changes and won't trigger any events. This is useful when you want to temporarily stop the file watcher without having to delete the entire configuration.
Type
-
-
- Event Type: This field lets you select the event type. For this example, we'll choose 'Adobe Feed' since we are setting up an 'Adobe Feed' event. This option is pre-selected by default.
- Connection: Choose an existing connection that directs to the cloud storage location (like an S3 bucket or GCS bucket) where the directory to be monitored is located. Currently, events support connections of types including GCS, S3, AWS S3, GCP GCS, FTP, SFTP, and ONPREM (HDFS).
-
Poll Interval (Minutes): This setting determines how often the event should check for new files, with a value of 1 indicating a check every minute.
Note: It's recommended to start with 30 to 40-minute intervals to balance performance and cost efficiency, as frequent polling may increase costs with your cloud provider. Note that there is an associated cost with events, which is influenced by the data volumes in the specified connection location.
-
Adobe Feed Details
-
- File Path: This field specifies the file's location that the Adobe Feed watcher will monitor for changes.
- Manifest File Pattern: The file name or regex file pattern to be monitored.
-
Feed Frequency: This field specifies the frequency of your Adobe clickstream data feed. This setting aligns the feed watcher’s polling schedule with the delivery frequency of your data, ensuring that the event checks for new files at appropriate intervals. There are 02 options:
- Daily: Select this if your clickstream data is delivered once per day.
-
Hourly: Select this if your data is delivered in hourly increments.
-
Trigger after a complete day (Toggle): This toggle option is used if you select Hourly as the feed frequency but want to trigger the event only after the entire day’s worth of hourly data has been delivered. This is useful for scenarios where you need to process data in complete daily batches, even if it’s delivered hourly.
- When ON: The event waits until all hourly files for the day are received before triggering the downstream processes.
-
When OFF: The event triggers as soon as each hourly file is received.
When setting up an Adobe Feed Watcher with an hourly feed frequency and enabling the "Trigger after a complete day" option, three more fields are populated: Event Timeout (hours), Event Timeout Behavior, and Time Zone. Let’s explore each of these in detail.-
Event Timeout (Hours): The Event Timeout (hours) field specifies how long the system should wait for all expected hourly files to be received before marking the event as timed out.
This setting provides a buffer period to accommodate delays in file delivery, ensuring that the event only triggers when either all files are received or the waiting period has expired.
For example, imagine you are expecting 24 hourly files for a complete day, but only 20 files have been received by the end of the day. By setting the Event Timeout (hours) to 2, you instruct the system to wait an additional 2 hours for the remaining 4 files. It will have 2 behaviors:- If the remaining files arrive within the specified timeout: The event will trigger as soon as the last expected file is received within this period.
- If the remaining files do not arrive within the specified timeout: The event is marked as timed out.
-
Event Timeout Behavior: The Event Timeout Behavior determines the action the system should take if the expected number of files is not received within the Event Timeout period. This field allows flexibility in handling incomplete data, giving you control over whether to proceed with partial data or wait for a complete set. It has 2 options:
- Ignore: If selected, the system will do nothing if the event times out, meaning no event will be triggered. This is useful when processing should only happen with a complete set of files.
- Process Partial: If selected, the system will trigger the event even with the partial set of files that were received. This allows processing to continue with the available data, which is beneficial in scenarios where some data is better than none.
- TimeZone: The Time Zone field specifies the time zone in which the event should operate. This is crucial for accurately interpreting the timestamps of the files being monitored and aligning the event schedule with the time zone where the data is generated or stored.
-
Event Timeout (Hours): The Event Timeout (hours) field specifies how long the system should wait for all expected hourly files to be received before marking the event as timed out.
Date Parsing
-
-
Date Pattern: Specifies the format of the date embedded in your file name. This ensures that the system can correctly identify and parse the date portion of the file name to determine its recency and whether it fits the expected delivery pattern.
Examples:- yyyyMMdd: For daily files with a date format like “20240614” (June 14, 2024).
- yyyyMMddHH: For hourly files with a date format like “2024061415” (3 PM, June 14, 2024).
-
Extraction Type: Index / Regex: Determines the method used to extract the date portion from the file name. Provides flexibility in handling different file naming conventions, whether the date is in a fixed position (Index) or follows a variable pattern (Regex).
-
Index: Use character positions to extract the date. If this option is enabled, it will display the following 04 fields:
-
Start Index: The starting character position where the date portion begins in the file name.
- Example: In a file named “clickstream_20240613.csv”, if the date starts at the 12th character, the start index would be 12.
-
End Index: The ending character position where the date portion ends in the file name.
- Example: For the same file “clickstream_20240613.csv”, if the date ends at the 19th character, the end index would be 19.
-
Hourly Start Index: Specifies the starting position of the hour portion in hourly file names. This option is available only when the 'Hourly' feed frequency is selected.
- Example: In a file named “clickstream_20240613-15.csv” (where “15” is the hour), the start index for the hour might be 20.
-
Hourly End Index: Specifies the ending position of the hour portion in hourly file names. This option is available only when the 'Hourly' feed frequency is selected.
- Example: For the same file “clickstream_20240613-15.csv”, if the hour ends at the 22nd character, the end index would be 22.
-
Start Index: The starting character position where the date portion begins in the file name.
-
Regex (Regular Expression): Use a pattern to match and extract the date. If this option is enabled, it will display 03 additional fields as described below. These fields allow for flexible date and time extraction from complex or variable file names, ensuring the Feed Watcher can correctly identify and process new data files.
-
Regex Pattern: A regular expression that matches the date portion in the file name.
- Daily Example: .*sample-file_(\d{8}).*.tsv.gz for files like “sample-file_20240613.tsv.gz”.
- Hourly Example: .*sample-file_(\d{8})-(\d{2}).tsv.gz for files like “sample-file_20240613-15.tsv.gz”.
-
Group Number: Indicates which group in the regex pattern contains the date portion.
- Example: If the pattern is .*sample-file_(\d{8})-(\d{2}).tsv.gz, group 1 captures the date (yyyyMMdd).
-
Hour Group Number: (For hourly files) Specifies which group in the regex contains the hour portion. This option is available only when the 'Hourly' feed frequency is selected.
- Example: Using the same pattern, group 2 captures the hour (HH).
-
Regex Pattern: A regular expression that matches the date portion in the file name.
-
Index: Use character positions to extract the date. If this option is enabled, it will display the following 04 fields:
-
Date Pattern: Specifies the format of the date embedded in your file name. This ensures that the system can correctly identify and parse the date portion of the file name to determine its recency and whether it fits the expected delivery pattern.
Note: Once an event is created, the value of the 'Event Type' field can not be modified. To alter fields in the 'Type' and 'File Details' sections, you must first set the event to the 'Inactive' state.