The Adobe Analytics or AA Loader process provides the ability to bring raw Adobe Analytics clickstream data into a customer environment from the specified location where Adobe places the files (i.e. S3 bucket or SFTP folder). Once the hit_data and lookup files are placed into a specified location on the environment, the data can be used by the Syntasa Event Enrich process, and external tables built providing ability to query.
Process Configuration
AA Loader Process Configuration screen has three tabs that define where the files reside, structure of the filename, structure of the schema, and storage rules of the files. Click on the AA Loader node to access the editor.
Below are details of each screen and descriptions of each of the fields.
Input
This section provides the information Syntasa needs for understanding the files to process along with location and structure of the date string in the filename.
- Source Path - directory within the source files connection (i.e. S3 bucket, GCS bucket, SFTP directory) where the data resides. It is common for Adobe to only place files at the root directory of the source connection, in this case just configure this field with a / which designates to Syntasa it is at the top-level or root directory.
Events
Adobe provides one or sets of files where the collected event (aka hit) data is stored along with a set of lookups. This section helps to define the location and filename structure of these files for Syntasa.
- Source File Pattern - filename pattern of the report suite source files. This is needed as it is common for Adobe to place files for multiple report suites into the same directory of the source connection, therefore, specifying the filename structure ensures the correct report suite files are getting picked up.
- Event File Pattern - name of the file where the raw event (or hit) records exist. For Adobe Analytics files this will typically be the hit_data.tsv file.
- File Type - specify if the files are Tar, Textfile or Zip. Adobe typically will provide the files in a Tar format or TSV (Tab Separated Values) format. If the file is in TSV format, select "Textfile" for this field.
- Compression Type - specify the type of compression used on the files. Adobe will typically use Gzip compression, and this can easily be determined by the .gz extension in the file pattern.
Lookups
This section defines the file pattern of the Adobe provided lookups.
- Lookup File Pattern - filename pattern of the lookup files. Adobe provides the option to package the lookup files in a separate set of files. Sometimes these files are a different file type as the source files. Please pay close attention to extensions on both sets of files provided by Adobe.
- Lookup Files - list of lookup files packaged with the Adobe clickstream files. Syntasa automatically configures these lookups by default, but also provides the customer with the ability to remove them.
Date Settings
This section defines filename date structure that Syntasa uses to build partitions when processing the files.
- Date Pattern - defines the pattern of the date within the Adobe source filename. For Adobe, this date is typically in the yyyyMMdd or yyyy-MM-dd format.
- Date Extraction Type - specifies the method of extracting the date in a dropdown menu with options of 'Regex' and 'Index'. Regex is the recommended method because Adobe may deliver over 100 files for one day and regex extraction type provides flexibility to pick up all files. Index provides a start position on filename where the date begins and an end position where date ends with index starting at 0. For example, filename of syntasademo_20180801.tar.gz would have start index of 12 and end index of 20.
- Regex Pattern - pattern defining where the file date exists on the filename. For example, syntasademo_(.*).tar.gz creates a group between the underscore and first dot of the extension.
- Group Number - group number (text between parentheses) that Syntasa should use to locate file date. In the above example of a regex pattern there is only one regex group defined so the group number should be set to 1.
Date Manipulation (Optional)
Rarely used for Adobe clickstream files and can usually be ignored for this process. This configuration is only made if the event data in the files is a different date than that of the filename.
- Days - number of days different, this can be a positive or negative value.
- Chronology - in most cases where this configuration is needed the selected option will be 'Days'.
Schema
This section defines the schema, or column headers, of the source files.
Log Format - defines the label of each column within the source files. Adobe typically provides this as a lookup named "column_headers.tsv" and to use the text within the file Syntasa requires the tabs to be replaced with commas. Columns can also be added manually by clicking the plus button, but isn't recommended for Adobe clickstream, which typically has about 1,000+ columns.
Outputs
This section provides the ability to name the output tables and how the output process should be labeled on the app graph.
Datasets
Table Name - defines the name of the database table where the output data will be written. Please ensure that the table name is unique to all other tables within the defined Event Store, otherwise, data previously written by another process will get overwritten.
Display Name - label of the process output icon displayed on the app graph canvas.
Load To BQ - this option is only relevant to Google Cloud Platform deployments. BQ stands for Big Query and this option allows for the ability to create a Big Query table. If using AWS, this will have the option to Load To RedShift and if an on-premise installation data is normally written to HDFS and does not display a Load To option.
Compression - option to compress the files, reducing amount of storage required. Compression adds a bit of overhead to processing, but if raw files will be stored indefinitely, it is recommended to compress the files. If raw files will be removed after Event Enrich processing, then it is recommended to turn Compression off.
Event Store Name - name of the Event Store selected when initially creating the app. This option is not configurable, if any of the Event Store Name, Database or Location details are incorrect then back out of the app and make the changes in the Event Stores settings screen.
Database - name of the database in the event store that data will get written.
Location - storage bucket or HDFS location where source raw files will be stored for Syntasa Event Enrich process to use.
Expected Output
For the AA Loader process, the expected output is typically that the Adobe clickstream source files are copied over from a location into the environment where Syntasa will process the data. For example, if using Google Cloud Platform the files could go from an S3 bucket to a GCS bucket. Once the files are copied over they are unpacked and if selected, tables will be built for users to query against. These tables could be extremely large and the data is as-is from Adobe. This is the first step of the Adobe Analytics Input Adapter pipeline and is typically not designed for end users, it is designed to prepare the files so the Syntasa Event Enrich process can use them. It is the Event Enrich process that builds the first usable dataset for the standard end user and other other Syntasa processes to build off of.