In the previous article, we learned how to configure the fields to locate the file for ingestion, set the data type (such as delimiter or Apache Logs), and utilize the incremental load feature to process files uploaded daily on an hourly basis. In this article, we will focus on configuring the schema
The Schema screen of the 'From File' process allows you to define the column names of the input file and define schema. In this article. we will cover the following topics:
- How to auto-fill the column names in the list?
- How to add columns manually?
- How to export the column name list?
- How to import the column name list?
- The relevance of file_date column
- Validations required for Column list
- Time Source Column
Let's go through each option available on the Schema screen of the From File process one by one:
Auto Fill
The Auto-Fill feature automatically detects and adds columns by reading the ingested file. Based on the file type, it identifies column names and populates them accordingly.
For delimited files like CSV, ensure that the Contains Header toggle under the Input screen is enabled. This setting informs the system that the first row of the incoming file contains column names, allowing accurate column detection.
Add Columns
The Add Columns option allows you to manually enter column names. This is useful when you need to add or modify column names based on specific requirements.
When you click Add Columns, a window will appear where you can define your column names and their order. Here’s how it works:
- Enter columns in the exact order you want them, separated by a delimiter (e.g., a comma,).
- The system will split your input into individual columns using the delimiter you specify. (e.g., a comma,)
- To assign a data type (like number, date, or string), add a space after the column name and write the type. Example: visitor_id string creates a column named "visitor_id" with a String data type.
- If you don’t specify a data type, the system will default to "string" (text). Example: last_name becomes a column named "last_name" with a string type automatically.
Once you click 'Apply', these four columns will be added to the list as shown below:
After adding columns—either manually or using the Auto-fill feature—you can make inline edits to modify the order, column name, and data type. Simply click the 'Pencil' icon next to the value you want to update. You also have the option to delete columns if needed.
Understanding the file_date Column and Partitioning
When Incremental Load is enabled on the Input screen, the system automatically adds a file_date column to the schema. This column corresponds to the timestamp embedded in the file name and is essential for incremental loading. The system partitions the data based on this column. You can also rename the column if you want.
However, if Incremental Load is disabled—meaning only a single file is used for processing—the file_date column will not be added to the schema. Additionally, the Partition column (which allows selection between Yes/No) will no longer be available on the Schema screen.
Any changes made to the schema after running the job will require the Drop & Replace process mode to ensure the output functions correctly.
Export
To export the available column names into an Excel sheet, simply click the Export option. This will download an Excel file containing the complete column list.
Import
If you have a column list in an Excel file that you want to import, click the Import option. This will allow you to upload the Excel file and add the columns to the schema.
To ensure the correct format, you can first export the column list as a template, update the necessary data in the sheet, and then import it back into the system.
The Import screen includes a toggle option that allows you to automatically replace special characters in column names with an underscore (_). This ensures consistency and prevents issues with column naming conventions.
Clear
When you click the 'Clear' option, all column values in the list are removed, except for the file_date column. Since the Incremental Load toggle is enabled on the Input screen, file_date remains mandatory for partitioning the data.
Validations
Before saving the column list, the system performs several validations to ensure data integrity. Some of these validations are highlighted in the screenshot below.
Time Source Column
If your input source contains an Epoch or Timestamp column that you plan to use for partitioning later in subsequent processes, the Schema screen in the From File process allows you to define its details. These details are stored as metadata in the output table for future use but do not impact the actual output data of the From File process.
Fields in the Schema Screen:
- Time Source – Specifies the column designated for timestamps. This column will be used for partitioning in another process.
- Time Format Type – Defines the format of the Time Source field. You can choose either 'Timestamp' or 'Epoch' from the dropdown.
- Time Format – Allows you to specify the format when Timestamp is selected as the Time Format Type (e.g., dd/MM/yyyy:hh:mm:ss). Ensure that the timestamp format follows Java naming conventions, as dd and DD are distinct literals and are case-sensitive.
Example Use Case:
- If you use the Transform process with the output of From File process as input, you can retrieve the partition values using the getPartition() method.
- If the Time Source column details are already defined in the Schema screen, calling getPartition() will automatically fetch the values.
- If the Time Source column details are not defined, you will need to explicitly pass arguments to getPartition(), such as:
getPartition(tablename.timesource_column, timestamp_format, partition_type)
Using the Schema screen to define time-related details simplifies partitioning in downstream processes, eliminating the need to manually specify timestamp formats or column names when working with transformations.
Now that we have successfully configured the Input and Schema screens, we can save the settings and run a job to process the data. In the next article, we will explore the fields available on the Output screen.