The From BQ provides the ability to load data from a BigQuery table into a Hive structured storage location in the Syntasa environment. Sample use cases may include:
- Loading data from BigQuery in one GCP project to another cloud or non-cloud environment for use by Syntasa processes.
- Loading data from a Syntasa BQ Process into storage for use by non-BQ related Syntasa processes (i.e. Tranform, Lookback, Lookahead)
Requirements for using the BQ Loader processing:
- Connection to an existing Big Query connector
- BigQuery table details (e.g. dataset name, table name)
- Schema information of the input table (input columns)
Process Configuration
The BQ Loader contains three screens to create the input parameters for the data, map the schema, and understand where the data is being written. Below are details of each screen and descriptions of each of the fields.
- Drag the BQ Loader on to the canvas
- Drag and connect a dataset to the BQ Loader (this is required before editing of the loader can occur)
- Click on the BQ Loader node to access the editor.
Input
This section provides the information Syntasa needs to configure the BQ Loader.
- Big Query Dataset - defines the BigQuery dataset to be used
- Big Query Table - defines the BigQuery table to be used as the input data source
- Incremental Load - when set to true or green, the incoming data is appended to the table and not over written
- Sharded - when set to true or green, the table is Sharded
-
Tables in BigQuery are generally sharded with a naming approach as PREFIX_YYYYMMDD
Input ConfigurationInterpretationIncremental Load = TRUE AND Sharded = FALSE Table is partitioned Incremental Load = TRUE AND Sharded = TRUE Table is sharded
Schema
This section is where the time format type is defined, time format defined, log format is described, columns and data types declared and then mapped into the Syntasa schema.
- Time Format Type - either Epoch or Timestamp
- Time Format - time format (e.g. yyyy-mm-dd) - this field is only displayed with time format type of Timestamp
- Log Format - defines the label of each column within the source file - Syntasa requires comma delimited format with no white space.
- Time Source - column from BigQuery table that contains timestamp (e.g. file_date)
The Column and Data Type fields are populated from the Log Format definition with the default data type set as string.
Output
This section provides the ability to name the output tables and how the output process should be labeled on the app graph.
Datasets
- Table Name - defines the name of the database table where the output data will be written. Please ensure that the table name is unique to all other tables within the defined Event Store, otherwise, data previously written by another process will get overwritten.
- Display Name - label of the process output icon displayed on the app graph canvas.
- Compression - option to compress the files, reducing amount of storage required. Compression adds a bit of overhead to processing, but if raw files will be stored indefinitely, it is recommended to compress the files. If raw files will be removed after Event Enrich processing, then it is recommended to turn Compression off.
- Event Store Name - name of the Event Store selected when initially creating the app. This option is not configurable, if any of the Event Store Name, Database or Location details are incorrect then back out of the app and make the changes in the Event Stores settings screen.
- Database - name of the database in the event store that data will get written.
- Location - storage bucket or HDFS location where source raw files will be stored for Syntasa Event Enrich process to use.
Expected Output
For the BQ Loader process, the expected output is typically that the data is loaded into the storage of the environment where Syntasa will process the data and output table created. Additionally, the table can serve as the foundation for building other datasets.