Schema Expansion – SYNTASA™

This is intended to help you with expanding schemas for Adobe Adapter to help you meet your evolving requirements.

The guide walks you through different milestones of the process to ensure has been done and in the event that the schema expansion doesn't work as expected, thus you will have fail-safe options like HIVE schema backups, app schema backups ability to successfully query data on Big Query.

Article Summary:

Out of Scope - content not covered by this article
Prerequisites - what is required in order to achieve a schema expansion
Excel & Big Query - Using excel to make the required changes & Big Query to verify the current production schema
Development and deploy - Test in development and deploy changes to production
Processing & Validation - verifying that production data is still correct and as expected.

Out of scope for this guide:

for this guide we assume you are familiar with:
- Apache HIVE and SQL querying.
- GCP Big Query
- Microsoft Excel formulas.
- how to use Google Cloud SDK shell / gcloud commands and SSH clients.
- Text editor such as Notepad++

Prerequisites:

Backup: Using gcloud command, ssh into a running master node in a cluster and take a backup of the HIVE production tables using the command below save the results on a text file. We used notepad++ to save our schemas. We backed up Event, Product, Session and Daily Visitor tables.
- ```
show create table_name;
```
Backup: Navigate to your app and then click on the production workflow to take a backup of your production schemas (We took a backup of Event, Product, Session, Daily Visitor). To do the same:
1. Click on Mapping
2. Click on Export. (this will generate an Excel file with your table schema).
  The file will be saved to your local downloads folder, we recommend you move the files to the same location as the HIVE backups).
3. Repeat steps 1 & 2 for the other processes.
4. Renamed all of the exported Excel schemas example: "Original Production Event Enrich.xlsx" etc.

Compare-HIVE Schemas:
- Once we had the HIVE production schemas we then ran the "show create table_name" for the development database and using Notepad++ we copied them into Notepad++ did a quick compare for datatype differences, using its compare capability
Compare in App-Schema for source and datatype differences: Again once we had the app production schema backups, we repeated the export process for the development workflow. We ran this exercise for (Event, Product, Session, and Daily Visitor enrich).
1. Click on the process node you want to backup (we started with event Enrich, then click on the mapping
2. Click on Export. (File will be saved to your local downloads folder, we recommend you move the files to the same location as the HIVE backups).
3. Repeat steps 1 & 2 for the other processes.
  Note: Now using Excel we will compare Production (backup schema) and Development Schemas (Dev event enrich vs prod event enrich etc) As we assume development may contain un-deployed schema changes which we want to capture in the production schema for use later.
4. Using Excel open event enrich for development and production
  1. We used Vlookup from production selecting Column B (name) and looking into development and pulling in column D (source) to see if there's any enrichment that is yet to be deployed.
  2. we then repeated the step but this time pulling in Column E (datatype)
  3. Scroll down the document to visually check for any difference (we found 1 un-deployed source enrichment).
  4. Next, we created two IF statements to check if our original columns against the pulled in columns matched (True/False) Our IF statement confirmed point our visual check findings.
  5. We then cut the text value of our un-deployed change in production schema and pasted it to the associated (column D) source for the cust_field it related to.
  6. Now delete the 4 additional columns we created for our vlookups and if statements.
  7. Finally, we saved our schema as a new copy (we saved them in the same location as our Hive exports): We named this as Deploy Production Event Enrich.xlsx etc.
  8. Repeat steps 4.1 to 4.7 for the other schemas.
    Note: If other schemas contain no differences between production and development you don't need to (step 4.7) save new copy's for them.
Big Query Sanity: We want to validate that data is present before processing in development and production
1. We ran two SQL queries, for development and production.
```
Select * from table_name where event_partition ='date_here' limit 10; 
```

Update Excel & Big Query Production schemas

Excel:

Open the excel which contains the production schema where we want to add additional columns to the table. (For us this was our Original production product enrich Excel schema).
Let's go ahead and insert new rows, we're inserting 5 columns at the end of the last cust_field and 5 at the end of the last cust_metric.
In column B (name) select 2 rows above where you inserted your rows and drag your mouse all the way to column K (columntype) and drag down, this will auto-fill the fields and provide continuity for cust_field names.
Repeat step 3 if needed for cust_metrics.
Update the Order id (Column A: Order). Select A2, A3, and drag down to auto-fill the order all the way to the end.
Now you can add enrichment in the new columns added for us, we will be pulling in a cust_field from tb_event (cust_field1 - login_status) into cust_field142
Finally, save your changes as a new copy similar to Deploy Production Product enrich.xlsx.
Repeat 1 to 7 for other schemas if required.

Big Query:

Navigate to the production database and click on your table (for us its tb_product)
Now you should see the schema, scroll to the bottom, and click Edit Schema.
Click add Field and add your fields
Note: cust_fields must be of type string and cust_metrics of type Float.
Repeat step 3 to add more columns.
Once finished Click save.

Update Development

Note: For us, we have to upload schemas for two processes, event enrich (Deploy Production Event Enrich, this contained un-deployed schema changes. Deploy Production Product Enrich was where we expanded our schema).

Go back to your app and unlock the workflow
Select your process (for us its event enrich first) and click on mapping.
Click on import and find your excel schema file and click ‘Apply’
Once ready hit the tick box on your popup window and click "Save & Lock.
Repeat steps 3 & 4 for other processes.
We verified our uploads were successful by opening the processes again and visually inspecting mapping to check everything looks good and by searching for our new cust_fields/cust_metrics.

Development Process & Deploy

Note: For this to work we now must process our development workflow with drop and replace, this will help clear the previous schema in development.

Select your job if you have one, if not hold shift and select your nodes then click job and ‘Create New’, ensure process mode is drop and replace and process only for 1 day. We want to show the pipeline has created the new table with the new schema configuration.
Once the job has completed, the select process preview and then schema, search for the new columns and if you created enrichment's you should see data in the Preview tab.

Note: Before we can deploy to production we need to go to Big Query and verify our schema is present with our new changes and check data has been loaded correctly.

Select * from table_name where event_partition ='date_here' limit 10;

We're ready to deploy...

Click Deploy, now provide a snapshot name, ours is Product Schema expansion.
Review which processes are being highlighted as altered column "Copy Mode" Alter, for those which we didn't change anything let revert Copy Mode from Altered to None.
1. Click on Copy Mode and from the drop-down select None.
Click Deploy, check development operations screen for deployment completion.

Production Process & Validate

Navigate to the production workflow in your app
Select a historic job to process full pipeline with process mode replace date range, we want to process for 1 day only.
Once the job is complete we need to verify the schema changes in HIVE for production Event & Product tables as these are the only two tables we've changed the schemas for, we do this by using our show create table_name statement.
Then compare them to our Original HIVE schema backups using notepad++.
Next, we verify data has been loaded to Big Query correctly we need to ensure our new columns are null where we didn't add enrichment. For us, we need to check this in the product table.
Then we need to verify the event table is also loading correctly given we deployed new enrichment's. simply updating the below statement.
Lastly, we need to verify data that was processed before our schema changes return results correctly when queried, again we use the below SQL statement to visually verify our results.

Note: this query allows us to inspect the table for a small subset. also checking our new enrichment is working and populating data correctly.

Select * from table_name where event_partition ='date_here' & cust_fiedl42 !='' limit 10;

Note: Finally we recommend you process in production other apps that are dependent on the output of the tables we've updated the schema to ensure scheduled jobs have not been impacted.