In data science, large datasets are often split into smaller, more manageable chunks based on a specific column—most commonly a date column. This approach makes it easier to store, process, and query data efficiently.
In Syntasa native apps, processes are designed to handle both partitioned and non-partitioned data, and can generate outputs in either format. Depending on the app process (e.g., From File, From DB, Spark Processor, etc.), the way partitioned or non-partitioned data is handled may differ, but the underlying concept remains the same.
By default, in Syntasa native apps, data partitioning is done on the basis of date.
What is Non-Partitioned Data?
Non-partitioned data is stored as a single table or dataset without separation by date or any other column.
Example: A flat table containing all records together, such as one large CSV file with no folder structure.
What is Partitioned Data?
Partitioning means dividing a large dataset into smaller chunks based on a column (typically a date or timestamp). This makes queries and transformations more efficient because only relevant partitions are scanned instead of the entire dataset.
Example: Data organized into folders such as dt=2025-01-01, dt=2025-01-02, and so on.
Please note that data is stored within same table but with different folder structure. If you want to fetch the data you simply need to user the general query and if you want to fetch data only for specific date range then you can fetch it using the Where condition with partition column name.
Partitioning is particularly useful when you want to:
- Query data for specific dates without scanning the full dataset.
- Organize large volumes of data in a structured way.
- Improve performance in downstream processing and analytics.
Please note that partitioned data is still part of the same logical table, but under the hood it is stored in different folder structures based on the partition column.
For example, suppose your dataset is partitioned by the column order_date. Physically, the data may be stored like this:
/orders/order_date=2025-01-01/part-0000.snappy.parquet
/orders/order_date=2025-01-02/part-0001.snappy.parquet
/orders/order_date=2025-01-03/part-0002.snappy.parquetFrom a user’s perspective, you don’t need to worry about these folders—you can simply query the table as a whole:
SELECT * FROM orders;This query will return all records across all partitions. But if you only want data for a specific date range, you can filter on the partition column (order_date) in your query:
SELECT *
FROM orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-01-03';In this case, the system will only read the relevant partitions (2025-01-01 to 2025-01-03) instead of scanning the entire dataset. This selective read is what makes partitioning so efficient—it reduces unnecessary I/O and speeds up query execution.
How Partitioned and Non-Partitioned Data Work in Syntasa
In any Syntasa app process, data is handled in two stages:
-
Input Data
Input can come from various sources such as an Event Store, the output of another process, a cloud storage system (e.g., GCS, S3, Azure), or a database connection (e.g., Snowflake, Postgres). This input may itself be partitioned (already divided by date) or non-partitioned (a single dataset). -
Output Data
After the process runs its logic or transformations, the results are written to an output Event Store. The output can be generated as either partitioned or non-partitioned, depending on how you configure the process.
A key point is that partitioned output can still be generated even if the input is non-partitioned—as long as the input data contains a suitable date or timestamp column. For example, if your input CSV file includes a order_date column, you can use it to partition the output.
Why Partitioned Data Matters: A Practical Use Case
Understanding the difference between partitioned and non-partitioned data in Syntasa is important because it directly impacts:
- Performance - Faster queries and reduced processing costs.
- Organization - Cleaner, date-wise storage of large datasets.
- Flexibility - Ability to choose whether your outputs should be flat (non-partitioned) or structured (partitioned) depending on downstream needs.
To see the value of partitioned data in action, let’s take an e-commerce store example. Every day, your platform generates massive amounts of data from user events, product views, and purchases. All this information is continuously written into your database or Event Store.
Now, as a data scientist, you often don’t need to analyze all historical data at once. Instead, you might want to:
- Run a daily transformation on just yesterday’s data.
- Train a model on the last 7 days of transactions.
- Monitor real-time trends by processing only the most recent partitions.
This is where partitioning becomes extremely powerful.
In Syntasa, you can configure your app processes (e.g., Spark Processor, From File, From DB) to recognize whether input data is already partitioned, or to partition it during output based on a date column of input source (such as order_date or event_date).
Here’s how it works in practice:
- Suppose your input table contains data from January 1st to January 31st.
- You set up a scheduled job in Syntasa to process only the last 1 day of data with Process mode suiting your requirements.
- At runtime, the system automatically reads only the required partition (e.g., January 30th if the job runs on 31st January ), applies your transformations, and writes the transformed subset into an output dataset which is part of event store(hive table).
This means you avoid scanning unnecessary data, which:
- Saves time (faster queries).
- Reduces compute cost (only relevant partitions are processed).
- Keeps outputs manageable (organized by date for downstream workflows).
The transformed data can then be:
- Used in further Syntasa processes (e.g., feature engineering for ML models).
- Sent back to external storage systems (like GCS, S3, or Snowflake) for analytics, reporting, or sharing with other teams.
How to Identify Partitioned vs Non-Partitioned Data in the Application?
In the Syntasa application, you can verify whether a dataset is partitioned by navigating to the State screen of the Event Store or output node. If the dataset is partitioned, the State screen will display the data organized by date, as illustrated in the screenshot below.
Under the Details tab, you can view the column that was used for partitioning the data and check whether the data is partitioned on daily basis or hourly basis, as shown in the screenshot below.
If the dataset is not partitioned—meaning the data is stored as a single table without being separated by date or other columns—the State screen will display a message at the top stating "Dataset is not partitioned", as shown in the screenshot below.
Hourly Partitions
Partitioning isn’t limited to daily data; in Syntasa, datasets can also be organized at an hourly level. This is especially useful when dealing with high-volume, time-sensitive data such as clickstream logs, IoT sensor readings, or transaction events that arrive continuously throughout the day.
How Hourly Partitioning Works?
Input Data:
Your source dataset may already be partitioned hourly, or you may choose to generate hourly partitions at the output stage.
Process Handling:
- Some app processes expect a default Syntasa timestamp format like yyyy-MM-dd-HH.
- Others let you define the format of your timestamp column.(e.g,. From File process)
- In Code process like Spark processor, you can explicitly transform your timestamp column into the accepted format before reading the input or writing to the event store.
To configure hourly partitioning:
- Go to the Output tab of the process.
- Select Partition Scheme = Hourly.
-
On execution, the system automatically creates output directories per hour within each date.
For example:
dt=2025-01-01-00 dt=2025-01-01-01 dt=2025-01-01-02 ... dt=2025-01-01-23
Each folder contains only the records generated in that specific hour of that day.
Validating Hourly Partition
Once the job completes:
- Navigate to the State screen of the event store or output node.
- Alongside the date column, you’ll see an additional hour column, confirming that data has been correctly partitioned at an hourly level.