Data Versioning in SQL - MySQL #36

akash-coded · 2023-08-07T16:19:14Z

akash-coded
Aug 7, 2023
Maintainer

Data versioning, also referred to as data version control or data version history, is a mechanism that helps to keep track of different versions of data. It's like 'track changes' in Microsoft Word or 'version history' in Google Docs, but for databases. In a data versioning system, all changes to the data are recorded with information like when the change was made and who made the change. This provides a way to revert back to any previous state of the data, which is very important when dealing with complex databases and systems.

Why is Data Versioning Important?

Data Recovery: If any data gets deleted or corrupted accidentally, data versioning allows us to revert back to the correct version. It acts as a safety net, protecting against data loss.
Audit Trail: It helps in maintaining an audit trail for data. An audit trail is crucial in many industries for compliance with regulations, accountability, and transparency.
Historical Analysis: It provides the ability to analyse historical data, see how data has changed over time and find trends and patterns.
Conflict Resolution: In multi-user environments, data versioning helps to resolve conflicts when multiple users are editing the same data concurrently.
Reproducibility: In data analysis or data science projects, reproducibility is crucial. With data versioning, you can always reproduce the results because you can recreate the exact state of the data at any point in time.

Here are some examples where data versioning can be very useful:

Example Scenario - 1: Managing Product Prices

In an e-commerce company, product prices often change over time due to discounts, promotions, changes in cost, etc. When analysing sales data, it is important to know the price of the product at the time of each sale. If you only keep the latest price in the database, you lose the historical price information.

Here, data versioning can be very useful. We can maintain a product_prices table with columns product_id, price, valid_from, and valid_to. When a price changes, instead of updating the price, we insert a new row with the new price and the current date as valid_from. The valid_to of the previous price is updated to the current date.

This way, for each sale, we can find the product price at the time of the sale by joining the sales table and the product_prices table on product_id and the date of sale between valid_from and valid_to.

Scenario 38: Tracking Changes in Customer Information

Consider a CRM system where customer information like name, address, contact number, etc. is stored. This information can change over time. For example, a customer may move to a new address or change their contact number.

If we simply update the customer information in the database, we lose the previous information. But with data versioning, we can maintain the history of changes. We can have a customer_versions table with a similar structure to the customers table but with additional version, created_at, and created_by columns. Every time a customer's information is updated, we insert a new row into the customer_versions table with the new information and the current date and user as created_at and created_by.

This way, we can see the full history of changes for each customer and can also revert back to any previous state if needed.

These are basic examples. In real-world scenarios, data versioning can be more complex and involve advanced SQL techniques like triggers, stored procedures, and partitioning. However, the basic concept remains the same: instead of deleting or updating data, insert a new version and keep the old versions. This ensures that no data is lost and the full history is available for analysis and auditing.

Data versioning can be applied in more complex scenarios where not only a single table, but multiple related tables are involved.

Example Scenario - 2: Managing Order Status Changes

In a typical e-commerce company, orders pass through several states: placed, confirmed, shipped, delivered, etc. The change in state is often accompanied by other changes. For example, when an order is confirmed, it is assigned to a warehouse; when it is shipped, a tracking number is generated.

A simplistic way to manage this is to have an orders table with status, warehouse_id, tracking_number, and other columns, and update these columns as the order progresses. However, this approach loses the history of changes.

A better approach using data versioning would be to have an order_versions table with the same columns as the orders table plus additional version, changed_at, and changed_by columns. Every time an order's state changes, we insert a new row into the order_versions table with the new state and associated data and the current date and user as changed_at and changed_by.

This way, we have a complete history of each order's state changes. We can see when each change happened, what other data changed at the same time, and who made the change. This is very useful for troubleshooting and auditing. It also allows for more detailed analysis, like calculating the average time an order spends in each state, or identifying bottlenecks in the order processing workflow.

Let's see how we can leverage SQL techniques in data versioning:

CREATE TABLE `order_versions` (
  `order_id` int NOT NULL,
  `version` int NOT NULL,
  `status` varchar(20) NOT NULL,
  `warehouse_id` int DEFAULT NULL,
  `tracking_number` varchar(20) DEFAULT NULL,
  `changed_at` datetime NOT NULL,
  `changed_by` varchar(20) NOT NULL,
  PRIMARY KEY (`order_id`,`version`)
);

-- Insert a new version when order is confirmed
INSERT INTO `order_versions` (`order_id`, `version`, `status`, `warehouse_id`, `changed_at`, `changed_by`)
SELECT `order_id`, MAX(`version`) + 1, 'Confirmed', 123, NOW(), 'User1'
FROM `order_versions`
WHERE `order_id` = 101;

-- Insert a new version when order is shipped
INSERT INTO `order_versions` (`order_id`, `version`, `status`, `tracking_number`, `changed_at`, `changed_by`)
SELECT `order_id`, MAX(`version`) + 1, 'Shipped', 'ABCD1234', NOW(), 'User2'
FROM `order_versions`
WHERE `order_id` = 101;

In this manner, using versioning in more complex and intricate scenarios, such as multi-table or multi-step operations, can be incredibly beneficial. Not only does it provide a solid structure for tracking changes, but it can be the backbone of a powerful audit system, making it a must-know concept for any data engineer.

Sure, let's consider a scenario where we have multiple related tables and we need to keep track of changes across all of them.

Example Scenario - 3: Managing Project Management System

In a project management system, a project involves multiple tasks, and each task is assigned to a user. Each task also goes through different statuses - open, in progress, under review, closed, etc.

In such a system, we have three tables projects, tasks, and task_assignments. The projects table stores project details, the tasks table stores task details and their current status, and the task_assignments table stores which user is assigned to which task.

Now, let's say we want to track all changes to task statuses and task assignments, including who made the change and when. We could do this by creating two versioned tables: task_versions and assignment_versions.

CREATE TABLE `task_versions` (
  `task_id` int NOT NULL,
  `version` int NOT NULL,
  `status` varchar(20) NOT NULL,
  `changed_at` datetime NOT NULL,
  `changed_by` varchar(20) NOT NULL,
  PRIMARY KEY (`task_id`,`version`)
);

CREATE TABLE `assignment_versions` (
  `assignment_id` int NOT NULL,
  `version` int NOT NULL,
  `user_id` int NOT NULL,
  `changed_at` datetime NOT NULL,
  `changed_by` varchar(20) NOT NULL,
  PRIMARY KEY (`assignment_id`,`version`)
);

Every time a task status or a task assignment changes, we insert a new row into the corresponding versioned table with the new status or assigned user, and the current date and user as changed_at and changed_by.

This way, we have a complete history of task status changes and task assignments. We can see when each change happened, who made the change, and even track how long a task stayed in a particular status or how long a user worked on a task.

This is a complex scenario that involves changes in multiple related tables and uses data versioning to keep track of these changes. It is very common in real-world applications, especially in systems that require auditing or detailed analysis of changes.

Sure, let's consider a real-world scenario from the domain of financial services. This is an industry where data versioning is of paramount importance due to regulatory requirements and the need for accurate historical data analysis.

Example Scenario - 4: Financial Transaction History

Imagine you're a data engineer at a large bank. Your bank offers a wide range of financial products to its customers including checking accounts, savings accounts, credit cards, loans, etc. Each of these products has a balance associated with it and transactions that modify that balance.

Regulations require banks to not only keep track of current balances, but also to retain an entire history of all transactions and balance changes over time.

Let's consider the Accounts and Transactions tables. The Accounts table keeps track of the current balance for each account while the Transactions table stores the details of all the transactions.

Now, the ask is to maintain a complete history of the balance of all accounts after every transaction. This is where data versioning comes in.

We could create a versioned table called Account_Balance_History:

CREATE TABLE `Account_Balance_History` (
  `account_id` int NOT NULL,
  `version` int NOT NULL AUTO_INCREMENT,
  `balance` decimal(10,2) NOT NULL,
  `changed_at` datetime NOT NULL,
  `transaction_id` int,
  PRIMARY KEY (`account_id`,`version`)
);

Here, account_id is the id of the account, version is the version number of the balance (incremented each time a transaction is made), balance is the account balance after the transaction, changed_at is the transaction time, and transaction_id is the id of the transaction that resulted in the balance change.

Every time a transaction is posted, besides inserting a row in Transactions table, you would also insert a new row into Account_Balance_History table with the new balance, the transaction time, and transaction id.

This would provide a full audit trail of the balance of each account, which is very useful for both the bank and the regulators. It provides transparency, accountability, and makes debugging easier in case of discrepancies.

Also, this versioned data can be used for time-series analysis, like identifying seasonal trends in customer behavior, analyzing the effect of interest rate changes on account balances, and many more such business-critical analyses.

Remember, though, that this kind of data versioning at such a scale will bring its own challenges such as managing storage and ensuring the performance of the database. Therefore, while designing such a system, data archival strategies and data partitioning might also need to be considered.

Data Archival and Data Partitioning Strategies

Dealing with large amounts of data in a versioned manner can present challenges. Two key strategies for managing these challenges are data archival and data partitioning.

Data Archival

Data archival refers to the process of moving data that is no longer actively used to a separate storage for long-term retention.

In the context of our banking scenario, we might decide to archive transaction history and balance history that is older than a certain number of years, say seven years. This is because this old data is likely accessed infrequently, but still needs to be stored due to regulatory requirements. By moving it to a cheaper, slower storage system, we can save costs and keep our main database lean and performant.

Here's how we might move older data to an archive:

CREATE TABLE `Account_Balance_History_Archive` LIKE `Account_Balance_History`;

INSERT INTO `Account_Balance_History_Archive`
SELECT * FROM `Account_Balance_History`
WHERE `changed_at` < DATE_SUB(NOW(), INTERVAL 7 YEAR);

DELETE FROM `Account_Balance_History`
WHERE `changed_at` < DATE_SUB(NOW(), INTERVAL 7 YEAR);

The above statements create an archive table, move the old data to the archive, and then delete the old data from the original table. This is a simple example; a real-world archival strategy might involve more complexity, like moving the data to a different storage system or compressing the data for storage.

Data Partitioning

Data partitioning is another useful strategy when dealing with large volumes of data. Partitioning refers to splitting a table into smaller, more manageable pieces, while still allowing the database to treat the data as a single table when querying.

Partitioning can be done in several ways, such as range partitioning, list partitioning, and hash partitioning. In our banking scenario, a good candidate for partitioning might be the changed_at column in the Account_Balance_History table, using range partitioning. We might decide to create a new partition for each month of data.

Here's how we might partition this table:

ALTER TABLE `Account_Balance_History`
PARTITION BY RANGE( YEAR(`changed_at`)*100 + MONTH(`changed_at`) ) (
    PARTITION p0 VALUES LESS THAN (202201),  -- data from before February 2022
    PARTITION p1 VALUES LESS THAN (202202),  -- data from February 2022
    PARTITION p2 VALUES LESS THAN (202203),  -- data from March 2022
    PARTITION p3 VALUES LESS THAN MAXVALUE    -- data from April 2022 and onwards
);

With this setup, MySQL will automatically store rows in the appropriate partition based on the changed_at column. This can significantly improve the performance of queries that filter on the changed_at column, as MySQL will only need to scan the relevant partition(s), rather than the entire table.

It's important to note, however, that maintaining partitioned tables can add to administrative overhead, as partitions will need to be added, dropped, or reorganized over time as data grows and evolves. In many databases, this can be automated with scripts or scheduled tasks.

Combining data versioning, archival, and partitioning can help build robust, scalable, and efficient data infrastructures, capable of handling complex, real-world use cases in a performant manner.

Further to archival and partitioning strategies, another strategy to consider for large scale, versioned data is Data Sharding.

Data Sharding

Sharding is a type of database partitioning that separates large databases into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

For example, if we had a table Account_Balance_History with millions of records, we could shard this data based on account_number. All records for a particular account_number could be stored in a specific shard.

This approach can improve performance by allowing us to query a smaller subset of data and hence reduce the search space. Also, if the shards are set up on different servers, we can leverage parallel processing to further speed up query execution.

Assuming we have already created databases for each shard, here is how you might distribute the data:

-- Suppose shard_1 and shard_2 are our two databases created for sharding

INSERT INTO shard_1.Account_Balance_History
SELECT * FROM Account_Balance_History
WHERE account_number % 2 = 0;  -- even account numbers go to shard_1

INSERT INTO shard_2.Account_Balance_History
SELECT * FROM Account_Balance_History
WHERE account_number % 2 <> 0;  -- odd account numbers go to shard_2

However, sharding can add a layer of complexity to your application, as the application logic needs to know which shard to query for a given account_number. Also, operations that need to span multiple shards (like a join operation across shards, or transactions that involve multiple shards) can be challenging.

Overall, when it comes to dealing with large volumes of data and complex scenarios in a versioned manner, a combination of data versioning strategies, archival strategies, partitioning and sharding can be used. All of these strategies should be tailored to fit the specific requirements and constraints of your application.

While it's important to understand and consider these strategies, always ensure to test them thoroughly in a development or staging environment before deploying to production, and monitor their effects to ensure they're providing the intended benefits.

Examples

Here are examples of how you might implement archival, partitioning, and sharding strategies in the classicmodels database:

Archival

Assume we want to archive orders that are older than 5 years. We could create an archive table and move the old data there:

CREATE TABLE orders_archive AS 
SELECT * 
FROM orders 
WHERE DATEDIFF(CURDATE(), orderDate) > 365 * 5;  -- More than 5 years old

DELETE FROM orders 
WHERE DATEDIFF(CURDATE(), orderDate) > 365 * 5;  -- More than 5 years old

Partitioning

Suppose we want to partition the orders table based on the orderDate. This could be helpful to speed up queries for specific date ranges:

ALTER TABLE orders 
PARTITION BY RANGE (YEAR(orderDate)) (
    PARTITION p0 VALUES LESS THAN (2003),
    PARTITION p1 VALUES LESS THAN (2004),
    PARTITION p2 VALUES LESS THAN (2005),
    PARTITION p3 VALUES LESS THAN MAXVALUE
);

Sharding

In this example, we'll use a sharding approach based on customerNumber. We'll create two new databases classicmodels_shard1 and classicmodels_shard2, and distribute the customers between the shards based on whether their customerNumber is even or odd:

INSERT INTO classicmodels_shard1.customers 
SELECT * 
FROM classicmodels.customers 
WHERE customerNumber % 2 = 0;  -- even customerNumbers go to shard 1

INSERT INTO classicmodels_shard2.customers 
SELECT * 
FROM classicmodels.customers 
WHERE customerNumber % 2 <> 0;  -- odd customerNumbers go to shard 2

Then the application will need to know which shard to query based on the customerNumber.

Please note that these are just examples. The specifics of these strategies would need to be adapted to fit the specific needs and constraints of your database and application, and they should be thoroughly tested to ensure they provide the intended benefits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Versioning in SQL - MySQL #36

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Data Versioning in SQL - MySQL #36

Uh oh!

Uh oh!

akash-coded Aug 7, 2023 Maintainer

Data Archival and Data Partitioning Strategies

Data Sharding

Examples

Replies: 0 comments

akash-coded
Aug 7, 2023
Maintainer