In this practice we are going to explore SQL syntax to insert and update records in Delta tables.
Remember Delta technology provides ACID compliant updates to Delta tables.
For this demonstration we will continue working with our bookstore dataset.
Let us first run our helping notebook to copy the dataset.
And then we will use a CTAS statement to create orders delta table As Select statement from parquet files.
Great! table has been created.
Let’s query this table.
As you can see, parquet files have a well-defined schema, so we managed to extract the data correctly.
When writing to tables, we could be interested by completely overwriting the data in the table.
In fact, there are multiple benefits to overwriting tables instead of deleting and recreating tables.
For example, the old version of the table still exists and can easily retrieve all data using Time Travel.
.
In addition, overwriting a table is much faster because it does not need to list the directory recursively
or delete any files.
In addition, it’s an atomic operation.
Concurrent queries can still read the table while you are overwriting it.
And of course due to the ACID transaction guarantees, if overwriting the table fails, the table will
be in its previous state.
The first method to accomplish complete owerwrite is to use CREATE OR REPLACE TABLE.
Also known as CRAS statement.
CREATE OR REPLACE TABLE statements fully replace the content of a table each time they execute.
Great.
Let us now check our table history.
As you can see, the version 0 is a CREATE TABLE AS SELECT statement.
While, CREATE OR REPLACE statement has generated a new table version.
Great.
The second method to overwrite table data is to use INSERT OVERWRITE statement.
It provides a nearly identical output as above.
It means data in the target table will be replaced by data from the query.
However, INSERT OVERWRITE statement has some differences.
For example, it can only overwrite an existing table and not creating a new one like our CREATE OR
REPLACE statement.
And it can override only the new records that match the current table schema, which means that it is
a safer technique for overwriting an existing table without the risk of modifying the table schema.
Let us run this command.
As you can see here, we have successfully overwriting the table data and rewriting 2150 records.
And we can see our table history again.
As you can see here, the INSERT OVERWRITE operation has been recorded as a new version in the table
as WRITE operation.
And it has the mood “Overwrite”.
And if you try to insert overwrite the data with different schema, for example, here we are adding
a new column of the data for the current timestamp.
By running this command, we see that it generates an exception.
And the exception says a schema mismatch detected when writing to the Delta table.
So the way how they enforce schema on-write is the primary difference between INSERT OVERWRITE and
CREATE OR REPLACE TABLE statements.
Let us now talk about appending records to tables.
The easiest method is to use INSERT INTO statement.
Here we are inserting a new data using an input query that query the parquet files in the orders-new
directory.
Great.
We have successfully added 700 new records to our table.
And we can check the new number of orders.
Yes, indeed.
Now we have 2850 records in the orders table.
The INSERT INTO statement is a simple and efficient operation for inserting new data.
However, it does not have any built in guarantees to prevent inserting the same records multiple times.
It means re-executing the query
will write the same records to the target table resulting in duplicate records.
To resolve this issue,
we can use our second method, which is MERGE INTO statement.
With the merge statement, you can upsert data from a source table, view, or dataframe into the target data table.
It means you can insert, update and delete using the MERGE INTO statements.
Here, we will use the merge operation to update the customer data with updated emails and adding new
customers.
We are creating a temporary view of the new customer data.
And now we can apply the merge operation that says MERGE INTO customers
the new changes coming from customer_updates temp view on the customer ID key.
And we have two actions here.
When match, we do an update and when not match, we do an insert.
In addition, we add extra conditions.
In this case, we are checking that the current row has a null email while the new record does not.
In such a case, we update the email and we also update the last updated timestamp.
And again, if the new record does not match any existing customers based on the customer ID, in this
case, we will insert this new record.
Great.
As we can see here, we have updated 100 records and we have inserted 201 records.
And no records have been deleted.
So in a merge operation, updates, inserts and deletes are completed in a single atomic transaction.
In addition, merge operation is a great solution for avoiding duplicates when inserting records.
Let us see another example.
Here we have new books to be inserted and they are coming in CSV format.
We will create this temporary view against this new data.
Here we have five new books and we are only interested by inserting the computer science books in our
database.
Let us now use the Merge Into statement to update the table books with the data coming from the temporary
view books_updates.
And now we can use the Merge Into statement where we provide only the not match condition.
It means we are only inserting new data if they are not already exist based on our key, which is the
book_id and the title.
In addition, we are specifying the category of the new record to be inserted is only computer science.
Great.
As expected, we are only inserting three new records, which are the three computer science books.
And as we said, one of the main benefits of the merge operation is to avoid duplicate.
So if we try to rerun this statement, it will not reinsert those records as they are already on the
table.
Yes, indeed.
Zero record has been inserted.
Great.
That’s it for this notebook.
See you in the next video.