What is Data Masking?

Data masking is the process of hiding or obfuscating sensitive data so that unauthorized users cannot see the actual values, while authorized users can still access them. It is commonly used to protect Personally Identifiable Information (PII), financial data, or any sensitive business information.

In Databricks, data masking is typically implemented using Unity Catalog, which provides column masks and row filters for dynamic masking at query time. Additionally, you can use Attribute-Based Access Control (ABAC) policies for centralized, scalable enforcement.


Types of Masking in Databricks

  1. Dynamic Masking (recommended): Sensitive data is masked at query runtime. The original values are preserved in storage.
  2. Static Masking / Pseudonymization: Data is transformed and stored in masked form, often for exports or test environments.

Dynamic Column Masking with Unity Catalog

Step 1: Create a Masking Function (UDF)

You first create a SQL UDF that defines how the column should be masked. The function can check attributes such as user group membership.

CREATE OR REPLACE FUNCTION main.email_mask(email STRING)
RETURNS STRING
RETURN CASE
         WHEN is_account_group_member('analyst') THEN email
         ELSE concat(substr(email, 1, 3), '*****@***.com')
       END;

This function shows the real email to users in the analyst group, and a masked version to everyone else.


Step 2: Apply the Mask to a Column

Once the function is created, attach it to a column:

ALTER TABLE main.customers
  ALTER COLUMN email
  SET MASK main.email_mask;

Now, whenever a user queries the email column, Databricks will dynamically apply the mask.


Step 3: Testing the Mask

Query the table as two different users:

SELECT customer_id, email FROM main.customers;
  • Analysts will see real email addresses.
  • Other users will see masked values like abc*****@***.com.

Step 4: Removing or Updating a Mask

To remove a mask:

ALTER TABLE main.customers
  ALTER COLUMN email
  DROP MASK;

To update a mask, recreate the function and re-apply it.


Row-Level Security with Row Filters

Sometimes you want to restrict rows instead of just masking columns. Row filters make this possible.

CREATE OR REPLACE FUNCTION main.us_only_filter(region STRING)
RETURNS BOOLEAN
RETURN IF(is_account_group_member('admin'), TRUE, region = 'US');

ALTER TABLE main.sales
  SET ROW FILTER main.us_only_filter ON (region);
  • Admins see all regions.
  • Other users only see rows where region = 'US'.

Generic Masking for Multiple Fields

You can reuse the same masking function for many columns.

CREATE OR REPLACE FUNCTION main.generic_mask(value STRING)
RETURNS STRING
RETURN CASE
         WHEN is_account_group_member('analyst') THEN value
         ELSE concat(substr(value, 1, 3), '*****MASKED*****')
       END;

ALTER TABLE main.customers
  ALTER COLUMN email SET MASK main.generic_mask;

ALTER TABLE main.customers
  ALTER COLUMN phone SET MASK main.generic_mask;

This ensures consistent masking across multiple sensitive fields.


Attribute-Based Access Control (ABAC) Policies

For large environments, ABAC allows you to define policies at the catalog or schema level and apply them automatically to tagged columns.

Example: Catalog-Level Policy

CREATE POLICY mask_pii
  ON CATALOG main
  COLUMN MASK main.generic_mask
  FOR COLUMNS
    MATCH COLUMNS hasTagValue('pii', 'true');

Now any column in the catalog tagged with pii=true is automatically masked with generic_mask. No need to manually alter every table.


Static Masking with PySpark

If you want to create a permanently masked dataset, you can use PySpark transformations.

from pyspark.sql.functions import sha2, col, concat, lit, substring

df = spark.table("main.customers")

masked = (df
  .withColumn("email_masked", concat(substring(col("email"), 1, 3), lit("*****@***.com")))
  .withColumn("phone_masked", concat(substring(col("phone"), 1, 2), lit("XXXXXX")))
  .drop("email", "phone"))

masked.write.format("delta").mode("overwrite").saveAsTable("main.customers_masked")

This produces a new table with masked values stored in place.


How to Check if Masking is Applied

Option 1: Describe the Table

DESCRIBE TABLE EXTENDED main.customers;

Look for the MASK metadata in column definitions.

Option 2: Query Information Schema

SELECT column_name, masking_function
FROM main.information_schema.columns
WHERE table_name = 'customers';

If masking_function is not null, the column has masking applied.


Best Practices

  • Use dynamic masks for governance, static masks only when needed.
  • Centralize control with ABAC policies when possible.
  • Keep masking functions simple to avoid query performance issues.
  • Tag sensitive data consistently (pii, confidential, etc.).
  • Test with different user groups to confirm enforcement.

Conclusion

Data masking in Databricks helps protect sensitive data while still allowing users to access datasets for analytics. Using Unity Catalog’s column masks, row filters, and ABAC policies, you can implement fine-grained, dynamic masking at scale. For scenarios where static copies are required, PySpark-based pseudonymization provides an additional option.

By combining these techniques, you can ensure that your Databricks Lakehouse remains both secure and compliant, without slowing down analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts