What is Data Masking?
Data masking is the process of hiding or obfuscating sensitive data so that unauthorized users cannot see the actual values, while authorized users can still access them. It is commonly used to protect Personally Identifiable Information (PII), financial data, or any sensitive business information.
In Databricks, data masking is typically implemented using Unity Catalog, which provides column masks and row filters for dynamic masking at query time. Additionally, you can use Attribute-Based Access Control (ABAC) policies for centralized, scalable enforcement.
Types of Masking in Databricks
- Dynamic Masking (recommended): Sensitive data is masked at query runtime. The original values are preserved in storage.
- Static Masking / Pseudonymization: Data is transformed and stored in masked form, often for exports or test environments.
Dynamic Column Masking with Unity Catalog
Step 1: Create a Masking Function (UDF)
You first create a SQL UDF that defines how the column should be masked. The function can check attributes such as user group membership.
CREATE OR REPLACE FUNCTION main.email_mask(email STRING)
RETURNS STRING
RETURN CASE
WHEN is_account_group_member('analyst') THEN email
ELSE concat(substr(email, 1, 3), '*****@***.com')
END;
This function shows the real email to users in the analyst group, and a masked version to everyone else.
Step 2: Apply the Mask to a Column
Once the function is created, attach it to a column:
ALTER TABLE main.customers
ALTER COLUMN email
SET MASK main.email_mask;
Now, whenever a user queries the email column, Databricks will dynamically apply the mask.
Step 3: Testing the Mask
Query the table as two different users:
SELECT customer_id, email FROM main.customers;
- Analysts will see real email addresses.
- Other users will see masked values like
abc*****@***.com.
Step 4: Removing or Updating a Mask
To remove a mask:
ALTER TABLE main.customers
ALTER COLUMN email
DROP MASK;
To update a mask, recreate the function and re-apply it.
Row-Level Security with Row Filters
Sometimes you want to restrict rows instead of just masking columns. Row filters make this possible.
CREATE OR REPLACE FUNCTION main.us_only_filter(region STRING)
RETURNS BOOLEAN
RETURN IF(is_account_group_member('admin'), TRUE, region = 'US');
ALTER TABLE main.sales
SET ROW FILTER main.us_only_filter ON (region);
- Admins see all regions.
- Other users only see rows where
region = 'US'.
Generic Masking for Multiple Fields
You can reuse the same masking function for many columns.
CREATE OR REPLACE FUNCTION main.generic_mask(value STRING)
RETURNS STRING
RETURN CASE
WHEN is_account_group_member('analyst') THEN value
ELSE concat(substr(value, 1, 3), '*****MASKED*****')
END;
ALTER TABLE main.customers
ALTER COLUMN email SET MASK main.generic_mask;
ALTER TABLE main.customers
ALTER COLUMN phone SET MASK main.generic_mask;
This ensures consistent masking across multiple sensitive fields.
Attribute-Based Access Control (ABAC) Policies
For large environments, ABAC allows you to define policies at the catalog or schema level and apply them automatically to tagged columns.
Example: Catalog-Level Policy
CREATE POLICY mask_pii
ON CATALOG main
COLUMN MASK main.generic_mask
FOR COLUMNS
MATCH COLUMNS hasTagValue('pii', 'true');
Now any column in the catalog tagged with pii=true is automatically masked with generic_mask. No need to manually alter every table.
Static Masking with PySpark
If you want to create a permanently masked dataset, you can use PySpark transformations.
from pyspark.sql.functions import sha2, col, concat, lit, substring
df = spark.table("main.customers")
masked = (df
.withColumn("email_masked", concat(substring(col("email"), 1, 3), lit("*****@***.com")))
.withColumn("phone_masked", concat(substring(col("phone"), 1, 2), lit("XXXXXX")))
.drop("email", "phone"))
masked.write.format("delta").mode("overwrite").saveAsTable("main.customers_masked")
This produces a new table with masked values stored in place.
How to Check if Masking is Applied
Option 1: Describe the Table
DESCRIBE TABLE EXTENDED main.customers;
Look for the MASK metadata in column definitions.
Option 2: Query Information Schema
SELECT column_name, masking_function
FROM main.information_schema.columns
WHERE table_name = 'customers';
If masking_function is not null, the column has masking applied.
Best Practices
- Use dynamic masks for governance, static masks only when needed.
- Centralize control with ABAC policies when possible.
- Keep masking functions simple to avoid query performance issues.
- Tag sensitive data consistently (
pii,confidential, etc.). - Test with different user groups to confirm enforcement.
Conclusion
Data masking in Databricks helps protect sensitive data while still allowing users to access datasets for analytics. Using Unity Catalog’s column masks, row filters, and ABAC policies, you can implement fine-grained, dynamic masking at scale. For scenarios where static copies are required, PySpark-based pseudonymization provides an additional option.
By combining these techniques, you can ensure that your Databricks Lakehouse remains both secure and compliant, without slowing down analytics.