Most modern data pipelines ingest semi-structured data — REST API responses, event tracking payloads, webhook bodies, and Kafka messages all arrive as JSON. Knowing how to parse, flatten, and query nested JSON directly in SQL is a fundamental skill for anyone building pipelines on Databricks, Snowflake, or PostgreSQL.
This tutorial covers the core JSON operations with real examples, mapped to the syntax differences across the three most common platforms.
The Dataset
Raw JSON data as it arrives from an API — stored as a VARCHAR/STRING column in your Bronze table.
CREATE TABLE bronze.raw_events (
event_id INT,
payload VARCHAR(5000), -- raw JSON string
ingested_at TIMESTAMP
);
INSERT INTO bronze.raw_events VALUES
(1, '{"user_id": 101, "event": "purchase", "amount": 250.00, "currency": "USD",
"product": {"id": 10, "name": "Laptop", "category": "Hardware"},
"tags": ["premium", "new_customer"],
"metadata": {"source": "mobile_app", "version": "3.2.1"}}',
CURRENT_TIMESTAMP),
(2, '{"user_id": 102, "event": "view", "amount": null, "currency": "USD",
"product": {"id": 11, "name": "Phone", "category": "Electronics"},
"tags": ["returning_customer"],
"metadata": {"source": "web", "version": "3.2.0"}}',
CURRENT_TIMESTAMP),
(3, '{"user_id": 103, "event": "purchase", "amount": 89.99, "currency": "AED",
"product": {"id": 10, "name": "Laptop", "category": "Hardware"},
"tags": ["premium", "bulk_buyer", "vip"],
"metadata": {"source": "mobile_app", "version": "3.2.1"}}',
CURRENT_TIMESTAMP);
PostgreSQL — JSON Operations
PostgreSQL has two JSON types: JSON (stored as text, validated) and JSONB (stored in binary, indexable and faster for queries). Use JSONB for any column you will query frequently.
Extracting scalar values
-- -> operator: returns JSON object
-- ->> operator: returns text (use for scalar values)
SELECT
event_id,
payload::JSONB ->> 'user_id' AS user_id,
payload::JSONB ->> 'event' AS event_type,
(payload::JSONB ->> 'amount')::DECIMAL AS amount,
payload::JSONB -> 'product' ->> 'name' AS product_name,
payload::JSONB -> 'product' ->> 'category' AS category,
payload::JSONB -> 'metadata' ->> 'source' AS source
FROM bronze.raw_events;
Output:
| event_id | user_id | event_type | amount | product_name | category | source |
|---|---|---|---|---|---|---|
| 1 | 101 | purchase | 250.00 | Laptop | Hardware | mobile_app |
| 2 | 102 | view | NULL | Phone | Electronics | web |
| 3 | 103 | purchase | 89.99 | Laptop | Hardware | mobile_app |
-> 'key' returns the JSON sub-object. ->> 'key' returns the value as text. For nested paths: -> 'product' ->> 'name' goes into product, then gets name as text.
Extracting array elements (PostgreSQL)
-- Get the first tag in the tags array
SELECT
event_id,
payload::JSONB -> 'tags' ->> 0 AS first_tag,
jsonb_array_length(payload::JSONB -> 'tags') AS tag_count
FROM bronze.raw_events;
Output:
| event_id | first_tag | tag_count |
|---|---|---|
| 1 | premium | 2 |
| 2 | returning_customer | 1 |
| 3 | premium | 3 |
Explode array to rows (PostgreSQL jsonb_array_elements)
-- One row per tag
SELECT
event_id,
tag_value::TEXT AS tag
FROM bronze.raw_events,
jsonb_array_elements_text(payload::JSONB -> 'tags') AS tag_value;
Output:
| event_id | tag |
|---|---|
| 1 | premium |
| 1 | new_customer |
| 2 | returning_customer |
| 3 | premium |
| 3 | bulk_buyer |
| 3 | vip |
Databricks / Spark SQL — JSON Operations
Spark SQL uses from_json() to parse JSON strings into a structured type, then dot notation to access fields.
Define the schema and parse
-- In Databricks: parse JSON string using from_json with an explicit schema
SELECT
event_id,
from_json(payload, 'STRUCT<
user_id:INT,
event:STRING,
amount:DOUBLE,
currency:STRING,
product:STRUCT<id:INT, name:STRING, category:STRING>,
tags:ARRAY<STRING>,
metadata:STRUCT<source:STRING, version:STRING>
>') AS parsed
FROM bronze.raw_events;
Access nested fields with dot notation
WITH parsed_events AS (
SELECT
event_id,
from_json(payload, 'STRUCT<
user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
product:STRUCT<id:INT, name:STRING, category:STRING>,
tags:ARRAY<STRING>,
metadata:STRUCT<source:STRING, version:STRING>
>') AS p
FROM bronze.raw_events
)
SELECT
event_id,
p.user_id,
p.event AS event_type,
p.amount,
p.currency,
p.product.name AS product_name,
p.product.category AS category,
p.metadata.source AS source,
SIZE(p.tags) AS tag_count
FROM parsed_events;
Output:
| event_id | user_id | event_type | amount | currency | product_name | category | source | tag_count |
|---|---|---|---|---|---|---|---|---|
| 1 | 101 | purchase | 250.00 | USD | Laptop | Hardware | mobile_app | 2 |
| 2 | 102 | view | NULL | USD | Phone | Electronics | web | 1 |
| 3 | 103 | purchase | 89.99 | AED | Laptop | Hardware | mobile_app | 3 |
Explode array to rows (Databricks EXPLODE)
WITH parsed_events AS (
SELECT event_id,
from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
FROM bronze.raw_events
)
SELECT
event_id,
p.user_id,
tag
FROM parsed_events
LATERAL VIEW EXPLODE(p.tags) tags_table AS tag;
Or using the SQL EXPLODE function directly:
WITH parsed AS (
SELECT event_id,
from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
FROM bronze.raw_events
)
SELECT event_id, p.user_id, EXPLODE(p.tags) AS tag
FROM parsed;
Output:
| event_id | user_id | tag |
|---|---|---|
| 1 | 101 | premium |
| 1 | 101 | new_customer |
| 2 | 102 | returning_customer |
| 3 | 103 | premium |
| 3 | 103 | bulk_buyer |
| 3 | 103 | vip |
Infer schema automatically (Databricks shortcut)
-- Let Spark infer the schema from the data (useful for exploration)
SELECT
event_id,
get_json_object(payload, '$.user_id') AS user_id,
get_json_object(payload, '$.event') AS event_type,
get_json_object(payload, '$.product.name') AS product_name,
get_json_object(payload, '$.metadata.source') AS source
FROM bronze.raw_events;
get_json_object() uses JSONPath syntax — a quick way to extract individual fields without defining a full schema first. Use this for exploration; use from_json() with an explicit schema in production.
Snowflake — JSON Operations
Snowflake uses the VARIANT type for semi-structured data and uses colon : notation to access fields.
-- Snowflake: parse string to VARIANT
SELECT
event_id,
PARSE_JSON(payload) AS parsed_payload
FROM bronze.raw_events;
-- Access fields with colon notation
SELECT
event_id,
PARSE_JSON(payload):user_id::INT AS user_id,
PARSE_JSON(payload):event::STRING AS event_type,
PARSE_JSON(payload):amount::DECIMAL(10,2) AS amount,
PARSE_JSON(payload):product:name::STRING AS product_name,
PARSE_JSON(payload):product:category::STRING AS category,
PARSE_JSON(payload):metadata:source::STRING AS source
FROM bronze.raw_events;
Output: Same as above — identical results, different syntax.
The ::TYPE casting is important in Snowflake — without it, all extracted values are VARIANT type, which behaves unexpectedly in comparisons and aggregations.
Flatten arrays in Snowflake
-- FLATTEN is Snowflake's equivalent of EXPLODE
SELECT
r.event_id,
PARSE_JSON(r.payload):user_id::INT AS user_id,
f.value::STRING AS tag
FROM bronze.raw_events r,
LATERAL FLATTEN(INPUT => PARSE_JSON(r.payload):tags) f;
Output: Same one-row-per-tag output as the PostgreSQL and Databricks examples.
Building a Silver Table from JSON Bronze
This is the complete pipeline step — parsing the JSON Bronze table into a clean, typed Silver table.
-- Insert parsed, flattened events into Silver
INSERT INTO silver.events (
event_id, user_id, event_type, amount, currency,
product_id, product_name, category,
source, app_version, tags,
ingested_at
)
-- Databricks version:
WITH parsed AS (
SELECT
event_id,
ingested_at,
from_json(payload, 'STRUCT<
user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
product:STRUCT<id:INT, name:STRING, category:STRING>,
tags:ARRAY<STRING>,
metadata:STRUCT<source:STRING, version:STRING>
>') AS p
FROM bronze.raw_events
WHERE ingested_at >= '2024-03-01'
)
SELECT
event_id,
p.user_id,
p.event AS event_type,
p.amount,
p.currency,
p.product.id AS product_id,
p.product.name AS product_name,
p.product.category AS category,
p.metadata.source AS source,
p.metadata.version AS app_version,
ARRAY_JOIN(p.tags, ',') AS tags, -- store as comma-separated string
ingested_at
FROM parsed
WHERE p.user_id IS NOT NULL
AND p.event IS NOT NULL;
Common Mistakes
Mistake 1 — Not casting extracted JSON values
In Snowflake and Spark, payload:amount returns a VARIANT/ANY type. SUM(payload:amount) or WHERE payload:amount > 100 will either error or produce wrong results without explicit casting: payload:amount::DECIMAL.
Mistake 2 — Using get_json_object in production pipelines
get_json_object(payload, '$.product.name') parses the entire JSON string for every column you extract. If you need 10 fields, the JSON is parsed 10 times. Use from_json() once and access all fields from the resulting struct.
Mistake 3 — Storing JSON as STRING in Silver
Silver should have proper typed columns, not JSON strings. The Silver table is where you parse and flatten — downstream Gold queries should never need to call from_json().
Mistake 4 — EXPLODE without knowing cardinality
If your tags array has 1,000 items per row and you EXPLODE it before filtering, you multiply your data by 1,000 immediately. Filter first, then EXPLODE.
Quick Reference — JSON Syntax Across Platforms
| Operation | PostgreSQL (JSONB) | Databricks (Spark SQL) | Snowflake (VARIANT) |
|---|---|---|---|
| Parse string to JSON | col::JSONB | from_json(col, schema) | PARSE_JSON(col) |
| Get string field | col ->> 'key' | col.key | col:key::STRING |
| Get nested field | col -> 'a' ->> 'b' | col.a.b | col:a:b::STRING |
| Get array element | col -> 'arr' ->> 0 | col.arr[0] | col:arr[0]::STRING |
| Explode array | jsonb_array_elements_text() | EXPLODE(col.arr) | LATERAL FLATTEN(col:arr) |
| Extract via path | jsonb_extract_path_text() | get_json_object(col, '$.path') | col:path::STRING |
What to Learn Next
Parsing JSON is usually the first step when onboarding a new data source. The next tutorial covers the full SQL workflow for onboarding a new source table into your pipeline — from schema inspection and profiling through to writing the first Silver transformation.