SQL for JSON and Semi-Structured Data — Parsing, Flattening, and Exploding in Databricks, Snowflake, and PostgreSQL

Most modern data pipelines ingest semi-structured data — REST API responses, event tracking payloads, webhook bodies, and Kafka messages all arrive as JSON. Knowing how to parse, flatten, and query nested JSON directly in SQL is a fundamental skill for anyone building pipelines on Databricks, Snowflake, or PostgreSQL.

This tutorial covers the core JSON operations with real examples, mapped to the syntax differences across the three most common platforms.

The Dataset

Raw JSON data as it arrives from an API — stored as a VARCHAR/STRING column in your Bronze table.

CREATE TABLE bronze.raw_events (
    event_id   INT,
    payload    VARCHAR(5000),  -- raw JSON string
    ingested_at TIMESTAMP
);

INSERT INTO bronze.raw_events VALUES
(1, '{"user_id": 101, "event": "purchase", "amount": 250.00, "currency": "USD",
      "product": {"id": 10, "name": "Laptop", "category": "Hardware"},
      "tags": ["premium", "new_customer"],
      "metadata": {"source": "mobile_app", "version": "3.2.1"}}',
   CURRENT_TIMESTAMP),

(2, '{"user_id": 102, "event": "view", "amount": null, "currency": "USD",
      "product": {"id": 11, "name": "Phone", "category": "Electronics"},
      "tags": ["returning_customer"],
      "metadata": {"source": "web", "version": "3.2.0"}}',
   CURRENT_TIMESTAMP),

(3, '{"user_id": 103, "event": "purchase", "amount": 89.99, "currency": "AED",
      "product": {"id": 10, "name": "Laptop", "category": "Hardware"},
      "tags": ["premium", "bulk_buyer", "vip"],
      "metadata": {"source": "mobile_app", "version": "3.2.1"}}',
   CURRENT_TIMESTAMP);

PostgreSQL — JSON Operations

PostgreSQL has two JSON types: JSON (stored as text, validated) and JSONB (stored in binary, indexable and faster for queries). Use JSONB for any column you will query frequently.

Extracting scalar values

-- -> operator: returns JSON object
-- ->> operator: returns text (use for scalar values)
SELECT
    event_id,
    payload::JSONB ->> 'user_id'                                AS user_id,
    payload::JSONB ->> 'event'                                  AS event_type,
    (payload::JSONB ->> 'amount')::DECIMAL                      AS amount,
    payload::JSONB -> 'product' ->> 'name'                      AS product_name,
    payload::JSONB -> 'product' ->> 'category'                  AS category,
    payload::JSONB -> 'metadata' ->> 'source'                   AS source
FROM bronze.raw_events;

Output:

event_id	user_id	event_type	amount	product_name	category	source
1	101	purchase	250.00	Laptop	Hardware	mobile_app
2	102	view	NULL	Phone	Electronics	web
3	103	purchase	89.99	Laptop	Hardware	mobile_app

-> 'key' returns the JSON sub-object. ->> 'key' returns the value as text. For nested paths: -> 'product' ->> 'name' goes into product, then gets name as text.

Extracting array elements (PostgreSQL)

-- Get the first tag in the tags array
SELECT
    event_id,
    payload::JSONB -> 'tags' ->> 0                    AS first_tag,
    jsonb_array_length(payload::JSONB -> 'tags')      AS tag_count
FROM bronze.raw_events;

Output:

event_id	first_tag	tag_count
1	premium	2
2	returning_customer	1
3	premium	3

Explode array to rows (PostgreSQL jsonb_array_elements)

-- One row per tag
SELECT
    event_id,
    tag_value::TEXT AS tag
FROM bronze.raw_events,
    jsonb_array_elements_text(payload::JSONB -> 'tags') AS tag_value;

Output:

event_id	tag
1	premium
1	new_customer
2	returning_customer
3	premium
3	bulk_buyer
3	vip

Databricks / Spark SQL — JSON Operations

Spark SQL uses from_json() to parse JSON strings into a structured type, then dot notation to access fields.

Define the schema and parse

-- In Databricks: parse JSON string using from_json with an explicit schema
SELECT
    event_id,
    from_json(payload, 'STRUCT<
        user_id:INT,
        event:STRING,
        amount:DOUBLE,
        currency:STRING,
        product:STRUCT<id:INT, name:STRING, category:STRING>,
        tags:ARRAY<STRING>,
        metadata:STRUCT<source:STRING, version:STRING>
    >') AS parsed
FROM bronze.raw_events;

Access nested fields with dot notation

WITH parsed_events AS (
    SELECT
        event_id,
        from_json(payload, 'STRUCT<
            user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
            product:STRUCT<id:INT, name:STRING, category:STRING>,
            tags:ARRAY<STRING>,
            metadata:STRUCT<source:STRING, version:STRING>
        >') AS p
    FROM bronze.raw_events
)
SELECT
    event_id,
    p.user_id,
    p.event        AS event_type,
    p.amount,
    p.currency,
    p.product.name     AS product_name,
    p.product.category AS category,
    p.metadata.source  AS source,
    SIZE(p.tags)       AS tag_count
FROM parsed_events;

Output:

event_id	user_id	event_type	amount	currency	product_name	category	source	tag_count
1	101	purchase	250.00	USD	Laptop	Hardware	mobile_app	2
2	102	view	NULL	USD	Phone	Electronics	web	1
3	103	purchase	89.99	AED	Laptop	Hardware	mobile_app	3

Explode array to rows (Databricks EXPLODE)

WITH parsed_events AS (
    SELECT event_id,
           from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
    FROM bronze.raw_events
)
SELECT
    event_id,
    p.user_id,
    tag
FROM parsed_events
LATERAL VIEW EXPLODE(p.tags) tags_table AS tag;

Or using the SQL EXPLODE function directly:

WITH parsed AS (
    SELECT event_id,
           from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
    FROM bronze.raw_events
)
SELECT event_id, p.user_id, EXPLODE(p.tags) AS tag
FROM parsed;

Output:

event_id	user_id	tag
1	101	premium
1	101	new_customer
2	102	returning_customer
3	103	premium
3	103	bulk_buyer
3	103	vip

Infer schema automatically (Databricks shortcut)

-- Let Spark infer the schema from the data (useful for exploration)
SELECT
    event_id,
    get_json_object(payload, '$.user_id')           AS user_id,
    get_json_object(payload, '$.event')             AS event_type,
    get_json_object(payload, '$.product.name')      AS product_name,
    get_json_object(payload, '$.metadata.source')   AS source
FROM bronze.raw_events;

get_json_object() uses JSONPath syntax — a quick way to extract individual fields without defining a full schema first. Use this for exploration; use from_json() with an explicit schema in production.

Snowflake — JSON Operations

Snowflake uses the VARIANT type for semi-structured data and uses colon : notation to access fields.

-- Snowflake: parse string to VARIANT
SELECT
    event_id,
    PARSE_JSON(payload)                          AS parsed_payload
FROM bronze.raw_events;

-- Access fields with colon notation
SELECT
    event_id,
    PARSE_JSON(payload):user_id::INT            AS user_id,
    PARSE_JSON(payload):event::STRING           AS event_type,
    PARSE_JSON(payload):amount::DECIMAL(10,2)   AS amount,
    PARSE_JSON(payload):product:name::STRING    AS product_name,
    PARSE_JSON(payload):product:category::STRING AS category,
    PARSE_JSON(payload):metadata:source::STRING  AS source
FROM bronze.raw_events;

Output: Same as above — identical results, different syntax.

The ::TYPE casting is important in Snowflake — without it, all extracted values are VARIANT type, which behaves unexpectedly in comparisons and aggregations.

Flatten arrays in Snowflake

-- FLATTEN is Snowflake's equivalent of EXPLODE
SELECT
    r.event_id,
    PARSE_JSON(r.payload):user_id::INT AS user_id,
    f.value::STRING AS tag
FROM bronze.raw_events r,
    LATERAL FLATTEN(INPUT => PARSE_JSON(r.payload):tags) f;

Output: Same one-row-per-tag output as the PostgreSQL and Databricks examples.

Building a Silver Table from JSON Bronze

This is the complete pipeline step — parsing the JSON Bronze table into a clean, typed Silver table.

-- Insert parsed, flattened events into Silver
INSERT INTO silver.events (
    event_id, user_id, event_type, amount, currency,
    product_id, product_name, category,
    source, app_version, tags,
    ingested_at
)
-- Databricks version:
WITH parsed AS (
    SELECT
        event_id,
        ingested_at,
        from_json(payload, 'STRUCT<
            user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
            product:STRUCT<id:INT, name:STRING, category:STRING>,
            tags:ARRAY<STRING>,
            metadata:STRUCT<source:STRING, version:STRING>
        >') AS p
    FROM bronze.raw_events
    WHERE ingested_at >= '2024-03-01'
)
SELECT
    event_id,
    p.user_id,
    p.event            AS event_type,
    p.amount,
    p.currency,
    p.product.id       AS product_id,
    p.product.name     AS product_name,
    p.product.category AS category,
    p.metadata.source  AS source,
    p.metadata.version AS app_version,
    ARRAY_JOIN(p.tags, ',') AS tags,   -- store as comma-separated string
    ingested_at
FROM parsed
WHERE p.user_id IS NOT NULL
  AND p.event   IS NOT NULL;

Common Mistakes

Mistake 1 — Not casting extracted JSON values

In Snowflake and Spark, payload:amount returns a VARIANT/ANY type. SUM(payload:amount) or WHERE payload:amount > 100 will either error or produce wrong results without explicit casting: payload:amount::DECIMAL.

Mistake 2 — Using get_json_object in production pipelines

get_json_object(payload, '$.product.name') parses the entire JSON string for every column you extract. If you need 10 fields, the JSON is parsed 10 times. Use from_json() once and access all fields from the resulting struct.

Mistake 3 — Storing JSON as STRING in Silver

Silver should have proper typed columns, not JSON strings. The Silver table is where you parse and flatten — downstream Gold queries should never need to call from_json().

Mistake 4 — EXPLODE without knowing cardinality

If your tags array has 1,000 items per row and you EXPLODE it before filtering, you multiply your data by 1,000 immediately. Filter first, then EXPLODE.

Quick Reference — JSON Syntax Across Platforms

Operation	PostgreSQL (JSONB)	Databricks (Spark SQL)	Snowflake (VARIANT)
Parse string to JSON	`col::JSONB`	`from_json(col, schema)`	`PARSE_JSON(col)`
Get string field	`col ->> 'key'`	`col.key`	`col:key::STRING`
Get nested field	`col -> 'a' ->> 'b'`	`col.a.b`	`col:a:b::STRING`
Get array element	`col -> 'arr' ->> 0`	`col.arr[0]`	`col:arr[0]::STRING`
Explode array	`jsonb_array_elements_text()`	`EXPLODE(col.arr)`	`LATERAL FLATTEN(col:arr)`
Extract via path	`jsonb_extract_path_text()`	`get_json_object(col, '$.path')`	`col:path::STRING`

What to Learn Next

Parsing JSON is usually the first step when onboarding a new data source. The next tutorial covers the full SQL workflow for onboarding a new source table into your pipeline — from schema inspection and profiling through to writing the first Silver transformation.

The Dataset

PostgreSQL — JSON Operations

Extracting scalar values

Extracting array elements (PostgreSQL)

Explode array to rows (PostgreSQL jsonb_array_elements)

Databricks / Spark SQL — JSON Operations

Define the schema and parse

Access nested fields with dot notation

Explode array to rows (Databricks EXPLODE)

Infer schema automatically (Databricks shortcut)

Snowflake — JSON Operations

Flatten arrays in Snowflake

Building a Silver Table from JSON Bronze

Common Mistakes

Quick Reference — JSON Syntax Across Platforms

What to Learn Next

Leave a Comment Cancel reply