SQL for JSON and Semi-Structured Data — Parsing, Flattening, and Exploding in Databricks, Snowflake, and PostgreSQL

Most modern data pipelines ingest semi-structured data — REST API responses, event tracking payloads, webhook bodies, and Kafka messages all arrive as JSON. Knowing how to parse, flatten, and query nested JSON directly in SQL is a fundamental skill for anyone building pipelines on Databricks, Snowflake, or PostgreSQL.

This tutorial covers the core JSON operations with real examples, mapped to the syntax differences across the three most common platforms.


The Dataset

Raw JSON data as it arrives from an API — stored as a VARCHAR/STRING column in your Bronze table.

CREATE TABLE bronze.raw_events (
    event_id   INT,
    payload    VARCHAR(5000),  -- raw JSON string
    ingested_at TIMESTAMP
);

INSERT INTO bronze.raw_events VALUES
(1, '{"user_id": 101, "event": "purchase", "amount": 250.00, "currency": "USD",
      "product": {"id": 10, "name": "Laptop", "category": "Hardware"},
      "tags": ["premium", "new_customer"],
      "metadata": {"source": "mobile_app", "version": "3.2.1"}}',
   CURRENT_TIMESTAMP),

(2, '{"user_id": 102, "event": "view", "amount": null, "currency": "USD",
      "product": {"id": 11, "name": "Phone", "category": "Electronics"},
      "tags": ["returning_customer"],
      "metadata": {"source": "web", "version": "3.2.0"}}',
   CURRENT_TIMESTAMP),

(3, '{"user_id": 103, "event": "purchase", "amount": 89.99, "currency": "AED",
      "product": {"id": 10, "name": "Laptop", "category": "Hardware"},
      "tags": ["premium", "bulk_buyer", "vip"],
      "metadata": {"source": "mobile_app", "version": "3.2.1"}}',
   CURRENT_TIMESTAMP);

PostgreSQL — JSON Operations

PostgreSQL has two JSON types: JSON (stored as text, validated) and JSONB (stored in binary, indexable and faster for queries). Use JSONB for any column you will query frequently.

Extracting scalar values

-- -> operator: returns JSON object
-- ->> operator: returns text (use for scalar values)
SELECT
    event_id,
    payload::JSONB ->> 'user_id'                                AS user_id,
    payload::JSONB ->> 'event'                                  AS event_type,
    (payload::JSONB ->> 'amount')::DECIMAL                      AS amount,
    payload::JSONB -> 'product' ->> 'name'                      AS product_name,
    payload::JSONB -> 'product' ->> 'category'                  AS category,
    payload::JSONB -> 'metadata' ->> 'source'                   AS source
FROM bronze.raw_events;

Output:

event_iduser_idevent_typeamountproduct_namecategorysource
1101purchase250.00LaptopHardwaremobile_app
2102viewNULLPhoneElectronicsweb
3103purchase89.99LaptopHardwaremobile_app

-> 'key' returns the JSON sub-object. ->> 'key' returns the value as text. For nested paths: -> 'product' ->> 'name' goes into product, then gets name as text.

Extracting array elements (PostgreSQL)

-- Get the first tag in the tags array
SELECT
    event_id,
    payload::JSONB -> 'tags' ->> 0                    AS first_tag,
    jsonb_array_length(payload::JSONB -> 'tags')      AS tag_count
FROM bronze.raw_events;

Output:

event_idfirst_tagtag_count
1premium2
2returning_customer1
3premium3

Explode array to rows (PostgreSQL jsonb_array_elements)

-- One row per tag
SELECT
    event_id,
    tag_value::TEXT AS tag
FROM bronze.raw_events,
    jsonb_array_elements_text(payload::JSONB -> 'tags') AS tag_value;

Output:

event_idtag
1premium
1new_customer
2returning_customer
3premium
3bulk_buyer
3vip

Databricks / Spark SQL — JSON Operations

Spark SQL uses from_json() to parse JSON strings into a structured type, then dot notation to access fields.

Define the schema and parse

-- In Databricks: parse JSON string using from_json with an explicit schema
SELECT
    event_id,
    from_json(payload, 'STRUCT<
        user_id:INT,
        event:STRING,
        amount:DOUBLE,
        currency:STRING,
        product:STRUCT<id:INT, name:STRING, category:STRING>,
        tags:ARRAY<STRING>,
        metadata:STRUCT<source:STRING, version:STRING>
    >') AS parsed
FROM bronze.raw_events;

Access nested fields with dot notation

WITH parsed_events AS (
    SELECT
        event_id,
        from_json(payload, 'STRUCT<
            user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
            product:STRUCT<id:INT, name:STRING, category:STRING>,
            tags:ARRAY<STRING>,
            metadata:STRUCT<source:STRING, version:STRING>
        >') AS p
    FROM bronze.raw_events
)
SELECT
    event_id,
    p.user_id,
    p.event        AS event_type,
    p.amount,
    p.currency,
    p.product.name     AS product_name,
    p.product.category AS category,
    p.metadata.source  AS source,
    SIZE(p.tags)       AS tag_count
FROM parsed_events;

Output:

event_iduser_idevent_typeamountcurrencyproduct_namecategorysourcetag_count
1101purchase250.00USDLaptopHardwaremobile_app2
2102viewNULLUSDPhoneElectronicsweb1
3103purchase89.99AEDLaptopHardwaremobile_app3

Explode array to rows (Databricks EXPLODE)

WITH parsed_events AS (
    SELECT event_id,
           from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
    FROM bronze.raw_events
)
SELECT
    event_id,
    p.user_id,
    tag
FROM parsed_events
LATERAL VIEW EXPLODE(p.tags) tags_table AS tag;

Or using the SQL EXPLODE function directly:

WITH parsed AS (
    SELECT event_id,
           from_json(payload, 'STRUCT<user_id:INT, tags:ARRAY<STRING>>') AS p
    FROM bronze.raw_events
)
SELECT event_id, p.user_id, EXPLODE(p.tags) AS tag
FROM parsed;

Output:

event_iduser_idtag
1101premium
1101new_customer
2102returning_customer
3103premium
3103bulk_buyer
3103vip

Infer schema automatically (Databricks shortcut)

-- Let Spark infer the schema from the data (useful for exploration)
SELECT
    event_id,
    get_json_object(payload, '$.user_id')           AS user_id,
    get_json_object(payload, '$.event')             AS event_type,
    get_json_object(payload, '$.product.name')      AS product_name,
    get_json_object(payload, '$.metadata.source')   AS source
FROM bronze.raw_events;

get_json_object() uses JSONPath syntax — a quick way to extract individual fields without defining a full schema first. Use this for exploration; use from_json() with an explicit schema in production.


Snowflake — JSON Operations

Snowflake uses the VARIANT type for semi-structured data and uses colon : notation to access fields.

-- Snowflake: parse string to VARIANT
SELECT
    event_id,
    PARSE_JSON(payload)                          AS parsed_payload
FROM bronze.raw_events;
-- Access fields with colon notation
SELECT
    event_id,
    PARSE_JSON(payload):user_id::INT            AS user_id,
    PARSE_JSON(payload):event::STRING           AS event_type,
    PARSE_JSON(payload):amount::DECIMAL(10,2)   AS amount,
    PARSE_JSON(payload):product:name::STRING    AS product_name,
    PARSE_JSON(payload):product:category::STRING AS category,
    PARSE_JSON(payload):metadata:source::STRING  AS source
FROM bronze.raw_events;

Output: Same as above — identical results, different syntax.

The ::TYPE casting is important in Snowflake — without it, all extracted values are VARIANT type, which behaves unexpectedly in comparisons and aggregations.

Flatten arrays in Snowflake

-- FLATTEN is Snowflake's equivalent of EXPLODE
SELECT
    r.event_id,
    PARSE_JSON(r.payload):user_id::INT AS user_id,
    f.value::STRING AS tag
FROM bronze.raw_events r,
    LATERAL FLATTEN(INPUT => PARSE_JSON(r.payload):tags) f;

Output: Same one-row-per-tag output as the PostgreSQL and Databricks examples.


Building a Silver Table from JSON Bronze

This is the complete pipeline step — parsing the JSON Bronze table into a clean, typed Silver table.

-- Insert parsed, flattened events into Silver
INSERT INTO silver.events (
    event_id, user_id, event_type, amount, currency,
    product_id, product_name, category,
    source, app_version, tags,
    ingested_at
)
-- Databricks version:
WITH parsed AS (
    SELECT
        event_id,
        ingested_at,
        from_json(payload, 'STRUCT<
            user_id:INT, event:STRING, amount:DOUBLE, currency:STRING,
            product:STRUCT<id:INT, name:STRING, category:STRING>,
            tags:ARRAY<STRING>,
            metadata:STRUCT<source:STRING, version:STRING>
        >') AS p
    FROM bronze.raw_events
    WHERE ingested_at >= '2024-03-01'
)
SELECT
    event_id,
    p.user_id,
    p.event            AS event_type,
    p.amount,
    p.currency,
    p.product.id       AS product_id,
    p.product.name     AS product_name,
    p.product.category AS category,
    p.metadata.source  AS source,
    p.metadata.version AS app_version,
    ARRAY_JOIN(p.tags, ',') AS tags,   -- store as comma-separated string
    ingested_at
FROM parsed
WHERE p.user_id IS NOT NULL
  AND p.event   IS NOT NULL;

Common Mistakes

Mistake 1 — Not casting extracted JSON values

In Snowflake and Spark, payload:amount returns a VARIANT/ANY type. SUM(payload:amount) or WHERE payload:amount > 100 will either error or produce wrong results without explicit casting: payload:amount::DECIMAL.

Mistake 2 — Using get_json_object in production pipelines

get_json_object(payload, '$.product.name') parses the entire JSON string for every column you extract. If you need 10 fields, the JSON is parsed 10 times. Use from_json() once and access all fields from the resulting struct.

Mistake 3 — Storing JSON as STRING in Silver

Silver should have proper typed columns, not JSON strings. The Silver table is where you parse and flatten — downstream Gold queries should never need to call from_json().

Mistake 4 — EXPLODE without knowing cardinality

If your tags array has 1,000 items per row and you EXPLODE it before filtering, you multiply your data by 1,000 immediately. Filter first, then EXPLODE.


Quick Reference — JSON Syntax Across Platforms

OperationPostgreSQL (JSONB)Databricks (Spark SQL)Snowflake (VARIANT)
Parse string to JSONcol::JSONBfrom_json(col, schema)PARSE_JSON(col)
Get string fieldcol ->> 'key'col.keycol:key::STRING
Get nested fieldcol -> 'a' ->> 'b'col.a.bcol:a:b::STRING
Get array elementcol -> 'arr' ->> 0col.arr[0]col:arr[0]::STRING
Explode arrayjsonb_array_elements_text()EXPLODE(col.arr)LATERAL FLATTEN(col:arr)
Extract via pathjsonb_extract_path_text()get_json_object(col, '$.path')col:path::STRING

What to Learn Next

Parsing JSON is usually the first step when onboarding a new data source. The next tutorial covers the full SQL workflow for onboarding a new source table into your pipeline — from schema inspection and profiling through to writing the first Silver transformation.

Leave a Comment