← Home

SQL Databricks Data Warehouse

Role

Data Engineer

Keywords

Year

2026

← Home SQL Databricks Data Warehouse Table of Contents About Introduction Requirement Analysis Context Project Requirements Data Architecture Design Choose Architecture Type Choose Data Warehouse Approach Design Data Layers Draw the Data Architecture Project Start General Principles Catalog Naming Conventions Table Naming Conventions Bronze Rules Silver Rules Gold Rules Glossary of Category Patterns Column Naming Conventions Surrogate Keys Technical Columns Stored Procedure Build the Data Warehouse Create the Volumes, Catalog and Schemas Build Bronze Layer Workflow Analyze: Source Systems Code: Ingest data Docs: Data Flow Build Silver Layer Workflow Analyze: Explore & Understand Data Code: Ingest data Data Cleansing Code: Automate Data Quality Checks Build Gold Layer Workflow Analyze Code Docs: Data Catalog Docs: Data Flow Diagram References Sources Overall Review Highlights Next Steps Contributions Tech Stack Changes Data Engineering Changes Documentation Enhancements Other Projects Let’s Work Together

About

Github

GitHub - maithtruong/databricks-data-warehouse: Building a Data Warehouse for a sportswear company.

Building a Data Warehouse for a sportswear company. - maithtruong/databricks-data-warehouse

https://github.com/maithtruong/databricks-data-warehouse

Tech Stack

Databricks

Introduction

Hi, I’m Mai, welcome to this project!

In this project, I will be building a Data Warehouse in Databricks using a Medallion architecture (bronze-silver-gold). This architecture ensures clear separation of data layers and traceability. It is accompanied by data tests in dbt, which performs automated data quality checks for prompt error handling.

Each step in the process is well-documented with descriptions, diagrams and more info on the technique used.

Check out the Github link for related files above.

*This project inherits many concepts from the course by Data with Baraa. Go to the References section to learn more on how to use the project, and to the Contribution section to see what modifications I’ve made to the original project to make it align with modern data workflows.

Requirement Analysis

Context

Flexi is a global fashion brand with existing operational departments such as Finance, Inventory and Marketing.

It uses two management systems:

ERP - SAP: for internal operations management

CRM - Salesforce: for customer interactions management

These systems operate independently, with limited integration and data sharing.

The company’s CEO wants to optimize company performance in various areas to cut costs and increase revenue. A new Data division is founded with the purpose of doing Business Intelligence, with custom & advanced analytics.

🎯 Hence, the first task of the Data Team is to build a Data Warehouse, incorporating data from ERP and CRM sources.

Project Requirements

💡

Objective
Develop a Data Warehouse using Databricks to consolidate sales data, allowing analytical reporting and informed decision-making.

💡

Specifications

Data Sources: Import data from two source systems (ERP and CRM), provided as CSV files.

Data Pre-processing: Clean and resolve data quality issues prior to analysis.

Integration: Combine both data sources into a single, user-friendly data model designed for analytical queries.

Scope: Focus on the latest dataset only, historization of data is not required.

Documentation: Provide clear documentation of the data model to support analytics teams and business stakeholders.

Data Architecture Design

This project chooses a Data Warehouse with Medallion Architecture with the following reasons:

Data Warehouse

Suitable for structured, relational data and analytical workloads.

Medallion Architecture

Provides clear separation of concerns and progressive data refinement across layers.

Choose Architecture Type

There are 5 common types of Data Architectures, each with its own pros/cons:

Type	Description	Pros ✅	Cons❌	ㅤ
Data Warehouse	Structured data storage (SQL, tables, schemas)	High performance for analytics	Rigid schema requirements	Solid foundations for BI and Business Reporting.
Data Lake	Structured and unstructured/semi-structured raw data storage (JSON, logs, etc.)	Flexible schema-on-read	Hard to manage - can become a data swamp No ACID transactions	More flexible, used when we have mixed types of data (logs, images, videos, etc.)
Data Lakehouse	Hybrid (structured + unstructured with warehouse features)	Combines data warehouse & data lake benefits Support ACID transactions	Complex to manage	ㅤ
Data Mesh	Decentralized data storage	Suitable for large organizations, each team has their own use cases with data	Complex governance	Each department is building their own data product and sharing with other teams.

Choose Data Warehouse Approach

Here are 4 common types of approaches to building a Data Warehouse:

Inmon Approach

Sources → Staging → Enterprise Data Warehouse (3NF) → Data Marts → BI Layers
EDW:
Data Marts: Small subsets of the EDW, ready to be consumed for reporting. Ex: Data marts for customers, sales, products, etc.
Slower, more complex to model

Kimball Approach

Sources → Staging →→→→→→ → → → → → → → → → Data Marts → BI Layers
Faster shipping speed
Tradeoff in data consistency

Data Vault

Sources → Staging →→ Raw Vault →→Business Vault →→ Data Marts → BI Layers
Splits the central layer into more layers
Raw Vault: Original data
Business Vault: All business rules and transformations to prepare for the data marts.

Medallion Architecture

Sources → Bronze →→ Silver →→Gold → BI Layers
Bronze: As-is data, similar to staging layer.
Silver: Clean and transformed data.
Gold: Business-level data models (similar to data marts), can build different business-ready objects. Used for reporting, ML, AI, etc.

Design Data Layers

Type	Bronze Layer	Silver Layer	Gold Layer
Definition	Raw, untouched data	Clean, standardized data	Business-ready data
Objective	Traceability & Debugging	Prepare data for analysis	Provide data to be consumed by reporting & analytics
Object Type	Tables	Tables	Views • virtual • dynamic • faster speed
Load Method	Full Load (Truncate & Insert)	Full Load (Truncate & Insert)	None
Data Transformation	None	• Data Cleaning • Data Standardization • Data Normalization • Derived Columns • Data Enrichment	Business transformations • Data Integration • Data Aggregation • Business Logic & Rules
Data Modelling	None	None	• Start Schema • Aggregated Objects • Flat Tables
Target Audience	Data Engineers	Data Analysts Data Engineers	Data Analysts Business Users

Separations of Concerns

Break down complex systems into smaller parts

Each part responsible for specific tasks

Each component must not be duplicated

Draw the Data Architecture

Project Start

General Principles

Name Convention: snake_case, use lowercase letters and underscores to separate words.

Language: English

Avoid reserved words

Naming Conventions

Set of rules or guidelines for naming anything in the project.

Database

Schema

Tables

Store procedures

Catalog Naming Conventions

All catalog names must start with its primary function and suffixed by its environment type.

<function>_<environment>

Ex: core_dev

The data warehouse for core operations, used in development.

Table Naming Conventions

Bronze Rules

All names must start with the source system name.

Table names must correspond to their original name.

Entity names are not abbreviated, are nouns, and are in singular forms.

crm_cust_info

Silver Rules

All names must start with the source system name.

Table names must correspond to their original name.

Entity names are not abbreviated, are nouns, and are in singular forms.

<source_system>_<entity>

Ex: crm_cust_info

Gold Rules

All names must use meaningful, business-aligned names for tables, starting with the category prefix.

Entity names are not abbreviated, are nouns, and are in singular forms.

<category>_<entity>

Ex: dim_customer

Glossary of Category Patterns

Pattern	Meaning	Examples
dim_	Dimension table	dim_customer, dim_product
fact_	Fact table	fact_sales
agg_	Aggregated table	agg_customer, agg_sales_monthly

Column Naming Conventions

Surrogate Keys

All primary keys in dimension tables must use the suffix _key

<entity_name>_key

Ex: customer_key → surrogate key in the dim_customer table

Technical Columns

All technical columns must start with the prefix dwh_, followed by a descriptive name indicating the column’s purpose.

dwh_<column_name>

Ex: dwh_load_date → Column used to store the date when the record was loaded, generated by the system.

Stored Procedure

All stored procedures used for loading data must follow the naming pattern:

load_<layer>

Ex: load_bronze → stored procedure for loading data into the bronze layer.

Build the Data Warehouse

Create the Volumes, Catalog and Schemas

Build Bronze Layer

Workflow

Analyze: Interview source system experts

Code: Ingest data

Load source files
Create data schema
Insert records (full load) from source files

Code: Data completeness & schema checks

Docs: Document & do versioning in git

Analyze: Source Systems

Asking the right questions help avoid mistakes.

Business Context & Ownership

Who owns the data?

Understand the source of the data

What Business Process it supports?

Understand the importance of the data

System & Data documentation

Save time when building data models

Data Model & Data Catalog

Save time when building data models by reading existing table and column descriptions.

Architecture & Technology Stack

How is data stored? (SQL Server, Oracle, AWS, Azure, etc.)
What are the integration capabilities? (API, Kafka, File Extract, Direct DB, etc.)

Extract & Load

Incremental vs. Full Loads?
Data Scope & Historical Needs

Whether we need all data or just a subset of the entire data history.
Whether we need to build data history.

What is the expected size of the extracts? (MB, GB, TB)

Understand whether we have the right tools

Are there any data volume limitations?

Conform to limitations and capability of the source system.

How to avoid impacting the source system’s performance?
Authentication and Authorization (tokens, SSH keys, VPN, IP whitelisting, …)

Understanding the data structure is done by 2 primary methods:

Doing interviews with source system owners (developers)

Doing data profiling on your own

Since the source data files are available, I head straight to the ingestion method.

Code: Ingest data

Load source files

Create the core_dev catalog

Create a new volume called source

Create schema and tables

Create bronze, silver and gold schemas

Create 6 tables based on the original source file names

crm_cust_info
crm_prd_info
crm_sales_details
erp_cust_az12
erp_loc_a101
erp_px_cat_g1v2

Insert records from source files

Methods to ingest

Bulk Insert

Massively load raw files (txt, csv) directly into the database.

Insert

Load data row-by-row into the database.

Full Load

Rewrite all old data with the newest version of the data.

Incremental Load

Incremental batch load

Inserts data into the database at regular intervals.

Incremental stream load

Inserts data into the database when new data emerges or old data is updated.

This project uses a bulk insert, full load approach.

New versions of the data (in .csv format) is pulled into the source/ volume in Databricks.

Current tables are dropped and the newest version of the data is rewritten onto the tables in the bronze schema.

Stored Procedure

A precompiled SQL code that can be saved and reused.

Unlike a saved query: Stored procedures

can be reused with new parameter inputs.

Unlike a UDF (user-defined function): Stored procedures

can return 0, 1 or many values
accept DML operations
can manage transactions
can call other stored procedures

COPY INTO in Databricks

This is a statement that does bulk insert (similar to INSERT INTO in MySQL). However, this operation is:

retryable

idempotent (the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application, by Wikipedia)

Which means if files in the source location have been loaded, then are skipped (even if the files have been modified).

The operation to create new tables is used as a stored procedure, using the COPY INTO statement.

Alternative ingestion methods

This project adopts a simple ingestion method. I think there are more efficient ways to approach this, but I will adopt them in a different project. Some changes I can think of as of now:

Skip the source volume and pull data directly to bronze. This reduces half of the storage needed for raw data.

Use incremental load instead of full load. It is much faster and more storage efficient (it comes with some downsides, though).

Docs: Data Flow

Data Flow
Describes the data lineage (data dependency through each layer).

Build Silver Layer

Workflow

Analyze: Explore & understand data

Find issues in bronze data tables and log in bronze/profile_bronze script.
Find table relationships and document in an integration diagram.

Code: Ingest data

Define silver tables’ schema.
Standardize silver data and insert into tables based on findings from bronze/profile_bronze.

Code: Data quality checks

Perform manual data quality checks on the newly created silver tables, store scripts in scripts/silver/test_silver.

Code: Automate data quality checks

Create automated tests on silver data in dbt.

Docs: Document & do versioning in git

Analyze: Explore & Understand Data

bronze Table Investigations

In this part, I carefully investigate each table, column and column combinations to understand its:

Data Uniqueness Constraint

Data Validity

Table Relationships

Main insights are described below. Read the scripts/bronze/profile_bronze script for more details.

crm_cust_info

About: Customer demographic data from the CRM system.

Structure: Non-standard, append-only structure (pseudo SCD Type 2 behavior).

No exact duplicate rows.
cst_id and cst_key are not unique due to multiple records over time.
Multiple records exist for the same (cst_id, cst_key) combination, differentiated by cst_create_date.
Changes across versions are incremental: null values are progressively filled with valid values (no overwrites).
No explicit versioning columns (e.g., start/end dates).

Issues:

Rows with cst_id IS NULL contain invalid and inconsistent data.
Redundant historical records where earlier versions contain mostly null values.

Conclusion:

Remove rows with null cst_id (data errors).
Deduplicate by keeping only the latest record per (cst_id, cst_key) based on cst_create_date.
Treat the table as SCD Type 0 in the Silver layer (no history retained).

crm_prd_info

About: Product master data with historical tracking.

Structure: Proper SCD Type 2 implementation.

prd_key represents the business entity (product).
prd_id represents a specific version of the product over time.
prd_start_dt and prd_end_dt define the validity period.
Active records have prd_end_dt IS NULL.
One prd_key can have multiple prd_id values (multiple versions).

Data Quality Notes:

No duplicate rows or duplicate (prd_id, prd_key) combinations.
Some invalid date ranges (prd_start_dt > prd_end_dt).
Some null or missing values in attributes (e.g., prd_cost).

Conclusion:

Retain full history (valid SCD Type 2 design).
Handle invalid date ranges and null values during transformation.

crm_sales_details

About: Sales transaction data.

Structure: Flattened fact table.

No duplicate rows.
No single-column primary key.
Grain: one row per product per order per customer.
Composite uniqueness: (sls_ord_num, sls_cust_id, sls_prd_key).

Relationships:

One customer → many orders (1:M).
One order → many products (1:M).
Products ↔ orders form an M:N relationship (resolved in this table).

Conclusion:

Acts as a fact table with defined grain.
Suitable for direct transformation into a star schema fact table.

erp_cust_az12

About: Customer attributes (e.g., gender, birthdate) from ERP.

Structure: Normalized table with unique cid.

Issues:

cid format is inconsistent:

Contains varying prefixes.
Does not directly match cst_id or cst_key in CRM.

Matching requires transformation (e.g., substring extraction).
Partial overlaps and mismatches exist between ERP and CRM identifiers.
Duplicate cid values exist under different formats.

Conclusion:

Normalize cid by removing prefixes to align with CRM identifiers.
Expect partial mismatches during integration.
Use transformation logic to standardize joins with CRM data.

erp_loc_a101

About: Customer location data (country).

Structure: Normalized table with unique cid.

Issues:

cid contains formatting inconsistencies (e.g., dashes ).
Requires cleaning (e.g., removing ) before joining.
Matches with CRM data are mostly consistent after normalization, with minor exceptions due to upstream data issues.

Conclusion:

Clean cid (remove special characters) before integration.
Join primarily with CRM (cst_key) after normalization.
Treat unmatched records as data quality issues.

erp_px_cat_g1v2

About: Product category and subcategory reference data.

Structure: Normalized lookup table.

id represents a subcategory (derived from category + subcategory codes).
No duplicate rows or IDs.

Relationships:

id maps to prefixes of prd_key in crm_prd_info.

Issues:

Some products (e.g., category CO-PE) do not have matching category records.
Indicates that this table may be incomplete or outdated.

Conclusion:

Use as a category lookup where matches exist.
Handle unmatched products separately (data gap).
Consider this table as potentially stale reference data.

Data Integration Diagram

After investigating the table relationships, I made the diagram below to reference when doing data cleaning for this layer.

The diagram uses the Crow’s Foot notation to describe relationships between tables.

Code: Ingest data

Define silver tables’ schema.

The tables in this layer normalizes values in bronze layer, so column formats and types will be changed. I figure out the changes I want based on the findings from the bronze table and communication with stakeholders.

Metadata columns

Extra columns added by data engineers that do not originate from the source data.

Common types:

create_date: The record’s load timestamp.

update_date: The records’s last update timestamp.

source_system: The origin system of the record.

file_location: The file source of the record.

Furthermore, to better track the version of the tables, I added the following metadata column: dwh_create_date. It describes the date when the schema is created.

Data Cleansing

Types of data transformations

This diagram summarizes the key operations in data transformation.

crm_cust_info

Remove invalid data

Remove null cst_id
Remove duplicates (history customer info)
Handle missing data (using a default value)

Standardize data

Map coded values to user-friendly descriptions
Remove unwanted spaces in string values

crm_prd_info

Standardize data

Map coded values to user-friendly descriptions
Cast to a suitable type

Remove invalid data

Handling missing data (using a default value)

Data enrichment (add new values to the dataset)

crm_sales_details

Standardize data

Convert to standard convention
Replace invalid values with placeholder
Cast to a suitable type
Derive valid values from upstream columns

Remove invalid data

Handle missing data (using a default value)

crm_sales_details

Standardize data

Convert to standard convention
Map coded values to user-friendly descriptions

Remove invalid data

erp_loc

Standardize data

Map coded values to user-friendly descriptions
Remove unwanted spaces

Business Rules

Total Sales = Sum of (Quantity * Price)

Negatives, zeros and nulls are not allowed.

Code: Automate Data Quality Checks

dbt
An open-source tool for data engineering following software engineering best practices such as testing, versioning and documentation for structured data.

Benefits of using dbt

Data platform agnostic: dbt uses its own version of SQL and can connect to different data platforms, so any script (such as tests) can be executed regardless of the platform.

Automated testing: avoid faults and save time creating models. The table will not be created if it does not pass mandatory tests. This ensure our business do not make decisions based on erroneous data.

Build Gold Layer

Workflow

Analyze: Explore & Understand the Business Object

Code: Data Integration

Build the Business Object
Choose type: Dimension vs. Fact
Rename to friendly names

Validate: Data Integration Checks

Docs: Document & do versioning in git

Data Model
Data Catalog
Data Flow

Analyze

Data Modeling

The process of taking raw data, process and structure it in a meaningful way.

Types of Data Models

Conceptual Model: Represents core entities in the business and how they relate to one another. Info: entities, relationships.

Logical Model: Represent data constraints, entity names and relationships in a diagram (platform-agnostic). Info: all in previous level + attributes, keys, cardinality.

Physical Model: Implement the logical model over a specific database technology. Info: all in previous level + data types, SQL tables, indexes, constraints, etc.

Types of Data Schemas

Star Schema: Has a central fact table surrounded by denormalized dimension tables.

Pros: Simple and easy to understand and analyze.
Cons: Dimensions may contain duplicates, data size rise faster → slower queries.

Snowflake Schema: Has a central fact table surrounded by dimensions that are normalized into multiple related tables.

Pros: More complex to understand and analyze.
Cons: Optimize storage by reducing data redundancy.

Table Types

Dimension tables: Description information that gives context to data.

Fact tables: Quantitative information that represents events.

This project chooses a Star Schema because the data size is small & tradeoff for ease of usage is not expensive.

Data Integration Model

Data Model

Code

Data lookup

Joining the fact with dim tables to find relevant keys

Docs: Data Catalog

Data Catalog
A centralized inventory and management system for metadata. Stakeholders can use it for better understanding of an organization’s data.

A data catalog is a must-have for efficient work with shared data within an organization.

In this project, the data catalog is written as a .md (markdown) file, clarifying the:

purpose and structure of each table

column names, types and description

Docs: Data Flow Diagram

Data Flow Diagram
Represents the dependency between data collections (tables) in different layers within a system. It shows how data moves within a system.

The data flow diagram below shows how data moves within the warehouse. Each subsequent layer is directly dependent on the previous layer and indirectly dependent on the layers before. This improves traceability when facing errors in a layer.

References

Sources

This project builds upon materials from the course by Data with Baraa, including datasets, code patterns, and architectural concepts. Refer to the repository’s license file for details on usage and distribution.

Overall Review

Overall, I think this is a wonderful course that gives a hands-on overview on the process of creating a Data Warehouse. Baraa did an excellent job tying the related concepts in an intuitive way.

Highlights

I have done similar projects of creating databases/ data warehouses in the past, but the thing that stands out to me is Baraa’s presentation skills. Every step is visually or textually described with a high level of detail, just like in an industry project. I have learned a lot and will definitely apply this mindset of clear communicating in my next projects.

Next Steps

I’m excited to build more complex data systems. For future projects, I will adopt a different:

Loading technique

SCD handling technique

Workflow orchestration technique

Performance optimization technique

Contributions

I have made several changes to the original project to match my learning goals & industry needs. The changes are described below.

Tech Stack Changes

Data Platform

Migrated from SQL Server to Databricks for scalability and modern data workflows.

Data Quality & Testing

Integrated dbt to automate data validation and enforce data quality constraints.

Version Control

Combined Git integration in Databricks and VSCode:

Databricks for data processing and SQL development
VSCode for dbt workflows and project management

Environment Management

Introduced uv for Python dependency management to ensure reproducibility.

Data Engineering Changes

Added data profiling scripts to systematically analyze source data.

Introduced automated testing via dbt and environment management via uv.

Identified additional data quality issues beyond the original pipeline.

Documentation Enhancements

Expanded documentation across all layers:

Data architecture
Data flow
Data modeling
…

Added detailed explanations for:

Data engineering concepts
Transformation logic
Design decisions
Workflow structure

That’s concludes the project for now. Happy Learning!

Other Projects

Gallery view

Automated Spreadsheet Dashboards

Google Sheets

Analytics Dashboard

Data Analyst

Automate Google Sheets to create dashboards that track teams’ sales leads, Marketing campaigns, sales overview and more.

Sequential Recommender System for Video Recommendations

Employee Attrition Dashboard & Analytics

Dashboard that analyzes employee attrition root causes by leveraging a detailed analysis process, advanced Power BI features, and Machine Learning for Churn Prediction.

RAG Chatbot for Flower Shop Recommendations

A Naive RAG chatbot that recommends flower shop products using semantic search on embedded product data. It features a Streamlit interface, a Scrapy-based data scraper; a backend including MongoDB vector search for retrieval, an LLM via OpenRouter; and a Flask API that connects the frontend with the backend.

SQL Databricks Data Warehouse

A Data Warehouse in Databricks using a Medallion architecture (bronze-silver-gold), accompanied by data tests in dbt, which performs automated data quality checks for prompt error handling.

Customer Growth using AI (Segmentation, Churn Prediction, CLV Prediction)

Machine Learning

Statistical Modelling

Perform RFM customer segmentation and model churn, survival, and customer lifetime value (CLV) using transactional data to identify high-priority customer groups.

TPC-H Data Warehouse Pipeline

Modern ELT pipeline using Airflow, dbt, and Snowflake built on the TPC-H benchmark dataset (SF1).

SQL Databricks Data Warehouse

← Home