SQL Databricks Data Warehouse

SQL Databricks Data Warehouse

← Home

SQL Databricks Data Warehouse

Role
Data Engineer
Keywords
 
Year
2026
notion image

Table of Contents

About

page icon
Github
page icon
Tech Stack
  • Databricks
  • dbt
  • SQL

Introduction

Hi, I’m Mai, welcome to this project!
In this project, I will be building a Data Warehouse in Databricks using a Medallion architecture (bronze-silver-gold). This architecture ensures clear separation of data layers and traceability. It is accompanied by data tests in dbt, which performs automated data quality checks for prompt error handling.
Each step in the process is well-documented with descriptions, diagrams and more info on the technique used.
Check out the Github link for related files above.
*This project inherits many concepts from the course by Data with Baraa. Go to the References section to learn more on how to use the project, and to the Contribution section to see what modifications I’ve made to the original project to make it align with modern data workflows.

Requirement Analysis

Context

Flexi is a global fashion brand with existing operational departments such as Finance, Inventory and Marketing.
It uses two management systems:
  • ERP - SAP: for internal operations management
  • CRM - Salesforce: for customer interactions management
These systems operate independently, with limited integration and data sharing.
The company’s CEO wants to optimize company performance in various areas to cut costs and increase revenue. A new Data division is founded with the purpose of doing Business Intelligence, with custom & advanced analytics.
🎯 Hence, the first task of the Data Team is to build a Data Warehouse, incorporating data from ERP and CRM sources.

Project Requirements

💡
Objective
Develop a Data Warehouse using Databricks to consolidate sales data, allowing analytical reporting and informed decision-making.
💡
Specifications
  • Data Sources: Import data from two source systems (ERP and CRM), provided as CSV files.
  • Data Pre-processing: Clean and resolve data quality issues prior to analysis.
  • Integration: Combine both data sources into a single, user-friendly data model designed for analytical queries.
  • Scope: Focus on the latest dataset only, historization of data is not required.
  • Documentation: Provide clear documentation of the data model to support analytics teams and business stakeholders.

Data Architecture Design

This project chooses a Data Warehouse with Medallion Architecture with the following reasons:
  • Data Warehouse
    • Suitable for structured, relational data and analytical workloads.
  • Medallion Architecture
    • Provides clear separation of concerns and progressive data refinement across layers.

Choose Architecture Type

There are 5 common types of Data Architectures, each with its own pros/cons:
Type
Description
Pros ✅
Cons❌
Data Warehouse
Structured data storage (SQL, tables, schemas)
High performance for analytics
Rigid schema requirements
Solid foundations for BI and Business Reporting.
Data Lake
Structured and unstructured/semi-structured raw data storage (JSON, logs, etc.)
Flexible schema-on-read
Hard to manage - can become a data swamp
No ACID transactions
More flexible, used when we have mixed types of data (logs, images, videos, etc.)
Data Lakehouse
Hybrid (structured + unstructured with warehouse features)
Combines data warehouse & data lake benefits
Support ACID transactions
Complex to manage
Data Mesh
Decentralized data storage
Suitable for large organizations, each team has their own use cases with data
Complex governance
Each department is building their own data product and sharing with other teams.

Choose Data Warehouse Approach

Here are 4 common types of approaches to building a Data Warehouse:
  • Inmon Approach
    • Sources → Staging → Enterprise Data Warehouse (3NF) → Data Marts → BI Layers
    • EDW:
    • Data Marts: Small subsets of the EDW, ready to be consumed for reporting. Ex: Data marts for customers, sales, products, etc.
    • Slower, more complex to model
  • Kimball Approach
    • Sources → Staging →→→→→→ → → → → → → → → → Data Marts → BI Layers
    • Faster shipping speed
    • Tradeoff in data consistency
  • Data Vault
    • Sources → Staging →→ Raw Vault →→Business Vault →→ Data Marts → BI Layers
    • Splits the central layer into more layers
    • Raw Vault: Original data
    • Business Vault: All business rules and transformations to prepare for the data marts.
  • Medallion Architecture
    • Sources → Bronze →→ Silver →→Gold → BI Layers
    • Bronze: As-is data, similar to staging layer.
    • Silver: Clean and transformed data.
    • Gold: Business-level data models (similar to data marts), can build different business-ready objects. Used for reporting, ML, AI, etc.

Design Data Layers

Type
Bronze Layer
Silver Layer
Gold Layer
Definition
Raw, untouched data
Clean, standardized data
Business-ready data
Objective
Traceability & Debugging
Prepare data for analysis
Provide data to be consumed by reporting & analytics
Object Type
Tables
Tables
Views
• virtual
• dynamic
• faster speed
Load Method
Full Load
(Truncate & Insert)
Full Load
(Truncate & Insert)
None
Data Transformation
None
• Data Cleaning
• Data Standardization
• Data Normalization
• Derived Columns
• Data Enrichment
Business transformations
• Data Integration
• Data Aggregation
• Business Logic & Rules
Data Modelling
None
None
• Start Schema
• Aggregated Objects
• Flat Tables
Target Audience
Data Engineers
Data Analysts
Data Engineers
Data Analysts
Business Users
page icon
Separations of Concerns
  • Break down complex systems into smaller parts
  • Each part responsible for specific tasks
  • Each component must not be duplicated

Draw the Data Architecture

notion image

Project Start

General Principles

  • Name Convention: snake_case, use lowercase letters and underscores to separate words.
  • Language: English
  • Avoid reserved words
page icon
Naming Conventions
Set of rules or guidelines for naming anything in the project.
  • Database
  • Schema
  • Tables
  • Store procedures

Catalog Naming Conventions

  • All catalog names must start with its primary function and suffixed by its environment type.
  • <function>_<environment>
    • Ex: core_dev
      • The data warehouse for core operations, used in development.

Table Naming Conventions

Bronze Rules

  • All names must start with the source system name.
  • Table names must correspond to their original name.
  • Entity names are not abbreviated, are nouns, and are in singular forms.
  • Ex:
    • crm_cust_info

Silver Rules

  • All names must start with the source system name.
  • Table names must correspond to their original name.
  • Entity names are not abbreviated, are nouns, and are in singular forms.
  • <source_system>_<entity>
    • Ex: crm_cust_info

Gold Rules

  • All names must use meaningful, business-aligned names for tables, starting with the category prefix.
  • Entity names are not abbreviated, are nouns, and are in singular forms.
  • <category>_<entity>
    • Ex: dim_customer

Glossary of Category Patterns

Pattern
Meaning
Examples
dim_
Dimension table
dim_customer, dim_product
fact_
Fact table
fact_sales
agg_
Aggregated table
agg_customer, agg_sales_monthly

Column Naming Conventions

Surrogate Keys

  • All primary keys in dimension tables must use the suffix _key
  • <entity_name>_key
    • Ex: customer_key → surrogate key in the dim_customer table

Technical Columns

  • All technical columns must start with the prefix dwh_, followed by a descriptive name indicating the column’s purpose.
  • dwh_<column_name>
    • Ex: dwh_load_date → Column used to store the date when the record was loaded, generated by the system.

Stored Procedure

All stored procedures used for loading data must follow the naming pattern:
  • load_<layer>
    • Ex: load_bronze → stored procedure for loading data into the bronze layer.

Build the Data Warehouse

Create the Volumes, Catalog and Schemas

Build Bronze Layer

Workflow

  • Analyze: Interview source system experts
  • Code: Ingest data
    • Load source files
    • Create data schema
    • Insert records (full load) from source files
  • Code: Data completeness & schema checks
  • Docs: Document & do versioning in git

Analyze: Source Systems

Asking the right questions help avoid mistakes.
  • Business Context & Ownership
    • Who owns the data?
      • Understand the source of the data
    • What Business Process it supports?
      • Understand the importance of the data
    • System & Data documentation
      • Save time when building data models
    • Data Model & Data Catalog
      • Save time when building data models by reading existing table and column descriptions.
  • Architecture & Technology Stack
    • How is data stored? (SQL Server, Oracle, AWS, Azure, etc.)
    • What are the integration capabilities? (API, Kafka, File Extract, Direct DB, etc.)
  • Extract & Load
    • Incremental vs. Full Loads?
    • Data Scope & Historical Needs
      • Whether we need all data or just a subset of the entire data history.
      • Whether we need to build data history.
    • What is the expected size of the extracts? (MB, GB, TB)
      • Understand whether we have the right tools
    • Are there any data volume limitations?
      • Conform to limitations and capability of the source system.
    • How to avoid impacting the source system’s performance?
    • Authentication and Authorization (tokens, SSH keys, VPN, IP whitelisting, …)
page icon
Understanding the data structure is done by 2 primary methods:
  • Doing interviews with source system owners (developers)
  • Doing data profiling on your own
Since the source data files are available, I head straight to the ingestion method.

Code: Ingest data

Load source files
  • Create the core_dev catalog
  • Create a new volume called source
Create schema and tables
  • Create bronze, silver and gold schemas
  • Create 6 tables based on the original source file names
    • crm_cust_info
    • crm_prd_info
    • crm_sales_details
    • erp_cust_az12
    • erp_loc_a101
    • erp_px_cat_g1v2
Insert records from source files
page icon
Methods to ingest
  • Bulk Insert
    • Massively load raw files (txt, csv) directly into the database.
  • Insert
    • Load data row-by-row into the database.
  • Full Load
    • Rewrite all old data with the newest version of the data.
  • Incremental Load
    • Incremental batch load
      • Inserts data into the database at regular intervals.
    • Incremental stream load
      • Inserts data into the database when new data emerges or old data is updated.
This project uses a bulk insert, full load approach.
  • New versions of the data (in .csv format) is pulled into the source/ volume in Databricks.
  • Current tables are dropped and the newest version of the data is rewritten onto the tables in the bronze schema.
page icon
Stored Procedure
A precompiled SQL code that can be saved and reused.
  • Unlike a saved query: Stored procedures
    • can be reused with new parameter inputs.
  • Unlike a UDF (user-defined function): Stored procedures
    • can return 0, 1 or many values
    • accept DML operations
    • can manage transactions
    • can call other stored procedures
page icon
COPY INTO in Databricks
This is a statement that does bulk insert (similar to INSERT INTO in MySQL). However, this operation is:
  • retryable
  • idempotent (the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application, by Wikipedia)
Which means if files in the source location have been loaded, then are skipped (even if the files have been modified).
The operation to create new tables is used as a stored procedure, using the COPY INTO statement.
page icon
Alternative ingestion methods
This project adopts a simple ingestion method. I think there are more efficient ways to approach this, but I will adopt them in a different project. Some changes I can think of as of now:
  • Skip the source volume and pull data directly to bronze. This reduces half of the storage needed for raw data.
  • Use incremental load instead of full load. It is much faster and more storage efficient (it comes with some downsides, though).

Docs: Data Flow

page icon
Data Flow
Describes the data lineage (data dependency through each layer).
notion image

Build Silver Layer

Workflow

  • Analyze: Explore & understand data
    • Find issues in bronze data tables and log in bronze/profile_bronze script.
    • Find table relationships and document in an integration diagram.
  • Code: Ingest data
    • Define silver tables’ schema.
    • Standardize silver data and insert into tables based on findings from bronze/profile_bronze.
  • Code: Data quality checks
    • Perform manual data quality checks on the newly created silver tables, store scripts in scripts/silver/test_silver.
  • Code: Automate data quality checks
    • Create automated tests on silver data in dbt.
  • Docs: Document & do versioning in git

Analyze: Explore & Understand Data

bronze Table Investigations
In this part, I carefully investigate each table, column and column combinations to understand its:
  • Data Uniqueness Constraint
  • Data Validity
  • Table Relationships
Main insights are described below. Read the scripts/bronze/profile_bronze script for more details.

crm_cust_info
  • About: Customer demographic data from the CRM system.
  • Structure: Non-standard, append-only structure (pseudo SCD Type 2 behavior).
    • No exact duplicate rows.
    • cst_id and cst_key are not unique due to multiple records over time.
    • Multiple records exist for the same (cst_id, cst_key) combination, differentiated by cst_create_date.
    • Changes across versions are incremental: null values are progressively filled with valid values (no overwrites).
    • No explicit versioning columns (e.g., start/end dates).
  • Issues:
    • Rows with cst_id IS NULL contain invalid and inconsistent data.
    • Redundant historical records where earlier versions contain mostly null values.
  • Conclusion:
    • Remove rows with null cst_id (data errors).
    • Deduplicate by keeping only the latest record per (cst_id, cst_key) based on cst_create_date.
    • Treat the table as SCD Type 0 in the Silver layer (no history retained).

crm_prd_info
  • About: Product master data with historical tracking.
  • Structure: Proper SCD Type 2 implementation.
    • prd_key represents the business entity (product).
    • prd_id represents a specific version of the product over time.
    • prd_start_dt and prd_end_dt define the validity period.
    • Active records have prd_end_dt IS NULL.
    • One prd_key can have multiple prd_id values (multiple versions).
  • Data Quality Notes:
    • No duplicate rows or duplicate (prd_id, prd_key) combinations.
    • Some invalid date ranges (prd_start_dt > prd_end_dt).
    • Some null or missing values in attributes (e.g., prd_cost).
  • Conclusion:
    • Retain full history (valid SCD Type 2 design).
    • Handle invalid date ranges and null values during transformation.

crm_sales_details
  • About: Sales transaction data.
  • Structure: Flattened fact table.
    • No duplicate rows.
    • No single-column primary key.
    • Grain: one row per product per order per customer.
    • Composite uniqueness: (sls_ord_num, sls_cust_id, sls_prd_key).
  • Relationships:
    • One customer → many orders (1:M).
    • One order → many products (1:M).
    • Products ↔ orders form an M:N relationship (resolved in this table).
  • Conclusion:
    • Acts as a fact table with defined grain.
    • Suitable for direct transformation into a star schema fact table.

erp_cust_az12
  • About: Customer attributes (e.g., gender, birthdate) from ERP.
  • Structure: Normalized table with unique cid.
  • Issues:
    • cid format is inconsistent:
      • Contains varying prefixes.
      • Does not directly match cst_id or cst_key in CRM.
    • Matching requires transformation (e.g., substring extraction).
    • Partial overlaps and mismatches exist between ERP and CRM identifiers.
    • Duplicate cid values exist under different formats.
  • Conclusion:
    • Normalize cid by removing prefixes to align with CRM identifiers.
    • Expect partial mismatches during integration.
    • Use transformation logic to standardize joins with CRM data.

erp_loc_a101
  • About: Customer location data (country).
  • Structure: Normalized table with unique cid.
  • Issues:
    • cid contains formatting inconsistencies (e.g., dashes ).
    • Requires cleaning (e.g., removing ) before joining.
    • Matches with CRM data are mostly consistent after normalization, with minor exceptions due to upstream data issues.
  • Conclusion:
    • Clean cid (remove special characters) before integration.
    • Join primarily with CRM (cst_key) after normalization.
    • Treat unmatched records as data quality issues.

erp_px_cat_g1v2
  • About: Product category and subcategory reference data.
  • Structure: Normalized lookup table.
    • id represents a subcategory (derived from category + subcategory codes).
    • No duplicate rows or IDs.
  • Relationships:
    • id maps to prefixes of prd_key in crm_prd_info.
  • Issues:
    • Some products (e.g., category CO-PE) do not have matching category records.
    • Indicates that this table may be incomplete or outdated.
  • Conclusion:
    • Use as a category lookup where matches exist.
    • Handle unmatched products separately (data gap).
    • Consider this table as potentially stale reference data.
Data Integration Diagram
After investigating the table relationships, I made the diagram below to reference when doing data cleaning for this layer.
The diagram uses the Crow’s Foot notation to describe relationships between tables.
notion image

Code: Ingest data

Define silver tables’ schema.
The tables in this layer normalizes values in bronze layer, so column formats and types will be changed. I figure out the changes I want based on the findings from the bronze table and communication with stakeholders.
Furthermore, to better track the version of the tables, I added the following metadata column: dwh_create_date. It describes the date when the schema is created.

Data Cleansing

page icon
Types of data transformations
This diagram summarizes the key operations in data transformation.
Source: Data with Baraa.
Source: Data with Baraa.
crm_cust_info
  • Remove invalid data
    • Remove null cst_id
    • Remove duplicates (history customer info)
    • Handle missing data (using a default value)
  • Standardize data
    • Map coded values to user-friendly descriptions
    • Remove unwanted spaces in string values

crm_prd_info
  • Standardize data
    • Map coded values to user-friendly descriptions
    • Cast to a suitable type
  • Remove invalid data
    • Handling missing data (using a default value)
  • Data enrichment (add new values to the dataset)

crm_sales_details
  • Standardize data
    • Convert to standard convention
    • Replace invalid values with placeholder
    • Cast to a suitable type
    • Derive valid values from upstream columns
  • Remove invalid data
    • Handle missing data (using a default value)

crm_sales_details
  • Standardize data
    • Convert to standard convention
    • Map coded values to user-friendly descriptions
  • Remove invalid data

erp_loc
  • Standardize data
    • Map coded values to user-friendly descriptions
    • Remove unwanted spaces
page icon
Business Rules
  • Total Sales = Sum of (Quantity * Price)
  • Negatives, zeros and nulls are not allowed.

Code: Automate Data Quality Checks

page icon
dbt
An open-source tool for data engineering following software engineering best practices such as testing, versioning and documentation for structured data.
Benefits of using dbt
  • Data platform agnostic: dbt uses its own version of SQL and can connect to different data platforms, so any script (such as tests) can be executed regardless of the platform.
  • Automated testing: avoid faults and save time creating models. The table will not be created if it does not pass mandatory tests. This ensure our business do not make decisions based on erroneous data.

Build Gold Layer

Workflow

  • Analyze: Explore & Understand the Business Object
  • Code: Data Integration
    • Build the Business Object
    • Choose type: Dimension vs. Fact
    • Rename to friendly names
  • Validate: Data Integration Checks
  • Docs: Document & do versioning in git
    • Data Model
    • Data Catalog
    • Data Flow

Analyze

page icon
Data Modeling
The process of taking raw data, process and structure it in a meaningful way.
page icon
Types of Data Models
  • Conceptual Model: Represents core entities in the business and how they relate to one another. Info: entities, relationships.
  • Logical Model: Represent data constraints, entity names and relationships in a diagram (platform-agnostic). Info: all in previous level + attributes, keys, cardinality.
  • Physical Model: Implement the logical model over a specific database technology. Info: all in previous level + data types, SQL tables, indexes, constraints, etc.
Source: Data with Baraa.
Source: Data with Baraa.
page icon
Types of Data Schemas
  • Star Schema: Has a central fact table surrounded by denormalized dimension tables.
    • Pros: Simple and easy to understand and analyze.
    • Cons: Dimensions may contain duplicates, data size rise faster → slower queries.
  • Snowflake Schema: Has a central fact table surrounded by dimensions that are normalized into multiple related tables.
    • Pros: More complex to understand and analyze.
    • Cons: Optimize storage by reducing data redundancy.
page icon
Table Types
  • Dimension tables: Description information that gives context to data.
  • Fact tables: Quantitative information that represents events.
This project chooses a Star Schema because the data size is small & tradeoff for ease of usage is not expensive.
Data Integration Model
notion image
Data Model
notion image

Code

Data lookup
Joining the fact with dim tables to find relevant keys
notion image
notion image

Docs: Data Catalog

page icon
Data Catalog
A centralized inventory and management system for metadata. Stakeholders can use it for better understanding of an organization’s data.
A data catalog is a must-have for efficient work with shared data within an organization.
In this project, the data catalog is written as a .md (markdown) file, clarifying the:
  • purpose and structure of each table
  • column names, types and description

Docs: Data Flow Diagram

page icon
Data Flow Diagram
Represents the dependency between data collections (tables) in different layers within a system. It shows how data moves within a system.
The data flow diagram below shows how data moves within the warehouse. Each subsequent layer is directly dependent on the previous layer and indirectly dependent on the layers before. This improves traceability when facing errors in a layer.
notion image

References

Sources

This project builds upon materials from the course by Data with Baraa, including datasets, code patterns, and architectural concepts. Refer to the repository’s license file for details on usage and distribution.

Overall Review

Overall, I think this is a wonderful course that gives a hands-on overview on the process of creating a Data Warehouse. Baraa did an excellent job tying the related concepts in an intuitive way.

Highlights

I have done similar projects of creating databases/ data warehouses in the past, but the thing that stands out to me is Baraa’s presentation skills. Every step is visually or textually described with a high level of detail, just like in an industry project. I have learned a lot and will definitely apply this mindset of clear communicating in my next projects.

Next Steps

I’m excited to build more complex data systems. For future projects, I will adopt a different:
  • Loading technique
  • SCD handling technique
  • Workflow orchestration technique
  • Performance optimization technique

Contributions

I have made several changes to the original project to match my learning goals & industry needs. The changes are described below.

Tech Stack Changes

  • Data Platform
    • Migrated from SQL Server to Databricks for scalability and modern data workflows.
  • Data Quality & Testing
    • Integrated dbt to automate data validation and enforce data quality constraints.
  • Version Control
    • Combined Git integration in Databricks and VSCode:
      • Databricks for data processing and SQL development
      • VSCode for dbt workflows and project management
  • Environment Management
    • Introduced uv for Python dependency management to ensure reproducibility.

Data Engineering Changes

  • Added data profiling scripts to systematically analyze source data.
  • Introduced automated testing via dbt and environment management via uv.
  • Identified additional data quality issues beyond the original pipeline.

Documentation Enhancements

  • Expanded documentation across all layers:
    • Data architecture
    • Data flow
    • Data modeling
  • Added detailed explanations for:
    • Data engineering concepts
    • Transformation logic
    • Design decisions
    • Workflow structure
That’s concludes the project for now. Happy Learning!

Other Projects

Gallery view
 

Let’s Work Together

 

Contact Mai

Name
Email*
Phone Number
Message*