← Home
SQL Databricks Data Warehouse
Role
Data Engineer
Keywords
Year
2026

Table of Contents
← HomeSQL Databricks Data WarehouseTable of ContentsAboutIntroductionRequirement AnalysisContextProject RequirementsData Architecture DesignChoose Architecture TypeChoose Data Warehouse ApproachDesign Data LayersDraw the Data ArchitectureProject StartGeneral PrinciplesCatalog Naming ConventionsTable Naming ConventionsBronze RulesSilver RulesGold RulesGlossary of Category PatternsColumn Naming ConventionsSurrogate KeysTechnical ColumnsStored ProcedureBuild the Data WarehouseCreate the Volumes, Catalog and SchemasBuild Bronze LayerWorkflowAnalyze: Source SystemsCode: Ingest dataDocs: Data FlowBuild Silver LayerWorkflowAnalyze: Explore & Understand DataCode: Ingest dataData CleansingCode: Automate Data Quality ChecksBuild Gold LayerWorkflowAnalyzeCodeDocs: Data CatalogDocs: Data Flow DiagramReferencesSourcesOverall ReviewHighlightsNext StepsContributionsTech Stack ChangesData Engineering ChangesDocumentation EnhancementsOther ProjectsLet’s Work Together
About
Github
Tech Stack
- Databricks
- dbt
- SQL
Introduction
Hi, I’m Mai, welcome to this project!
In this project, I will be building a Data Warehouse in Databricks using a Medallion architecture (bronze-silver-gold). This architecture ensures clear separation of data layers and traceability. It is accompanied by data tests in dbt, which performs automated data quality checks for prompt error handling.
Each step in the process is well-documented with descriptions, diagrams and more info on the technique used.
Check out the Github link for related files above.
*This project inherits many concepts from the course by Data with Baraa. Go to the References section to learn more on how to use the project, and to the Contribution section to see what modifications I’ve made to the original project to make it align with modern data workflows.
Requirement Analysis
Context
Flexi is a global fashion brand with existing operational departments such as Finance, Inventory and Marketing.
It uses two management systems:
- ERP - SAP: for internal operations management
- CRM - Salesforce: for customer interactions management
These systems operate independently, with limited integration and data sharing.
The company’s CEO wants to optimize company performance in various areas to cut costs and increase revenue. A new Data division is founded with the purpose of doing Business Intelligence, with custom & advanced analytics.
🎯 Hence, the first task of the Data Team is to build a Data Warehouse, incorporating data from ERP and CRM sources.
Project Requirements
Objective
Develop a Data Warehouse using Databricks to consolidate sales data, allowing analytical reporting and informed decision-making.
Develop a Data Warehouse using Databricks to consolidate sales data, allowing analytical reporting and informed decision-making.
Specifications
- Data Sources: Import data from two source systems (ERP and CRM), provided as CSV files.
- Data Pre-processing: Clean and resolve data quality issues prior to analysis.
- Integration: Combine both data sources into a single, user-friendly data model designed for analytical queries.
- Scope: Focus on the latest dataset only, historization of data is not required.
- Documentation: Provide clear documentation of the data model to support analytics teams and business stakeholders.
Data Architecture Design
This project chooses a Data Warehouse with Medallion Architecture with the following reasons:
- Data Warehouse
- Suitable for structured, relational data and analytical workloads.
- Medallion Architecture
- Provides clear separation of concerns and progressive data refinement across layers.
Choose Architecture Type
There are 5 common types of Data Architectures, each with its own pros/cons:
Type | Description | Pros ✅ | Cons❌ | ㅤ |
Data Warehouse | Structured data storage (SQL, tables, schemas) | High performance for analytics | Rigid schema requirements | Solid foundations for BI and Business Reporting. |
Data Lake | Structured and unstructured/semi-structured raw data storage (JSON, logs, etc.) | Flexible schema-on-read | Hard to manage - can become a data swamp No ACID transactions | More flexible, used when we have mixed types of data (logs, images, videos, etc.) |
Data Lakehouse | Hybrid (structured + unstructured with warehouse features) | Combines data warehouse & data lake benefits Support ACID transactions | Complex to manage | ㅤ |
Data Mesh | Decentralized data storage | Suitable for large organizations, each team has their own use cases with data | Complex governance | Each department is building their own data product and sharing with other teams. |
Choose Data Warehouse Approach
Here are 4 common types of approaches to building a Data Warehouse:
- Inmon Approach
- Sources → Staging → Enterprise Data Warehouse (3NF) → Data Marts → BI Layers
- EDW:
- Data Marts: Small subsets of the EDW, ready to be consumed for reporting. Ex: Data marts for customers, sales, products, etc.
- Slower, more complex to model
- Kimball Approach
- Sources → Staging →→→→→→ → → → → → → → → → Data Marts → BI Layers
- Faster shipping speed
- Tradeoff in data consistency
- Data Vault
- Sources → Staging →→ Raw Vault →→Business Vault →→ Data Marts → BI Layers
- Splits the central layer into more layers
- Raw Vault: Original data
- Business Vault: All business rules and transformations to prepare for the data marts.
- Medallion Architecture
- Sources → Bronze →→ Silver →→Gold → BI Layers
- Bronze: As-is data, similar to staging layer.
- Silver: Clean and transformed data.
- Gold: Business-level data models (similar to data marts), can build different business-ready objects. Used for reporting, ML, AI, etc.
Design Data Layers
Type | Bronze Layer | Silver Layer | Gold Layer |
Definition | Raw, untouched data | Clean, standardized data | Business-ready data |
Objective | Traceability & Debugging | Prepare data for analysis | Provide data to be consumed by reporting & analytics |
Object Type | Tables | Tables | Views • virtual • dynamic • faster speed |
Load Method | Full Load (Truncate & Insert) | Full Load (Truncate & Insert) | None |
Data Transformation | None | • Data Cleaning • Data Standardization • Data Normalization • Derived Columns • Data Enrichment | Business transformations • Data Integration • Data Aggregation • Business Logic & Rules |
Data Modelling | None | None | • Start Schema • Aggregated Objects • Flat Tables |
Target Audience | Data Engineers | Data Analysts Data Engineers | Data Analysts Business Users |
Separations of Concerns
- Break down complex systems into smaller parts
- Each part responsible for specific tasks
- Each component must not be duplicated
Draw the Data Architecture

Project Start
General Principles
- Name Convention: snake_case, use lowercase letters and underscores to separate words.
- Language: English
- Avoid reserved words
Naming Conventions
Set of rules or guidelines for naming anything in the project.
- Database
- Schema
- Tables
- Store procedures
Catalog Naming Conventions
- All catalog names must start with its primary function and suffixed by its environment type.
- <function>_<environment>
- Ex: core_dev
- The data warehouse for core operations, used in development.
Table Naming Conventions
Bronze Rules
- All names must start with the source system name.
- Table names must correspond to their original name.
- Entity names are not abbreviated, are nouns, and are in singular forms.
- Ex:
- crm_cust_info
Silver Rules
- All names must start with the source system name.
- Table names must correspond to their original name.
- Entity names are not abbreviated, are nouns, and are in singular forms.
- <source_system>_<entity>
- Ex: crm_cust_info
Gold Rules
- All names must use meaningful, business-aligned names for tables, starting with the category prefix.
- Entity names are not abbreviated, are nouns, and are in singular forms.
- <category>_<entity>
- Ex: dim_customer
Glossary of Category Patterns
Pattern | Meaning | Examples |
dim_ | Dimension table | dim_customer, dim_product |
fact_ | Fact table | fact_sales |
agg_ | Aggregated table | agg_customer, agg_sales_monthly |
Column Naming Conventions
Surrogate Keys
- All primary keys in dimension tables must use the suffix _key
- <entity_name>_key
- Ex: customer_key → surrogate key in the dim_customer table
Technical Columns
- All technical columns must start with the prefix dwh_, followed by a descriptive name indicating the column’s purpose.
- dwh_<column_name>
- Ex: dwh_load_date → Column used to store the date when the record was loaded, generated by the system.
Stored Procedure
All stored procedures used for loading data must follow the naming pattern:
- load_<layer>
- Ex: load_bronze → stored procedure for loading data into the bronze layer.
Build the Data Warehouse
Create the Volumes, Catalog and Schemas
Build Bronze Layer
Workflow
- Analyze: Interview source system experts
- Code: Ingest data
- Load source files
- Create data schema
- Insert records (full load) from source files
- Code: Data completeness & schema checks
- Docs: Document & do versioning in git
Analyze: Source Systems
Asking the right questions help avoid mistakes.
- Business Context & Ownership
- Who owns the data?
- Understand the source of the data
- What Business Process it supports?
- Understand the importance of the data
- System & Data documentation
- Save time when building data models
- Data Model & Data Catalog
- Save time when building data models by reading existing table and column descriptions.
- Architecture & Technology Stack
- How is data stored? (SQL Server, Oracle, AWS, Azure, etc.)
- What are the integration capabilities? (API, Kafka, File Extract, Direct DB, etc.)
- Extract & Load
- Incremental vs. Full Loads?
- Data Scope & Historical Needs
- Whether we need all data or just a subset of the entire data history.
- Whether we need to build data history.
- What is the expected size of the extracts? (MB, GB, TB)
- Understand whether we have the right tools
- Are there any data volume limitations?
- Conform to limitations and capability of the source system.
- How to avoid impacting the source system’s performance?
- Authentication and Authorization (tokens, SSH keys, VPN, IP whitelisting, …)
Understanding the data structure is done by 2 primary methods:
- Doing interviews with source system owners (developers)
- Doing data profiling on your own
Since the source data files are available, I head straight to the ingestion method.
Code: Ingest data
Load source files
- Create the
core_devcatalog
- Create a new volume called
source
Create schema and tables
- Create
bronze,silverandgoldschemas
- Create 6 tables based on the original source file names
crm_cust_infocrm_prd_infocrm_sales_detailserp_cust_az12erp_loc_a101erp_px_cat_g1v2
Insert records from source files
Methods to ingest
- Bulk Insert
- Massively load raw files (txt, csv) directly into the database.
- Insert
- Load data row-by-row into the database.
- Full Load
- Rewrite all old data with the newest version of the data.
- Incremental Load
- Incremental batch load
- Inserts data into the database at regular intervals.
- Incremental stream load
- Inserts data into the database when new data emerges or old data is updated.
This project uses a bulk insert, full load approach.
- New versions of the data (in .csv format) is pulled into the
source/volume in Databricks.
- Current tables are dropped and the newest version of the data is rewritten onto the tables in the
bronzeschema.
Stored Procedure
A precompiled SQL code that can be saved and reused.
- Unlike a saved query: Stored procedures
- can be reused with new parameter inputs.
- Unlike a UDF (user-defined function): Stored procedures
- can return 0, 1 or many values
- accept DML operations
- can manage transactions
- can call other stored procedures
COPY INTO in Databricks
This is a statement that does bulk insert (similar to INSERT INTO in MySQL). However, this operation is:
- retryable
- idempotent (the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application, by Wikipedia)
Which means if files in the source location have been loaded, then are skipped (even if the files have been modified).
The operation to create new tables is used as a stored procedure, using the COPY INTO statement.
Alternative ingestion methods
This project adopts a simple ingestion method. I think there are more efficient ways to approach this, but I will adopt them in a different project. Some changes I can think of as of now:
- Skip the
sourcevolume and pull data directly tobronze. This reduces half of the storage needed for raw data.
- Use incremental load instead of full load. It is much faster and more storage efficient (it comes with some downsides, though).
Docs: Data Flow
Data Flow
Describes the data lineage (data dependency through each layer).
Describes the data lineage (data dependency through each layer).

Build Silver Layer
Workflow
- Analyze: Explore & understand data
- Find issues in
bronzedata tables and log inbronze/profile_bronzescript. - Find table relationships and document in an integration diagram.
- Code: Ingest data
- Define
silvertables’ schema. - Standardize
silverdata and insert into tables based on findings frombronze/profile_bronze.
- Code: Data quality checks
- Perform manual data quality checks on the newly created
silvertables, store scripts inscripts/silver/test_silver.
- Code: Automate data quality checks
- Create automated tests on
silverdata indbt.
- Docs: Document & do versioning in git
Analyze: Explore & Understand Data
bronze Table InvestigationsIn this part, I carefully investigate each table, column and column combinations to understand its:
- Data Uniqueness Constraint
- Data Validity
- Table Relationships
- …
Main insights are described below. Read the
scripts/bronze/profile_bronze script for more details.crm_cust_info
- About: Customer demographic data from the CRM system.
- Structure: Non-standard, append-only structure (pseudo SCD Type 2 behavior).
- No exact duplicate rows.
cst_idandcst_keyare not unique due to multiple records over time.- Multiple records exist for the same
(cst_id, cst_key)combination, differentiated bycst_create_date. - Changes across versions are incremental: null values are progressively filled with valid values (no overwrites).
- No explicit versioning columns (e.g., start/end dates).
- Issues:
- Rows with
cst_id IS NULLcontain invalid and inconsistent data. - Redundant historical records where earlier versions contain mostly null values.
- Conclusion:
- Remove rows with null
cst_id(data errors). - Deduplicate by keeping only the latest record per
(cst_id, cst_key)based oncst_create_date. - Treat the table as SCD Type 0 in the Silver layer (no history retained).
crm_prd_info
- About: Product master data with historical tracking.
- Structure: Proper SCD Type 2 implementation.
prd_keyrepresents the business entity (product).prd_idrepresents a specific version of the product over time.prd_start_dtandprd_end_dtdefine the validity period.- Active records have
prd_end_dt IS NULL. - One
prd_keycan have multipleprd_idvalues (multiple versions).
- Data Quality Notes:
- No duplicate rows or duplicate
(prd_id, prd_key)combinations. - Some invalid date ranges (
prd_start_dt > prd_end_dt). - Some null or missing values in attributes (e.g.,
prd_cost).
- Conclusion:
- Retain full history (valid SCD Type 2 design).
- Handle invalid date ranges and null values during transformation.
crm_sales_details
- About: Sales transaction data.
- Structure: Flattened fact table.
- No duplicate rows.
- No single-column primary key.
- Grain: one row per product per order per customer.
- Composite uniqueness: (
sls_ord_num,sls_cust_id,sls_prd_key).
- Relationships:
- One customer → many orders (1:M).
- One order → many products (1:M).
- Products ↔ orders form an M:N relationship (resolved in this table).
- Conclusion:
- Acts as a fact table with defined grain.
- Suitable for direct transformation into a star schema fact table.
erp_cust_az12
- About: Customer attributes (e.g., gender, birthdate) from ERP.
- Structure: Normalized table with unique
cid.
- Issues:
cidformat is inconsistent:- Contains varying prefixes.
- Does not directly match
cst_idorcst_keyin CRM. - Matching requires transformation (e.g., substring extraction).
- Partial overlaps and mismatches exist between ERP and CRM identifiers.
- Duplicate
cidvalues exist under different formats.
- Conclusion:
- Normalize
cidby removing prefixes to align with CRM identifiers. - Expect partial mismatches during integration.
- Use transformation logic to standardize joins with CRM data.
erp_loc_a101
- About: Customer location data (country).
- Structure: Normalized table with unique
cid.
- Issues:
cidcontains formatting inconsistencies (e.g., dashes ).- Requires cleaning (e.g., removing ) before joining.
- Matches with CRM data are mostly consistent after normalization, with minor exceptions due to upstream data issues.
- Conclusion:
- Clean
cid(remove special characters) before integration. - Join primarily with CRM (
cst_key) after normalization. - Treat unmatched records as data quality issues.
erp_px_cat_g1v2
- About: Product category and subcategory reference data.
- Structure: Normalized lookup table.
idrepresents a subcategory (derived from category + subcategory codes).- No duplicate rows or IDs.
- Relationships:
idmaps to prefixes ofprd_keyincrm_prd_info.
- Issues:
- Some products (e.g., category
CO-PE) do not have matching category records. - Indicates that this table may be incomplete or outdated.
- Conclusion:
- Use as a category lookup where matches exist.
- Handle unmatched products separately (data gap).
- Consider this table as potentially stale reference data.
Data Integration Diagram
After investigating the table relationships, I made the diagram below to reference when doing data cleaning for this layer.
The diagram uses the Crow’s Foot notation to describe relationships between tables.

Code: Ingest data
Define
silver tables’ schema.The tables in this layer normalizes values in
bronze layer, so column formats and types will be changed. I figure out the changes I want based on the findings from the bronze table and communication with stakeholders.Furthermore, to better track the version of the tables, I added the following metadata column: dwh_create_date. It describes the date when the schema is created.
Data Cleansing
Types of data transformations
This diagram summarizes the key operations in data transformation.

crm_cust_info
- Remove invalid data
- Remove null cst_id
- Remove duplicates (history customer info)
- Handle missing data (using a default value)
- Standardize data
- Map coded values to user-friendly descriptions
- Remove unwanted spaces in string values
crm_prd_info
- Standardize data
- Map coded values to user-friendly descriptions
- Cast to a suitable type
- Remove invalid data
- Handling missing data (using a default value)
- Data enrichment (add new values to the dataset)
crm_sales_details
- Standardize data
- Convert to standard convention
- Replace invalid values with placeholder
- Cast to a suitable type
- Derive valid values from upstream columns
- Remove invalid data
- Handle missing data (using a default value)
crm_sales_details
- Standardize data
- Convert to standard convention
- Map coded values to user-friendly descriptions
- Remove invalid data
erp_loc
- Standardize data
- Map coded values to user-friendly descriptions
- Remove unwanted spaces
Business Rules
- Total Sales = Sum of (Quantity * Price)
- Negatives, zeros and nulls are not allowed.
Code: Automate Data Quality Checks
dbt
An open-source tool for data engineering following software engineering best practices such as testing, versioning and documentation for structured data.
An open-source tool for data engineering following software engineering best practices such as testing, versioning and documentation for structured data.
Benefits of using dbt
- Data platform agnostic: dbt uses its own version of SQL and can connect to different data platforms, so any script (such as tests) can be executed regardless of the platform.
- Automated testing: avoid faults and save time creating models. The table will not be created if it does not pass mandatory tests. This ensure our business do not make decisions based on erroneous data.
Build Gold Layer
Workflow
- Analyze: Explore & Understand the Business Object
- Code: Data Integration
- Build the Business Object
- Choose type: Dimension vs. Fact
- Rename to friendly names
- Validate: Data Integration Checks
- Docs: Document & do versioning in git
- Data Model
- Data Catalog
- Data Flow
Analyze
Data Modeling
The process of taking raw data, process and structure it in a meaningful way.
Types of Data Models
- Conceptual Model: Represents core entities in the business and how they relate to one another. Info: entities, relationships.
- Logical Model: Represent data constraints, entity names and relationships in a diagram (platform-agnostic). Info: all in previous level + attributes, keys, cardinality.
- Physical Model: Implement the logical model over a specific database technology. Info: all in previous level + data types, SQL tables, indexes, constraints, etc.

Types of Data Schemas
- Star Schema: Has a central fact table surrounded by denormalized dimension tables.
- Pros: Simple and easy to understand and analyze.
- Cons: Dimensions may contain duplicates, data size rise faster → slower queries.
- Snowflake Schema: Has a central fact table surrounded by dimensions that are normalized into multiple related tables.
- Pros: More complex to understand and analyze.
- Cons: Optimize storage by reducing data redundancy.
Table Types
- Dimension tables: Description information that gives context to data.
- Fact tables: Quantitative information that represents events.
This project chooses a Star Schema because the data size is small & tradeoff for ease of usage is not expensive.
Data Integration Model

Data Model

Code
Data lookup
Joining the fact with dim tables to find relevant keys


Docs: Data Catalog
Data Catalog
A centralized inventory and management system for metadata. Stakeholders can use it for better understanding of an organization’s data.
A centralized inventory and management system for metadata. Stakeholders can use it for better understanding of an organization’s data.
A data catalog is a must-have for efficient work with shared data within an organization.
In this project, the data catalog is written as a .md (markdown) file, clarifying the:
- purpose and structure of each table
- column names, types and description
Docs: Data Flow Diagram
Data Flow Diagram
Represents the dependency between data collections (tables) in different layers within a system. It shows how data moves within a system.
Represents the dependency between data collections (tables) in different layers within a system. It shows how data moves within a system.
The data flow diagram below shows how data moves within the warehouse. Each subsequent layer is directly dependent on the previous layer and indirectly dependent on the layers before. This improves traceability when facing errors in a layer.

References
Sources
This project builds upon materials from the course by Data with Baraa, including datasets, code patterns, and architectural concepts. Refer to the repository’s license file for details on usage and distribution.
Overall Review
Overall, I think this is a wonderful course that gives a hands-on overview on the process of creating a Data Warehouse. Baraa did an excellent job tying the related concepts in an intuitive way.
Highlights
I have done similar projects of creating databases/ data warehouses in the past, but the thing that stands out to me is Baraa’s presentation skills. Every step is visually or textually described with a high level of detail, just like in an industry project. I have learned a lot and will definitely apply this mindset of clear communicating in my next projects.
Next Steps
I’m excited to build more complex data systems. For future projects, I will adopt a different:
- Loading technique
- SCD handling technique
- Workflow orchestration technique
- Performance optimization technique
- …
Contributions
I have made several changes to the original project to match my learning goals & industry needs. The changes are described below.
Tech Stack Changes
- Data Platform
- Migrated from SQL Server to Databricks for scalability and modern data workflows.
- Data Quality & Testing
- Integrated dbt to automate data validation and enforce data quality constraints.
- Version Control
- Combined Git integration in Databricks and VSCode:
- Databricks for data processing and SQL development
- VSCode for dbt workflows and project management
- Environment Management
- Introduced
uvfor Python dependency management to ensure reproducibility.
Data Engineering Changes
- Added data profiling scripts to systematically analyze source data.
- Introduced automated testing via dbt and environment management via uv.
- Identified additional data quality issues beyond the original pipeline.
Documentation Enhancements
- Expanded documentation across all layers:
- Data architecture
- Data flow
- Data modeling
- …
- Added detailed explanations for:
- Data engineering concepts
- Transformation logic
- Design decisions
- Workflow structure
That’s concludes the project for now. Happy Learning!
Other Projects
Gallery view

Automated Spreadsheet Dashboards
Google Sheets
Analytics Dashboard
Data Analyst
Automate Google Sheets to create dashboards that track teams’ sales leads, Marketing campaigns, sales overview and more.

Sequential Recommender System for Video Recommendations
Recommender Systems
Machine Learning
pytorch
API
FastAPI
Data Scientist
AI Engineer

Employee Attrition Dashboard & Analytics
Power BI
Python
People Analytics
Competition
Analytics Dashboard
Data Analyst
Dashboard that analyzes employee attrition root causes by leveraging a detailed analysis process, advanced Power BI features, and Machine Learning for Churn Prediction.

RAG Chatbot for Flower Shop Recommendations
RAG
MongoDB
Vector search
Flask
API
Python
AI Engineer
A Naive RAG chatbot that recommends flower shop products using semantic search on embedded product data. It features a Streamlit interface, a Scrapy-based data scraper; a backend including MongoDB vector search for retrieval, an LLM via OpenRouter; and a Flask API that connects the frontend with the backend.

SQL Databricks Data Warehouse
Databricks
SQL
dbt
Data Engineer
A Data Warehouse in Databricks using a Medallion architecture (bronze-silver-gold), accompanied by data tests in dbt, which performs automated data quality checks for prompt error handling.

Customer Growth using AI (Segmentation, Churn Prediction, CLV Prediction)
Machine Learning
Statistical Modelling
Python
Streamlit
Flask
Data Scientist
Perform RFM customer segmentation and model churn, survival, and customer lifetime value (CLV) using transactional data to identify high-priority customer groups.

TPC-H Data Warehouse Pipeline
Airflow
dbt
Docker
uv
Python
astronomer
Snowflake
Data Engineer
Modern ELT pipeline using Airflow, dbt, and Snowflake built on the TPC-H benchmark dataset (SF1).
