--- layout: handbook-page-toc title: "dbt Guide" description: "data build tool (dbt) Guide" --- ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .toc-list-icons .hidden-md .hidden-lg} {::options parse_block_html="true" /} ---- ## Quick Links [Primary Project](https://gitlab.com/gitlab-data/analytics/){:.btn .btn-purple-inv} [dbt docs](https://dbt.gitlabdata.com/){:.btn .btn-purple-inv} ## What and why [πŸŒŽπŸ’‘](/handbook/business-ops/data-team/documentation) dbt, short for [data build tool](https://www.getdbt.com/), is an [open source project](https://github.com/fishtown-analytics/dbt) for managing data transformations in a data warehouse. Once data is loaded into a warehouse, dbt enables teams to manage all data transformations required for driving analytics. It also comes with built in testing and documentation so we can have a high level of confidence in the tables we're generating and analyzing. The following links will give you an excellent overview of what dbt is: * [What, exactly, is dbt?](https://blog.getdbt.com/what--exactly--is-dbt-/) - This is a less technical overview for understanding the tool * [What is dbt?](https://docs.getdbt.com/docs/introduction) - This is a bit more technical and comes straight from the docs But why do we use dbt? There are several reasons. First is that it is an open source tool with a vibrant community. Choosing an open source tool enables us to collaborate with the larger data community and solve problems faster than had we gone with a proprietary solution. Second, it was built with version control in mind. For GitLab, this is essential since we use the product for building and running the company. Third, it speaks the language of analysts - SQL. This increases the number of people that can contribute since SQL is becoming such a critical part of many people's jobs. Finally, it enables teams to move faster by integrating [testing and documentation](https://docs.getdbt.com/docs/testing-and-documentation) from the start. For even more information about the basics of dbt, see our [data analyst onboarding issue template](https://gitlab.com/gitlab-data/analytics/blob/master/.gitlab/issue_templates/Data%20Onboarding.md#what-is-dbt) At times, we rely on dbt packages for some data transformation. [Package management](https://docs.getdbt.com/docs/package-management) is built-in to dbt. A full list of packages available are on the [dbt Hub site](https://hub.getdbt.com). ## Running dbt If you're interested in using dbt, the [dbt documemtation has a great tutorial](https://tutorial.getdbt.com/tutorial/setting-up/) on getting setup to work on data from a fictional business called Jaffle Shop. If you wish to use dbt and contribute to the data team project, you'll need to gain access to our Snowflake instance, which can be done via an [access request](/handbook/business-ops/it-ops-team/access-requests/). ### Configuration * Ensure you have access to our Snowflake instance * Ensure you have [Docker installed](https://docs.docker.com/docker-for-mac/) * Ensure you have [Make](https://en.wikipedia.org/wiki/Make_(software)) installed (should be installed on new Macs and with XCode) * Create a folder in your home directory called `.dbt` * In the `~/.dbt/` folder there should be a `profiles.yml`file that looks like this [sample profile](https://gitlab.com/gitlab-data/analytics/blob/master/admin/sample_profiles.yml) * The smallest possible warehouse should be stored as an environment variable. Our dbt jobs use `SNOWFLAKE_TRANSFORM_WAREHOUSE` as the variable name to identify the warehouse. The environment variable can be set in the `.bashrc` or `.zshrc` file as follows: * `export SNOWFLAKE_TRANSFORM_WAREHOUSE="ANALYST_XS"` * In cases where more compute is required, the variable can be overwritten by adding `--vars '{warehouse_name: analyst_xl}'` to the dbt command * Clone the [analytics project](https://gitlab.com/gitlab-data/analytics/) Note that many of these steps are done in the [onboarding script](https://gitlab.com/gitlab-data/analytics/-/blob/master/admin/onboarding_script.sh) we recommend new analysts run. ### Docker Workflow {: #docker-workflow} [πŸ› πŸ](/handbook/business-ops/data-team/documentation) To facilitate an easier workflow for analysts, and to abstract away some of the complexity around handling `dbt` and its dependencies locally, the main [analytics project](https://gitlab.com/gitlab-data/analytics/) supports using `dbt` from within a `Docker` container. We build the container from the [`data-image`](https://gitlab.com/gitlab-data/data-image) project. There are commands within the `Makefile` to facilitate this, and if at any time you have questions about the various `make` commands and what they do, use `make help` to get a list of the commands and what each of them does. Before your initial run (and whenever the containers get updated) make sure to run the following commands: 1. `make update-containers` 1. `make cleanup` These commands will ensure you get the newest versions of the containers and generally clean up your local `Docker` environment. #### Using dbt * Ensure you have the `DBT_PROFILE_PATH` environment variable set. This should be set if you've used the [onboarding_script.sh](https://gitlab.com/gitlab-data/analytics/blob/master/admin/onboarding_script.sh), but if not, you can set it in your `.bashrc` or `.zshrc` by adding `export DBT_PROFILE_PATH="/ ## Style and Usage Guide ### Model Structure As we transition to a more Kimball-style warehouse, we are improving how we organize models in the warehouse and in our project structure. The following sections will all be top-level directories under the `models` directory, which is a dbt default. This structure is inspired by how Fishtown Analytics [structures their projects](https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355).
**Legacy Structure** {: .panel-heading}
Prior to our focus on Kimball dimensional modeling, we took inspiration from the BEAM* approach to modeling introduced in ["Agile Data Warehouse Design" by Corr and Stagnitto](https://books.google.com/books/about/Agile_Data_Warehouse_Design.html?id=TRWFmnv8jP0C&source=kp_book_description). Many of the existing models still follow that pattern. The information in this section is from previous iterations of the handbook. * The goal of a (final) `_xf` dbt model should be a `BEAM*` table, which means it follows the business event analysis & model structure and answers the who, what, where, when, how many, why, and how question combinations that measure the business. * `base models`- the only dbt models that reference the source table; base models have minimal transformational logic (usually limited to filtering out rows with data integrity issues or actively flagged not for analysis and renaming columns for easier analysis); can be found in the `analytics_staging` schema; is used in `ref` statements by `end-user models` * `end-user models` - dbt models used for analysis. The final version of a model will likely be indicated with an `_xf` suffix when it’s goal is to be a `BEAM*` table. It should follow the business event analysis & model structure and answer the who, what, where, when, how many, why, and how question combinations that measure the business. End user models are found in the `analytics` schema.
#### Sources All raw data will still be in the `RAW` database in Snowflake. These raw tables are referred to as `source tables` or `raw tables`. They are typically stored in a schema that indicates its original data source, e.g. `netsuite` Sources are defined in dbt using a `sources.yml` file. * We use a variable to reference the database in dbt sources, so that if we're testing changes in a Snowflake clone, the reference can be programmatically set * When working with source tables with names that don't meet our usual convention or have unclear meanings, use identifiers to override source table names when the original is messy or confusing. ([Docs on using identifiers](https://docs.getdbt.com/docs/using-sources#section-configuring-sources)) ```yaml # Good tables: - name: bizible_attribution_touchpoint identifier: bizible2__bizible_attribution_touchpoint__c # Bad tables: - name: bizible2__bizible_attribution_touchpoint__c ``` ##### Source Models We are enforcing a very thin source layer on top of all raw data. This directory is where the majority of source-specific transformations will be stored. These are "base" models that pull directly from the raw data and do the prep work required to make facts and dimensions and should do _only_ the following: * Rename fields to user-friendly names * Cast columns to appropriate types * Minimal transformations that are 100% guaranteed to be useful for the foreseeable future. An example of this is parsing out the Salesforce ID from a field known to have messy data. * Placement in a logically named schema Even in cases where the underlying raw data is perfectly cast and named, there should still exist a source model which enforces the formatting. This is for the convenience of end users so they only have one place to look and it makes permissioning cleaner in situations where this perfect data is sensitive. The following should not be done in a source model: * Removing data * Joining to other tables * Transformations that fundamentally alter the meaning of a column For all intents and purposes, the source models should be considered the "raw" data for the vast majority of users. Key points to remember: * These models will be written to a logically named schema in the `analytics` database based on the data source type. For example: * Zuora data stored in `raw.zuora` would have source models in `analytics.zuora` * GitLab.com data with tables stored in `raw.tap_postgres.gitlab_db_*` would have source models in `analytics.gitlab_dotcom` * Customers.gitlab.com data with tables stored in `raw.tap_postgres.customers_db_*` would have source models in `analytics.customers_db` * These models should be organized by source - this will usually map to a schema in the raw database * Only source models should select from source/raw tables * Source models should not select from the `raw` database directly. Instead, they should reference sources with jinja, e.g. `FROM {{ source('bamboohr', 'job_info') }}` * Only a single source model should be able to select from a given source table * Source models should be placed in the `/models/sources/ gitlabusers[GitLab Users Source Model ] end subgraph "version_db Schema" VersionRaw --> versionpings[Version DB Pings Source Model ] end subgraph "zuora Schema" ZuoraRaw --> ZuoraAccount[Zuora Account Source Model ] end subgraph "analytics_staging Schema" dbtModels[More dbt models ] gitlabusers --> dbtModels versionpings --> dbtModels ZuoraAccount --> dbtModels end subgraph "analytics schema" dbtModels --> dbtModels2[More dbt models ] gitlabusers --> dbtModels2 versionpings --> dbtModels2 ZuoraAccount --> dbtModels2 end end ``` #### Staging Prior to our implementation of Kimball modeling, most all of our models would have fallen into Staging category. This directory is where the majority of business-specific transformations will be stored. This layer of modeling is considerably more complex than creating source models, and the models are highly tailored to the analytical needs of business. This includes: * Filtering unrelevant records * Choosing columns required for analytics * Renaming columns to represent abstract business concepts * Joining to other tables * Executing business logic * Modelling of fct_* and dim_* tables following Kimball methodology ##### Use of Kimball Methodology Fact and dimensional tables are abstracted from the source data and represent entities and processes relevant to the business. The naming convention for fact and dimensional tables: * fct_ where `verb` represent real-world processes that are occuring e.g. sessions, transactions. * dim_ where `noun ` can be a person, place, or thing e.g. customers, employees. * Plural noun version should be used Conventions that should be followed for dimensional modelling: * fct_ and dim_ models should be materialized as tables to improve query performance * Models are tested and documented in a .yml file in the same directory as the models * Very often a CTE duplicated across two models qualifies to be a separate fct_/dim_ table Dimensional models are likely to be denormalized, as opposite to source models, making them easier to read and interpret, as well as allowing efficient querying by reducing the number of joins. #### Marts Marts models describe business entities and processes. They are often grouped by business unit: marketing, finance, product. This directory is where OLAP cubes will be stored. When a model is in this directory it communicates to business stake holders that the data is cleanly modelled and is ready for querying. ### General * Model names should be as obvious as possible and should use full words where possible, e.g. `accounts` instead of `accts`. * Documenting and testing new data models is a part of the process of creating them. A new dbt model is not complete without tests and documentation. * Follow the naming convention of `analysis type, data source (in alpha order, if multiple), thing, aggregation` ```sql -- Good retention_sfdc_zuora_customer_count.sql -- Bad retention.sql ``` * All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file. (Think of these as import statements.) * This does not imply all CTE's that have a `{{ ref('...') }}` should be `SELECT *` only. It is ok to do additional manipulations in a CTE with a `ref` if it makes sense for the model * If you want to separate out some complex SQL into a separate model, you absolutely should to keep things DRY and easier to understand. The config setting `materalized='ephemeral'` is one option which essentially treats the model like a CTE. However, if you need to materialize it as a table for performance reasons, consider setting the schema config to `temporary`. This will build the table to a temporary schema which is then dropped at the end of the run. This way you get the performance benefits of a table but it won't be available for querying by end users. Note that you cannot use `temporary` if there are view models downstream. Also, you will not be able to test the model because it won't exist. Schema References (aka What goes where) | Purpose | Production | Dev | Config | |---|---|---|---| | For querying & analysis | analytics | emilie_scratch_analytics | None | | For modeling marts | analytics_staging | emilie_scratch_staging | {{ config({ "schema": "staging"}) }} | | For modeling, but SENSITIVE | analytics_sensitive | emilie_scratch_analytics_sensitive | {{ config({ "schema": "sensitive"}) }} | | Intermediate tables that are dropped | analytics_temporary | emilie_scratch_temporary | {{ config({ "schema": "temporary"}) }} | | Source models used to build other models | analytics_source | emilie_scratch_temporary | {{ config({ "schema": "source"}) }} | #### Model Configuration There are multiple ways to provide configuration definitions for models. The [dbt docs for configuring models](https://docs.getdbt.com/docs/configuring-models) provide a concise explanation of the ways to configure models. Our guidelines for configuring models: * The default materialization is `view` * The default schema is `analytics` * Disabling any model should always be done in the `dbt_project.yml` via the `enabled: false` declaration * Configs should be applied in the smallest number of locations: * If <50% of models in a directory require the same configuration, then configure the individual models * If >=50% of models in a directory require the same configuration, strongly consider setting a default in the `dbt_project.yml`, but think about whether that setting is a sensible default for any new models in the directory #### Schema Name Generation Within dbt it is possible define custom schemas. The default behavior is documented in the [Using custom schemas section of the dbt documentation](https://docs.getdbt.com/docs/using-custom-schemas). In brief, the schema name is determined by a macro called `generate_schema_name`. We override the behavior of this macro with our own [`generate_schema_name` definition](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/macros/utils/generate_schema_name.sql). This macro takes the configuration (target name and schema) supplied in the profiles.yml as well as the schema configuration provided in the model config to determine what the final schema should be. Generally, if you're working locally, all schemas should be prefaced by the schema defined in your profiles.yml (typically `yourname_scratch`). On production, there are only 4 names which will be given the `analytics_` prefix: 'meta', 'sensitive', 'staging', and 'temporary'. Everything else will be rendered as given. This means if you provide `zuora` as the schema, then the rendered schema will be `zuora`. ### Seeds {: #seeds} Seeds are a way to load data from csv files into our data warehouse ([dbt documentation](https://docs.getdbt.com/docs/seeds)). Because these csv files are located in our dbt repository, they are version controlled and code reviewable. This method is appropriate for loading static data which changes infrequently. A csv file that’s up to ~1k lines long and less than a few kilobytes is probably a good candidate for use with the `dbt seed` command. #### Organizing columns When writing a base model, colummns should have some logical ordering to them. We encourage these 4 basic groupings: - Primary data - Foreign keys - Logical data - This group can be subdivided further if needed - Metadata Primary data is the key information describing the table. The primary key should be in this group along with other relevant unique attributes such as name. Foreign keys should be all the columns which point to another table. Logical data is for additional data dimensions that describe the object in reference. For a Salesforce opportunity this would be the opportunity owner or contract value. Further logical groupings are encouraged if they make sense. For example, having a group of all the variations of contract value would make sense. Within any group, the columns should be alphabetized on the alias name. An exception to the grouping recommendation is when we control the extraction via a defined manifest file. A perfect example of this is our [gitlab.com manifest](https://gitlab.com/gitlab-data/analytics/blob/master/extract%2Fpostgres_pipeline%2Fmanifests%2Fgitlab_com_db_manifest.yaml) which defines which columns we extract from our application database. The base models for these tables can be ordered identically to the manifest as it's easier to compare diffs and ensure accuracy between files. - Ordered alphabetically by alias within groups ```sql -- Good SELECT id AS account_id, name AS account_name, -- Foreign Keys ownerid AS owner_id, pid AS parent_account_id, zid AS zuora_id, -- Logical Info opportunity_owner__c AS opportunity_owner, account_owner__c AS opportunity_owner_manager, owner_team_o__c AS opportunity_owner_team, -- metadata isdeleted AS is_deleted, lastactivitydate AS last_activity_date FROM table ``` - Ordered alphabetically by alias without groups ```sql -- Less Good SELECT id AS account_id, name AS account_name, isdeleted AS is_deleted, lastactivitydate AS last_activity_date, opportunity_owner__c AS opportunity_owner, account_owner__c AS opportunity_owner_manager, owner_team_o__c AS opportunity_owner_team, ownerid AS owner_id, pid AS parent_account_id, zid AS zuora_id FROM table ``` - Ordered alphabetically by original name ```sql -- Bad SELECT account_owner__c AS opportunity_owner_manager, id AS account_id, isdeleted AS is_deleted, lastactivitydate AS last_activity_date name AS account_name, opportunity_owner__c AS opportunity_owner, owner_team_o__c AS opportunity_owner_team, ownerid AS owner_id, pid AS parent_account_id, zid AS zuora_id FROM table ``` ### Testing - Every model should be tested in a `schema.yml` file - At minimum, unique, not nullable fields, and foreign key constraints should be tested (if applicable) - The output of dbt test should be pasted into MRs - Any failing tests should be fixed or explained prior to requesting a review ### Snapshots {: #snapshots} Snapshots are a way to take point-in-time copies of source tables. dbt has [excellent documentation](https://docs.getdbt.com/docs/snapshots) on how the snapshots work. Take note of how we [talk about and define snapshots](https://about.gitlab.com/handbook/business-ops/data-team/#snapshots-definition). #### Create snapshot tables with `dbt snapshot` Snapshot definitions are stored in the [snapshots folder](https://gitlab.com/gitlab-data/analytics/tree/master/transform/snowflake-dbt/snapshots) of our dbt project. We have organized the different snapshots by data source for easy discovery. The following is an example of how we implement a snapshot: ```sql {% snapshot sfdc_opportunity_snapshots %} {{ config( unique_key='id', strategy='timestamp', updated_at='systemmodstamp', ) }} SELECT * FROM {{ source('salesforce', 'opportunity') }} {% endsnapshot %} ``` Key items to note: * The database and schema are configured in `dbt_project.yml`. The database is an environmental variable while the schema is set to `snapshots`. * _Always_ select from a source table. Even if some deduplication is required, a source table must be selected from, as selecting from a downstream dbt model is prone to failure * As snapshots are stored in `RAW`, your role will need to be explicitly granted access to the schema or tables * Follow the naming convention `{source_name}_{source_table_name}_snapshots` for your file name * Avoid any transformations in snapshots aside from deduplication efforts. Always clean data downstream * Unless you don't have a reliable `updated_at` field, always prefer using `timestamp` as a strategy (over `check`). Please find [documentation about strategy here](https://docs.getdbt.com/docs/snapshots#section-how-does-dbt-know-which-rows-have-changed-) ##### Testing Snapshots Testing of a snapshot can be done in a merge request using the `snapshots` CI job. Engineers should test locally using Airflow, as the proper environment variables are handled based on the git branch. Testing should NOT be done while on the master branch. It is not recommended to test locally by setting the `SNOWFLAKE_SNAPSHOT_DATABASE` environment variable. This should never be set to `RAW` as it will overwrite production data. #### Make snapshots table available in analytics data warehouse Once a snapshot is taken, it becomes, and should be treated as, a [data source](/handbook/business-ops/data-team/dbt-guide/#sources). We currently follow the legacy method for generating models based on snapshots. This means we don't have source models. Base models for snapshots are available in the folder /models/snapshots of our dbt project. Key items to note: * Before writing a snapshot base model, don't forget to add it in the [sources.yml file](https://gitlab.com/gitlab-data/analytics/blob/master/transform/snowflake-dbt/models/snapshots/sources.yml) (keep this file sorted) * The name of the table in the data warehouse should be consistent with our data warehouse design guideline. Ideally we would like to stick to `{source_name}_{source_table_name}_snapshots` as our naming convention. But dbt doesn't allow duplicated file names in projects. In order to avoid this the snapshot and the snapshot base model having the same name, we follow this pattern: * The name of the base model file will be the name of the source snapshot table to which we suffix `_base`. Ex: we have a `gitlab_dotcom_members_snapshots` snapshot file [here](https://gitlab.com/gitlab-data/analytics/blob/master/transform/snowflake-dbt/snapshots/gitlab_dotcom_members_snapshots.sql) and a base model of this snapshot [here](https://gitlab.com/gitlab-data/analytics/blob/master/transform/snowflake-dbt/models/snapshots/gitlab_dotcom_members_snapshots_base.sql) named `gitlab_dotcom_members_snapshots_base`. * We use the [dbt config alias argument](https://docs.getdbt.com/docs/using-custom-aliases#section-usage) to rename the table by removing the `_base` suffix and keep the table name clean * If a base model built upon the snapshotted source table exists, please re-use the query that has been already written and apply the following modifications: * Remove the deduplication process, it is not necessary. * Always add `dbt_scd_id` as a primary key to your snapshot base model and rename it to something more explicit (documentation about snapshot meta-fields can be found [here](https://docs.getdbt.com/docs/snapshots#section-snapshot-meta-fields)) * Add columns `dbt_valid_from` and `dbt_valid_to` to your query * Good example [here with the snapshot base model](https://gitlab.com/gitlab-data/analytics/blob/master/transform/snowflake-dbt/models/snapshots/gitlab_dotcom_gitlab_subscriptions_snapshots_base.sql) and [the base model](https://gitlab.com/gitlab-data/analytics/blob/master/transform/snowflake-dbt/models/gitlab_dotcom/base/gitlab_dotcom_gitlab_subscriptions.sql)