--- layout: handbook-page-toc title: "SQL Style Guide" description: "GitLab Data Team Handbook" --- ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .toc-list-icons .hidden-md .hidden-lg} ---- ## SQL Style Guide **Since we don't have a linter, it is *our collective responsibility* to enforce this Style Guide.** #### Field Naming and Reference Conventions * Field names should all be lowercased. * An `id`, `name`, or generally ambiguous value such as `type` should always be prefixed by what it is identifying or naming ```sql -- Good SELECT id AS account_id, name AS account_name, type AS account_type, ... -- Bad SELECT id, name, type, ... ``` * When joining to any data from a different source, a field should be prefixed with the data source, e.g. `sfdc_account_id`, to avoid ambiguity ```sql -- Good SELECT sfdc_account.account_id AS sfdc_account_id, zuora_account.account_id AS zuora_account_id FROM sfdc_account LEFT JOIN zuora_account ON ... -- Bad SELECT sfdc_account.account_id, zuora_account.account_id AS zuora_id FROM sfdc_account LEFT JOIN zuora_account ON ... ``` * When joining tables and referencing columns from both, strongly prefer to reference the full table name instead of an alias. When the table name is long (~20), try to rename the CTE if possible, and lastly consider aliasing to something descriptive. ```sql -- Good SELECT budget_forecast_cogs_opex.account_id, -- 15 more columns date_details.fiscal_year, date_details.fiscal_quarter, date_details.fiscal_quarter_name, cost_category.cost_category_level_1, cost_category.cost_category_level_2 FROM budget_forecast_cogs_opex LEFT JOIN date_details ON date_details.first_day_of_month = budget_forecast_cogs_opex.accounting_period LEFT JOIN cost_category ON budget_forecast_cogs_opex.unique_account_name = cost_category.unique_account_name -- Ok, but not preferred. Consider renaming the CTE in lieu of aliasing SELECT bfcopex.account_id, -- 15 more columns date_details.fiscal_year, date_details.fiscal_quarter, date_details.fiscal_quarter_name, cost_category.cost_category_level_1, cost_category.cost_category_level_2 FROM budget_forecast_cogs_opex bfcopex LEFT JOIN date_details ON date_details.first_day_of_month = bfcopex.accounting_period LEFT JOIN cost_category ON bfcopex.unique_account_name = cost_category.unique_account_name -- Bad SELECT a.*, -- 15 more columns b.fiscal_year, b.fiscal_quarter, b.fiscal_quarter_name, c.cost_category_level_1, c.cost_category_level_2 FROM budget_forecast_cogs_opex a LEFT JOIN date_details b ON b.first_day_of_month = a.accounting_period LEFT JOIN cost_category c ON b.unique_account_name = c.unique_account_name ``` * All field names should be [snake-cased](https://en.wikipedia.org/wiki/Snake_case) ```sql -- Good SELECT dvcecreatedtstamp AS device_created_timestamp FROM table -- Bad SELECT dvcecreatedtstamp AS DeviceCreatedTimestamp FROM table ``` * Boolean field names should start with `has_`, `is_`, or `does_` ```sql -- Good SELECT deleted AS is_deleted, sla AS has_sla FROM table -- Bad SELECT deleted, sla FROM table ``` **Dates** - Timestamps should end with `_at`, e.g. `deal_closed_at`, and should always be in UTC - Dates should end with `_date`, e.g. `deal_closed_date` - Months should be indicated as such and should always be truncated to a date format, e.g. `deal_closed_month` - Always avoid key words like `date` or `month` as a column name - Prefer the explicit date function over `date_part`, but prefer `date_part` over `extract`, e.g. `DAYOFWEEK(created_at)` > `DATE_PART(dayofweek, 'created_at')` > `EXTRACT(dow FROM created_at)` - Note that selecting a date's part is different from truncating the date. `date_trunc('month', created_at)` will produce the calendar month ('2019-01-01' for '2019-01-25') while `SELECT date_part('month', '2019-01-25'::date)` will produce the number 1 - Be careful using [DATEDIFF](https://docs.snowflake.net/manuals/sql-reference/functions/datediff.html), as the results are often non-intuitive. - For example, `SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` returns `1` even though the timestamps are different by one millisecond. - Similarly, `SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999')` return `0` even though the timestamps are nearly an entire day apart. - Using the appropriate interval with the `DATEDIFF` function will ensure you are getting the right results. For example, `DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` will provide a `1 day interval` and `DATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')` will provide a `1 millisecond interval`. #### Use CTEs (Common Table Expressions), not subqueries - [CTEs make SQL more readable and are more performant](https://www.alisa-in.tech/post/2019-10-02-ctes/) - Use CTEs to reference other tables. Think of these as import statements - CTEs should be placed at the top of the query - Where performance permits, CTEs should perform a single, logical unit of work - CTE names should be as concise as possible while still being clear - Avoid long names like `replace_sfdc_account_id_with_master_record_id` and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins - CTEs with confusing or noteable logic should be commented in file and documented in dbt docs - CTEs that are duplicated across models should be pulled out into their own models - Leave an empty row above and below the query statement - CTEs should be formatted as follows: ``` sql WITH events AS ( -- think of these select statements as your import statements. ... ), filtered_events AS ( -- CTE comments go here ... ) SELECT * -- you should always aim to "select * from final" for your last model FROM filtered_events ``` #### General - Indents for the main part of a query should be 4 spaces (i.e. when in a CTE) - Further indents should be two spaces (except for predicates, which should line up with the `WHERE` keyword) - No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces - see our [onboarding template](https://gitlab.com/gitlab-data/analytics/-/blob/master/.gitlab/issue_templates/Data%20Onboarding.md#data-grip-configuration) for more details - Lines of SQL should be no longer than 80 characters - Commas should be at the end-of-line (EOL) as a right comma, with the exception of temporary filters in the `WHERE` clause for specific values. ```sql -- Good SELECT deleted AS is_deleted, -- EOL right comma accountId AS account_id FROM table WHERE is_deleted = false AND account_id NOT IN ( '232' , '234' -- left comma , '425' ) -- Bad SELECT deleted AS is_deleted, -- EOL right comma accountId AS account_id FROM table WHERE is_deleted = false AND account_id NOT IN ('232', '234', '425') ``` - When `SELECT`ing, always give each column its own row, with the exception of `SELECT *` which can be on a single row - `DISTINCT` should be included on the same row as `SELECT` - The `AS` keyword should be used when projecting a field or table name - When aliasing use `AS`, strive to align the original column names on a single vertical line and the `AS` keyword on a separate vertical line - Fields should be stated before aggregates / window functions - Ordering and grouping by a number (eg. group by 1, 2) is preferred - Prefer `WHERE` to `HAVING` when either would suffice - Prefer accessing JSON using the bracket syntax, e.g. `data_by_row['id']::bigint as id_value` - **Never** use `USING` in joins. On Snowflake, it will produce inaccurate results - Prefer `UNION ALL` to `UNION`. This is because a `UNION` could indicate upstream data integrity issue that are better solved elsewhere. - Prefer `!=` to `<>`. This is because `!=` is more common in other programming languages and reads like "not equal" which is how we're more likely to speak - Consider performance. Understand the difference between `LIKE` vs `ILIKE`, `IS` vs `=`, and `NOT` vs `!` vs `<>`. Use appropriately - Prefer `lower(column) LIKE '%match%'` to `column ILIKE '%Match%'`. This lowers the chance of stray capital letters leading to an unexpected result - Prefer using `VARCHAR` when casting this data type and not specifying a length. A column only consumes storage for the amount of actual data stored. - Familiarize yourself with [the DRY Principal](https://docs.getdbt.com/docs/design-patterns). Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places - **DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP. [BRAIN TIME IS EXPENSIVE.](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/)** #### Functions - Function names and keywords should all be capitalized - Prefer `IFNULL` TO `NVL` - Prefer `IFF` to a single line `CASE WHEN` statement - Prefer `IFF` to selecting a boolean statement `(amount < 10) AS is_less_than_ten` #### JOINs - Be explicit when joining, e.g. use `LEFT JOIN` instead of `JOIN`. (Default joins are `INNER`) - Prefix the table name to a column when joining, otherwise omit - Specify the order of a join with the FROM table first and JOIN table second: ```sql -- Good FROM source LEFT JOIN other_source ON source.id = other_source.id WHERE ... -- Bad FROM source LEFT JOIN other_source ON other_source.id = source.id WHERE ... ``` #### Example Code * Putting it all together: ```sql WITH my_data AS ( SELECT * FROM analytics.my_data WHERE filter = 'my_filter' ), some_cte AS ( SELECT DISTINCT id, other_field_1, other_field_2 FROM analytics.my_other_data ), final AS ( SELECT data_by_row['id']::bigint AS id_field, field_1 AS detailed_field_1, field_2 AS detailed_field_2, detailed_field_3, CASE WHEN cancellation_date IS NULL AND expiration_date IS NOT NULL THEN expiration_date WHEN cancellation_date IS NULL THEN start_date+7 ELSE cancellation_date END AS cancellation_date, LAG(detailed_field_3) OVER ( PARTITION BY id_field, detailed_field_1 ORDER BY cancellation_date ) AS previous_detailed_field_3, SUM(field_4) AS field_4_sum, MAX(field_5) AS field_5_max FROM my_data LEFT JOIN some_cte ON my_data.id = some_cte.id WHERE field_1 = 'abc' AND (field_2 = 'def' OR field_2 = 'ghi') GROUP BY 1, 2, 3, 4, 5 HAVING COUNT(*) > 1 ORDER BY 4 DESC ) SELECT * FROM final ``` #### Commenting * When making single line comments in a model use the `--` syntax * When making multi-line comments in a model use the `/* */` syntax * Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long * dbt model comments should live in the model documentation * Calculations made in SQL should have a brief description of what's going on and a link to the handbook defining the metric (and how it's calculated) * Instead of leaving `TODO` comments, create new issues for improvement ### Other SQL Style Guides - [Fishtown Analytics](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#sql-style-guide) - [Matt Mazur](https://github.com/mattm/sql-style-guide) - [Kickstarter](https://gist.github.com/fredbenenson/7bb92718e19138c20591)