--- layout: handbook-page-toc title: "Data For Product Managers" --- ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .toc-list-icons .hidden-md .hidden-lg} {::options parse_block_html="true" /} ---- This page is intended to help Product Managers at GitLab understand what data is available to them and how they can use it to understand how their product is used. This page primarily covers two topics: *how* to consume data, and *what* data is available. ## How to Consume Data at Gitlab The user-facing end of GitLab's data stack is comprised of our BI Tool, Sisense (formerly known as Periscope) which is connected to our Snowflake data warehouse. The [Sisense page](/handbook/business-ops/data-team/periscope/) of the data team handbook has general information about Sisense aimed for a wider GitLab audience. #### Product Managers toolkit Some useful links that we recommend for you to bookmark: * [Usage Pings documentation](https://docs.gitlab.com/ee/user/admin_area/settings/usage_statistics.html) * Snowplow [backend](https://docs.gitlab.com/ee/development/event_tracking/backend.html) and [frontend](https://docs.gitlab.com/ee/development/event_tracking/frontend.html) documentation * [DBT documentation](https://dbt.gitlabdata.com/#!/overview) where most of the tables are documented * [Sisense link](https://app.periscopedata.com/app/gitlab/403199/Welcome-Dashboard-%F0%9F%91%8B) #### Getting Access * Everybody at GitLab should automatically have view access granted through Okta. * To write your own queries and create your own charts + dashboard, you'll need to create an [access request](https://gitlab.com/gitlab-com/team-member-epics/access-requests/issues/new?issuable_template=New+Access+Request) asking for the Editor role. These access requests can be assigned to a data team manager. #### The 2 basic building blocks in Sisense are charts and dashboards. * Charts are queries to the data warehouse, materialized into visualizations. * Dashboards are collections of charts and have a unique URL (like a "page"). * If you have Editor access, you'll be able to create both of these. New dashboards can be created on the left-hand nav bar and charts can be created by clicking "New Chart" in the top right on any dashboard where you have sufficient permissions. #### Creating Charts * Clicking *"New Chart"* will take you to the Sisense SQL editor. This is a safe environment, meaning you don't need to worry about breaking any of the data infrastructure. The Sisense "user" has read access only and will kill all queries lasting longer than 5 minutes so you can't rack up any massive bills. * The variant of SQL used in the data warehouse is [Snowflake](https://docs.snowflake.net/manuals/index.html). * A great way to learn about building charts is to watch this 10-minute [Data Onboarding](https://www.youtube.com/watch?v=F4FwRcKb95w&feature=youtu.be) video from Sisense. * If you do happen to experience timeouts in Sisense (you'll see an error message saying "query killed"), reach out to the data team and we might be able to help you optimize your query or run it as a one-off with more computational resources. * Sisense dashboards can be designated as "Official", meaning it has been reviewed by a data team member. If you're interested in making your dashboard official, the process is documented [here](/handbook/business-ops/data-team/periscope/#official-badge-for-a-dashboard). * Sisense has a concept of "Topics" which organizes dashboards into topical areas. Check out the [Product Topic](https://app.periscopedata.com/app/gitlab/topic/Product/ab707846c91f4d30b1c1ca0399803d67) to see what's already available in Periscope and to get inspiration for future charts. * One limitation of Sisense is that it is a BI tool, not a product analytics tool. We've received feedback that it's difficult to do any product analytics without a strong knowledge of SQL. To meet this need, the product team is exploring alternative analytics tools in [this epic](https://gitlab.com/gitlab-org/telemetry/issues/303) #### How do I know what tables are available? * The data team uses a tool called [dbt](https://www.getdbt.com/) for our data transformation layer. A nice feature of dbt is dbt docs, which automatically creates documentation for all of the models in our schema. Our dbt docs instance can be found [here](https://dbt.gitlabdata.com/#!/overview). * Navigating to `Projects > gitlab_snowflake > models` will give a list of all of the models (think of these as tables) that exist in the data warehouse. Models are organized in directories according to their data source. * ![](/handbook/business-ops/data-team/data-for-product-managers/projects_periscope.png) * Clicking on a model will reveal more details including documentation, a column list, and the underlying SQL. * Also, Sisense will show a list of all tables available for querying on the left-hand side while in the SQL chart editor. * ![](/handbook/business-ops/data-team/data-for-product-managers/schemas.png) * Table names are always prefixed by their source name. So the table that stores the ngroups table from the gitlab dotcom database is available at `analytics_staging.gitlab_dotcom_groups` #### What's the difference between analytics_staging and analytics? * These are two different schemas inside of our `analytics` data warehouse. * analytics_staging **does NOT** have anything to do with the software engineering concept of a staging environment. * The difference between the two relates to the amount of data transformation that has been done on the tables: * Tables in the `analytics_staging` schema have undergone little to no data transformation, meaning they're basically clones of the raw data source. Example: the `analytics_staging.gitlab_dotcom_merge_requests` table is a copy of the `merge_requests` table from the GitLab.com Postgres database (except it includes rows that were hard deleted on the latter). * Tables in `analytics` have typically undergone more data transformation. They will typically end in `_xf` to represent the fact that they are transformed. This is done as a convenience to the end-user who won't have to do the common joins and transformations themselves. For example: the `analytics.gitlab_dotcom_merge_requests_xf` table includes columns about a merge request's labels, milestone and author, which come from joining to other tables. * Some tables will have both `analytics_staging` and `analytics` models, whereas others might only have an `analytics_staging` model. * You do not need to specify the **warehouse** name (ANALYTICS) when writing a query. `analytics_staging.gitlab_dotcom_groups` will function the same as `ANALYTICS.analytics_staging.gitlab_dotcom_groups`. #### How does data get into the warehouse? * Through a couple different ways which are detailed [here](/handbook/business-ops/data-team/#extract-and-load). #### How should I interact with the data team? * If you ever get stuck or have a question, please create an issue in the [data team project](https://gitlab.com/gitlab-data/analytics/issues/new) with the Product label and assign it to a [product data analyst](/handbook/business-ops/data-team/#product). There are issue templates that are useful for specifying your request. * It's helpful for us to know the context behind your question. Don't just say *what* you want to know but also *why* you want to know it because we might be able to point you to a more efficient way to get your answer. ## Key Data Sources for Product Managers at GitLab The first question we on the data team ask product managers is usually "are you interested in knowing this for self-managed or GitLab.com?" Our approach to answering your question differs greatly between the two. Although our self-managed offering has many more active customers, our GitLab.com offering has much more data available to analyze. * See this [Sisense chart](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Dashboard?widget=7940644&udv=0) which shows the differences between the 2 product offerings from a financial perspective. * We have three primary data sources that are useful from a product perspective. * **Usage Ping** (for Self-Managed and GitLab.com) * **GitLab.com Postgres Database** (for GitLab.com) * **Snowplow** (for GitLab.com) ### Usage Ping (Version App) * [Usage Ping](https://docs.gitlab.com/ee/user/admin_area/settings/usage_statistics.html) is a custom tool that GitLab built to deal with the problem of collecting weekly aggregate information from our customers who host our product on their own hardware. * It is optional but defaults to being on. * 30% of licensed instances successfully send usage ping every month. We don't know about unlicensed (core) users but assume the same rate. Here is more [information](https://gitlab.com/gitlab-org/telemetry/issues/291#note_276741996) on why some instances block data from being sent. * Usage ping is supposed to send regularly at the instance-level for GitLab.com as well, but it hasn't done so successfully in some time. See [this issue](https://gitlab.com/gitlab-org/telemetry/issues/299). GitLab.com usage pings can be filtered out of your query as they always come from the same UUID. **Query Example filtering out Gitlab.com:** ```sql SELECT * FROM analytics.version_usage_data_unpacked WHERE uuid <> 'ea8bf810-1d6f-4a6a-b4fd-93e8cbd8b57f' --GitLab.com UUID ``` #### Snippets Snippets are great ways to allow Periscope users to build charts without writing any SQL. Anyone with editor access can write their own snippets. The data team has created several snippets that have the official badge. To find a list of available snippets, click on the scissors in the left menu. We created a snippet called `feature_usage_this_agg_period_and_growth(feature)` ([SQL source](https://app.periscopedata.com/app/gitlab/snippet/feature_usage_this_agg_period_and_growth/5564f7aad54d4044834b48167e6befd8/edit)) that allows you to get quickly without any SQL writing some feature usage from the Usage Pings data source. Here is a list of the different fields that : * feature_name: name of the pings you want to track. To get the exact name of the pings, please refer to [this model](https://dbt.gitlabdata.com/#!/model/model.gitlab_snowflake.version_usage_data_unpacked). * feature_usage_sum: Total count for the specific feature. * feature_usage_sum_change: count for the specific feature at a give period P. * instance_count: Total count of instances that have been using this specific feature at least once before a given period P. In technical terms, it counts all instances which have a `feature_usage_sum` greater than 0 for this period. * instance_count_change: count of instances that have been using this specific feature at a given period P. In technical terms, it counts all instances which have a `feature_usage_sum_change` greater than 0 for this period. * average_per_12: average `feature_usage_sum` value over the previous 12 months. * percent_growth: POP growth. This snippet can be used with the date-range and aggregation filters. It has been used in several dashboards, one of them is the [Secure Metrics dashboard](https://app.periscopedata.com/app/gitlab/410654/Secure-Metrics) We are planning to build a second snippet to report on the adoption rate per [product tier](https://about.gitlab.com/handbook/marketing/product-marketing/tiers/). This snippet will allow to easily calculate among the instances that send us usage pings in a given period, the number of instances that use a specific feature. #### Examples * **Example 1**: [What percent of self-managed instances sending usage pings have a project with an active Jira integration?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Manager:-Supporting-doc?widget=7886001) * **Example 2**: [What are the salesforce accounts of the 100 instances that use issues the most?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Doc?widget=7932958) * **Example 3**: [How many SAST jobs have been created in the last 12 months by self-managed instances sending usage pings ?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Manager:-Supporting-doc?widget=7886026) ### GitLab.com Postgres Database * Because GitLab.com is a GitLab instance hosted by GitLab, we have access to the instance's postgres database and can load parts of it into our snowflake data warehouse. This means we can get a very detailed look into how our product is used on GitLab.com. * This is largely an **untapped resource** as historically not many PMs have known that this data was consistently available in Sisense. * Any part of the product that creates a table on the backend (see the [schema](https://gitlab.com/gitlab-org/gitlab/blob/master/db/schema.rb) file) can be added to the ELT job which will sync 3x a day to the warehouse. From there, all we need to do is build a dbt base model to make them accessible in Sisense. #### What if the table or column I want isn't in the data warehouse? * Our ELT process works by explicitly stating which columns and tables we want to import into the data warehouse. This means we might be missing a column or whole table that you want to have in the data warehouse for analysis. * When this is the case, please create a data issue letting us know what you want us to import. Before doing so, please confirm that the table/column is truly part of the [production schema](https://gitlab.com/gitlab-org/gitlab/blob/master/db/schema.rb). #### Replicating Usage Ping using GitLab.com Data * Because usage ping only aggregates data at an `instance` level, it is not super useful for GitLab.com since we often want to see information at the `namespace` level. For example, knowing that 40K people used your stage on GitLab.com is somewhat useful, but you'll probably want to know more context (Are they free or paid? What plan are they on? Do I have any power users or is usage equally distributed?) * But since we have access to the GitLab.com postgres database, we are capable of replicating any part of usage ping at the namespace level or even the user level. * All usage ping does under the hood is execute some `SELECT COUNT(*) FROM x` statements, making it trivial to replicate. * [This model](https://dbt.gitlabdata.com/analytics/dbt/snowflake/#!/model/model.gitlab_snowflake.gitlab_dotcom_usage_data_events) exemplifies hows usage ping could be replicated for GitLab.com at the namespace level. This model is available as a discovery dataset in Periscope. [More information is available here](/handbook/business-ops/data-team/periscope/#discovery-datasets) * We created [this dashboard](https://app.periscopedata.com/app/gitlab/576469/WIP:-Namespace-Overview) to demonstate what data is available and which type of drilldowns can be done. #### Examples * [How many snippets are created every day?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Manager:-Supporting-doc?widget=7886091&udv=953103) * [How many issues with the label "group:: source code" are opened each month?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Doc?widget=7933301&udv=0) #### Snowplow * Snowplow Analytics is an open-source enterprise event-level analytics platform that enables data collection from multiple platforms for advanced data analytics. * GitLab.com currently uses two of these snowplow tracking libraries: Javascript and Ruby. * With the JS library, we track a lot of front-end activity automatically (All page views, sessions, link clicks, some button clicks, etc.) We also utilize structured events to capture specific events with defined business logic. * We do not track `user_id` on any of the snowplow events, making all events functionally anonymous. This severely limits the utility of these events. * We have found the [Snowplow Inspector](https://chrome.google.com/webstore/detail/snowplow-inspector/maplkdomeamdlngconidoefjpogkmljm?hl=en) to be very useful in tracing and debugging events being sent through the browser. * We do not use snowplow on our self-managed instances, only on GitLab.com * **Examples** * [How many GitLab.com users visit the preferences page every day?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Dashboard?widget=7940849) * [How many "users" convert from one trial form to another in the last 30 days?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Dashboard?widget=7940858) * [Which "status" tabs ('Open', 'Merged', etc) get clicked on the /merge_requests page?](https://app.periscopedata.com/app/gitlab/602123/Data-For-Product-Managers:-Supporting-Dashboard?widget=7940862) ###### What can Product Managers do? As mentioned, the anonymization of any snowplow events is a severe limitation for Snowplow data instrumentation. Nonetheless, you can still add some Snowplow to measure feature adoption and usage. We recommend Product Managers and their teams use [Snowplow custom structured events](https://github.com/snowplow/snowplow/wiki/canonical-event-model#customstruct), which are Snowplow's canonical events. We have built [`Tracking`](https://docs.gitlab.com/ee/development/event_tracking/frontend.html) and [`Gitlab::Tracking`](https://docs.gitlab.com/ee/development/event_tracking/backend.html), 2 wrappers for Snowplow Javascript and Ruby Trackers. * Javascript trackers can be used to collect users' frontend action, especially clicks. * Ruby trackers can be used to collect any event happening in the backend. For example, we currently use them for monitor:APM. More information in this [issue](https://gitlab.com/gitlab-org/gitlab/issues/121724) ###### Taxonomy When adding new click events, we should add them in a way that's internally consistent. If we don't, it'll be very painful to perform analysis across features since each feature will be capturing events differently. The current method provides 5 attributes that are sent on each click event. Please try to follow these guidelines when specifying events to capture: | attribute | type | required | description| |-----------|---------|----------|------------| | category | text | true | The page or backend area of the application. Unless infeasible, please use the Rails page attribute by default in the frontend, and namespace + classname on the backend.| | action | text | true | The action the user is taking, or aspect that's being instrumented. The first word should always describe the action or aspect: clicks should be `click`, activations should be `activate`, creations should be `create`, etc. Use underscores to describe what was acted on; for example, activating a form field would be `activate_form_input`. An interface action like clicking on a dropdown would be `click_dropdown`, while a behavior like creating a project record from the backend would be `create_project` | | label | text | false | The specific element, or object that's being acted on. This is either the label of the element (e.g. a tab labeled 'Create from template' may be `create_from_template`) or a unique identifier if no text is available (e.g. closing the Groups dropdown in the top navbar might be `groups_dropdown_close`), or it could be the name or title attribute of a record being created. | | property | text | false | Any additional property of the element, or object being acted on.| | value | decimal | false | Describes a numeric value or something directly related to the event. This could be the value of an input (e.g. `10` when clicking `internal` visibility). | ###### Testing your events Testing your events can be tricky. Snowplow doesn't have a proper testing interface. However, several tools can help you debug, test, and validate your events implementation: * When working on your local instance: you can use [Snowplow Micro](https://snowplowanalytics.com/blog/2019/07/17/introducing-snowplow-micro/). [This video is a nice tutorial on getting started with Snowplow Micro](https://www.youtube.com/watch?v=OX46fo_A0Ag). * In the near future, we plan to implement [Snowplow Mini](https://github.com/snowplow/snowplow-mini) as a way to QA our events on our staging environment. You can follow the progress [in this issue](https://gitlab.com/gitlab-org/telemetry/issues/266). ###### Visualize your events in Periscope * After the events are released, you can first check if they are correctly stored in Snowflake in the [Snowplow Event Exploration Dashboard](https://app.periscopedata.com/app/gitlab/539181/Snowplow-Event-Exploration---last-30-days). You can use the filters to find your events. If you are not sure of the value of the different attributes, please check with your engineering manager. * Once you see your events, that means they are ready to be queried and visualized in Periscope. We are collecting several millions of events (pageviews, structured events) per month, so the whole dataset is quite slow to query. In order to make it easy to explore this data source we have created several smaller and more compact tables: * [`analytics.snowplow_structured_events_all`](https://dbt.gitlabdata.com/#!/model/model.gitlab_snowflake.snowplow_structured_events_all): contains ALL structured events (but no pageviews). * [`analytics.snowplow_page_views_30`](https://dbt.gitlabdata.com/#!/model/model.gitlab_snowflake.snowplow_page_views_30): contains ALL pageviews * [`analytics.snowplow_unstructured_events_30`](https://dbt.gitlabdata.com/#!/model/model.gitlab_snowflake.snowplow_unstructured_events_30): contains ALL unstructured events (especially click events, submit_form).
**PRO TIP: Optimizing queries** {: .panel-heading}
To make your query faster, use a date filter in your `WHERE` statement. Example query: ```sql SELECT event_action, COUNT(*) FROM analytics.snowplow_structured_events_all WHERE derived_tstamp > ' 2020-01-01' GROUP BY 1 ORDER BY 2 DESC ```
###### Some Issues and Merge Requests examples TODO #### Other data sources** * **Sheetload** * You can load your own Google Sheets into the data warehouse. Read more [here](/handbook/business-ops/data-team/#using-sheetload). ## Instrumentation / Telemetry * [Telemetry](https://about.gitlab.com/direction/telemetry/) is part of the product org and is completely separate from the data team. * The telemetry team members are the [DRI](https://about.gitlab.com/handbook/people-group/directly-responsible-individuals/)s for data collection across both GitLab.com and Self-Managed. They own Usage Ping and Snowplow. They are the ones to go to for questions like: * How do I instrument a new statistic for self-managed? * What are the best practices for adding to usage ping? * How can I use snowplow to track a frontend interaction on GitLab.com? * Can I utilize snowplow to track events on the server-side?