--- layout: handbook-page-toc title: "Monitor Stage" --- ## On this page {:.no_toc .hidden-md .hidden-lg} - TOC {:toc .hidden-md .hidden-lg} # Direction ## Vision Using GitLab, you automatically get broad and deep insight into the health of your deployment. ## Mission We provide a robust monitoring solution to give GitLab users insight into the performance and availability of their deployments and alert them to problems as soon as they arise. We provide data that is easy to digest and to relate to other features in GitLab. With every piece of the devops lifecycle integrated into GitLab, we have a unique opportunity to closely tie our monitoring features to all of the other pieces of the devops flow. We work collaboratively and transparently and we will contribute as much of our work as possible back to the open source community. ## Responsibilities The monitoring team is responsible for: - Providing the tools required to enable monitoring of GitLab.com - Packaging these tools to enable all customers to manage their instances easily and completely - Building integrated monitoring solutions for customers apps into GitLab, including: metrics, logging, and tracing This stage consists of the following groups: - [APM](apm/) - [Health](health/) These groups map to the [Monitor Stage product category](/handbook/product/categories/#monitor-stage). ## How to be successful in this stage Team members who are successful in this stage typically demonstrate **stakeholder mentality**. There are many ways to demonstrate this but examples include: - Actively contributes to the discussion and direction of where the product is headed - Actively reaching out for help when stuck - Actively finding ways to make the team successful This stage is only successful when each team member collaborates to make one another successful. # Rhythms ## Monthly Cadence Since GitLab releases on a monthly basis, we have supporting activities that also take place on monthly rhythms. In addition, since our releases take place on the 22nd of each month, each monthly cadence does not map to actual months of the Gregorian calendar. - We manually create our **planning issue** for the next milestone as soon as we start a milestone - Everyone in the team is encouraged to participate in this issue - Our planning issue is a conversation about the planning priorities of the team. The PM, EM and UX are responsible for narrowing down the scope based on the capacity of the team. - An opportunity for the team to bring up any issues they would like to work on even if it isn’t directly related to the stage so that the PM can determine how we can balance work that benefits the entire company vs the stage (as well as technical debt). - One week prior to the last date of the milestone, the PM will record a kickoff recording of the issues intended to be completed in the next milestone (as prescribed in the planning issue). - We automatically create a **retrospective issue** for the current milestone when we are halfway through a milestone using the async-retrospective. - Everyone in the team is encouraged to participate in this issue - Issues are confidential to the team - Engineering Managers for both backend and frontend are responsible for bubbling up content and themes to the main [public retrospective document](https://docs.google.com/document/d/1nEkM_7Dj4bT21GJy0Ut3By76FZqCfLBmFQNVThmW2TY/edit) - PM creates a **release post item merge request** for each feature that will be available in the next release so that these features can be displayed in the release post on the 22nd. - Team members working on a feature are tagged in this MR but are not required to take any action - Engineering Managers follow the checklist in the MR template and are responsible for merging the MR once the feature has been merged to master and (ideally) verified in production - We consider the 18th of the month (or if that day is on a weekend, the last weekday before that day) to be the **last date of the milestone**. - Deliverables that are not closed by the 18th will have their milestones moved to the next milestone and marked `missed-deliverable` - Assigned filler issues that are `workflow::in dev`, `workflow::in review` or `workflow::verification` will be moved to the next milestone - Filler issues that do not fall into the above categories will be moved to the backlog - After all the issues associated to that milestone are moved, we remove that milestone column from their planning boards ([APM](https://gitlab.com/groups/gitlab-org/-/boards/1065731), [Health](https://gitlab.com/groups/gitlab-org/-/boards/1131777)) - A few days prior to the last date of the milestone, **engineers are assigned to Deliverables**. - Engineering managers will assign engineers to deliverables - Assignment of deliverables help set a clear DRI - If engineers notice that their deliverables are unlikely to make the milestone, they are responsible for communicating that in the issue, their EM and the PM - We highly encourage engineers who finish their deliverables to notice if there are ways they can help other engineers with their deliverables (that would otherwise slip) ## Meetings Meetings are not required but attendance/reviewing the recordings to the important ones will generally make team members successful. These are ordered in order of importance and are all stored in the [Monitor Stage Calendar](https://calendar.google.com/calendar?cid=Z2l0bGFiLmNvbV8xbGMyZHFpbjFoMXQ2MHFoNnJmcjJjZTE5OEBncm91cC5jYWxlbmRhci5nb29nbGUuY29t)(Viewable to all GitLab team members) 1. Group Weekly Meeting (Monitor:Health and Monitor:APM each have their own) 1. Monitor Stage Demo Hour (Bi-weekly) 1. Monitor Social Hour (weekly) ## Async Daily Standups Groups in this stage also participate in async daily standups. The purpose is to give every team member insight into what others are working on so that we can identify ways to collaborate and unblock one another as well as foster relationships within the team. We use the [geekbot slack plugin](https://geekbot.com/) to automate our async standup, following the guidelines outlined in the [Geekbot commands guide](https://geekbot.com/guides/commands/). Our questions change depending on the day of the week. Participation is optional but encouraged. ### Monday | Question | Why we ask it | |---|---| | Do you need help from anyone to unblock you this week? | One of our main goals with our standups is to help ensure that we are unblocking one another as a top priority. We ask this first because we think it's the question that other team members can take action on. | | What do you plan on working on this week? | We want to understand how our daily actions drive us toward our weekly goals. This question provides broader context for our daily work, but also helps us hold ourselves accountable to maintaining proper scopes for our tasks, issues, merge requests, etc. This answer may stay the same for a week, this would mean things are progressing on schedule. Alternatively, seeing this answer change throughout the week is also okay. Maybe we got side tracked helping someone get unblocked. Maybe new blockers came up. The intention is not to have to justify our actions, but to keep a running record of how our work is progressing or evolving. | | Any personal tidbits you'd like to share? | This question is intentionally open ended. You might want to share how you feel, a personal anecdote, funny joke, or simply let the team know that you will have limited availability that afternoon. All of these answers are welcome. | ### Tuesday/Wednesday/Thursday | Question | Why we ask it | |---|---| | Are you facing any blockers requiring action from others? | Same reason as Monday's first question | | Are you on track with your plan for the week? | We want to understand how each team member is doing on achieving our week goal(s). It is meant to highlight progress while also identifying if there are things getting in the way. This could also be used to update the plan for the week as things change. | | What will be your primary focus for today? | This question is aimed at the most impactful task for the day. We aren't tyring to account for the entire day's worth of work. Highlighting only a primary task keeps our answers concise and provides insight into each team member's most important priority. This doesn't necessarily mean sharing the task that will take the most time. We focus on results over input. Typically this will mean highlighting the task that is most impactful in closing the gap between today and our end of the week goal(s). | | Any personal tidbits you'd like to share? | Same reason as Monday's last question | ### Friday | Question | Why we ask it | |---|---| | What went well this week? What did you enjoy? | The end of the week is a good time to reflect on our goals, and this question is meant to be a short retrospective of the week. This focusing on things that went well during the week. | | What didn’t go so well? What caused you to slow down? | Like the previous question, this question is a way to review our week. This one is a way to surface things that did not go so well or things that go in the way of meeting our weekly goal(s). | | What have you learned? | This could be something about work or personal. We hope that by sharing things we have learned that others can also learn from us. | | Any plans for the weekend you'd like to share? | Like the "personal tidbit" question we ask other days of the week, this one is very opened ended. You can share as much or as little as you want and all answers are welcome. | # Initiatives ## SRE Shadow Program With the support of GitLab's SRE team, we implemented the SRE shadow program as a means of improving the team's understanding of our ideal user personas so that we can build a better product. In this program, engineers are expected to devote 1 entire week to shadow SREs. There is no expectation for the engineer to complete their assigned issues during this time. Engineers are added to PagerDuty and will follow the [existing SRE shadow format of interning](https://about.gitlab.com/handbook/engineering/infrastructure/career/#interning-with-infrastructure--reliability-engineering) (except scaled down to a shorter duration of 1 week). Although typical SREs on-call for multiple days at a time, shadows are only expected to shadow during their regular business hours. This can be set as a preference in PagerDuty. ### Objectives - Gain empathy for our user persona - Observe pain points in the current SRE workflow so that we can improve it - Observe ways the SRE team can dogfood more features - Document observations in some medium that allows non-shadows to learn (Eg. blog post, Q&A session..etc) ### Outcomes - Engineers gain a better understanding of the users we build features for. - Engineers become better stakeholders for the stage. - They are able to create more feature proposals to help the stage build features that improves the life of our user persona. - They are more equipped to influence product direction based on their observation on what is better for our user persona. - Engineers develop stronger relationships with the SRE team. - Enables improved collaboration and efficiency in dogfooding features and getting faster feedback cycles for our features. ### How to participate Engineers interested in the program should notify their respective frontend/backend engineering managers. Managers should collaborate and determine an optimal schedule in the slack channel `#monitor-sre-shadow` and create an access request for PagerDuty. Assign the access request to the SRE manager (this is a departure from [established processes](https://about.gitlab.com/handbook/business-ops/it-ops-team/access-requests/#bulk-access-request)). We are currently limited to 2 max shadows per release so that we do not overload the SRE team. If you are shadowing during the same release as another engineer, coordinate to create a combined access request for the duration of the release. Before starting your rotation, coordinate with the SRE(s) who will be on-call to determine which areas it makes sense for you to shadow (incidents, other on-call tasks, SRE daily tasks, etc). You can either check PagerDuty or coordinate with the SRE manager to figure out who you'll be shadowing. ### Alumni Alumni of the program are encouraged to add themselves to this list and document/link to the observations/outcomes they were able to share with the wider team. | Name | Outcomes | |---|---| | Tristan Read | [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/) | Sarah Yasonik | [Created 4 issues for the team to consider adding to the product](https://gitlab.com/gitlab-org/monitor/health/-/issues/12#related-issues) | # Resources ## Demo Environments In order to make it more efficient to verify changes and demonstrate our product features to customers and other stakeholders. The engineers in this stage maintain a few demo environments. | Use Case | URL | |---|---| | Customer simulation environment | [tanuki-inc](https://gitlab.com/gitlab-org/monitor/tanuki-inc) | | Verifying features in Staging | [monitor-sandbox (Staging)](https://staging.gitlab.com/gitlab-org/monitor/monitor-sandbox) | | Verifying features in Production | [monitor-sandbox (Production)](https://gitlab.com/gitlab-org/monitor/monitor-sandbox) | ## Video and Tutorials - [Introduction to Prometheus (Video)](https://www.youtube.com/watch?v=8Ai55-sYJA0) - [Setting up Kubernetes cluster for local development (Video)](https://www.youtube.com/watch?v=dFIlml7O2go) - [How to run GitLab Omnibus in docker with common monitor-related features enabled (Tutorial)](https://gitlab.com/snippets/1892700)