---
title: "Introducing Token-Hunter"
author: Greg Johnson
author_gitlab: codeEmitter
author_twitter: code_emitter
categories: security
image_title: '/images/blogimages/lightscape-Bsw6l6e01Rw-unsplash.jpg'
description: "Our red team has created a new tool to find sensitive data in the vast, wide-open."
tags: security, security research, open source
merch_banner: merch_four
twitter_text: ".@GitLab introduces a new #opensource #infosec tool to help teams find sensitive data shared in assets unintentionally. Community contributions are welcome!"
postType: content marketing
---

We operate business at GitLab in [“public by default”](/handbook/values/#public-by-default), so other people can benefit from our transparent business practices. Defaulting to public sharing also means we store massive amounts of data in a public format by design. Much of what we do as a company takes the form of a GitLab issue and is open for the world to see, including those individuals with nefarious goals. Naturally, for a [Red Team](/handbook/engineering/security/red-team/), we’re curious about what all of this public information could do to aid someone intent on attacking GitLab. We started our investigation by identifying those secrets that are unintentionally shared across the different parts of GitLab and the assets we make public like issues, issue discussions, and snippets. There was no tooling available that accomplished what we set out to do, so we developed it ourselves and just released it: [Token-Hunter](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter).

### Background

API tokens are a keystone in the development world. They facilitate important functionality not only in the software they build itself, but also in the deployment, maintenance, integration, and security of both closed and open source projects. Many companies providing services on the internet offer API tokens in multiple flavors that allow interaction with their systems, as does GitLab. Ideally, these tokens offer configurable access control to otherwise closed systems allowing you to impersonate a user’s session and access raw data. Developers, DevOps professionals, infrastructure professionals and the like often depend on API tokens to do their job.

It’s a common and understandable mistake to make a commit to a Git repository containing one of these tokens when building software in a shared environment. Moving quickly, trying to support your fellow developer, and generally working quickly to get things done efficiently can lead to mistakes made under pressure, which can happen to us all. Popular tools that search for these commits like [gitrob](https://github.com/michenriksen/gitrob), [TruffleHog](https://github.com/dxa4481/truffleHog), [gitleaks](https://github.com/zricethezav/gitleaks), and even GitLab’s own [SAST project](https://docs.gitlab.com/ee/user/application_security/sast/) can find leaked tokens given proper configuration. Our Red Team had early success leveraging these known techniques, tactics, and procedures (TTPs).

The tools referenced above are fantastic at finding secrets unintentionally left in source code. However, it's also a common mistake to submit sensitive data like API tokens, usernames, and passwords to public forums like [GitLab snippets](https://docs.gitlab.com/ee/user/snippets.html), [issues](https://docs.gitlab.com/ee/user/project/issues/), and [issue discussions](https://docs.gitlab.com/ee/api/discussions.html). Sharing this type of information by accident can happen easily when attempting to share relevant information to facilitate a public support request as we often do at GitLab for many different products. Though most people know not to post sensitive information in a public place directly, mistakes do happen, sometimes shortcuts are taken, logs get shared, configuration files get dropped, and information inadvertently gets leaked and leveraged.

### Exploring the wide-open

Token-Hunter is intended to complement tools like gitrob, gitleaks, TruffleHog, and others. It can be used if you’re hosting your groups and projects on GitLab.com, or on a self-hosted GitLab instance of your own. We created Token-Hunter to support the following features:

- **Search GitLab issues and the related discussions for sensitive data.** GitLab issues and comments are a primary method of sharing information and resolving support issues. They typically contain shared log data, configuration files, copy/pasted source code examples, and discussions by both GitLab employees and customers, and are therefore likely to contain sensitive data.
- **Search GitLab snippets for sensitive data.** These are small, URL-addressable chunks of code or text intended to be shared between GitLab users or served directly in source code. They are most often used to share small bits of configuration data, JavaScript source code, example code in any language, or log data. Therefore, they can likely contain sensitive information like usernames and passwords, API tokens, etc.
- **List all of the projects associated with a group.** This is helpful to quantify the problem and understand where the search will start. Optionally, you can include members’ projects in the search to expand the organizational scope similar to gitrob. Starting at different points in the project after you understand your target more completely can yield very different results.
- **Proxy all traffic from the tool.** Token-Hunter accepts arguments for an HTTP proxy server and self-signed certificate to decrypt TLS traffic. GitLab’s Red Team used this feature to record traffic pattern examples to the Security Operations team in support of defensive strategy development. This feature is also handy for debugging by examining the traffic the tool generates. [Burp Suite](https://portswigger.net/burp/communitydownload) and [OWASP Zap](https://www.owasp.org/index.php/OWASP_Zed_Attack_Proxy_Project) are two popular tool choices for proxying traffic locally and can be configured with a self-signed certificate to decrypt TLS traffic.

For full details on using the tool and the functionality of each of its available arguments, visit [the Token-Hunter project page](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter/tree/master) on GitLab.

### Taming the wild... mostly

Hitting an API to gather large amounts of raw data is daunting. Internet connections flake out, servers need maintenance, rate limits get hit, WiFi drops, performance degrades, timeouts happen, and you end up with a headache attempting to simply get the data you’d like to analyze. To counter some of these issues as pragmatically as possible, two simple algorithms were applied: request retries and dynamic page-size reduction. Request retries simply retries a failed request after a few seconds. The tool will retry a failed request twice, each after a four-second delay with a four-second backoff. In other words, the first retry will occur four seconds after the initial failed request. The second retry will occur eight seconds after the first failed retry attempt. If each of these retry attempts fails, the tool then attempts to reduce the paging size in order to complete the request. Reducing the page size reduces the number of records the request needs to return lessening the likelihood of a timeout. *Though simple, these two algorithms allowed the tool to reliably pull data for nearly 1.3 million individual GitLab assets with only three recorded request errors resulting in over 1600 pattern matches.*

### More to explore

The ability to search discussions and other popular channels where sensitive data is likely to be shared is the key benefit of the Token-Hunter tool over other related tooling. The Red Team plans to continue iterating to support our operations, including adding support for more assets such as [merge requests](https://docs.gitlab.com/ee/user/project/merge_requests/), commit discussions, and [epics](https://docs.gitlab.com/ee/user/group/epics/). We learned during our operation that the specifics of the regular expressions we used in relation to the context in which we were searching (posted log data format, configuration file format, code structure, etc.) largely determined our level of success. It can be necessary to tune these expressions depending on your environment and context. To start, we made a few adjustments to [TruffleHog’s regular expressions](https://github.com/dxa4481/truffleHogRegexes) to add coverage for GitLab-specific token formats. However, there’s still much room for improvement depending on your environment and objective.

Looking for a specific password for a user name? Trying to find all mentions of a specific server DNS name or IP? Expecting a specific log format that has the potential to contain an API token? Tune [the regular expressions](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter/blob/master/regexes.json), and you just may find what you’re looking for.

### We want your ideas and contributions

There is still plenty to be done and we welcome community contributions and ideas. If the tool is helpful to you in defense of your infrastructure and you’d like to contribute, [there are instructions in the README.md](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter#contributing) on how to get started. If you’re not sure what to do, pick an issue out of [our issue list](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter/issues) or add to the existing discussions.

Some of the ideas we’re currently pursuing are:

- **Better output formatting:** We’d like to standardize output to an industry accepted, standard format that allows support for findings verification. A simple CSV file might be the first step.
- **Real-time reporting of findings:** Currently, the tool gathers data first, then reports on the findings, leaving you in way too much suspense for way too long. Reporting findings as they are found allows verification to begin earlier during a long-running execution.
- **Data persistence:** Querying the API is the costliest part of inspecting GitLab assets for sensitive data. Persisting that data from an execution would:
  - Reduce the need to query the API again after tuning your regular expressions. During our operation, we often needed to make changes to the regular expressions based on what we were seeing in the matches. This was virtually impossible given the amount of data necessary to pull.
  - Allow for long-running executions to be paused and resumed. Executions against larger groups can take several hours and would sometimes require a restart during our operation.
  - Maintain a permanent record of findings should they be edited following a found match. During our exercise, there were a few occasions where matches were found that looked to be legitimate, but could not be verified as the asset was modified post-discovery.

We have learned a lot from this initial attempt at gathering OSINT from rather unique and unorthodox locations, but this exercise was just a start. We hope you find the tooling useful and if you have questions or ideas to share please reach out through [email](mailto:redteam@gitlab.com), through our [issue board](https://gitlab.com/gitlab-com/gl-security/gl-redteam/token-hunter/-/boards), or [on Twitter](https://twitter.com/code_emitter). Happy hacking!

Photo by [Lightscape](https://unsplash.com/@lightscape?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/photos/Bsw6l6e01Rw).
{: .note}