RippleNet Engineering's Inclusive Language Initiative: Part 2

Welcome back to the second post of this Inclusive Language blog series! Previously, we contextualized the importance of eliminating terms with problematic and racist origins from our codebase, such as “master” and “slave”, or “blacklist” and “whitelist”. We then suggested changing them with equally clear and more agreeable words such as “primary” and “secondary”, “denylist” and “allowlist”.

In this post, we’ll walk through the second step of creating a more inclusive codebase—replacing each term across all our primary branches and repositories of code. It was paramount that we implement a solution that not only removes and replaces these problematic words, but keeps them out of the codebase for good.

For this task, we split our efforts into the following activities:

  1. Sizing the problem
  2. Fixing the files
  3. Creating a repeatable process
  4. Renaming the Git branches

Sizing the problem

The first step to any problem is to figure out its scope. This was difficult at first as it seemed like we were dealing with a constantly moving target. Our entire codebase is dynamic due to the changes our engineers constantly introduce, so working around this was not an easy task.

To scan our codebase, we started with a simple script that finds problematic words in the files of a specified project. Below is an excerpt of the code we used to find the terms within a file, its associated line number and file name. In the below example, keyword represents a problematic word. We chose to iterate through our keywords dictionary to ensure that all instances of our keywords were found.

with open(file_name, 'r') as read_line:
	for line in read_line:
		line_number += 1
		for keyword in keywords_dict.keys():
			if keyword and keyword in line and not is_ignored_keyword(line, ignored_keywords):
				keyword_hit = {
					"file_name": str(file_name)[len(project) - 1:],
					"line_number": line_number,
					"line": html_escape(line),
				}
                                
				keyword_hits.append(keyword_hit)
				print("Found '" + keyword + "' in " + str(file_name) + " on line " + str(line_number))

After verifying that this simple script could locate keywords in a given project, we focused on expanding this search to every project in a given GitLab group. It’s a general priority here at RippleNet Engineering to test and refine new processes iteratively, so we constrained our search script to search repositories that were members of a specified GitLab group name rather than every repository under the sun. Our inventory of repositories is always changing with engineers deprecating old projects and adding new ones, so we chose to fetch an up-to-date list of repositories in a specified GitLab group and clone them to ensure that the repository was frozen while we applied our search. Below is an excerpt on how we fetched a list of repositories in a specified GitLab group, using the Python Gitlab package:

import gitlab
def get_repo_names():
    gl = gitlab.Gitlab.from_config('somewhere')

    groups = gl.groups.list(search=groupName)
    group = next((group for group in groups if group.name == groupName), None)

    date = (datetime.date.today() - datetime.timedelta(days=numDays)).isoformat()

    # ProjectMap is a singleton class that stores info about the chosen repos
    project_map = ProjectMap.getInstance()

    if group is not None:

        projects = group.projects.list(all=True, include_subgroups=True, last_activity_after=date)
        # Sometimes repositories contained 3rd party libraries that we chose to
        #    ignore, as we couldn’t change their source code
        repos_to_ignore = config["ignored_repos"]
        counter = 0
        for project in projects:
            if config["skip_archived"] and project.attributes["archived"]:
                continue
            if project.name not in repos_to_ignore:
                counter = counter + 1
                project_map.addToProjectMap(project)

    return project_map

Here’s an excerpt of how we cloned these repositories:

import git
def clone_repo(project_id, repo_git_url):
    destination_folder_name = "./repositories/" + project_id
    print("Cloning repo from " + repo_git_url + " into " + destination_folder_name)
    try:
        git.Git().clone(repo_git_url, destination_folder_name, depth=1)
        return True
    except git.exc.GitCommandError as err:
        print("Error cloning repo: {0}".format(err))
        return False

We containerized the search script and the repository cloning process in a Docker container for a number of reasons:

  1. It was easy to pass this process onto another engineer,
  2. It was easy to wipe clean our state by rebooting the container, and
  3. It was easy to ensure our scripts worked as expected in a variety of engineering environments.

Below is an excerpt of our Dockerfile. As you can infer, the aforementioned scripts were placed in the scripts/ folder.

FROM alpine:3.12.0

RUN apk update && \
  apk add python3=3.8.5-r0

RUN apk add git py3-pip openssh gcc python3-dev musl-dev; \
  pip3 install --upgrade gitpython Jinja2 python-gitlab; \
  pip3 install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

# so someone that only has the container image can see how it was created
COPY Dockerfile /home/ruser/Dockerfile

COPY etc/python-gitlab.cfg /etc/python-gitlab.cfg

COPY scripts/ /scripts
COPY configuration/ /configuration

To summarize and propagate our findings, we used the Python jinja2 template to format a report as an HTML file. We used the Python Gmail API to send out an email with this generated report to engineers who were appointed in their respective teams to monitor inclusive language changes. The Python Gmail Quickstart guide is a great summary of how we utilized this library.

At this point, we had a script that generated a report containing words that needed to be changed in every project of a specified GitLab group, which was then emailed to a group of accountable engineers.

Fixing the files

The next step was to replace the words we wanted to change. Automating these fixes is an important part of minimizing developer time required to change numerous terms in several projects. To start, we created a static map that contained problematic words as keys and the words we wanted to replace them with as their values. For example, “master” mapped to “primary” and “blacklist” mapped to “denylist”. We amended our search script from the previous part to also replace the found words with their corresponding alternatives. For example, every instance of the word “blacklist” changed to “denylist”.

We added a configuration that allowed engineers to toggle whether or not the found words would be automatically replaced. In both cases, a report on problematic word occurrences would still be generated and sent to engineers via email. We chose to make this configurable because the order of changes matters when a repository is a dependency of another. For instance, if repository A imports problematic variables from repository B, then it is necessary for repository B to fix its keywords before changing keywords in repository A.

Should a developer choose to automatically replace problematic words in a project, then a new Git branch would be created to capture these changes safely. The changes would be added and committed to the branch and pushed upstream to create a new merge request. From this point, these proposed merge requests would follow our standard process of code review before getting merged to main. As engineers become more accustomed to our solution, the process will become more seamless.

Here is an excerpt of how we used the Python Git package library to initiate and push a merge request with our keyword fixes:

def initialize_git_head(self):
        if self.config["fix_files"] and self.config["auto_open_mr"]:
            self.repo = git.Repo.init(self.project_dir)
            self.empty_file = os.path.join(self.repo.working_tree_dir, 'empty_file.txt')

            with open(self.empty_file, 'w') as fp: 
                pass

            self.repo.index.add([self.empty_file])
            self.repo.index.commit("initial commit")
            new_branch = self.repo.create_head(self.branch_name).checkout()
            self.repo.head.reference = new_branch
            pass

    def generate_mr(self, keyword_hits):
        if self.config["fix_files"] and len(keyword_hits) and self.config["auto_open_mr"] > 0:
            origin = self.repo.remotes.origin
            origin.fetch()
            self.repo.git.add('--all')
            self.repo.index.remove([self.empty_file])
            self.repo.index.commit("Proposed changes by Inclusive Language Initiative")
            self.repo.git.push("--set-upstream", origin, self.repo.head.ref)

            project_map = ProjectMap.getInstance()
            project_id = project_map.getFromProjectMap(self.project_dir.rsplit('/', 1)[-1], 'project_id')
            gl = gitlab.Gitlab.from_config('somewhere', ['/home/ruser/.python-gitlab.cfg', '/etc/python-gitlab.cfg'])
            project_obj = gl.projects.get(project_id)
            print("Creating merge request...")
            new_mr = project_obj.mergerequests.create({
                    'source_branch' : self.branch_name,
                    'target_branch' : 'main',
                    'title': 'WIP: Proposed changes for Inclusive Language Project'
                })
            print("Merge request URL: " + new_mr.web_url)
            pass

Creating a repeatable process

Finally, we wanted to set a new norm in using better lingo. Like any human behavior, a habit shift will take practice, so our solution involved a repeatable workflow that we ported to a GitLab pipeline job. Below is an example of what we added to our project’s GitLab CI .yaml file to schedule regular reporting of problematic terms. Whether or not the tool will automatically create merge requests with proposed changes is coordinated in the job’s configuration.

Scheduled Report:
  image: $REPORT_IMAGE
  stage: build
  variables:
    GROUP_NAME: group_name
  rules:
    - if: '$CI_PIPELINE_SOURCE != "schedule"'
      when: never
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
      when: always
  script:
    - /scripts/parse_gitlab_url.py
    - python3 /scripts/scan-repos.py ./repositories
    - python3 /scripts/send-email.py $CI_SERVER_URL $CI_PROJECT_NAMESPACE $CI_PROJECT_NAME $CI_JOB_ID
  artifacts:
    paths:
      - $CI_PROJECT_DIR/keyword-report.html
    expire_in: 1 month

We ushered the first phase of changes into a major GitLab group to create a proof of concept and demonstrate our new process to our fellow engineers. Even though we manually activated our tool this first time, we established a scheduled job on the project’s GitLab CI pipeline to activate this task automatically at a monthly cadence. Within the configuration of this now automated job, we made sure to constrain our repository fetch and clone process only to active repositories with which engineers had interacted within a specific number of days. For instance, if we only wanted to monitor repositories with which engineers have interacted within 100 days, then any repository with its latest commit over 100 days ago would not be included in our search.

Renaming Git branches

In addition to replacing terms in our codebase, we also renamed our main Git branches from “master” to “main.” We devoted a couple days to run the following script to change upstream Git branch names in batches in order to minimize impact on our fellow engineers. Before and after every batch, we’d send an update and instructions to engineers to fetch from the new main upstream branch. It’s human nature to take time to adapt, so although this change may have at first felt awkward for our engineers, it is now widely known within Ripple that our upstream branches are called “main.”

cd $HOME/repositories
    local counter=0
    
    for dir in ./*; do 
        cd "$dir"
        git pull

        local exists=$(git show-ref refs/heads/main)
        
        if [ -n "$exists" ]; then
            echo "Branch exists. Skipping."
        else
            let counter=$((counter+1))
            echo "Branch does not exist. Creating main branch. count = $counter"
            git co -b main
        fi
        echo "$dir"
        cd ..

        if [ $counter -ge $MAX_BRANCHES ]; then
           echo "Found $counter branches to change. Stopping."
           break
        fi
    done
    cd $HOME

In summary

This is how we created an automated and scalable method for both identifying and replacing words from our codebase. Besides the intended outcomes, we also identified some positive, unforeseen outcomes of this project. These include surfacing projects that were unmaintained and could become archived, as well as potentially creating a use for the tool to replace words across repositories in case of future rebranding situations. We realize that there may also be other changes we may want to make to align our ideals with our actions, so we prioritized making our tool as flexible and extensible as possible.

We hope that by sharing our process for making our coding language more inclusive, we will encourage other forward-looking companies to recognize the value of updating some of our industry’s outdated terminology. Every behavioral change starts small.

If you are interested in projects like this one or want to learn more about Ripple Engineering, check out our team page!

Photo by Steve Johnson on Unsplash