Plumbee engineering

Building a Bot

| Comments

Problem

One problem that I’ve encountered in many of the jobs that I’ve had is that of keeping the builds green. Even intelligent and well-intentioned developers can sometimes drop the ball on this.

Context

At Plumbee, we’ve aimed for modularity in our code, and have a strong focus on automated test suites to validate our work. We use Jenkins to manage our (often sizeable) build pipelines. While this approach has many benefits, it can mean that a single commit can trigger dozens of builds, and a single mistake can trigger distant failures.

One of our dashboards displaying a tiny fraction of our Jenkins jobs.

In the past we have set up dashboards around the office to add visibility to builds that nearby teams might care about. However we now have a Jenkins job count well into the hundreds, and it is often difficult to tell at a glance who might be responsible for a failure and who might be looking at it. While some dashboards are often all green, a red build on a dashboard seems to be a canonical example of an invitation to the bystander effect.

Opportunity

As a key part of our approach to openness and engineer satisfaction, as well as reducing technical debt in the places that actually bother people the most, we have a stream of work focused on “Engineering Goals”. Each project within the stream is tackled by a small self-organising team.

In the last round of Engineering Goals projects, a team of three of us built a Slack bot to see if we could improve the build situation. The aim was to pull data from the Jenkins API and use Slack to provide targeted information to the people most likely to be in a position to fix any failures that were occurring.

One of our team had already built a tool to scrape the Jenkins API to provide information about builds in the OS taskbar. I had already built a Slack bot to do matchmaking for our Football Table.

Design

The main role of the BuildBot is to monitor our Jenkins builds and to let us know when they are broken. In order to maximise the signal to noise ratio we wanted to target users as precisely as we could.

The bot polls Jenkins and finds views that mention a slack channel in their description. If any build is newly broken then the relevant slack channel is informed. We walk the upstream job tree of the failing build to find people who may have caused the failure and mention them by name in the Slack message.

A failure notification showing that @tom and one of our robot users may have triggered the failure. Convenient links take people straight to the relevant Jenkins pages.

While a build remains broken, half-hourly reminders are sent to the channel.

A failure reminder.

If someone reacts to a build broken message, then they are considered to have claimed that failure. Henceforth, only they will receive the reminders, and the frequency drops to thrice daily.

A less frequent reminder for a claimed build failure.

Once a build passes, the channel is notified. Those users who had claimed an interest in the build are mentioned by name.

Approach

The design was born iteratively. Once we had a skeleton up and running we dog-fooded the bot by setting it up to report on its own builds. As we became happier with the level of functionality it provided we rolled it out to report more builds in more Slack channels.

Our existing Slack bot code had started as a 5% time project that I had written in Clojure. While the other team members were enthusiastic about trying Clojure, they were much less experienced with it than I was. I’m lucky to work with open-minded peers.

When Jon Pither from JUXT came to give us a talk about Clojure one of the points he made was that a Clojure REPL is a great tool for exploring an API. We certainly found this to be true. The Jenkins API is fairly straightforward and we were able to iterate quickly on some queries just by manually rewriting the URLs in a browser. When it came to manipulating the data though, Clojure’s data-first approach and a running REPL sped things up enormously.

Having found the data that we needed, we extended the REPL-driven approach into the main development effort. This gave me an excuse to play with Stuart Sierra’s component library and reloaded workflow and learn about writing reloadable code - a concept I was unfamiliar with from years of writing Java/Spring applications. The upshot of this was that we could call (reset) at the REPL and have a cleanly refreshed server running our latest code in less time than it takes to start a JVM. While running we could interrogate (and occasionally change) the mutable parts of the application state as well as updating function definitions as we developed the code base.

Another new approach for us was to code the BuildBot together. We ordinarily use a mix of code reviews and occasional ad-hoc pairing to keep code quality up and to share knowledge and keep our code style consistent. For this project we extended the pairing approach to a ‘small-mob’ of three. This meant that all three of us on the project knew all the code. We also found it to be very productive, as we kept each other focused, and rarely reached a point where no-one had any ideas about where to go next.

Structure

Besides being able to quickly refresh our development server, another benefit of the component library was that it gave some structure to our code and provided a point of familiarity to those unfamiliar with Clojure. By saying that this was a library to manage dependency injection and lifecycle management, Java developers knew exactly what this meant, even if it was implemented in only about 200 lines of fairly straightforward code.

Additionally it made clear to me which parts of the code needed to be managed, and which could be implemented as pure functions. In the current code, much of the complexity has been pulled into the managed code, leaving the bots that we have currently implemented as pure functions. Each bot exposes a handler function and an initial state. Each bot’s handler function is passed it’s own state, and an event (representing for example a message posted to Slack) and returns a new version of it’s state. Conceptually, it’s like reducing a stream of Slack events into a state object. Output from the bots is handled by an :outbox entry in the state map.

There were two main advantages to this approach. Firstly, writing tests for a simple bot is incredibly easy. You pass in two maps, and receive a map in return to run assertions against. If your bot has no side-effects then it really is that simple. Secondly, I wanted to make it easy and fun for others at Plumbee to write Slack bots. By extracting the need to deal with mutating state or external resources I hoped to do this.

The downside to this is that some of the bots we have are not true pure functions. A SupportBot was recently added that interacts with the PagerDuty API. This is probably fine, and was the same approach we took when talking to Jenkins for our BuildBot. The bots are only pure with regards to Slack, but so far they are treating these external resources as read-only, and the cost of moving bot-specific impure functionality into common code does not seem worthwhile. Of course it does make testing them a little harder…

Results

We finished the project on time and had a great deal of fun doing it. We also learned a lot in the process. We learned that we like mob-programming and have since been mobbing on other projects. We learned that we can iterate quickly with a REPL. We all improved our Clojure skills. We were also impressed by the Jenkins API, which packs a lot of functionality and at times feels like a nicer interface than the UI!

Reception

Given that it is designed to hassle busy people until they take responsibility for another task we were pleasantly surprised by people’s initial reactions to the BuildBot. We received some positive feedback and people began to use it.

A few of the messages our Build Bot received.

Aftermath

After an initial flurry of bot/human interaction, the novelty seemed to wear off. There were builds that it seemed no-one cared about that remained broken for days on end. While some of our builds are inevitably less important than others it seemed like a good idea to try to reduce the amount of spam that the bot was producing.

To that end I used some more 5% time to add an exponential backoff to the reminders for unclaimed build failures. I was concerned that broken builds might then drop off the radar and never be fixed, so I’ve added a daily leaderboard message with the top three most broken builds. Now the level of unwanted messages has gone down, but no one is unaware when we have broken builds.

Comments