Toolmaking in Software Engineering
Mike Romero
Operations Tool Making

Introduction
When I was young and early in my work career, between ski seasons when I was an instructor and high school semesters, I was lucky enough to work in a machine shop as a machinist assistant. Over the summers, I worked in loud, hot shops, loading metal into machines and watching it turn into aircraft wheel and brake systems for Matco Mfg. It was an enlightening time, and one that I think about frequently in my career as a Software Engineer, SRE, and Platform Engineer.
Out of the many lessons I learned, one comes around often to the work that I do now. Especially now as a Platform Engineer, working on an Internal Developer Platform (IDP), while also taking care of existing infrastructure automation and engineering tools setup in support of our legacy on-prem and Azure environment. That lesson is toolmaking.
Toolmaking
At first, toolmaking might seem like a simple concept. Making tools like wrenches, screwdrivers, and spanners. But it is actually more nuanced than building common tools.
Toolmaking is the process of creating a tool for a specific use case. For example, if you have a part that doesn’t fit into a vice or grip within a machine, toolmaking might be the process of building a custom fixture to hold the part in the machine. It is a purpose-built tool to achieve a unique function for a job.
A real life example from my past job was custom grips for wheels in standard machine vices. The wheels, made of aircraft grade aluminum, could easily be crushed or damaged from the steel grip of the vices. To prevent this, clamp inserts were made that could hold onto specific wheel designs that we manufactured and were stored in the tool closet for reuse whenever that wheel was made.

These grips were part of a two-step job. Raw aluminum stock and the grips were both loaded into the same vice. When the raw material completed step one, the part was flipped into the grips, which would hold the now half-finished wheel, and new raw stock was loaded. The machine would then perform step one on the new raw aluminum while simultaneously completing step two on the half-finished wheel. The brilliance of the grips was not just in protecting the part, but in their role as a passive quality control mechanism. The engineer who designed this process built in a mindless QA step. If something went wrong in step one, the part wouldn’t seat correctly in the grips, and the operator would know immediately that the dimensions were off.
These simple steps in toolmaking allow for increased production and increased quality for very little impact to the workflow of operators and machinist assistants. There is a catch though.
Toolmaking is disruptive and expensive. It can take a machine offline from revenue production and halt or delay production of parts. This translates to very real impact to the manufacturing process.
So how would they decide when to make a tool?
Toolmaking Decision Making, and Toolmaking in Software Engineering
Toolmaking in the manufacturing world is decided by reviewing its use case and determining if the use case is relevant. Often this is done via a decision matrix where the usefulness, utility, and maintenance of the new tool is weighed against the time and cost to create it and the opportunity cost of re-engineering the part to better fit existing tooling and equipment.
In software engineering, we are confronted by the same concept. When is a Terraform module worth creating over using a provider’s default resources? How useful is managing your own IDP or provisioning tools over using out of the box or SaaS solutions? When is owning a Kubernetes cluster more valuable and useful than using Azure Container Services or Elastic Container Service? All of these use cases have answers in both the positive and negative.
Determining this might seem like an easy task. I mean, why not have ultimate customization? But disregarding a decision matrix can, and often has, led to headache and heartbreak.

A recent employer of mine had what I called an “Orphanage of Applications,” a term I stole from an executive at a different company who dealt with the same issue. A large collection of applications lacking ownership by engineering teams, often parked in whatever DevOps adjacent organization existed within the company for long-term maintenance. Many of these applications were one-off tools bought or built for one-off use cases that became so ingrained in the primary application of the company that they couldn’t be removed.
10,000 line PowerShell super scripts. Chef cookbooks and recipes on systems so deprecated that they had to be held within exclusive security boundaries with special permissions. Windows Server 2003 instances operating in the 2020s, running critical financial processes and software. Deployment scripts so complicated that terms like “wizards” and “magicians” were used to describe those that operated them. All manner of custom scripts, jobs, and pipelines running on a large variety of applications in diverse repos ranging from shared drives and ancient TFS servers to GitHub and GitLab repos.
All degrading daily to the entropy of our industry known as bitrot. The tool, abandoned by its creators, is critical and yet so poorly maintained that it languishes. In Site Reliability Engineering, we call these “Tools of Extreme Toil.”
So how do you prevent this from happening, or how do you fix the situation if you are in it? By taking the machinist’s lesson on a decision matrix.
All machinists and mechanical engineers weigh the following when deciding to make a tool. Is it worth the effort of automation? Does the tool have more than one or two uses? What does maintenance look like with the tool? Can it be kept sharp and accurate?
Personally, I use a matrix to help me decide if I should advocate for or against a tool’s use or creation:
- Is the tool reusable and portable? Can I use it for more than “x”?
- Is the tool maintainable? Can I keep the tool “sharp” and “accurate”?
- Does the tool’s technology have a feasible future? What does long-term support look like for the tech?
- Does the tool return its time investment? Does it take three weeks to save three hours a year?
- How is the tool operated? Is it in the cloud, on-prem, or in a container? What are the VM or OS dependencies?
- What is the risk of the tool? Security implications? Consequences of a bad run? Does the tool have rollback or redundancy requirements?
- Does anyone want this tool?
And this is just to name a few parts of what I consider. Tools that I advocate for must meet high criteria, both purchased and built. Toolmaking requires critical self-reflection and design.
Toolmaking, Standardized
Software toolmaking, as well as all software engineering, should follow set standards. Not just an ADR, but a standard.
In a machine shop, the shop has standard bits and heads, vices of set sizes, and common materials and designs. There is a tooling cabinet. Everything needed is there, and things can be ordered if additional items are needed, but size is a constraint, as are the machines for manufacturing goods and tools. In software engineering, we call this our “stack.”
Toolmaking follows this same set of limitations. If you are a C# shop, your tools, if possible, should be written in C#. If an exception is made for, say, Go or Python, then all subsequent tools should match that choice if possible. If it is not possible, you might not need the tool and should reevaluate your decision.
Standards should also include execution environments. If you deploy with GitHub Actions, then tool execution should be done via GitHub Actions. I have personally created deployment pipelines that run terraform init && terraform validate, then have a step called terraform plan with release gates to a terraform apply step. This keeps troubleshooting and repeatability in a shared location and available to all other engineers. Standards prevent the “it works on my machine” excuse.
By creating standards, it becomes easier to build tools and easier for those tools to be run. It also prevents bitrot from tools being made for one-offs or in unique environments and then not being maintained, sharpened, and lubed.
Unused Tools Should Not Have Been Made
In closing, I want to make one other point. If a tool is going unused, it frankly should not have been made. Toolmaking is about repeatability and usefulness. If you create a tool without asking in your matrix “is this needed?” you have failed at toolmaking.
Tools, like any software, need a customer. Sometimes that customer is us and our team. Sometimes that customer is external to the team, like a dev team or even end users. But toolmaking should always be an answer to a need.
The best tools are used, sharp, and useful.