TLDR: I made a Claude Code plugin to measure coding productivity.
This helped us measure over +70% productivity in iubenda's dev team.
Rationale and details follow.
For the past year I've been pretty obsessed with using AI for productivity improvement and I've been running initiatives to increase AI adoption within iubenda and team.blue, particularly amongst developers.
The challenge was to measure the results, but I saw problems with the most commonly used methods:
- % of developers using Claude Code: pretty mute, just tells you who is using it, fine for initial rollout but doesn't really give you a sense of what the productivity really is, it's the kind of "tick the box" approach that leaves many companies with a very superficial AI adoption
- Number of MRs / PRs: not the worst metric but very unreliable as different teams and developers have different styles in terms of sizes of the contribution (few but large vs many but small), which means that more or fewer MRs / PRs doesn't necessarily mean more or less productive team
- Story points: not all teams use story points, plus story point scoring is a qualitative process and subjective. It also requires tracking story points across MRs / PRs / commits, which is very complex as very few teams have really deterministic connection between their git repo and their task management tool, meaning that issues in data coverage makes this method unreliable even on teams that actually use story points
- Lines of code changed: I really like the objectivity of this metric, and if we keep as a constant the fact that a specific development team will keep the verbosity of their code and the mix between types of code (tests, translation, updates, comments, refactors, actual new code) indeed constant, then this metric is not bad at all, but in tests this still had huge variability due to large refactors or wide but low value changes skewing the metrics completely
Several weeks into the rabbit hole, I landed on using lines of code changed, BUT scoring them using Haiku. In essence, the plugin will:
- Download all diffs from all repos you select, across all branches, and deduplicate them to avoid double-counting merge commits
- Score each file diff with Haiku, giving it a weight that will score e.g. 0 a file change, low a translation change, low or zero a library update, high an actual genuine code change or refactor, etc (this can also act as code verbosity index)
- Calculate a sort of "weighted lines of code" metric that you can plot over time to measure productivity improvements
Scoring is very cheap at around $7 per K commits.
The plugin also has a # of other features like creating reports, anonymizing developers with local hashing and the possibility to use BigQuery to share the database across a team.
I'm publishing it so you can grill me on the methodology, cross-check it, find bugs, you name it. All contributions welcome
Not really, actually. I took a quick glance at their homepage during research. But seeing that they're commercial, not a free project, and requiring sign ups and upgrades for most things, I wasn't really interested in exploring this further.
By the way, I thought that they would mostly do terms for social integrations, such as Facebook like buttons, widgets, etc. This is the thing that got the least focus with the project that I did. So I didn't realize they have 600 modules (whatever they are and do, exactly), as you say.
It was cheap enough and comprehensive enough that I’ve used it in the past without even thinking twice. Far cheaper than a lawyer, and if it’s <$100/yr that counts effectively as free for any real business.
I want it to be something that I pay for – I’d expect quality, updates as laws change, support if anything goes wrong, ideally some kind of risk sharing, etc.
Can understand that reasoning, and often do the same. Though it's important to keep in mind that quality is not something that can only be achieved by paying any random party some amount of money.
Developing and drafting things in the open can produce the same level or even higher levels of quality.
Since the project is not a commercial product, though, I don't care what solution you use :)
Ye this was a WE project that just looked so cool, and some of the libraries needed were already out there. Opensourcing & moving everything to telnet/ssh is probably the next step. For now it was just fun :P
That would be amazing! There's a real dearth of terminal-based amusement on the internet these days - last cool thing I remember was that telnet server that would play the first third of Star Wars in ASCII art. Being able to buy Unix swag entirely over SSH would definitely take the crown.
Good stuff though :) I see similar initial reactions to mine elsewhere in the thread, but let's not make the perfect the enemy of the good, eh?
This helped us measure over +70% productivity in iubenda's dev team. Rationale and details follow.
For the past year I've been pretty obsessed with using AI for productivity improvement and I've been running initiatives to increase AI adoption within iubenda and team.blue, particularly amongst developers.
The challenge was to measure the results, but I saw problems with the most commonly used methods: - % of developers using Claude Code: pretty mute, just tells you who is using it, fine for initial rollout but doesn't really give you a sense of what the productivity really is, it's the kind of "tick the box" approach that leaves many companies with a very superficial AI adoption - Number of MRs / PRs: not the worst metric but very unreliable as different teams and developers have different styles in terms of sizes of the contribution (few but large vs many but small), which means that more or fewer MRs / PRs doesn't necessarily mean more or less productive team - Story points: not all teams use story points, plus story point scoring is a qualitative process and subjective. It also requires tracking story points across MRs / PRs / commits, which is very complex as very few teams have really deterministic connection between their git repo and their task management tool, meaning that issues in data coverage makes this method unreliable even on teams that actually use story points - Lines of code changed: I really like the objectivity of this metric, and if we keep as a constant the fact that a specific development team will keep the verbosity of their code and the mix between types of code (tests, translation, updates, comments, refactors, actual new code) indeed constant, then this metric is not bad at all, but in tests this still had huge variability due to large refactors or wide but low value changes skewing the metrics completely
Several weeks into the rabbit hole, I landed on using lines of code changed, BUT scoring them using Haiku. In essence, the plugin will: - Download all diffs from all repos you select, across all branches, and deduplicate them to avoid double-counting merge commits - Score each file diff with Haiku, giving it a weight that will score e.g. 0 a file change, low a translation change, low or zero a library update, high an actual genuine code change or refactor, etc (this can also act as code verbosity index) - Calculate a sort of "weighted lines of code" metric that you can plot over time to measure productivity improvements
Scoring is very cheap at around $7 per K commits.
The plugin also has a # of other features like creating reports, anonymizing developers with local hashing and the possibility to use BigQuery to share the database across a team.
I'm publishing it so you can grill me on the methodology, cross-check it, find bugs, you name it. All contributions welcome
reply