Products have defects. That doesn't mean software bugs, though that is one source. A Defect is simply something that doesn't work like it should. A defect is a fault.
Some folk turn this into a semantic quagmire by insisting that any Feature that the customer would like but doesn't have is therefore a defect. No, this is not the case. If every Feature the customer is paying for is working as intended, there are no Defects from a development perspective, no matter how much additional or different functionality the customer may wish to have. Perhaps some folk in sales or marketing will slap their furrowed brows and say 'damn, we shoulda thunk of that !'. Maybe. In which case, they should have a word with the Product Owner about adding a new feature or changing an existing one.
Therefore I'm concerned here not with new things, and not with change requests, both of which are perfectly valid types of product backlog item, but with existing things that must be altered to make them work like they should. The question is, how to manage the work required.
And what is the work required ? In fact, that is not a question that many folk bother to ask. It's code, and you change it, yes ? This work actually has several different components :
What this is referring to is an understanding of the erroneous behaviour that is sufficiently detailed and confident to allow a specific remedy to be designed. The problem for planning is that the effort needed to get a diagnosis is very variable.
The purpose of root cause analysis is to understand not what the defect is ( this is the purpose of Diagnosis ), but why the defect occurred and why it was not detected prior to release of the product. Performing this analysis is a distinctly different task to Diagnosis.
Knowing the Diagnosis, a fix can be designed and implemented.
In fact, the dependency relationship between these things is not linear. It looks more like this :

The point here is that having obtained a Diagnosis, a Root Cause Analysis may not be essential in order to design and implement a change that will fix the immediate problem. However, it could be the case that having understood the root cause, a different and more robust fix could be devised. Or possibly this understanding will not affect the fix, but would cause some alteration to the Definition of Done that would prevent similar defects, or would even affect higher level strategic planning.
The planning conundrum is clear - should a root cause analysis be performed before Design and Implementation, or not ? Indeed, should a root cause analysis be performed at all ? And if it is to be done, how should it be done within the Scrum framework ?
From a simple engineering perspective, root cause analysis should always be done because there's a good chance it will de-risk future development in some way. That, of course, is not quite the same question that the business must ask, because this analysis does come with some cost attached.
Therefore essentially this is not a call the engineering function can make. It's probably not a call that individual Product owners should really be making either, because there will always be considerable pressure on Product owners to not incur costs which do not deliver product. It really requires a strategic commitment from the business, possibly under the umbrella of a continuous improvement strategy or a QA framework.
In terms of the development team and planning, one point to note about root cause analysis is that it does not have to be done immediately, because getting a fix is not dependent on it. The downside of deferring the work is that any benefit will be deferred, but that trade off may be considered worthwhile in order to get a fix sooner - this is a matter of judgement for the Product owner.
As regards how to do it in a Scrum context, then because it constitutes effort expended on a well defined, narrowly focused task, it should be treated as a backlog item. Then it's business value can be assessed, the work can be scheduled as normal, transparency is improved and productivity is not seriously impacted ( because the work is made explicit and has business value ). The concept of Spike may be relevant here, although it isn't quite the same thing.
I don't support the notion that using the Sprint retrospective as the time to do root cause analysis is the right way to do this. Using the retrospective in this way buries the effort in overhead, and a root cause analysis is rarely a trivial thing. It may require a series of tasks including discussion with stakeholders or even coding and testing work. Its a true story, and should be handled as such.
I've already explained that I see root cause analysis as a backlog item. What, then, of Diagnosis ?
Diagnosis and Design and Implementation are distinct phases of defect handling. Each has value in and of itself, but in this case there is a strict dependency. But even despite this dependency, I believe treating these phases as distinct is justified. The reason for this is simple - only when the Diagnosis is known can the size of the Design and Implementation task be estimated.
How, then, to deal with the open ended nature of Diagnosis ? This is really the thorniest question from the point of view of a Product owner trying to manage defects via the Product backlog.
Well, one very general approach to the problem of managing open ended problems is to timebox them. This works if the range of possible solutions is sufficiently broad - a simple solution can often be found quickly within the timebox, even if it is clear that more complex solutions with various advantages may exist. However, in this case the range of possible solutions is unlikely to be broad - quite the opposite. Timeboxing Diagnosis, per the Spike concept, runs a serious risk of getting no answer at all.
In some environments, a dedicated team may triage defects and even perform initial Diagnosis, in which case the problem for the feature team is simply one of managing the Design and Implementation of the fix. From a Product owner perspective this certainly seems beneficial. However, I share the opinion of those who say that a Scrum team owns their work from start to finish, and if there is a defect, the work was not finished.
It may be that Diagnosis is trivial - perhaps even that it can be subsumed into Product Backlog refinement as a part of the effort that Developers must contribute to understand the item. In this case, it's existence as a PBI would be ephemeral.
Nonetheless, I would rather see a Diagnosis item on the product backlog whose estimate had considerable uncertainty, followed later by a Design and Implementation item with considerably less uncertainty, than one larger composite item that had all the uncertainty of the Diagnosis.
One important question about defect fixing relates to story points and velocity. The gist of the matter is simple. Should defect fixing work count towards velocity ?
The developer in me would certainly prefer it if it did, because defect fixing is hard work. But if I'm a Product owner, business value is what I care about. Fixing a defect might mean something works that didn't work before - but I can't count the business value twice, once when it was first delivered, and again when it is fixed.
On balance I'm more inclined to believe that Scrum is concerned with delivery of business value than it is with lines of code.
It isn't that I believe teams delivering defects deserve punishment. But I do believe that the feedback loop that velocity represents does not operate correctly if it only counts in one direction - there has to be a way to deliver negative feedback to teams if their quality is not as good as it could be. That's the only way to trigger improvement. So on balance, I would argue that defect fixing should not count towards velocity.
© Mark de Roussier 2021, all rights reserved.