Risk Management

All management is the management of risk

Apr 12, 2021

All management is the management of risk

Back in the day…

…of unenlightened project management, managers used to concentrate on timelines. The project would be described in great detail, broken down into minute tasks, each task would be estimated and the sum of all these estimates would be the estimate for the project. Savvy managers would add some buffer for contingencies and voila, we have a beautiful Gantt chart for the project. But alas, as the Bard once said

There are more things in heaven and earth, Horatio,Than are dreamt of in your philosophy - Hamlet (1.5.167-8), Hamlet to Horatio

Which our savvy manager also soon finds out. No sooner has the project started than it is behind schedule. A task which was estimated to be two days took four. And so did another and so did another. Pretty soon the project is so far behind schedule that leadership is questioning whether it’s even worth doing. Why does this happen? In a word - variability.

The book “The Principles of Product Development Flow” by Donald G Reinertsen talks about this in some detail. The problem is that product development tasks suffer from high variability, and this variability only increases as development proceeds due to incalculable combinations of complexity. The book then talks about how the various trade-offs made to ‘meet the deadline’ are all focussing on the wrong thing. The issue is not that variability is bad. The issue is that your system assumes zero variability and collapses on the first contact with the ‘things in heaven and earth’ i.e. reality.

Reinertsen then goes on to tell us not to worry. There’s a whole branch of mathematics called Queueing Theory which gives us robust answers to the question - how do I run a predictable execution in the face of massive variability? The answer is simple - look at the queues. Where a queue is backed up, find out the cost of delay. If the cost of delay is unsustainable then clear that backlog first. A robust process will also not worry about 100% utilisation. In fact it builds resilience to variability by leaving some slack in the system. (Recommended reading - The Goal by Eliyahu Goldratt, perhaps the only management book that is a gripping read start to finish)

Now, like most people who have a copy of Principles of Product Development flow on their nightstand, I have only read the first few chapters, but I do feel there is another way of dealing with this problem. Namely, through the lens of ‘risk’

Risk

Software projects abound with risks. Existential risk - that the project will fail completely, time risks, budget risks, quality risks and so on. The goal of a software project should be to pay down existential risk as quickly and cheaply as possible, then deal with risks to time, budget and quality. If there is a date beyond which the project can be deemed a failure, ie if you are preparing for an expo or have to deal with regulators and so on then time risk becomes an existential risk. In all other cases, time is not usually as harmful as our obsession with deadlines would lead us to believe. (With the leverage that modern software applications provide to businesses, the succesful completion of a project is an order of magnitude more important than time and budget constraints but this is a controversial topic and I’ll cover this in more detail in another post on Deadlines).

The traditional waterfall method did not take any view on risk. It viewed all tasks in the development phase as equally risky, assuming that risk has been paid down in the planning and requirements gathering phases. This, as we know, is far from the truth. For the night is ….

https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2c92090-b2d6-449d-a90d-9d7ddd2e86cf_500x250.png

Lady Melissandra explains the problem with Waterfall.

Risk can only meaningfully be paid off through execution, not before. The difference between Waterfall and Agile is this - Waterfalls plans to finish, Agile merely plans to start (read more on this here -> https://breakingsmart.substack.com/p/planning-to-start-planning-to-finish). And planning to start is a much more robust way of peeling back the layers of risk. First the existential risks, then the critical risks, then the execution risks.

There are many different moving parts here so I'd like to better explain the concept of managing projects through the lens of risk using a real live example. After that, we'll quickly review everything we've learned.

An Example

At InVideo, we' are rewriting the core rendering engine of our browser based video editor to use modern tech like HTML Canvas so we can leverage the GPU and so on. This is an extremely risky project because the original editor grew up in an HTML world with the browser doing all the heavy lifting around text, color correction and so on. Recreating all of these features on Canvas is a non-trivial task to say the least. The original estimate came out to be 8 months to recreate InVideo's browser editor using Canvas as a base. So let's see how this project is approached using the lens of Risk.

Do we even need to do this?

Diverting significant resources from feature development to the 'big rewrite' is a big question and the first question we need to ask is - do we even need to do this? Let's see what answer we get if we look through the Risk lens.

The risk of not doing this project is that you remain forever limited by what the browser, HTML and CSS can do. This kind of risk is called platform risk. Every platform gets you off to the races quickly but eventually you mature enough to start hitting the limits of the platform. At this point, if your platform is outside your control and opinionated about things in ways that don't work for you, the platform turns out to be an existential risk. Choosing the wrong platform is a sure path to failure of the entire project.

In our case we realised that there are WebGL shaders and effects that can only be done on a Canvas. Continuing on the DOM meant that our product would always be hamstrung in unacceptable ways and we'd be hobbling along as our Canvas based competitors stole march after march on us. This risk was deemed to be not even a risk but a certainty. The Canvas based editor is a necessity.

How long should it take to build this?

Of course the question is badly structured. It takes as long as it takes. There are inherent limits to resources and parallelisation that set a lower bound on the time taken for such a project. Known and unknown unknowns add an element of variability that make the upper bound hard to calculate. So we can rephrase the question like this - how much risk can we tolerate while this project continues?

The other major existential risk facing InVideo was product velocity. This was another rationale for the rewrite (but of lower intensity than platform risk). The current editor was showing all the hallmarks of a product that's been through a winding journey to Product Market Fit. Badly written code by young developers, dead code that never got removed and multiple sources of truth for everything. Therefore, one way to derisk the rewrite was to achieve product velocity in all other parts of the system. That way, the total amount of risk in the system remains within limits.

If the total system risk is within limits, then elements of the system become free of time risk. Of course, time is our only currency in startup land, so time risk is never zero, but it does not remain the over-riding, all encompassing risk any more.

The original estimate for the entire rewrite was 8 months. This estimate was, even then, known to be unreliable. However it was essential to arrive at an estimate because as Eisenhower said - plans are meaningless, but planning is everything.

Managing the risk of time and cost over runs.

So, after 8 months we would have a completely rewritten editor. That would be sweet! But again, this is not the outcome that requires planning for. We need to plan for that other outcome - what happens if 8 months later we do not have a new editor? What happens if at that point we are still in the dark about how long it's going to take? Here we meet a new risk - management risk. The risk that management declares the project a failure with catastrophic consequences for the organisation. How to deal with this?

The reason management would do such a thing is because they still see too much risk, so one way to avoid that outcome is to progressively derisk the project as often as possible. This is a technique called risk decomposition and it plays a very pivotal part in all Agile methodologies. The diktat to ship something every sprint is a diktat to decompose risk into digestible pieces. All WIP represents risk and so keeping WIP minimum is a risk management strategy. More on this in another post.

The more of your project is in production, the more protected it is against cancellation. Let's see how we did this.

Decomposing Risk

So we came up with a plan to decompose the risk and ship something to production as soon as possible. In the case of the editor, there are two components to it - the canvas, which is where you build the video, and the renderer, which is the part which shows you your beautiful video. Our plan was to ship the preview part to production first.

This is a conscious strategy to front load the riskiest parts of the project - the renderer is a very complex piece of technology. Text rendering alone has so many cases - line heights, stems, LTR text, unicode, kerning, the list goes on. So we chunked up the risk into buckets - text, video, masks, overlays, filters, animations and so on and went to work on each one of them separately.

Text is hard…

…very hard…

However, when you decompose risks into small pieces it gives rise to a new risk - integration risk. Integrating a canvas based previewer alone into the existing editor represented an additional work, in fact significant additional work. However, in this case we were, in order to address the risk of product velocity, planning to split the editor into a microfrontend and once that was done we'd be able to slot the previewer into the front end with a minimum of fuss. Risk composes, just like functions do, and decomposing risk allows for us to refactor the risks where they start to play well with each other.

A separate team was working on rearchitecting the front end so the rewrite team could focus on just building out the previewer. Risks had been well allocated across the teams and each team could focus on paying down their respective risks.

The re-lite team (why it's called that is another story) then went to work at the lowest level. Simply learning how to render things correctly on the canvas. As they paid off the risk of the individual elements, they encountered risks of integrating them to work together - ie a filter and an animation together. These issues were complicated but not overly risky - we could see that others had managed to achieve these things and were confident we would as well.

So we see, just like architects decompose a system into modules and functions, project management is about decomposing risk to the point where each individual item, like a function, is trivial to solve.

Once we could render text, animations and all components correctly, we had to integrate them together - not risky. Then we had to ensure that the performance was adequate.

Again for performance it's easy to say 'this has to work at 60FPS', but does it? The question to ask is, 'at what FPS does performance cease to be a risk'. Turns out it's 30 FPS. This answer alone significantly reduced the amount of work we had to do on performance.

Performance risk was real - InVideo offers many features that other platforms do not and so our benchmarking was incomplete. However, we managed to achieve 30 FPS and we still have potential for further up side.

Another risk was fidelity risk - we tried various approaches to automatically compare the fidelity of the new previewer against the old but in the end there were pixel differences that were invisible to the eye but made any automated approach infeasible. So we did the best our engineer eyes could and then prepped for release.

Parallely, another team had refactored the original editor and we could slot in the new previewer quite easily. Voila - time to release. But wait - not so fast. InVideo has thousands of templates. How could we be sure that we were rendering all of them correctly? Correct - phased releases. Many eyes make bugs shallow. Two weeks ago, the preview launched to internal template creators. The video design head did a check first, handed out a list of 20 odd bugs. Then his bug reports slowed to a trickle and we feel we're in pretty good shape.

The rest of the editor consists of dropping elements on the canvas and giving the user handles to resize, drag and group them. This is a longish exercise given the number of elements but it represents very little risk.

The 'so what?'

Focussing on the riskiest bits first meant that in 4 months we could reduce the risk of the project to zero. Contrast this with the other approach - had we taken a risk-blind approach to things we might have ended up in the other situation 4 months later - with the not-risky parts of the system in good shape but with significant risks still remaining in the system. Management would be getting nervous and no one would be enjoying the slick new GPU driven animations.

Code in production represents value because now we can take calls like - “hey, instead of building out the rest of the editor let’s instead add fancy new design elements because the renderer can handle them” (this is just an example, don't @ me). In other words, the 'editor' project can now take as long as it takes because its chances of failure are now zero and a large part of the value it represented has already been put into production.

The reason timelines became the de-facto way to manage a project because they don't in fact measure time but risk. Hitting a milestone is a de-risking, but we needn't be oblivious to this. We can focus on risk and cut out the middleman. Time is, in most cases, merely a proxy for risk.

The Engineering Organisation

Discussion about this post