## Saturday, November 21, 2015

### Watch metrics, drive performance

Great athletic coaches watch metrics carefully, but they really focus on improving the things that produce them.  They drive athletes to improve mechanics, form, effort and endurance - the things that move the metrics.  In business, we sometimes focus too much on the metrics themselves and not enough on improving core performance.

A great example is managing problem backlogs.  When problem backlogs grow, businesses suffer.  The natural response is to start tracking and setting goals against problem aging metrics.  This is a good thing to do; but if too much focus is put on the metrics themselves, you end up just moving problems around or resolving things with incomplete "solutions."  Its better to focus on the underlying mechanics: Where are the problems coming from? Why are some of them taking so long to resolve?  How might you cut out large classes of them with some targeted product improvements?  Then like the athlete who makes some targeted changes to training regimen and form, you get to watch the metrics improve.

## Wednesday, November 18, 2015

### I can see it from where I live

After seeing it recommended by Dan Pink, I started reading Studs Terkel's classic, Working.  The following quote brought back old memories for me
There’s not a house in this country that I haven’t built that I don’t look at every time I go by. (Laughs.) I can set here now and actually in my mind see so many that you wouldn’t believe. If there’s one stone in there crooked, I know where it’s at and I’ll never forget it. Maybe thirty years, I’ll know a place where I should have took that stone out and redone it but I didn’t. I still notice it. The people who live there might not notice it, but I notice it. I never pass that house that I don’t think of it.
I was lucky to have as my first boss a man who really valued workmanship.  His landscape construction and maintenance business was hard and I could see every day the pressure to cut corners.  But he never did and he got very, very mad when he saw any of us doing shoddy work.

I remember a few years later, I was working on an interstate highway construction project.  My job was to break off concrete pipes and parge cement around the gaps between them and the junction boxes where they came together.  I always tried to do a nice job, leaving the box looking like it had been cast as a single piece of concrete.  I remember once having a hard time with one of the pipes and struggling with my coworkers to get the finish smooth.  One of them said, "I can't see it from where I live."  I immediately thought of my first boss, yelling at me once for justifying a sloppy joint by saying that no one would notice it because it was going to be backfilled.  He said, "but I just noticed it and you saw it yourself.  When you go home, you will see it again.  And if you don't see it again, you haven't learned anything from me."

## Wednesday, November 11, 2015

### A very crowded corner

OK, time for a little math walk.  Imagine that Bolzano's grocery is running a special on Weirstrass' Premium Vegan Schnitzel.  People start converging on the corner in front of Bolzano's from all around.  Based on counts using awesome new really big data technology, the local news media makes the amazing announcement that there are infinitely many people in the city block around Bolzano's.  The subject of this walk is showing that there must be at least one location in that block where you can't move even the slightest distance without bumping into someone.

To simplify things, lets smash everything down into one dimension and pretend that the city block above is the closed interval $[0, 1]$ on the real number line.  Let's represent the infinite set of people as points in this interval.  Now consider the subintervals $(0, .1), (.1, .2), ... (.9, 1).$  At least one of these intervals must contain infinitely many people.  Suppose, for example, that the interval $(.5, .6)$ contains infinitely people.  Then split that interval into 10 segments, as shown in the picture below.   At least one of these has to contain infinitely many people.  Suppose, again for example, that this subinterval is $(.537, .538)$.

Now consider the number .537. We know that there are infinitely many people within .001 of .537.  There is nothing stopping us from continuing this process indefinitely, finding smaller and smaller subintervals with left endpoints $.5, .53, .537...$ each containing infinitely many people.  Let $r$ be the number whose infinite decimal expansion is what we end up with when we continue this process ad infinitum.  To make $r$ well-defined, let's say that in each case we choose the left-most subinterval that contains infinitely many people.  Depending on how the people are distributed, $r$ might be boring and rational or something exotic like the decimal expansion of $\pi$.  The point is that it is a well-defined real number and it has the property that no matter how small an interval you draw around it, that interval includes infinitely many people.   This is true because for each $n$, the interval starting at $r$ truncated to $n$ decimal digits and ending $1 / 10^n$ higher than that contains both $r$ and infinitely many other people by construction.  In the example above, for $n = 3$, this interval starts at $.537$ and ends at $.538$.

Now let's remove the simplification, one step at a time.  First, let's see how the same construction works if in place of $[0, 1]$ we use any bounded interval $[a, b]$.  Consider the function $f(x) = (x - a) / (b - a)$.  That function maps $[a, b]$ onto $[0, 1]$.   Its graph is a straight line with slope $1/(b - a)$.  If $b - a$ is larger than 1, points get closer together when you do this mapping; otherwise they get further apart.  But the expansion or contraction is by a constant factor, so the picture above looks exactly the same, just with different values for the interval endpoints.  So if we do the construction inside $[0, 1]$ using the mapped points, then the pre-image of the point $r$ we end up with will be an accumulation point for the set in $[a, b]$.

OK, now lets pick up our heads and get out of Flatland.  Imagine that the square block around Bolzano's is the set of points in the x-y plane with both x and y coordinates between 0 and 1.  Divide up the square containing those points into 10 equal-sized pieces.  One of those pieces has to contain infinitely many people. Suppose it is the square with bottom-left coordinates (.5, .2).   Now divide that little square into 10 subsquares.  Again, one of these has to contain infinitely many people.  Say it is the one with lower-left coordinates (.53, .22).   The picture below shows these points and a next one, say, (.537, .226).  Just like the one-dimensional case, this sequence of points converges to an accumulation point (x,y) that has infinitely many people within even the smallest distance from it.

The ideas presented above are the core of one proof of the Bolzano-Weierstrass Theorem, a beautiful and very useful result in Real Analysis.  The existence of the limiting values is guaranteed by the Least Upper Bound Axiom of the real numbers.

## Monday, November 25, 2013

### Fully solving problems

Bryan Pendleton's great post, "Anatomy of a bug fix" suggests some basic principles that apply to all kinds of problem resolution.  The attributes that he calls out as separating "great developers" from the not-so-great apply in lots of other contexts, distinguishing the people who you really want to have on your team from those who you can't really count on.   Bryan's conclusion:
I often say that one of the differences between a good software engineer and a great one is in how they handle bug fixes. A good engineer will fix a bug, but they won't go that extra mile:
• They won't narrow the reproduction script to the minimal case
• They won't invest the time to clearly and crisply state the code flaw
• They won't widen the bug, looking for other symptoms that the bug might have caused, and other code paths that might arrive at the problematic code
• They won't search the problem database, looking for bug reports with different symptoms, but the same underlying cause.
The scenario above applies whenever there is a problem to be resolved.  I once led a great team responsible for resolving operational problems at a bank.  The "great ones" on that team always performed analogs of all of the things Bryan mentions above.  They always got to a very precise problem statement and recipe to reproduce and (sometimes painfully) a really exhaustive explication of impacts (what Bryan calls "widening the bug") as well as understanding of the relation between the current problem and any others that may have been related.

I have seen this separation of talent in lots of business domains - operations, engineering, finance, marketing - even business strategy and development.  The great ones don't stop until they have a really satisfying understanding of exactly what an anomaly is telling them.  The not-so-great are happy just seeing it go away.

The difference is not really "dedication" or "diligence" per se - i.e.,  its not just that the great ones "do all the work" while the not-so-great are lazier.  The great ones are driven by the desire to understand and to avoid needless rework later.  They tend to be less "interrupt-driven" and may actually appear to be less responsive or "dedicated" in some cases.  They solve problems all the way because they can't stop thinking about them until they have really mastered them.  I always look for this quality when I am hiring people.

## Sunday, April 14, 2013

We get the word "integrity" from the same root that gives us "integer" or "whole number."  To have integrity is first and foremost to be one thing.  Kant built his entire theory of knowledge on the premise that experience has to make sense - the world has to be one thing in this sense.   We have to be able to say "I think..." before every perception that we have about the world.

The core of all effective leadership really comes down to this.  It all has to make sense.  Everyone has to be able to start with "I think...".  Not "Mr. X says..."  Not "policy is.."  Not "I was told..." but "I think..."

For this to work, leaders have to be firmly grounded in a shared vision and they have to be committed to maintaining integrity in the sense above.  Values, principles, objectives, strategies, communications, performance evaluations, policies, processes, commitments all have to be constantly integrated.  Leaders who force themselves to be able to say "I think..." before a comprehensive view of all of these things can lead from the core.  Just as it is painful to do the exercises to strengthen your physical core, so it can be painful to maintain core leadership strength in this sense.  It is very easy to get "out of shape" by neglecting core values, objectives, strategy and execution alignment.  But without a strong core, none of the most important leadership attributes - authenticity, inspiration, strategic vision, followership, transformational impact - are possible.

Leaders who "skip the abs work" can get some things done and, depending on their good fortune and / or cleverness, some achieve material success.  But no one remembers them.  No great change is ever led by them.  No great leaders are ever developed by them.  Leading durable transformational change and developing great leaders requires core strength.

So how do you develop core strength?  A great mentor and an already established values-based vision and strategy can help get you started, but you always end up having to do the work to build your own core yourself.  Here are some little exercises that can help.  There is nothing particularly deep here and there are lots of variations on these practices.  The point is to regularly and critically focus on core integrity.

Look-back sit-ups. Starting once a week and working up to once a day, look back on all of the decisions, communications and interactions that you had and explain how it is possible that one person did all of these things.  I guarantee that if you are really observant and critical, you will find lots of little inconsistencies - things that in retrospect you can't say, "I think..." in front of.  For each of these, you have two choices: either come up with an alternative course of action that, had you done it, would have made sense; or modify whatever aspects of your vision, strategy or values it is inconsistent with (or more precisely, resolve yourself to conceive and align the necessary changes with your team, your peers and your leadership).  Done honestly, this is painful.  Think of each example as a little integrity sit-up.  Here are a couple of concrete examples.
1. Suppose that last week you negotiated an extension to a service contract. In exchange for a healthy rate reduction, you doubled the term length and added minimums to the contract. This will help achieve your annual opex reduction goal; but your agreed upon strategy is to ensure supplier flexibility and aggressively manage demand in the area covered by the contract. Your decision basically said near-term opex reduction was more important than flexibility or demand management. Either your strategy was wrong or your decision was wrong. To be one person, you need to either acknowledge the mistake or harmonize the decision with the strategy.
2. Last week you agreed with your leader and peers in a semi-annual performance ratings alignment meeting that one of your direct reports was not fully meeting expectations in some key areas.  You agreed to deliver the "needs improvement" message in these areas in his performance appraisal and to adjust his overall rating downward.  You did change the rating and some of the verbiage in the assessment; but when you delivered the review and he challenged the overall rating, you were swayed by his arguments and in the end you admitted that you had been told to adjust the rating downward.   Here either you failed to consider everything when agreeing to the rating adjustment or you were overly influenced by the feedback.
In some cases, the look-back exercise can and should lead you to take some remediating actions; but that is not the point of the exercise.  The point is to do a little "root cause analysis" of what caused the integrity breakdown.  In the first example, it may have been extreme near-term financial pressure causing things to get out of focus, or possibly just lack of clarity in the relative importance of the different factors in the strategy.  In the second example, the feedback may have pushed some "hot buttons" causing you to temporarily lose some core strength.  The key is to face these integrity gaps directly and honestly by yourself.  First think clearly and honestly about what went wrong and why.  Then think about how to "fix things."

Virtual 360 crunchies.  Again starting once a week and working up to daily, imagine you are specific person on your team, in your company or a partner (alternate among randomly chosen people from these groups) and respond to the question, "What is most important to X?"  where X is you.  Don't just repeat goals or big initiative names or repeat your own communications.  Actually try to imagine what it would be like being the selected person and what they really think is important to you and how that relates to what they do on a day to day basis.  Think about how they would say it in their own words, not yours.  If you can't do it, or what naturally comes out is far from what you see as your core,  you have two options.  Either you have a communication problem - i.e. there is no way this person can have a clear understanding of what is important to you because you have failed to communicate it - or you don't make sense from their vantage point.  In the first case, you need to work on communication and in the second, you need to patch whatever holes exist in your vision, strategy or values that make you incomprehensible to this person.  Here are some examples.
2. You are a well-respected leader of a software engineering team.  There are 10 products in your portfolio.  You have developed a robust set of goals for the organization, cascaded effectively through the team.  There are 5 top-level goals, each of which has been broken down by each of the teams in your group.  You know every product and team inside out and you regularly do "deep dives" at the product level.  Your communications are detailed, often focusing on knowledge-sharing across the group and encouraging collaboration among teams.  Imagine that the virtual 360 candidate is an engineer working on one of the products.  He might say something like, "Let me go look at the goals statement.  I know the features we are working on are important because she mentioned them in the deep dive last week."
The examples above show opposite extremes.  In the first case, the leader has a big gap in communicated - possibly even conceived - vision and values.  In the second case, the team has lost sight of the forest through the trees.  Both require not just communication, but critical assessment of core leadership.  In the first case, there is a hole.  In the second case, the core appears to be disintegrating.  The basic problem is the same.  It all has to make sense to all stakeholders all the time.

I have never met a leader who did not have small or large "core integrity" problems to deal with from time to time.  The great ones recognize them quickly and get whatever help they need to build and maintain a strong core.

## Saturday, November 19, 2011

### Thinking Together

"Nestor, gladly will I visit the host of the Trojans over against us, but if another will go with me I shall do so in greater confidence and comfort. When two men are together, one of them may see some opportunity which the other has not caught sight of; if a man is alone he is less full of resource, and his wit is weaker. "  Homer, Illiad 10  (Diomedes about to go out spying on the Trojans)
I have always loved the simplicity in the image above, which Plato paraphrased simply as "when two go together, one sees before the other."  Real collaboration works this way - like Diomedes and Odysseus, we think together, and do more, better, faster than we could alone.  For this to work, we have to be thinking about the same thing.  Rather than just periodically sharing fully working ideas, we need to think out loud, sharing the not-yet-sense that eventually workable ideas come from.

I remember when I was a mathematics grad student I thought I would never be capable of producing the brilliantly incomprehensible work that I saw published in professional journals.  It took me forever to read papers and there appeared to be so many little miracles embedded in them that there had to be some magical wizardry involved.  All of that changed when I started thinking together with some real mathematicians.  Seeing the thought process, the not-quite-sense at the base of the ideas, the stops and starts, the workarounds, let me see that while I would never be Diomedes, I could at least have a little fun as Odysseus.

The key here is to actually externalize and share the thought process underneath ideas in the making, rather than just handing them back and forth as "products."  I am belaboring this point because it bears on a dynamic in the open source world that I find alternatively exciting and troubling - the rise of distributed version control systems (DVCS).  Thanks to Git and GitHub, the process of contributing to open source has become much easier and the number of projects and contributors is exploding.  That is the exciting part.  The thing that I sometimes worry about is that the "forking is a feature" meme that can result from DVCS can take us away from thinking together and more toward "black box" exchange of completed ideas.  The traditional, central VCS, central mailing list model works more or less like this:
1. Someone has an "itch" (idea for enhancement, bug fix, idea for a change of some kind)
2. Post to the project development mailing list, talking about the idea
3. Patch, maybe attached to an issue in the project issue tracker, maybe directly committed
4. Commit diff from the central repo triggers an email to the list - everyone sees each small change
5. Idea evolves via discussion and commits, all visible to all following the project
Using DVCS, the process can work differently:
1. Someone has an itch
2. Clone the repo, creating a local branch
3. Experiment with new ideas in the local branch, possibly publishing it so others can see and play with it
4. Submit a request for the changes in the local branch to be incorporated in the "mainline"
5. Changes are eventually integrated in a batch
Now there is no reason that the second flow can't involve the full community and be done in a thinking together way in small, reversible steps; but it is also easy for it to just happen in relative isolation and to end with a big batch of "black box" changes integrated in step 5.  If this becomes the norm, then the advantages of OSS communities thinking about the same thing as they evolve the code is significantly diminished.  It's also a lot less fun and interesting to review large, unmotivated pulls than to participate in the development of ideas and work collaboratively on the code expressing them.  But most importantly, if the primary means of communication becomes patch sets developed by individuals or small subcommunities, and the only thinking together that the community does is after the fact review, we lose one of the biggest benefits of open source - the collaborative development of ideas by a diverse community influencing their origin and direction.

The DVCS train has left the station and as I said above tools by themselves don't determine community dynamics.  Its up to those of us who participate in OSS communities to keep engaging in the real collaboration that created the great software and communities-around-code that have made OSS what it is.  So lets take a lesson from Diomedes.  Great warrior that he was, he still thought it best to bring "the wily one" with him when he went out into new territory.

## Friday, October 28, 2011

### Correlated failure in distributed systems

Every time I start to get mad at Google for being closed and proprietary, they release something really interesting.  The paper, Availability in Globally Distributed Storage Systems  is loaded with interesting data on component failures and it presents a nice framework for analyzing failure data.  The big takeaway is that naive reasoning about "HA" deployments can lead to inflated expectations about overall system availability.

Suppose you have a clustered pair of servers and each is 99% available.  What availability can you expect from the clustered pair?  "Obviously," $99.99\%.$

Unfortunately, that expectation assumes that the servers fail independently - i.e., component failures are not correlated.  The Google research shows that this can be a bad assumption in practice.  Referring to storage systems designed with good contemporary architecture and replication schemes, the authors say, "failing to account for correlation of node failures typically results in overestimating availability by at least two orders of magnitude."  They go on to report that due to high failure correlation, things like increasing the number of replicas or decreasing individual component failure probabilities have much weaker effects when you take correlation into account.

Steve Laughran does a nice job summarizing the practical implications of the specific area covered by this research.  What is interesting to me is the model proposed for quantifying correlation in component failure and estimating its impact on overall system availability.   Its easy to come up with scenarios that can lead to correlated component failures, e.g. servers on a rack served by a single power source that fails; switch failures; bad OS patches applied across a cluster.  What is not as obvious is how to tell from component-level availability data what counts as a correlated failure and how to adjust expectations of overall system availability based on the extent of correlation.

The first question you have to answer to get to a precise definition of correlation in component failure is what does it mean for two components to fail "at the same time" or equivalently what does it mean for two observed component failures to be part of a single failure event?  The Google authors define a "failure burst" to be a maximal sequence of node failures, all of which start within a given window-size, $w$, of one another.   They use $w=120$ seconds for their analysis, as this matches their internal polling interval and it also corresponds to an inflection point on the curve formed when they plot window size against percentage of failures that get clubbed into bursts.

We can define correlation among node failures by looking at how the nodes affected by bursts are distributed.  The practically relevant thing to look at is how often nodes from system architecture domains fail together - for example, to what extent do node failures occur together in the same rack.  If failures are highly rack-concentrated, for example, having system redundancy only within-rack is a bad idea.

Given a failure burst consisting of a set $N = {f_0,..., f_n}$ of failing nodes and a partition $D = {d_0, ... , d_m}$ of $N$ into domains, we will define the $D$-affinity of $N$ to be the probability that a random assignment of failing nodes across domains will look less concentrated than what we are observing.  High $D$-affinity means correlation, low means dispersion or anti-correlation.  If domains are racks, high rack-affinity means failures are concentrated within-rack.

To make the above definition precise, we need a measure of domain concentration.  The Google paper proposes a definition equivalent to the following.  For each $i = 0, ..., m$ let $k_i$ be the number of nodes in $N$ included in $d_i$.  So for example if the $d_i$ are racks, then $k_0$ is the number of nodes in rack $0$ that fail, $k_1$ counts the failures in rack $1$, etc.   Then set $x = \sum_{i=0}^{m}{k_i \choose 2}$.  This makes $x$ the number of "failure pairs" that can be defined by choosing pairs of failing nodes from the same domain.  Clearly this is maximized when all of the failures are in the same domain (every pairing is possible) and minimized when all failing nodes are isolated in different domains.  Increasing domain concentration of failures increases $x$ and disaggregating failing nodes decreases it.

Now let $X$ be a random variable whose values are the values of $x$ above.  For each possible value $x$ define $Pr(X = x)$ to be the likelihood that $X$ will take this value when failing nodes are randomly distributed across domains.  Then for each value $x$, define $r_x = Pr(X < x) + \frac{1}{2}Pr(X = x)$.  Then $r_x$ measures the likelihood that a random assignment of failing nodes to domains will result in concentration at least as large as $x$.  The $\frac{1}{2}$ is to prevent the measure from being biased, as we will see below.  A value of $r$ close to $1$ means that failures are highly correlated with respect to domain, while values close to $0$ indicate dispersion.  With domains equal to racks and $r$ called rack-affinity, the Google paper reports:
We find that, in general, larger failure bursts have higher rack affinity. All our failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. It is worth noting that some bursts with high rack affinity do not affect an entire rack and are not caused by common network or power issues. This could be the case for a bad batch of components or new storage node binary or kernel, whose installation is only slightly correlated with these domains.
The authors point out that it can be shown that the expected value of $r$ is $.5$.  To see this, let $x_0, x_1, ..., x_t$ be the values of $X$ as defined above and for each $i = 0, ..., t$ let $p_i = Pr(X = x_i)$.  Then the expected value of $r$ is $$E(r) = \sum_{i=0}^{t}\left\{p_i \left(\sum_{j=0}^{i-1}p_j + \frac{1}{2}p_i\right)\right\}.$$Since $\sum p_i = 1$, we must have $(\sum p_i)^2 = 1$.  Expanding this last sum and the sum for $E(r)$, it is easy to see that $E(r) = \frac{1}{2}(\sum p_i)^2$.  Note that this applies to any discrete probability distribution - i.e., $r$ as above could be defined for any discrete distribution and its expectation will always be $.5$.  Note also that while $r$ can take the value $0$, its maximum value is $1 - \frac{1}{2}p_t.$  For $X$ as defined above, $p_t$ is the probability that all failures are in the same domain, which is $1/B_N$ where $N$ is the total number of nodes and $B_N,$ the $Nth$ Bell number, is the number of ways that the $N$ nodes can be partitioned.

Computing the value of $r$ given counts $c_0, c_1, ..., c_m$ of failing nodes by domain is non-trivial.  According to the Google authors,
It is possible to approximate the metric using simulation of random bursts. We choose to compute the metric exactly using dynamic programming because the extra precision it provides allows us to distinguish metric values very close to 1.
I have not been able to figure out a straightforward way to do this computation.  Maybe the Googlers will release some code to do the computation on Google Code.  The only way that I can see to do it is to fully enumerate partitions over the node set, compute $x$ for each partition and build the distribution of $X$ using frequency counts.  Patches welcome :)

The Google paper stops short of developing a framework for using estimates of node failure correlation in end-to-end system availability modelling.  That would be an interesting thing to do.  Here are some simple observations that might be useful in this regard and that also illustrate some of the practical implications.

Correlation cuts both ways - i.e., it is possible to do better than independence if a system's deployment architecture splits over domains with high failure affinity.  Consider, for example, an application that requires at least one database node to be available for it to provide service.  Suppose that database node failures are perfectly rack-correlated (i.e., all database node failures are concentrated on single racks).  Then if the application splits database nodes over racks (i.e. has at least one node in each of two different racks) it can deliver continuous availability (assuming the database is the only thing that can fail).

End-to-end HA requires splitting over all domains with high failure correlation. Suppose that in the example above, database node failures also show high switch affinity.  Then to deliver HA at the application level, you need to ensure that in addition to having database nodes in two different racks, you also need nodes connected to at least two different switches.

As always, correlation does not imply causation.  The Google paper makes this point in a couple of places.  Suppose that in our simple example all database failures are in fact due to database upgrades and the operational practice is to apply these upgrades one rack at a time.  That will result in high rack affinity among failures, but the failures have nothing to do with the physical characteristics or failure modes of the racks or their supporting infrastructure.

The observations above are basic and consistent with the conventional wisdom applied by operations engineers every day.  In an ideal world, HA systems would be designed to split over every possible failure domain (rack, switch, power supply, OS image, data center...).  This is never practical and rarely cost-effective.  What is interesting is how quantitative measurements of failure correlation can be used to help estimate the benefit of splitting over failure domains.  Just measuring correlation as defined above is a good start.