## Monday, November 25, 2013

### Fully solving problems

Bryan Pendleton's great post, "Anatomy of a bug fix" suggests some basic principles that apply to all kinds of problem resolution.  The attributes that he calls out as separating "great developers" from the not-so-great apply in lots of other contexts, distinguishing the people who you really want to have on your team from those who you can't really count on.   Bryan's conclusion:
I often say that one of the differences between a good software engineer and a great one is in how they handle bug fixes. A good engineer will fix a bug, but they won't go that extra mile:
• They won't narrow the reproduction script to the minimal case
• They won't invest the time to clearly and crisply state the code flaw
• They won't widen the bug, looking for other symptoms that the bug might have caused, and other code paths that might arrive at the problematic code
• They won't search the problem database, looking for bug reports with different symptoms, but the same underlying cause.
The scenario above applies whenever there is a problem to be resolved.  I once led a great team responsible for resolving operational problems at a bank.  The "great ones" on that team always performed analogs of all of the things Bryan mentions above.  They always got to a very precise problem statement and recipe to reproduce and (sometimes painfully) a really exhaustive explication of impacts (what Bryan calls "widening the bug") as well as understanding of the relation between the current problem and any others that may have been related.

I have seen this separation of talent in lots of business domains - operations, engineering, finance, marketing - even business strategy and development.  The great ones don't stop until they have a really satisfying understanding of exactly what an anomaly is telling them.  The not-so-great are happy just seeing it go away.

The difference is not really "dedication" or "diligence" per se - i.e.,  its not just that the great ones "do all the work" while the not-so-great are lazier.  The great ones are driven by the desire to understand and to avoid needless rework later.  They tend to be less "interrupt-driven" and may actually appear to be less responsive or "dedicated" in some cases.  They solve problems all the way because they can't stop thinking about them until they have really mastered them.  I always look for this quality when I am hiring people.

## Sunday, April 14, 2013

We get the word "integrity" from the same root that gives us "integer" or "whole number."  To have integrity is first and foremost to be one thing.  Kant built his entire theory of knowledge on the premise that experience has to make sense - the world has to be one thing in this sense.   We have to be able to say "I think..." before every perception that we have about the world.

The core of all effective leadership really comes down to this.  It all has to make sense.  Everyone has to be able to start with "I think...".  Not "Mr. X says..."  Not "policy is.."  Not "I was told..." but "I think..."

For this to work, leaders have to be firmly grounded in a shared vision and they have to be committed to maintaining integrity in the sense above.  Values, principles, objectives, strategies, communications, performance evaluations, policies, processes, commitments all have to be constantly integrated.  Leaders who force themselves to be able to say "I think..." before a comprehensive view of all of these things can lead from the core.  Just as it is painful to do the exercises to strengthen your physical core, so it can be painful to maintain core leadership strength in this sense.  It is very easy to get "out of shape" by neglecting core values, objectives, strategy and execution alignment.  But without a strong core, none of the most important leadership attributes - authenticity, inspiration, strategic vision, followership, transformational impact - are possible.

Leaders who "skip the abs work" can get some things done and, depending on their good fortune and / or cleverness, some achieve material success.  But no one remembers them.  No great change is ever led by them.  No great leaders are ever developed by them.  Leading durable transformational change and developing great leaders requires core strength.

So how do you develop core strength?  A great mentor and an already established values-based vision and strategy can help get you started, but you always end up having to do the work to build your own core yourself.  Here are some little exercises that can help.  There is nothing particularly deep here and there are lots of variations on these practices.  The point is to regularly and critically focus on core integrity.

Look-back sit-ups. Starting once a week and working up to once a day, look back on all of the decisions, communications and interactions that you had and explain how it is possible that one person did all of these things.  I guarantee that if you are really observant and critical, you will find lots of little inconsistencies - things that in retrospect you can't say, "I think..." in front of.  For each of these, you have two choices: either come up with an alternative course of action that, had you done it, would have made sense; or modify whatever aspects of your vision, strategy or values it is inconsistent with (or more precisely, resolve yourself to conceive and align the necessary changes with your team, your peers and your leadership).  Done honestly, this is painful.  Think of each example as a little integrity sit-up.  Here are a couple of concrete examples.
1. Suppose that last week you negotiated an extension to a service contract. In exchange for a healthy rate reduction, you doubled the term length and added minimums to the contract. This will help achieve your annual opex reduction goal; but your agreed upon strategy is to ensure supplier flexibility and aggressively manage demand in the area covered by the contract. Your decision basically said near-term opex reduction was more important than flexibility or demand management. Either your strategy was wrong or your decision was wrong. To be one person, you need to either acknowledge the mistake or harmonize the decision with the strategy.
2. Last week you agreed with your leader and peers in a semi-annual performance ratings alignment meeting that one of your direct reports was not fully meeting expectations in some key areas.  You agreed to deliver the "needs improvement" message in these areas in his performance appraisal and to adjust his overall rating downward.  You did change the rating and some of the verbiage in the assessment; but when you delivered the review and he challenged the overall rating, you were swayed by his arguments and in the end you admitted that you had been told to adjust the rating downward.   Here either you failed to consider everything when agreeing to the rating adjustment or you were overly influenced by the feedback.
In some cases, the look-back exercise can and should lead you to take some remediating actions; but that is not the point of the exercise.  The point is to do a little "root cause analysis" of what caused the integrity breakdown.  In the first example, it may have been extreme near-term financial pressure causing things to get out of focus, or possibly just lack of clarity in the relative importance of the different factors in the strategy.  In the second example, the feedback may have pushed some "hot buttons" causing you to temporarily lose some core strength.  The key is to face these integrity gaps directly and honestly by yourself.  First think clearly and honestly about what went wrong and why.  Then think about how to "fix things."

Virtual 360 crunchies.  Again starting once a week and working up to daily, imagine you are specific person on your team, in your company or a partner (alternate among randomly chosen people from these groups) and respond to the question, "What is most important to X?"  where X is you.  Don't just repeat goals or big initiative names or repeat your own communications.  Actually try to imagine what it would be like being the selected person and what they really think is important to you and how that relates to what they do on a day to day basis.  Think about how they would say it in their own words, not yours.  If you can't do it, or what naturally comes out is far from what you see as your core,  you have two options.  Either you have a communication problem - i.e. there is no way this person can have a clear understanding of what is important to you because you have failed to communicate it - or you don't make sense from their vantage point.  In the first case, you need to work on communication and in the second, you need to patch whatever holes exist in your vision, strategy or values that make you incomprehensible to this person.  Here are some examples.
2. You are a well-respected leader of a software engineering team.  There are 10 products in your portfolio.  You have developed a robust set of goals for the organization, cascaded effectively through the team.  There are 5 top-level goals, each of which has been broken down by each of the teams in your group.  You know every product and team inside out and you regularly do "deep dives" at the product level.  Your communications are detailed, often focusing on knowledge-sharing across the group and encouraging collaboration among teams.  Imagine that the virtual 360 candidate is an engineer working on one of the products.  He might say something like, "Let me go look at the goals statement.  I know the features we are working on are important because she mentioned them in the deep dive last week."
The examples above show opposite extremes.  In the first case, the leader has a big gap in communicated - possibly even conceived - vision and values.  In the second case, the team has lost sight of the forest through the trees.  Both require not just communication, but critical assessment of core leadership.  In the first case, there is a hole.  In the second case, the core appears to be disintegrating.  The basic problem is the same.  It all has to make sense to all stakeholders all the time.

I have never met a leader who did not have small or large "core integrity" problems to deal with from time to time.  The great ones recognize them quickly and get whatever help they need to build and maintain a strong core.

## Saturday, November 19, 2011

### Thinking Together

"Nestor, gladly will I visit the host of the Trojans over against us, but if another will go with me I shall do so in greater confidence and comfort. When two men are together, one of them may see some opportunity which the other has not caught sight of; if a man is alone he is less full of resource, and his wit is weaker. "  Homer, Illiad 10  (Diomedes about to go out spying on the Trojans)
I have always loved the simplicity in the image above, which Plato paraphrased simply as "when two go together, one sees before the other."  Real collaboration works this way - like Diomedes and Odysseus, we think together, and do more, better, faster than we could alone.  For this to work, we have to be thinking about the same thing.  Rather than just periodically sharing fully working ideas, we need to think out loud, sharing the not-yet-sense that eventually workable ideas come from.

I remember when I was a mathematics grad student I thought I would never be capable of producing the brilliantly incomprehensible work that I saw published in professional journals.  It took me forever to read papers and there appeared to be so many little miracles embedded in them that there had to be some magical wizardry involved.  All of that changed when I started thinking together with some real mathematicians.  Seeing the thought process, the not-quite-sense at the base of the ideas, the stops and starts, the workarounds, let me see that while I would never be Diomedes, I could at least have a little fun as Odysseus.

The key here is to actually externalize and share the thought process underneath ideas in the making, rather than just handing them back and forth as "products."  I am belaboring this point because it bears on a dynamic in the open source world that I find alternatively exciting and troubling - the rise of distributed version control systems (DVCS).  Thanks to Git and GitHub, the process of contributing to open source has become much easier and the number of projects and contributors is exploding.  That is the exciting part.  The thing that I sometimes worry about is that the "forking is a feature" meme that can result from DVCS can take us away from thinking together and more toward "black box" exchange of completed ideas.  The traditional, central VCS, central mailing list model works more or less like this:
1. Someone has an "itch" (idea for enhancement, bug fix, idea for a change of some kind)
2. Post to the project development mailing list, talking about the idea
3. Patch, maybe attached to an issue in the project issue tracker, maybe directly committed
4. Commit diff from the central repo triggers an email to the list - everyone sees each small change
5. Idea evolves via discussion and commits, all visible to all following the project
Using DVCS, the process can work differently:
1. Someone has an itch
2. Clone the repo, creating a local branch
3. Experiment with new ideas in the local branch, possibly publishing it so others can see and play with it
4. Submit a request for the changes in the local branch to be incorporated in the "mainline"
5. Changes are eventually integrated in a batch
Now there is no reason that the second flow can't involve the full community and be done in a thinking together way in small, reversible steps; but it is also easy for it to just happen in relative isolation and to end with a big batch of "black box" changes integrated in step 5.  If this becomes the norm, then the advantages of OSS communities thinking about the same thing as they evolve the code is significantly diminished.  It's also a lot less fun and interesting to review large, unmotivated pulls than to participate in the development of ideas and work collaboratively on the code expressing them.  But most importantly, if the primary means of communication becomes patch sets developed by individuals or small subcommunities, and the only thinking together that the community does is after the fact review, we lose one of the biggest benefits of open source - the collaborative development of ideas by a diverse community influencing their origin and direction.

The DVCS train has left the station and as I said above tools by themselves don't determine community dynamics.  Its up to those of us who participate in OSS communities to keep engaging in the real collaboration that created the great software and communities-around-code that have made OSS what it is.  So lets take a lesson from Diomedes.  Great warrior that he was, he still thought it best to bring "the wily one" with him when he went out into new territory.

## Friday, October 28, 2011

### Correlated failure in distributed systems

Every time I start to get mad at Google for being closed and proprietary, they release something really interesting.  The paper, Availability in Globally Distributed Storage Systems  is loaded with interesting data on component failures and it presents a nice framework for analyzing failure data.  The big takeaway is that naive reasoning about "HA" deployments can lead to inflated expectations about overall system availability.

Suppose you have a clustered pair of servers and each is 99% available.  What availability can you expect from the clustered pair?  "Obviously," $99.99\%.$

Unfortunately, that expectation assumes that the servers fail independently - i.e., component failures are not correlated.  The Google research shows that this can be a bad assumption in practice.  Referring to storage systems designed with good contemporary architecture and replication schemes, the authors say, "failing to account for correlation of node failures typically results in overestimating availability by at least two orders of magnitude."  They go on to report that due to high failure correlation, things like increasing the number of replicas or decreasing individual component failure probabilities have much weaker effects when you take correlation into account.

Steve Laughran does a nice job summarizing the practical implications of the specific area covered by this research.  What is interesting to me is the model proposed for quantifying correlation in component failure and estimating its impact on overall system availability.   Its easy to come up with scenarios that can lead to correlated component failures, e.g. servers on a rack served by a single power source that fails; switch failures; bad OS patches applied across a cluster.  What is not as obvious is how to tell from component-level availability data what counts as a correlated failure and how to adjust expectations of overall system availability based on the extent of correlation.

The first question you have to answer to get to a precise definition of correlation in component failure is what does it mean for two components to fail "at the same time" or equivalently what does it mean for two observed component failures to be part of a single failure event?  The Google authors define a "failure burst" to be a maximal sequence of node failures, all of which start within a given window-size, $w$, of one another.   They use $w=120$ seconds for their analysis, as this matches their internal polling interval and it also corresponds to an inflection point on the curve formed when they plot window size against percentage of failures that get clubbed into bursts.

We can define correlation among node failures by looking at how the nodes affected by bursts are distributed.  The practically relevant thing to look at is how often nodes from system architecture domains fail together - for example, to what extent do node failures occur together in the same rack.  If failures are highly rack-concentrated, for example, having system redundancy only within-rack is a bad idea.

Given a failure burst consisting of a set $N = {f_0,..., f_n}$ of failing nodes and a partition $D = {d_0, ... , d_m}$ of $N$ into domains, we will define the $D$-affinity of $N$ to be the probability that a random assignment of failing nodes across domains will look less concentrated than what we are observing.  High $D$-affinity means correlation, low means dispersion or anti-correlation.  If domains are racks, high rack-affinity means failures are concentrated within-rack.

To make the above definition precise, we need a measure of domain concentration.  The Google paper proposes a definition equivalent to the following.  For each $i = 0, ..., m$ let $k_i$ be the number of nodes in $N$ included in $d_i$.  So for example if the $d_i$ are racks, then $k_0$ is the number of nodes in rack $0$ that fail, $k_1$ counts the failures in rack $1$, etc.   Then set $x = \sum_{i=0}^{m}{k_i \choose 2}$.  This makes $x$ the number of "failure pairs" that can be defined by choosing pairs of failing nodes from the same domain.  Clearly this is maximized when all of the failures are in the same domain (every pairing is possible) and minimized when all failing nodes are isolated in different domains.  Increasing domain concentration of failures increases $x$ and disaggregating failing nodes decreases it.

Now let $X$ be a random variable whose values are the values of $x$ above.  For each possible value $x$ define $Pr(X = x)$ to be the likelihood that $X$ will take this value when failing nodes are randomly distributed across domains.  Then for each value $x$, define $r_x = Pr(X < x) + \frac{1}{2}Pr(X = x)$.  Then $r_x$ measures the likelihood that a random assignment of failing nodes to domains will result in concentration at least as large as $x$.  The $\frac{1}{2}$ is to prevent the measure from being biased, as we will see below.  A value of $r$ close to $1$ means that failures are highly correlated with respect to domain, while values close to $0$ indicate dispersion.  With domains equal to racks and $r$ called rack-affinity, the Google paper reports:
We find that, in general, larger failure bursts have higher rack affinity. All our failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. It is worth noting that some bursts with high rack affinity do not affect an entire rack and are not caused by common network or power issues. This could be the case for a bad batch of components or new storage node binary or kernel, whose installation is only slightly correlated with these domains.
The authors point out that it can be shown that the expected value of $r$ is $.5$.  To see this, let $x_0, x_1, ..., x_t$ be the values of $X$ as defined above and for each $i = 0, ..., t$ let $p_i = Pr(X = x_i)$.  Then the expected value of $r$ is $$E(r) = \sum_{i=0}^{t}\left\{p_i \left(\sum_{j=0}^{i-1}p_j + \frac{1}{2}p_i\right)\right\}.$$Since $\sum p_i = 1$, we must have $(\sum p_i)^2 = 1$.  Expanding this last sum and the sum for $E(r)$, it is easy to see that $E(r) = \frac{1}{2}(\sum p_i)^2$.  Note that this applies to any discrete probability distribution - i.e., $r$ as above could be defined for any discrete distribution and its expectation will always be $.5$.  Note also that while $r$ can take the value $0$, its maximum value is $1 - \frac{1}{2}p_t.$  For $X$ as defined above, $p_t$ is the probability that all failures are in the same domain, which is $1/B_N$ where $N$ is the total number of nodes and $B_N,$ the $Nth$ Bell number, is the number of ways that the $N$ nodes can be partitioned.

Computing the value of $r$ given counts $c_0, c_1, ..., c_m$ of failing nodes by domain is non-trivial.  According to the Google authors,
It is possible to approximate the metric using simulation of random bursts. We choose to compute the metric exactly using dynamic programming because the extra precision it provides allows us to distinguish metric values very close to 1.
I have not been able to figure out a straightforward way to do this computation.  Maybe the Googlers will release some code to do the computation on Google Code.  The only way that I can see to do it is to fully enumerate partitions over the node set, compute $x$ for each partition and build the distribution of $X$ using frequency counts.  Patches welcome :)

The Google paper stops short of developing a framework for using estimates of node failure correlation in end-to-end system availability modelling.  That would be an interesting thing to do.  Here are some simple observations that might be useful in this regard and that also illustrate some of the practical implications.

Correlation cuts both ways - i.e., it is possible to do better than independence if a system's deployment architecture splits over domains with high failure affinity.  Consider, for example, an application that requires at least one database node to be available for it to provide service.  Suppose that database node failures are perfectly rack-correlated (i.e., all database node failures are concentrated on single racks).  Then if the application splits database nodes over racks (i.e. has at least one node in each of two different racks) it can deliver continuous availability (assuming the database is the only thing that can fail).

End-to-end HA requires splitting over all domains with high failure correlation. Suppose that in the example above, database node failures also show high switch affinity.  Then to deliver HA at the application level, you need to ensure that in addition to having database nodes in two different racks, you also need nodes connected to at least two different switches.

As always, correlation does not imply causation.  The Google paper makes this point in a couple of places.  Suppose that in our simple example all database failures are in fact due to database upgrades and the operational practice is to apply these upgrades one rack at a time.  That will result in high rack affinity among failures, but the failures have nothing to do with the physical characteristics or failure modes of the racks or their supporting infrastructure.

The observations above are basic and consistent with the conventional wisdom applied by operations engineers every day.  In an ideal world, HA systems would be designed to split over every possible failure domain (rack, switch, power supply, OS image, data center...).  This is never practical and rarely cost-effective.  What is interesting is how quantitative measurements of failure correlation can be used to help estimate the benefit of splitting over failure domains.  Just measuring correlation as defined above is a good start.

## Monday, October 17, 2011

### Open source community economics

Much has been written about how to make money from open source and the impact of open source on commercial software markets.  The economic agents in these analyses are “open source companies,” traditional technology companies leveraging open source and individuals trying to make a living writing software.  I have not found much that looks at open source communities as agents.  Here are some pretty obvious things that call out the contrast between the interest of an open source community and the more traditional economic agents that engage it.  Throughout, I am assuming a real open development, open source community - not a corporate fishbowl or commercial consortium.

Membership is the bottom line
Companies eventually go out of business if they do not make money - i.e., if their revenues do not exceed their costs over time.  Open source communities go out of business if they do not “make members” over time - i.e., if they do not succeed in attracting and retaining volunteers faster than people leave to do other things with their time.  Just as commercial companies need to relentlessly focus on making sure they remain profitable, OSS communities need to constantly strive to remain interesting, attractive and welcoming.  Infusing paid volunteers is one way to keep “the books balanced” but it is analogous to borrowing capital in the commercial world - the result better be a healthier, more sustainable community; otherwise the “cash infusion” is just postponing demise.

While commercial interests and committers’ individual career goals may be enhanced by focusing on achieving the highest possible levels of downloads and industry buzz, this by itself does nothing for the community.  The community indirectly benefits as a result of users who come to know about it via the “buzz” and later decide to engage.  But if maximizing on sheer numbers of “free beer consumers” in any way reduces user engagement or discourages new volunteers from getting involved, it does harm to the community.  A critical juncture is what happens when an OSS project becomes commercially important.  Inevitably, the need for “stability” starts popping up in community discussions and someone proposes that the project should move to “review then commit” (nothing gets committed until *after* it has been reviewed by the project committers).  Then comes the decision to have a “high bar” for commit.  This will “stabilize” the code in the short term, allowing vast hordes of free beer drinkers to download and use it “with confidence” and generate ever more positive industry buzz.  But it will kill the community over time.  I am not suggesting that this *has* to happen.  The Apache httpd and Tomcat projects, and many others, have managed to have it both ways - lots of downloads and “buzz” and healthy communities.  But they have had to work at it and stay focused on maintaining an environment where new volunteers are welcomed into the community, processes are transparent, there is genuine openness to new ideas and it is not hard for new contributors to learn about the project and find useful things to work on.

Problems are worth more than solutions
At Apache, we often repeat the mantra, “community before code.”  That means that if you ever have to decide between the interests of the community and the most expedient way to ship working software, the community wins.  We take time to talk about things and come to consensus and we make technical decisions based on consensus - even if that sometimes takes longer or results in less elegant code committed to the repository.  From the standpoint of the community as economic agent, its ability to attract and retain volunteers is paramount, so it makes sense that we *use* the code to feed the community, and not vice-versa.  An interesting consequence of this is that problems - whether they be bug reports or ideas for enhancements - are more valuable to the community than “donations” of code.  Large “code dumps” of finished code actually have negative value, because they distract the community with generally boring things like licensing, repackaging, and documentation and add an additional support burden.  Small contributions of code or ideas that raise interesting problems are sustaining food for the community.  Here again, the actual interest of the community as an economic agent do not correspond exactly to the interests of those consuming its “product.”  This is not surprising, because the real “customers” of an OSS community are the members of the community itself, who are typically a tiny subset of the user base of the software that is produced.

Diversity is the risk management strategy
Just as corporations have to worry about mismanagement, market forces or other externalities destroying their business models, OSS communities have to worry about internal or external problems forcing them to lose their volunteers.  Just as businesses diversify to spread risk, OSS communities “diversify” - but in the case of OSS communities, diversification means cultivating a diverse membership.  Having all key contributors work for the same company, for example, represents a material risk (assuming they are all paid to contribute).  Or having them all share a narrow view of *the one right way* to do whatever the code does.  Diversity in open source communities is a natural hedge against technological obsolescence and collective loss of interest.  Software technology is faddish and talented developers are fickle - easily attracted to the next new shiny object.  To survive in the market for eyeballs and interest, OSS communities have to be places where new ideas can happen.  This means they have to attract people who can have different ideas than what the community already has - i.e., they have to be constantly diversifying.

## Monday, October 10, 2011

### Rethinking authentication

Authentication may end up being the next big thing for mobile devices after the "killer app" that led to us all walking around with them - i.e., voice.  The whole concept of "user authentication" is due for an uplift and all of us effectively becoming net POPs via our mobile devices opens up some interesting possibilities in this area.

Traditionally, authentication is something that punctuates and intrudes on our experience as we do things that only we should be able to do - e.g. access financial accounts, use credit cards, check in to hotels, flights, exclusive events, etc.  A person or automated system gating our access to something has to get to a sufficiently high level of confidence that we are who we say we are in order to let us in.  Authentication puts gates in front of us and we present credentials to get the gates to open.

Having an "always on" POP attached to us allows us to think about the problem differently.  Instead of authenticating at experience-interrupting gates, we can think about continuously updating our estimate of the probability that the person attached to a mobile device is the person who should be attached to that device (call this the right binding probability).  As I walk around and do stuff, take calls, get visually identified, etc., my device can provide a stream of "naturally authenticating information" (eventually based on biometrics, but also including behavioral information as well as the outcome of authentication / identification events).   When I want to do something that only I can do, my authentication state can be pushed in front of me, opening gates and eventually even eliminating most of them altogether in favor of challenges based on thresholds of the right binding probability.

There are obviously privacy considerations to think about here and at the end of the day, it will come down to how much "observation" we are going to allow in order to make authentication more convenient for us and our identities more secure.  Just allowing the phone to identify us via voiceprint and to report this event to an authentication service that we opt in to could provide a convenient second factor for financial transactions - again, without interrupting experience.

Updating right binding probabilities based on authenticating events presents an interesting mathematical modelling problem.  Each event should have an immediate impact, but its effect should decay over time.  A relatively strong event like voiceprint identification should create a significant bump and a weaker event like crossing a geo fence into a common haunt at a regular time should contribute less.  But how, if at all, should the second event affect the decay of the first event's effect?  It seems we need to keep a rolling window of recent events, including their times and an updating algorithm that looks at both existence of event types over backward-looking time intervals as well as sequencing.

## Tuesday, September 27, 2011

### Client Oriented Architecture

For no particular reason, I have been thinking a fair amount recently about the CAP theorem and how the basic problem that it presents is worked around in various ways by contemporary and even ancient systems.

I remember years ago as a freshly minted SOA zealot, I was confused by the pushback that I got from mainframe developers who insisted that client applications needed to have more control over how services were activated and how they worked.  I always thought that good, clean service API design and "separation of concerns," along with developer education and evangelism would make this resistance go away.  I was wrong.

I still think the basic idea of SOA (encapsulation and loose coupling) is correct; but once you shatter the illusion of the always-available, always-consistent central data store, you need to let the client do what it needs to do.  The whole system has to be a little more "client-oriented."

The Dynamo Paper provides a great example of what I am talking about here.  I am not sure it is still an accurate description of how Amazon's applications work; but the practical issues and approaches described in the paper are really instructive.  According to the paper, Dynamo is a key-value store designed to deliver very high availability but only "eventual consistency" (i.e., at any given time, there may be multiple, inconsistent versions of an object in circulation and the system provides mechanisms to resolve conflicts over time).  For applications that require it, Dynamo lets clients decide how to resolve version conflicts.  To do that, services maintain vector clocks of version information and surface what would in a "pure" SOA implementation be "service side" concerns to the client.  To add even more horror to SOA purists, the paper also reports that in some cases, applications that have very stringent performance demands can bypass the normal service location and binding infrastructure  - again, letting clients make their own decisions.  Finally, they even mention the ability of clients to tune the "sloppy quorum" parameters that determine effective durability of writes, availability of reads and incidence of version conflicts.

Despite the catchy title for this post, I don't mean to suggest that SOA was a bad idea or that we should all go back to point-to-point interfaces and tight coupling everywhere.  What I am suggesting is that just having clean service APIs at the semantic, or "model" level and counting on the infrastructure to make all decisions on behalf of the client doesn't cut it in the post-CAP world.   Clients need to be allowed to be intelligent and engaged in managing their own QoS.  The examples above illustrate some of the ways that can happen.  I am sure there are lots of others.  An interesting question is how much of this does it make sense to standardize and what ends up as part of service API definitions.   Dynamo's context is a concrete example.  Looks like it just rides along in the service payloads so is effectively standardized into the infrastructure.