We live in a constantly changing world. Updates happen all the time. It's counter-intuitive to say that we need to somehow avoid mutability. But we absolutely do in order to manage scaling and to be able to reason effectively about distributed information systems. Here are some motivational examples:
- Who saw what A system giving diagnostic tests to students maintains a database of students and tests. They want to generate reports showing how many students in each state took the different test versions. When they run the reports, the counts don't match counts taken when the tests were administered. The problem is that students move and their addresses are updated when that happens.
- Patching A large company has a massive fleet of bespoke servers and workstations hosting diverse system and application software. These systems need to be patched to address security vulnerabilities. Every patch deployment carries incompatibility risk and has to be tested individually on each host.
- Replication A social media company has a core application to manage posts and other activities. Over time, marketing, analytics and all kinds of other applications in the company want to maintain their own copies of portions of the data in the main app. Every time there is an update in the main app, a wild mess of spaghetti code propagates updates to the other application databases.
- Distributed processing A system for processing financial transactions distributes transactions to processing services based on attributes in the transaction. Processors can only handle transactions with the attributes that they expect. Sometimes transactional attributes change after distribution.
The problem in each of these examples comes down to mutability in the distributed system design, or more precisely the lack of an immutable substrate to ground information distribution on. In the first case, storing student addresses in a single, mutable field makes accurate reporting impossible. In the second, mutability of running systems forces treating them as "pets." In the third, individually propagating updates becomes intractable at scale. In the fourth, mutability in transactional attributes makes correct distribution impossible.
The third example is what motivated the invention of Apache Kafka by Jay Kreps and his then colleagues at LinkedIn. The core idea is very simple: make an immutable log the definitive record of distributed state change and enable clients to consume the events in the log that they are interested in at their own pace.
Using change data capture or a separate time-stamped address table to solve the first example comes down to the same thing. Grounding infrastructure upgrades on versioned, immutable images and redeploying instead of "patching" comes down to the same thing. In each of the first three examples. scaled and consistent behavioral mutability is enabled by grounding state changes in an immutable event log.
The problem in the fourth example is that mutable objects are concurrency-hostile. You can't allow concurrent access to them or use them in scatter-gather architectures without constantly worrying about how their attributes might change or how views of their state might be inconsistent. Reasoning about distributed or concurrent processing systems that manage mutable objects is very hard. Almost always, these problems can be avoided by favoring immutable objects, creating new instances instead of allowing mutation.
If you want your distributed systems to scale and to be easy to reason about, don't look at them as networks of mutable objects. Instead, model state changes as immutable sequences of events and try to limit the need to share event information across context boundaries.
No comments:
Post a Comment