Scaling and the Friction of Dimension
If you’ve been in software development for a while, you know that small web applications are ill-suited for massive load. I’m writing this a few days after Black Friday in the US. This year, another large retailer had a site outage, losing tens of millions of dollars in expected sales. The retailer definitely didn’t have a small system, but it is easy to imagine that if their system's possible load had been much smaller, the system could have been simpler and everything would have been ok.
Over the past few decades, the industry has gone through a number of cycles of innovation aimed directly at the problem of how to scale systems “up.” There are many approaches, but it’s worth asking why there’s a problem at all? Why don’t things in the small just work when scaled up?
The most immediate answer is: physics. Code that runs in the physical world has a relationship with the physical world. It’s easy to miss this fact in the small, much like we can ignore relativistic effects when apples fall from trees, but interactions with the world are inevitable when we scale. Memory size, latency, computational speed, synchronization across distributed nodes.. there are many problems that can be solved with a screenful of code, but once N grows large you have to alter your algorithm (or its packaging) to deal with all of the scaling issues imposed by atoms, photons and electrons.
This, in a nutshell, is why scaling is a problem in software —but we can go deeper.
One of the most important insights about scaling came from Galileo Galilei in the 17th century. While under house arrest at the end of his life he wrote a book called Two New Sciences that contained the following observation (sometimes called Galileo’s Scaling Law or the Square-Cube Law):
the surface of a small solid is comparatively greater than that of a large one because the surface goes like the square of a linear dimension, but the volume goes like the cube
It’s such a simple statement but it is profound. In the world of physical objects we can’t build a skyscraper using the same materials, supports and ratios that we would use when building a cottage. It would collapse because volume scales faster than surface area. The structure must be different.
The same is true, by the way, of biological structures. Cheap sci-fi movies show us ants the size of elephants terrorizing cities, but if an ant actually was the size of a elephant it would collapse and become a puddle of goo. The material of the ant's exoskeleton simply isn’t strong enough to support the weight of the enclosed volume at that scale.
What does this have to do with software? Well, I think we deal with the same problem in software development — but one dimension down. In physical systems, volume grows faster than surface area but, in networked systems, there’s a tendency for the number of edges to grow faster than the number of nodes. For a graph of N nodes, the number of edges tends toward N2 as it becomes more connected.
Galileo’s Scaling Law is about the tension between N2 and N3. In networks, scaling is about the tension between N and N2.
Where do we see this tension in software development? I can think of two obvious places. One is the tension of team size implied by Brook’s' Law and the other is the tension of dependency in architecture.
Let’s look at Brook’s’ Law first.
In The Mythical Man Month, Fred Brooks pointed out that adding people to a late project makes it later. It’s a useful observation, but the reasoning behind it is the important part. Brooks realized that number of communication paths in a team grows as the square of number of team members. Each time an additional person is added, potentially N-1 new relationships form — that can be costly. Worse, the cost of adding new team members grows each time we add one. The costs accelerate.
Outside of software development, we can see how communication costs can grow excessively as N grows large. It’s often quicker to reach consensus with smaller groups of people than with larger groups of people. Three people deciding where to go for dinner usually takes far less time than thirty people deciding. This is the Universal Scalability Law in a social context. It’s also why smaller teams tend to be better. Amazon’s two-pizza team model is a good example of advice that aligns with these insights.
Architecture is another place where N2 has brutal effects. In code, we know that circular dependencies are bad, but aside from the directionality of dependencies, it’s better on balance for a component to have fewer dependencies on other components. When systems go bad, every piece starts to depend on every other piece. N components start to develop N2 connections between them. I call this bad but I think it’s important to realize that is not a sinister sort of bad, it’s just the natural way that systems grow. No developer is saying “oh, I want to mung up the system today” but there is a tendency toward connection in systems. You are working in component A and you find it needs something from library B. You get the benefit of having that capability from B but now you have a dependency on B. Connection is attractive.
We can see this tendency in social systems too. Adding a person to a team gives the team the benefit of new skills, new perspectives, and another set of hands. But, when you add a person you have one more person to coordinate with. As N grows, you slow down. Then, you start to think about splitting the team. Even if you don’t consciously split teams, they tend to divide internally on an ad-hoc basis. In the sociology, these subgroups are called cliques. Dunbar’s Number is another example of the same tension of connection at the organizational level.
There’s a tension that grows as the number of pieces in a system grows if the pieces can connect. This doesn’t put an absolute bound on the number of pieces but it does increase costs as systems grow. At a certain point, the costs outweigh the benefits. In economics, this is Ronald Coase’s transaction cost theory of the firm. In software, it’s why we modularize. The 7 plus or minus two
of our working memory is an area of reduced cost. The N things that can fit in your mind at once can be connected in N2 ways at low cost. As N grows, you wish for functions, classes and services to bound the scope of what you need to be aware of while you do localized work. The generic term for this is modularity.
All of these tensions can be modeled as functions of the costs of edges in a graph. When the costs are zero and the benefits of connection are positive, N becomes N2 very quickly. As costs grow, systems break apart. They federate or form local hubs, creating the structure that we see.
Let’s go back to Galileo for a moment.
Galileo showed a relationship between N2 and N3. As N grows, structure needs to change when there are costs. The same relationship seems to hold for N and N2.
Maybe we can generalize this and say that there’s a friction between adjacent dimensions that generates structure.