Get our newsletter

 

NEW - Telco-OTT Strategies & Case Studies report from Disruptive Analysis. Learn more.


Designed and facilitated by

Wednesday
May162012

Lean Networking

I have been fortunate to have many great colleagues and mentors throughout my career. One of them is David Anderson, with whom I have worked when at Oracle, Sprint and now running my own company. David is a leading light in the agile software development movement, and has written a number of books on the topic. His latest one (which is being revised for a new edition) is Kanban, subtitled Successful evolutionary change for your technology business.

The essence of the Kanban methodology is to surface how value flows through a system (in this case software development), and to do it in a way that makes it manageable. Many of the ideas are taken from lean manufacturing, agile management techniques, plus the theory of constraints and throughput accounting, as espoused by Eli Goldratt in his famous book The Goal.

I was fortunate enough to go on David’s Kanban course in Barcelona earlier this year, and commend attendance at a future event as essential basic literacy in managing software development. What I noticed were great similarities between the issues of value flow in both manufacturing and software development with those in packet networking and telecoms. I would like to share with you one critical idea that I feel will become a central theme in telecoms over the next decade and more.

I call this idea ‘lean networking’, based on the the insight is that there are two kinds of efficiency in any system of value-flow -- including networks.

Understanding lean

The idea of ‘lean’ has gotten a bad name in recent times due to misunderstanding and misapplication. It became synonymous with outsourcing, cost reduction and management short-sightedness. This is diametrically opposed to its origins and true meaning, which were a ruthless focus on value to the customer, agility and low defect rates. The means of achieving these ends are restricted work-in-progress, low cycle times, flexible resources, and highly standardised and optimised work.

What the Toyota Production System demonstrated was that ‘lean’ is not just a system of practises, but rather a set of attitudes and beliefs. Central is the idea of respect for people, and that everyone is part of the system and responsible for the outcomes of the system. This rejects the idea of workers as machine-parts that occasionally malfunction and produce poor quality that has to then be ‘managed’ by professional managers using systems of carrots and sticks. Instead, it sees that quality is not opposed to efficiency, but rather that it is central it, and is the duty of everyone to ensure it.

Famously, in Toyota plants you are a hero if you spot a quality issue and stop the production line. Defects cause re-work, and that is waste. Defects that reach the customer cause a loss of satisfaction, and that is unacceptable.

Flow vs resource efficiency

Contrast this with the traditional view of manufacturing, where what matters is keeping the expensive plant busy. Stopping the production line reduces resource usage, and therefore loses some perceived efficiency. Whilst quality is not discounted, it is secondary to meeting targets of efficient production and low unit cost.

That traditional worldview suffers from several critical flaws:
  • It cannot cope well with variation in demand and supply capacity, since variation requires slack, which is at odds with high utilisation.
  • It results in long lead times and high inventory costs, as work is stockpiled to ensure no resource is starved of work.
  • Furthermore, the focus on quantity over quality allows re-work into the system to deal with those quality issues, which further increases lead time.
  • Because of the long lead times, some work needs to be expedited, which incurs set-up and task switching costs, which further increases lead times and decreases ability to respond to variable demand.
These two approaches can be thought of as two different beliefs of what matters: the ‘traditional’ view that resource efficiency is primary, and the ‘lean’ view that flow efficiency (low lead time, high throughput) works best.

Ideally we want to have both: no wastage of production capacity and short lead times, that together make for high throughput. The same ideas are pursued in greater depth in the excellent blog posts by Håkan Forss (part 1 and part 2).

The right kind of efficiency

These ideas can be summarised in the chart below. This is a critical chart, and one worth remembering.

The crux is that there is no path from high resource efficiency and low flow efficiency to the nirvana of having both. It is mathematically impossible, because the variation in presented load means you have no slack with which to achieve flow efficiency. This is the stuck position that the telecoms industry finds itself in, since the core belief of telcos is that they are in the business of delivering pipes and filling those pipes with bits.

Telecoms networks: variation, quality, defects and re-work

The traditional bandwidth-based view of networks is one that focuses on resource utilisation:
  • When we think of bandwidth as the resource, we imagine ‘bandwidth caps’ as being a way of preventing over-exploitation of that resource.
  • We put queues in front of network links to ensure they remain fully loaded, and have lots of ‘work in progress’ as packets sit in queues.
  • QoS offers priority, but can’t cope with differential needs of loss or delay between flows, or assure any particular outcome. In other words, QoS doesn’t actually deliver quality which is an absence of excessive loss and delay; every failed Skype call or delayed Web page is a defect, and having to hit 'redial' or 'refresh' is re-work.
  • QoS is also a form of expediting that comes at a high cost to other flows; we not only rob Peter to pay Paul, but at the cost of leaving Peter’s bodily organs so impoverished as to be unable to function, and Paul suffering from obesity from consuming too much cream and lard.
  • As we load up networks to their point of saturation, we see an increasing failure load on the networks of lost packets (which generate no value, despite consuming resources and delaying other packets), and re-sent packets.
  • As an added bonus, as we approach full resource utilisation, phasing effects in the network cause correlated flow patterns that tend to push us into chaotic behaviour and network collapse.
  • What we ignore at our peril is that load is highly variable, and flows have different loss, jitter and throughput needs. Value to the customer comes from meeting boththeir quantity and quality needs. Those quality needs may be weak or highly stringent, depending on the application’s performance aspirations.
The PSTN is like a cottage craft business: one product, made to perfection, at high expense. The Internet is like a traditional manufacturing plant: fill her up, damn the quality.

We can do better than both of these.

Limit work-in-progress

The basic problem is that we ignore the central idea of the Theory of Constraints: we have to limit work-in-progress, and match at ingress to the system that work to the bottleneck in the system. That means in telecoms terms you need to accept some basic realities:
  • To get to both high flow and resource efficiency, you need to work with flow efficiency first.
  • The resource you need to manage for flow efficiency is not bandwidth, but contention (i.e. loss and delay) along the path.
  • The place to manage it is at the point of ingress to the network.
This is counter-intuitive, as it requires having networks that run idle more of the time, and drop packets more often. However, the result is a network which can deliver both flow and resource efficiency – the quality outcomes of the PSTN, and the flexibility and cost of the Internet.

The journey to lean networking

The telecoms industry is caught up in a trap of its own making driven by a failed mental model of networks as pipes, bandwidth as the resource that is to be packaged and sold, and resource efficiency as the driver of profit. The inevitable result is that the typical telco resembles a Model T car plant (complete with QoS mine, network policy plantation and IMS steelworks), rather than a modern lean flexible manufacturing system matching resources to needs quickly and efficiently.
Wednesday
May162012

Lean Networking

I have been fortunate to have many great colleagues and mentors throughout my career. One of them is David Anderson, with whom I have worked when at Oracle, Sprint and now running my own company. David is a leading light in the agile software development movement, and has written a number of books on the topic. His latest one (which is being revised for a new edition) is Kanban, subtitled Successful evolutionary change for your technology business.

The essence of the Kanban methodology is to surface how value flows through a system (in this case software development), and to do it in a way that makes it manageable. Many of the ideas are taken from lean manufacturing, agile management techniques, plus the theory of constraints and throughput accounting, as espoused by Eli Goldratt in his famous book The Goal.

I was fortunate enough to go on David’s Kanban course in Barcelona earlier this year, and commend attendance at a future event as essential basic literacy in managing software development. What I noticed were great similarities between the issues of value flow in both manufacturing and software development with those in packet networking and telecoms. I would like to share with you one critical idea that I feel will become a central theme in telecoms over the next decade and more.

I call this idea ‘lean networking’, based on the the insight is that there are two kinds of efficiency in any system of value-flow -- including networks.

Understanding lean

The idea of ‘lean’ has gotten a bad name in recent times due to misunderstanding and misapplication. It became synonymous with outsourcing, cost reduction and management short-sightedness. This is diametrically opposed to its origins and true meaning, which were a ruthless focus on value to the customer, agility and low defect rates. The means of achieving these ends are restricted work-in-progress, low cycle times, flexible resources, and highly standardised and optimised work.

What the Toyota Production System demonstrated was that ‘lean’ is not just a system of practises, but rather a set of attitudes and beliefs. Central is the idea of respect for people, and that everyone is part of the system and responsible for the outcomes of the system. This rejects the idea of workers as machine-parts that occasionally malfunction and produce poor quality that has to then be ‘managed’ by professional managers using systems of carrots and sticks. Instead, it sees that quality is not opposed to efficiency, but rather that it is central it, and is the duty of everyone to ensure it.

Famously, in Toyota plants you are a hero if you spot a quality issue and stop the production line. Defects cause re-work, and that is waste. Defects that reach the customer cause a loss of satisfaction, and that is unacceptable.

Flow vs resource efficiency

Contrast this with the traditional view of manufacturing, where what matters is keeping the expensive plant busy. Stopping the production line reduces resource usage, and therefore loses some perceived efficiency. Whilst quality is not discounted, it is secondary to meeting targets of efficient production and low unit cost.

That traditional worldview suffers from several critical flaws:
  • It cannot cope well with variation in demand and supply capacity, since variation requires slack, which is at odds with high utilisation.
  • It results in long lead times and high inventory costs, as work is stockpiled to ensure no resource is starved of work.
  • Furthermore, the focus on quantity over quality allows re-work into the system to deal with those quality issues, which further increases lead time.
  • Because of the long lead times, some work needs to be expedited, which incurs set-up and task switching costs, which further increases lead times and decreases ability to respond to variable demand.
These two approaches can be thought of as two different beliefs of what matters: the ‘traditional’ view that resource efficiency is primary, and the ‘lean’ view that flow efficiency (low lead time, high throughput) works best.

Ideally we want to have both: no wastage of production capacity and short lead times, that together make for high throughput. The same ideas are pursued in greater depth in the excellent blog posts by Håkan Forss (part 1 and part 2).

The right kind of efficiency

These ideas can be summarised in the chart below. This is a critical chart, and one worth remembering.

The crux is that there is no path from high resource efficiency and low flow efficiency to the nirvana of having both. It is mathematically impossible, because the variation in presented load means you have no slack with which to achieve flow efficiency. This is the stuck position that the telecoms industry finds itself in, since the core belief of telcos is that they are in the business of delivering pipes and filling those pipes with bits.

Telecoms networks: variation, quality, defects and re-work

The traditional bandwidth-based view of networks is one that focuses on resource utilisation:
  • When we think of bandwidth as the resource, we imagine ‘bandwidth caps’ as being a way of preventing over-exploitation of that resource.
  • We put queues in front of network links to ensure they remain fully loaded, and have lots of ‘work in progress’ as packets sit in queues.
  • QoS offers priority, but can’t cope with differential needs of loss or delay between flows, or assure any particular outcome. In other words, QoS doesn’t actually deliver quality which is an absence of excessive loss and delay; every failed Skype call or delayed Web page is a defect, and having to hit 'redial' or 'refresh' is re-work.
  • QoS is also a form of expediting that comes at a high cost to other flows; we not only rob Peter to pay Paul, but at the cost of leaving Peter’s bodily organs so impoverished as to be unable to function, and Paul suffering from obesity from consuming too much cream and lard.
  • As we load up networks to their point of saturation, we see an increasing failure load on the networks of lost packets (which generate no value, despite consuming resources and delaying other packets), and re-sent packets.
  • As an added bonus, as we approach full resource utilisation, phasing effects in the network cause correlated flow patterns that tend to push us into chaotic behaviour and network collapse.
  • What we ignore at our peril is that load is highly variable, and flows have different loss, jitter and throughput needs. Value to the customer comes from meeting boththeir quantity and quality needs. Those quality needs may be weak or highly stringent, depending on the application’s performance aspirations.
The PSTN is like a cottage craft business: one product, made to perfection, at high expense. The Internet is like a traditional manufacturing plant: fill her up, damn the quality.

We can do better than both of these.

Limit work-in-progress

The basic problem is that we ignore the central idea of the Theory of Constraints: we have to limit work-in-progress, and match at ingress to the system that work to the bottleneck in the system. That means in telecoms terms you need to accept some basic realities:
  • To get to both high flow and resource efficiency, you need to work with flow efficiency first.
  • The resource you need to manage for flow efficiency is not bandwidth, but contention (i.e. loss and delay) along the path.
  • The place to manage it is at the point of ingress to the network.
This is counter-intuitive, as it requires having networks that run idle more of the time, and drop packets more often. However, the result is a network which can deliver both flow and resource efficiency – the quality outcomes of the PSTN, and the flexibility and cost of the Internet.

The journey to lean networking

The telecoms industry is caught up in a trap of its own making driven by a failed mental model of networks as pipes, bandwidth as the resource that is to be packaged and sold, and resource efficiency as the driver of profit. The inevitable result is that the typical telco resembles a Model T car plant (complete with QoS mine, network policy plantation and IMS steelworks), rather than a modern lean flexible manufacturing system matching resources to needs quickly and efficiently.
Thursday
Apr122012

The Postage and Packaging Problem

A number of conversations have recently converged on a single problem: how to match applications to network access. Let’s unpeel this issue.

Post and Telegraph 2.0

When I was Chief Analyst at Telco 2.0, we proposed there was a significant untapped market opportunity for network operators to bundle together access with content, applications or services. The revenue opportunity is to charge the providers of those services for delivering fit-for-purpose data at bulk wholesale prices. This is the “postage problem” – who pays for delivery of digital goods and services?

A way of viewing this is “freephone for data”, or the equivalent of pre-paid bulk mail in business. It’s the bread-and-butter of most distribution businesses, with large enterprises paying to get their goods to customers.

My colleague Dean Bubley has long pointed out a parallel issue, which we can call the “packaging problem”. This in a nutshell can be thought of “if a user embeds a YouTube video inside of a Facebook page, which enterprise picks up the tab?”. It’s hard to allocate traffic to paying parties once it has hit the network.

A physical envelope acts as packaging, and has two functions: keeping what you want sent inside and not spilling out, and keeping things outside from getting in. The same problem exists with networks: how to pay for only the data you should be liable for? You don’t want anyone else tunnelling their traffic via your app with you paying the bill.

Postage Problem

When application providers want to pay telcos for sender-pays data, they typically fall into one of two cases: they want premium assured quality service (e.g. home worker telepresence), or they want bulk low-cost low-quality service (e.g. video advert pre-caching).

As Dean has drolly noted on a number of occasions, there are two problems with this: enterprises don’t know how to buy it, and telcos don’t know sell it. The whole market is based on quantity, not quality.

In fact, it’s worse than this. The vendors who sell to telcos don’t even know how to make it. What they get telcos to implement is “quality of service”, or priority. However, what premium applications need is predictable, assured bounds on loss and delay, particularly infrequent “worst case” which users may perceive as an application failure. The type of quality on sale is next-to-useless, as it still has an unstable and unbounded loss and delay limits. Furthermore, the quality of lower-class traffic tiers has such poor and unstable loss and delay characteristics you can’t even begin to sell it.

Think of telcos like factories that take in raw milk. If they try to separate out the cream and whey, you find the cream is still sometimes milky and won’t whip, and the whey often goes rotten before it leaves the factory.

Packaging Problem

The other half of the problem is making a strong and secure packaging. Cream, yogurt, cheese, milk and curd all need different packaging. So do the different types of data, and that partly depends on where you need to deliver it.

There is a fundamental shift going on in the industry. We’ve seen the basic network end-point unit drop in size over the decades: a town had a town crier; then a neighbourhood had a telegraph office; a block a telephone booth; followed by telephones to houses and Internet down to single devices like PCs.

A common pattern is for each new generation to use some elements of the communication solution of the previous generation, before coming up with its own version. So telephones used telegraph wires, before coming up with twisted pairs. PCs used dial-up ISPs before broadband.

The next shift in delivery point is from PCs down to Apps. We’re at the phase where PC-level broadband is going to get the “dial-up ISP treatment”, and start to be broken down into smaller chunks. Apps are the missing Tyvek envelopes into which to put the data.

The initial need will not, however, be driven by sender-pays data. We will see simpler examples where corporates and schools want to ensure their employees and pupils can use approved apps in approved ways, without the host having to pay for the data costs of every app that crosses into the building.

What next?
Here is my thesis on what happens next.

We will see the beginnings of a solution to the Postage & Packaging problem in the context of enterprise use. Trends like Bring Your Own Device will force enterprises to turn their functions into managed apps.
New business models will start to spring up. For example, a school may charge pupils a fee if they want to use non-educational apps on the school network since that drives a real cost of network upgrade and operation.

Over time, virtualisation technology and on-device management features will make the envelopes more robust. Apps will understand networks better. In parallel, new network enablers will solve more of the ‘postage problem’ offering a richer set of quantities and qualities than one-size-fits-all best-effort broadband. Aggregators will begin to bridge together the supply and demand sides.

Eventually, the current broadband ISP model may look as antiquated as dial-up ISP is today. The industry will polarise into two extremes: highly-packed experiences, like a ready-made meal. These come as part of a device or an e-commerce service. The other is totally unpackaged ‘grow and cook it yourself’, just like a home network today doesn’t need any kind of service provider.

Neither is intrinsically good or bad, but either way the system we currently think of as ‘the Internet’ will be tomorrow’s cold leftovers.
Monday
Apr022012

I, Network

In the March newsletter, I shared the idea that there is a new category of network architecture, the Network of Probabilities. This differs from classical circuits (Network of Promises) or best-effort packet data (Network of Possibilities). I personally believe it's the next revolution in telecoms. What’s new is that it provides a trading space for allocating contention between flows, and does this with some novel applied mathematics.

A bit like the progression from 2G to 3G to 4G wireless data encoding, better mathematics can squeeze a lot more out of fixed networks. Indeed, we could say there is an equivalent generational progression in multiplexed fixed networks, from TDM to ATM to IP. In this newsletter, I’d like to lead you a little further along my own journey of enlightenment to the fourth generation of fixed networking, called Contention Management. It’s feeling lonely out here right now.

The hard thing to do is to let go of your intuitive beliefs about ‘bandwidth’ in networks. Packet networks do not have ‘bandwidth’, just like the sun is not made of ‘shine’. We mustn’t mistake metaphors for reality. Indeed, in order to get more out of networks, we must transcend the approximation of bandwidth-based thinking to network reality, and adopt a more robust model that is inclusive of both quantity and quality effects.

When better bandwidth is bad

Here are three examples of what happens in real networks when you apply naïve bandwidth thinking to packet networks like the Internet.

Example 1: Your network is working fine, has lots of bandwidth available, but the users keep reporting short outages and poor bandwidth. What’s going on?

Example 2: Image you have a standard 20 Mbit/sec DSL line from a central exchange to your home. One day, your telco comes along and ‘upgrades’ you. Now you have a 1 Gbit/sec fibre to a street cabinet, and then say 50 Mbit/sec copper onwards to your home. The fibre is fast, and your copper loop is shorter, so bandwidth goes up. But customers are complaining, and you notice that your online gaming has worse performance than before. What’s going on?

Example 3: To speed up application performance to your holiday cottage, you bond together two links, say 3G and a satellite link. Bandwidth goes up. When you test what happens to the applications, you find terrible performance problems. What’s going on?

Buffers badly batter bandwidth

Let’s see what really happens in networks.

The first is a well-known phenomenon called bufferbloat. When networks saturate, it disrupts the control loops that TCP uses to say ‘faster!’ and ‘slower!’ to the end points of the flows. This can lead to all the queues filling up, multiple packets getting lost in a row, and sudden collapses in transmission speed that are experienced at transient outages by users. The network recovers, but only slowly. As fast memory has become cheaper, the queues in routers have become longer, driven by the mistaken belief that it is always better to delay a packet than to drop it. This just makes the collapse bigger, and recovery slower. And more bandwidth makes the collapses more sudden.

When the telco upgraded from a single copper loop to fibre plus copper, it inserted an extra queue. This added new delay effects that undid all the benefits of additional bandwidth for delay-sensitive applications. Furthermore, it allowed ‘greedy’ applications to stuff that queue with pulsed traffic, which raised loss and delay for better-behaved applications. Hence customer experience got worse, despite more ‘bandwidth’.

If you take two network links and bond them, you can run into trouble in multiple ways. For a start, you have done nothing to improve the delay characteristics of the new ‘synthetic’ combined link. If you fire packets randomly down one or the other link, you get order-reversal, which TCP treats as a loss, and slows down. If you send packets from the same flow down the same link, the still self-contend, but the application may face unexpectedly different characteristics for each flow. So the different loss and delay for audio and video of a Skype call may seriously confuse the encoding algorithm. Furthermore, any outage or transient saturation effect, even momentary, may cause odd oscillations in the traffic that create poor user experience.

As you can see, the explanations require looking at the properties of the queues over short time periods; none of the problems were as a result of a lack of bandwidth. These aren’t isolated edge cases, as the problems are endemic.

Networks all have failure modes.

The questions are: how big are they are? how to manage them? and at what cost?

Consider contention before bandwidth

We get failure modes as a result of a lack or misallocation of resources to perform the functions the user desires. The fundamental resource of a network is not bandwidth, but rather contention. Contention is what happens when a packet sits around waiting for other packets, or has nowhere to sit and is lost.

We name this composite of loss and delay as quality attenuation. It’s analogous to noise in a transmission link, but is defined as a meaningful concept for multiplexed networks instead. There is an algebra of how loss and delay of packet flows can be decomposed, and this is not the place to describe it. Just accept there’s a nice formula to define and describe quality attenuation.

Now we’re in a position to move packet networking from alchemy to chemistry by laying down some basic principles.

The three laws of networking

Rather like the laws of thermodynamics, there are three fundamental laws of networks.

  • Quality attenuation exists: Statistically-multiplexed networks are systems with three parameters (load, loss and delay) and two degrees of freedom (typically loss and delay, with load being exogenous). A network is a bit like a piston with pressure, volume and temperature. Set any two, and the third is set for you.
  • Quality attenuation is conserved: Loss and delay are conserved. In other words, we can’t un-delay a queued packet, or un-lose a dropped one. This conservation works in two ways: attenuation is conserved along any one path, and also at any piece of equipment.
  • Quality attenuation is tradable: There is a trading space for loss and delay. At every edge node and transmission link they can be exchanged without increasing overall quality attenuation. However, any other form of trading – between different places, or at different times – inevitably does increase it.

These are not opinions, but are provable matters of mathematical fact.

The trouble with telecom

The telecoms industry is in trouble, because it is fighting all three fundamental laws. Needless to say, in a fight between management and mathematics, the latter always wins.

  • Telcos have no idea where quality attenuation is happening. They don’t look for it, don’t model it right when they do, and don’t know what to do when they see quality problems. All too often, the prescription is ‘more bandwidth’. It’s the equivalent of medieval leeches, but attached to your capex budget.
  • Telcos try to put quality back into the system when it’s already lost. Quality is a bit like darkness or quiet. You can’t go out and get a box of dark to make it un-light, or a spray can of quiet for a noisy place. Likewise, you can’t put quality back in once you’ve lost it. Techniques to hide poor quality, like anti-jitter buffers, just raise quality attenuation. Network compression boxes add more queues and quality attenuation. Content delivery networks have unexpected quality side-effects. Application-layer cleverness to adapt to quality issues just sets off oscillations that push networks into overdrive failure modes.
  • Telcos trade quality badly. They can’t sell quality of service, because they don’t know how to make it. Priority is about giving ‘more of the bandwidth’ to an application, as opposed to trading loss and delay between flows. The problem is, when you use priority, you typically give too much quality attenuation to the prioritised flows. Given the law of conservation, you’ve got less to give to other flows. What typically happens is that the non-priority flows enter failure modes and collapse easily. So telco QoS doesn’t work at any sensible cost.

This is not the network you are looking for

It’s worse than you think.

Telcos all over the world are splurging capex on unnecessary network upgrades to paper over what are often quality issues. So the first thing they do is build out fast, fat pipes, and sell them.

And sell them. And sell them. They are then over-selling the capacity, in the mistaken belief that they sell bandwidth. But you run out of quality a long time before you run out of bandwidth. Applications collapse and customers complain – and the network doesn’t yet appear to be ‘full’. That was never in the business case. Are you sure this is a safe utility stock still?

And when you do try to explicitly package and sell quality, to mitigate the collapse effects, you get an effect called ‘quality inversion’. It’s cheaper for customers to buy a fatter, faster pipe with lower packet service times than to buy the ‘quality-assured’ one. That’s a by-product of seeing quality through a bandwidth lens, and mispricing it as a result.

Bye-bye to bandwidth

The bandwidth approach has no means for modelling or managing the failure modes of multiplexed networks. Indeed, it takes infinite bandwidth at infinite cost to have no failure modes. The contention model lets you manage the failures, at a finite cost. Sounds like a good alternative, no?

In a world where capex is constrained, and demand is not, we’re going to see an inevitable shift towards getting more out of what we have. The financial and network maths tells us we must manage the true fundamental resources of the network, not fantasy ones.

At the end of the day, there’s no contention. Bandwidth is bust.

Sunday
Apr012012

The Network of Probabilities

I first met Dr Neil Davies and his crew from Predictable Network Solutions Ltd back in 2008, and wrote a blog post on their technology when I was Chief Analyst at Telco 2.0. In the intervening years, my deepening understanding makes me believe that their ideas and technology are the single biggest paradigm shift I have witnessed in my entire technology career – one which now spans three full decades. I’d like to help you make that same journey of understanding over the coming months.

It’s a journey worth making, because it shows us how the telecoms industry is misallocating tens of billions of dollars of capital, over-spending on network operational cost, and under-serving users through poor experiences.

The challenge I have is providing a path for others to also see how and why bandwidth-based thinking fails us, and contention-based thinking is a superior alternative. There are so many parts to the puzzle, so many things to un-learn, and so many counter-intuitive new principles to adopt. So if you’ll forgive me, I’ll pick a few simple highlights.

The essential problem is that the ‘pipes’ model you have in your head of packet data networks fails to match the fundamental reality of what goes on. That is because networks are not pipes, along which packets flow. Indeed, no packet has ever ‘flowed’ outside of the mind of a human. Networks don’t even have ‘bandwidth’ – that’s at best a property of individual transmission links. Instead, networks are large distributed supercomputers that take waiting packets and copy them. It sounds the same, but the combination of queues and copying makes for mind-warping results.

Applications need both quantity and quality

We want fat pipes, yes? Indeed, fatter pipes are better pipes, aren’t they?

Wrong! Here’s why. The most basic assumption of the bandwidth model in your head is wrong.

When you increase the speed of a network link, you are increasing the quantity of packets for delivery in a way that can degrade the quality that user applications experience. Indeed, more bandwidth can paradoxically make networks unusable some of the time. How come?

Well, imagine you have many users, with many devices, running many applications. Some of these applications will be sensitive to the quantity of packets, say a large file download. Others will be sensitive to the quality, and performance will drop when they experience bursts of jitter, loss and delay. This is typical of voice, video, interactive web applications and online gaming.

All these applications in turn start lots of connections. Some applications by their nature pulse traffic, and those pulses become correlated in time. For example, when you open a web page it typically initiates several simultaneous connections. These slam packets into the queues in the network, which fill up.

What happens next is that control loops like TCP detect packet loss, and try to slow down the rate of sending. And here’s the problem: those control loops can end up setting up a kind of “resonance” in the network, forcing ever more queues to fill up, especially as the ‘slow down!’ signals get lost too.

In other words, ordinary every-day traffic generates statistical patterns of flow that resemble low-bandwidth denial-of-service attacks.

Bandwidth is bad

The faster you make the network, the worse this phenomenon is, because the easier it becomes for ‘badly behaved’ applications with pulse-like traffic to crowd out all other traffic. You can get into trouble faster, but can’t recover faster. So your network collapses, and round-trip times can become anything from 500ms to 30 seconds. Yes, you read that right.

It reminds me of the old joke about slow postal services: When they charge 46p for a stamp, it’s only 10p for delivery, and 36p for unwanted storage on the way.

The underlying reality of a network is that it is like a microphone on a stage. When you increase the volume past a certain point you get nasty feedback effects. That takes you from a predictable region of operation into a chaotic one. This destroys the performance of your applications. It’s OK for your phone to buzz, but it’s not good news when your whole network rings.

This phenomenon is so counter-intuitive, it feels hard to believe. So why don’t we hear more about it? One reason is because nobody bothers to look. But in the real world, it happens all the time – especially in places like households with children or shared student accommodation, which tends to mix more and different types of traffic together. The bufferbloat phenomenon is just a special case of a problem endemic to all packet networks today.

It also doesn’t get noticed because at the time you upgrade bandwidth, temporarily traffic loads stay the same, so you tend to keep away from the unpredictable region of operation. But over time, it rises back up as the extra bandwidth (quantity) effects attracts increased number of users, devices and applications. This heterogenaity In turn generates more of that pulse-feedback effect, and you can end up with worse application performance than before you started.

So how to fix it? You need to think very differently about networks.

Think of trading, not transmission

Networks can be thought of as systems that trade space for time. By that, we mean they provide the illusion of collapsing the world to a single point, but at the cost of smearing the traffic with delay, and in extremis with loss. Stop thinking of networks as systems for transmitting data. Networks are systems for trading load, loss and delay. This worldview is not optional: that’s the fundamental, mathematical reality.

There are three basic reasons for networks behaving badly when we saturate queues with traffic:

  • Trade-offs are done at different or inappropriate layers (including between sub-classes of traffic within layers).
  • Trade-offs are dispersed physically across the network.
  • Trade-offs are time-lagged.

The diagram below captures these three effects.

Networks are trading spaces between classes of traffic across space and time

For instance, when TCP detects a packet loss, it assumes the network is over-loaded, and backs-off the rate of sending. This happens in isolation for one flow, and as a control loop works a thousand times slower than the phenomenon it is trying to manage (a single contended link).

No ‘get out of jail free’ card

Applications at the edge can’t adapt to transient effects that are momentary at far-distant places, or buried at layers of the stack they can’t access. The information about the effect cannot travel fast enough. This is a fundamental limit to the design idea that brought us the Internet in the first place, the ‘end to end principle’.

No amount of clever software at the edge can get you out of the problem of your network going from its predictable to chaotic behaviour patterns. That clever software can even induce new resonance effects and chaotic failure modes.

No amount of extra bandwidth can save you either. Indeed, that approach is going to drive the industry to bankruptcy. Bandwidth is not free, it takes energy, and we can’t afford the electricity bills for unlimited bandwidth.

Better than ‘best effort’

To escape from this problem, you need to make a simple but significant change: ideally do all the loss and delay trading at a single layer, place and time. In practise on real networks with multiple attachments to backbones, plus intermediate content delivery systems, you need to do it at 3-4 places along the end-to-end path. With some clever mathematics, you can make this process compositional, and control the end-to-end loss and delay. Then you don’t have the nasty delay-feedback loops, chaotic behaviour, and cost of upgrades using the failed bandwidth metaphor.

What we get is a fundamentally new category of network architecture.

We can think of old-fashioned circuit networks as the Network of Promises. You get a fixed loss and delay profile, and a guaranteed load limit. The downside is that there is no graceful degradation as the system saturates, just a sudden cliff as net traffic is rejected. Furthermore, all traffic must pay for premier first-class delivery, whether it needs it or not, and any idle capacity cannot be resold.

The Internet can be thought of as the Network of Possibilities. Nothing is guaranteed, bar the chaos at saturation, and counter-intuitive results from adding bandwidth in the wrong way. Failure to properly understand how loss and delay accrue leads to effects like in the example at the start of this essay.

The Network of Probabilities

We now have a third option that gives us the best of both worlds: the generativity of the Internet, plus the determinism of circuit networks.

Three kinds of network: promises, probabilities and possibilities

The Network of Probabilities works on different principles. It manages aggregates of flows, not just individual packets. It does trading, not prioritisation. It also works with the fundamental mathematical resource of the network, which is contention, not bandwidth. In doing so, it allows us to match supply and demand in ways that were previously unthinkable, as well as to fix many of the core design and operational issues of the Internet.

In the next newsletter, you can look forward to the three fundamental laws of networking, and to more examples and elaboration on the failures of bandwidth thinking and the benefits of contention-based thinking.

Together, we’re going to build a better kind of Internet. The current prototype has done its job.