HOW DO YOU TELL THE TELCO THE PROBLEM IS IN THEIR NETWORK?
To operate successfully, most large distributed systems depend on software, hardware, and human operators and maintainers to function correctly. Failure of any one of these elements can disrupt or bring down an entire system. One such distributed system, the US Public Switched Telephone Network (PSTN), is the US portion of possibly the largest distributed system in existence.[1] Like all telephone switching networks, the PSTN performs a fairly simple task: It connects point A with point B. Paradoxically, this seemingly trivial task requires some of the most complex and sophisticated computing systems in existence. Software for a switch with even a relatively small set of features may comprise several million lines of code. The PSTN contains thousands of switches. Switches include redundant hardware and extensive self-checking and recovery software. For several decades, AT&T has expected its switches to experience not more than two hours of failure in 40 years [2] a failure rate of 5.7 x 10^-6.
The PSTN's dependability stems from a design that successfully exploits the loose coupling of system components. Because the PSTN has many similarities with other types of distributed systems, the analysis may suggest factors to consider in the design of distributed systems in general. Major sources of failure were human error (on the part of both telephone company personnel and others), act of nature, and overloads. Overloads caused nearly half of all downtime (44 percent) in terms of outage minutes. An unexpected finding, given the complexity of the PSTN and its heavy reliance on software, was that software errors caused less system downtime (2 percent) than any other source of failure except vandalism. Hardware and software failures were similar in terms of average number of customers affected (96,000 and 118,000) and duration of outage (160 and 119 minutes). Errors on the part of telephone company personnel and acts of nature caused similar amounts of downtime (14 and 18 percent).
Usually we
can ignore the network when considering Telephone problems. In figure 1, we see
the usual simplified view of the network. Normally, this simplistic view is
sufficient to allow us to solve our difficulties - After all, everything in the
cloud is the Telco's responsibility, right? True enough, and network problems
will frequently resolve of their own accord. However, sometimes we can't wait for
someone else to discover and fix the problem. Put on your Sherlock Holmes hat
and let’s investigate!
The key to troubleshooting network problems is persistence.
If we make enough calls, and we eventually get one that does not fail, this
tells us several things:
• The problem is not our equipment. Terminal equipment (such
as a telephone or codec) should not care how many calls you make - it should
act similarly in each case.
• The problem is “acting like” a “network” (e.g. a “trunk”)
problem in that it is non-absolute; rather, it is probabilistic
Generally,
we will want to make 15 calls, carefully keeping track of the number of calls
where the problem occurs (we can then calculate a “success rate” from this raw
data). Next, we reverse the direction of the call, and place 15 calls. If the
success rates are markedly different, we can be very suspicious this is a
network problem. The logic for this conclusion is as follows: On each call the
same customer equipment and same Central Office switches will be used. However,
as we have seen, trunk selection is dynamic. Another clue that the problem may
be network related, is if the success rate varies substantially depending on
the time of day. You will also sometimes note that the success of Circuit
Switched Data (CSD) calls at 56 kbps may differ versus CSD calls at 64 kbps,
and both will usually act differently versus voice calls.
THE NETWORK - THE BIG PICTURE
Before we go on to more detailed troubleshooting, let’s
examine the network in greater detail. We will examine the USA network, but you
will find similar topology in other parts of the world.
Long distance access
Tandem
switches and trunks. Figure 5, adds the network facilities to allow A to make
long distance calls from CO1. USA telecommunications policy requires that users
be permitted “equal access” to various competing long distance carriers.
Therefore, the local Telco’s have something called an "Access Tandem
Switch" that allows for this flexibility. This means that you may observe
the somewhat paradoxical situation where a problem that occurs only with long
distance calls is actually due to the local Telco.
Comments