Quality of Experiencedownload.101com.com/techlibrary/vts/ocs_qoe.pdfAs traditional IP telephony solutions (i.e. IP-PBX from the leading vendors) gain in market share, the ability of

Quality of Experience

A strategic competitive advantage of Microsoft Unified Communications

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. Unless otherwise noted, the companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted in examples herein are fictitious. No association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows Vista, Active Directory, Outlook, PowerPoint, SQL Server, Visual C++, and Visual J# are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners.

© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows Vista, Active Directory, Outlook, PowerPoint, SQL Server, Visual C++, and Visual J# are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners.

Contents Executive Summary.....................................................................................................................................................4

1 Voice Quality is key to the success of IP telephony implementations .....................................5

1.1 Voice Quality: the subjective and elusive goal ............................................................. 5

1.2 Assessing Voice Quality: a complex task ...................................................................... 7

2 The traditional approach to Voice Quality on IP networks: managing Network Service Quality with QoS and CAC...................................................................................................... 10

2.1 The challenge of Voice Quality on IP networks ......................................................... 10

2.2 Network Service Quality, a required condition for Voice Quality with traditional IP telephony ........................................................................................................................ 11

2.3 Traditional techniques for management of Network Service Quality..................... 16

3 The traditional approach is complex and increasingly ineffective.......................................... 24

3.1 Traditional IP telephony solutions are challenged to deliver Voice Quality .......... 24

3.2 QoS techniques have limits, risks and costs ............................................................... 24

3.3 The traditional approach has a limited scope ............................................................ 27

3.4 Providing Network Service Quality is not sufficient to ensure Voice Quality ........ 28

4 Microsoft UC Quality of Experience: a comprehensive approach to Quality...................... 29

4.1 Quality of Experience, a new approach....................................................................... 29

4.2 A comprehensive, user-focused commitment to perceived quality ....................... 30

4.3 Intelligent, adaptive end-points ................................................................................... 32

4.4 Measuring and monitoring the user experience in real time................................... 33

4.5 A media stack designed for quality on IP networks .................................................. 34

4.6 Video: the next frontier ................................................................................................. 40

4.7 Evidence of the superior Quality of Experience of Microsoft UC ............................ 43

5 Network design and management for Microsoft UC .................................................................. 49

5.1 Right-provisioning for Microsoft UC ........................................................................... 49

5.2 Using Microsoft UC on a QoS enabled network ........................................................ 53

Conclusion................................................................................................................................................................... 58

Page 4 of 58

Executive Summary

Voice Quality, as measured by the subjective perception of users, is key to the success of IP telephony implementations. Managing it however is a very complex task in the absence of comprehensive real time measurement of Voice Quality in traditional enterprise voice solutions. As a substitute for direct management of Voice Quality, traditional IP telephony providers (and in particular those with a technical and cultural heritage in network solutions) have emphasized network level management of Network Service Quality using in particular the techniques of QoS.

Traditional IP telephony solutions depend on QoS to a large extent because they rely on media capabilities initially bred from the world of circuit switched digital telephony. QoS is indeed required on IP networks for those solutions to provide acceptable Voice Quality. That network centric approach however is showing its limits. Network wide QoS techniques tend to be complex and require ongoing management. Even then, there is increasing evidence that they fail to guarantee that the Voice Quality will match expectations. And as users become increasingly mobile, solutions that require them to be on the QoS-enabled network in order to have any chance of experiencing Voice Quality leave more users underserved.

We introduce Microsoft UC Quality of Experience, an innovative approach that aims to provide all users with the best possible quality anytime anywhere. It combines a comprehensive, user-focused quality program incorporating all significant influencing parameters; the use of intelligent, adaptive end-points that have the real time capability to monitor, pilot, optimize and deliver the Microsoft UC Quality of Experience; real time metrics of user’s perceived quality of the actual call experience for all calls, collected in Metrics CDR and processed by a UC Monitoring Server; and a new media stack optimized for unmanaged IP networks, capable of real time adaptive and corrective actions to continuously optimize the user’s subjective experience.

There is already ample evidence of the superior Quality of Experience of Microsoft UC. Besides the use of its key components in the recent versions of Windows Live Messenger, now serving in excess of 1 billion minutes of voice calls per month, we present the results of a third-party benchmarking study. That study found that the Microsoft solution provides better Voice Quality than the Cisco CallManager version tested in virtually all the conditions enterprise users will likely encounter.

We provide guidelines for network design and management for Microsoft UC, emphasizing the need for a base level of network capacity through right-provisioning of the network. Microsoft UC’s voice stack is more efficient with respect to bandwidth use than traditional solutions: right-provisioning for Microsoft UC voice scenarios generally requires less bandwidth than for those traditional solutions. Where that cannot be fully achieved however, local network layer management may be considered. If implemented, such network management should be done at the class of traffic level rather than on a per flow basis.

Microsoft UC provides an excellent experience on good networks, satisfactory quality on mediocre networks, and successful calls where traditional solutions would not work at all on very aggressive, unmanaged networks, and with superior economics.

Page 5 of 58

1 Voice Quality is key to the success of IP telephony implementations

1.1 Voice Quality: the subjective and elusive goal

The ultimate measure of the performance of any service is determined by its users. In the case of voice1, that ultimate measure is subjective, in-context perception of Voice Quality2 by the listener.

Such subjective perception incorporates and reflects intelligibility, clarity and pleasantness of speech, absence of defects, and overall conformity as perceived by the listener. This goes beyond simple restitution of the actual literal content to also include appropriate perception of speaker identity, emotions, intonation, and timbre as well as the absence of annoying effects. Indeed perception can be affected by an almost limitless variety of effects (delay, background noise and interference, clipping, distortion, echo, pops and clicks, signal cuts or drops…).

Opinion of what is good or bad voice quality is user dependent and varies with the customer context and expectations (business requirements, cultural differences, local environment, hardware and software…). For example, customers are probably more willing to accept poor quality while in a remote undeveloped area than in a modern city. Customers also routinely experience and accept lower quality on mobile phones than they typically would on a landline – but that difference in performance (and in expectations) is progressively narrowing3.

In the enterprise voice market, where voice solutions have existed for a long time, there are well established, rather stringent Voice Quality expectations. As traditional IP telephony solutions (i.e. IP-PBX from the leading vendors) gain in market share, the ability of these vendors to meet enterprises’ Voice Quality requirements and the complexity of their technical solutions aimed at providing Voice Quality for IP telephony are topics for debate.

While everybody agrees that Voice Quality should be a fundamental goal of any voice solution, the complexities of measuring Voice Quality and the limitations of traditional IP telephony systems have led to indirect and complex approaches to the management of Voice Quality and mixed results on actual performance. In the following chapters, we will explain why these approaches, which consist in managing network level parameters through Quality of Service techniques, can be both overly expensive and frequently ineffective.

We will introduce later in this document Microsoft’s comprehensive approach to Quality of Experience, based on proactive design for quality, actual direct monitoring and management of the end-user experience, and on an advanced media stack running on adaptive end-points. We

1Approaches to Voice Quality scoring and management are more broadly used than those dealing with multimedia quality. This document addresses Quality of Experience as a concept applicable to both audio and video media but it concentrates its explanations on voice. 2The term “Voice” is here used interchangeably with the term “Speech”. 3http://www.jdpower.com/global/press-releases/pressrelease.asp?StudyID=1108

Page 6 of 58

will show why this approach is a more comprehensive, more effective and more economical alternative.

Page 7 of 58

1.2 Assessing Voice Quality: a complex task

1.2.1 Subjective evaluation and MOS

In order to provide a common reference framework for user perception, following extensive series of subjective tests the International Telecommunications Union (ITU) standardized a methodology for measuring several aspects of Voice Quality in its recommendations P.800/P.830,. These recommendations provide several scales known as Mean Opinion Scores (MOS). Each score is generated by arithmetic averaging of the results of many standard, subjective tests where panels of listeners rate their perception of aspects of the communication.

For assessment of Voice Quality in IP telephony, the most commonly used MOS scale is the Listening Quality (LQ) scale, which measures how end-users perceive the quality of speech that is presented for listening. In an LQ test, the listeners are presented a range of test sentences read aloud by both male and female speakers over the communications medium being evaluated. Each listener is required to give each sentence a rating using the following rating scheme:

MOS Quality of Speech

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

The use of the LQ scale for assessing voice quality is so common that many people use MOS interchangeably with LQ MOS. There are additional MOS scales defined by the ITU, including Conversational Quality, Listening Effort and Conversational Effort. Unless otherwise specified, in the remainder of this document the term MOS refers to the LQ MOS.

It is important to understand that a MOS resulting from a subjective experiment is dependent upon the range of experiences presented to the listener in that experiment. It is generally statistically meaningful to compare absolute MOS values taken from the same experiment (assuming it was conducted properly), but it is much more difficult to compare absolute MOS resulting from different experiments, especially if the methodology and experimental conditions (even while conforming to the ITU recommendations) were not identical. For example it is often said that a MOS of 4.0 or greater is considered “toll quality”. That figure can typically be obtained by running subjective experiments only exposing listeners to traditional circuit switched quality calls and to impairments encountered in circuit switched calls. If however a test sample that received a MOS of 4.0 in those experiments was presented to listeners as part of a new experiment also including richer test samples than traditional toll voice (such as wideband audio), its MOS in this new experiment would now be significantly lower than 4.0. In that light, frequently quoted absolute characterizations of common experiences (such as “PSTN toll calls have a MOS of

Page 8 of 58

4.0 or more” or “GSM mobile calls have MOS ranging from 2.9 to 3.9”) should be treated more as a convenient (but misleading) marketing summary than as a statistically meaningful metric.

One of the most complex aspects of subjective perception is its dynamic nature: it is not only the occurrence of changes during the call that impacts perceived quality but how and when these changes occur. Users are more prone to notice and penalize quality decreases than quality increases, and if the quality decreases they will penalize more sudden drops in quality (even brief ones) than gradual reduction. Similarly, if a period of bad quality comes near the end of a call, it lowers the MOS for overall call quality much more than if it had appeared earlier in the call (“recency” effect).

1.2.2 From subjective evaluation to objective estimates

While subjective testing is meaningful for statistically comparing user perceptions of various media, the operational complexity of panel testing of many different, standardized samples makes it unpractical as a real-time, non-intrusive experimental tool suitable for IP telephony. In order to obtain automated estimates of MOS without requiring a panel of users for scoring, various algorithms were developed which provide estimates and predictors of how users would score a specific experience.

For example, one of the most accurate standard algorithms is called the Perceptual Evaluation of Speech Quality (PESQ) algorithm (ITU-T Rec. P.862). PESQ and similar algorithms such as its predecessor PAMS (Perceptual Analysis Measurement System) measure one-way quality and are designed for use with active tests where a speech signal is injected into the system under test, and the degraded or rendered output is compared with the input signal to predict the MOS. PESQ automates the evaluation of sound transmission quality using repeatable, objective calculations that attempt to incorporate the necessary subjectivity of the human factor.

These algorithms first time-align the clips, and then map different parameters of the clips onto psychophysical speech (psychoacoustic) representations or, in other words, how speech is perceived by the human ear and brain. Once the mapping of signals is complete, a “cognitive subtraction” is performed and a quantitative measurement of the results is produced. The results are then algorithmically correlated to benchmark MOS.

Page 9 of 58

1.2.3 Experimental measurement

Actual experimental measurement of voice quality in traditional telephony and IP telephony networks is very complex and typically only done to identify an already suspected issue. The most common form of experimental measurement is active measurement, which injects a known reference signal and compares it to the degraded signal to predict perceived quality, using algorithms such as the ones described above. That approach generally requires to set-up and run tests in parallel from the actual use of the system, by inserting measuring hardware into the network and sending actual test data across the network. Most common algorithms such as PESQ require active measurement.

Passive measurement is a newer, more complex and less commonly used technique. Its advantage is that it allows in-place assessment of actual quality of the live traffic. In passive measurement, no reference signal is used; instead, the degraded signal is inspected as received, and a model is used to predict perceived quality.

Most current passive measurement models only consider the transport layer effects (loss, jitter, and delay for the session) to estimate a MOS. Looking at the transport layer can provide a vague understanding of the quality of the call, but it does not take into account other important aspects that can be discovered only by payload or datagram examination considering the actual speech data being transmitted. Payload examination should include important information such as noise level, echo, gain and talk-over (or double talk). Without this type of information, a call could have a very high MOS based on network conditions, even though problems like echo and noise could make the communication unacceptable. Passive payload measurement approaches are more algorithmically and computationally complex than passive network measurement approaches.

Voice Quality measurement for traditional IP telephony solutions is too heavy and technically complex to allow permanent and ubiquitous measurement of live traffic4 in typical enterprise

4Systematic measurement of Voice Quality of an IP-PBX implementation would generally involve inserting specialized hardware elements (sensors) in the network to intercept all relevant flows. These sensors and

Page 10 of 58

environments. Because of the cost and of the complexity involved, those measurements are rare and mostly done for remedial purposes on the subset of the network where a problem is encountered, for a short campaign and often by a specialized consulting firm.

2 The traditional approach to Voice Quality on IP networks: managing Network Service Quality with QoS and CAC

2.1 The challenge of Voice Quality on IP networks

Internet Protocol networks (IP networks) provide best effort data delivery by default. Best effort allows the complexity to stay in the end-hosts, so the network can remain relatively simple and economical. The fundamental principle of IP-networks is to "leave complexity at the edges and keep the network core simple". This scales well, as evidenced by the ability of the Internet to support its host count and traffic growth without any significant change in the principles on which it operates. If and when network services requests from hosts exceed network capacity, the network does not abruptly deny service to some users but instead degrades progressively in its performance to all users5, by delaying the delivery of packets or even by dropping some of them.

The resulting variability in packet delivery does not adversely affect typical Internet applications (bursty and sometimes bandwidth intensive but not very delay sensitive applications such as email, file transfer and Web “elastic” applications) until very severe network performance degradation, meaning that if data packets arrive within a reasonable amount of time and almost in any order, both the application and the user are satisfied. Delayed packets are likely to eventually arrive because the applications typically use TCP at the transport layer. Of course TCP is a connection-oriented protocol that has built-in adaptation mechanisms to ensure error-free data transfer, ordered-data transfer, diagnostic, re-request and retransmission of lost packets, discarding of duplicate packets, and flow control, also known as congestion throttling.

On the other hand, real-time applications and frameworks that use traditional codecs (in particular traditional IP telephony delivered by current IP-PBX) cannot withstand any but the slightest degradation in packet delivery – they are said to be “inelastic”. They do not work well with protocols that throttle bandwidth, nor do they have any use for resent packets. For those applications, a packet that does not make it in time is worthless as a carrier of real-time communication and is effectively lost, leading to suboptimal performance. They generally use UDP, a connectionless protocol, which is faster and more efficient than TCP for time-sensitive purposes. UDP does not provide the same richness as TCP. There is no mechanism to ensure that data packets are delivered, or that they are delivered in sequential order. The receiving node must restructure IP packets that may be out of order, delayed or missing, while ensuring the proper

solutions are generally provided by specialized vendors. Even then, the measurement would generally not occur at the actual end-points, but rather at specific nodes of the network, and could miss a number of issues downstream from the sensors. Testing can also be difficult when information travels from an enterprise to a carrier's network. 5This is very different from traditional, circuit switched network behavior, where the experience of users granted access to the network resources would be virtually unchanged but users in excess of the network capacity would be denied service altogether.

Page 11 of 58

time consistency of the audio stream, attempting to reconstruct the flow on the receiving end6. To minimize the impact of packet delivery effects, those applications and frameworks rely on network layer management and wherever possible on provisioned (reserved) bandwidth.

In summary, in traditional IP telephony, Voice Quality is very sensitive to any variation in packet delivery by the network; it is also very impractical to assess and manage directly. This has led to the traditional approach of concentrating on actively managing the performance of the network7 as a proxy for Voice Quality, to a large extent because that performance can be more readily measured, quantified and managed than Voice Quality.

2.2 Network Service Quality, a required condition for Voice Quality with traditional IP telephony

Network Service Quality (NSQ) is a network8 engineering and operations concept describing the performance of a network in transporting selected network flows within specified objective metrics. While the management of NSQ originated on traditional networks such as ATM, it is now almost exclusively focused on packet-switched networks and in particular on IP networks, both due to the significant growth of Voice over IP and because of the nature of IP networks.

NSQ is generally measured and managed against traditional network behavior metrics that most impact traditional IP telephony applications: real time effective bandwidth, delay (or latency), packet loss and jitter9.

2.2.1 Real time effective bandwidth

The measure of the bandwidth (generally expressed in kbps) of an end to end network path that is actually available at a given point in time to applications or network flows. On a shared network, this measure fluctuates under the influence of flows generated by other applications, of flows of the same application between other users, and of up- and downtime of network elements and links. In most cases it is driven by a few congestion points on the network, often on the first or last mile for consumers and on a WAN link for corporate networks.

6This functionality is usually accomplished by means of a jitter buffer, and will be discussed in more detail. 7For example, QoS is the first best practice quoted by Cisco in http://www.cisco.com/en/US/tech/tk652/tk701/technologies_white_paper09186a00800cb7fd.shtml, “Cisco ensured high quality voice calls by accommodating QoS features on its entire network”. 8For simplicity, in this document we will generally use the terms “network” and “low level” to describe activities occurring at the layers 1 through 3 of the OSI model. 9There is one type of network event, network re-routing, that generates extremely disruptive but infrequent network impairment. Re-routing occurs upon link or node failure. Seen by the end-user, its effect can resemble a complete loss of connection during a period of disruption to the delivery of traffic until the network re-converges on the new topology. Packets for destinations which were previously reached by traversing the failed component may be dropped or may suffer looping. Traditionally such disruptions have lasted for periods of at least several seconds. Recent advances in routers have reduced this interval to under a second for carefully configured networks using link state Interior Gateway Protocols (IGP). Fast re-routing frameworks are also being developed, see http://www.ietf.org/internet-drafts/draft-ietf-rtgwg-ipfrr-framework-06.txt.

Page 12 of 58

In traditional IP telephony, real time effective bandwidth should always exceed the actual rate of the payload across the network, which is the set, constant throughput any given flow will require. Therefore ensuring availability of effective bandwidth at all times is a key concern of network engineers in traditional IP telephony deployments.

Most traditional voice codecs’ throughput requirements range between 35 kbps and 100 kbps per stream on a fully loaded basis, depending on the codec.

2.2.2 Delay (or latency)

Delay is the measure of the time required for a voice signal to traverse the network. It is called one-way delay when measured end-point to end-point. Round-trip delay, also called Round Trip Time (RTT) is measured end-to-end and back. Delay is generally expressed in milliseconds.

Delay results from the time it takes the system or network to digitize, encrypt where appropriate, packetize, transmit, route, buffer (often several times), de-packetize, recombine, decrypt and restitute a voice signal. Those sources of IP telephony delay can be grouped in four main categories:

Processing delay includes the time required collecting a frame of voice samples before processing by the speech encoder can occur, the actual process of encoding, encrypting if appropriate, packetizing for transmission, and the corresponding reverse process on the receiving end, including the jitter buffer used to compensate for varying packet arriving delay on the receiving end. The complete end to end processing delay is often in the 60 ms to 120 ms range when all of the contributing factors are taken into account. The processing delay is essentially within a fixed range determined by the vendor’s technology and implementation choices. Encoding and decoding might be repeated several times however if there is any inline transcoding from one codec to another, for example for some hand-off between networks, in which case accumulated processing delay can become disruptive.

Serialization delay is a fixed delay required to clock a voice or data frame onto a network interface, placing the bits onto the wire for transmission. The delay will vary based on the clocking speed of the interface. A lower speed circuit (such as a modem interface or smaller transmission circuit) will have a higher serialization delay than a higher speed circuit. It can be quite significant on low-speed links and occurs on every single link of a multi-hop network.

Network delay is mostly caused by inspecting, queuing and buffering of packets, which can occur at traffic shaping buffers10 (such as “leaky bucket” buffers) sometimes encountered at various network ingress points, or at various router hops encountered by the packet along the way. Network delay on the internet generally averages less than 40 ms when there is no major congestion. Modernization of routers has contributed to reducing this delay over time.

Propagation delay is the distance traveled by the packet divided by the speed of signal propagation (i.e. speed of light). Propagation delay on transcontinental routes is relatively small – typically less than 40 ms – but propagation delay across complex intercontinental paths can be much larger. This is especially true when satellite circuits are involved or on very long routes such

10Traffic shaping will be described in the section on QoS.

Page 13 of 58

as Australia to South-Africa via Europe, for example, which might incur up to 500 ms of one way propagation delay. Propagation delay can only be optimized by designing the shortest possible path links.

The sum of these four components creates the total delay. The ITU-T has recommended 150 ms total one-way delay (including endpoints) as the upper limit for “excellent” Voice Quality. Longer delays can be disruptive to the conversation, with the risk of talk-over effects and echo. When the one-way delay exceeds 250 ms it is likely that talkers will step over each other’s speech.

In the case of a transcontinental route with well sized links, total delay might for example, in non-congested conditions, equal 70 ms (processing) + 10 ms (serialization) + 30 ms (network) + 40 ms (propagation) = 150 ms total. Therefore IP telephony calls will frequently operate where even small incremental delays could impact the Voice Quality.

Network delay is the one component on which the system administrator has the most control. It can be reduced through a variety of network engineering means11. However the first priority of network delay engineering is often avoidance of spikes and limitation of variability12 (i.e. jitter) due to congestion – ahead of reduction in normal delay. Of all the delay components, queuing at router hops is the most variable and unpredictable component of overall delay, especially in situations of congestion, and this is one of the domains where Quality of Service techniques are most frequently used.

2.2.3 Packet-loss

Packet-loss occurs when packets are sent, but not received at the final destination due to some network problem. Packet-loss is the proportion (in %) of packets lost en-route across the end-to-end network.

Packets can be designated as lost for a variety of reasons: actual errors in transmission, corruption, packets discarded from overflowing buffers or for having stayed too long in the buffer, and packets arriving with too much delay or too much out-of-order to still be useable. However the main reason by far for packet loss is discarded packets in congested routers, either because the buffer was full and overflowing, or due to methods such as the Random Early Detection (RED) or Weighted Early Random Detection (WRED) which will proactively drop packets to avoid congestion13.

11Such as improvements in the network architecture when possible to reduce the number of hops, avoidance of en-route encoding-decoding or transcoding processes, appropriate sizing of network elements to avoid lengthy buffering and queuing… There are also new advanced queuing technologies such as Class Based Weighted Fair Queuing/Priority Queuing that can reduce network delay and its variability in conjunction with QoS technologies. 12See for example http://www.isoc.org/inet99/proceedings/4h/4h_2.htm 13In QoS enabled network, RED is often by-passed by the Priority Queuing and applies only to lower queues; therefore higher classes may not be affected, whereas lower classes might be strongly penalized.

Page 14 of 58

Well sized and managed IP backbones and LAN are designed to operate at better than 0.5% packet loss average14. Packet loss on end-to-end Internet routes however can occasionally reach 5% to 10%15. Wifi connections can experience well in excess of 10% loss.

Several factors make packet loss requirements somewhat variable. Even with the same average packet loss, the way the packets are lost influences the impact on Voice Quality:

• There are two types of packet loss: random packet loss over time (where single packets might be dropped every so often during the call) and “bursty” packet loss (where several, contiguous packets can be lost in a short time window). Losing ten contiguous packets is worse than losing ten packets evenly spaced over an hour time span.

• Packet loss may be also be more noticeable for larger voice payloads (i.e. packets representing a longer time sample) than for smaller ones, because more voice is lost in a larger payload.

• Packet loss may be more tolerable for one codec over another, because some codecs have some loss concealment capabilities.

• Packet loss requirements are tighter for tones (other than DTMF) than for voice. The ear is less able to detect packet loss during speech (variable-pitch), than during a tone (consistent pitch).

• Even small amounts of packet loss can greatly affect traditional TTY16 devices’ ability to work properly as well as transmission of faxes using the usual fax protocol T.30 over IP networks; standards such as T.38 have been developed to reduce the impact of network impairments on the reliability of faxing over IP, but in practice they are not always supported, or the IP network may not be detected.

Experienced IP telephony players state “the default G.729 codec requires packet loss far less than 1% to avoid audible errors”17 and “even a 1% loss can significantly degrade the user experience with the ITU-T G.711 codec, which is considered the standard for toll quality”18.

Estimated MOS (average and range) from Perceptual Analysis Measurement System vs. packet loss rate for G.72919

14Verizon and ATT indicate that average Internet backbone loss in Europe is 0.9 % and in the US 0.7% 15See for example http://www.isoc.org/inet99/proceedings/4h/4h_2.htm 16TTY: teletype writer, traditionally used for transmission of text over voice channels in applications such as TDD (telecommunications device for the deaf) 17Cisco: Quality of Service for Voice over IP. 18Intel: Overcoming Barriers to High-Quality Voice over IP Deployments. 19Reynolds & Rix, BT Technology Journal Vol. 19 No 2 April 2001

Page 15 of 58

2.2.4 Jitter

Jitter is a measure of the time variability in arrival of successive packets. It is generally expressed in milliseconds.

Jitter can result from packets taking different routes (for a variety of reasons including load balancing or re-routing due to congestion) and experiencing different propagation delays on those routes, or from differences in the effects of congestion, where some packets may have to wait for long buffer queues to be emptied whereas other packets may not. Jitter may also result in packets arriving out-of-order. Typically, the more network delay, the more jitter because each processing step is likely to add jitter.

Effects of jitter, if untreated, are similar to the effects of very severe packet loss at the endpoint because the packets will arrive too late to be rendered to the end user. Therefore the impact of jitter is reduced through the use of a jitter buffer. That buffer, located at the receiving end of the voice connection, intentionally delays arriving packets by more than the typical jitter value to attempt to receive most jitter affected packets, reordering20 them and retiming them so that the end user hears the signal as intended.

Unfortunately, jitter buffers introduce incremental delay which itself can negatively impact the experience. Therefore, jitter buffers typically contain only about 20 to 40 ms of voice21. Values of jitter in excess of the buffer length will lead to packets being discarded22.

Traditional codecs which do not benefit from advanced adaptive jitter buffers are very sensitive to jitter. For example they behave poorly under typical Internet conditions 20Reordering is a complex task that only advanced buffers can perform. Traditional approaches make it very difficult to correct for jitter that exceeds packet length, due to the need for reordering. 21Avaya: IP Voice Quality Network Requirements. 22Most jitter buffers in devices and appliances have a fixed length; very few are dynamic, with a variable length capable of adapting to actual jitter conditions – reducing buffer length when there is little jitter to reduce the impact on delay and increasing the length when there is a lot of jitter to avoid losing too much of the signal.

Page 16 of 58

In summary, the sensitivity of traditional codecs to even moderate impairments on bandwidth, packet loss or jitter limits the applicability of traditional IP telephony solutions to networks where the NSQ is tightly managed explicitly end-to-end, in particular through the use of Quality of Service techniques. For traditional IP telephony solutions, it makes sense to look at active management techniques as required towards end-user satisfaction.

2.3 Traditional techniques for management of Network Service Quality

Practically, NSQ is quantified against the key metrics described above (bandwidth, delay, packet loss and jitter). Target values are determined for each of those metrics and captured in requirements.

Network and IP telephony engineers in the enterprise traditionally have access to five families of tools to attempt to deliver NSQ and meet their requirements against the NSQ metrics.

2.3.1 Right-provisioning

Right-provisioning23 (which is different from indiscriminate “over-provisioning”) aims to limit traffic congestion on the network to extreme cases by provisioning it to support the applications that it is designed to support in the first place, designing it to have “head room” in most actual traffic conditions and planning and sizing for actual peak usage (in a manner similar to the “busy hour” Erlang-type sizing common in traditional telephony).

Right-provisioning is not solely about bandwidth. Avoiding congestion minimizes loss and jitter that would be produced in the queue buffers of a congested network. Right-provisioning also includes good architectural design of networks to reduce hops and avoid bottlenecks.

This is a typical enterprise approach on Local Area Networks (LAN) which now typically operate at gigabit per second and where average and peak utilization are generally low. Right-provisioning on the LAN is the simplest and cheapest way to provide NSQ. It is also what carrier grade IP Service Providers do on their commercial backbone – typically operating at less than 10% average utilization on the backbone, thus avoiding congestion at peak.

The approach is less frequently applied to enterprise WAN links where the company has to purchase access or links priced by peak throughput, which creates a perceived24 economic incentive to limit bandwidth. Because bandwidth over the WAN can be orders of magnitude more expensive than over the LAN, many wide-area networks operate at speeds as low as T1/E1 speeds (about 1.45 Mbps) and slower, potentially creating a bottleneck at the LAN/WAN interface. For

23“Right-provisioning” may mean different things for different solutions; for example, for the most frequently used G.711 codec, it generally implies between 80 kbps and 110 kbps per call leg; for G.729, it generally means between 24 kbps and 45 kbps per call leg. 24Whether the economic incentive is real or just perceived is subject for much debate; only case by case analysis including a fully loaded business case – where the cost of managing scarcity and the secondary effects of poor Voice Quality are included – can provide the answer. CIO should require such analysis prior to any decision. Global companies often have activities in territories where bandwidth simply cannot be acquired at any meaningful price, and as such will point out the need for managed solutions not relying on right-provisioning. That legitimate need should not be confused with the general case.

Page 17 of 58

applications using TCP, this LAN/WAN bottleneck is a nuisance, but generally not an application killer.

However, when voice and video packets must compete at that bottleneck with regular data packets for transmission over a bandwidth-constrained WAN, traditional voice and video applications may be rendered useless. To compensate for that, enterprises will deploy network management techniques and solutions such as Quality of Service to prioritize or control access to that link.

Those network engineering techniques really consist in managing bandwidth scarcity. As the average cost of bandwidth continues to decrease and eventually declines below the fully loaded cost of managing bandwidth scarcity, right-provisioning will become more prevalent as the primary mechanism to deliver good NSQ, and scarcity itself will become scarce, an exception that can be addressed in isolation rather than the general case requiring a network wide solution.

2.3.2 Service Provider managed solution with guaranteed SLA

Customers may decide to outsource their WAN to Service Providers, in which case they negotiate Service Level Agreements (SLA) with their providers, describing the NSQ the providers commit to deliver for specific traffic flows and the penalties attached to failing to deliver that NSQ. In exchange, service providers can charge significant price premiums for “premium service plans”. In turn the providers will use a combination of the techniques described in this chapter on the managed network (or variations of these techniques, such as MPLS).

2.3.3 Quality of Service techniques25

Quality of Service (QoS) is a set of network engineering tools used to proactively manage metrics of Network Service Quality26 (bandwidth, delay, packet loss and jitter). QoS does not manage any of the metrics of Voice Quality, but can indirectly affect them. QoS consists in control mechanisms and traffic management methods to aim at transporting selected network transport flows across the network within specified NSQ metrics. While QoS originated on ATM, it is now mostly focused on packet-switched networks.

QoS is based on providing deliberate preferential treatment for a specific traffic flow or a prioritized class of traffic, at the expense of course of the other network traffic flows. Preferential treatment can be done in many ways that typically fall within two main categories: Differentiated Services which is a coarse-grained, stateless, class-based mechanism for traffic management, and Integrated Services which is a fine-grained, stateful, flow-based mechanism.

2.3.3.1 Differentiated services

25This chapter aims at providing an introduction to the concepts and techniques for non-specialists. It does not pretend to capture the full richness and complexity of the domain, or to be of much value to specialists. Readers with intimate knowledge of the field, as well as those without direct interest in the actual techniques of QoS, may want to skip it altogether. 26There is often confusion in the use of the term QoS between the means and the ends, where the term QoS is used instead of Network Service Quality; in our taxonomy QoS is the set of means and NSQ is the resulting performance.

Page 18 of 58

Differentiated services prioritize packets from one flow or class of flows above other packets at each hop that recognizes a specific marking of the packets. While there are several mechanisms available, DiffServ27 is the most commonly used. In DiffServ, packets from various applications are marked as belonging to specific traffic classes with the use of a Differentiated Services Code Point (DSCP). The IETF DIFFSERV working group has redefined the semantics of the Type of Service (TOS) octet in the IP header, which is now called the DS field. The 6-bit Codepoint (DSCP) portion is contained in the DS field, which provides for 64 different packet treatments for the implementation of differentiated network services. DiffServ aware routers implement per-hop behaviors (PHB), which are packet queue management, prioritization and forwarding properties associated with each specific traffic class, with higher priority forwarding for higher priority classes, at the expense of lower priority classes. Two standard per-hop behaviors are available:

• Expedited Forwarding (EF)

Initially defined in RFC 2598 (replaced by RFC 3246), EF includes only 1 DSCP (recommended as 101110). DiffServ identifies and manages one class of traffic for expedited forwarding and expedites the traffic in that class to minimize delay and jitter for that class. It provides the highest available NSQ to the one expedited flow. Traffic that exceeds the traffic profile may be de-prioritized or discarded. EF is similar to IP Precedence 5 in the old Type of Service approach. EF traffic is often given strict priority queuing above all other traffic classes. Because an overload of EF traffic will cause queuing delays and affect the jitter and delay tolerances within the class, EF traffic is often strictly controlled through admission control, policing and other mechanisms. Typical networks will try to maintain EF traffic to no more than 30% - and often much less - of the capacity of a link.

• Assured Forwarding (AF)

Defined in RFC 2597, AF includes 12 DSCP in 4 classes with 3 drop-precedences within each class. AF is similar to IP Precedence 1 to 4 in the old Type of Service approach (IP Precedence 0 being pure best effort). When coexisting with EF PHB, EF traffic represents a super-class prioritized above the 4 classes of AF. In RFC 2597, DiffServ attempts to manage multiple types of traffic simultaneously, for example differentiating between call signaling, voice media traffic and video media traffic. Some measure of priority and proportional fairness is defined between traffic in different classes. Should congestion occur between classes, the traffic in the higher class is given priority. Traffic that exceeds the traffic profile may be demoted or dropped. If congestion occurs within a class, the packets with the higher drop precedence are discarded first. Rather than using strict priority queuing, more balanced queue servicing algorithms such as fair queuing or weighted fair queuing are likely to be used. That aspect of implementation is generally vendor specific. In the same manner that EF traffic needs to be strictly controlled to maintain NSQ, it is also necessary to carefully manage and limit the amount of traffic admitted in each class. In that regard, the richness of the many classes may be somewhat misleading: Assured Forwarding works best if there is little traffic in each class.

27DiffServ was initially defined in IETF RFC 2474 and 2475 (1998)

Page 19 of 58

DiffServ is a best effort mechanism28. There is no guarantee that any specific stream in the entire traffic class will have enough bandwidth, unless the amount of traffic to be managed is much less than the capacity of the link. DiffServ can turn out to be worst than best effort wherever there is no unique authority owning traffic classification (such as networks not managed end-to-end, public networks, or where “rogue applications” might try to claim precedence by marking their traffic in the highest classes, actually flooding the routers with competing traffic) as there is no policy enforcement in the DiffServ standard29. Therefore, on public networks such as the Internet, DiffServ is not implemented. At hand-off points between private networks, it is necessary to negotiate SLAs to avoid conflicting policies.

DiffServ works best within a network where all routers honor it, but it does not require that to be the case (DiffServ unaware routers just pass-through the traffic). DiffServ may provide benefits even where only some of the routers recognize it. However details of how individual routers handle the type of service field are somewhat arbitrary and it is difficult to predict end-to-end behavior. This is complicated further if a packet crosses two or more DiffServ clouds before reaching its destination.

DiffServ is available economically from virtually all modern switches/routers. Its implementation is standard (even if the actual details of queue management algorithms tend to be vendor specific). DiffServ does not require to maintain state and as a result it requires less setup and ongoing management overhead than Integrated Services.

2.3.3.2 Integrated services

Integrated Services (IntServ) attempt to emulate properties of a circuit-switched network over a packet-switched network for a specific flow: “circuit over packet”. In particular IntServ aims at providing an end-to-end reserved and guaranteed resource – a “virtual circuit” where all resources are sequestered at call set-up time.

IntServ require end-to-end management of the network: it needs every router in the end-to-end path to support it. IntServ interacts with applications by responding to appropriate requests (“traffic contracts”) and providing feedback to the application. Every application that requires some kind of guarantee makes an individual reservation for each flow; "Flow Specs" describe what the reservation is for. When possible, the network reserves capacity at network nodes during session establishment. The “Resource ReSerVation Protocol” (RSVP)30 is the most common underlying mechanism to signal the reservation across the network31. During the session the

28Some priority queuing mechanisms implemented in addition to DiffServ enable assigning specific amount of bandwidth to specific classes. Each such class can enjoy guaranteed bandwidth as a class in aggregate. Of course, that does not ensure that any flow marked within such a class would benefit from guaranteed bandwidth without limitation. 29Most enterprise-class layer 3 switches may be configured to discard all application-set values and reset DSCP for these packets to a default value of 0 or best effort. Requiring a trusted proxy to set these values instead of the applications is often considered “best practice” by network managers. While the practice takes care of rogue applications, it may also prevent legitimate applications from taking advantage of DiffServ, and make end-to-end support of DiffServ across network boundaries even less likely. 30RSVP was initially defined in IETF RFC 2205 (1997) 31RSVP may also be used in conjunction with some rare and complex DiffServ implementations; it is however virtually always part of any IntServ implementation.

Page 20 of 58

network may monitor the achieved level of NSQ, and dynamically control scheduling priorities in the network nodes. It is expected to release the reserved capacity during a tear down phase. There are two types of service enabled by RSVP (a) Guaranteed, which behaves as closely as possible to a dedicated virtual circuit and (b) Controlled Load, which is equivalent to best-effort service under unloaded conditions for the prioritized flows.

The routers store the nature of the flow and police it. This requires state to be maintained at all routers for all flows in their path. It is done in soft (transitory) state, so if nothing is heard for a certain length of time, then the state will time out and the reservation will be cancelled. However the amount of effort required for that is non-trivial. The RSVP signaling is quite complex and chatty, with exchange of PATH and RESV messages spreading out throughout the network every 30 seconds or so. As a result, scalability can be an issue.

RSVP provides the highest level of all current QoS technologies in terms of service guarantees; when implemented on all nodes in the path, and especially in conjunction with Call Admission Control (or when using RSVP Sync directly as CAC), RSVP can guarantee bandwidth for the call along the entire path for the entire duration of the call. Note that while end-to-end bandwidth guarantee should help improve all other NSQ metrics, RSVP does not guarantee delay, jitter or packet loss, because packets still need to be identified as belonging to the flow and forwarded at each hop, even in an expedited manner.

RSVP is the biggest departure from “best-effort” IP service, the most complex and management intensive of all QoS technologies, and creates the closest emulation of a circuit on IP networks. It also is the technology that creates the most difference between the traffic that benefits and the traffic that does not, because the resources are not shared but sequestered (whether they are entirely needed or not); in some cases, the entire resource set can be sequestered, leading to all other services being incapacitated and blocking for any further call attempts.

While IntServ and RSVP are standardized, actual implementation and application integration for IP telephony (such as flow requirements specification) tend to be quite complex and can be proprietary for advanced functions. In practice, those deployments are most often done on a homogeneous, single-vendor network, and the IP telephony solution (all the way to the end-points) is also often provided by the same vendor32 to maximize compatibility end-to-end. As a result traditional IP telephony tends to foster vertical integration and vendor lock-in between network, IP telephony application and devices. This also makes it more difficult in practice for multiple IP telephony solutions to coexist and interoperate in an enterprise as compared to traditional TDM voice solutions.

Because of the complexity of the design and the integration, and also because the network needs to be designed for peak usage regardless of how often peak conditions occur, RSVP is also generally the most costly to purchase, and most complex to setup and manage, especially as traffic patterns fluctuate seasonally and/or change structurally on the network or new resources or nodes are added.

2.3.3.3 Traffic shaping

32There are of course examples of implementations in multi-vendor environments. Those are however a minority for the reasons stated.

Page 21 of 58

Traffic Shaping provides mechanisms to control the volume of traffic being sent into a network (bandwidth throttling), and the rate at which the traffic is being sent (rate limiting). It is typically used at specific network ingress points (such as hosts running bursty applications, and third party network interconnects) to avoid flooding of the network by uncontrolled spikes of the incoming traffic.

It consists in deliberately slowing down the influx of packets to smooth out the traffic peaks and troughs of traffic being inserted on the managed network, to avoid situations where spikes briefly collide with real time traffic, generating jitter, packet loss or delay. Of course, if there is real time traffic in that influx, it cannot be delayed or that would make it unusable. Therefore traffic shaping is rarely used alone, but rather added to DiffServ or IntServ, as well as to CAC, to shape the various flows differently.

Traffic shaping is generally implemented through various “leaky bucket” techniques, which are big queuing buffers that at most let a specified capped throughput through. There are other techniques such as “token bucket” which are primarily used in association with IntServ and differ from leaky bucket in that they can allow some amount of short term burstiness.

As examined above, QoS is designed to protect real time traffic from other data traffic contending for the same network resources and to deal with traffic already present on the network. The other common method used to manage Network Service Quality for voice or video traffic is Call Admission Control (CAC), which determines whether new traffic from a specific application should be admitted onto the network or not.

2.3.4 Call Admission Control

Call Admission Control (CAC) is a preventive application layer management capability that intervenes proactively during call set-up to authorize or limit which flows are allowed to be set-up based on parameters such as available bandwidth, number and type of calls in progress, or even possibly other parameters such as available CPU at end-points. In almost all implementations, CAC is implemented to avoid congestion and over-subscription of network resources, in particular congested WAN links. From a user point of view, CAC can translate in some calls being (apparently randomly) blocked (fast busy signals for example).

Typically, CAC is used to address two main objectives. The most common objective is to protect the quality of the real time media. The less frequent objective is to protect other critical payloads from the real time media.

• Objective 1: protect the quality of the real time media to which it is applied

Because traditional IP telephony codecs require a fixed bit rate and cannot adapt to congestion, the observed quality of all flows on a link would degrade very rapidly when link utilization approaches capacity as more media sessions are requested. If implemented, RSVP might help prevent that to some extent for already established flows, but if incremental flows were established they would likely be of very poor quality. Eventually, even flows with reserved resources could be affected at least in terms of increased delays, jitter and packet loss.

Page 22 of 58

CAC controls which flows are authorized on the link and proactively limits the flows so that each authorized flow theoretically enjoys acceptable NSQ. The flipside is that new media session requests would be denied outright or, if available, sent to a backup solution (PSTN fall-back). The application may or may not provide feedback to the user as to why his request has failed (fast busy signal…) or how it is handled (incremental cost).

Variations on this need can include (a) proactive media type management: when video and audio data coexist, audio is generally considered more important than video which also has a heavier payload; upon congestion, the network manager may desire to deny video sessions rather than lose audio sessions; and (b) break through priority: some flows may be given the right to “break through” and grab resources away from other flows; in that case, if CAC assessed that such a top priority user or call (such as an emergency call) would need more resources than available, it could force existing flows to relinquish the resources (essentially kill the call) to make them available to the prioritized flow.

• Objective 2: protect other critical payloads on the network from being preempted

by prioritized real time media

Enterprise networks generally carry not only real time data such as IP telephony and Video but also data that is business critical without being real time (ex: financial transaction data, HR data). Traditional IP telephony vendors often prescribe QoS to prioritize real time media data over any other media (including business critical data), which creates a secondary effect of resources preemption by the real time media if too many IP telephony or Video sessions are requested – whereas if no QoS was applied, best effort transport layer mechanisms would likely enable the business critical data to go through even with delay.

In such a case, CAC would complement QoS to ensure that the class prioritized for quality did not grab all the resources away from the other classes; a CAC threshold may be set for example such that real time media never uses more than say 50% of the available bandwidth, leaving headroom for other, non-expedited traffic.

Overall, CAC is mostly useful on under-provisioned links. On right-provisioned links, it should only rarely be triggered. Therefore it is a solution to manage scarcity, and as such often works together with QoS in traditional IP telephony. QoS is applied to prioritize real time data over other payloads while CAC is used to prevent congestion on under-provisioned links. In that context CAC is an optional but important complement to DiffServ (to ensure flow prioritization does not get applied to more flows than the resources can support and that the queues are not oversubscribed, leading to selective discarding of packets). It is virtually always associated with IntServ (as part of the resource reservation traffic contract, to ensure the application does not attempt to reserve more resources than available). CAC also cannot on its own prevent the effects of any rapid bursts and increases in the other traffic that co-exists on the network after calls are already established; therefore it typically also is combined with traffic shaping of that other traffic.

CAC requires the application to be informed of the network resources. In most cases that is done statically for simplicity; however static configuration requires a separate process to maintain the network configuration up-to-date. Or that is done dynamically, either through “measurement-based CAC”, sending probe traffic (pings) before establishing the call, or through topology-aware

Page 23 of 58

methods such as “resource-reservation based techniques” relying on typically proprietary integration with RSVP/IntServ and interaction between the network and the application.

CAC generally also requires understanding of network flows. Depending on the CAC method and in particular for methods called “resource calculation”, that required understanding can be very comprehensive – state, network path, bandwidth requirements… That can be especially complex with roaming endpoints and point-to-point flows – modeling user behavior and the resulting bandwidth usage with an increasing mobile workforce is probably more an art than a science.

CAC implementations range from relatively localized, such as monitoring discrete spokes in hub and spokes network, making a one-time decision on whether to establish the call or to just reject it on the basis of current number of calls on the link, and not worrying about the call afterwards – to very complex gatekeeper “zone” functionalities integrated with RSVP over distributed networks and with break-through rights for some users, as well as including PSTN fallback or alternate path routing mechanisms. In either case, to ensure it is effective, but at the same time not unduly limiting the amount of calls that can be authorized or penalizing the other traffic, CAC may require significant setup and ongoing management effort as the network conditions or required traffic patterns evolve over time or when the need of other critical payloads changes over time.

2.3.5 VLAN and DVLAN

The last traditional approach is the deployment of dynamic and static VLAN for traffic engineering, security33 and to a lesser extent quality (typically in combination with differentiated services such as per-port per-VLAN QoS). In this approach, voice devices are hard-mapped to a VLAN, or can use proprietary discovery protocols to request creation of a DVLAN. These VLAN form a second parallel network that carries only the voice traffic from end to end. This is in effect a new kind of circuit-switched network overlaying the data network. VLAN boundaries however typically end at the WAN boundaries (router edges), where network management problems are the most acute. Therefore VLAN alone are typically not sufficient to handle NSQ, and some of the other techniques described above are also used in addition.

Because a VLAN is a parallel network, service sharing is problematic and infrastructure and management cost can be greatly increased. Take, for example, the deployment of DHCP. A VLAN network may require a separate DHCP server (at least a subnet scope) and IP address space, with substantial effect in IP management cost. Similarly, keeping large VLAN up-to-date in evolving network and traffic conditions presents a very complex challenge. The overall complexity and cost of VLAN management, in addition to QoS, may end up challenging the theoretical cost of ownership savings that are often cited for moving to traditional IP telephony.

33Traditional IP telephony solutions are generally not intrinsically secure at the application layer (typically they do not include strong authentication, encryption of signaling and media, etc). For those solutions, deploying a VLAN for the purpose of improving the security of the voice application (rather than as a way to provide Voice Quality) is typically required. There is however a significant cost to that incremental security, which solutions providing strong authentication and encryption natively would not require.

Page 24 of 58

3 The traditional approach is complex and increasingly ineffective

3.1 Traditional IP telephony solutions are challenged to deliver Voice Quality

Any modern Enterprise Voice solution should be designed explicitly to manage and deliver the best possible Voice Quality. This however is not always the case of the current generation of IP-PBX, which do not commonly result from proactive, comprehensive design for Voice Quality nor include capabilities for real time in situ monitoring and management of Voice Quality, and are therefore structurally challenged, by design, to deliver Voice Quality.

3.1.1 Lack of comprehensive design for Voice Quality

Rather than redesigning from scratch for IP networks, traditional IP telephony borrowed much of its technology from the world of traditional digital telephony. From that point of view, IP-PBX are much more in the continuity of the PBX than they belong in the new world of IP. As a result, some of the main challenges of Voice Quality on IP networks were not addressed natively in the IP telephony solution itself – leading to using corrective approaches, in particular in the network.

For example, the codecs almost universally used in traditional IP telephony were developed prior to the rise of IP networks, and in the case of G.711 as early as the 1970s, when the PSTN was introducing digital source encoding. Those are Constant Bit Rate (CBR) codecs that were designed for guaranteed bandwidth, switched networks. In the context of such networks they do provide very satisfactory Voice Quality. They were however not initially designed and optimized for packet switched, best effort networks, and in particular for the Internet.

3.1.2 Lack of real time in situ evaluation and management of Voice Quality

The complexity of subjective Voice Quality measurement in traditional IP telephony (as described earlier) has prevented direct management of the Voice Quality in virtually all of those systems. Because traditional IP phones are not designed to run advanced perceptual algorithms, traditional commercially available solutions generally do not measure and track end-to-end user experience for all calls in real time. Voice quality of virtually all calls remains unmeasured, unmonitored and unknown – and as a result unmanaged.

3.2 QoS techniques have limits, risks and costs

Voice Quality in traditional IP telephony is dependent on an indirect approach. Rather than directly managing Voice Quality, it is Network Service Quality that is measured, monitored and managed, with the network engineering techniques of Quality of Service. These techniques (QoS and CAC34) are applied end-to-end on the managed network to specific classes of traffic or specific flows, with the objective that those flows will receive the bandwidth they need and be expedited at the various hops, keeping delay, packet loss and jitter to low values, and as a result transporting the flow in a manner that would not damage its quality. However, and in spite of

34Thereafter we’ll just say QoS generically

Page 25 of 58

being broadly – almost systematically – implemented for traditional IP telephony, QoS presents multiple issues and limitations.

3.2.1 QoS is not a substitute for right-provisioning

At most, QoS only guarantees throughput. Specifically QoS either expedites forwarding with DiffServ (preferentially directing the throughput toward the prioritized flow as a best effort flow without other NSQ metric guarantees), or allocates throughput resources with IntServ (guaranteeing end-to-end network bandwidth but not the other NSQ metric). QoS does not directly control packet loss and jitter but rather aims to minimize them through sufficient bandwidth allocation or prioritization and expediting.

QoS works best when only a small percentage of the traffic is priority – which is rarely the case with real time media traffic. If a link is completely overloaded, IntServ may be able to sequester bandwidth (and cause some existing traffic to be discarded) but jitter and packet loss will likely still be an issue on a clogged network.

Page 26 of 58

When QoS works best (illustrative)35

QoS (and especially DiffServ) is only helpful in a narrow range of traffic conditions36. It is unnecessary when there is little traffic (hence little congestion). It works well in supporting prioritized traffic when there is moderate congestion (of course at the expense of the other traffic). And it works less well (but probably better than nothing) when there is heavy congestion, as illustrated in the chart above. Even IntServ/RSVP and CAC become eventually ineffective and meaningless if there is too much of the traffic to be prioritized, and too many calls end up being rejected by the system.

3.2.2 QoS can do more harm than good if not managed right

QoS often creates detrimental secondary effects on the network, such as:

• The risk of preemption and disruption of other, sometimes business critical flows; whereas a best effort network would give them equal opportunity, QoS networks often end up treating them worse than best effort; prioritization and pre-emption pre-suppose that the network knows what is important at all times, but that is not always the case37

• The risk of “rogue applications” taking advantage of QoS at the expense of the traffic that should have benefited from it38

• The risk of mishandling network interconnections39

35Source: http://www.bricklin.com/qos.htm 36See also slide #9 in http://www.cisco.com/application/pdf/en/us/guest/tech/tk759/c1482/cdccont_0900aecd8019f3e0.pdf 37Even the common assumption that voice deserves a high priority is very arbitrary – many phone calls might have limited or no business value compared to some TCP flows. 38For example, there is no standard policy enforcement in DiffServ: if it is implemented on the network with the goal of supporting one application, all applications can take advantage of it as they choose unless the switches reset all application-applied markings. 39The worst case scenario is for IP telephony flows to have to traverse third party networks where QoS is implemented but where those flows are not given preferential treatment – and hence are in practice

Page 27 of 58

• IntServ scalability limitations that may impact the actual usability of the application

• And most importantly the risk of not being able to keep up with and manage the complexity of ongoing QoS. In many cases, poor or stale QoS implementations actually end up working against the traffic they are trying to favor.

3.2.3 Advanced QoS is costly

While DiffServ is generally available on most off-the-shelf routers, IntServ/RSVP and dynamic CAC are more complex and costly to procure, implement and manage.

IntServ involves support for RSVP signaling and for network feedback to the application, and a more complex, often single vendor40 network (because implementation, monitoring and management of advanced features are often vendor specific). Advanced solutions also often require compatible end-points (generally from the same vendor), adding procurement cost.

Most importantly, deploying and managing IntServ and/or CAC are costly and complex. It takes skilled engineers to plan, set up and optimize for all traffic patterns in light of typically highly variable or seasonal traffic. Ongoing monitoring and optimization is typically needed to keep up with actual traffic pattern. Significant modifications to the network, such as new sites or new services likely will require reviewing the configuration. QoS is not a “set and forget” technology.

3.3 The traditional approach has a limited scope

By requiring QoS, traditional IP telephony solutions make implicit assumptions that limit the scope of the calls for which the system is capable of providing Network Service Quality.

3.3.1 The entire network is rarely under the administrator’s control

For QoS and CAC techniques to work, the administrator has to be in a position to provision and manage on an ongoing basis all the network elements to respect the QoS policies and requests controls. Alternatively the administrator must be in control of very well defined and well managed SLA over a service provider’s network.

Most large corporate networks however are in practice still heterogeneous and complex. They often are at various stages of roll-out, sourced from different service providers with incompatible SLA, or resulting from M&A with different architectures, or the control is delegated at the local level with different policies. Coordination of QoS across private networks is extremely complex, and rarely implemented.

3.3.2 The users are no longer at well identified points on the managed network

given detrimental treatment, for example by having their packets dropped by RED in low priority queues. The result is much worse than if they had to traverse best effort networks only. 40In order to certify their IP telephony deployments, some vendors require that the underlying network run on their gear exclusively.

Page 28 of 58

QoS only works on the managed network. If the end-point is not on it, the flow will be exposed to NSQ impairments in the non-managed part of the network, and in all likelihood the Voice Quality will be affected.

As more and more users collaborate across a variety of networks, they escape the ivory tower of an end-to-end owned and managed network, in many cases using the Internet to connect back to the enterprise network. QoS is not supported on the Internet. Therefore a solution which requires QoS is very significantly limited from its inception.

3.3.3 The traffic patterns are increasingly difficult to predict

QoS manages scarcity at the margin in a narrow range of traffic conditions, and therefore the traffic patterns need to be well understood. If there is much more traffic than planned, the service will end up poor no matter what. If there is much less traffic, there is no need for any QoS or CAC.

As new services such as videoconferencing are introduced, as employees roam within the enterprise from hot-desk to hot-desk or connect from outside, the traffic may no longer be easily predictable, and often be in ranges where QoS is no longer efficient.

3.4 Providing Network Service Quality is not sufficient to ensure Voice Quality

Poor Voice Quality is the most common source of user dissatisfaction with current IP telephony implementations41 and is difficult and costly to address. Even well designed and managed IP telephony deployments typically provide lower Voice Quality to their end-users than the circuit switched solutions they replaced42. This is becoming a more visible and contentious issue.

Due to the convergence of cultural factors (SLA orientation inherited from the world of operators and TDM), technical factors (fragility of traditional real time applications to NSQ impairments, difficulties in managing Voice Quality directly), and market factors (strong reliance of the networking industry on QoS for market differentiation and premium revenues), broad QoS has become the de facto automatic pre-requisite to any traditional IP telephony deployment.

While the intent of QoS is of course to try and provide quality as perceived by the end user, QoS is not comprehensive and does not examine the actual end-to-end user experience. It solely works on managing NSQ parameters on the part of the end-to-end path that is actually managed.

41See for example the following articles and sources: http://www.vonmag-digital.com/vonmag/200603/?pg=34 http://www.technewsworld.com/story/89p8MqTcCNwR8Q/Is-VoIP-Call-Quality-Enterprise-Ready.xhtml http://www.convergedigest.com/bp-ttp/bp1.asp?ID=404&ctgy=Home http://www.networkworld.com/video/081406hs-qovia.html?tab=recent http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9026541 42See for example: http://www.psytechnics.com/site/sections/downloads/download.php?fileID=42 where a PSTN solution (with an average MOS of 4.1) was replaced by a traditional IP telephony solution which initially provided unacceptable quality on many calls; Psytechnics was brought in to conduct a world-class quality improvement program which ensured virtually all calls now deliver a MOS of 3.7 or better – still a significant reduction however from the PSTN solution.

Page 29 of 58

The intended result (or hope) is that Network Service Quality will directly translate into Voice Quality experienced by the end users.

There is increasing recognition however that implementing QoS with traditional IP telephony is far from ensuring that end-user quality. While good Network Service Quality is a requirement for traditional IP telephony to deliver Voice Quality, it is not a sufficient condition43.

Therefore there is a considerable need for alternate strategies to providing Voice Quality, strategies that cannot rely on network management as a required condition, because the network will often be a public, best-effort network, or out of the control of the enterprise’s network engineer. The new strategies must work on any network, enabling users to experience real time media with the best possible Voice Quality in any situation. What is needed is a new, comprehensive approach to Quality of Experience, where the bulk of the management is at the application layer rather than almost exclusively at the network layer.

4 Microsoft UC Quality of Experience: a comprehensive approach to Quality

4.1 Quality of Experience, a new approach

Quality of Experience (QoE) is a new approach to quality for media communication, describing the emerging reality that what ultimately matters in moving to the next generation of services is how users perceive it performs. Quality of Experience is starting to emerge as a topic in the literature,, in industry circles and at select customers in recognition that the traditional methodology of QoS (which manages network configuration and performance) is often ineffective in ensuring the desired user experience. QoE examines all elements that influence a user's perception of the interaction and makes use of many relevant metrics and technologies (including where appropriate specific network layer QoS technologies) to deliver the best possible experience. The concept applies to any kind of interaction, not just voice.

Microsoft UC Quality of Experience is based on optimizing and monitoring the actual user experience through all relevant means. Specifically, Microsoft UC Quality of Experience comprises:

• Comprehensive, user-focused approach to perceived quality:

Adopting a new, comprehensive approach to perceived quality, centered on the actual users, and incorporating all significant influencing parameters in optimizing the user experience.

• Intelligent, adaptive end-points including an advanced media stack:

43“You can build an exceptionally robust network but that still does not guarantee that you’ll get anything close to what the end-user sees getting from the traditional TDM technology with respect to (Voice Quality)” – Stephen Mank, Qovia in http://www.vonmag-digital.com/vonmag/200603/?pg=34

Page 30 of 58

Using smart adaptive end-points with the real time capability to monitor, pilot, optimize and deliver the UC Quality of Experience, in particular by running an advanced set of applications such as a "media stack" that takes real time adaptive and corrective actions to continuously optimize the user’s subjective experience on any network.

• Real time metrics of the actual experience:

Measuring, quantifying and monitoring at all times the actual experience using real time metrics of the user's perceived subjective quality of the media experience.

Because those three components are needed to ensure consistent delivery of end user Quality of Experience, Microsoft UC is unique in providing a complete, comprehensive solution to the Quality of Experience needs of Enterprise Voice; this is a key competitive differentiation of Microsoft UC.

4.2 A comprehensive, user-focused commitment to perceived quality

4.2.1 Microsoft UC addresses root causes of poor user experience

Microsoft UC QoE is making significant departures from the traditional network-centered methodology by taking a comprehensive, user-focused approach to perceived quality. The QoE approach attempts to incorporate all significant influencing parameters (network parameters of course, but also hardware, application, psychological, physical parameters, etc) in optimizing the user experience in real life situations. Often it is the non-network effects which are the most harmful to the experience.

Extensive subjective and objective testing programs44 helped identify and address the many root causes of poor user experience. There are three families of root causes:

• Network effects:

Network effects are impairments in the end to end transmission of packets (Network Service Quality impairments such as bandwidth restrictions, packet loss or jitter). That is the domain on which traditional network based methodologies have concentrated their efforts.

• Payload effects:

Payload effects are quality impairments to the payload content (what is “in the packet”) which can result from the environment, from poor devices that fail to capture the quality of the signal or generate acoustic echo, or from a host of other issues at the end-points. The factors that are not seen with a network-only model, such as noise level, echo, gain, and double talk, often have a greater effect on perceived quality than the random variations of the network layer. A good example is a conversation where every word is echoed.

44Both internally and with vendors and partners.

Page 31 of 58

It is subjective perception of quality by the end users that has been driving technical choices in Microsoft UC, even when those choices were difficult and required significant investment. This has led to a very wide reaching effort – from the design of voice devices with simpler setup, improved ergonomics and improved acoustics to working with PC manufacturers in identifying improvements in design (such as avoiding to place the microphone close to the fan or the hard drive), to working with peripheral vendors in ensuring that the device would be an integral part of delivering QoE45.

4.2.2 Wideband voice: an example of experience that exceeds PSTN capabilities

In taking this new approach, Microsoft UC is not limiting itself even to what constitutes the best available experience in traditional telephony. Microsoft UC does not just aim at matching traditional experiences but at being able to significantly improve on them. For example, Microsoft UC made the choice of wideband voice by default, rather than the traditional narrowband voice approach of traditional and IP telephony.

PSTN networks and traditional IP telephony solutions are virtually all narrowband: they sample speech at 8 kHz, providing (according to Nyquist theorem) for a usable frequency range of 4000 Hz; that range then is traditionally bound on both ends through the application of guard bands to avoid interference and aliasing. The resulting range of 200 Hz to 3400 Hz net of guard bands (occasionally also described as 300 Hz to 3400 Hz) is termed “telephony band” and has been essentially unchanged since the establishment of the first transcontinental telephony service between New York and San Francisco in 1915. Compressed speech in the telephony band is sufficient for intelligibility but compromises on subjective speech quality. For example, the fundamental frequency for the voiced speech of both typical males and typical females is generally below 200 Hz and as such is not transmitted by narrowband systems (adult males will have a fundamental frequency in the 85 Hz to 155 Hz range and adult females from 165 Hz to 255 Hz). In narrowband systems, it is the transmission of enough of the harmonics series that provides the perception of hearing the fundamental tone, which is actually not transmitted. Similarly, voiced speech (in particular consonants and fricatives) contains a lot of energy in the 4000 Hz to 7000 Hz high frequency band, which narrowband transmission does not convey.

The wideband RTAudio codecs in Microsoft UC are sampled at 16 kHz, with a total frequency range of 8000 Hz and a usable frequency range of about 50 Hz to 7000 Hz. This provides accurate transmission and reproduction of the low and high sub-bands, adding to the quality of experience of the speech signal. Wideband not only improves the intelligibility and naturalness of speech, but also adds a feeling of transparent communication and eases speaker recognition.

Because Microsoft UC QoE is comprehensive, it did not stop with the design of superior wideband codecs. It also ensured there would be end-points that could restitute the superior quality of those codecs (such as support for wideband acoustic frequency and reduced echo). Microsoft UC applied the same concepts of QoE not only to person to person voice and non-voice calls but also to multi-user calls using the MCU server roles in Office Communications Server 2007.

45Example: poorer quality webcams may generate pixelization and “noisy” video images which in turn will unduly consume bandwidth, possibly impacting the voice quality; Microsoft UC engineers have worked with webcam vendors on precise specifications and extensive testing to minimize that effect.

Page 32 of 58

4.2.3 Microsoft UC supports users anytime anywhere

This comprehensive approach can be illustrated in another important dimension: providing optimal quality of experience for users anywhere and anytime. For an increasing number of people, the workplace today is no longer the office in which their PBX phone is connected. Their workplace is wherever they are, whether roaming on the lossy public wifi connection of a busy coffee shop or a noisy airport, with little time or opportunity for complex operations such as setting up a VPN, or consulting at a customer site, or teleworking and connecting from their home broadband connection. Mobile worker programs are a strategic imperative that enable business continuity, improved abilities to access and retain top talent, and far greater organizational agility in a knowledge economy. Teleworking will be practiced as a primary work arrangement by more than 60 million people worldwide in 200746, and we should be anticipating the day when more than half of the Information Worker workforce is able to function productively outside of traditional facilities47. Users no longer can be tied to a desk to maintain communication with the enterprise and its customers, and technology needs to adapt.

Of course enterprise network administrators do not have control of the network on which nomadic users operate and they are out of reach of end-to-end network layer management. Therefore any solution where quality and security are predicated on end-to-end management of the network and/or better NSQ than what users get on the Internet would under-serve that significant and growing segment of users and will not serve the needs of a modern enterprise.

Microsoft UC was designed from the start to support those users across the most aggressive of unmanaged networks. It is differentiated by the use of intelligent end-points which enable secure authentication and encrypted media sessions without the need for a VPN, and by a rich adaptive and resilient media stack. Microsoft UC works under high loss, jitter, and delay, under constrained bandwidth, and can overcome such non-network impairments as echo and poor signal/noise ratio. To meet these requirements, a robust system was created that could correct all but the most extreme network and non-network effects.

4.3 Intelligent, adaptive end-points

While limiting the occurrence of network effects is the domain of management of the network, correcting those network effects as well as non-network effects cannot be done through management in the network. It is at the application layer that the solution must be implemented. And in order to enable point-to-point media, to provide end-to-end quality while tracking the user experience most closely, and to be scalable that application must run autonomously on the users’ end-points. Therefore two key pillars of this comprehensive approach are a rich software application and end-points that have the capability to run that rich application.

Traditional IP telephony solutions are not designed to do that. Their end-points generally are very simple appliances with very few on-board capabilities and a low rate of innovation over time. They typically have limited processing power and memory, with Digital Signal Processing (DSP)

46Gartner Management Update 2004. 47Heidi Skatrud, Vice President, Runzheimer International, quoted in http://www.runzheimer.com/web/all/news.2006.06.01.aspx

Page 33 of 58

processes and codecs embedded once and for all in silicon, and they are not adapted to running rich applications or to keeping up with the pace of software innovation.

Microsoft UC is based on Microsoft Office Communicator 2007 as the rich software application running on intelligent end-points (PC, Windows CE devices with phone form factors, Windows Mobile device, or other devices running partner platforms with similar capabilities). Those end-points have enough processing power and memory, versatility and flexibility, and core underlying platform capabilities to host rich media applications such as the new UC media stack in Office Communicator. They can also support a high rate of software innovation.

Office Communicator as a rich software application running on an intelligent device and platform brings three key new capabilities over traditional IP telephony:

• Authenticated, encrypted VPN-less communication:

Provide strong authentication and non-repudation and encrypt all signaling and media by default, which together with the other capabilities of Microsoft UC such as support for ICE enables VPN-less anytime anywhere access. Given that VPN contribute to network effects negatively affecting Voice Quality (by adding overhead, i.e. increasing bandwidth consumption, and by adding delay and jitter), removing traditional IP telephony’s requirement for VPN provide multiple benefits to the Quality of Experience.

• Quantifying the user experience at all times:

Measure, quantify and monitor at all times the actual experience using real time metrics of the user's perceived subjective quality of the media experience48.

• An advanced media stack designed for quality on IP networks:

Central to the approach is a brand new UC media stack for Office Communicator 2007, a media stack that includes modern codecs and capabilities to overcome and hide the effects of network and non-network impairments.

4.4 Measuring and monitoring the user experience in real time

4.4.1 Microsoft UC generates rich metrics of the user experience

Quantifying and monitoring the Quality of Experience of all users in all calls is one of the unique differentiations of Microsoft UC. Office Communicator measures the actual experience and generates all relevant metrics for all calls, including payload effects, which Office Communications Server collects and aggregates in the Call Detail Records (CDR) as Metrics CDR. Those capabilities represent a substantial advancement from traditional IP telephony solutions.

Metrics CDR include:

48Rather than just monitoring network behavior using low level NSQ metrics such as packet loss, jitter or latency.

Page 34 of 58

• Rich estimate of MOS:

Microsoft UC employs advanced real time passive algorithms examining both network and payload effects to produce key perceptive quality metrics such as estimates of MOS. These algorithms run on the end-points to measure the actual end-to-end user experience on a continuous basis, and report on all end-points for all calls. Quality metrics can be used for example for accurate trending and exception reporting.

• Network Service Quality parameters:

Microsoft UC produces metrics of all relevant NSQ parameters for all calls, providing the network administrator with unique insights on the network performance and facilitating resolution of potential network issues.

• ICE and relevant transmission information:

Microsoft UC uses IETF’s Interactive Connectivity Establishment (ICE) for NAT traversal to allow media to successfully traverse the variety of firewalls that may exist between users. It is one of the key innovations of Microsoft UC, supporting nomadic users’ access without VPN. Metrics CDR contain relevant information on the operation of ICE.

• In total more than 30 parameters that pertain to quality are logged by each end-point in a call.

4.4.2 Introducing the Microsoft UC Monitoring Server

The Metrics Call Detail Records can be routed to Microsoft UC Monitoring Server (a new Office Communications Server 2007 server role shipping as Tech Preview at RTM). The Monitoring Server collects the metrics, provides reporting interfaces, and runs analytics on data. It can provide root cause analysis and alarms.

As an example of how it might help an administrator, suppose that as adoption and usage of Microsoft UC grows, a specific network link becomes progressively saturated. In traditional IP telephony, there might not be much advanced warning prior to calls actually starting to block or by-passing to the PSTN, and the network administrator scrambling to augment the resources.

With Microsoft UC, as the link progressively saturates, the media stack will adapt to enable all calls to be served. Call quality might be slightly reduced for all calls as a result of the dynamic adaptation but calls will not block or need to be sent to the PSTN. Metrics will show bandwidth reduction, a stepwise degradation in quality and other tangible signs that the link experiences congestion, providing advanced notice that the link will need to be augmented. There will however remain ample time to do so before calls are blocked, enabling a better process.

4.5 A media stack designed for quality on IP networks

Windows Live Messenger and other modern consumer voice solutions have clearly demonstrated in the past two years or so that advanced voice solutions can successfully operate on the most

Page 35 of 58

aggressive of unmanaged networks. That success requires a new kind of media stack. The one in Windows Live Messenger has already proven itself out at scale in the real world, providing well in excess of 1 billion minutes of voice per month to consumers throughout the world.

While those new capabilities are now broadly available to consumers, they were yet to be introduced in the domain of Enterprise Voice solutions. The new media stack in Office Communicator 2007 is an evolution of the voice stack in recent versions of Windows Live Messenger, and has been developed by the same media team in Microsoft UC Group.

The Microsoft UC media stack incorporates numerous innovative capabilities:

• Manage and optimize the quality and make the best possible use of the existing network resources:

RTAudio (Microsoft Real Time Audio Codec)49, a modern wideband speech codec designed for real-time two-way Voice over IP applications. RTAudio is the preferred Microsoft® Real-Time audio codec and is the default codec for Microsoft’s Unified Communications platforms. Legacy codecs are also supported by the platform for the purpose of interoperability50.

The encoder is capable of encoding single-channel (mono), 16 bit per sample audio signals. It can operate either in narrowband (8 kHz sampling rate) or in wideband (16 kHz sampling rate), and can run in Constant Bit Rate (CBR) mode or for maximal efficiency in Variable Bit Rate (VBR) mode. VBR mode’s efficiency results from the real time adaptation of the codec and media stack to the richness and complexity of the speech signal. Because some parts of speech (vowels in particular) are less complex than others (such as consonants and fricatives), they require less information to be sent; in the VBR mode RTAudio instantly adapts the bit rate to the complexity, gaining in efficiency over a fixed bit rate codec. The media stack regulates itself to a mean rate over any period of time of a few seconds or more.

RTAudio Encoder: RTAudio is a sub-band coder. The number of sub-bands that it uses is dependent on the sampling frequency used. For a sampling frequency of 8 kHz it uses a single band. For sampling frequencies greater than 8 kHz it uses multiple bands, divided unequally or equally within the full signal bandwidth. Most information for voice data is contained in the lower bands. Therefore more bits are allocated for the lower band, with bit allocation progressively decreasing for the higher bands. The figure below provides a high level representation of the RTAudio encoder structure.

49RTAudio licensing: RTAudio is distributed via a source code Porting Kit that includes support for both wideband and narrowband modes and both encoding and decoding functionality. Details are available at: http://www.microsoft.com/downloads/details.aspx?FamilyID=5d79b584-79c9-42a8-90c4-4ab3f03d19c4&DisplayLang=en 50G.711, G.722.1/SIREN, G.723.1, G.726 and GSM are supported; G.729 is not supported at this time.

Page 36 of 58

High-level Overview of the RTAudio Encoder

The input signal is split into sub-band signals using sub-band filters if the number of bands is more than one. A rate control module determines the encoding modes for each sub-band based on several factors including the signal characteristics of each sub-band, the bit stream buffer history and the target bit rate. Generally fewer bits are needed for “simple” frames, such as unvoiced and silent frames, and more bits are needed for “complex” frames, such as transition frames.

The encoder structure for each sub-band consists of one or more code-book blocks, Linear Prediction Coefficients (LPC) analysis block and a synthesis filter. There are several pre-defined mode sets (which is defined by a combination of different code books) for each sampling rate that correspond to different coding bit rates. The rate control module determines the mode set for each frame.

The encoder also includes a unit that can optionally be used to embed error recovery data as well as redundant information in the RTAudio bitstream. This additional information is inserted in the bitstream when the codec operates with Forward Error Correction mode enabled as described later.

RTAudio Decoder: the RTAudio decoder operates in a pull-mode as illustrated in the figure below. At the front-end of the decoder, a jitter-control module allows active management of the packet jitter and packet loss that typically occur in IP networks.

Page 37 of 58

High-level Overview of the RTAudio Decoder

The decoder also includes the capability for error concealment. Error concealment along with jitter control improves the overall audio quality experienced by the end user under packet loss conditions.

Multi-rate codec: the codec is multi-rate and the bit rate is driven by the logic of the Quality Controller, enabling real time adaptation to the actual measured conditions, including NSQ conditions. RTAudio features multi-rate adaptation with 6 wideband and 3 narrowband rates.

The default mean bit rate in wideband mode is about 45 kbps (fully loaded with RTP/UDP/IP overhead). The mean bit rate in that wideband mode ranges between 24 kbps and 45 kbps (fully loaded) in the normal operating mode of the codec (i.e. without the FEC redundancy that is only used in case of extreme packet loss in conditions where bandwidth is readily available). With the redundancy, the mean bit rate in wideband mode ranges between 43 kbps and 74 kbps.

The default mean bit rate in narrowband mode is about 28 kbps (fully loaded). The mean bit rate in that narrowband mode ranges between 15 kbps and 28 kbps (fully loaded) in the normal operating mode of the codec without redundancy. With the redundancy, the mean bit rate in narrowband mode ranges between 25 kbps and 40 kbps.

RTAudio bit rates Sampling Frame Total payload

type size target bit rate(ms) (bits / sec) No redundancy With redundancy

Wideband (16 kHz) 20 29000 45000 74000Wideband (16 kHz) 40 26500 34500 61000Wideband (16 kHz) 60 25666 31000 56667Wideband (16 kHz) 20 21000 37000 58000Wideband (16 kHz) 40 19500 27500 47000Wideband (16 kHz) 60 19000 24333 43333Narrowband (8 kHz) 20 11800 27800 39600Narrowband (8 kHz) 40 10300 18300 28600Narrowband (8 kHz) 60 9800 15133 24933

Fully loaded target bit rate(bits / sec)

Page 38 of 58

Bandwidth efficiency: RTAudio is generally more efficient than traditional codecs, consuming significantly less bandwidth to deliver equivalent audio quality, or delivering significantly more quality at similar bandwidth consumption than codecs such as G.711, G.729 or G.726 – and supporting a much wider range of network conditions.

Bit rate comparison with traditional codecs Codec Frame Total payload Fully loaded target bit rate

size target bit rate (no redundancy)(ms) (bits / sec) (bits / sec)

RTAudio (16 kHz) 20 29000 45000RTAudio (8 kHz) 20 11800 27800G.711 20 64000 80000G.729 20 32000 48000G.726 20 8000 24000

Quality Controller (QC) is a sophisticated software component that makes concurrent use of all the data inputs from the transport and payload layer of the media stack to dynamically and in real time adapt the behavior of the sender endpoint to optimize the perceptible experience in the receiving end-points (users). The QC takes into consideration a variety of parameters including: the number and types of media streams, the current estimated available bandwidth, the available negotiated codecs, and the NSQ conditions. This optimization includes the enablement and management of media redundancy in the form of Forward Error Correction.

The QC also manages the dynamic selection of the highest quality codec, adjusting in real time during the call the nominal bit rates used as conditions change. The media stack will dynamically (over a few seconds) adapt to actual real time conditions whenever a link gets temporarily saturated (regardless of whether the network congestion results from too many real time media sessions or from other IP traffic). In that case, the SRC (at each end point involved in a session) will detect that the network conditions are deteriorating and are leading to congestion. Each media stack will adapt dynamically, reducing its bit rate progressively in several steps, if needed all the way down to about 1/3rd (audio) or 1/4th (video) of the initial bit rate. Of course the quality will be progressively reduced as well, but even at the lowest bit rate the quality remains better than what it would be with a G.729 codec at about the same bit rate per stream; meanwhile no session would have been dropped. This capability is very unique in Enterprise IP telephony.

Voice Activity Detection/Silence Suppression (VAD/SS) detects the presence or absence of speech from the user’s microphone, and stops packets from being sent over the network if the user is not talking (in tandem with the AGC which ensures there is no clipping of the audio). VAD/SS works to ensure no “empty” packet is sent, which reduces the total number of packets sent during a call and optimizes transmission bandwidth, making it is available for other calls. The primary function of the Voice Activity Detection component is to categorize the captured audio as speech content or not. The Silence Suppression module uses this information to filter the silence packets out.

Page 39 of 58

Noise Suppression (NS) isolates and removes non-speech signals from the signal captured by the microphone to produce a cleaner representation of the speech signal. It has the side effect of decreasing the amount of information that the speech codec must compress, resulting in additional savings on the transmission bandwidth needed for speech signals.

Automatic Gain Control (AGC) detects the energy level of the speech input signal. The digital element of AGC adjusts the level of the digital signal to provide a comfortable listening level for the user. The analog element of AGC is designed to prevent clipping of the audio from the user’s microphone.

• Proactively work to prevent and compensate the effects of impairments:

Forward Error Correction (FEC): the RTAudio encoder, under control of the QC, can provide additional protection against data loss during transmission using Forward Error Correction. This is over and above the built-in error concealment scheme in the audio codec core, which are useful to mitigate degradation in voice quality due to packet loss but prove to be less useful when there is very high packet loss,

With FEC, redundant information can be optionally embedded in the RTAudio bitstream to supplement the error concealment module with additional recovery data, which is used by the receiving end-point to reconstruct media packets lost during transmission.

The use of FEC means the data rate of the media stream will increase. The application must decide if the increase in rate is worthwhile and possible. In general, FEC can be most beneficial for high loss rates starting at 10%. For loss rates of 20% and above the concealment algorithm alone likely will be insufficient to obtain reasonable quality, and FEC may prove necessary.

Acoustic Echo Cancellation (AEC) is a time-varying, adaptive digital filter responsible for eliminating the echo signal resulting from a user’s speaker sounds being fed back into the user’s microphone. The Acoustic Echo Cancellation filter effectively prevents these echoes from reaching the other participants in the call.

• Work on the receiving end to provide the best experience from the signal received:

Reconstruction of the signal to which FEC was applied by the emitting end-point.

Time Warping Jitter Buffer: the warping jitter buffer dynamically adjusts the audio play out speed to optimize both quality and latency under network jitter as a function of the actual jitter conditions. The dynamic capabilities of the Time Warping Jitter Buffer minimize buffering impact on latency in low jitter conditions, and smoothly transitions to and from high jitter conditions by varying the buffer length and the playing speed in a manner barely noticeable.

Packet Loss Concealment (PLC), a module which re-samples the speech signal to provide a near-seamless recovery of audio packets that have been lost, making low or moderate packet loss imperceptible to the user; PLC is most beneficial for low loss rates.

Page 40 of 58

It is the unique combination of intelligent end-points and of an advanced media stack that enables Microsoft UC to provide superior QoE over traditional IP telephony solutions.

Microsoft UC Media Stack map

4.6 Video: the next frontier

4.6.1 The challenges of video

The contents of this paper have primarily been focused on measuring the quality of voice calls and on demonstrating how the Microsoft Unified Communication platforms and solutions are designed to optimize the quality of this experience. In time, however, it can be expected that video interactions will increasingly complement voice interactions. Microsoft UC solutions offer advanced video capabilities that make those interactions possible.

Unlike PSTN networks that have been designed for the delivery of voice signals, IP networks allow concurrent transmission of digital audio and video. While the addition of video enhances the quality of the interaction, it also brings new and challenging elements to the domain of Quality of Experience. These new challenges arise both from the complexity of video signals and from the interaction of video and audio streams. The complexity of digital video signals results from their three-dimensional nature (picture width, picture height and frame rate). Each dimension interacts with the other two in a time-varying fashion as objects and persons in the video are in motion. Requirements on interaction of audio and video include in particular maintaining their synchronization from capture to playback to achieve the highest Quality of Experience.

Page 41 of 58

4.6.2 Factors affecting perceived Video Quality

As is the case for speech, network factors such as limited transmission bandwidth, packet loss, jitter and delay (aka round trip time, RTT) can all generate Video Quality impairments. Bandwidth limitations in particular can have dramatic effects. Large delays can also render recovery mechanisms ineffective (such as the sending by the receiving endpoint of recovery parameters so that the sending endpoint acts on these parameters).

Beyond network factors and as was the case for audio signals, managing Video Quality requires an in-depth understanding of payload factors, in particular the video content, channel coding (or Error Correction), bit rate, and a new, video specific parameter called Group of Pictures. These factors interact constantly, adding to the complexity.

Content, and especially its rate of change or motion, obviously has a significant influence on quality. For example, finely detailed, fast moving content is more challenging than uniform, static content, and would require a higher bit rate. If reduced network bandwidth or some other reason forced the video encoder to reduce the bit rate and to quantize the video more coarsely, the resulting decoded video would show obvious blocking artifacts as illustrated below.

The picture on the left displays the experience of a video stream that has been encoded at 20 kbps; the one on the right shows the same video stream encoded at 400 kbps. The blocking artifacts of the picture on the left are the result of the aggressive compression applied to the video source to honor the lower bit rate. In most usage cases the resulting quality would not be acceptable to the end-users.

A Group of Pictures (GoP) is a fundamental and independent structure in a compressed video bit-stream, designed to reduce the average bandwidth requirement at the high frame rates required to provide an acceptable experience. The first frame, called the key frame, carries the entire information required to restitute it. All the other video frames in the GoP are coded differentially from the first frame. Both the length of a GoP and its algorithmic make up (frame dependency rules) are of prime importance on the robustness of video against network factors such as packet loss and packet delay.

Page 42 of 58

Endpoint factors also have an impact on Video Quality. Local hardware resources such as CPU, GPU, and dynamic memory dictate the computing power available for real time video encoding and decoding. Camera resolution and display addressability play a significant role on perceived quality. Error concealment and error recovery techniques designed in receivers to hide or recover from packet loss are perhaps the most important factors on Video Quality.

Factors impacting Video Quality

4.6.3 Measuring the video experience

For historical and technical reasons measurement of video experiences has trailed measurement of voice experiences. However over the past few years the industry has embarked in the establishment of Video Quality evaluation methodologies, quantifying the impact of the factors listed above on Video Quality; already several solutions are becoming available. Like it was done for speech, the attempt is to derive a Video MOS that quantifies the overall perceived quality of a video call. In parallel, the ITU Video Quality Experts Group has undertaken the task to derive an objective perceptual quality model as was done for speech. It is still too early to determine with certainty what the final solution will be but the research and development investments that have already been made in this area have uncovered its main foundational elements.

4.6.4 Microsoft UC video

Page 43 of 58

Microsoft UC RTVideo codec

The video transmission capability in the Microsoft Unified Communication platforms is designed to operate efficiently with respect to the network, payload and endpoint factors listed above. The compression engine used to compress video in real time is called RTVideo. RTVideo is based on the new SMPTE (Society of Motion Picture and Television Engineers) 421M standard51 and as shown below, includes a system level enhancement allowing packet loss recovery. It has an operational range of 15 kbps to 350 kbps per stream, for CIF at 15 frames per second. The combination of this state of the art video codec and a system-level error recovery mechanism at the transport layer make Microsoft Unified Communications platforms a robust solution tuned to maximize end to end Video Quality of Experience on any network.

4.7 Evidence of the superior Quality of Experience of Microsoft UC

4.7.1 Evidence from the consumer space

Beyond the theory, there is factual evidence of the superior Quality of Experience of Microsoft UC. Of course, the broad usage of Windows Live Messenger is a first, scale tested demonstration of the capabilities of the stack. Windows Live Messenger now serves in excess of 1 billion voice minutes per month using Microsoft UC’s voice media capabilities. Independent reviews have picked the new Windows Live Messenger Voice Quality as the best in its class: “Call quality was superb--the best experience of any IM application we've recently tested. Our voices sounded

51“Quality and Compression: The Proposed SMPTE Video Compression Standard VC-1” by Shankar L. Regunathan, Ann Marie Rohaly, Regis Crinon and Patrick Griffis, SMPTE Motion Imaging Journal, Vol. 114, No. 5, May 2005, pages 194-201.

Page 44 of 58

crystal clear, without any detectable echo or choppiness. In fact, we completely forgot that we were talking over the Internet”52.

4.7.2 Evidence from third-party benchmarking in enterprise-like setting

But how does Microsoft UC perform in enterprise-like settings? In order to provide an expert’s answer to that question, Microsoft commissioned a third-party benchmarking study, which was prepared by Psytechnics Limited53. Psytechnics evaluated the speech quality performance of Microsoft’s Office Communicator 2007 client (using a pre-Beta 3 version of Office Communicator with prototype wideband USB handsets designed by Microsoft UC) in comparison with Cisco CallManager version 5.0 using 7961 IP phones.

The study was conducted under controlled test conditions representing a wide range of operational scenarios using both subjective and objective experiments. The subjective evaluation portion of the study included the following controlled levels of Network Service Quality impairments representing the most common conditions encountered by IP telephony solutions54:

52http://www.pcworld.com/article/id,124187/article.html and http://www.pcworld.com/article/id,123954-page,1/article.html 53Psytechnics (http://www.psytechnics.com) is an independent Microsoft UCG QoE partner which owns patented technology at the heart of 5 ITU-T standards including PESQ and which has developed an especially effective set of algorithms to provide real time estimates of user’s subjective perception (typically expressed as MOS) across a very wide range of conditions, on the basis of its own extensive subjective testing. Psytechnics performed this study as part of an ongoing benchmarking and performance analysis consulting program. 54The IP impairment conditions used for the subjective experiments were derived from the ITU-T G.1050 model (http://www.itu.int/itudocr/itu-t/aap/sg12aap/history/g1050/g1050_ww9.doc). Actual measured average packet loss ranged 0% to 25%, mean absolute jitter 0 ms to about 45 ms (with min/max jitter ranging -3 ms/3 ms to about -500 ms/500 ms), base delay was 50 ms one way and packet delay (90th percentile) ranged 0 ms to about 600 ms. This subjective experimentation was conducted both using North-American English (clean speech) and using British English with office “babble” at 25dB signal to noise ratio. Only main results of the North-American English experiment are shown here. British English experiment led to very similar relative results.

Page 45 of 58

Main Scenarios Packet loss Burstiness Jitter "Perfect NSQ” network Zero Zero Zero Typical enterprise network Low Low Medium Overloaded enterprise network Medium High Medium/High Multisite connectivity Medium High High Internet conditions High Medium High

4.7.3 Results of the subjective benchmarking study

As illustrated in the graphs below, the study concluded that the one-way listening speech quality provided by the combination of Microsoft UC’s client and USB handset was consistently better than that provided by the Cisco’s CallManager and IP phones tested, across the main scenarios used for subjective testing. That was the case both when the Microsoft UC client was used in its wideband mode and when it was constrained to operate in narrowband mode only – and in comparison to the Cisco solution operating either with the G.711 codec or with the G.729 codec respectively.

When comparing Microsoft UC in wideband mode with G.711 in the main scenarios described in the table above, Microsoft UC averaged 0.98 MOS point advantage (with an advantage ranging between 0.81 and 1.39 MOS point). When comparing Microsoft UC in narrowband mode with G.729 in the scenarios described above, Microsoft UC averaged 0.86 MOS point advantage (with an advantage ranging between 0.41 and 1.42 MOS point).

Page 46 of 58

Page 47 of 58

Page 48 of 58

Page 49 of 58

4.7.4 Other results of the benchmarking study

The study also included 16 objective experiments using the PESQ algorithm. Those experiments enabled the exploration of a wider set of IP impairments than could be accommodated in the subjective experiments. The results of the objective experiments confirmed and reinforced the results of the subjective experiments.

Overall the study clearly confirms the superior Quality of Experience of Microsoft UC against the traditional IP telephony solution tested in virtually all likely NSQ conditions.

4.7.5 Microsoft UC delivers Quality of Experience

Of course, Microsoft UC tends to deliver better Quality of Experience when the NSQ is better; well designed right-provisioned networks (possibly with well administered class-based QoS) will resemble more closely the “perfect” or “typical” conditions modeled above. But even in extreme cases of very poor NSQ such as saturated links or on best effort networks and the Internet Microsoft UC is still capable of delivering an acceptable Voice Quality when traditional IP telephony solutions can become practically unusable.

Microsoft UC provides an excellent experience on well designed and managed networks that deliver good NSQ, acceptable quality on networks with mediocre NSQ, and can still deliver successful calls in many conditions where traditional IP telephony solutions would not work anymore on very aggressive or best effort networks with very poor NSQ.

5 Network design and management for Microsoft UC

5.1 Right-provisioning for Microsoft UC

5.1.1 Good network design matters more than ever!

Because the media stack tolerates more NSQ variations, the network design requirements to enable Microsoft UC are less stringent than those of traditional IP telephony solutions. Microsoft UC is designed to work on a wide range of networks and Network Service Quality conditions, so as to support users on all the networks on which they work rather than only on the company’s managed network.

That is not to say that Microsoft UC does not benefit from networks with great NSQ. Those are the networks on which users can best enjoy the full breadth of the Microsoft UC Quality of Experience, providing not only Enterprise Grade voice but also going beyond its traditional definition with the richness of the wideband audio codecs and the multimedia experience of combined voice, video and data.

Page 50 of 58

Whenever possible, Microsoft recommends the use of the best possible network in support of Microsoft UC. Therefore it is important to consider a few simple steps with respect to corporate network design to enable Microsoft UC.

• A base level of network capacity is mandatory

There is no free-lunch, and neither QoE nor QoS are any panacea to slow, under-capacity networks. Right-provisioning the Enterprise network and optimizing it for delay and flow with the right topology is a key step to enabling any IP solutions and services, including Microsoft UC.

• Some base level of QoS constructs may be necessary in some parts of the network

Right-provisioning should be the norm, but there will occasionally be portions of the network where it is not a practical answer for either cost or availability reasons. Satellite uplinks, expensive and slow WAN links especially to international locations, and similar circumstances represent exceptions to that norm. In those circumstances, it may be necessary to implement network level management of traffic with QoS techniques.

• Where QoS is necessary, use class based prioritization rather than flow based reservation

Where implemented, network level management should be done at a gross level, using class of traffic prioritization techniques such as DiffServ, rather than on a per flow, per endpoint, per application reservation basis. Class of traffic based approaches are more cost effective, easier to configure and manage, more scalable and sufficient to support the Microsoft QoE approach.

Where all the steps above are implemented (by qualified network professionals or systems integrators), Microsoft UC QoE can help deliver an industry leading experience, thus allowing a very effective and economical solution to delivering Enterprise grade voice quality.

Whenever these steps are not possible, such as on unmanaged networks, Microsoft UC QoE can help deliver the “best possible” solution.

5.1.2 RTAudio, a very efficient codec

The most important aspect of designing for UC is to ensure there is enough bandwidth. Scarce bandwidth and congestion are key causes of other NSQ impairments (jitter, packet loss and delay) which might not only affect the UC payload but all other traffic on the network as well. The second most important aspect is to minimize delay (processing time in particular) through appropriate network design.

RTAudio is a more efficient codec than traditional IP telephony codecs. It generally takes less bandwidth to enable Microsoft UC than to enable traditional IP telephony and the resulting Voice Quality is generally superior.

Page 51 of 58

As compared to using traditional codecs, the Microsoft UC media stack, with its ability to dynamically adapt to network conditions, provides very substantial efficiency and headroom advantage.

Specifically, RTAudio in its default wideband mode, fully loaded, has a mean bandwidth consumption of 45 kbps per stream55 (without the redundancy of Forward Error Correction which is only triggered on a needs basis in extreme packet loss conditions when bandwidth is available). That 45 kbps mean bandwidth per stream of RTAudio in its default mode compares favorably straight-up to the traditional G.711 codec, which requires a fixed bit rate per stream of circa 80 kbps fully loaded56.

Furthermore, G.711 frequently does not have silence suppression enabled57 (due to the tendency for silence suppression to induce clipping in the traditional implementations and to damage the subjective experience), and therefore will require 160 kbps across both channels for a full duplex session. Meanwhile, the Microsoft UC media stack has a very effective VAD/SS capability that reduces the overall bandwidth consumption of a full duplex session without damaging the subjective experience. Such a session would need 90 kbps of mean duplex bandwidth without VAD/SS, but much less with it, say on average 63 kbps depending on conversation pattern. In those conditions, Microsoft UC in its wideband mode would require about 2.5 times less bandwidth per session on average than G.711, while generally delivering widely superior Voice Quality.

In summary, networks designed for Microsoft UC will require less bandwidth for voice than networks designed for traditional telephony using the most common G.711 codec (both with or without VAD/SS implemented).

5.1.3 Elasticity: Microsoft UC media stack’s ability to seamlessly adapt to saturation

When a link gets saturated, the media stack adapts dynamically by reducing the mean bit rate of the codec over a few seconds, if needed down to a mean bit rate of about 15 kbps per stream. At that rate with VAD/SS the full duplex session might require on average about 21 kbps. Under similar circumstances, traditional IP telephony solutions would generally use a G.729 codec without silence suppression, which would typically require about 24 kbps per stream or about 48 kbps for the full duplex session. Here too RTAudio has a significant advantage on bandwidth per session while generally delivering significantly superior Voice Quality58.

55In this document we will call stream, or call leg, each half of a full duplex call, such as the incoming half or outgoing half. 56http://www.cisco.com/en/US/tech/tk652/tk698/technologies_tech_note09186a0080094ae2.shtml 57Many traditional IP telephony solutions have some Voice Activity Detection/Silence Suppression (VAD/SS) capability, but those are typically rudimentary and often induce clipping and other quality issues. Therefore they are rarely turned on. Microsoft UC has a very effective VAD/SS capability that does not materially impact the Voice Quality. 58Another advantage of the dynamic adaptation is the fact that it happens on a session per session basis, for example only sessions where the media traverses a saturated link. Meanwhile all other sessions not going through that link would enjoy the best possible quality. In traditional IP telephony solutions, the choice of the codec often has to be done for the entire system as default, rather than dynamically. Therefore a single congested link could impose selecting a codec that penalizes the quality of all conversations on the entire network.

Page 52 of 58

With that ability to dynamically adapt, the Microsoft UC media stack provides dynamic elasticity. Up to about 3 times as many total sessions can be supported as were before inception of congestion. If a link was right-sized to support a specific number of sessions of RTAudio with the best possible wideband experience, and if more than that number of sessions were requested, the media stack would not reject the incremental sessions, but rather dynamically accommodate them by renegotiating the bit rate of all sessions on a needs basis. This could enable up to about 3 times as many sessions, at which point all sessions would use the smallest narrowband bit rate of RTAudio – of course with a progressive reduction in the quality of experience for all calls.

In traditional IP telephony implementations where the choice of codec is static and CAC is required, once the link is full calls are either blocked (which is a source of end user frustration) or sent to the PSTN with a CAC/PSTN fall back solution (at incremental cost). With Microsoft UC, the probability of blocking and the need for PSTN fall back are virtually eliminated by the dynamic elasticity.

5.1.4 Right provisioning for Microsoft UC

In light of the above discussion, Microsoft UC’s recommendation is to right-provision all links to the throughput of 45 kbps per stream for audio (and 300 kbps per stream for video if enabled) for busy hour traffic.

The codecs’ ability to dynamically adapt to congestion (elasticity) should not be misconstrued for an opportunity to under-provision the network, and it is not advisable to size any link against the lower bit rates of the codec, as it could affect the ability of the media stack to dynamically deal with varying network conditions, such as temporary high packet loss. This design guide is what will ensure the best Quality of Experience in all circumstances. Provisioning the network in such a manner would enable it to absorb a sudden peak of incremental sessions (possibly up to about three times the initial design load) with dynamic bit rate adaptation; it would also enable some sessions (for example sessions terminated on the Internet) to temporarily consume more bandwidth by sending redundant data to compensate for high packet loss under the FEC mechanism.

For very difficult and costly links for which right-provisioning is extremely costly and difficult, it may be exceptionally possible to provision to a lesser volume of traffic (such as the next busy hour for example) and let the elasticity of Microsoft UC absorb the difference between that traffic volume and the peak traffic level, at the cost of some reduction in the Voice Quality, but also of a decrease in the headroom otherwise available to absorb sudden peaks in traffic. Those links might also constitute targets for DiffServ type of class prioritization schemes.

With respect to delay, it is desirable to try and keep delay under 150 ms and identify problem spots generating excessive delay. Delay is the one NSQ impairment that Microsoft UC cannot reduce, and it is important to find and eliminate the weak points. Propagation delay is rarely improvable but processing time can often be reduced. Right-provisioning is one of the most effective tools in that regard. Serialization delay can also be improved substantially by the use of faster links.

5.1.5 In last recourse: policing

Page 53 of 58

Some enterprises may not be in a position to change their network in the short term, may be durably starved for bandwidth on some WAN links, and may be looking for more design advice.

As mentioned before, DiffServ can help on those links and should be considered locally. It cannot however hide or compensate significant structural under-sizing. Deploying new capabilities that cannot function due to lack of resources is bound to generate end-user frustration and disaffection for the solution. In such a case, it is advisable to use policies to control the user’s rights and the modalities they have access to.

In particular, it may be appropriate to disable video for users of sites served by very poor WAN links if those links cannot be improved, so as to preserve the audio – or even to only turn on audio for a subset of users.

5.2 Using Microsoft UC on a QoS enabled network

Many companies have already implemented QoS mechanisms and need to understand how Microsoft UC will integrate on a network with a QoS framework. Generally speaking, Microsoft UC can coexist harmoniously on networks where QoS is implemented as long as it is not unduly penalized by the QoS scheme59. It can also benefit from the local use of class based prioritization techniques such as DiffServ on specific network pain points. QoS should not however be considered a viable general alternative to right-sizing the network.

When DiffServ is already deployed, companies can if they so desire take advantage of the DSCP marking provided by the application. On networks that have multiple VLAN, Microsoft UC can be deployed on the default VLAN that supports the undifferentiated TCP and UDP traffic as long of course as it is sized appropriately.

5.2.1 Microsoft UC natively supports DiffServ

Microsoft UC natively supports DiffServ through DSCP marking by the end-points, which can easily be turned on or off and modified through Group Policies.

DSCP setting can be enabled or disabled at the application level should the network administrator desire it. For example, the network administrator might have implemented DiffServ for another business critical application and not want to extend the benefits of prioritization to Microsoft UC, in which case there is nothing to do – by default both audio and video packets will not be marked (Best Effort DSCP).

59Deploying Microsoft UC on a network where QoS is implemented for the benefit of a small volume of business critical traffic (where for example that traffic would benefit from Expedited Forwarding while all remaining traffic, including Microsoft UC, would coexist as Best Effort) represents a good example of situations where Microsoft UC would operate satisfactorily within a QoS scheme. On the other hand, situations where QoS is already implemented to prioritize a large volume of other traffic representing a significant proportion of the overall capacity (such as another IP telephony solution on an undersized network) while giving Microsoft UC a lower priority (where for example Microsoft UC traffic would be selectively dropped by RED or similar technologies) would constitute an extreme case of aggressive networks as seen from Microsoft UC traffic, and can lead to unpredictable experiences. Such configurations have not been tested for Quality of Experience.

Page 54 of 58

When enabled, Microsoft Office Communicator 2007 instructs the transport layer of the operating system to set by default the Differentiated Services Code Point (DSCP) for voice media to the Expedited Forwarding class marking, and to set the DSCP for video media to Class 3 of Assured Forwarding. That means that for networks honoring the DSCP marking, voice is prioritized higher than video by default. These default settings ensure that on any network honoring end-point marked DiffServ, both the voice and the video media would take advantage of the network capability for DiffServ prioritization as soon as the administrator would have enabled the packet marking – provided the ingress network switch is not set to overwrite end-point set marking. At this time, Microsoft UC does not natively provide trusted proxy interfaces for DSCP marking by network elements of the Microsoft UC traffic.

Fine tuning of the DSCP setting is possible and documented for Windows 200060, Windows XP61, and Windows Vista62 to enable administrator determined DiffServ settings as an alternative to the default settings. The following paragraphs provide a brief summary of packet marking on Windows Vista for Microsoft UC. The OCS Documentation provides more detail.

• Guidelines for DSCP marking:

DiffServ QoS DSCP marking of packet by the end-point can be turned on by modifying or creating the following registry key to the REG_DWORD value of 1: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\RTC\Transport\QoSEnabled

If DiffServ QoS DSCP marking is turned on (by setting QoSEnabled to 1) as mentioned above, the packets will by default be marked as: Audio: SERVICETYPE_GUARANTEED (DSCP 40, 0x28) Video: SERVICETYPE_CONTROLLEDLOAD (DSCP 24, 0x18)

On Windows Vista, to use different values that the default values listed above, run gpedit.msc as Administrator to modify them. As shown on the illustration, please go to:

60http://www.microsoft.com/technet/prodtechnol/windows2000serv/plan/qosover8.mspx 61http://www.microsoft.com/technet/network/qos/default.mspx 62http://technet.microsoft.com/en-us/windowsvista/aa905087.aspx#ERH

Page 55 of 58

gpedit.msc Computer Configuration Administrative Templates Network QoS Packet Scheduler DSCP values for conforming packets

Enable if necessary and modify the Guaranteed service type and/or Controlled load service type to the desired DSCP value, expressed in DiffServ Byte Format.

• Marking on servers:

In addition of the registry key settings described above, administrators who desire to mark packets sent by the OCS Mediation Server and MCU server roles must turn the Windows Server 2003 Packet Scheduler on. The Packet Scheduler (Psched) is not running by default on Windows 2003.

5.2.2 Microsoft UC does not require nor support IntServ

Microsoft UC does not support nor require IntServ. The issues with IntServ have been addressed elsewhere in this document. In addition, IntServ are only ever deployed in support of, integrated with, and as part of a specific application that requires them, such as traditional IP telephony. Because of the very intricate and often proprietary interoperation between network and application, the investment a company has made in IntServ can be specific to that application and very difficult to transfer to or share with another application. Therefore for Microsoft UC to support IntServ would likely require significant rework of the existing IntServ implementation, which would make no economic sense.

In addition, the very principle of reserving fixed resources is suboptimal when applied to the Microsoft UC media stack, where the use of a dynamically adaptable, multi-rate, Variable Bit Rate codec with high quality VAD/SS means that bandwidth requirements of each stream vary (a) instantly around a target value according to patterns of speech, (b) over time as the mean value is adjusted to network conditions, and (c) according to which party is talking goes to 0 or back to the target bit rate.

Page 56 of 58

Reserving resources in those conditions could be done either by reserving for peak bandwidth (and hence reserving significantly more resources than needed on average and as a result support less users and sessions than possible with Microsoft UC); or by reserving for some lesser amount of bandwidth and hence preventing the media stack to fully use its intelligence to compensate for network and non-network impairments. There is no available, standard based network QoS solution that would enable dynamically evolving bandwidth reservation to stay in sync with the dynamically variable flows of Microsoft UC.

On the other hand, Microsoft UC can operate on a network where IntServ is deployed if enough resources are left unreserved. If another application on the network uses RSVP, it should not affect Microsoft UC as long as it leaves enough unallocated bandwidth for all other applications, including Microsoft UC, and if delay is kept within the recommended values. Of course if almost all resources are reserved it makes no sense to deploy any incremental service on the network in the first place.

5.2.3 Microsoft UC does not require nor support CAC

Microsoft UC does not natively implement nor require CAC. As stated previously, there are two main reasons companies implement CAC on weak WAN links (a) to protect the real time media from itself and (b) to enable coexistence with other traffic from other applications.

• Protection from itself with dynamic adaptation:

Because of the dynamic adaptability of the media stack, when a new media session is requested on a link approaching saturation, all existing media sessions will adapt and reduce their bit rate requirements, hence creating room for the new session. That elasticity creates a lot of headroom (as discussed, enabling up to about triple the number of sessions). If that does not suffice, the problem is more fundamental and should be addressed by policies – not conferring rights to the users that they cannot reasonably use.

For the same reason, Microsoft UC does not at this time need or support PSTN fallback as a substitute for a congested link – dynamic adaptation will work to enable the incremental session on the link instead.

• Coexisting with other traffic:

Typically, Microsoft UC will coexist with other traffic on the network. Generally that traffic will be either UDP or TCP.

UDP traffic such as traditional IP telephony (or any connectionless traffic behaving like UDP) lacks a built in regulation mechanism and can generally be subject to disruption from any other traffic (not just from Microsoft UC). Such traffic may or may not need to be actively protected from all the other flows on the network (including Microsoft UC). Whether to do so should be treated as an independent decision based on that application’s business criticality and sensitivity to congestion alone, and the methodology (DiffServ, IntServ, VLAN…) to protect the application should be chosen in light of the specific application.

Page 57 of 58

TCP traffic (or traffic behaving similarly) has built-in correction and adaptation mechanisms. TCP will rate-adjust to congestion and ensure delivery even in congested situations, at the same time that Microsoft UC also dynamically adapts to the congestion. TCP flows and Microsoft UC will both adapt to ensure they coexist within the network resources, without the need for CAC.

If another application on the link already uses CAC, it should not affect Microsoft UC as long as the session cap set for that application in CAC leaves enough unallocated bandwidth headroom for reasonable operation, per the recommendations above.

In the same manner that resource reservation is suboptimal with Microsoft UC, so would CAC unless it would be dynamically adjusting to the actual bandwidth consumed at any point in time by Microsoft UC – we are not aware of any existing commercial solution that can do that.

5.2.4 On saturated WAN links, Microsoft UC may benefit from a combination of DiffServ and traffic shaping

On highly utilized, under-provisioned WAN links the TCP adaptation may be too slow to prevent occasional traffic spikes, which could lead to short term, transient NSQ issues until the self adjusting mechanisms in Microsoft UC and in TCP kick in. On rare occasions, some oscillatory behaviors could occur. In those extreme conditions, if actual quality issues are encountered that are deemed unacceptable, it may be useful to implement a combination of DiffServ and traffic shaping (such as leaky bucket) of the TCP traffic on the link. This recommendation only applies to situations where right-provisioning is not a possibility and where transient Voice Quality problems are encountered due to spikes in TCP traffic.

Page 58 of 58

Conclusion

In this document, we explained how Microsoft UC Quality of Experience can be used to manage and deliver the best possible Voice Quality to the end user in any circumstance, and how it differs from the traditional approach of managing network behavior with QoS.

In summary the Quality of Experience approach combines the use of adaptive end-points that measure and monitor the actual experience for all calls at all times, and of an advanced media stack capable of correcting both network and non-network impairments. This forward looking technology uniquely enables Microsoft UC to provide overall superior Quality of Experience as compared to traditional IP telephony solutions in identical network conditions, as demonstrated by the results of an independent study,

Microsoft UC is designed to deliver an optimal Quality of Experience given the actual network conditions: an excellent experience on networks that deliver good Network Service Quality (through a combination of right-provisioning, good design and where appropriate class of traffic prioritization), acceptable quality on networks with mediocre NSQ, and successful calls in many conditions where traditional IP telephony solutions would not work anymore on very aggressive and best effort networks with very poor NSQ.

Documents

Quality of Experiencedownload.101com.com/techlibrary/vts/ocs_qoe.pdfAs traditional IP telephony solutions (i.e. IP-PBX from the leading vendors) gain in market share, the ability of