<?xml version="1.0"?>
<!-- $Id$ -->
<!-- Time-stamp: <07/09/21 13:52:11 yusuke> -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
 <!ENTITY rfc2629 PUBLIC '' './bibxml/reference.RFC.2629.xml'>
]>
<rfc ipr="full3978" category="std" xml:lang="en">
<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="no" ?>

<front>
  <title abbrev="RTP Payload Format for UEMCLIP">
  RTP payload format for UEMCLIP speech codec
  </title>
  <author initials="Y.H." surname="Hiwasaki"
    fullname="Yusuke Hiwasaki">
    <organization>NTT Corporation</organization>
    <address>
      <postal>
        <street>3-9-11 Midori-cho,</street>
        <street>Musashino-shi</street>
	<city>Tokyo</city>
	<code>180-8585</code>
	<country>Japan</country>
      </postal>
      <phone>+81(422)59-4815</phone>
      <email>hiwasaki.yusuke@lab.ntt.co.jp</email>
    </address>
  </author>
  <author initials="H.O." surname="Ohmuro"
  fullname="Hitoshi Ohmuro">
    <organization>NTT Corporation</organization>
    <address>
      <postal>
        <street>3-9-11 Midori-cho,</street>
        <street>Musashino-shi</street>
	<city>Tokyo</city>
	<code>180-8585</code>
	<country>Japan</country>
      </postal>
      <phone>+81(422)59-2151</phone>
      <email>ohmuro.hitoshi@lab.ntt.co.jp</email>
    </address>
  </author>
  <date month="September" year="2007" />
  <area>Real-time Applications</area>
  <workgroup>Audio/Video Transport</workgroup>
  <keyword>RTP Payload type</keyword>
  <keyword>MIME</keyword>
  <keyword>UEMCLIP</keyword>
  <keyword>PCMU</keyword>
  <keyword>Speech Coding</keyword>
  <abstract>
    <t>This document describes the RTP payload format of UEMCLIP, an
    enhanced speech codec of ITU-T G.711. The bitstream has a scalable
    structure with an embedded u-law bitstream, also known as PCMU,
    thus providing a handy transcoding operation between narrowband
    and wideband speech.</t> </abstract>

</front>

<middle>

<section anchor="sec_intro" title="Introduction">

<t>
This document specifies the payload format for sending UEMCLIP encoded
speech using the Real-time Transport Protocol (RTP) <xref
target="RFC3550"/>. UEMCLIP is an enhanced version of u-law ITU-T
G.711, and designed to help the market for smooth transition towards
the forthcoming wideband communication environment and while
maintaining the interoperability and less transcoding load with the
existing terminals, in which the implementation of G.711 is mandatory.
</t>

<t>
The background and the basic idea of the media format is described in
<xref target="sec_background" />. The details of the payload format is
given in <xref target="sec_bitformat" />. The interoperability with
G.711 issues are discussed in <xref target="sec_g711interoperability"
/>, and the consideration for congestion control is in <xref
target="sec_congestion" />. In <xref target="sec_mediatype" />, a
media type registration for UEMCLIP RTP payload format and SDP
mappings are provided.
</t>

<section anchor="sec_terminology" title="Terminology">

<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref
target="RFC2119"/>.</t>

</section>

</section>

<section anchor="sec_background" title="Media Format Background">

<t>
UEMCLIP stands for "U-law EMbedded Coder for Low-delay IP
communication", and is basically an enhanced version of u-law ITU-T
G.711, otherwise known as PCMU <xref target="RFC4856"/>. It is
developed for VoIP (Voice over Internet Protocol) applications, and is
especially suitable for wideband multi-point conferencing. The main
goal of this codec is to provide a wideband communication platform
that is highly interoperable with existing terminals equipped with
G.711, and to stimulate the market to gradually shift to the wideband
communication. Because the G.711 bitstream is embedded in the
bitstream, costly transcoding can be avoided especially when
interoperating with narrowband terminals.</t>

<t>
This document does not discuss the implementation details of the
encoder and decoder, but only describes the bitstream format. The
implementation details will be available by other means.</t>

<t>
Because of its scalable nature, there are a number of sub-bitstreams
(sub-layer) with in a UEMCLIP bitstream. By choosing appropriate
sub-layers, the codec can adapt to the following requirements:

<list style="symbols">
<t>Sampling frequency,</t>
<t>Number of channels,</t>
<t>Speech quality, and</t>
<t>Bit-rate.</t>
</list>

The current implementation of UEMCLIP codec includes three sub-coders,
as shown in <xref target="tab_sublayer"/>. The core layer is G.711
core, and other two are quality and bandwidth enhancement layers with
bit-rate of 16 kbit/s each.</t>

<texttable anchor="tab_sublayer" title="Sub-layer description">
<ttcol align='center'>Layer</ttcol>
<ttcol align='left'>Description</ttcol>
<ttcol align='right'>Bit-rate</ttcol>
<ttcol align='left'>Coding algorithm</ttcol>
<c>a</c><c>G.711 core</c><c>64</c><c>u-law PCM</c>
<c>b</c><c>Lower-band enhancement</c><c>16</c><c>Time domain block quantization</c>
<c>c</c><c>Higher-band</c><c>16</c><c>MDCT block quantization</c>
</texttable>

<t>
Based on these sub-layers, UEMCLIP codec operates in four modes as
shown in <xref target="tab_modes"/>. Here, "Fs" is the sampling
frequency in kHz. The absent Modes 2 and 5 are reserved for future
extension to 32 kHz sampling modes. As the mode definition is expected
to grow, any other modes not defined in this table MUST NOT be used
for compatibility and interoperability reasons.</t>

<texttable anchor="tab_modes" title="Mode description">
<ttcol align='center'>Mode</ttcol>
<ttcol align='center'>Ch</ttcol>
<ttcol align='right'>Fs</ttcol>
<ttcol align='center'>Layer a</ttcol>
<ttcol align='center'>Layer b</ttcol>
<ttcol align='center'>Layer c</ttcol>
<ttcol align='right'>Bit-rate w/o headers [kbps]</ttcol>
<ttcol align='right'>Total bit-rate [kbps]</ttcol>
<c>0</c><c>1</c><c> 8</c><c>x</c><c>-</c><c>-</c><c>64</c><c>68.8</c>
<c>1</c><c>1</c><c>16</c><c>x</c><c>-</c><c>x</c><c>80</c><c>85.6</c>
<c>2</c><c>-</c><c> -</c><c>-</c><c>-</c><c>-</c><c>-</c><c>-</c>
<c>3</c><c>1</c><c> 8</c><c>x</c><c>x</c><c>-</c><c>80</c><c>85.6</c>
<c>4</c><c>1</c><c>16</c><c>x</c><c>x</c><c>x</c><c>96</c><c>102.4</c>
<c>5</c><c>-</c><c> -</c><c>-</c><c>-</c><c>-</c><c>-</c><c>-</c>
</texttable>

<t>
UEMCLIP bitstream contains internal headers and other side-information
apart from the layer data.  This results in total bit-rate larger than
the sum of the layers shown in the above table.  The detail of the
internal headers and auxiliary information are described in <xref
target="sec_mainheader"/>.</t>

<t>
Defining the sampling frequency and the number of channels does not
result in a singular mode, i.e., there can be multiple modes for the
same sampling frequency or number of channels. The supported modes
would differ from the implementations, thus the sender and the
receiver must exchange what mode to use for transmission.
</t>

</section>

<section anchor="sec_bitformat" title="Payload Format">

<t>
As an RTP payload, UEMCLIP bitstream can contain one or more frames as
shown in <xref target="fig_rtpformat"/>.</t>

<figure anchor="fig_rtpformat" title="RTP payload format">
<artwork>
  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                      RTP Header                               |
 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
 |                                                               |
 |                 one or more frames of UEMCLIP                 |
 |                                                               |
 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
</artwork>
</figure>

<t>
UEMCLIP bitstream has a scalable structure, thus it is possible to
reconstruct the signal by decoding a part of it. A UEMCLIP frame is
composed of a main header (MH) followed by one or more (up to three)
sub-layers (SL) as shown in <xref target="fig_bitstream"/>.</t>

<figure anchor="fig_bitstream" title="A UEMCLIP frame (bitstream format)">
<artwork>
                         +--+-------+//-+
                         |MH| SL #1 |...|
                         |  |       |   |
                         +--+-------+//-+
</artwork>
</figure>

<t>
As a sub-layer, the core layer, i.e., "Layer a", MUST always be
included. It should be noted that the location of the core layer may
not be located at the top. The decoder MUST always refer to the layer
ID for proper decoding.</t>

<t>
The UEMCLIP bitstream does not include the following information: Mode
and sampling frequency (Fs). As described before, this information
SHOULD be exchanged while establishing a connection, for example, by
means of SDP.</t>

<section anchor="sec_rtphdrusg" title="RTP Header Usage">
<t>
Each RTP packet starts with a fixed RTP header, as explained in <xref
target="RFC3550"/>. The following fields of the RTP fixed header used
specifically for UEMCLIP streams are emphasized:

<list style="hanging">

<t hangText="Payload type:">The assignment of an RTP payload type for
this packet format is outside the scope of this document, however, it
is expected that a payload type in the dynamic range shall be
assigned.</t>

<t hangText="Timestamp:">This encodes the sampling instant of the
first speech signal sample in the RTP data packet. For UEMCLIP
streams, the RTP timestamp MUST advance based on a multiple of 8 kHz,
and in case the sampling rate can change during a session, this figure
should equal to the maximum rate (in Hz) given in the mode range (see
<xref target="sec_dyntx" />). For example, during a 8 kHz session, if
a transition to a 16 kHz mode is allowed, the time stamp SHOULD
advance using 16 kHz clock rate. For fixed modes, it should be either
8 or 16 kHz, based on the sampling rate.</t>

<t hangText="Marker bit:">If the codec is used for applications with
discontinuous transmission (DTX, or silence compression), the first
packet after a silence period during which packets have not been
transmitted contiguously SHOULD have the marker bit in the RTP data
header set to one. The marker bit in all other packets MUST be
zero. Applications without DTX MUST set the marker bit to zero.</t>

</list>
</t>
</section>

<section anchor="sec_multiframe" title="Multiple frames in an RTP packet">

<t>
More than one UEMCLIP frame may be included in a single RTP packet by
a sender. However, senders have the following additional restrictions:

<list style="symbols">
<t> A single RTP packet SHOULD NOT include more UEMCLIP frames than
will fit in the MTU of the RTP transport protocol.</t>
<t> All frames contained in a single RTP packet MUST be of the same
mode.</t>
<t>Frames MUST NOT be split between RTP packets.</t>
</list>

It is RECOMMENDED that the number of frames contained within an RTP
packet be consistent with the application.  Since UEMCLIP is designed
for telephony application where delay has a great impact on the
quality, then fewer frames per packet for lower delay, is
preferable.</t>

</section>

<section anchor="sec_payloaddata" title="Payload Data">

<section anchor="sec_mainheader" title="Main Header">

<t>
The main header (MH) is placed at the top of a frame and has size of
10 bytes with additional optional enhanced header size. The content of
the main header is defined in <xref target="fig_mainheader"/>.</t>

<figure anchor="fig_mainheader" title="UEMCLIP main header format (MH)">
<artwork>
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     ID        |             BS                |      MX       |
|               |                               |               |
|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5|0 1 2 3 4 5 6 7|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              PC                               |
|                                                               |
|0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   PC(cont'd)  |      ES       |             EH                |
|               |               |         (if exists)           |
|2 3 4 5 6 7 8 9|0 1 2 3 4 5 6 7|                             ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-//+-+-+-+
</artwork>
</figure>

<t>
<list style="hanging">
<t hangText="Identification (ID):">8 bits</t>
<t>The value should be "0x95".</t>

<t hangText="Byte size (BS):">16 bits</t>
<t>
Indicates the size in bytes of the rest of the UEMCLIP frame, i.e.,
the frame size minus 3 bytes (of ID and BS). It is encoded in network
byte-order.
</t>

<t hangText="Mixing information (MX):">8 bits</t>
<t>Mixing information field.</t>

<t hangText="Packet-loss Concealment information (PC):">40 bits</t>
<t>Packet-loss concealment (PLC) information field.</t>

<t hangText="Enhanced-header Size (ES):">8 bits</t>
<t>Size of EH (enhanced header) in bytes.</t>

<t hangText="Enhanced header (EH):">8*ES bits</t>
<t>Content of the enhanced header.
When ES is 0, the enhanced header is non-existent.
</t>
</list>
</t>

<section anchor="sec_mixinfo" title="Mixing information field">

<figure anchor="fig_mixinfo" title="Mixing information field (MX)">
<artwork>
                         0 1 2 3 4 5 6 7 
                        +-+-+-+-+-+-+-+-+
                        |C|R|V|   PW1   |
                        |1|1|1|         |
                        | | | |0 1 2 3 4|
                        +-+-+-+-+-+-+-+-+
</artwork>
</figure>
<t>
<list style="hanging">

<t hangText="Check bit #1 (C1):">1 bit</t>
<t>Validity flag of V1 and PW1. This bit being "1" indicates that both
parameters are valid, and "0" indicates that the parameters should be
ignored.</t>

<t hangText="Reserved bit #1 (R1):">1 bit</t>
<t>This bit should be ignored.</t>

<t hangText="VAD flag #1 (V1):">1 bit</t>
<t>Voice activity detection flag of the current frame. This flag being
"1" indicates that the frame is an active (voice) segment, and "0"
indicates that it is an inactive (non-voice) or a silent segment. This
flag is specifically designed for mixing information and DTX judgement
based this flag is not recommended.
</t>

<t hangText="Power #1 (PW1):">5 bits</t>
<t>Signal power code of the current frame. <!-- The power value is
quantized to 32 levels, i.e., 5 bits. -->
</t>

</list>
</t>
</section>

<section anchor="sec_plcinfo" title="PLC information field">
<figure anchor="fig_plcinfo" title="PLC information field (PC)">
<artwork>
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|C|C|R|V|   K   |R|     P1      |R|     P2      |      PW2      |
|2|3|2|2|       |3|             |4|             |               |
| | | | |0 1 2 3| |0 1 2 3 4 5 6| |0 1 2 3 4 5 6|0 1 2 3 4 5 6 7|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      R5       |
|               |
|0 1 2 3 4 5 6 7|
+-+-+-+-+-+-+-+-+
</artwork>
</figure>
<t>
<list style="hanging">

<t hangText="Check bit #2 (C2):">1 bit</t>
<t>Validity flag of V2, K, P1, P2, and PW2. If the flag is
"1", it means that all these parameters are valid, and "0" means that
the parameters should be ignored.  If any of these parameters is
invalid, C1 should be set to "0".</t>

<t hangText="Check bit #3 (C3):">1 bit</t>
<t>Payload validity indicator. This flag is normally set to "0". If a
received packet has this flag set to "1", the payload data should be
ignored and packet-loss concealment should be performed by the
receiver. This flag is used in case of a multi-point conferencing,
where the upstream packet was lost and the mixing server did not
execute packet-loss concealment.</t>

<t hangText="Reserved bit #2 (R2):">1 bit</t>
<t>This bit should be ignored.</t>

<t hangText="VAD flag #2 (V2):">1 bit</t>
<t>Voice activity detection flag of the current frame. This may be as
same as V1 in the mixing information, and may not be synchronous to
the marker bit in the RTP header.  This flag is specifically designed
for packet-loss concealment and DTX judgement based this flag is not
recommended.</t>

<t hangText="Frame indicator (K):">4 bits</t>
<t>This value indicates the frame offset of P2 and PW2. Since it is a
better idea to carry the pitch and power parameters as PLC information
in a different frame, this frame offset value gives which frame the
parameters are to be associated with. Since there are 4 bits
allocated, it ranges between "0" and "15".
</t>

<t hangText="Reserved bit #3 (R3):">1 bit</t>
<t>This bit should be ignored.</t>

<t hangText="Pitch lag #1 (P1):">7 bits</t>
<t>Pitch code of the current frame. The actual pitch lag is calculated
as P1+20 samples in 8-kHz sampling rate. Pitch lag must be 20 &lt;= pitch length &lt;=
120. Codes ranging between "0x65" and "0x7F" are not used.
</t>

<t hangText="Reserved bit #4 (R4):">1 bit</t>
<t>This bit should be ignored.</t>

<t hangText="Pitch lag #2 (P2):">7 bits</t>
<t>Pitch code of the offset frame. The actual pitch lag is calculated
as P2+20 samples in 8-kHz sampling rate. Pitch lag must be 20 &lt;= pitch length &lt;=
120. Codes ranging between "0x65" and "0x7F" are not used. The offset
value is defined as K.</t>

<t hangText="Power #2 (PW2):">8 bits</t>
<t>Signal power code of the offset frame. The offset value is defined
as K.</t>

<t hangText="Reserved bits #5 (R5):">8 bits</t>
<t>These bits should be ignored.</t>

</list>
</t>

</section>

</section>

<section anchor="sec_sublayer" title="Sub-layer">

<t>
Sub-layer (SL) is a sub-header followed by layer bitstreams, as shown
in <xref target="fig_sublayer"/>.  The sub-header indicates the layer
location and the number of bytes.
</t>

<figure anchor="fig_sublayer" title="Sub-layer format (SL)">
<artwork>
  0                   1                   2                   
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7   . . .
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+-+
 | CI| FI| QI| R6|      SB       |               LD         ...  |
 |   |   |   |   |               |                               |
 |0 1|0 1|0 1|0 1|0 1 2 3 4 5 6 7|                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+-+
</artwork>
</figure>

<t>
<list style="hanging">
<t hangText="Channel index (CI):">2 bits</t>
<t>Indicates the channel number. For all modes given in <xref
target="tab_modes"/>, this should be "0x1". The detail is given in
<xref target="tab_layeridx"/>.</t>

<t hangText="Frequency index (FI):">2 bits</t>
<t>Indicates the frequency number. "0" means that the layer is in the
base frequency band, higher number means that the layer is in
respective frequency band. The detail is given in <xref
target="tab_layeridx"/>.</t>

<t hangText="Quality index (QI):">2 bits</t>
<t>Indicates the quality layer number. "0" means that the layer is in
the base layer, and higher number means that the layer is in
respective quality layer. The detail is given in <xref
target="tab_layeridx"/>.</t>

<t hangText="Reserved #6 (R6):">2 bits</t>
<t>Not used (reserved). The value must be "0".</t>

<t hangText="Sub-layer Size (SB):">8 bits</t>
<t>Indicates the byte size of the following sub-layer data.</t>

<t hangText="Layer Data (LD):">SB*8 bits</t>
<t>The actual sub-layer data.</t>
</list>
</t>

<section anchor="sec_idxencoding" title="Layer index encoding">

<t>
The layer index is encoded using values of channel number, quality
number, and frequency-band number encoded with 2-bits each, in the
appearing order. The last 2 bits are reserved for future use, and all
implementation should ignore this field. For all the layers shown in
<xref target="tab_sublayer"/>, the layer indices are shown in <xref
target="tab_layeridx"/>.</t>

<texttable anchor="tab_layeridx" title="Layer indices">
<ttcol align='center'>Layer</ttcol>
<ttcol align='center'>CI</ttcol>
<ttcol align='center'>FI</ttcol>
<ttcol align='center'>QI</ttcol>
<c>a</c><c>0</c><c>0</c><c>0</c>
<c>b</c><c>0</c><c>0</c><c>1</c>
<c>c</c><c>0</c><c>1</c><c>0</c>
</texttable>

</section>

</section>

</section>

</section>

<section anchor="sec_g711interoperability" title="G.711 interoperability">

<t>
As given in <xref target="sec_background"/>, u-law encoded G.711
bitstream (Layer a) is the core layer of a UEMCLIP bitstream, and is
always embedded. This means that transcoding from UEMCLIP bitstream to
G.711 does not have to undergo decoding and re-encoding procedures,
but simple extraction would only suffice. However, this does not apply
for the reverse procedure, i.e., transcoding from G.711 to UEMCLIP,
because the side information in the main header must be assigned
separately.</t>

<t>
The transcoding from UEMCLIP to u-law G.711 can be done easily by
finding an appropriate sub-layer. Within a frame, the transcoder
should look for a sub-layer with layer index "0x00", and subsequent LD
which has size of SB*8 bits (UEMCLIP has a 20-ms frame thus, SB=160)
are the actual G.711 bitstream data. It should be noted that
transcoder should not always expect the core layer to be located right
after the main header.
</t>

<t>
On the other hand, the transcoding from G.711 to UEMCLIP is not
entirely straight-forward. Since there are no means to generate
enhancement sub-layers, a G.711 bitstream can only be converted to
UEMCLIP Mode 0 bitstream. If the original G.711 bitstream is encoded
in A-law, it should first be converted to u-law to become the core
layer. Because a UEMCLIP frame size is 20 ms, u-law encoded G.711
bitstream MUST be a 160-sample chunk to become a core layer. For the
main header contents, when the UEMCLIP encoder is not available, it
should follow the following guidelines.

<list style="symbols">
<t> ID must be set "0x95".</t>
<t> Byte size (BS) should be set 7 bytes of the main header, plus
sub-header size (2) added with number of samples in G.711 (SB).</t>
<t> The enhanced-header size (ES) set to "0x00".</t>
<t> The check bit for mixing and PLC (C1 and C2) should be set 0.</t>
<t> The payload validity indicator (C3)  should be set 0.</t>
</list>

For the core layer (i.e., u-law G.711 bitstream), it should have the
following sub-layer header: 

<list style="symbols">
<t> All CI, FI, QI, R6 MUST be 0.</t>
<t> Sub-layer size (SB) MUST be 160 for 20 ms frame.</t>
</list>

</t>

</section>

<section anchor="sec_congestion" title="Congestion Control Considerations">

<t>
The general congestion control considerations for transporting RTP
data apply to UEMCLIP over RTP <xref target="RFC3550"/> as well as any
applicable RTP profile like AVP <xref target="RFC3551"/>. UEMCLIP does
not have any built-in mechanism for reducing the bandwidth. Packing
more frames in each RTP payload can reduce the number of packets sent,
and hence the overhead from IP/UDP/RTP headers, at the expense of
increased delay and reduced error robustness against packet losses. It
should be treated with care because increased delay means reduced
quality.</t>

</section>

<section anchor="sec_payloadformatparam" title="Payload Format Parameters">

<section anchor="sec_mediatype" title="Media type registration">
<t>
This registration is done using the template defined in <xref
target="RFC4288"/> and following <xref target="RFC4855"/>.

<list style="hanging">

<t hangText="Media type name:">audio</t>

<t hangText="Media subtype name:">UEMCLIP</t>

<t hangText="Required parameters:">
   <list style="hanging">
   <t hangText="Mode:">This defines bit-rate,
   sampling frequency and layer structure of the bitstream. This
   parameter is necessary because the this is not signaled within the
   bitstream. Allowed values are 0, 1, 3, and 4.</t></list></t>

<t hangText="Optional parameters:"> <!-- None.</t>-->
   <list style="hanging">
   <t hangText="ptime:">See RFC 4566 <xref target="RFC4566"/>.</t>
   <t hangText="maxptime:">See RFC 4566 <xref target="RFC4566"/>.</t>

<!-- -->
   <t hangText="dynmode:">Indicates dynamic mode change is
   allowed. Possible values are comma separated list of modes from the
   supported mode set: 0, 1, 3, and 4. This option MUST be exclusively
   used with respect to "fixmode". When not specified, the mode
   transmission defaults to "fixmode" with the default modes specified
   in <xref target="tab_defaultmodes"/>. See <xref
   target="sec_dyntx"/> "Dynamic transmission specification" for
   details.</t>

   <t hangText="fixmode:">Indicates dynamic mode change is
   prohibited. Possible values are comma separated list of modes from
   the supported mode set: 0, 1, 3, and 4. This option MUST be
   exclusively used with respect to "dynmode". See <xref
   target="sec_dyntx"/> "Dynamic transmission specification" for
   details.</t>
<!-- -->
   </list></t>

<t hangText="Encoding considerations:">This type is defined for
   transferring UEMCLIP-encoded data via RTP using the payload format
   specified in <xref target="sec_bitformat"/> "Payload Format". This
   media type is framed and binary.</t>

<t hangText="Security considerations:">See <xref
   target="sec_security"/> "Security Considerations".</t>

<t hangText="Interoperability considerations:"> This media is
   interoperable with u-law encoded ITU-T G.711. see <xref
   target="sec_g711interoperability"/> "G.711 interoperability".</t>

<t hangText="Published specification:">RFC xxxx (This RFC)</t>

<t hangText="Applications that use this media type:">Audio and video
streaming and conferencing tools.</t>

<t hangText="Additional information:">None</t>

<t hangText="Intended usage:">COMMON</t>

<t hangText="Person &amp; email address to contact for further
information:">Yusuke Hiwasaki
&lt;hiwasaki.yusuke@lab.ntt.co.jp&gt;</t>

<t hangText="Author:">
<list style="hanging">
<t hangText="Author:">Yusuke Hiwasaki</t>

<t hangText="Change Controller:">IETF Audio/Video Transport Working
Group delegated from the IESG</t>
</list>
</t>

</list>
</t>
</section>

<section anchor="sec_sdpparam" title="Mapping to SDP Parameters">
<t>
The media types audio/UEMCLIP are mapped to fields in the Session
Description Protocol (SDP) <xref target="RFC4566"/> as follows:

<list style="hanging">
<t hangText="Payload type:">Since it is not registered in <xref
target="RFC3551"/>, any RTP packets that carry UEMCLIP as payload type
MUST be treated as a dynamic payload type.</t>

<t hangText="Media name:">The "m=" line of SDP MUST be audio.</t>

<t hangText="Encoding name:">Registered media subtype name should be
used for the "a=rtpmap" line.</t>

<t hangText="Sampling Frequency:">Depending on the mode to
communicate, clock rate (sampling frequency) specified in "a=rtpmap"
MUST be selected from the ones defined in <xref
target="tab_modes"/>.</t>

<t hangText="Encoding parameters:">Since this is an audio stream, the
encoding parameters indicate the number of audio channels, and this
SHOULD default to "1", as selected from the ones defined in <xref
target="tab_modes"/>. This is OPTIONAL.</t>

<t hangText="Packet time:">A frame length of any UEMCLIP is 20 ms,
thus the argument of "a=ptime" MUST be a multiple of "20". When not
listed in SDP, it should also default to the minimum size: "20".</t>

<t hangText="Bandwidth:">As described in <xref target="RFC4566"/>,
bandwidth line is OPTIONAL.  When there is no bandwidth restrictions,
the numbers MUST be the largest value out of the <xref
target="tab_modes"/>, and the unit should be "kbit/s" with the
fraction raised to the unit, including header overheads down to Layer
3. If any restrictions apply, then the value MUST be the largest of
the <xref target="tab_modes"/> that satisfy the restriction, by the
same calculation procedure. It MUST NOT encode with bit-rate larger
than the answered bit-rate bandwidth.</t>

<t hangText="UMECLIP specific:">Any description specific to UEMCLIP
are defined in the Format Specification Parameters ("a=fmtp"). Each
parameters MUST be separated with ";", and if any attributes (value)
exists, it MUST be defined with "+". For compatibility reasons, any
application/terminal MUST ignore any parameters that does not appear
below. This is to ensure the upper-compatibility with later added
parameters for the future enhancements. The dynamic transmission
specification parameters should be defined here (see <xref
target="sec_dyntx"/>).</t>

</list>
</t>

<section anchor="sec_dyntx" title="Dynamic transmission specification">
<t>
Since UEMCLIP codec can operate in number of modes, it is desirable to
specify the range of modes that an encoder or a decoder can operate
at.</t>

<t>
UEMCLIP decoders SHOULD accept bitstreams in any modes. However, the
implementation limitation may fail to adopt to the dynamic bit-rate
change. Thus introduced here is two concepts: "dynamic mode" (denoted
as "dynmode"), where the dynamic mode (bit-rate) change is allowed,
and "fixed mode" (denoted as "fixmode"), where the change is not
permitted. Both modes MUST be used exclusively.</t>

<t>
"fixmode" is used to specify no modification of the operating mode
(bit-rate) during the session. It MUST operate exclusively to
"dynmode". It should specify the possible combination of mode numbers,
delimited by commas ",". When offering a "fixmode", the offerer SHOULD
list the mode numbers in descending priority order. The answerer MUST
select a single suitable mode number and reply as "fixmode" with one
argument.</t>

<t>
On the other hand, "dynmode" is used to allow modification of the
operating mode during the session. It MUST operate exclusively to
"fixmode". The offerer should specify the possible combination of mode
numbers, delimited by commas ",". The answerer can either select a
number of suitable modes and reply as "dynmode" in the same manner, or
select a single suitable mode number and reply as "fixmode" with one
argument.</t>

<t>
The mode numbers that can be specified as arguments to "fixmode" or
"dynmode" are restricted by a combination of a sampling frequency and
a number of audio channels, as shown in <xref
target="tab_modes"/>. This is because SDP binds a payload type to a
combination of a sampling frequency and a number of audio
channels. When a "fixmode" or "dynmode" is not given, it MUST be
interpreted as being defaulting to the fixed mode ("fixmode") and MUST
use the default value specified in <xref
target="tab_defaultmodes"/>.</t>

<texttable anchor="tab_defaultmodes" title="Default modes">
<ttcol align='right'>Fs [Hz]</ttcol>
<ttcol align='center'>Channels</ttcol>
<ttcol align='center'>Selectable modes</ttcol>
<ttcol align='center'>Default mode</ttcol>
<c>8000</c><c>1</c><c>0,3</c><c>0</c>
<c>16000</c><c>1</c><c>1,4</c><c>1</c>
</texttable>
</section>

</section>

<section anchor="sec_offeranswer" title="Offer-Answer Model Considerations">

<section anchor="sec_offeranswer_guide" title="Offer-Answer Guidelines">
<t>
The procedures related to exchanging SDP messages MUST follow <xref
target="RFC3264"/>. Other than that, followings are the guidelines for
establishing a session using an offer-answer model.

<list style="symbols">
<t>
When multiple UEMCLIP dynamic payload type number is offered, an
answerer SHOULD select a single payload type number, i.e., one
sampling frequency and channel condition.</t>

<!--
<t>The ptime SHOULD be 20.</t>
-->

<t>
An offerer SHOULD offer every possible combination of sampling
frequency, channel number, and fmtp parameters including dynamic/fixed
mode. When the transmission bandwidth is restricted, it MUST be
offered in accordance to the restriction.</t>

<t>
When offering/answering SDP, any fmtp parameters which are undefined
MUST be ignored. If any unknown/undefined parameters should be
offered, an answerer MUST delete the entry from the answer message. In
this case, the offerer MUST use the default value for any deleted
parameters.</t>

<t>
If a dynamic mode ("dynmode") is offered, an answerer MUST select
either "dynmode" or "fixmode", according to ones capabilities. When
fixed mode ("fixmode") is offered, an answerer MUST only answer
"fixmode". In the case of answering fixed mode ("fixmode"), answerer
MUST select a single mode out of offered mode, regardless of
dynamic/fixed mode specification. If a mode is not offered at all, the
session MUST default to fixed mode, and the default mode value, as
shown in <xref target="tab_defaultmodes"/>, MUST be used, based on the
sampling frequency and number of channels specified elsewhere.</t>

<t>
When an offered condition does not fit an answerer's capabilities, it
naturally MUST not answer the conditions, and session MAY proceed to
re-INVITE, if possible. If a condition (mode) is decided upon, an
offerer and an answerer MUST transmit on this condition.</t> </list>
</t>
</section>

<section anchor="sec_sdpexample" title="Examples">
<t>When an offerer indicates that he/she wishes to dynamically switch
between modes (0,1,3, and 4) during a session, an exapmle of an
offered SDP can be: 

<figure>
<artwork>
  m=audio 5004 RTP/AVP 96 97
  a=rtpmap:96 UEMCLIP/16000/1
  a=fmtp:96 dynmode=4,1
  a=rtpmap:97 UEMCLIP/8000/1
  a=fmtp:97 dynmode=3,0
</artwork>
</figure>

When an answerer can only operate in modes 1 and 0, and cannot
dynamically switch between modes during a session, an example of
answer will be as follows:

<figure>
<artwork>
  m=audio 5004 RTP/AVP 96
  a=ptime:20
  a=rtpmap:96 UEMCLIP/16000/1
  a=fmtp:96 fixmode=1
</artwork>
</figure>

As a result, both will start communication with mode 1. It should be
noted that mode change during this session MUST NOT be done, because
the answerer's response is "fixmode".

If an offerer wants to stick to one single mode during a session but
can receive either Modes 4 or 1 bitstreams, the SDP should be:

<figure>
<artwork>
  m=audio 5004 RTP/AVP 96
  a=rtpmap:96 UEMCLIP/16000/1
  a=fmtp:96 fixmode=4,1
</artwork>
</figure>

The "ptime" attribute is used to denote the packetization
interval. When not specified, it SHOULD default to 20. Since UEMCLIP
uses 20 msec frames, ptimes values of multiples of 20 imply multiple
frames per packet.

In the example below, the ptime is set to 60, and this means that
there are 3 frames in each packet. When fmtp line is not present, it
should default to fixmode with Mode 0 (see <xref
target="sec_dyntx"/>).

<figure>
<artwork>
  m=audio 5004 RTP/AVP 96
  a=ptime:60
  a=rtpmap:96 UEMCLIP/16000/1
</artwork>
</figure>

</t>

</section>

</section>

</section>

<section anchor="sec_security" title="Security Considerations">
<t>
RTP packets using the payload format defined in this specification are
subject to the security considerations discussed in the RTP
specification <xref target="RFC3550"/> and any appropriate
profiles. This implies that confidentiality of the media streams is
achieved by encryption by other means.</t>

<t>
A potential denial-of-service threat exists for data encoding using
compression techniques that have non-uniform receiver-end
computational load. The attacker can inject pathological datagrams
into the stream that are complex to decode and cause the receiver
output to become overloaded. However, UEMCLIP covered in this document
do not exhibit any significant non-uniformity.</t>

<t>
Another potential threats are memory attacks by illegal layer indices
or byte numbers. The implementor of the decoder should always be aware
that the indicated numbers may be corrupted and does not point to the
right sub-layer and may force reading beyond the bitstream
boundaries.</t>

</section>

<section anchor="sec_iana" title="IANA Considerations">
<t>
It is requested that one new media subtype (audio/UEMCLIP) is
registered by IANA. For details, see <xref
target="sec_mediatype"/>.</t>

</section>
</middle>

<back>

<references title='Normative References'>

<reference anchor='RFC2119'>
  <front>
    <title abbrev='RFC Key Words'>Key words for use in RFCs to Indicate Requirement Levels</title>
    <author initials='S.' surname='Bradner' fullname='Scott Bradner'>
      <organization>Harvard University</organization></author>
    <date year='1997' month='March' />
  </front>
  <seriesInfo name='BCP' value='14' />
  <seriesInfo name='RFC' value='2119' />
</reference>

<reference anchor='RFC3264'>
  <front>
    <title>An Offer/Answer Model with Session Description Protocol (SDP)</title>
    <author initials='J.' surname='Rosenberg' fullname='J. Rosenberg'>
      <organization /></author>
    <author initials='H.' surname='Schulzrinne' fullname='H. Schulzrinne'>
      <organization /></author>
    <date year='2002' month='June' />
  </front>
  <seriesInfo name='RFC' value='3264' />
</reference>

<reference anchor="RFC3550">
  <front>
    <title>RTP: A Transport Protocol for Real-Time Applications</title>
    <author initials="H." surname="Schulzrinne" fullname="H. Schulzrinne">
      <organization/></author>
    <author initials="S." surname="Casner" fullname="S. Casner">
      <organization/></author>
    <author initials="R." surname="Frederick" fullname="R. Frederick">
      <organization/></author>
    <author initials="V." surname="Jacobson" fullname="V. Jacobson">
      <organization/></author>
    <date year="2003" month="July"/>
  </front>
  <seriesInfo name="STD" value="64"/>
  <seriesInfo name="RFC" value="3550"/>
</reference>

<reference anchor='RFC3551'>
  <front>
    <title>RTP Profile for Audio and Video Conferences with Minimal Control</title>
    <author initials='H.' surname='Schulzrinne' fullname='H. Schulzrinne'>
      <organization /></author>
    <author initials='S.' surname='Casner' fullname='S. Casner'>
      <organization /></author>
    <date year='2003' month='July' />
  </front>
  <seriesInfo name='STD' value='65' />
  <seriesInfo name='RFC' value='3551' />
</reference>

<!--
<reference anchor='RFC3555'>
  <front>
    <title>MIME Type Registration of RTP Payload Formats</title>
    <author initials='S.' surname='Casner' fullname='S. Casner'>
      <organization /></author>
    <author initials='P.' surname='Hoschka' fullname='P. Hoschka'>
      <organization /></author>
    <date year='2003' month='July' />
  </front>
  <seriesInfo name='RFC' value='3555' />
</reference>
-->
<reference anchor='RFC4288'>
  <front>
    <title>Media Type Specifications and Registration Procedures</title>
    <author initials='N.' surname='Freed' fullname='N. Freed'>
      <organization /></author>
    <author initials='J.' surname='Klensin' fullname='J. Klensin'>
      <organization /></author>
    <date year='2005' month='December' />
  </front>
  <seriesInfo name='BCP' value='13' />
  <seriesInfo name='RFC' value='4288' />
</reference>

<reference anchor='RFC4566'>
  <front>
    <title abbrev='SDP'>SDP: Session Description Protocol</title>
      <author initials='M.' surname='Handley' fullname='M. Handley'>
        <organization/></author>
      <author initials='V.' surname='Jacobson' fullname='V. Jacobson'>
        <organization/></author>
      <author initials='C.' surname='Perkins' fullname='C. Perkins'>
        <organization/></author>
      <date year='2006' month='July' />
  </front>
  <seriesInfo name='RFC' value='4566' />
</reference>

<reference anchor='RFC4855'>
  <front>
    <title>Media Type Registration of RTP Payload Formats</title>
      <author initials='S.' surname='Casner' fullname='S. Casner'>
        <organization /></author>
      <date year='2007' month='February' />
  </front>
  <seriesInfo name='RFC' value='4855' />
</reference>

<reference anchor='RFC4856'>
  <front>
    <title>Media Type Registration of Payload Formats in the RTP Profile for Audio and Video Conferences</title>
      <author initials='S.' surname='Casner' fullname='S. Casner'>
        <organization /></author>
      <date year='2007' month='February' />
  </front>
  <seriesInfo name='RFC' value='4856' />
</reference>

</references>

<!--
<references title="Informative References">
<reference anchor='ieice_uemclip'>
  <front>
    <title>A G.711 Embedded Wideband Speech Coding for VoIP Conferences</title>
      <author initials='Y.' surname='Hiwasaki' fullname='Yusuke Hiwasaki'>
	<organization /></author>
      <author initials='H.' surname='Ohmuro' fullname='Hitoshi Ohmuro'>
	<organization /></author>
      <author initials='T.' surname='Mori' fullname='Takeshi Mori'>
	<organization /></author>
      <author initials='S.' surname='Kurihara' fullname='Sachiko Kurihara'>
	<organization /></author>
      <author initials='A.' surname='Kataoka' fullname='Akitoshi Kataoka'>
	<organization /></author>
      <date month='September' year='2006' />
    </front>
    <seriesInfo name="IEICE Trans. Inf. &amp; Syst., vol.E89-D" value="no. 9"/>
    </reference>
</references>
-->
</back>
</rfc>
