Adetunji Dahunsi

Video streaming algorithms with Kotlin: H.264 NAL unit parsing

How Kotlin made writing a video streaming algorithm a little easier

TJ Dahunsi

Jul 19 2023 · 8 mins

Categories:

android

kotlin

Rotating video gif via Erik on Dribble.

This post is part of a series of video streaming algorithms I've implemented in Kotlin.

I've come a long way from my first impression of Kotlin 4 years ago. I'm in love with the language and its features, from extension functions to coroutines and hopefully soon, context receivers. It improves the developer experience so much that I jump at any opportunity to think differently in Kotlin.

3 years ago I led the integration for the pivot to video streaming at GameChanger during the pandemic. Apart from helping select the streaming vendor (at the time Amazon's IVS was nascent, so we ended up choosing Mux as the streaming platform), I had to choose an RTMP library for video streaming on Android. I evaluated a few:

The first 2 were heavily based on Android Camera APIs, not the newer Camera2 APIs. Not quite satisfied with them and wanting a little more control over the streaming pipeline, I decided to build our own Camera2 based RTMP streaming stack. I'm certain the streaming stack has changed in the times since I built it, but here's a bit about the h.264 NAL unit extraction pipeline that powered version 1.0 of the GameChanger live streaming experience.

Note: We didn't use CameraX as our usage was outside the specific use cases CameraX was designed for. CameraX is designed for previews, image analysis, image capture and video capture. Our usage was much more low level. We needed to rotate videos efficiently in real time (we used renderscript for this, the renderscript replacement toolkit was not yet announced) and to access the raw MediaCodec buffers to extract h.264 NAL units. For high level and common use cases, you should use CameraX.

The h.264 bit stream

The h.264 bitstream consists of a sequence of bytes chunked in descriptive pieces called NAL (Network Abstraction Layer) units. This bit stream can come in a variety of containers; they may be in a ByteArray, or in the case of Android's MediaCodec, they come in ByteBuffer instances. Regardless of the enclosing container, they all offer constant random access to a byte at an index. The first piece of abstraction therefore is a typealias for ByteRandomAccess:

typealias ByteRandomAccess<T> = (container: T, index: Int) -> Byte

Each NAL unit represents different parameters of the encoded stream from picture size to full blown pictures. Examples of these NAL units include:

SPS and PPS: Sequence parameter set and pictures parameter set contain metadata relevant for decoding the pictures in the bit stream; for example, picture resolution.
IDR frame: Instantaneous decoder refresh frames represent an image in the video stream that can be decoded independently. If you ever wondered when you seek a video it always starts at a particular frame, this is why. The decoder has a seed frame that it applies diffs to for consecutive frames; this is how the video is compressed.

These NAL units can be represented in an h.264 bitstream in in two formats:

The Annex B format
The AVCC format

The Annex B format will be the focus of this blog post for parsing the raw camera output from an Android phone. The AVCC format will be covered in the next post for FLV muxing and sending it to the server in an RTMP stream.

The Annex B format

The Annex B format bitstream uses start code prefixes (a sequence of distinct bytes that are never part of a NAL unit) to separate each NAL unit. The Android MediaCodec writes to a ByteBuffer in this format. Therefore to decode the bitstream, one first needs to detect whether a byte is part of a NAL unit or a start code.

The objective is the following API:

/**
 * A callback for Nal units in a range
 */
typealias NalAccessor = (startCode: NalStartCode, start: Int, size: Int) -> Unit

/**
 * Invokes [action] on each Annex B formatted NAL unit in this [ByteRandomAccess],
 * with the kind of [NalStartCode] it is prefixed by, and the range of the Unit.
 * Note that the start codes are included in this range.
 *
 */
inline fun <T> ByteRandomAccess<T>.onEachNalUnit(
    container: T,
    max: Int,
    action: NalAccessor
)

That is, an extension function on a ByteRandomAccess<T> is defined that is invoked on each NAL unit in the bitstream. Let's define this algorithm in Kotlin.

Determining the NAL start code

In annex B formatted bit streams, the NAL start codes may either be:

A 3 byte sequence of [0, 0, 1]
A 4 byte sequence of [0, 0, 0, 1]

This works because this sequence of bytes is forbidden in any NAL unit without being properly escaped. Given a ByteRandomAccess<T>, determining if an index in the bitstream represents a start code can be performed by:

enum class NalStartCode(val bytes: ByteArray) {
    None(bytes = byteArrayOf()),
    Three(bytes = byteArrayOf(0, 0, 1)),
    Four(bytes = byteArrayOf(0, 0, 0, 1))
}

/**
 * Tests whether a NAL start code exists at a given index for the Annex B
 * formatted [ByteRandomAccess].
 *
 * @param @isNalStartCode The data.
 * @param index The index to test.
 * @return Whether there exists a start code that begins at `index`.
 */
fun <T> ByteRandomAccess<T>.nalStartCode(container: T, index: Int, max: Int): NalStartCode {
    if (max - index <= NalStartCode.Three.size) return NalStartCode.None
    var startCode: NalStartCode = NalStartCode.Three

    for (j in NalStartCode.Three.indices) {
        if (this(container, index + j) == NalStartCode.Three.bytes[j]) continue

        val matchesFour =
            this(container, index + j) == NalStartCode.Four.bytes[j]
                    && this(container, index + j + 1) == NalStartCode.Four.bytes[j + 1]

        if (!matchesFour) return NalStartCode.None

        startCode = NalStartCode.Four
        continue
    }
    return startCode
}

Determining the end of a NAL unit

After determining the delimiting start code in the bitstream, the next step is extracting the NAL unit. This is done by finding the next delimiting start code in the bit stream. This can be done by testing the bit stream for start codes after the index of the previous start code. After a consecutive start code is found, the NAL unit is byte sequence delimited by the start codes:

/**
 * Finds the next occurrence of a NAL start code from a given index
 * for an Annex B formatted [ByteRandomAccess].
 *
 * @param @findNalStartCode The data in which to search.
 * @param index The first index to test.
 * @return The index of the first byte of the found start code, or [INDEX_UNSET].
 */
fun <T> ByteRandomAccess<T>.nalStartCodeIndex(container: T, index: Int, max: Int): Int {
    val endIndex = max - NalStartCode.Three.size
    for (i in index..endIndex) when (nalStartCode(container, i, max)) {
        NalStartCode.Three, NalStartCode.Four -> return i
        NalStartCode.None -> Unit
    }
    return INDEX_UNSET
}

Accessing the NAL units

Now that consecutive NAL start codes and the NAL unit they delimit can be found, the final step is putting together the onEachNalUnit extension defined above:

/**
 * Invokes [action] on each Annex B formatted NAL unit in this [ByteRandomAccess],
 * with the kind of [NalStartCode] it is prefixed by, and the range of the Unit.
 * Note that the start codes are included in this range.
 *
 */
inline fun <T> ByteRandomAccess<T>.onEachNalUnit(container: T, max: Int, action: NalAccessor) {
    var startIndex = 0
    var startCode = nalStartCode(container, startIndex, max)

    if (startCode == NalStartCode.None) return

    while (startIndex != INDEX_UNSET) {
        val nextStartIndex = nalStartCodeIndex(container, startIndex + startCode.size, max)

        val endIndex = if (nextStartIndex == INDEX_UNSET) max else nextStartIndex
        val size = endIndex - startIndex

        action(startCode, startIndex, size)
        startIndex = nextStartIndex
        startCode =
            if (nextStartIndex == INDEX_UNSET) NalStartCode.None
            else nalStartCode(container, nextStartIndex, max) // O(3) or O(4)... effectively O(1)
    }
}

Note: The above can be optimized by skipping 4 bytes if it's certain the start code is exactly 4 bytes. The above is generic for 3 or 4 byte start codes.

Consuming NAL Units

With the above, extracting NAL units from an Android MediaCodec is straightforward. The following snippet is edited for brevity and shows part of the video streaming pipeline:

MediaCodec callback

A ByteRandomAccess for a ByteBuffer is created using a static method reference on the ByteBuffer class:

val byteBufferRandomAcces: ByteRandomAccess<ByteBuffer> = ByteBuffer::get

Next, a MediaCodec.Callback is created to be notified when a ByteBuffer is filled:

private class VideoStreamingCallBack(
    ...
    private val writer: BufferConsumer
) : MediaCodec.Callback() {
    ...
    override fun onOutputBufferAvailable(codec: MediaCodec, index: Int, info: BufferInfo) {
        codec.getOutputBuffer(index)?.let { byteBuffer ->
            val endOfStream: Int = info.flags and MediaCodec.BUFFER_FLAG_END_OF_STREAM
            if (endOfStream == 0) writer(byteBuffer, info)
        }
        codec.releaseOutputBuffer(index, false)
    }
}

After the buffer is filled, it's passed to the FLVMuxer to insert the video frame after which it is streamed via RTMP.

FLVMuxer

The NAL units in the ByteBuffer instances from the Android media codec class are Annex B formatted. However in FLV, the NAL units are AVCC formatted. Therefore, the FLVMuxer needs to extract each NAL unit, and re-encode it in the AVCC format.

class FlvMuxer<T>(
    ...
    private val accessor: ByteRandomAccess<T>
) : Restartable, FrameModifier {

    /**
     * Muxes an H.264 video frame. The buffer passed in must represent a whole frame.
     */
    fun muxVideo(container: T, accessLength: Int) {
        if (!started) return

        var frameType: VideoFrameType = VideoFrameType.InterFrame
        val frame = videoPool.get()
        val output = frame.data
        var length = 0

        accessor.onEachNalUnit(container, accessLength) { startCode, start, size ->
            val nalStart = start + startCode.size
            val nalSize = size - startCode.size

            when (val nalUnitType = accessor(container, nalStart).nalUnitType) {
                // Write sequence parameter set into the output frame
                NalUnitType.SPS -> ...
                // Write picture parameter set into the output frame
                NalUnitType.PPS -> ...
                // Write IDR frames, inter frames and other valid NAL units
                // into the output frame
                else -> ...
        }
    }
}

The next post in this series will expand on the FLVMuxer and on its Kotlin specific algorithms.

Wrap up

The algorithm above used the following kotlin features:

Type aliases: Allowed for better communication of the context existing types like ByteRandomAccess and NalAccessor.
Extension functions: Allowed for easier to read method call sites for onEachNalUnit.
Inline functions: Allowed for defining onEachNalUnit on arbitrary ByteArray containers with no object allocation overhead.
when statements: For more readable and exhaustive control flow and pattern matching relative to if/else or switch statements.

The algorithm can still be written in java without these, however it is so much more concise in Kotlin. Consider the equivalent Java findNalUnit method that the algorithm above was adapted from. Functionally equivalent, a bit easier on the eyes.

This along with other initiatives like Kotlin Multiplatform make me excited for the future of Kotlin for software development. I can't wait to build my next Kotlin thing.

Tags: