Video streaming algorithms with Kotlin: H.264 NAL unit parsing
How Kotlin made writing a video streaming algorithm a little easier
Jul 19 2023 · 8 min read
Rotating video gif via Erik on Dribble.
This post is part of a series of video streaming algorithms I've implemented in Kotlin.
I've come a long way from my first impression of Kotlin 4 years ago. I'm in love with the language and its features, from extension functions to coroutines and hopefully soon, context receivers. It improves the developer experience so much that I jump at any opportunity to think differently in Kotlin.
3 years ago I led the integration for the pivot to video streaming at GameChanger during the pandemic. Apart from helping select the streaming vendor (at the time Amazon's IVS was nascent, so we ended up choosing Mux as the streaming platform), I had to choose an RTMP library for video streaming on Android. I evaluated a few:
The first 2 were heavily based on Android Camera APIs, not the newer Camera2 APIs. Not quite satisfied with them and wanting a little more control over the streaming pipeline, I decided to build our own Camera2 based RTMP streaming stack. I'm certain the streaming stack has changed in the times since I built it, but here's a bit about the h.264 NAL unit extraction pipeline that powered version 1.0 of the GameChanger live streaming experience.
The h.264 bit stream
The h.264 bitstream consists of a sequence of bytes chunked in descriptive pieces called NAL (Network Abstraction Layer) units. This bit stream can come in a variety of containers; they may be in a
ByteArray, or in the case of Android's
MediaCodec, they come in
ByteBuffer instances. Regardless of the enclosing container, they all offer constant random access to a byte at an index. The first piece of abstraction therefore is a
Each NAL unit represents different parameters of the encoded stream from picture size to full blown pictures. Examples of these NAL units include:
- SPS and PPS: Sequence parameter set and pictures parameter set contain metadata relevant for decoding the pictures in the bit stream; for example, picture resolution.
- IDR frame: Instantaneous decoder refresh frames represent an image in the video stream that can be decoded independently. If you ever wondered when you seek a video it always starts at a particular frame, this is why. The decoder has a seed frame that it applies diffs to for consecutive frames; this is how the video is compressed.
These NAL units can be represented in an h.264 bitstream in in two formats:
- The Annex B format
- The AVCC format
The Annex B format will be the focus of this blog post for parsing the raw camera output from an Android phone. The AVCC format will be covered in the next post for FLV muxing and sending it to the server in an RTMP stream.
The Annex B format
The Annex B format bitstream uses start code prefixes (a sequence of distinct bytes that are never part of a NAL unit) to separate each NAL unit. The Android
MediaCodec writes to a
ByteBuffer in this format. Therefore to decode the bitstream, one first needs to detect whether a byte is part of a NAL unit or a start code.
The objective is the following API:
That is, an extension function on a
ByteRandomAccess<T> is defined that is invoked on each NAL unit in the bitstream. Let's define this algorithm in Kotlin.
Determining the NAL start code
In annex B formatted bit streams, the NAL start codes may either be:
- A 3 byte sequence of [0, 0, 1]
- A 4 byte sequence of [0, 0, 0, 1]
This works because this sequence of bytes is forbidden in any NAL unit without being properly escaped. Given a
ByteRandomAccess<T>, determining if an index in the bitstream represents a start code can be performed by:
Determining the end of a NAL unit
After determining the delimiting start code in the bitstream, the next step is extracting the NAL unit. This is done by finding the next delimiting start code in the bit stream. This can be done by testing the bit stream for start codes after the index of the previous start code. After a consecutive start code is found, the NAL unit is byte sequence delimited by the start codes:
Accessing the NAL units
Now that consecutive NAL start codes and the NAL unit they delimit can be found, the final step is putting together the
onEachNalUnit extension defined above:
Consuming NAL Units
With the above, extracting NAL units from an Android
MediaCodec is straightforward. The following snippet is edited for brevity and shows part of the video streaming pipeline:
ByteRandomAccess for a
ByteBuffer is created using a static method reference on the
Next, a MediaCodec.Callback is created to be notified when a
ByteBuffer is filled:
After the buffer is filled, it's passed to the
FLVMuxer to insert the video frame after which it is streamed via RTMP.
The NAL units in the
ByteBuffer instances from the Android media codec class are Annex B formatted. However in FLV, the NAL units are AVCC formatted. Therefore, the FLVMuxer needs to extract each NAL unit, and re-encode it in the AVCC format.
The next post in this series will expand on the
FLVMuxer and on its Kotlin specific algorithms.
The algorithm above used the following kotlin features:
- Type aliases: Allowed for better communication of the context existing types like
- Extension functions: Allowed for easier to read method call sites for
- Inline functions: Allowed for defining
ByteArraycontainers with no object allocation overhead.
when statements: For more readable and exhaustive control flow and pattern matching relative to if/else or switch statements.
The algorithm can still be written in java without these, however it is so much more concise in Kotlin. Consider the equivalent Java
findNalUnit method that the algorithm above was adapted from. Functionally equivalent, a bit easier on the eyes.
This along with other initiatives like Kotlin Multiplatform make me excited for the future of Kotlin for software development. I can't wait to build my next Kotlin thing.