Containers, Streams and Codecs Explained

2023-03-05 1411 words 7 minutes

Contents

If you want to work with multimedia files, you need to understand the concepts of containers, streams and codecs. This post will explain what they are and how they work.

Containers

A container is a file format that can contain one or more streams. The most common container formats are MP4, MKV, AVI, MOV, FLV and WebM. The container format is usually determined by the file extension. For example, a file with the extension .mp4 is most likely an MP4 container.

Streams

A stream is a continuous flow of data that represents a single audio, video track or a subtitle track within a container. A container can contain one or more streams. For example, a MP4 container can contain an audio stream, a video stream, and a subtitle stream, each of which can be decoded and played independently or extracted into its own file.

./container.png — Structure of a multimedia container

Multiplexing

Multiplexing (Muxing) is the process of combining streams into a container. For example, you can multiplex an audio stream, a video stream, and a subtitle stream into a MP4 container.

Demuxing

Demuxing is the process of extracting streams from a container. For example, you can demux a MP4 container to extract the audio stream, the video stream, and the subtitle stream.

Codecs

A codec is a software algorithm (or library) that can encode and decode streams. For example, the H.264 codec can encode and decode video streams. Some common audio codecs include MP3 and AAC, and some popular video codecs are H.264, MPEG-2 and HEVC.

Channels

The channels of an audio stream is the number of audio channels in that stream. Each channel usually corresponds to one microphone input or one speaker output. For example, a stereo audio stream has two channels, one for the left speaker and one for the right speaker.

Resolution

The resolution is the number of pixels in a frame. The resolution of a video stream is the number of pixels in a frame. For example, a video stream with a resolution of 1920x1080 has 1920 pixels in the horizontal direction and 1080 pixels in the vertical direction.

Frame Rate

The frame rate of a video stream is the number of frames per second. For example, a video stream with a frame rate of 30 frames per second has 30 frames per second and each frame lasts for approximately 33 milliseconds.

Sampling Rate

Audio in the physical world is analog, but on a computer, audio is represented as a series of samples taken at regular intervals. The rate at which the samples are taken is called the sampling rate. The bigger the sampling rate, the more accurately the produced digital audio represents the original analog audio. A typical sampling rate is 44.1 kHz or 44,100 samples per second. This is because the human ear can only hear sounds between 20 Hz and 20 kHz and the sampling rate should be at least twice the highest frequency that can be heard according to Nyquist–Shannon sampling theorem.

./sampling.webp — Effect of increasing sampling rate

Sample Size

The sample size of an audio stream is the size of each sample. For example, an audio stream with a sample size of 16 bits uses 16 bits to represent each sample.

Consider a stereo audio clip (two audio channels) with a sample size of 16 bits (2 bytes), recorded at 48 kHz. You can calculate the size of the rate of data transmitted by this audio clip as follows:

$$ 2×2 \tfrac{bytes}{sample} × 48000 \tfrac{samples}{second} = 192000 \tfrac{bytes}{second} = 192 kBps $$

Sample Format

The sample format of an audio stream is the format of each sample. For example, some audio streams use 16-bit signed integer for their sample format.

Duration

The duration is the length of the stream

Bit Depth

The bit depth can be mentioned in the context of images or audio. The bit depth of an image is the number of bits used to represent each pixel. For example, an image with a bit depth of 8 bits per pixel (8bpp) uses 8 bits to represent each pixel (256 possibilities), which means you can represent 256256256 = Over 16 million possbile RGB colors. Increasing this to 10 bits per pixel (10bpp) gives you 1024 possibilities for each color, which means you can represent 102410241024 = over 1 billion possible RGB colors.

The bit depth of an audio stream is the number of bits used to represent each sample. For example, an audio stream with a bit depth of 16 bits per sample uses 16 bits to represent each sample.

Changing the bit depth can affect the quality of the image or audio but it also affects the size of the file produced, the data transferred over the network, and the processing power required to encode and decode the stream.

Bitrate

The bitrate is the number of bits that represent 1 second or an audio/video file. The size of a video file is calculated by multiplying the bitrate by the duration of the video. For example, if a video file has a bitrate of 10M bits per second and a duration of 1 hour, then the size of the video file is 10M * 3600 = 36,000,000,000 bits = 36,000,000 bytes = 36,000 KB = 36 MB.

Compression

Compression is the process of reducing the size of a file by removing redundant data. A compression algorithm could be lossy or lossless. A lossy compression algorithm removes some data from the original file, resulting in a significantly smaller file size. A lossless compression on the other hand does producesses higher quality output but does not result in big reduction in the final output size. Examples of lossy compression algorithms for audio are MP3 and AAC, and examples of lossless compression algorithms for audio are FLAC and WAV. Examples of lossy compression algorithms for video are H.264 and HEVC, and examples of lossless compression algorithms for video are MPEG-2 and MPEG-4.

Color Space

A color space is a color model that is used to represent colors as numbers that could be used in a computer system.

The most commonly used color space is RGB which is an additive color model in which red, green, and blue are added together to produce a broad array of colors, where the lack of color is black and the presence of all three primary colors is white.

./rgb.jpeg — RGB Color Model. Full brightness of all colors is represented by white in middle

Another common color space is CMYK which is commonly used in printing. It is a subtractive color model in which cyan, magenta yellow inks are used on white paper to produce the different colors. Adding all 3 inks together produces the black color.

./cmyk.jpeg — CMYK Color Model. Full saturation of all inks plus black is represented by black in middle.

Another common color space is HSB which is also known as HSV or HSL. It is a cylindrical color model in which hue, saturation, and brightness are used to represent colors. The hue is the color itself, the saturation is the amount of gray in the color, and the brightness is the amount of black in the color.

Chroma Subsampling

Chroma subsampling is a compression technique used to reduce the bandwidth required to transmit or store a video stream. It is used in video compression standards such as H.264 and MPEG-2. It is also used in image formats such as JPEG and PNG. The idea is to reduce the amount of data required to represent the color information in an image by only storing the color information for every Nth pixel, where N is usually 2 or 4. This is done because the human eye is more sensitive to changes in brightness than it is to changes in color. So, by only storing the color information for every Nth pixel, we can reduce the amount of data required to represent the color information in an image by a factor of N^2. The most common chroma subsampling techniques are 4:2:0, 4:2:2, and 4:4:4, where 4:4:4 is the highest quality and 4:2:0 is the lowest quality resulting in 50% reduction in file size.

Manipulating Media files

If you want to manipulate audio or video files, you can use the ffmpeg software. You can learn how to use it here.

If you prefer want to edit audio file only and prefer a visual way you can use Audacity which is a great open source tool for doing for manipulating audio files.