A starting point for learning about file formats

Question mark superimposed on file icons

This page is intended for students, especially homeschoolers, to get started in learning about data file formats. The important part of it is the links which I’ve selected. Now I don’t claim to be an education expert, just a file format expert, but I’ve tried to select material suitable for high school students. If you know some basics about computers, you should find these pages useful for expanding your knowledge. You just need to understand what bits, bytes, and files are.

File format basics

Data files usually consist of a sequence of bytes. Some file systems support multiple sequences or “forks,” but most formats don’t use them. The bytes in a file are ordered from 0 to n – 1, where n is the file size in bytes. A format can use either sequential access, where software processes the bytes in order, or random access, where the software jumps around to the bytes it needs. Here are some articles explaining random and sequential access:

Formats represent different types of content. Common categories are:

  • Plain text files
  • Formatted text files
  • Image files
  • Audio files
  • Video files
  • Design files
  • Presentation and e-book files
  • Structured data files
  • Executable files
  • Compression and packaging

Another way to classify formats is by whether they compress their data. Image, audio, and video data tend to be huge. Compressing the data brings the size down. There are three categories:

  • No compression
  • Lossless compression
  • Lossy compression

Lossless compression is reversible. You can reconstruct the uncompressed data precisely. Lossy compression loses some of the least important bits, allowing a better compression ratio. Lossless compression algorithms can’t squeeze audio, image, or video files down very much. Lossy methods can shrink the files more, with little or no effect on the way you see and hear the content.

Formats can be proprietary or non-proprietary. The distinction is sometimes tricky. The most thoroughly non-proprietary (or “open”) formats renounce all royalty and patent claims and document everything. A format that a company created and holds patents on is usually considered non-proprietary if the company promises anyone can use it without payments or restrictions. Proprietary formats are encumbered by patents or trade secrets, and a license is required for at least some uses.

Sometimes anyone is free to render files in the format, but a license is needed to create files. MP3 was outrageously non-proprietary, demanding license payments even to create an open-source player, but its patents have expired. Once a format’s patents expire, it’s open to anyone to implement.

The most annoying aspect of proprietary files often isn’t the license requirement but the lack of free documentation.

Metadata

Many file formats provide a way to include metadata, “data about data.” Putting metadata into a file lets others find out information such as when and how the file was made, what its content represents, and what restrictions there are on its use. Some popular metadata formats are used across different file formats.

Metadata is often represented using XML (eXtensible Markup Language). It’s a structured text format similar to HTML. It can represent hierarchies of information and associate data labels with values. XML has many uses within file formats, so it’s worth learning about.

Character encoding

Probably the large majority of file formats make some use of text. They use it for metadata, if nothing else. Text has to be represented as bits. The most common text encoding is ASCII, the American Standard Code for Information Interchange. It stores each character in seven bits, which allows 128 different characters. 32 of these are reserved for “control” characters for historical reasons. Most of them are no longer used, but a few, such as tab and carriage return, still are.

Notice the word “American.” The encoding includes the 26 uppercase and lowercase letters of the alphabet, but it doesn’t include any characters with accent marks. It doesn’t work very well for most European languages, and not at all for Chinese or Korean. Various extended encodings have been used over the years. The most popular encoding today is Unicode, which can use up to 32 bits . That allows 4,294,967,296 different characters to be encoded. (For various technical reasons, the actual number is a little smaller than that.) Unicode covers just about all the major languages in the world (but not Klingon). It also includes a ton of emoji. The codes 1 through 127 are the same as in ASCII.

But what I just said isn’t quite right. Unicode isn’t an encoding. Officially, it’s a “standard for digital representation of the characters used in writing all of the world’s languages.” It assigns a numeric “code point” to each character, but it doesn’t say how to represent the code point. The most common Unicode encoding is UTF-8. It represents characters with 1 to 4 bytes of data. ASCII characters require just one byte, so UTF-8 files are reasonably economical in size. In fact, an ASCII text file is a perfectly good Unicode file.

Containers and codecs

Video and audio formats usually consist of two pieces: the container and the codec. The container organizes the data and metadata, and the codec (coder-decoder) contains an encoded representation of the video or audio tracks. The reason for the separation is to support multiple compression methods in one format. If an exciting new compression method comes out, it’s only necessary to create a new codec for it, not an entire file format.

File packages

Some formats use a different kind of container. They use multiple files to hold all the relevant information, aggregating them into a single file. You’ve probably run into the Zip compression file format. It can hold a collection of files using lossless compression. Formats such as OpenDocument and Office Open XML use a Zip container to group XML-based data files. Their files are technically Zip files, but calling them that would miss the point.

File format identification and validation

Software needs to identify a file’s format to decide what to do with it. The most obvious clue is the file’s extension, e.g., “docx” in Document.docx (Microsoft Word file) or “jpg” in Image.jpg (JPEG image file). There are a few problems with this approach, though.

  1. The extension could be a lie. If there was an error in naming the file, it might have a different format.
  2. Extensions sometimes conflict. Well-known extensions are normally unique, but an obscure format could adopt the same extension as an already existing obscure format.
  3. The file might be an invalid instance of the format. It might be corrupted so that it can’t be completely read, or the software that created it could be buggy.
  4. Some formats have incompatible variations or versions. For example, MP4 is a container format that can hold different codecs. The application that creates a file might use an unusual codec, and the application which tries to read it might not have that codec.

Many applications specialize in identifying and validating files. They go beyond the extension and examine the inside of the file. They vary greatly in thoroughness. Some just look at the first few bytes. Others analyze the whole file. Some are so fussy that they insist that perfectly usable files are invalid. (I’ve written some in the last category.)

Videos!

You may prefer learning from videos. Here are a few you might like.

That should be enough to keep you busy for a while. I may update this page from time to time.