Tag Archives: file identification

Identifying files by programming language

Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.

The software should do some basic checks to make sure it wasn’t handed a binary file with a false extension, which could be dangerous. A code file should be a text file. regardless of the language. (This isn’t strictly true, but non-text languages like Piet and Velato are just obscure for the sake of obscurity.) The UK National Archive recognizes XML and JSON (which is a subset of JavaScript) but doesn’t talk about programming languages as file formats. Exiftool identifies lots of formats but makes no attempt to discern programming languages.
Continue reading