Identifying files by programming language

Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.

The software should do some basic checks to make sure it wasn’t handed a binary file with a false extension, which could be dangerous. A code file should be a text file. regardless of the language. (This isn’t strictly true, but non-text languages like Piet and Velato are just obscure for the sake of obscurity.) The UK National Archive recognizes XML and JSON (which is a subset of JavaScript) but doesn’t talk about programming languages as file formats. Exiftool identifies lots of formats but makes no attempt to discern programming languages.

How to tell languages apart

Identifying a file as being in a particular programming language is tricky. Some have self-identifying markers; PHP files, for example, should start with “<?php". Most languages aren't so cooperative.

One approach is to compile the code with several different compilers (or syntax checkers, in the case of interpreted languages). The problem is that source files often need to include other files to compile properly. If those files aren't available, the compiler will issue lots of error messages even if it's syntactically perfect.

Another approach would be to search the files for features which are peculiar to one language. It might take several feature checks to identify a file's language; languages share features, and a particular file might not use all its language's capabilities. For instance, a file with package and import statements ending in a semicolon and public class or public interface declarations is likely to be Java. These are just heuristics, though, and oddly written source code can make even a software engineer wonder what language it is.

A Web search turns up software in Python and Ruby that claims to identify source files by language. I don’t know how well they work, or even if it’s safe to use them. A lot depends on your standards. If you use a small list of languages and assume the file belongs to one of them — let’s say C, Python, Java, JavaScript, PHP, and Ruby — then the job shouldn’t be too hard.

However, it might get false positives if a file is written in a language which is similar to one of them. For example, Groovy is intentionally similar to Java, and a Groovy file might be mistaken for a Java file. There are large numbers of niche languages, and it would be unrealistic to include modules to identify all of them.

There don’t seem to be any tools suitable for libraries and archives that try to identify a file’s programming language. I’m not sure whether this is because of lack of demand or because it’s too hard a problem. Archiving source code seems to be a largely unexplored area. If anyone needs to do it, the best alternative for now may be to run sanity checks for a text file and hope that the extension is correct.

Addendum: Johan van der Knijff pointed me at an article by Dr. Santhilata Kuppili Venkata on some work on machine learning to identify the type of a text file.

One response to “Identifying files by programming language

  1. The following comment was submitted by email by Kevin Ashley, who said he had trouble posting a comment to this site:

    Thanks for an interesting post. I might take issue with your opening statement that most of today’s languages look a bit like C, not because it is untrue but because anyone interested in preserving software must also be interested in preserving software of the past, much of which does not look like C. And today, although many of us may not think of it this way, we are often using Postscript, which is definitely a programming language although very few people are able to program in it with any degree of fluency.

    Your post reminded me of a paper that won the best paper award at IDCC in 2011. Whilst it doesn’t propose a direct solution to your problem, it does extend a technique used to validate program-language text to validate arbitrary binary file formats, using formal grammars to generate syntax validators. It’s by Bill Underwood, titled “Grammar-Based Specification and Parsing of Binary File Formats”, and available from https://doi.org/10.2218/ijdc.v7i1.217 . This technique doesn’t do identification, but validation. However it does remind us that programming languages don’t have to be compiled in order to validate them, but merely subject to syntax checking. (This isn’t strictly true, particularly for some older languages where there are semantic constraints that aren’t picked up at the syntax-checking stage such as calling a function with the wrong number of arguments.) But if you are just trying to identify things, then syntax-checking is enough. In any event, archives will sometimes need to preserve code which is faulty so you would rather say “This looks a lot like a FORTRAN 77 program” rather than say “This is a valid FORTAN 77 program which will definitely produce the right results when it is run.”

    But your comment about things like include files brings us to another problem. In languages like C with a macro pre-processor it is impossible to even do syntactic validation without all the include files since macros can easily be used to write code which, on the surface, doesn’t look valid and isn’t valid. Only when the macros are expanded does the code become so.

    I think heuristic identification is the best you can hope for. But Bill Underwood’s approach allows you to use a single type of tool to check all types of file, binary or text, which is a big step along the road.