Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.
How to tell languages apart
Identifying a file as being in a particular programming language is tricky. Some have self-identifying markers; PHP files, for example, should start with “<?php". Most languages aren't so cooperative.
One approach is to compile the code with several different compilers (or syntax checkers, in the case of interpreted languages). The problem is that source files often need to include other files to compile properly. If those files aren't available, the compiler will issue lots of error messages even if it's syntactically perfect.
Another approach would be to search the files for features which are peculiar to one language. It might take several feature checks to identify a file's language; languages share features, and a particular file might not use all its language's capabilities. For instance, a file with
import statements ending in a semicolon and
public class or
public interface declarations is likely to be Java. These are just heuristics, though, and oddly written source code can make even a software engineer wonder what language it is.
However, it might get false positives if a file is written in a language which is similar to one of them. For example, Groovy is intentionally similar to Java, and a Groovy file might be mistaken for a Java file. There are large numbers of niche languages, and it would be unrealistic to include modules to identify all of them.
There don’t seem to be any tools suitable for libraries and archives that try to identify a file’s programming language. I’m not sure whether this is because of lack of demand or because it’s too hard a problem. Archiving source code seems to be a largely unexplored area. If anyone needs to do it, the best alternative for now may be to run sanity checks for a text file and hope that the extension is correct.
Addendum: Johan van der Knijff pointed me at an article by Dr. Santhilata Kuppili Venkata on some work on machine learning to identify the type of a text file.