Tag Archives: software

File identification tools, part 7: Apache Tika

Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.

Tika isn’t designed to validate files. If it encounters a broken file, it won’t tell you much about how it violates the format’s expectations.

Originally it was a subproject of Lucene; it became a standalone project in 2010. It builds on existing parser libraries for various formats where possible. For some formats it uses its own libraries because nothing suitable was available. In most cases it relies on signatures or “magic numbers” to identify formats. While it identifies lots of formats, it doesn’t distinguish variants in as much detail as some other tools, such as DROID. Andy Jackson has written a document that sheds light on the comparative strengths of Tika and DROID. Developers can add their own plugins for unsupported formats. Solr and Lucene have built-in Tika integration.

Prior to version 1.9, Tika didn’t have support for batch processing. Version 1.9 has a tika-batch module, which is described in the change notes as “experimental.”

The book Tika in Action is available as an e-book (apparently DRM free, though it doesn’t specifically say so) or in a print edition. Anyone interested in using its API or building it should look at the detailed tutorial on tutorialspoint.com. The Tika facade serves basic uses of the API; more adventurous programmers can use the lower-level classes.

Next: NLNZ Metadata Extraction Tool. To read this series from the beginning, start here.

File identification tools, part 6: FITS

FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.

It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.

FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.
Continue reading

File identification tools, part 5: JHOVE

In 2004, the Harvard University Libraries engaged me as a contractor to write the code for JHOVE under Stephen Abrams’ direction. I stayed around as an employee for eight more years. I mention this because I might be biased about JHOVE: I know about its bugs, how hard it is to install, what design decisions could have been better, and how spotty my support for it has been. Still, people keep downloading it, using it, and saying good things about it, so I must have done something right. Do any programmers trust the code they wrote ten years ago?

The current home of JHOVE is on GitHub under the Open Preservation Foundation, which has taken over maintenance of it from me. Documentation is on the OPF website. I urge people not to download it from SourceForge; it’s out of date there, and there have been reports of questionable practices by SourceForge’s current management. The latest version as of this writing is 1.11.

JHOVE stands for “JSTOR/Harvard Object Validation Environment,” though neither JSTOR nor Harvard is directly involved with it any longer. It identifies and validates files in a small set of formats, so it’s not a general-purpose identification tool, but does a fairly thorough job. The formats it validates are AIFF, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, WAV, XML, ASCII, and UTF-8. If it doesn’t recognize a file as any of those formats, it will call it a “Bytestream.” You can use JHOVE as a GUI or command line application, or as a Java library. If you’re going to use the library or otherwise do complicated things, I recommend downloading my payment-optional e-book, JHOVE Tips for Developers. Installation and configuration are tricky, so follow the instructions carefully and take your time.

JHOVE shouldn’t be confused with JHOVE2, which has similar aims to JHOVE but has a completely different code base, API, and user interface. It didn’t get as much funding as its creators hoped, so it doesn’t cover all the formats that JHOVE does.

Key concepts in JHOVE are “well-formed” and “valid.” When allowed to run all modules, it will always report a file is a valid instance of something; it’s a valid bytestream if it’s not anything else. This has confused some people; a valid bytestream is nothing more than a sequence of zero or more bytes. Everything is a valid bytestream.

The concept of well-formed and valid files comes from XML. A well-formed XML file obeys the syntactic rules; a valid one conforms to a schema or DTD. JHOVE applies this concept to other formats, but it’s generally not as good a fit. Roughly, a file which is “well-formed but not valid” has errors, but not ones that should prevent rendering.

JHOVE doesn’t examine all aspects of a file. It doesn’t examine data streams within files or deal with encryption. It focuses on the semantics of a file rather than its content. However, it’s very aggressive in what it does examine, so that sometimes it will declare a file not valid when nearly all rendering software will process it correctly. If there’s a conflict between the format specification and generally accepted practice, it usually goes by the specification.

It checks for profiles within a format, such as PDF/A and TIFF/IT. It only reports full conformance to a profile, so if a file is intended to be TIFF/A but fails any tests for the profile, JHOVE will simply not list PDF/A as a profile. It won’t tell you why it fell short.

The PDF module has been the biggest adventure; PDF is really complicated, and its complexity has increased with each release. Bugs continue to turn up, and it covers PDF only through version 1.6. It needs to be updated for 1.7, which is equivalent to ISO 32000.

Sorry, I warned you that I’m JHOVE’s toughest critic. But I wouldn’t mind a chance to improve it a bit, through the funding mechanism I mentioned earlier in the blog.

Next: FITS. To read this series from the beginning, start here.

Funding for preservation software development

The Open Preservation Foundation (formerly the Open Planets Foundation) is launching a new model for funding the development of preservation-related software. Quoting from the announcement:

‘Over the last year the OPF has established a solid foundation for ensuring the sustainability of digital preservation technology and knowledge,’ explains Dr. Ross King, Chair of the OPF Board. ‘Our new strategic plan was introduced in November 2014 along with community surveys to establish the current state of the art. We developed our annual plan in consultation with our members and added JHOVE to our growing software portfolio. The new membership and software supporter models are the next steps towards realising our vision and mission.’ …

The software supporter model allows organisations to support individual digital preservation software products and ensure their ongoing sustainability and maintenance. We are launching support for JHOVE based on its broad adoption and need for active stewardship. It is also a component in several leading commercial digital preservation solutions. While it remains fully open source, supporters can steer our stewardship and maintenance activities and receive varying levels of technical support and training.

JHOVE logoI have a selfish personal interest in spreading the word. At the moment, I’m between contracts, and I wouldn’t mind getting some funding from OPF to resume development work on JHOVE. I know its code base better than anyone else, I worked on it without pay as a hobby for a year or so after leaving Harvard, and I’d enjoy working on it some more if I could just get some compensation. This is possible, but only if there’s support from outside.

US libraries have been rather insular in their approach to software development. They’ll use free software if it’s available, but they aren’t inclined to help fund it. If they could each set aside some money for this purpose, it would help assure the continued creation and maintenance of the open source software which is important to their mission.

How about it, Harvard?

File identification tools, part 3: DROID and PRONOM

The last installment in this series looked at file, a simple command line tool available with Linux and Unix systems for determining file types. This one looks at DROID (Digital Record Object IDentification), a Java-based tool from the UK National Archives, focused on identifying and verifying files for the digital repositories of libraries and archives. It’s available as open source software under the New BSD License. Java 7 or 8 is needed for the current release (6.1.5). It relies on PRONOM, the National Archive’s registry of file format information.

Like file, DROID depends on files that describe distinctive data values for each format. It’s designed to process large batches of files and compiles reports in a much more useful way than file‘s output. Reports can include total file counts and sizes by various criteria.

To install DROID, you have to download and expand the ZIP file for the latest version. On Windows, you run droid.bat; on sensible operating systems, run droid.sh. You may first have to make it executable:

chmod +x droid.sh
./droid.sh

Running droid.sh with no arguments launches the GUI application. If there are any command line arguments, it runs as a command line tool. You can type

./droid.sh --help

to see all the options.

The first time you run it as a GUI application, it may ask if you want to download some signature file updates from PRONOM. Let it do that.

It’s also possible to use DROID as a Java library in another application. FITS, for example, does this. There isn’t much documentation to help you, but if you’re really determined to try, look at the FITS source code for an example.

DROID will report file types by extension if it can’t find a matching signature. This isn’t a very reliable way to identify a file, and you should examine any files matched only by extension to see what they really are and whether they’re broken. It may report more than one matching signature; this is very common with files that match more than one version of a format.

It isn’t possible to cover DROID in any depth in a blog post. The document Droid: How to use it and how to interpret your results is a useful guide to the software. It’s dated 2011, so some things may have changed.

Next: Exiftool. To read this series from the beginning, start here.

File identification tools, part 2: file

A widely available file identification tool is simply called file. It comes with nearly all Linux and Unix systems, including Macintosh computers running OS X. Detailed “man page” documentation is available. It requires using the command line shell, but its basic usage is simple:

file [filename]

file starts by checking for some special cases, such as directories, empty files, and “special files” that aren’t really files but ways of referring to devices. Second, it checks for “magic numbers,” identifiers that are (hopefully) unique to the format near the beginning of the file. If it doesn’t find a “magic” match, it checks if the file looks like a text file, checking a variety of character encodings including the ancient and obscure EBCDIC. Finally, if it looks like a text file, file will attempt to determine if it’s in a known computer language (such as Java) or natural language (such as English). The identification of file types is generally good, but the language identification is very erratic.

The identification of magic numbers uses a set of magic files, and these vary among installations, so running the same version of file on different computers may produce different results. You can specify a custom set of magic files with the -m flag. If you want a file’s MIME type, you can specify --mime, --mime-type, or --mime-encoding. For example:

file --mime xyz.pdf

will tell you the MIME type of xyz.pdf. If it really is a PDF file, the output will be something like

xyz: application/pdf; charset=binary

If instead you enter

file --mime-type xyz.pdf

You’ll get

xyz.pdf: application/pdf

If some tests aren’t working reliably on your files, you can use the -e option to suppress them. If you don’t trust the magic files, you can enter

file -e soft xyz.pdf

But then you’ll get the uninformative

xyz.pdf: data

The -k option tells file not to stop with the first match but to apply additional tests. I haven’t found any cases where this is useful, but it might help to identify some weird files. It can slow down processing if you’re running it on a large number of files.

As with many other shell commands, you can type file --help to see all the options.

file can easily be fooled and won’t tell you if a file is defective, but it’s a very convenient quick way to query the type of a file.

Windows has a roughly similar command line tool called FTYPE, but its syntax is completely different.

Next: DROID and PRONOM. To read this series from the beginning, start here.

File identification tools, part 1

This is the start of a series on software for file identification. I’ll be exploring as broad a range as I reasonably can within the blog format, covering a variety of uses. I’m most familiar with the tools for preservation and archiving, but I’ll also look at tools for the end user and at digital forensics (in the proper sense of the word, the resolution of controversies).

We have to start with what constitutes “identifying” a file. For our purposes here, it means at least identifying its type. It can also include determining its subtype and telling you whether it’s a valid instance of the type. You can choose from many options. The simplest approach is to look at the file’s extension and hope it isn’t a lie. A little better is to use software that looks for a “magic number.” This gives a better clue but doesn’t tell you if the file is actually usable. Many tools are available that will look more rigorously at the file. Generally the more thorough a tool is, the narrower the range of files it can identify.

Identification software can be too lax or too strict. If it’s too lax, it can give broken files, perhaps even malicious ones, its stamp of approval. If it’s too severe, it can reject files that deviate from the spec in harmless and commonly accepted ways. Some specifications are ambiguous, and an excessively strict checker might rely on an interpretation which others don’t follow. A format can have “dialects” which aren’t part of the official definition but are widely used. TIFF, to name one example, is open to all of these problems.

Some files can be ambiguous, corresponding to more than one format. Here’s a video with some head-exploding examples. It’s long but worth watching if you’re a format junkie.

The examples in the video may seem far-fetched, but there’s at least one commonly used format that has a dual identity: Adobe Illustrator files. Illustrator knows how to open a .ai file and get the application-specific data, but most non-Adobe applications will see it as a PDF file. Ambiguity can be a real problem when file readers are intentionally lax and try to “repair” a file. Different applications may read entirely different file types and content from the same file, or the same file may have different content on the screen and when printed. So even if an identification tool tells you correctly what the format is, that may not be the whole story. I don’t know of any tool that tries to identify multiple formats for the same file.

Knowing the version and subtype of a file can be important. When an application reads a file in a newer version than it was written for, it may fail unpredictably, and it’s likely to lose some information. Some applications limit their backward compatibility and may be unable to read old versions of a format. Subtypes can indicate a file’s suitability for purposes such as archiving and prepress.

I’ll use the tag “fident” for all posts in this series, to make it easy to grab them together.

Next: The shell file command line tool.

Update on the JHOVE handover

There’s a brief piece by Becky McGuinness in D-Lib Magazine on the handover of JHOVE to the Open Preservation Foundation. It describes upcoming plans:

During March the OPF will be working with Portico and other members to complete the transfer of JHOVE to its new home. The latest code base will move to the OPF GitHub organisation page. All documentation, source code files, and full change history will be publicly available, alongside other OPF supported software projects, including JHOVE2, Fido, jpylyzer, and the SCAPE project tools.

Once the initial transfer is complete the next step will be to set up a continuous integration (CI) build on Travis, an online CI service that’s integrated with GitHub. This will ensure that all new code submissions are built and tested publicly and automatically, including all external pull requests. This will establish a firm foundation for future changes based on agile software development best practises.

With this foundation in place OPF will test and incorporate JHOVE fixes from the community into the new project. Several OPF members have already developed fixes based on their own automated processes, which they will be releasing to the community. Working as a group these fixes will be examined and tested methodically. At the same time the OPF’s priority will be to produce a Debian package that can be downloaded and installed from its apt repository.

Following the transfer OPF will gather requirements from its members and the wider digital preservation community. The OPF aims to establish and oversee a self-sustaining community around JHOVE that will take these requirements forward, carrying out roadmapping exercises for future development and maintenance. The OPF will also assess the need for specific training and support material for JHOVE such as documentation and online or virtual machine demonstrators.

It’s great to know that JHOVE still has a future a decade after its birth, but what boggles my mind is the next sentence:

The transfer of JHOVE is supported by its creators and developers: Harvard Library, Portico, the California Digital Library, and Gary McGath.

I never expected to see my name in a list like that!

A new home for JHOVE

Over a decade ago, the Harvard University Libraries took me on as a contractor to start work on JHOVE. Later I became an employee, and JHOVE formed an important part of my work. When I left Harvard, I asked for continued “custody” of JHOVE so I could keep maintaining it, and got it. Over time it became less of a priority for me; there’s only so much time you can devote to something when no one’s paying you to do it.

After a long period of discussion, the Open Preservation Foundation (formerly the Open Planets Foundation) has taken up support of JHOVE. In addition to picking up the open source software, it’s resolved copyright issues in the documentation with Harvard, really over boilerplate that no one intended to enforced, but still an issue that had to be cleared.

Stephen Abrams, who was the real father of JHOVE, said, “We’re very pleased to see this transfer of stewardship responsibility for JHOVE to the OPF. It will ensure the continuity of maintenance, enhancement, and availability between the original JHOVE system and its successor JHOVE2, both key infrastructural components in wide use throughout the digital library community.”

JHOVE2 was originally supposed to be the successor to JHOVE, but it didn’t get enough funding to cover all the formats that JHOVE covers, so both are used, and the confusion of names is unfortunate. OPF has both in its portfolio. It doesn’t appear to have forked JHOVE to its Github repository yet, but I’m sure that’s coming soon.

My own Github repository for JHOVE should now be considered archival. Go forth and prosper, JHOVE.