Monthly Archives: July 2015

Photographic forensics

The FotoForensics site can be a valuable tool in checking the authenticity of an image. It’s easy to alter images with software and try to fool people with them. FotoForensics uses a technique called Error Level Analysis (ELA) to identify suspicious areas and highlight them visually. Playing with it a bit shows me that it takes practice to know what you’re seeing, but it’s worth knowing about if you ever have suspicions about an image.

New Horizons Pluto picture with cartoon PlutoLet’s start with an obvious fake, the iconic New Horizons image of Pluto with the equally iconic Disney dog superimposed on it. The ELA analysis shows a light-colored boundary around most of the cartoon, and the interior has very dark patches. The edge of the dwarf planet has a relatively dim boundary. According to the ELA tutorial, “Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.” So that’s a confirmation that the New Horizons picture has been subtly altered.

Let’s compare that to an analysis of the unaltered image. The “heart” stands out as a dark spot on the ELA image, but its edges aren’t noticeably brighter than the edges of the planet’s (OK, “dwarf planet”) image. The tutorial says that “similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface,” so it seems to make sense that the smooth heart (which is something like an ice plain) looks different.
Continue reading

PDF 2.0

As most people who read this blog know, the development of PDF didn’t end with the ISO 32000 (aka PDF 1.7) specification. Adobe has published three extensions to the specification. These aren’t called PDF 1.8, but they amount to a post-ISO version.

The ISO TC 171/SC 2 technical committee is working on what will be called PDF 2.0. The jump in major revision number reflects the change in how releases are being managed but doesn’t seem to portend huge changes in the format. PDF is no longer just an Adobe product, though the company is still heavily involved in the spec’s continued development. According to the PDF Association, the biggest task right now is removing ambiguities. The specification’s language will shift from describing conforming readers and writers to describing a valid file. This certainly sounds like an improvement. The article mentions that several sections have been completely rewritten and reorganized. What’s interesting is that their chapter numbers have all been incremented by 4 over the PDF 1.7 specification. We can wonder what the four new chapters are.

Leonard Rosenthol gave a presentation on PDF 2.0 in 2013.

As with many complicated projects, PDF 2.0 has fallen behind its original schedule, which expected publication in 2013. The current target for publication is the middle of 2016.

veraPDF validator

The veraPDF Consortium has announced a public prototype of its PDF validation software.

It’s ultimately intended to be “the definitive open source, file-format validator for all parts and conformance levels of ISO 19005 (PDF/A)”; however, it’s “currently more a proof of concept than a usable file format validator.”

New developments in JPEG

A report from the 69th meeting of the JPEG Committee, held in Warsaw in June, mentions several recent initiatives. The descriptions have a rather high buzzword-to-content ratio, but here’s my best interpretation of what I think they mean. What’s usually called “JPEG” is one of several file formats supported by the Joint Photographic Experts Group, and JFIF would be a more precise name. Not every format name that starts with JPEG refers to “JPEG” files, but if I refer to JPEG without further qualification here, it means the familiar format.
Continue reading

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.
Continue reading

Redecoration in progress

I’m working on some changes in the appearance of this blog. You may see weird things while this is happening. Sorry for the inconvenience.


TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags. Some of them have achieved wide acceptance, such as the TIFFTAG_ICCPROFILE tag (34675), which fills the need to associate ICC color profiles with images. Many applications use the EXIF tag set to specify metadata, but this isn’t part of the “standard” either.

In other words, TIFF today is the sum of a lot of unwritten rules.

It’s generally not too hard to deal with the chaos and produce files that all well-known modern applications can handle. On the other hand, it’s easy to produce a perfectly legal TIFF file that only your own custom application will handle as you intended. People putting files into archives need some confidence in their viability. Assumptions which are popular today might shift over a decade or two. Variations in metadata conventions might cause problems.
Continue reading

File identification tools, part 7: Apache Tika

Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.

Tika isn’t designed to validate files. If it encounters a broken file, it won’t tell you much about how it violates the format’s expectations.

Originally it was a subproject of Lucene; it became a standalone project in 2010. It builds on existing parser libraries for various formats where possible. For some formats it uses its own libraries because nothing suitable was available. In most cases it relies on signatures or “magic numbers” to identify formats. While it identifies lots of formats, it doesn’t distinguish variants in as much detail as some other tools, such as DROID. Andy Jackson has written a document that sheds light on the comparative strengths of Tika and DROID. Developers can add their own plugins for unsupported formats. Solr and Lucene have built-in Tika integration.

Prior to version 1.9, Tika didn’t have support for batch processing. Version 1.9 has a tika-batch module, which is described in the change notes as “experimental.”

The book Tika in Action is available as an e-book (apparently DRM free, though it doesn’t specifically say so) or in a print edition. Anyone interested in using its API or building it should look at the detailed tutorial on The Tika facade serves basic uses of the API; more adventurous programmers can use the lower-level classes.

Next: NLNZ Metadata Extraction Tool. To read this series from the beginning, start here.