Monthly Archives: March 2010

iPres 2009 proceedings available

The proceedings from iPres 2009 are now available online. Of particular interest in the area of file formats is “MIXED: Repository of Durable File Format Conversion.”

Thanks to Digitization 101 for the link.

New JHOVE2 alpha release v. 0.60

Forwarded from Stephen Abrams:

A new alpha release of JHOVE2 is now available for download and evaluation (v. 0.6.0, 2010-03-17). Distribution packages (in zip and tar.gz form) are available on the JHOVE2 public wiki at

The new JHOVE2 architecture reflected in this prototype is described in the architectural overview.

The distribution package contains two driver scripts in the JHOVE2 home directory: a DOS shell script (jhove2.bat) for Windows and a Bourne shell script ( for Unix/Linux. Please see the download page for instructions on any modifications that need to be made to these scripts to run in your environment.

You can verify the installation with the command (for Unix):

      % ./ test.xml -o test.xml.out

This command should produce results similar to this.

The prototype supports the following features:

  • Format identification, validation, feature extraction, and message digest.
  • Appropriate recursive processing of directories, file sets, clumps, and container files (see the architectural overview for the definition of file sets and clumps).
  • High performance buffered I/O using the Java NIO package.
  • Integration with DROID for file identification.
  • Message digesting for the following algorithms: Adler-32, CRC-32, MD2, MD5, SHA-1, SHA-256, SHA-384, SHA-512.
  • Results formatted as text (name/value pairs), JSON, and XML.
  • Use of the Spring Framework v2.5.6.
  • Inversion-of-Control (IOC) container for flexible application and module configuration using dependency injection.
  • Complete modules:

Please be aware of the following limitations and caveats:

  • JHOVE2 requires a 1.6 JRE.
  • This prototype is being made available to provide an early look at the new JHOVE2 architecture and APIs. While the full processing model is demonstrated, there is limited format support at this time.
  • The aggregate-level identification module (i.e. the “aggrefier” module) has been configured by the Spring configuration files in this distribution to recognize a Shapefile formed by the files with the extensions “.shp”, “.shx”, and “.dbf”. The Shapefile module itself, however, is minimally functional.
  • There is no assessment module available for review at this time.

The project team is now working on additional format modules. These will be added to the public distribution as they become available.

Utility scripts are also included in the JHOVE2 installation directory to support Windows (.bat) and Unix/Linux (.sh):

  • jhove2_doc – JHOVE2 Reportable documentation utility.
  • jhove2_upfg – JHOVE2 utility to generate editable Java
    properties file for units of measure settings for Reportable features
    that have a Numeric type

  • jhove2_dpfg – JHOVE2 utility to generate editable Java
    properties file for Displayer settings for Reportable features

Please see the download page for instructions on running these scripts in your environment.

We would very much like to receive your feedback on the new code. While the current state of the code is the product of much internal review and refactoring, your evaluations and suggestions, based on a wide diversity of experience and needs, will be welcome as we continue to move forward with our work.

Please direct your comments and suggestions to the “JHOVE2-TechTalk-L” mailing list for community discussion.

Thank you,

Stephen Abrams / California Digital Library
Tom Cramer / Stanford University
Sheila Morrissey / Portico
On behalf of the JHOVE2 project team

PDF/A Seminar in Washington

A seminar on PDF/A will be held in Washington, DC, on March 26. The registration fee is $125. PDF/A is a restricted subset of PDF designed to promote long-term data viability for the purpose of preservation.

The press release contains a bizarre statement:

“At this time, the use of PDF/A is not mandatory in the United States,” said Betsy Fanning, Director, Standards and Member Services, AIIM, “however, that is changing.” “We are learning of draft legislation that is being debated that will make the use of PDF/A mandatory for preserving electronic documents.”

Congress has neither the right nor the technical competence to order us to use particular file formats. Hopefully this was an out-of-context quote about the government’s own use of PDF/A, though even there legislation requiring a specific subset of a specific format would be very strange.

So what is HTML 5 exactly?

Paul Cotton, co-chair from Microsoft on the W3C HTML Working Group, has some interesting comments on exactly what people mean by “HTML 5.” This may help explain some odd statements about “HTML video” which I’ve commented on in recent posts. The interview includes other remarks on the status of HTML 5.

First, I believe that most people use the term “HTML 5” to refer to the HTML 5 specification currently being worked on by the HTML WG. The HTML 5 specification defines the syntax and the semantics of the elements and attributes in the HTML markup language and several of the APIs that are used to process HTML documents. Recently the HTML WG has started to break the HTML 5 specification into more modular and separate Working Drafts e.g. HTML+RDFa, HTML Microdata, and HTML Canvas 2D Context. The HTML WG is also publishing two additional documents to aid users of HTML 5: the HTML 5 differences from HTML4 specification and HTML: The Markup Language which is aimed at developers that produce HTML 5 output.

Each of these additional Working Drafts are still part of “HTML 5” and are all on track to become separate but related W3C Recommendations or Working Group Notes. I believe that the content of these WDs taken together will define the part of “HTML 5” being worked on by the HTML WG.

But I believe that some use the term “HTML 5” to refer also to the important related API specifications being worked on by the WebApps WG. The WebApps WG is chartered to create client-side APIs that can be used with the HTML markup language – in fact some of its specifications started as part of the HTML 5 specification but were migrated over to be separate modular specifications managed by the WebApps WG. In addition there are some very interesting APIs under development by the Device APIs and Policy Working Group which are related to HTML 5 since they can be used with the HTML language and in user agents.

Others use the term “HTML 5” to also include the ECMAScript-262 Language which defines the programming language that we use today to build dynamic web applications.