Category Archives: News

JHOVE2 poll

There is a poll online for letting the developers of JHOVE2 know what plans you have for it. It just takes a couple of minutes to fill out and doesn’t even require Javascript.

UDFR job openings

I’ve been informed that there are two new contract openings at the Universal Digital Format Registry (UDFR), for a project developer and a project architect. I’d be tempted myself if it didn’t mean moving to California.

The California Digital Library should have the job announcements on line shortly, though it doesn’t as I write this.

Preservation week

The American Library Association has announced Preservation Week, May 9-15, 2010. Announced events so far concentrate mostly on preservation of physical materials, but I’m hoping digital preservation has a prominent role as well.

PDF exploit

A number of web sites are talking about a vulnerability in PDF. So far I haven’t found an exact description; anyone who explained it in detail would get the blame for everyone who uses it for malicious purposes. But the idea seems simple enough that anyone with the necessary technical knowledge (including me) could work it out given a little time. Apparently it’s a means by which the user can be presented with a legitimate-looking dialog and tricked into approving the launching of arbitrary executable code. The exploit can be added to an existing PDF without changing its appearance. JavaScript isn’t required. The vulnerability is in the format specification, not in a software bug. This is the really nasty kind of vulnerability that designers have nightmares about.

Here’s an article on CNET on the issue. There seems to be substantive discussion of the root of the problem here. I’ve got to get to work now. I’ll post something more later.


 
Update: OK, it’s not so bad as it sounded. Here’s the real account, which doesn’t say exactly how to do it, but gives enough clues that it’s not too hard to figure out the rest.

As you might have guessed if you know PDF, it uses the PDF Launch Action. The PDF specification actually doesn’t mandate any safety features in the Launch Action; if you implemented a PDF reader that automatically launched anything a PDF document told you to, you’d be within the spec. But Adobe Reader, exercising normal common sense, prompts the user for permission to launch. The trick is just that the text which describes the application to be launched can be modified. The user still gets a stern warning not to launch anything untrusted.

This trick will doubtless catch some people, as even simpler tricks do (just saying “don’t worry, it’s safe” in the document itself will trick a rather large number of fools). But it isn’t really anything to get hugely worried about.

iPres 2009 proceedings available

The proceedings from iPres 2009 are now available online. Of particular interest in the area of file formats is “MIXED: Repository of Durable File Format Conversion.”

Thanks to Digitization 101 for the link.

New JHOVE2 alpha release v. 0.60

Forwarded from Stephen Abrams:

A new alpha release of JHOVE2 is now available for download and evaluation (v. 0.6.0, 2010-03-17). Distribution packages (in zip and tar.gz form) are available on the JHOVE2 public wiki at https://confluence.ucop.edu/display/JHOVE2Info/Jhove2-0.6.0+Download.

The new JHOVE2 architecture reflected in this prototype is described in the architectural overview.

The distribution package contains two driver scripts in the JHOVE2 home directory: a DOS shell script (jhove2.bat) for Windows and a Bourne shell script (jhove2.sh) for Unix/Linux. Please see the download page for instructions on any modifications that need to be made to these scripts to run in your environment.

You can verify the installation with the command (for Unix):


      % ./jhove2.sh test.xml -o test.xml.out

This command should produce results similar to this.

The prototype supports the following features:

  • Format identification, validation, feature extraction, and message digest.
  • Appropriate recursive processing of directories, file sets, clumps, and container files (see the architectural overview for the definition of file sets and clumps).
  • High performance buffered I/O using the Java NIO package.
  • Integration with DROID for file identification.
  • Message digesting for the following algorithms: Adler-32, CRC-32, MD2, MD5, SHA-1, SHA-256, SHA-384, SHA-512.
  • Results formatted as text (name/value pairs), JSON, and XML.
  • Use of the Spring Framework v2.5.6.
  • Inversion-of-Control (IOC) container for flexible application and module configuration using dependency injection.
  • Complete modules:

Please be aware of the following limitations and caveats:

  • JHOVE2 requires a 1.6 JRE.
  • This prototype is being made available to provide an early look at the new JHOVE2 architecture and APIs. While the full processing model is demonstrated, there is limited format support at this time.
  • The aggregate-level identification module (i.e. the “aggrefier” module) has been configured by the Spring configuration files in this distribution to recognize a Shapefile formed by the files with the extensions “.shp”, “.shx”, and “.dbf”. The Shapefile module itself, however, is minimally functional.
  • There is no assessment module available for review at this time.

The project team is now working on additional format modules. These will be added to the public distribution as they become available.

Utility scripts are also included in the JHOVE2 installation directory to support Windows (.bat) and Unix/Linux (.sh):

  • jhove2_doc – JHOVE2 Reportable documentation utility.
  • jhove2_upfg – JHOVE2 utility to generate editable Java
    properties file for units of measure settings for Reportable features
    that have a Numeric type

  • jhove2_dpfg – JHOVE2 utility to generate editable Java
    properties file for Displayer settings for Reportable features

Please see the download page for instructions on running these scripts in your environment.

We would very much like to receive your feedback on the new code. While the current state of the code is the product of much internal review and refactoring, your evaluations and suggestions, based on a wide diversity of experience and needs, will be welcome as we continue to move forward with our work.

Please direct your comments and suggestions to the “JHOVE2-TechTalk-L” mailing list for community discussion.

Thank you,

Stephen Abrams / California Digital Library
Tom Cramer / Stanford University
Sheila Morrissey / Portico
On behalf of the JHOVE2 project team

PDF/A Seminar in Washington

A seminar on PDF/A will be held in Washington, DC, on March 26. The registration fee is $125. PDF/A is a restricted subset of PDF designed to promote long-term data viability for the purpose of preservation.

The press release contains a bizarre statement:

“At this time, the use of PDF/A is not mandatory in the United States,” said Betsy Fanning, Director, Standards and Member Services, AIIM, “however, that is changing.” “We are learning of draft legislation that is being debated that will make the use of PDF/A mandatory for preserving electronic documents.”

Congress has neither the right nor the technical competence to order us to use particular file formats. Hopefully this was an out-of-context quote about the government’s own use of PDF/A, though even there legislation requiring a specific subset of a specific format would be very strange.

So what is HTML 5 exactly?

Paul Cotton, co-chair from Microsoft on the W3C HTML Working Group, has some interesting comments on exactly what people mean by “HTML 5.” This may help explain some odd statements about “HTML video” which I’ve commented on in recent posts. The interview includes other remarks on the status of HTML 5.

First, I believe that most people use the term “HTML 5” to refer to the HTML 5 specification currently being worked on by the HTML WG. The HTML 5 specification defines the syntax and the semantics of the elements and attributes in the HTML markup language and several of the APIs that are used to process HTML documents. Recently the HTML WG has started to break the HTML 5 specification into more modular and separate Working Drafts e.g. HTML+RDFa, HTML Microdata, and HTML Canvas 2D Context. The HTML WG is also publishing two additional documents to aid users of HTML 5: the HTML 5 differences from HTML4 specification and HTML: The Markup Language which is aimed at developers that produce HTML 5 output.

Each of these additional Working Drafts are still part of “HTML 5” and are all on track to become separate but related W3C Recommendations or Working Group Notes. I believe that the content of these WDs taken together will define the part of “HTML 5” being worked on by the HTML WG.

But I believe that some use the term “HTML 5” to refer also to the important related API specifications being worked on by the WebApps WG. The WebApps WG is chartered to create client-side APIs that can be used with the HTML markup language – in fact some of its specifications started as part of the HTML 5 specification but were migrated over to be separate modular specifications managed by the WebApps WG. In addition there are some very interesting APIs under development by the Device APIs and Policy Working Group which are related to HTML 5 since they can be used with the HTML language and in user agents.

Others use the term “HTML 5” to also include the ECMAScript-262 Language which defines the programming language that we use today to build dynamic web applications.