Tag Archives: XML

The quest for a 3D printing format

The 3-D printing industry has been moving toward 3MF as a standard file format. It’s an XML-based format that claims to offer extensibility, interoperability, and freedom from the problems of other formats. The specification includes an XSD schema. I’m no judge of how suitable it is for 3D modeling, but yes, it is extensible. In fact, it’s designed with a relatively lean core model, so additional features can be added as extensions.

A recent Fortune article, “Why These Big Companies Want a New 3D File Format”, discusses 3MF from a business standpoint.

The old STL format, based on tessellation, is widely used, but it’s been criticized for generating huge files and lacking features.

OOXML: The good and the bad

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.

JHOVE and XHTML

I’m surprised I only got a complaint about this recently. Using JHOVE to validate XHTML files is often painfully slow. In fact, using anything to validate them without caching or redirection of DTDs would be painfully slow. The DOCTYPE declaration brings in the standard XHTML DTD, and it in turn brings in lots of other DTDs. These all have URLs on w3.org. As you can imagine, this is a lot of traffic converging in one place, and the response is often very slow.

JHOVE has a remedy, but it turns out not to work in this case. In the configuration file, you can declare local copies of schemas and DTDs to be loaded by the SAX entity resolver. This looks something like this:

 <module>
   <class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
  <param>schema=http://www.w3.org/TR/REC-smil/SMIL10.dtd;/Users/gmcgath/schemas/SMIL10.dtd</param>
 </module>

Unfortunately, there are some problems in JHOVE 1.9. The HTML module processes XHTML files by passing them to the XML module. In this case, the module doesn’t get the parameters that the config file declared for it. In JHOVE 1.10, I’ll fix this by having the HTML module pass its own parameters to the XML module. At present, JHOVE’s processing of XHTML files makes no use of the configuration file’s instructions to the entity resolver.

There’s another complication. The XHTML DTD invokes other DTDs, and JHOVE has to get every one of those in turn. Some of them have relative URLs to other DTDs; these break when they’re redirected to local files. Even making local copies of all the files doesn’t work, as JHOVE doesn’t handle the relative URLs correctly within the file system, and making them work would require changing some existing assumptions. The best fix for the user is to get JHOVE 1.10 when it’s ready (version 1.10B2 doesn’t have the XHTML fix yet) edit all those files so that all the URLs are absolute.

This is a big chunk of work, and I haven’t tested the approach fully. Any ideas on how this might be better handled would be appreciated.

The URI namespace problem

Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.

URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that hul.harvard.edu can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.

Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the http scheme, it’s an invitation to use it as a URL.

Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.

The best solution available is for software to resolve schema names to local copies (e.g. with Java’s EntityResolver). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.

For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.

It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.

One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.

The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.

If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.

IIIF Image API draft

The International Image Interoperability Framework (IIIF) has put a draft API for the delivery of images via a standard http request. It supports information requests as JSON or XML as well as image requests.

One of my first reactions is that it sticks to the letter of RESTful interfaces while doing things that would be more sensibly be expressed by URL parameters. The following are offered as example URLs:

That’s harder to understand than something like x=80&y=15&w=60&h=75.

A service must specify the level of compliance it provides, which may be different for different images; for instance, JPEG2000 images might be scalable but GIF images not.

If widely adopted, this API could simplify access to images spread across multiple repositories. I’ll be looking at it more carefully as soon as I find the time.

Undocumented “open” formats

Recently I learned that I can’t upgrade to a current version of Finale Allegro, a music entry program, except by getting the very expensive full version or taking a step downward to PrintMusic. Since I don’t want to lose all my files when some “upgrade” makes Allegro stop working, I’ve been looking for alternatives. MuseScore has its attractions; it’s open source, powerful, and generally well regarded. But I ran across this discussion on the MuseScore forum, which has me just a bit worried. According to “Thomas,” whose user ID is 1 and so probably speaks with authority, “As the MuseScore format is still being shaped on a daily basis, we haven’t put any effort yet to create a schema.”

This doesn’t encourage me to use MuseScore. Even though it’s an “open” application, its format isn’t open in any meaningful sense. You can download the code and reverse-engineer it, of course, but it’s going to change in the next version. While I’m sure the developers will try not to break files created with earlier versions, there’s no guarantee they’ll succeed, and they’re likely to be especially careless about compatibility with files that are more than a few versions old.

You can export files to MusicXML, which is standardized, but in trying this out I came upon a disturbing bug. If I edit the file and save the changes, they’re saved not to the .xml file but to a .mcsz file, MuseScore’s native format. If there’s already an older file with that name, it gets overwritten without warning.

The dichotomy between “open” and “proprietary” formats is the wrong one. There are many formats which are trademarked by a business and their documentation copyrighted, but if the documentation is public and the format not encumbered by patents, anyone can use it. Formats which are created by open-source code but are undocumented and subject to change might are effectively closed formats.

This post grew, in part, from my thoughts on avoiding data loss due to format obsolescence, which is this topic of this week’s post on Files That Last.

W3C link roundup

DOM4 draft updated.
First draft of CSS device adaptation.
Ink Markup Language (InkML) recommendation.
Widget Packaging and XML Configuration recommendation.
XSL-FO 2.0 updated.
Namespaces Module and Selectors Level 3; First Draft of Selectors Level 4.
CSS Fonts Module Level 3 Draft.