Potemkin scheduling

Let me begin my preparations for Day of DH 2010 by coming clean. One point of the exercise of taking an arbitrary day and asking everyone in the project to talk about what they spend their time on, during that particular day, is to get something like a sample of activities digital humanists spend their time on. In the nature of things, some of the activities will be interesting and others will be, well, boring. That’s life; get over it, OK?

I’m going to try to play by the rules, and blog my actual day. I will not, for example, blog about the weekly XSLT working group call, although I had rather looked forward to doing so, because (to my chagrin) there is no WG call this week. (We had a face to face meeting last week, and experience teaches that it’s often helpful to cancel WG calls the week after a ftf.) But I have not been able to resist the temptation to game the system at least a little, in two ways that I’m conscious of (and how many that elude my introspection?).

First, I have learned from reading some of last year’s blog posts that praeteritio is by no means a dead art. Melissa Terras began her day by thinking about all the places she could be that day, but would not in fact be. So about ten o’clock on Thursday I plan to blog about the fact that we are not having a WG call this week, and spend some time meditating on the work we are doing in the WG and what it might mean for digital humanities.

And second, the nature of my schedule this week is that while various things need to be done by the end of the day on Friday, very little of what needs to be done is tied to a particular day of the week. Much of my time this week will have been spent, and much of Friday will be spent, on bookkeeping and the preparation of materials for a meeting with an accountant. (This is the first tax return filed since I organized my consultancy, and professional advice is essential.) But I thought it would send the wrong message if my blog entries for the Day of Digital Humanities 2010 read:

  • 6:00 a.m. Starting to go through last year’s records of credit-card transactions to make sure they are all categorized correctly.
  • 8:30 a.m. Done with credit cards. Pause to feed the dogs.
  • 9:30 a.m. Now working on checking statements.

So I have saved up some interesting tasks for Thursday, in order to ensure that I get to talk about them here. I’ll do some work connected X, because I really want to talk about X, and of course Y is interesting so I’ll spend some time on Y. And so on. This has cost me the occasional twinge of guilty conscience over the duplicity involved, and the twinges have eventuated in this confession. If the reader were to get the impression that all my days are packed with such interesting projects, then (a) the careful efforts I have been making to give that impression will have worked perfectly, and (b) the impression given will not be quite accurate. Caveat lector.

Hmm. Perhaps I should spend a couple of hours on my bookkeeping, too, just to introduce a little tedium. It might add verisimilitude.

Plan for the day

6.41 Begin the day, a little later than expected. (The dogs hate daylight savings time and resist being told to wake up and go back outdoors when any fool can see it’s still dark and cold out there. And taking the puppy along means the trip out to the road for the newspaper takes twenty minutes, not ten.)

Glance at the new posts on the Day of Digital Humanities aggregation page and decide I can’t possibly afford to look at that page again today: too much to read, and far too intimidating. It will be a good idea to read through them all later, though: by intimidating me and by stimulating my envy, they make it easier to identify some things that worry me and some that I think it would be fun to do.

6:58 Realize I’ve now spent almost as much time blogging about my day as living it. Going to have to work on that. Time to get down to work on planning. I won’t bore the reader with the details; suffice it to say that I group tasks both by broad categories of activity (tasks directly related to clients; marketing work, which in my case is mostly network presence and visibility [freelancers don't get clients if they are invisible]; technical work, including work related to W3C working groups; and administrative overhead) and by importance and urgency (not the same thing, but using three dimensions to classify tasks quickly gets too complicated).

Lots of bases to touch; I was away all last week and things have accumulated, as they do.

Notes for further reading

7:19 to 7:55 Begin working through the notes I made last week in Prague about topics to follow up on. Jan Hajič recommended some literature on probabilistic parsing for CFGs; look up Joakim Nivre and Michael Collins on the Web and print out several papers to read. Apart from the intrinsic interest of the topic, in the back of my mind is the hypothesis that techniques for probabilistic parsing of natural languages might lead to better error tolerance, simpler error recovery, and better error diagnosis if applied to formal languages such as XML vocabularies. It turns out that Nivre has been there before me.

The XPath 1.0 data model, again

10:00 Usually at 10 on Thursdays I’m on the weekly conference call for the W3C working group responsible for XSLT. No call this week, since we had a face to face meeting last week, but I’ll use the time to try to get some work done on some of the WG tasks I’ve undertaken, the most important ones all relating to formalization problems.

Since the publication of XSLT 2.0 in January 2007, the working group has been spending its time on 2.1, the major changes of which will be related to the ‘streamability’ of XSLT processing. We would like to make it easier to use XSLT in situations where it’s not feasible to materialize the entire input document in memory before processing it; we are introducing some new constructs which make it easier for the processor to see that particular templates can be processed in a streaming way, and some which make it easier for the stylesheet author to signal that a particular input document can be streamed. We also specify an elaborate set of rules for determining whether a stylesheet (or a mode in the stylesheet) is guaranteed streamable: streaming processors are allowed to stream anything they like, if they can figure out how, including constructs not accepted by the prescribed analysis. But if a stylesheet mode is guaranteed streamable according to the analysis described in the spec, then any streaming processor is expected to stream it.

It’s easy to make mistakes in such an analysis, and it would be very helpful if we could use formal methods in some way to provide greater confidence in the correctness of the analysis. I’ve undertaken to see whether the formal tools I work with, Alloy and ACL2, could perhaps be used to help check the analysis for gaps and errors. Ideally, we’d like to have a proof that the analysis is sound; failing that, we’d like to have whatever the tools can provide by way of illumination of the problem area.

In general, specs can be formalized at varying levels of abstraction, and one of the great advantages of formal methods over (say) just implementing the spec in a programming language is that by working at a higher level of abstraction it’s possible to get results with far less effort than required by a conventional implementation in a programming language, and to get results which apply to all implementations, not just the one you happened to write. In particular, it would be very useful if we could shed light on the completeness and correctness of the streamability analysis without having to formalize all of XSLT. Several of the lines of attack which offer themselves for the soundness proof, however, take the form of arguing that every transformation described by a stylesheet which passes the analysis can be translated into a form which is manifestly streamable. (E.g. they can be compiled into code for a virtual machine which operates in constant memory.) It’s not enough, though, to say that the stylesheets can be translated into a manifestly streamable form: you have to argue that the manifestly streamable form produces the same results as the original. Which seems to entail the constructions of arguments about the semantics of XSLT transformations. Which in turn entails the construction of formal characterizations of XSLT stylesheets, XML documents, and the transformation process.

In January I spent time formalizing the XPath data model in Alloy, starting with XPath 1.0, because it’s simpler than XPath 2.0 in several ways. But I ran into a couple of unexpected problems. In particular, the notions of node identity and document ordering (which are crucial to XPath semantics and thus to XSLT) are underspecified in the XPath 1.0 spec. So a straightforward translation of the spec prose into Alloy notation does not produce a satisfactory result. So I’m spending time this morning re-reading the current state of my formalization of the XPath 1.0 data model and seeking simple ways to fix the problems, preferably by layering additional constraints over what the spec now says, but possibly by reformulating the model from the ground up.

What is an XML document, really?

Help desk service

If when I stopped working in academic computing centers I thought I was getting out of help desk duty, I was wrong.

1:15 A board member from a local non-profit calls, nervous because they are having trouble reaching the board web site I’ve helped set up for them as a small pro bono project. Ten minutes talking them through the problem, which proves in the end to be one, two, or possibly three typos in the board member’s transcription of the URI.

(You mean there’s a difference between “http” and “https”? And Web addresses don’t always have to begin with “www.”? Actually in this case, the extra “www.” in the address should not have mattered: the server is set up to accept both forms of the address.)

I never did figure out what the other typo was; the board member kept erasing everything and starting again, and on the third try (not counting the attempts they made before calling) they typed the URI correctly.

For years, I heard people saying that computer applications in the humanities were too hard to use: you could not expect their target users to learn to use their interfaces. What, some of my friends and colleagues asked, was “the average humanist” to do when they discovered that their work required them to use a computer? — we must make it easier for them to cope! I agree that simple clean interfaces are better than impenetrable crufty ones (I also like apple pie and motherhood), but for the most part the plight of “the average humanist” always left me unmoved. (Does the average humanist even exist? how many people do you know who don’t respond with a frosty look when you call them average?)

What does the average humanist do when they discover that the work they are doing requires a good reading knowledge of Sanskrit? Well, where I come from they either settle down and learn what they need to do the work, or else they discover a compelling interest in some other topic. They don’t sit around complaining that Sanskrit is too hard to learn and that by golly if it’s that hard then the classics department must not be doing its job properly.

This board member is a salutary reminder to me that not everyone gets a choice of learning and using new technology or leaving it alone. Also a reminder that if one wants the things one builds to be readily usable by a really wide audience, even existing widespread technologies like the URI may already be too complicated. I plan to remember this board member, the next time I’m wondering just how complicated to make a user interface.

EADs and the Web

2:30 Phone call with a friend, who like me is an XML consultant, to talk about the Encoded Archival Description and XSLT stylesheets to handle it.

They are talking with a potential client about a situation that has a couple of interesting aspects.

The archive in question uses the Encoded Archival Description to describe their archival collections — or rather, they export EAD documents from their archive management software for publication on the Web. They are shifting to a new management system, however, and the shape and content of their EAD documents are changing dramatically at the same time. The old software exported into the version of EAD current when it was installed, and the new software exports into the now current version.

The archive has a system of XSLT stylesheets in place to display the finding aids on the Web, but when applied to newly exported EAD documents they are showing … nothing at all. “Namespace problem,” I muttered to myself, as my friend paused briefly before explaining that her immediate reaction had been to diagnose a missing namespace declaration. She then went on to say that the diagnosis was correct, but incomplete: even with the namespace problem fixed, the old stylesheets aren’t going to do the job, because the new EAD documents describe the collections at a much finer granularity than the old ones. The current stylesheets were written to handle the exports from the old software, and for understandable reasons they do not include templates for information never present in the documents created by the old archive software.

As far as my friend and I can tell, the archive regards EAD as a kind of specialized write-only database dump format. The master description of their collection is not the one in the EAD but the one in the database; the EAD document is just a funny way that databases have of reporting on their contents. This fact is unexpected enough to be interesting; in my own work I’m almost always working with XML which is itself the master copy of the information it records. The XML-as-db-dump / DB-as-master-copy idea makes me slightly uneasy (over the decades the database industry has learned many subtle ways of locking users in), but it’s probably just my text-format biases showing.

If the client proves to be in a hurry, and my friend’s shop is already busy at the time, I may get to help with the new XSLT; that would be fun.

Balisage planning

3:00 Phone call with Tommie Usdin, who serves as chair of Balisage, an annual conference I help organize each August in Montréal for those interested in the theory and practice of markup (XML and other), to talk about recent developments and plans.

There may be a sponsor for an XML application contest modeled very loosely after the contests some governments and foundations have sponsored seeking ideas for interesting things one can do with public data. A lot of those data sources are in XML; it’s crazy, some people have argued, that some contestants spend so much effort shredding the XML into other formats instead of just using the XML format natively. We (they continued) should have a contest for doing something interesting with government (or other) data in XML. It would be fun to organize this and present the prize to the winner at Balisage this year. It will be a challenge to get everything organized in time, but I think we’re going to give it a try.

Promoting XML-aware Web applications

4:30 Phone call with Daniela Florescu of Oracle, who serves as president of the FLWOR Foundation, which is dedicated to the development and promotion of XML-aware open-source software for Web applications, especially XQuery implementations. The FLWOR people were out in force at XML Prague last week, with demos of Zorba, their open-source XQuery implementation in C++ (for which they also have an online demo interface called Try Zorba), and browser plug-ins based on it (XQuery in the browser — “Just like Javascript, only less code!”).

The easier it becomes to build cool web apps with XQuery and XSLT and other open standards, the easier it will be for those responsible for curating cultural heritage data to make the data accessible in useful ways (that’s the cool apps part) while also preserving the data usefully for the future (that’s the open standards part). So all of us who care about preservation of and access to culturally important materials should be wishing the FLWOR Foundation godspeed and success in their mission.

Re-hosting the Model Editions Partnership

8:00 p.m. Have spent some time today working through problem reports related to a web site I’m setting up for the material created by the Model Editions Partnership. Making progress, but am not finished.

What (you may be asking) is MEP? MEP was funded (in 1995-2004) by the (U.S.) National Historical Publications and Records Commission as a project to explore some of the issues involved with electronic publication of historical documentary editions. NHPRC has been funding projects to publish the papers of various prominent Americans for over fifty years now, originally mostly political and military figures (i.e. DWMs) and more recently (reflecting changes in historiography) a broader range of figures (not all W, not all M, but still mostly D). But historical documentary editions take a long time, and NHPRC has limited funding, so the overall balance changes rather slowly. (So the DWM predominance will continue for a while.)

In the MEP project, a number of partner projects contributed samples from their editions, which were transformed into a MEP customization of the TEI vocabulary mostly by a team at the University of South Carolina under the leadership of David Chesnutt. The resulting collection of ‘mini-editions’ was delivered on the web using DynaWeb, which provided web access to DynaText, an SGML-based hypertext system. At least, they were delivered on the web until sometime in 2009, when something in the server broke and it became clear that no one was in a position to fix it. (Electronic Book Technologies, which made both DynaText and DynaWeb, was acquired by Inso almost twenty years ago, and Inso itself disappeared a while back. And DynaWeb ran only under the Netscape Enterprise Server. Remember Netscape?) As they used to say (in slightly different terms) when I started with computing, when closed-source software breaks, you get to keep both the pieces.

Toward the end of the project, David Chesnutt and I also built a version of the site using Cocoon, an XML-oriented system for Web site creation. A handful of CDs were distributed with the label “MEP Sampler” with this Cocoon-based site.

The work done in the MEP project is worth preserving, I think (and I don’t think so because I was involved in the project, at least not only for that reason), so I took some time earlier this year, installed the Cocoon-based version of the site on a Web server, and invited the editors of the partner editions to inspect it. We are still in the process of shaking down the new site.

The project is, I think, interesting for a number of reasons. First, the material itself is interesting: letters and papers ranging from military directives in a crucial campaign of the American Revolution to discussions of slavery during the debates over the ratification of the U.S. Constitution to Margaret Sanger’s dissident publication The Woman Rebel. Second, MEP marks a certain historical moment in the development of electronic editions in the U.S. in the twentieth century. It can’t hurt to document the work done then, which illustrates (if nothing else) some of the blind spots of some specific people at that specific time, as well as (I hope) some insight. Third, precisely because one of the goals is to record how things looked (to some people) in the late 1990s and early 2000s, the plan is to preserve the user interface of 2000/2004, warts and all, as an alternative to the main interface which will develop and mature over time. Fourth, because the original goal of the project was to develop a better understanding of the issues and opportunities of digital editions, I hope to involve digital humanists interested in user interface design in using the MEP samples as a kind of test bed. Here is a given body of materials, with given properties; what kind of user interface would you build for it, for the general public? For professional historians? For high-school students? For students of markup and markup theory?

The catch is that the site is not generally available yet, so I can’t point to it. In order to be sure we are clean as regards intellectual property rights, David and I have decided we really should ask the rights holders for fresh permissions. That takes time. Current intellectual property rights law has gotten in the way of many useful things and imposed many burdens upon society. The burden of delaying access to a revived MEP web site is not the heavist among those burdens, but it is one of them. (I say, grant perpetual copyright to Mickey Mouse, to get Disney off Congress’s back, and then go back to twenty-seven plus twenty-seven.)

Oh, well. I hope to invite all those interested in digital humanities to visit the new MEP site, as soon as circumstances permit.

Things said and unsaid, work done and undone

10:00 p.m.

If we say that we are without sin, we deceive ourselves [check] and the truth is not in us [check].

We have done those things which we ought not to have done [well, that's a little harsh], and we have left undone those things which we ought to have done [check].

Among the things I ought to have done today (and thus ought to have blogged about) was preparing for a teleconference tomorrow with my collaborators Claus Huitfeldt and Yves Marcoux. Yves has done a lot of work recently on Alloy formalizations of the Goddag (generalized ordered-descendant directed acyclic graph) data structure, and I promised him faithfully that I would read it all on the plane to Prague (no, re-read Jana Dvořáková’s work on streaming XSLT processing instead, in preparation for the XSLT wg meeting), or during idle time while in Prague (ha! where do I get these ideas that I’m going to have idle time during a ftf meeting and a conference? What kind of Lebenslüge is that?), or on the way home from Prague (sorry, a friend gave me a French translation of Stefan Zweig’s Schachnovelle and I could not resist). Yves, I admit it! I confess publicly! If we never get this paper finished, it will be all my fault.

Oh, well. No one needs me to tell them that the study of overlapping structures and ways of dealing with them belongs to digital humanities. So that’s it, then, I’m knocking off for the night. I’m looking forward to reading the other reports from the Day of Digital Humanities, but it may have to wait: I have to spend all day tomorrow on my taxes.