Hello, Day of DH!

Welcome to my Day of DH 2010 blog. With this first introductory post, I invite you to explore the Who is… page to get to know me before reading on.

My Dog Joe

My Dog Joe The project I’m working on at McMaster is the Voyeur Tools software, specifically its analytical module Trombone. Voyeur Tools software, pioneered by project leader Stéfan Sinclair, is about revealing your texts.

Voyeur Tools counts words, basically. It can compute several analyses of your words, including frequencies, concordance and collocation.  A picture being worth a thousand words, Voyeur Tools can output the performed analyses as visually attractive and easily understandable colour graphics. For a nice contextualized example, see the visualization of the early Day of DH 2010 blog posts, put online by Stéfan Sinclair.

Many sources of textual data are supported. From files on your hard drive, to files accessible through a URL. Plain text files, PDF files, MS Word files, OpenOffice files, archives containing other files. It’s really really easy to feed data to Voyeur Tools.

Textual analyses have been operational for months. Given how easy it is to feed data to Voyeur Tools for analysis, the next step is to handle large amounts of data. This software engineering problem or design aspect is typically referred to as scalability.

The objective of my postdoctoral fellowship here at McMaster (primarily funded by SHARCNET) can be summarized as bringing scalability to Voyeur Tools. The current objective is to get Trombone to run on the Hadoop cloud middleware. The intent is to get Trombone to process large-scale textual data, e.g. initially several hundred megabytes. Getting there requires some heavy-lifting, i.e. rearchitecting the Trombone data model from a file-based model to a cloud-based model. This requires changes all over the code base.

Designing and redesigning Trombone calls for hard thinking. Stéfan Sinclair and I usually meet weekly to brainstorm about Trombone design. To sustain our creativity, we typically work from My Dog Joe. It’s a great place, relaxing and stimulating at the same time. It somewhat reminds me of Parisian cafés. We spent the 2nd part of this sunny morning brainstorming at MyDogJoe. Today’s discussions were centered around programmatic interfaces to make tokens easily taggable with different types, as well as around the interface between Trombone and Trombone calling code. Read on here for further insights on this morning’s meeting.

Where science gets done

Sun is here. Spring is almost here.

Where science gets done















No Apples falling from the trees at this time of year, only some sitting on desks as usual.

Where science gets done

















The pictures feature McMaster’s TSH (Togo Salmon Hall) as well as my TSH office. Besides meetings at My Dog Joe (see previous blog post), this is where science (eventually) gets done.

The tools I’m using on a daily basis for the software engineering of the Trombone module of Voyeur Tools (see previous blog post) are all Free software or Open Source software. They essentially include: the Firefox browser, the Eclipse integrated development environment, the Subversion revision control system, the Trac issue tracking software, the OpenOffice productivity suite, the Sunbird calendar, the Adium instant messenger, the Ubuntu Linux operating system. The OS X operating system, though propietary software (alas, but it’s sooooo eye-pleasing, isn’t it), is actually built on Open Source software, too: among others, the FreeBSD operating system.

Not only are all these software free (as in: do not cost anything to use, acquire, install or configure) but they are also free (as in: freely available and modifiable), which means that I could  tailor them to my specific needs if required. It’s not a theoretical possibility. This has already happened. I modified the JMeld visual diff’ing tool to suit my specific use case and subsequently offered the modified code (called a patch in technical jargon) to the project. I also added a specific feature to the Apache Commons Math toolkit (called a library in technical jargon) to ease its integration into the CanoPeer middleware that I introduced as part of my doctoral dissertation.

Of course patching Free software and Open Source software requires coding skills. But there’s a good chance there’s someone among your colleagues, whether departmental or over-the-web ones, who could tailor software to your needs. Free and Open Source software is standing over the shoulder of giants in action.


Digital Rhetoric and Communication

Thursday, for me this quarter, is teaching day. This afternoon and early evening were filled by teaching duties. This quarter, I am grateful to have the opportunity to teach the Digital Rhetoric and Communication on behalf of Stéfan Sinclair. The tag line of the course is “how to be effective communicators in a digital society and in an information age”. A tutorial as well as student advising filled the first half of the afternoon, while the lecture filled the remainder of the afternoon. Early evening, administrative work for the course, and preparation of next week’s lecture (to be continued, though).

Semantic analysis, the egg and the hen

I’ll wrap up my Day of DH 2010 blog by sharing an open-ended thought on the interplay between:

  • exploratory analysis of a corpus of textual data
  • definition of  semantic couples (synonyms, antonyms,…)
  • semantic analysis of the said corpus of textual data

Which one comes first? Is there a fixed sequence of these 3 steps? In theory, in a perfect world, there might be. But in practice, there may be loops in such sequence. New insights at the end of one of the steps may require one of the previous steps to be completed again, at least partially.

Expanding from this example, my intuition is that flexibility should be paramount in DH tools. A very flexible, large-scale computational and data storage substrate… scripting interface to rapidly (i.e. without requiring advanced coding skills) configure tools for a particular analysis of textual data… Dream precedes action, that is for certain.