Briding HPC and the Humanities

Teams

Improved Search for the Orlando Project and the Canadian Writing and Research Collaboratory (CWRC)

The Orlando Project (www.ualberta.ca/~orlando) is a major digital humanities project that has produced an extensive semantically-encoded textbase of born-digital scholarship about women's writing in Britain. It has published a major scholarly resource online (http://orlando.cambridge.org/) and continues to be developed both in terms of content and in terms of its interface and functionality. It is a significant experiment in the use of computers to support humanities scholarship. These materials are both critically acclaimed by scholars in literary studies as a model for what second-generation digital resources should be in the future and as a paradigm-shifting tool for literary studies ("It changes how you think", according to a recent review) and are serving as a testbed for developing experimental tools and interfaces.

The Orlando Project's internal production processes, search and delivery system, and semantic markup will serve as the basis for the newly-funded Canadian Writing Research Collaboratory (CWRC; http://cwrc .cs.ualberta.ca/index.php/General:CWRC), which will provide a web-based platform for literary scholars of and in Canada. The CWRC scholarly platform will involve scholars from across Canada and around the world working on a wide range of digital materials--some richly structured like Orlandoand others messy and unstructured; some audio and video with varying degrees of metadata and transcription; some remotely housed in extensive collections such as Peel's Prairie Provinces (University of Alberta library) or Early Canadiana Online (canadiana.org). We cannot yet anticipate the size of CWRC but it is likely to be many times the size of Orlando.

The research challenge and its significance
Orlando Project material is composed of XML (SGML) encoded digital text files utilising both structural elements and a large number of custom "semantic" tags. These "content" tags associated with the knowledge domain of literary history provide a richly structured body of scholarly text.
The extent of orlando materials (as of 2010-02-05):
elements: 2.4 million and growing
terms: 8.8 million and growing
raw data: 250MB; with indices 6GB this is as of july: update?

At present, the system's search and retrieval system is based on a standard relational database, indexed to allow for retrieval of material based on the parent/child structures embedded in the hierarchical tagging. The current system is inefficient in its retrieval of complex xml-based queries, resulting in measurable delays in the return of results.

Our user studies, moreover, have suggested that we would serve scholars better by moving to a ranked search algorithm approach, enabling them to filter results based on the markup rather than beginning with the markup. The goal of the search system is to be able to access the information in through information retrieval (i.e. full-text search with stemming, ranking, phrase, and the like) while retaining the ability to restrict the "text search" to specified branches of the hierarchy. We have created a very basic implementation of such a system but are experiencing scaling issues, and cannot extend the feature set due to performance problems.
We are seeking to merge an XML database engine and an information retrieval engine. There are existing XML database engines that work well with address-book like XML but not Orlando-like XML. There are existing information retrieval engines (e.g. Lucene, PostgreSQL test search) that work well but only for unstructured text. The "information retrieval" engine allows for features such as ranking, stemming (i.e. "story" and "stories" are treated as the same term). What we are trying to do is marry the two. Ultimately, we are aiming at something like this W3C candidate recommendation: XQuery and XPath Full Text 1.0 (recommendation Jan 2010). We want to integrate the XML querying and Information Retrieval functionality into one integrated, purpose-built system thereby increase both effectiveness and efficiency.
The significance of developing an effective system for combining powerful full-text searching with XML-aware searching is considerable. Such a system would be of broad interest and utility to the digitalhumanties community, which has invested greatly in text markup but for whose user communities full text searching remains crucial, as well as in the XML community more generally.

Research goals
Our shorter term goals are to 1) speed up real-time web-based queries; 2) speed up the production of batch/background indices ; 3) experiment with new modes of improving XML-sensitive searchablility of extensively tagged XML collections to produce innovative combination of ranked search and XML-aware search. In the longer term we hope to improve scalability / handle increased content and handle an increased number of concurrent users.

How this research application could use HPC
By participating in this workshop we hope to explore the possibility of utilizing HPC for optimization via HPC-oriented algorithms, data structures, disk/memory based indices.. At the moment we are interested in exploring the following possibilities: parallel processing algorithms and data structures that would improve the efficiency and scalability of our search techniques. We would like to test the efficacy of much more extensive indices than we can currently support. We'd like to experiment with decomposing the XML content into a "x" subsets and utilise "y" machines in parallel to answer a given XML query either in real-time as in a web-based search or in a batch process that could be used to populate indices.
We are interested in techniques to measure the efficiency of searching, and more specifically to develop a method for quantifying increases/decreases in performance and scalability. We are also interested in visualisation of indices so as to better understand how efficiently the index is storing material.
We are not sure about our needs or what is available, but would hope to clarify this in advance of the research through consultation with Westgrid advisors. We would like to be able to draw on Westgrid personnel with experience of parallel and distributed processing and data structuring. We will require storage for indexing but it is likely to be relatively modest by Westgrid standards, although indexing of the most likely search permutations, if we were to go that route, could add up. The databases that support our current search system are 6GB; databases that store graph data structures extrapolated from our data for a different, experimental search system that uses a modified breadth first search algorithm are currently 8 GB. Although our storage needs cannot yet be determined, they are unlikely to exceed 2 TB.

Team members
Jeffery Antoniuk (University of Alberta) is a systems analyst / programmer involved with the Orlando Project and also with ARC (Arts Resource Centre). He holds an MSc in Computing Science from the University of Alberta and a BSc from Brandon University. As a student at the University of Alberta, Jeff was a member of the Database System Research Group, and his interests include information retrieval and data-mining (knowledge discovery in data). He will contribute to the workshop his expertise in the Orlando markup system, technical expertise with the Orlando Delivery System including the current design of the searching and indexing mechanisms, and programming experience.

Susan Brown (University of Alberta and University of Guelph) is the director of the Orlando Project and the leader of the interdisciplinary team that will produce the Canadian Writing Research Collaboratory. The ongoing interdisciplinary collaborative Orlando Project pioneers new applications of semantic markup to support digital literary history. The Canadian Writing Research Collaboratory is a CFI-funded virtual research environment designed to support scholarship on Canadian writing, to provide open access to a rich collection of resources on Canadian writing, and to foster the use of digital tools by literary scholars. She will contribute to the workshop her expertise in the Orlando markup system and its relation to scholarly inquiry in literary studies, and understanding of the desiderata for searching and indexing that has emerged from recent user studies on the Orlando textbase.

Denilson Barbosa, PhD 2005 (Toronto), is an Assistant Professor and Ingenuity New Faculty at the Department of Computing Science (Alberta). His interests are in the management of Web Data, particularly XML, and in Computational Social Network Analysis.