PirateBay went down yesterday. Text analysts can take a page from pirates.

This post deserves an essay. I’m going to take big leaps with too little explanation, but it’s been rattling in my head for awhile and yesterday’s bust of PirateBay compelled me to write something down.

PirateBay went down yesterday. Police in Sweden seized computers and the site went down. This is not the first time the site went down and people expect it to come back up. Torrent technology was invented for just this kind of event. A torrent only stores metadata about files available elsewhere. The entire PirateBay set of magnets can be stored on a USB disk. Cached versions of PirateBay still exist on the web and people can still download files.

One might dismiss torrent technology as a hack by pirates unwilling to pay for content, but torrents are driving real-world innovation. In earlier posts, I compared the classical “Hot Water Tank” architecture of a QA system with an alternative “Tank-less” architecture. The Tank approach is solid but cumbersome, while the Tank-less approach is deft. The idea is part of a larger shift in the world of big data processing and a demand for real-time stream processing. One of the technologies in play are torrents.

The pirate flag flies in winter in Wakefield Quebec

The pirate flag flies in winter in Wakefield Quebec

Go ahead and question the motive of pirates but their purpose overlaps with freedom of information advocates. Consider PirateBox. PirateBox is a do-it-yourself file sharing system, built with a cheap router and open source software. Bring it to a public space and anyone can anonymously upload and download content. It can be used to share movies. It could also be used to legally share health care information in the aftermath of a natural disaster when the internet is not available. It is no surprise that the technology has been taken up by librarians in the form of LibraryBox.

The fight for net neutrality does not seem to end. A two-tiered internet seems inevitable. Those who seek greater internet surveillance powers keep coming back. What can be done? In 2012 PirateBay experienced a downtime. They came back on, announcing a plan to move its servers to the sky, tethered to drones. It got me thinking, strap a PirateBox to a drone from BestBuy, and you have a flying internet. The cost is cheap. Build a fleet. A flying internet would deftly sidestep unwanted controls, for geeks wanting the latest Marvel movie, for teachers in Syria.

PirateBay, PirateBox, a drone-based internet. It sounds fantastic but the driver is practical. People want agile access to content. If things get too boxed in then people will invent PirateBoxes to get out. It is the same challenge faced in big data and text analytics today. Faced with an ocean of unstructured content waiting to be mined, traditional database design and top-down programming is simply too rigid. New approaches with Natural Language Processing divide content into fragments and apply bottom-up pattern recognition to extract meaning. You can see the parallel with the pirates, the use of sophisticated techniques to preserve access to distributed content.

I think of Fahrenheit 451 and the character Granger, the leader of a group of exiled drifters. Each has memorized fragments of books in readiness for a time when society will be ready to discover them.

Genre, gender and agency analysis using Parts of Speech in Watson Content Analytics. A simple demonstraton.

Genre is often applied as a static classification: fiction, non-fiction, mystery, romance, biography, and so on. But the edges of genre are “blurry” (Underwood). The classification of genre can change over time and situation. Ideally, genre and all classifications could be modeled dynamically during content analysis. How can IBM’s Watson Content Analytics (WCA) help analyze genre? Here is a simple demonstration.

In WCA I created a collection of 1368 public domain novels from Open Library. For this demonstration, I obtained author metadata and expressed it as a WCA facet. I did not obtain existing genre metadata. I will demonstrate that I can use author gender to dynamically classify genre for a specific analytical question. In particular, I follow the research of Matthew Jockers and the Nebraska Literary Lab. Can genre be distinguished by the gender of the author? How is action and agency treated differently in male and female genres? This simple demonstration does not answer these questions, but shows how WCA can be used to give insight into literature.

In Figure 1, the WCA Author facet is used to filter the collection to ten male authors: Walter Scott, Robert Louis Stevenson, and others. The idea is to dynamically generate a male genre by the selection of male authors. (Simple, but note that a complex array of facets could be used to quickly define a male genre.)

genre gender 1

In Figure 2, the WCA Parts-of-Speech analysis lists frequently used verbs in the collection susbset, the male genre: tempt, condemn, struggle. Some values might be considered action verbs, but further analysis is required.

genre gender 2


In Figure 3, the verb “struggle” is seen in the context of its source, the Waverly novels: “the Bohemian struggled to detain Quentin”, “to struggle with the sea”. This view can be used to determine the gender of characters, the actions they are performing, and interpret agency.

genre gender 3


In Figure 4, a new search is performed, this time filtering for female authors: Jane Austen, Maria Edgeworth, Susan Ferrier, and others. In this case, the idea is to dynamically generate a female genre by selecting female authors.

genre gender 4


In Figure 5, the WCA Parts-of-Speech analysis lists frequently used verbs in the female genre: mix, soothe, furnish. At a glance, there is an obvious difference in quality from the verbs in the male genre.

genre gender 5

Finally in Figure 6, the verb “furnish” is seen in the context of its source in Jane Austen’s Letters, “Catherine and Lydia … a walk to Meryton was necessary to amuse their morning hours and furnish conversation.” In this case, furnish does not refer to the literal furnishing of a house, but to the facilitation of dialog. As before, detailed content inspection is needed to analyze and interpret agency.

genre gender 6

Using Orlando and Watson Named Entities to analyze literature from Open Library. A simple example.

Jane Austen’s Letters are a collection of Austen’s personal observations about her family, friends, and life. Great stuff for a literary researcher.The Letters are in the public domain. Public domain books provide a corpus of unstructured content for literary analysis. I am very grateful to Jessamyn West and Open Library for obliging my request for a download of public domain novels and related literary works, over 2100 titles. It allows this first simple example of how Orlando metadata and IBM Watson technology can work together to analyze literature.

In Figure 1, I observe in Watson Content Analytics (WCA) that there are 129  works from Open Library matching on the Orlando entry for Jane Austen. I could continue to explore the Orlando relationships available as facets here, but for this example I just add the Jane Austen entry to the search filter.

jane austen 1

In Figure 2, I look at the WCA Named Entity Recognition (NER) annotators for Person. NER is automatic annotation of content by Person, Location and Organization. It is enabled with a simple switch in WCA. In this view, I suppose I am interested in Austen’s publisher, Frank S. Holby, who matches on 28 of the 128 works. Note that this Person was not Orlando metadata but rather discovered from the body of works by NER. I add Holby’s name to my search criteria.

jane austen 2

In Figure 3, I switch to the WCA Documents view to begin inspecting the search results. I see a number of works, the Letters, highlighting the Orlando match on Jane Austen and the NER match on Frank S. Holby.

jane austen 3


Orlando and Watson Part II. Pseudonym as a simple illustration of semantic search.

A common problem with searching for information is that a concept can have many different surface forms. It is difficult for a researcher to know all the forms, let alone type them in for every search.

Orlando is a digital literary resource, a structured “textbase” about British women writers. This resource can be utilized by IBM’s Watson Content Analytics to provide semantic search and analysis. Here is a simple illustration.

In Figure 1, suppose I know the interesting pseudonym, “Will Chip, a Carpenter.” Must be a male writer, yes? Not so fast. I select the pseudonym for a search. There are sixteen matching documents in this small sample.

orlando synonym1

In Figure 2, I switch to Documents view. The name “Hannah More”, a female writer, is highlighted in the documents. Hannah More is Will Chip, a Carpenter. It is her pseudonym. This link was provided by Orlando. Semantic links like this can be applied to every concept in IBM’s Watson Content Analytics, facilitating literary research across millions of documents.

orlando synonym2

Orlando and Watson demonstration. Analytics without metadata.

Orlando is a digital index of the lives and works of British women writers. I have the privilege of using the Orlando resource in collaboration with Susan Brown. For discussion in the context of NovelTM, I have put together a quick demo that integrates Orlando in IBM’s Watson Content Analytics.

Orlando is structured data, making associations between names, places and works. However, it is not precisely metadata. Metadata is “data about data”, and Orlando does not classify content directly. Not yet. I extracted a subset of the Orlando data and converted it into Natural Language Processing annotators. Annotators can be used to extract structure from unstructured content and make it analyzable. In this case, the content is a small set of about 300 biographical documents. The demo illustrates how analytics can be peformed without the labour intensive work of manual metadata classification.

Figure 1. The Orlando extract has been mapped to facets in Watson Content Analytics. For example, the “Author (Orlando)” facet lists Maria Abdy, Elizabeth Carter, Horace Walpole, and many others, with their associated frequency counts. Horace Walpole has 20 hits.


Figure 2. Switching to the Documents view, an analyst discovers documents federated from multiple sources. In this case, the 20 documents for Horace Walpole. The Documents view provides a single interface for in depth document analysis.


Figure 3. A number of visualizations provide a quick way to analyze documents. In this view, the Birth Region facet is paired up with the Religion facet, both showing values from Orlando. The red square highlights a strong correlation between the Midlothian birth region and the Free Church of Scotland. It’s a jumping point to filter documents and discover additional patterns.


There’s so much more to show and tell.

How Watson Works in Four Steps

A good overview of how IBM’s Watson works. When humans seek to understand something and to make a decision we go through four steps.

  1. Observe visible phenomena and bodies of evidence;
  2. Draw on what we know to interpret evidence and to generate hypotheses;
  3. Evaluate which hypotheses are right or wrong; and
  4. Decide the best option and act accordingly.

So does Watson. Key to the success is the ability to process unstructured inputs using Natural Language Processing.

Orlando: the lives and works of British women writers. Digital resources working together in unexpected and insightful ways.


Orlando is a digital resource, indexing the lives and works of British women writers.

The full name of the project is, Orlando: Women’s Writing in the British Isles from the Beginnings to the Present. It is the work of scholars Susan Brown, Patricia Clements, and Isobel Grundy. The name of the work was inspired by Virginia Woolf’s 1928 novel, Orlando: A Biography. The project, like the novel, is an important resource in the history of women’s writing. It grew out of the limitations of a print-based publication, The Feminist Companion to Literature in English. The Companion presented a great deal of research on women writers but lacked an adequate index. The researchers decided to compile a digital index.

I have the good fortune to work with Susan Brown and the Orlando resource. I have extracted bibliographic and literary data from Orlando, and intend to integrate it with unstructured literary content using Natural Language Processing. The aim is a first demonstration of how digital resources like Orlando can provide new ways of reading and understanding literature. In particular I hope to show how digital resources can work together in unexpected and insightful ways.

More information:

The Orlando Project

Bigold, Melanie (2013) “Orlando: Women’s Writing in the British Isles from the Beginnings to the Present, edited by Susan Brown, Patricia
Clements, and Isobel Grundy,” ABO: Interactive Journal for Women in the Arts, 1640-1830: Vol. 3: Iss. 1, Article 8.
DOI: http://dx.doi.org/10.5038/2157-7129.3.1.8
Available at: http://scholarcommons.usf.edu/abo/vol3/iss1/8

Orlando: A Biography. Wikipedia


Book Was There, by Andrew Piper. If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.

Bookwastherepage“I can imagine a world without books. I cannot imagine a world without reading” (Piper, ix). In these last few generations of print there is nothing keeping book lovers from reading print books. Yet with each decade the print book yields further to the digital. But there it is, we are the first few generations of digital, and we are still discovering what that means for reading. It is important to document this transition. In Book Was There: Reading in Electronic Times, Piper describes how the print book is shaping the digital screen and what it means for reading.

Book was there. It is a quote from Gertrude Stein, who understood that it matters deeply where one reads. Piper: “my daughter … will know where she is when she reads, but so too will someone else.” (128) It is a warm promise and an observation that could be ominous, but still being explored for possibilities.

The differences between print and digital are complex, and Piper is not making a case for or against books. The book is a physical container of letters. The print book is “at hand,” a continuous presence, available for daily reference and so capable of reinforcing new ideas. The word, “digital,” comes from “digits” (at least in English), the fingers of the hand. Digital technology is ambient, but could could allow for more voices, more debate. On the other hand, “For some readers the [print] book is anything but graspable. It embodies … letting go, losing control, handing over.” (12)  And internet users are known to flock together, reinforcing what they already believe, ignoring dissent. Take another example. Some criticize the instability of the digital. Turn off the power and the text is gone. Piper counters that digital text is incredibly hard to delete, with immolation of the hard drive being the NSA recommended practice.

Other differences are still debated. There is a basic two-dimensional nature to the book, with pages facing one another and turned. One wonders if this duality affords reflection. Does the return to one-dimensional scrolling of the web page numb the mind? Writing used to be the independent act of one or two writers. Reading was a separate event. Digital works like Wikipedia are written by many contributors, organized into sections. Piper wonders if it possible to have collaborative writing that is also tightly woven like literature? (There is the recent example of 10 PRINT, written by ten authors in one voice.) Books have always been shared, a verb that has its origins in “shearing … an act of forking.” (88) With digital, books can be shared more easily, and readers can publish endings of their own. Books are forked into different versions. Piper cautions that over-sharing can lead to the forking that ended the development of Unix. But we now have the successful Unix. Is there a downside?

Scrolling aside, digital is really a multidimensional media. Text has been rebuilt from the ground up, with numbers first. New deep kinds of reading are becoming possible. Twenty-five years ago a professor of mine lamented that he could not read all the academic literature in his discipline. Today he can. Piper introduces what is being called “distant reading”: the use of big data technologies, natural language processing, and visualization, to analyze the history of literature at the granular level of words. In his research, he calculates how language influences the writing of a book, and how in turn the book changes the language of its time. It measures a book in a way that was never possible with disciplined close reading or speed reading. “If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.” (148)

Piper embraces the fact that we now have new kinds of reading. He asserts that these practices need not replace the old. Certainly there were always be print books for those of us who love a good slow read. I do think, however, that trade-offs are being made. Books born digital are measurably shorter than print, more suited to quick reading and analysis by numbers. New authors are writing to digital readers. Readers and reading are being shaped in turn. The reading landscape is changing. These days I am doubtful that traditional reading of print books — or even ebooks — will remain a common practice. There it is.

Wilson iteration plans: Topics on text mining the novel.

The Wilson iteration of my cognitive system will involve a deep dive into topics on text mining the novel. My overly ambitious plans are the following, roughly in order:

  • Develop a working code illustration of genre detection.
  • Develop another custom entity recognition model for literature, using an annotated corpus.
  • Visualization of literary concepts using time trends.
  • Collection of open data, open access articles, and open source tools for text analysis of literature.
  • Think about a better teaching tool for building models. Distinguish teaching computers from programming.

We’ll see where it goes.