How Watson Works in Four Steps

A good overview of how IBM’s Watson works. When humans seek to understand something and to make a decision we go through four steps.

  1. Observe visible phenomena and bodies of evidence;
  2. Draw on what we know to interpret evidence and to generate hypotheses;
  3. Evaluate which hypotheses are right or wrong; and
  4. Decide the best option and act accordingly.

So does Watson. Key to the success is the ability to process unstructured inputs using Natural Language Processing.

Orlando: the lives and works of British women writers. Digital resources working together in unexpected and insightful ways.

orlando

Orlando is a digital resource, indexing the lives and works of British women writers.

The full name of the project is, Orlando: Women’s Writing in the British Isles from the Beginnings to the Present. It is the work of scholars Susan Brown, Patricia Clements, and Isobel Grundy. The name of the work was inspired by Virginia Woolf’s 1928 novel, Orlando: A Biography. The project, like the novel, is an important resource in the history of women’s writing. It grew out of the limitations of a print-based publication, The Feminist Companion to Literature in English. The Companion presented a great deal of research on women writers but lacked an adequate index. The researchers decided to compile a digital index.

I have the good fortune to work with Susan Brown and the Orlando resource. I have extracted bibliographic and literary data from Orlando, and intend to integrate it with unstructured literary content using Natural Language Processing. The aim is a first demonstration of how digital resources like Orlando can provide new ways of reading and understanding literature. In particular I hope to show how digital resources can work together in unexpected and insightful ways.

More information:

The Orlando Project

Bigold, Melanie (2013) “Orlando: Women’s Writing in the British Isles from the Beginnings to the Present, edited by Susan Brown, Patricia
Clements, and Isobel Grundy,” ABO: Interactive Journal for Women in the Arts, 1640-1830: Vol. 3: Iss. 1, Article 8.
DOI: http://dx.doi.org/10.5038/2157-7129.3.1.8
Available at: http://scholarcommons.usf.edu/abo/vol3/iss1/8

Orlando: A Biography. Wikipedia

 

Book Was There, by Andrew Piper. If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.

Bookwastherepage“I can imagine a world without books. I cannot imagine a world without reading” (Piper, ix). In these last few generations of print there is nothing keeping book lovers from reading print books. Yet with each decade the print book yields further to the digital. But there it is, we are the first few generations of digital, and we are still discovering what that means for reading. It is important to document this transition. In Book Was There: Reading in Electronic Times, Piper describes how the print book is shaping the digital screen and what it means for reading.

Book was there. It is a quote from Gertrude Stein, who understood that it matters deeply where one reads. Piper: “my daughter … will know where she is when she reads, but so too will someone else.” (128) It is a warm promise and an observation that could be ominous, but still being explored for possibilities.

The differences between print and digital are complex, and Piper is not making a case for or against books. The book is a physical container of letters. The print book is “at hand,” a continuous presence, available for daily reference and so capable of reinforcing new ideas. The word, “digital,” comes from “digits” (at least in English), the fingers of the hand. Digital technology is ambient, but could could allow for more voices, more debate. On the other hand, “For some readers the [print] book is anything but graspable. It embodies … letting go, losing control, handing over.” (12)  And internet users are known to flock together, reinforcing what they already believe, ignoring dissent. Take another example. Some criticize the instability of the digital. Turn off the power and the text is gone. Piper counters that digital text is incredibly hard to delete, with immolation of the hard drive being the NSA recommended practice.

Other differences are still debated. There is a basic two-dimensional nature to the book, with pages facing one another and turned. One wonders if this duality affords reflection. Does the return to one-dimensional scrolling of the web page numb the mind? Writing used to be the independent act of one or two writers. Reading was a separate event. Digital works like Wikipedia are written by many contributors, organized into sections. Piper wonders if it possible to have collaborative writing that is also tightly woven like literature? (There is the recent example of 10 PRINT, written by ten authors in one voice.) Books have always been shared, a verb that has its origins in “shearing … an act of forking.” (88) With digital, books can be shared more easily, and readers can publish endings of their own. Books are forked into different versions. Piper cautions that over-sharing can lead to the forking that ended the development of Unix. But we now have the successful Unix. Is there a downside?

Scrolling aside, digital is really a multidimensional media. Text has been rebuilt from the ground up, with numbers first. New deep kinds of reading are becoming possible. Twenty-five years ago a professor of mine lamented that he could not read all the academic literature in his discipline. Today he can. Piper introduces what is being called “distant reading”: the use of big data technologies, natural language processing, and visualization, to analyze the history of literature at the granular level of words. In his research, he calculates how language influences the writing of a book, and how in turn the book changes the language of its time. It measures a book in a way that was never possible with disciplined close reading or speed reading. “If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.” (148)

Piper embraces the fact that we now have new kinds of reading. He asserts that these practices need not replace the old. Certainly there were always be print books for those of us who love a good slow read. I do think, however, that trade-offs are being made. Books born digital are measurably shorter than print, more suited to quick reading and analysis by numbers. New authors are writing to digital readers. Readers and reading are being shaped in turn. The reading landscape is changing. These days I am doubtful that traditional reading of print books — or even ebooks — will remain a common practice. There it is.

Wilson iteration plans: Topics on text mining the novel.

The Wilson iteration of my cognitive system will involve a deep dive into topics on text mining the novel. My overly ambitious plans are the following, roughly in order:

  • Develop a working code illustration of genre detection.
  • Develop another custom entity recognition model for literature, using an annotated corpus.
  • Visualization of literary concepts using time trends.
  • Collection of open data, open access articles, and open source tools for text analysis of literature.
  • Think about a better teaching tool for building models. Distinguish teaching computers from programming.

We’ll see where it goes.

Slow reading six years later. Digital technology has evolved, and so have I. There is a trade-off.

slowreadingcov_sI was recently interviewed by The Wall Street Journal about slow reading. It has been a few years since I did one of these interviews. I wrote Slow Reading in 2008, six years ago. At the time, the Kindle had just been released and there was a surge of discussion about reading practices, to which I attribute the interest in my little book of research. The request for an interview suggests an ongoing interest in slow reading. So what do I have to say about the subject now?

I used to slow-read often. I would write books reviews, thinking myself progressive in a digital sense for blogging reviews in just four paragraphs. A shift began. My ongoing use of digital technology to read, write and think forced that shift along. I tried to write about that shift in a new online book project — I, Reader — but I failed. The shift was still in progress. I hit a wall at one point. I thought for a time I had reached the end of reading. In 2013, I stopped reading and writing. A year later I started again. I have a good perspective on the shift, but I have no immediate plans to resume writing about it.

So what did I tell the interviewer about slow reading? I confessed that I slow-read print books less often. I re-asserted that “Slow reading is a form of resistance,  challenging a hectic culture that requires speed reading of volumes of information fragments.” I admitted that my resistance is waning. Digital technology has evolved to allow for reading, not just for scanning of information fragments, but also for comprehension of complex and rich material. I was surprised and pleased to discover how digital technology has re-programmed my reading and writing skills to process information more quickly and deeply. I am smarter than I used to be.

I have resumed my writing of book reviews. I restored a selection of book reviews from the past, ones relevant to my current blogging purposes. I will be writing new reviews, probably less often. I will be writing them differently. Currently I am reading Book Was There: Reading in Electronic Times by Andrew PiperI no longer take notes on paper as I read. I have been tweeting notes. I like the way it is evolving. I use a hashtag for the title and author, and sometimes a reader joins in. When I am done, I will write a very short review, two paragraphs tops, and post it here.

That’s not all I said to the interviewer. I said there has been a trade-off because of digital technology. There is always a trade-off. We just have to decide whether whether the gains are more than the losses. What have we lost? I lingered on this question because the loss is less than I anticipated. We still read. We still read rich and complex material. Students still prefer print books for serious reading but I expect they are going through the same transition as I did. What is lost, I assert, is long-form writing. Books born print can be scanned and put online, but books born digital are getting shorter all the time. It is no coincidence that my book, Slow Reading, was short. I was already a reader in transition. Digital technology prefers shortness. It is one reason that many kinds of poetry will survive and thrive on the web. Things should be short and simple as possible (but not simpler, per the quote attributed to Einstein). Long-form novels and textbooks will be lost in time. It is a loss. Is it worth it?

The four steps Watson uses to answer a question. An example from literature.

Check out this excellent video on the four steps Watson uses to answer a question. The Jeopardy style question (i.e., an answer) comes from the topic of literature, so quite relevant here: “The first person mentioned by name in ‘The Man in the Iron Mask’ is this hero of a previous book by the same author.’ This video is not sales material, but a good overview of the four (not so simple) steps: 1. Question Analysis, 2. Hypothesis Generation, 3. Hypothesis & Evidence Scoring, 4. Final Merging & Ranking. “Who is d’Artagnan?” I am so pleased that IBM is sharing its knowledge in this way. I had new insight watching it.

Physika, the next phase: Text analysis of the novel. Selected book reviews are back.

NovelTM is an international collaboration of academic and non-academic partners to produce the first large-scale quantitative history of the novel. It is a natural fit with my interests in cognitive technologies, text analytics, and literature. I am getting to know the players, and hope to contribute. Given that, I have reorganized things a bit here at my blog. The next “Wilson” iteration of my basement build of a cognitive system will focus on text analysis of the novel. Note too, I have brought back a number of book reviews related to text analysis of the novel. In particular, note my review of Orality and Literacy by Ong. In that review, back in 2012, I noted, “It blows my information technology mind to think how these properties might be applied to the task of structuring data in unstructured environments, e.g., crawling the open web. I have not stopped thinking about it. It may take years to unpack.” Two years later, I am slowly unpacking that insight at this blog. 

Genre detection. Natural language processing models ought to be trained by genre specific content.

Having completed the second iteration of Whatson, I am going kick about for a bit, exploring special topics, before I take on another iteration. One topic of interest is genre detection. To start with the obvious, genre is a category of art, most commonly in reference to literature, characterized by similarities in form, style, technique, tone, content, and sometimes length. Genres include fiction and non-fiction, tragedy and comedy, satire and allegory, and many more refined classifications.

Why is it interesting for a cognitive system to detect genre? Clearly, my focus in on literature and genre is a major category. However, it goes deeper. I’m reading through an article by Stamatatos:

Kessler gives an excellent summarization of the potential applications of a text genre detector. In particular, part-of-speech tagging, parsing accuracy and word-sense disambiguation could be considerably enhanced by taking genre into account since certain grammatical constructions or word senses are closely related to specific genres. Moreover, in information retrieval the search results could be sorted according to the genre as well.

Natural language processing depends on the choice of models for entity recognition. One major choice is language, e.g., English or other. Perhaps the very next choice is genre. Models really ought  to be trained by genre specific content.

Knowing the content’s genre can help with extended analytics. Take the analysis of people names as an example. If we know in advance that we are analyzing fiction, and know that names are selected based on character traits, then we might be able to say something interesting about a person by the “color” of their name. In non-fiction, “color” is incidental, and we might consider other classifications like genealogy to be more informative.

What is unstructured data? Anything not in a DBMS? Text? Non-repetitive data?

What is unstructured data? Anything not in a DBMS? Text? “Many English teachers would contend that the English language is in fact highly structured.” In IBM Data Mag, Inmon suggests distinguishing structured data from unstructured data based on repetition of data occurrences.

“Data that occurs frequently, repetitive data, is data in a record that appears very similar to data in every other record. … Examples of repetitive data—and there are many—include metering data; click-stream data; telephone call records data, such as time of call, the caller’s telephone number, and the call’s length; analog data; and so on.”

“The converse of repetitive data, nonrepetitive data, is data in which each occurrence is unique in terms of content—that is, each nonrepetitive record is different from the others. … There are many different forms of nonrepetitive data, and examples include emails, call center conversations, corporate contracts, warranty claims, insurance claims, and so on.”

Or is the concept of “non-repetition” just another way of saying “meaning?” When events repeat they begin to blur. We look for discontinuities, change, or lack of repetition to indicate meaning. Especially when it comes to big data, we do not know in advance what cases of non-repetition are of value. That’s what we seek to find out.