Going on a bike tour in Quebec with my kids. What will follow for Lila.

canada-rideI’m going on a bike tour in Quebec with my kids. That and other priorities mean I need to shift
my focus away from Lila for a bit. But here’s what coming up when I get back.

  1. Convert unread content into notes. Lila will take unread articles and books and convert them into slips for embedded reading.  I may have been over-thinking the technology that is required here. After all, most unread content is already organized into slip-sized shapes, short units of thought, i.e., paragraphs.
  2. Compute association between notes. I have done a rough cut at calculating association between slips, based on keyword queries. Keyword queries are not enough, as a cognitive technology should operate more on the level of questions, i.e., something that digs into meaning. I have been thinking about other methods of computing association, based on word properties. Most recently I have been thinking about topic analysis and statistical clustering. I need to dive into this latter approach. I must pick the best approach.
  3. Demonstrate the use of concreteness to order notes. I believe the use of word concreteness could become the most interesting feature of Lila. I did some tests back in December before I started blogging. Fascinating stuff. I will post some demonstrations.
  4. Draft the user interface.  I have been cutting a few drafts of Lila’s user interface, but I am stuck a bit as I wrestle with item 2.
  5. Post an updated and unified version of the solution architecture. I have been blogging my way through a solution architecture since January, but this process has been as much discovery for me as articulation for interested readers. Blogs are helpful that way. I have been reconsidering several things along the ways, trimming here, extending there. Once I complete items 1-4, I will likely blow away everything I have written so far and post a updated and unified version of the solution architecture.

When this is complete, I expect I will choose one piece of Lila to code, likely item 2. I am also considering a deep dive into Digital Humanities research. Stick around.

The cognitive computing features of Lila. Candidate technologies, limitations, and future plans.

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. In the previous post I summarized the characteristics of a cognitive system. This post maps the characteristics to Lila features, along with candidate technology to deliver them. Limitations and future plans are also listed.

Cognitive Characteristic Lila Features Candidate Technology Limitations and Future Plans
1. Life-world data Lila operates on unstructured data from multiple sources. Unstructured data includes author notes, digital articles and books. Data is collected from many sources, including smart phone notes, email, web pages, documents, PDFs.

Lila operates on rapidly changing data, as is expected when writing a work. Lila’s functions can be re-calculated on demand.

Data volume is expected to be the size of an average non-fiction work (about 100,000 words), up to 1000 full length articles, and about 100 full length books.

There are existing tools for gathering content from different sources. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote. English only.

Digital text only.

Text must be text analyzable, i.e., no locked formats.

Table content can be analyzed, but no table look-up operations.

Image analysis is limited to associated text labels.

2. Natural questions Lila analyzes author notes, treating them as questions to be asked of other notes and unread articles and books. The following features combine to build meaningful queries on the content.

  • The finite size of the note itself helps capture the author’s meaning.
  • Lila use author suggested categories, tags and markup to understand what the author considers important.
  • Lila develops a model of the author’s work, used to better understand the author’s intent.
New Lila technology will be built. This technology will be used to create more meaningful structured queries.

Structured queries will be performed using existing technology, Apache Solr.

Questions are constructed implicitly from author notes, not from a voice or text question box.

No direct dialog interface is provided, but see 6&7.

3. Reading and understanding Lila uses natural language processing (NLP) to read author notes and unread content.

Language dictionaries provide an understanding of synonyms and parts-of-speech. This knowledge of language is an advance over simple keyword matching.

Entity identification is performed automatically using machine learning. Identification includes person names, organizations and locations. Lila can be extended to include custom entity identification models.

Lila uses additional input from the author to build a model of the author’s work. This model is used to better understand the the author’s meaning when questioning content. See 6&7.

Existing NLP technologies, e.g., OpenNLP.

New Lila technology for the model.

English only.

Lila does not perform deep parsing of syntax.

4. Analytics Lila calculates a correlation between author notes, and between author notes and unread content. Lila also calculates a suggested order for notes. The open source R tool can be used for statistical calculations.

Language resources such as the MRC psycholinguistic database will be used to create new Lila technology for ordering notes.

The calculations for suggesting order are experimental. It is likely that this function will need development over time.
5. Answers are structured and ordered Lila provides two visualizations:

  • A connections view to visualize correlations between notes and unread content.
  • A suggested order for notes, a visual hierarchy or a table of contents.
New Lila technology for the visualizations. Web-based. Lila will use open source add-ins to generate visualizations.
6. Taught rather than just programmed7. Learn from human interaction Lila’s user interface provides the author with a simple and natural way to:

  • Classify content with categories and tags.
  • Inline markup of entities, concepts and relations.

These inputs create the model used to question content and create correlations. The author can manually edit the model with improvements.

The connections view will allow the author to “pin” correct relationships and delete incorrect relationships.

There are existing technologies for classifying content. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote.

New Lila technology for the model.

The Evernote interface for collecting and editing notes has limitations. In the future, Lila will need its own interface to allow for advanced functions, e.g., inline markup, sorting of notes without numbered labels.

In the future, Lila may use author ordering of notes as a suggestion toward its calculated order.

Cognitive computing. Computers already know how to do math, now they can read and understand.

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. It is characterized by the following:

  1. Life-world data. Operates on data that is large, varied, complex and dynamic, the stuff of daily human life.
  2. Natural questions. A question is more than a keyword query. A question embodies unstructured meaning. It may be asked in natural language. A dialog allows for refinement of questions.
  3. Reading and understanding. Computers already know how to do math. Cognitive computing provides the ability to read. Reading includes understanding context, nuance, and meaning.
  4. Analytics. Understanding is extended with statistics and reasoning. The system finds patterns and structures. It considers alternatives and chooses a best answer.
  5. Answers are structured and ordered. An answer is an “assembly,” a wiki-type summary, or a visualization such as a knowledge graph. It often includes references to additional information.

Cognitive computing is not artificial intelligence. Solutions are characterized by a partnership with humans:

  1. Taught rather than just programmed. Cognitive systems “borrow” from human intelligence. Computers use resources compiled from human knowledge and language.
  2. Learn from human interaction. A knowledge base is improved by feedback from humans. Feedback is ideally implicit in an interaction, or it may be explicit, e.g., thumbs up or down.

Lila “tears down” old categories and suggests new ways of looking at content. Word concreteness is a good candidate.

Many of the good things we love about language are essentially hierarchical. Narrative is linear: a beginning, middle, and end. Order shapes the story. Hierarchy gives a bird’s eye view, a table of contents, a summary that allows a reader to consider a work as a whole.

Lila will compute hierarchy by comparing passages on word qualities that suggest order. Concreteness is considered a good candidate. Passages with more abstract words express ideas and concepts, whereas passages with more concrete words express examples. Of the views that Lila can suggest, it is useful to have a view that presents abstract concepts first and concrete examples second. I have listed four candidate qualities here, but I will focus in the posts that follow on concreteness.

Quality Description Examples
1 Abstract Intangible qualities, ideas and concepts. Different than frequency of word usage. Both academic terms and colorful prose can have low word frequency. freedom (227*), justice (307), love (311)
Concrete Tangible examples, illustrations and sensory experience grasshopper (660*), tomato (662), milk (670)
2 General Categories and groupings. Similar to 1, but 1 is more dichotomous and this one is more of a range. furniture
Specific Particular instances La-Z-Boy rocker-recliner
3 Logical Analytical thinking, understatement and fact. Note the conflict with 1 and 2 — facts are both logical and concrete. The fastest land dwelling creature is the Cheetah.
Emotional/Sentimental Feeling, emphasis, opinion. Can take advantage of the vast amount of sentiment measures available. The ugliest sea creature is the manatee.
4 Static Constancy and passivity It was earlier demonstrated that heart attacks can be caused by high stress.
Dynamic Change and activity. Energy. Researchers earlier showed that high stress can cause heart attacks.

* Concreteness index. MRC Psycholinguistic database. Grasshopper is a more concrete word than freedom. Indexes like the MRC can be used to compute concreteness for passages.

Lila can compute hierarchy for passages, and for groups of passages. Together, it builds a hierarchy, a view of how the content can be organized. Think of what this offers a writer. A writer stuck in his or her manually produced categories and view can ask Lila for alternate views. Lila “tears down” the old categories and suggests a new way of looking at the content. It is unlikely that the writer will stick exactly to Lila’s view, but it could provide a fresh start or give new insight. And Lila can compute new views dynamically, on demand, as the content changes.

Hierarchy has a bad rap but language is infused with it. We must find ways to tear down hierarchy almost as quickly as we build it up.

Hierarchy has a bad rap. Hierarchy is a one-sided relation, one thing set higher than another. In society, hierarchy is the stage for abuse of power. The rich on the poor, white on black, men on women, straight on gay. In language too, hierarchy is problematic. Static labels are laden with power and stereotypes, favoring some over others. Aggressive language, too, can overshadow small worthy ideas.

I read Lila the year it was published, 1991. I have a special fondness for this book because my girlfriend bought it for me; she is now my wife. Lila is not a romantic book, and I don’t mean in the classic-romantic sense of Pirsig’s first famous book. I re-read Lila this year. Philosophy aside, I cringe at Pirsig’s portrayal of his central female character, Lila. She is a stereotype, a dumb blonde, operating only on the level of biology and sexuality, the subject of men’s debates about quality. Pirsig is more philosopher than storyteller.

We cannot escape that many of the good things we love about language are essentially hierarchical. Narrative is linear: a beginning, middle, and end. Order shapes the story. Hierarchy gives a bird’s eye view, a table of contents, a summary that allows a reader to consider a work as a whole. For the reader’s evaluation of a book, or for choosing to only enter a work at a particular door, the table provides a map. Hierarchy is a tree, a trunk on which the reader can climb, and branches on which the reader can swing.

Granted, a hierarchy is just one view, an author’s take on how the work should be understood. There is merit in deconstructing the author’s take and analyzing the work in other ways. It is static hierarchy that is the problem.

Many writers are inspired to start a project with a vision of the whole, a view of how all the pieces hang together, as if only keystrokes were needed to fill in the details. The writer gets busy, happily tossing content into categories. Inevitably new material is acquired and new thinking takes place. Sooner or later a crisis occurs — the new ideas do not fit the original view. Either the writer does the necessary work to uproot the original categories and build a new better view, or the work will flounder. Again, it is static hierarchy that is the problem.

We must find ways to tear down hierarchy almost as quickly as we build it up. Pirsig’s metaphysics is all about the tension between static and dynamic quality. My writing technology, Lila, named after Pirsig’s book, uses word qualities to compute hierarchy. What word qualities measure hierarchy? I have several ideas. I propose that passages with abstract words are higher order than those with more concrete words. Closer to Pirsig’s view, passages that are dynamic — measured by agency, activity, and heat — are higher order than those that are static. Or does cool clear static logic trump heated emotion? There are several ways to measure it, and plenty of issues to work out. It will take more posts.

Evernote Random. A daily email link to a random note. Keep your content alive.

evernoteI write in bits and pieces. I expect most writers do. I think of things at the oddest moments. I surf the web and find a document that fits into a writing project. I have an email dialog and know it belongs with my essay. It is almost never a good time to write so I file everything. Evernote is an excellent tool for aggregating all of the bits in notebooks. I have every intention of getting back to them. Unfortunately, once the content is filed, it usually stays buried and forgotten.

I need a way to keep my content alive. The solution is a daily email, a link to a random Evernote note. I can read the note to keep it fresh in memory. I can edit the note, even just one change to keep it growing.

I looked around for a service but could not find one. I did find an IFTTT recipe for emailing a daily link to a random Wikipedia page. IFTTT sends the daily link to a Wikipedia page that automatically generates a random entry. In the end, I had to build an Evernote page to do a similar thing.

You can set up Evernote Random too, but you need a few things.

  • An Evernote account, obviously.
  • A web host that supports PHP.
  • A bit of technical skill. I have already written the Evernote script that generates the random link. But you have to walk through some technical Evernote setup steps, like generating keys and testing your script in their sandbox.
  • The Evernote Random script. It has all the instructions.
  • An IFTTT recipe. That’s the easy part.

Take the script. Improve it. Share it. Sell it. Whatever you like. I would enjoy hearing about it.

Categories? Tags? Pffft. Words are the true pit of chaos. Or not.

Categories? Very eighteenth century. Tags? So Web 2.0. Pretty cryptic stuff. What will Lila do differently? Let’s take another step.

Tags are messier than categories; I called tags evil. But tags are easier to manage than the next level down, the words themselves. Tags are messy when left to humans, but tags can be managed with automation. Many services auto-suggest tags, controlling the vocabulary. Lila will generate its own tags, refreshing them on demand. Tags can be managed.

wordleWords are the true pit of chaos. People conform to the rules of language when they write, or they don’t. People make up words on the fly. Down the rabbit hole. But is it so bad? It happens time and again that we think an information problem is too complex to be automated, only to analyze it and discover that we can do a good chunk of what we hoped following a relatively simple set of rules. One mature technology is keyword search. Keyword search is so effective we take it for granted. Words can be managed with the right technologies.

Another mature technology is Natural Language Processing (NLP). Its history dates back to the 1950’s. The field is enjoying a resurgence of interest in the context of cognitive computing. Consider that a person can learn basic capability in a second language with only a couple thousand words and some syntax for combining them. Words and syntax. Data and rules. Build dictionaries with words and their variant forms. Assign parts-of-speech. Use pattern recognition to pick out words occurring together. Run through many examples to develop context sensitivity. Shakespeare it is not, but human meaning can be extracted from unstructured text in this way for many useful purposes.

Lila’s purpose is to make connections between passages of text (“slips”) and to suggest hierarchical views, e.g., a table of contents. I’ve talked a lot about how Lila can compute connections. Keywords and NLP can be used effectively to find common subjects across passages. Hierarchy is something different. How can the words in a passage say something about how it should be ordered relative to other external passages? We can go no deeper than the words. It’s all we have to work with. To compute hierarchy, Lila needs something different, something special. Stay tuned.

Tags are the evil sisters of Categories. Surprising views, sour fast. Lila offers a different approach.

hashtagI’m a classification nut, as I told you. In the last post I told you about the way I organize files and emails into folders. Scintillating stuff, I know. But let’s go a level deeper toward Lila by talking about tagging. Tags are the evil sisters of categories. Categories are top-down classification — someone on high has a idealized model of how everything fits into nice neat buckets. Tags are situational and bottom-up. In the heat of the moment, you decide that this file or that email is about some subject. Tags don’t conform to a model, you make them up on the fly. You add many tags, as many as you like. Mayhem! I’ve tried ’em, I don’t like ’em.

Tags do one thing very well, they let you create surprising views on your content. Categories suffer from the fact that they only provide one view, a hierarchical structured tree. Tags let you see the same content in many different ways. Oh! Look. There’s that short story I wrote tagged with “epic.” And there’s those awesome vacation pics tagged with the same. Hey, I could put those photos on that story and make it so much better. But the juice you get out of tags sours fast. The fact that they are situational and bottom-up causes their meaning to change. “Bad” and “sick” used to mean negative things. As soon as people get about a hundred tags they start refactoring them, merging and splitting them, using punctuation like underscores to give certain tags special meanings. Pretty soon they dump the whole lot of them and start over. Tags fail. What people really want is, yup, categories.

Lila is a new way to get the juice out of tags without going sour. Lila works collaboratively with the author to organize writing. Lila will let writers assign categories and tags, but treat them as mere suggestions. The human is smart, Lila knows, and needs his or her help, so it will use the author’s suggestions to come up with its own set of categories and tags. Lila’s technique will be based on natural language processing. Best part, the tags can also be regenerated at the click of a button, so that the tags never sour. You get the surprising views and the tags maintain their freshness. Sweet.

I’ve been pretty down on tags in this post, so I will say there is one more thing that tags do quite well. They connect people, like hash tags in twitter. They form lose groupings of content so that disparate folks can find each other. It doesn’t apply so much to a solitary writing process, but it might fit to a social writing process. I will think on that.

I’m a bit of a classification nut. It comes from my Dutch heritage. How do you organize files and emails into folders?

Dutch Efficiency

I’m a bit of a classification nut. It comes from my Dutch heritage — those Dutchies are always trying to be efficient with their tiny bits of land. It’s why I’m drawn to library science too. I think a lot about the way I organize computer files and emails into folders. It provides insight into the way all classification works, and of course ties into my Lila project. I’d really like to hear about your own practices. Here’s mine:

  1. Start with a root folder. When an activity starts, I put a bunch of files into a root folder (e.g., a Windows directory or a Gmail label).
  2. Sort files by subject or date. As the files start to pile up in a folder, I find stuff by sorting files by subject or date using application sorting functions (e.g., Windows Explorer).
  3. Group files into folders by subject. When there are a lot of files in a folder, I group files into different folders. The subject classification is low level, e.g, Activity 1, Activity 2. Activities that are expire are usually grouped together into an ‘archive’ folder.
  4.  Develop a model. Over time the folder and file structure can get complex, making  it hard to find stuff. I often resort to search tools. What helps is developing a model that reflects my work. E.g., Client 1, Client 2. Different levels correspond to my workflow, E.g., 1. Discovery, 2. Scoping, 3. Estimation, etc. The model is really a taxonomy, an information architecture. I can use the same pattern for each new activity.
  5. Classification always requires tinkering. I’ve been slowly improving the way I organize files into folders for as long as I’ve been working. Some patterns get reused over time, others get improved. Tinkering never ends.

(I will discuss the use of tagging later. Frankly, I find manual tagging hopeless.)

Several methods to compute association between passages of text. Time to unpack Walter J. Ong.

What does it mean for two passages of text to be associated? How can association be computed? The answer always comes down to word matching. But not all words are of equal value. Some words say the same thing in different ways. And words mean different things depending on how they are combined. In the previous post I gave a rough cut at how Lila could compute association. A finer method would use more advanced techniques. Here is a list of several techniques. I briefly describe how each method can be used to select important words that limit a search, or expand meaning for a broader search. Once an adequate search query can be defined, a results ranking can be returned and a measure of association computed.

1. Parts of Speech

Nouns and noun phrases. Nouns and noun phrases are most important because they describe the subject of sentences — people, places, things. This is what the text is about, the focus. Noun and noun phrases can be extracted using Natural Language Processing (NLP) technology, ready to use as keywords in a search algorithm.

Nouns with Verbs. Verbs indicate change, perhaps the essence of meeting. Compare “the fox and the chicken” with “the fox ate the chicken.” Text without verbs are continuous streams with nothing happening. I think of verbs as heat, indicating an exchange of energy, a change of state. Verbs indicate something meaningful is going on with a noun, and is worth capturing for search.

Adjectives and Adverbs. Adjectives modify a noun, adverbs modify verbs. Staples of grammar. Extended parts-of-speech analysis can help decide which nouns and verbs are more important to the meaning of text.

2. Normal Forms, Lemmas, and Synonyms

A normal form is the standard expression of multiple surface forms; e.g., $200 = “two hundred dollars”, IBM = “International Business Machines”.  A lemma is the canonical form of a word; e.g., “run” for “runs”, “ran”, and “running.” Synonyms are words of comparable meaning; e.g., tortoise = turtle. These variations can be handled by lookup in existing NLP dictionaries. Matching can be expanded on any terms declared equivalent in a dictionary.

3. Word Properties

Quantitative properties of words have been compiled by language researchers and scientists. These properties are available as lists for direct look-up. They can be used to decide which words are more important as keywords. The trick is to decide how the properties imply importance. Here are some possible applications.

Frequency. An infrequently used word is more important than a frequently used word. E.g., “tortoise” is less frequently used than “tower.” Infrequency implies more deliberate word selection.

Concreteness. Abstract words summarize many concrete examples. “Food” is more abstract than “apple.” An abstract word is virtually metadata, and can generally be considered more important. I studied Allan Paivio’s dual-coding theory as an undergrad psychology student. The theory seems to have receded, but his measures of word concreteness are begging to be tapped by Lila.

4. Phase and Sentence Properties

Sentiment and Emotion. Positive or negative regard for a thing are referred to as sentiment. “I like apples” is positive sentiment, and “I hate bananas” is negative sentiment. Sentiment is also an indicator of emotion. Generally, positive sentiment can be assumed to indicate a higher degree of association, unless of course you are looking for contrary views.

Idea Density. The number of ideas in words can be computed as idea density. “The old gray mare has a big nose” — this sentence is short and choppy; it has low idea density. “The gray mare is very slightly older than …” — this sentence has complex interrelationships; it has higher idea density. Idea density can be computed as Number of Ideas / Number of Words. Generally, text with comparable idea density can be assumed to have a greater association. For example, an academic paper on a subject is more likely to associated with another academic paper than, say, a blog post.

5. Orality

ongIn Orality and Literacy: The Technologizing of the Word, Walter J. Ong identifies properties of oral communication in societies in which literacy is unfamiliar. I read this book in 2012. I wrote,

One might think that oral culture could not engineer complex works, yet the Iliad and the Odyssey were oral creations. Ong explains the properties of orality that make this possible. Oral memory is achieved through repetition and cliche, for example. Also, phrasing is aggregative, e.g, “brave solider” rather than analytic, e.g., “soldier’. … Ong contrasts many more properties of oral memory. They defi ne the lifeworld of thought prior to structuring through literacy. It is an architecture of implicit thought, of domain knowledge. It blows my information technology mind to think how these properties might be applied to the task of structuring data in unstructured environments, e.g., crawling the open web. I have not stopped thinking about it. It may take years to unpack.

The time has come to unpack these ideas. Oral phrasing uses sonorous qualities like repetition to increase emphasis, aid memory, and shape complexity. These oral patterns carry over to the way people write text today, and are one of the reasons computers have trouble analyzing unstructured text. But Ong catalogued these techniques and at least some of them can be used to select concepts of importance. A text algorithm can easily detect word repetition, for example.