What does it mean for two passages of text to be associated? How can association be computed? The answer always comes down to word matching. But not all words are of equal value. Some words say the same thing in different ways. And words mean different things depending on how they are combined. In the previous post I gave a rough cut at how Lila could compute association. A finer method would use more advanced techniques. Here is a list of several techniques. I briefly describe how each method can be used to select important words that limit a search, or expand meaning for a broader search. Once an adequate search query can be defined, a results ranking can be returned and a measure of association computed.
1. Parts of Speech
Nouns and noun phrases. Nouns and noun phrases are most important because they describe the subject of sentences — people, places, things. This is what the text is about, the focus. Noun and noun phrases can be extracted using Natural Language Processing (NLP) technology, ready to use as keywords in a search algorithm.
Nouns with Verbs. Verbs indicate change, perhaps the essence of meeting. Compare “the fox and the chicken” with “the fox ate the chicken.” Text without verbs are continuous streams with nothing happening. I think of verbs as heat, indicating an exchange of energy, a change of state. Verbs indicate something meaningful is going on with a noun, and is worth capturing for search.
Adjectives and Adverbs. Adjectives modify a noun, adverbs modify verbs. Staples of grammar. Extended parts-of-speech analysis can help decide which nouns and verbs are more important to the meaning of text.
2. Normal Forms, Lemmas, and Synonyms
A normal form is the standard expression of multiple surface forms; e.g., $200 = “two hundred dollars”, IBM = “International Business Machines”. A lemma is the canonical form of a word; e.g., “run” for “runs”, “ran”, and “running.” Synonyms are words of comparable meaning; e.g., tortoise = turtle. These variations can be handled by lookup in existing NLP dictionaries. Matching can be expanded on any terms declared equivalent in a dictionary.
3. Word Properties
Quantitative properties of words have been compiled by language researchers and scientists. These properties are available as lists for direct look-up. They can be used to decide which words are more important as keywords. The trick is to decide how the properties imply importance. Here are some possible applications.
Frequency. An infrequently used word is more important than a frequently used word. E.g., “tortoise” is less frequently used than “tower.” Infrequency implies more deliberate word selection.
Concreteness. Abstract words summarize many concrete examples. “Food” is more abstract than “apple.” An abstract word is virtually metadata, and can generally be considered more important. I studied Allan Paivio’s dual-coding theory as an undergrad psychology student. The theory seems to have receded, but his measures of word concreteness are begging to be tapped by Lila.
4. Phase and Sentence Properties
Sentiment and Emotion. Positive or negative regard for a thing are referred to as sentiment. “I like apples” is positive sentiment, and “I hate bananas” is negative sentiment. Sentiment is also an indicator of emotion. Generally, positive sentiment can be assumed to indicate a higher degree of association, unless of course you are looking for contrary views.
Idea Density. The number of ideas in words can be computed as idea density. “The old gray mare has a big nose” — this sentence is short and choppy; it has low idea density. “The gray mare is very slightly older than …” — this sentence has complex interrelationships; it has higher idea density. Idea density can be computed as Number of Ideas / Number of Words. Generally, text with comparable idea density can be assumed to have a greater association. For example, an academic paper on a subject is more likely to associated with another academic paper than, say, a blog post.
In Orality and Literacy: The Technologizing of the Word, Walter J. Ong identifies properties of oral communication in societies in which literacy is unfamiliar. I read this book in 2012. I wrote,
One might think that oral culture could not engineer complex works, yet the Iliad and the Odyssey were oral creations. Ong explains the properties of orality that make this possible. Oral memory is achieved through repetition and cliche, for example. Also, phrasing is aggregative, e.g, “brave solider” rather than analytic, e.g., “soldier’. … Ong contrasts many more properties of oral memory. They define the lifeworld of thought prior to structuring through literacy. It is an architecture of implicit thought, of domain knowledge. It blows my information technology mind to think how these properties might be applied to the task of structuring data in unstructured environments, e.g., crawling the open web. I have not stopped thinking about it. It may take years to unpack.
The time has come to unpack these ideas. Oral phrasing uses sonorous qualities like repetition to increase emphasis, aid memory, and shape complexity. These oral patterns carry over to the way people write text today, and are one of the reasons computers have trouble analyzing unstructured text. But Ong catalogued these techniques and at least some of them can be used to select concepts of importance. A text algorithm can easily detect word repetition, for example.