- Microsoft Windows chief decries standards grandstanding
- The 5 best, and 5 worst, features of Google Chrome OS
- Federal government using PS3 to crack pedophile passwords
- 10G Ethernet cheat sheet
- Top 10 free Windows tools for IT pros, at a glance
Mark Gibbs shares Web site tips and provides advice on getting the most out of your apps.
Making sense of Web content is mostly easy for humans but rarely easy for computers. Part of the issue is that recognizing the "interesting" parts of online content involves what is mostly unstructured data, making the task very difficult.
For example, consider text that you, a human, have no problem understanding: "The man, who is 42, was charged with arson". You immediately understand what the age of the arsonist is. Not so for computer programs which would need a huge number of rules to interpret that sentence and discover the man's age. Even then anyone of the thousands of possible minor variants such as "The man, 42, was charged with arson" would most likely cause the same program to make a mistake.
And those aren't even tricky sentence constructions. Consider a sentence like "You'll know the melon is ripe when you can smell it walking into your kitchen" -- it would take a truly stupendous program for it to have a "deep" understanding of the meaning.
That's the problem the concept of the Semantic Web is intended to address: Adding implicit structure to Web content so that its meaning and intent are clear and easily discovered. Of course, to date only a small percentage of Web content is architected that way and even then, casual communications between people will always be unstructured.
And this brings me to today's topic: How can we programmatically interpret online content to, for example, determine the prevailing sentiment of Twitter users who mention "iran"? The answer is to turn to linguistics.
A new service called OpenAmplify published by Hapax LLC uses a "patented Natural Language Processing technology" which analyzes every word used in a piece of text to identify the "significant topics, brands, people, perspectives, emotions, actions and timescales". The requests are via a RESTful API output is in XML, Doubleclick DART, or JSON structure, perfect for programmatic analysis.
The output enumerates "signals", structured representations of the meaning, intent, style and other characteristics of the text that are weighted, ranked and organized.
OpenAmplify provides four different analyses that can be requested separately or as a group. The four analyses include topical signals, which include polarity (positive/negative perception of each topic) and guidance (degree to which guidance is sought or offered about each topic) as well as listing proper names and referenced locations; action signals with a measure of decisiveness (how likely the action is to be taken), guidance (whether guidance is sought or offered on taking the action) and temporality (when the action may take place); stylistic signals indicating flamboyancy (a measure of how "flowery" the writing style is) and use of slang (degree to which slang vocabulary is used); and finally, demographic analysis covering the likely age, gender and education level signals of the text's author or audience.
Mark Gibbs is a consultant, author, journalist, columnist and blogger.
Comment