As you massage your piles of Big Data, particularly when the data comes from social media, you’ll often come across tasks such as extracting “clean” text from HTML content, summarizing content, extracting concepts and entities, classifying the content, and identifying the contents of images.
And then there’s sophisticated advertising where you want to target the user with the most appropriate content possible to maximize selling opportunities and business intelligence or any other kind of research where you’re trying to mine news or documents for useful and actionable data. In every case the only way to maximize the value of the staggering volume of available data is machine-driven analysis.
If these are the kinds of problems you’re struggling with, I have a key part of the solution for you: Aylien.
I’ve covered services that do linguistic analysis in the past such as OpenAmplify and Aylien is a really strong competitor in this market. They offer a set of REST APIs that return JSON-formatted data and the service include:
- Article Extraction
- Concept Extraction
- Entity Extraction
- Semantic Labeling NEW
- Image Tagging NEW
- Sentiment Analysis
- Hashtag Suggestion
- Language Detection
- Microformat Extraction
Article extraction returns the text, embedded medias (image and videos), author name, and embedded RSS feeds from an HTML document without any of the surrounding clutter such as ads and navigation.
Concept extraction is much more sophisticated:
Our Concept Extraction endpoint is a more accurate, more targeted and Linked Data-aware variation of our Entity Extraction endpoint. Using it, not only you can find out what topics are mentioned in a piece of text, but also you you’ll be provided with their semantic typesand URIs, which would allow you to tap into Linked Data to bring in additional, relevant content to your article. Under the hood, our Concept Extraction endpoint performs very accurate Word Sense Disambiguation to find out what the author meant by each mentioned topic. For instance, does “apple” refer to the fruit or the company?
Entity extraction is really useful for building indexes of people and organizations that content refers to; consider a Network World story: Many password strength meters are downright WEAK, researchers say; the Aylien’s Entity Extraction service produced:
Organization: Concordia University, University Concordia University Assistant Professor Mohammad, Institute for Information Systems Engineering, Ph.D, Microsoft/ University of California at Berkeley/ University, Carnegie Mellon University
Url: hear.That’s, Carnavalet.MORE, passwords.But, say.In, secure.Overall, issue.One, safety.MORE
Keyword: Password Meters on Password, passwords such as … “password, password strength meters, Password Strength Meters, password strength, password gauges do encourage users, researchers at Concordia University, password, study asserts that most of the meters, University Concordia University, researchers, users accessing many different websites, meters, strength, study, Website, Concordia, University, users, inconsistent
Person: Mohammad Mannan, Steve Jobs
This is very useful stuff but it underlines a problem with all linguistic analysis systems; they can make mistakes so double-checking for relevance is enormously important. Consider the extracted URL: If you were using Alien in a real system you’d be wise to check all returned URLs to see if they are properly formed and, if they are, reject them if they can’t be resolved. (By the way, the apparently random reference to Steve Jobs was from a link to another article.)
Image tagging is a very interesting Aylien service and they claim they can identify 6,000 classes of visual objects. Here are a few of the categories and their associated confidence values generated by my profile picture (it was taken after I shaved my head for charity a few months ago):
man 0.3513295010549713 (Really? Only 35%!)
male 0.33755465407483953 (Again!)
person 0.2929902361744316 (Under 30%?))
handsome 0.18064155897652662 (Oh, come on!)
oxygen mask 0.0828 (I only look 8% like an oxygen mask?)
Hashtag Suggestion provides what I think of as potential hastags; they're derived from the content but to be really useful you'd want to test them for popularity and relevance using Facebook's Hashtag Counter API or Twitter's Search API.
Aylien is offered as a set of straightforward REST APIs or, if you don’t want to get your hands dirty programming, you can use the Aylien Google Sheets add-on. The Google Sheets add-on is a really cool way to use the services because makes them so simple to drive. You can either select spreadsheet cells contains text or URLs and then select the destination for the results and pick the function you want from a menu or your can actually use Aylien functions in formulae; for example, here a few of the functions available:
=Classification(value): Returns the classification of a text or URL.
=Hashtags(value): Returns a list of suggested hashtags for a text or URL.
=Language(value): Returns the main language of a text or URL in ISO 639-1 format.
=Concepts(value): Returns a list of concepts mentioned in a text or URL.
=SentimentPolarity(value): Returns the polarity (‘positive’ or ‘negative’) of a text or URL.
=SentimentSubjectivity(value): Returns the subjectivity (‘subjective’ or ‘objective’) of a text or URL.
=People(value): Returns a list of the people mentioned in a text or URL.
=Organizations(value): Returns a list of the organizations mentioned in a text or URL.
Aylien provide extensive documentation with tutorials for a wide range of languages and platforms including C#, Go, Java, node.js, PHP, Python, and Ruby.
Using the Aylien APIs is free for up to 30,000 “hits” per month (a text analysis call is billed as one hit while an image analysis call is billed as two hits on the Small plan, and 10 hits on the other plans), rate limited to 60 hits per minute.
Premium plans start with Small at $199 per month for 180,000 hits limited to 120 hits per minute, Medium at $649 per month for 2.4 million hits, at 120 hits per minute, and Large at $1,399 per month for 5.4 million hits, rate limited to 180 hits per minute. Each premium plan has additional hits billed at different rates and SLAs are only available for the Large plan. A “pay as you go” plan is available for $0.01 per hit. On-premises solutions providing unlimited hits are also available.
The text analysis add-on which is free to use for up to 1000 "credits." A credit is a single unit of operation for performing one analysis task (say, a sentiment analysis) on a single value. When you’ve used up your credits you’re on a “pay as you go” basis and you can top up your credits in the Chrome store or from within a spreadsheet.
Aylien provides an outstanding set of services that will be invaluable in an wide range of Big Data, sales and marketing, and research projects.
Let me know what you wind up using Aylien for …
Image Conscious: Why your PNG images look terribleNext Post
Yepzon: Tracking your stuff in the real world has just become affordable
When reporting on Friday’s DDoS attack, the national media should have warned consumers not to install...
The attacks that overwhelmed the internet-address lookup service provided by Dyn today were well...
By forcing Windows 10 on users, Microsoft has lost the tenuous trust and credibility users had in the...
Everyone wants to know what the new Macs will have, and you need look no further than what's already...
Speculation is starting about what company Silver Lake will sell Avaya to. Here’s a look at companies...
What every citizen should know about the state of our voting systems and the security of our elections....
A Q&A on what caused the Dyn DDoS attacks and what to do to protect yourself and your network.