Long time reader and old friend, Jim Sterne, recently wrote to me with a question:
I'd like to start publishing a newsletter about a specific area of interest, using the latest in feeds, bots, scrapers and content management organizers to make things as automated as possible but still being able to keep my eye on what gets posted, emailed, tweeted and projected directly into the corneas of avid, would-be readers.
What’s out there at the moment?
One more publisher on the InterWebs
Dear Mr. Sterne,
So, what you’re looking for is a publishing pipeline: Multiple incoming content feeds, presumably in multiple formats that need to be normalized to create an incoming queue that feeds an analysis and grading process that, in turn, feeds a draft queue where you can tweak and enhance the results and decide what gets published and when. As far as I can determine, there’s no solution out there on the Interwebs that can do everything I think you want so to achieve your vision you may have to bolt multiple subsystems together to create what we shall henceforth call the “FrankenPress.”
And the FrankenPress is not just a technical problem, there are also some serious legal issues involved. Let’s tackle the legal stuff first …
By ingesting content from various sources you’re likely be toying with text and images that could wind up costing you a lot of money if you’re found guilty of copyright violation. For example, if that RSS feed your pipeline is ingesting is from, say, The Hollywood Reporter, and you don’t rewrite the content such that it is demonstrably your own or, heaven forfend, republish it without rewriting at all, or you reuse a photo that they’ve licensed from, say, Getty Images, then it’s pretty much certain you’ll eventually get a letter from their dogs of law and it will cost you bigly. Large publishers who get hit with claims of copyright violation often just pay up to avoid the cost of litigation so unless you have deep pockets, be very, very careful.
“Wait a minute!” you might be thinking, “then how does Google get away with it? Google indexes and summarizes everything including the Hollywood Reporter!” The answer, my friend, is that Google gets away with it because they are Google and the PR value of turning up in Google results is yuggge.
So, consider yourself warned; now, back to technology … amongst the many sources of content you might want to ingest are RSS feeds and scraped Web content. Each of these sources present different challenges in parsing the content and then processing it to minimize the amount of work you’ll have to do before publishing.
If your readers would be happy with curation of sources rather than reworked descriptive text to go with it then a strategy that might work and avoid both the legal problems and the effort of proofing and editing machine-created summaries, would be to simply post source titles with links from RSS feeds along with tagging and categorization but I’m guessing you’re looking for more depth than that.
What you want is an “autoblogging” plugin but while there are lots of these, most are rated as just okay and focus on more sales-oriented goals such as mining affiliate links. One plugin I haven’t used and that gets good reviews, is WP Robot which claims to support over 32 sources and is priced starting at $99 per year for three sites. If you just want to ingest RSS feeds you should consider plugins such as RSSImport (free) or FEEDZY RSS Feeds (starts at $59 per year).
A problem you’ll face is how to make the extracted content yours rather than simply reposting the original content (and potentially heading off to one of the outer rings of legal hell). There are plenty of “spinners” or “rewriters” for WordPress but most take a very simplistic approach to rewriting and just change the content by replacing phrases, for example, “smart decision” might be replaced randomly from a list of synonyms such as “ good move” or “smart move” and, as you might guess, the results will probably not impress you.
If you want to go to the next level of “spinning” you could process the extracted content through a service such as Aylien which offers really sophisticated article extraction,, classification, and — and this may make your final content editing easier — automatic summarization via straightforward REST APIs but to integrate those services with WordPress will require some engineering.
Web scraping? Well, there’s a big, messy set of problems involved with that and services such as Aylien (starts at $49 per month), Grepsr ($129 per site), or Automate (p.o.a.) can do the heavy lifting for you. If you have very specific needs you might want to roll your own scrapers (check out Chapter 11 - Web Scraping of Automate the Boring Stuff with Python for an intro). In general, I wouldn't recommend a do-it-yourself approach because what you’ll wind up with is a scraper for each individual site and every little site change will break your system … you’ll wind up looking like one of those plate spinners at the circus as you run around fixing code.
As for publicizing your newly created blog content, you might consider using the free WordPress Jetpack Publicize feature which automatically publicizes new blog content to Facebook, Twitter, LinkedIn, Google+, Tumblr, and Path.
So, let’s bottom line this: There isn’t, as far as I know, a solution that will take multiple, diverse data sources and create really professional blog content as automatically as I think you’d like. As I’ve discussed above, there are lots of tools to get somewhere close to what you’re looking for but the end result will probably be a home-grown collection of spare parts flying in formation (i.e. the Frankenpress) and true automation — hands-off content collection, analysis, and publishing— is highly unlikely to produce high quality blog posts; at least for now, you’ll have to have human polishing of the final output if you want a professional blog. Give it a year or two and I’m sure there will be an A.I. to do the job for you, but for now it’s a human thing.