Technology would speed Clinton email investigation

With the right technology, 650,000 emails is not an obstable to quickly completing the investigation

Technology would speed Clinton email investigation

U.S. Democratic presidential nominee Hillary Clinton speaks during a visit to Borinquen Health Care Center in Miami, Florida, on Aug. 9, 2016.

Credit: REUTERS/Chris Keane

If you don’t run an order of magnitude test on your thoughts before they come out of your mouth, I am going to have to fire you. So ended an otherwise fantastic review with my boss who had earned a Ph.D. in physics from MIT. What she really meant was I should apply mathematical common sense to my ideas to check the feasibility before I discussed them. I immediately applied her criticism because I enjoyed working for someone as gifted as her and the world was amidst a recession. I never forgot her comment.

The order of magnitude of the 650,000 Anthony Weiner and Huma Abedin emails reported by The Wall Street Journal is not a big number. It is what 20-30 office workers deal with in a year. If someone made up the number 650,000 to make it appear an obstacle to quickly completing the investigation, they should have combed through it with an order-of-magnitude test.

Manually read the emails

It is technically disturbing that the FBI does not know what is in these emails on a laptop that they have had in their possession since July. Are they printing all the emails and sorting them by hand? Order-of-magnitude check: three months, 66 work days, 100 emails read per day per FBI agent adds up to about 100 FBI agents to review the email trove. That is not a big number.

Write a program to reduce the emails investigated

Emails are structured data, most likely stored in MIME RFC 822 because the rest of the Clinton and DNC emails on WikiLeaks were in this format. Messages have to be formatted this way so that they can be indexed and read by email servers and read by email clients such as Outlook.

Find one agent in the 100 who are reading through the paper copies of the emails who knows Python, Perl or JavaScript, and the job could be finished in an afternoon. There are many libraries that will speed the parsing of the emails, which would cut out some time writing the code to index and read the key value pairs in the emails. But the RFC 822 emails were created to be indexed and sorted, so it would not save that much time.

First, sort and discard any emails sent or received after Mrs. Clinton left the State Department. Then sort out any emails that were not sent to or received from Mrs. Clinton’s email server, the State Department or any U.S. or foreign government based on domain name. Next, compare all the emails previously cleared by the FBI against the emails found on the Weiner and Abedin laptop. Discard them. Call the remaining emails the email dataset of interest.

Identify the email client used to create the email and compare it to the Message-ID to identify forged emails. Take the distilled emails, parse for potentially incriminating email servers, IP addresses, senders, words and phrases. According to PBS, U.S. intelligence officials commissioned a system to do exactly that. Discard emails that do not fit the criteria of incriminating words and phrases. Discard any non-matching emails.

Use a content-understanding machine learning model

Machine learning models for various recognition and understanding tasks exists. That is why email accounts from Google, Microsoft, Yahoo and many other companies are free. Because they can apply machine learning to understand the user better and advertise to him or her better.

Pick a content understanding artificial intelligence (AI) model and train it with the 30,000 emails the FBI as already cleared. Test the model with samples of classified emails. If the probability of recognizing a classified email is near 100 percent, then process the email dataset of interest through the model. Read any emails identified as potentially classified.

Alternatively, if the probability of identifying classified emails is too low in the previous example. Train the model with a very large dataset of classified emails until the probability of identifying a classified email is near 100 percent, then process the email dataset of interest through the model. Read any emails identified as potentially classified.

Ask for help

Ask Megan Smith, chief technology officer of the United States and former vice president of Google for help. Or ask her predecessor, Todd Park, who rebooted healthcare.gov. Either would have a solution before the day was out.

To comment on this article and other Network World content, visit our Facebook page or our Twitter stream.
Related:
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.