A few years ago I was the CTO and co-founder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimize *everyone’s* time. Too often, office visits are scheduled to optimize the physician’s time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.

One of my co-founders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data.

After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.

One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Excel and other GUI analysis programs, is completely auditable.

Alas, that startup ran out of money not long after I implemented a proof-of-concept Web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.

## Essential R scripting

Sharon Machlis of Computerworld wrote an excellent set of beginner tutorials on R for business intelligence in 2013. It would be silly for me to reinvent those six articles here, so feel free to go read them and come back. The TL;DR version is as follows.

Start by installing R and RStudio on your desktop. Both are free. RStudio is optional, but I like it, and you probably will, too. There are a half-dozen other R IDEs and a dozen editors with some R support, but don’t go crazy trying them all.

Try running R from a command shell (Figure 1), the R Console (Figure 2), and RStudio (Figure 3). Familiarize yourself with some of the R tutorials and demos.

The power of R is illustrated by the deceptively simple calls in Figure 3 to do statistical analysis. For example,

`fm1 <- lm(y ~ x, data=dummy, weight=1/w^2)`

`summary(fm1) `

This says “find the best fit coefficients, fitted values, and residuals for a linear model where `y`

varies with `x`

for the supplied data and weight vectors. Save them in object `fm1`

and then summarize the results.” Earlier in this session we had defined the following:

`w <- 1 + sqrt(x) / 2`

Reading this code is straightforward. Writing it takes a little study. But isn’t hard and there’s lots of free help available, not to mention dozens of books.

In addition to the R help available on the Web and from the Help menu items in the R Console and RStudio, you can get help from the R command line. For example:

`?functionName`

`help(functionName)`

`example(functionName)`

`args(functionName)`

`help.search("your search term")`

`??("my search term")`

To get data into R, either use its sample data, listed by the `data()`

function, or load it from a file:

`mydata <- read.csv("filename.txt")`

R is extremely extensible. The `library()`

and `require()`

functions load and attach add-on packages; `require()`

is designed for use inside other functions. Many add-on packages and the R distributions live in CRAN, the worldwide Comprehensive R Archive Network. The other two common R archives are Omegahat and Bioconductor. Additional packages live in R-Forge.

The R installation copies the base packages and the recommended packages from CRAN into a local library directory, which on a Mac is currently at /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. Running the R `library()`

command without any arguments will list the local packages and the library location. RStudio will also generate the correct `library()`

command to install a listed package when you check the installation check mark in the Packages tab. The command `help(package = packageName)`

will display the functions in the specified package.

There are R packages and functions to load data from any reasonable source, not only CSV files. Beyond the obvious case of delimiters other than commas, which are handled using the `read.table()`

function, you can copy and paste data tables, read Excel files, connect Excel to R, bring in SAS and SPSS data, and access databases, Salesforce, and RESTful interfaces. See, for example, the `foreign`

package.

You don’t really need to learn the syntax for standard data imports, as the RStudio `Tools|Import Dataset`

menu item will help you generate the correct commands interactively by looking at the data from a text file or URL and setting the correct conversion options in drop-down lists based on what you see.

You can see a list of the currently available packages by name on CRAN; this list is much more extensive than the list of recommended packages downloaded to your desktop by default. To install a package from one of the default archives, use the `install.packages`

function:

`install.packages("ggplot2")`

Note that `ggplot2`

is a popular advanced graphics package that has more options than the standard `graphics`

package. Nevertheless, `graphics`

can do a lot. In addition to the graphics in Figures 2 and 3, consider Figures 4 and 5.

R can do much more in terms of graphics and statistical analysis. Do read Sharon Machlis’s tutorial and follow up with her links to additional information. At this point, I want to expand my discussion to how you can analyze big data in R.

## R in the cloud

When R programmers talk about “big data,” they don’t necessarily mean data that goes through Hadoop. They generally use “big” to mean data that can’t be analyzed in memory.

The fact is you can easily get 16GB of RAM in a desktop or laptop computer. R running in 16GB of RAM can analyze millions of rows of data with no problem. Times have changed quite a bit since the days when a database table with a million rows was considered big.

One of the first steps many developers take when their program needs more RAM is to run it on a bigger machine. You can run R on a server; a common 4U Intel server can hold up to 2TB of RAM. Of course, hogging an entire 2TB server for one personal R instance might be a bit wasteful. So people run large cloud instances for as long as they need them, run VMs on their server hardware, or run the likes of RStudio Server on their server hardware.

RStudio Server comes in Free and Pro editions. Both have the same features for individual analysts, but the Pro version offers more in the way of scale: authorization and security, management visibility, performance tuning, support, and a commercial license. According to Roger Oberg of RStudio, the company’s intent is *not* to create paid-only features for individuals.

RStudio Server Pro is integrated with several big data systems. For example, when I was reviewing the IBM Bluemix PaaS, I noticed that R and RStudio are part of IBM's DashDB service (Figure 6). In fact, this is an installation of RStudio Server Pro on Bluemix and SoftLayer, according to Oberg and Tareef Kawaf of RStudio.

There is an additional strategy for running R against big data: Bring down only the data that you need to analyze. In the spirit of MapReduce, Hadoop, Spark, and Storm, you want to winnow the data as you stream it to make in-memory analysis tractable on the reduced data set. To use Kawaf’s example, you may have 100TB of data but need “only” 5 columns and 20 million rows, a mere few hundred megabytes of reduced data.