**See Top 100 in Books**. In SAS, once we’ve found the line, we can look for the # symbol, and the numbers between there and the text in Books. filename amazon url "http://tinyurl.com/cartoonguide"; data grabit; infile amazon truncover; input @1 line $256.; if count(line, "See Top 100 in Books") gt 0 then do; rankchar = substr(line, find(line, "#")+1, find(line, "in Books") - find(line, "#") - 2); rank = input(rankchar, comma9.); output; end; run; The code works by reading in the ﬁrst 256 characters of each line, as a single variable. Then we use the count function to determine if this is the correct line. Next, we use the substr function to make a new variable with the characters after the # symbol and before the “in Books”. Finally, we read the number in as a character string, formatting it afterwards to accommodate the commas. Our approach in R is similar, except that we’ll use some diﬀerent functions to isolate the number, as annotated within the code below. To help in comprehending the code, readers are encouraged to run the commands on a line-by-line basis, then look at the resulting value. > # grab contents of web page > urlcontents = readLines("http://tinyurl.com/cartoonguide") > # find line with sales rank > linenum = suppressWarnings(grep("See Top 100 in Books", urlcontents)) > # split line into multiple elements > linevals = strsplit(urlcontents[linenum], ' ')[[1]] > # find element with sales rank number > entry = grep("#", linevals) > charrank = linevals[entry] # snag that entry > charrank = substr(charrank, 2, nchar(charrank)) # kill '#' at start > charrank = gsub(',' ,'', charrank) # remove commas > salesrank = as.numeric(charrank) # make it numeric > cat("salesrank=", salesrank, "\n") i i “book” — 2014/5/24 — 9:57 — page 331 — #353 i i i i i i 12.4. DATA SCRAPING AND VISUALIZATION 331 In our experience, the format of Amazon’s book pages changes often. The code above may not work on current pages, but could be tested on the example page mentioned above, at http://www.amherst.edu/~nhorton/sasr2/datasets/cartoon.html. More sophisti- cated approaches to web scraping can be found in Nolan and Temple Lang [127]. 12.4.2 Reading data with two lines per observation The code from 12.4.1 was run regularly on a server by calling R in batch mode (see B.2.2), with results stored in a cumulative ﬁle. While a date-stamp was added, it was included in the ﬁle on a diﬀerent line. The ﬁle (accessible at https://www.amherst.edu/~nhorton/ sasr2/datasets/cartoon.txt) has the following form. Wed Oct 9 16:00:04 EDT 2013 salesrank= 3269 Wed Oct 9 16:15:02 EDT 2013 salesrank= 4007 To read these data into SAS, we’ll check the ﬁrst character of each line: if it is an “s”, we have a line with a rank, not a date in it. We’ll read the two types of lines with diﬀerent input statements. filename salesdat url "http://www.amherst.edu/~nhorton/sasr2/datasets/cartoon.txt"; data sales; infile salesdat; retain dow month date time edt year; input @1 type $1 @; if type ne 's' then do; input @1 dow $ Month $ date time $ edt $ year; end; else do; input @12 rank; datetime = compress(date||month||year||"/"||time); salestime = input(datetime,datetime18.); if timepart(salestime) lt (8 * 60 * 60) or timepart(salestime) gt (18 * 60 * 60) then night=1; else night = 0; output; end; run; To implement our plan, we use the advanced features of the input statement, as in 12.2. Later, we’ll use the time of day in our plot. For clarity, we ﬁrst construct a character variable with the time, then convert it to a SAS date-time variable (see 2.4) using the input function. We deﬁne “night” as times between 6:00 PM and 8:00 AM, and calculate it by extracting just the time from the SAS date-time formatted variable we construct from the information in the ﬁle. We can then examine the ﬁrst few lines of the ﬁle. First, let’s see how SAS handles the date-time variable. i i “book” — 2014/5/24 — 9:57 — page 332 — #354 i i i i i i 332 CHAPTER 12. CASE STUDIES proc print data=sales (obs=4); var datetime salestime rank; run; Obs datetime salestime rank 1 30Sep2013/00:00:03 1696118403 5151 2 30Sep2013/00:15:03 1696119303 5151 3 30Sep2013/00:30:03 1696120203 4162 4 30Sep2013/00:45:03 1696121103 4162 Those salestime values look like the counts of seconds they are supposed to be. But let’s also see whether they converted correctly. proc print data=sales (obs=4); var datetime salestime rank; format salestime datetime18.; run; Obs datetime salestime rank 1 30Sep2013/00:00:03 30SEP13:00:00:03 5151 2 30Sep2013/00:15:03 30SEP13:00:15:03 5151 3 30Sep2013/00:30:03 30SEP13:00:30:03 4162 4 30Sep2013/00:45:03 30SEP13:00:45:03 4162 The character and date-time versions match. In R, we begin by reading the ﬁle, then we calculate the number of entries by dividing the ﬁle’s length by two. Next, two empty vectors of the correct length and type are created to store the data. Once this preparatory work is completed, we loop (4.1.1) through the ﬁle, reading in the odd-numbered lines as date/time values from the Eastern U.S. time zone, with daylight savings applied. The gsub() function (2.2.14) replaces matches determined by regular expression matching. In this situation, it is used to remove the time zone from the line before this processing. These date/time values are read into the timeval vector. Even-numbered lines are read into the rank vector, after removing the strings salesrank= and NA (again using two calls to gsub()). Finally, we make a dataframe (B.4.6) from the two vectors and display the ﬁrst few lines using the head() function (1.2.1). > library(RCurl) > myurl = getURL("https://www3.amherst.edu/~nhorton/sasr2/datasets/cartoon.txt", ssl.verifypeer=FALSE) > file = readLines(textConnection(myurl)) > n = length(file)/2 > rank = numeric(n) > timeval = as.POSIXlt(rank, origin="1960-01-01") > for (i in 1:n) { timeval[i] = as.POSIXlt(gsub('EST', '', gsub('EDT', '', file[(i-1)*2+1])), tz="EST5EDT", format="%a %b %d %H:%M:%S %Y") rank[i] = as.numeric(gsub('NA', '', gsub('salesrank= ','', file[i*2]))) } > timerank = data.frame(timeval, rank) Note that the ﬁle is being read from an HTTPS (Hypertext Transfer Protocol Secure) connection (1.1.12) and string data is converted to date and time variables (2.4.6). The ﬁrst four entries of the ﬁle are given below. i i “book” — 2014/5/24 — 9:57 — page 333 — #355 i i i i i i 12.4. DATA SCRAPING AND VISUALIZATION 333 > head(timerank, 4) timeval rank 1 2013-09-30 00:00:03 5151 2 2013-09-30 00:15:03 5151 3 2013-09-30 00:30:03 4162 4 2013-09-30 00:45:03 4162 12.4.3 Plotting time series data While it is straightforward to make a simple plot of the data from 12.4.2 using code discussed in 8.3.1, we’ll augment the display by indicating whether the rank was recorded in the nighttime (eastern U.S. time) or not. Then we’ll color the nighttime ranks diﬀerently from the daytime ranks. In SAS, we already made the nighttime indicator. We’ll use the sgplot procedure (8.3.1) to make the plot. proc sgplot data=sales; series y=rank x=salestime / lineattrs=(thickness=2 color=black); scatter y=rank x=salestime / group=night grouporder=ascending markerattrs=(symbol=circlefilled) name="night"; refline "30SEP13/23:59:59"dt / axis=x lineattrs=(thickness=2 color=black pattern=shortdash); keylegend "night" / title="Night" location=inside position=top; xaxis label=" "; format salestime datetime8.; run; The plot is composed of several pieces. The series statement makes a time series plot, or a line plot, connecting the values in sequence. The scatter statement adds symbols for the observed ranks, plotting day and night values in diﬀerent colors using the group option. The refline statement adds a vertical line at the end of September. Note the use of the trailing dt to convert the text to the date-time format that the x axis values are recorded in. The keylegend statement improves the default legend by bringing it into the plot area. The remaining options are fairly self-explanatory. For R, we begin by creating a new variable reﬂecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (2.2.7) while subtracting that midnight value. Next, we call the hour() function in the lubridate package (2.4) to get the hour of measurement. > library(lubridate) > timeofday = hour(timeval) > night = rep(0,length(timeofday)) # vector of zeroes > night[timeofday < 8 | timeofday > 18] = 1 Then we can build the plot. > plot(timeval, rank, type="l", xlab="", ylab="Amazon Sales Rank") > points(timeval[night==1], rank[night==1], pch=20, col="black") > points(timeval[night==0], rank[night==0], pch=20, col="red") > legend(as.POSIXlt("2013-10-03 00:00:00 EDT"), 6000, legend=c("day","night"), col=c("red","black"), pch=c(20,20)) > abline(v=as.numeric(as.POSIXlt("2013-10-01 00:00:00 EST")), lty=2) The time series plot is requested by the type="l" option and symbols for the ranks added with calls to the points() function. The abline() function adds a reference line at the start of October. The results for both SAS and R are displayed in Figure 12.5. i i “book” — 2014/5/24 — 9:57 — page 334 — #356 i i i i i i 334 CHAPTER 12. CASE STUDIES 30Sep 02Oct 04Oct 06Oct 08Oct 10Oct 2013 2000 3000 4000 5000 6000 7000 rank 10Night Sep 30 Oct 02 Oct 04 Oct 06 Oct 08 Oct 10 2000 3000 4000 5000 6000 Amazon Sales Rank day night R)b(SAS)a( Figure 12.5: Sales plot 12.4.4 URL APIs and truly random numbers Usually, we’re content to use a pseudo-random number generator. But sometimes we may want numbers that are actually random. An example might be for randomizing treatment status in a randomized controlled trial. The site Random.org provides truly random numbers based on radio static. For long simulations which need a huge number of random numbers, the quota system at Random.org may preclude its use. But for small to moderate needs, it can be used to provide truly random numbers. In addition, you can purchase larger quotas if need be. The site provides application programming interfaces (APIs) for several types of in- formation. We’ll demonstrate how to use these to pull vectors of uniform (0,1) random numbers (of 10−9 precision) and to check the quota. To generate random variates from other distributions, you can use the inverse probability integral transform (3.1.10). In SAS, we’ll make a macro to grab the desired number of random values and save them in a dataset. The challenging bit is to pass the desired number of random numbers oﬀ to the API, through the macro system. This is hard because the API includes the special characters ?, ", and, most notably, &. The ampersand is used by the macro system to denote the start of a macro variable and is used in APIs to indicate that an additional parameter follows. To avoid processing these characters as part of the macro syntax, we have to enclose them within the macro quoting function %nrstr(). We use this approach twice, for the ﬁxed pieces of the API, and between them insert the macro variable that contains the number of random numbers desired. Also note that the sequence %" is used to produce the quotation mark. Then, to unmask the resulting character string and use it as intended, we %unquote() it. Note that the line breaks printed below in the ﬁlename statement must be removed for the code to work. i i “book” — 2014/5/24 — 9:57 — page 335 — #357 i i i i i i 12.4. DATA SCRAPING AND VISUALIZATION 335 %macro rands (outds=ds, nrands=); filename randsite url %unquote( %nrstr(%"http://www.random.org/integers/?num=) &nrands %nrstr(&min=0&max=1000000000&col=1&base=10& format=plain&rnd=new%")); proc import datafile=randsite out=&outds dbms=dlm replace; getnames=no; run; data &outds; set &outds; var1 = var1 / 1000000000; run; %mend rands; Running the macro and examining the output is trivial. %rands(nrands=7, outds=myrs); proc print data=myrs; run; Obs VAR1 1 0.502746213 2 0.247134785 3 0.620425172 4 0.932627004 5 0.266144436 6 0.967032193 7 0.751058844 The companion macro to ﬁnd the quota is slightly simpler, since we don’t need to insert the number of random numbers in the middle of the URL. Here, we show the quota in the SAS log; the file print syntax, shown in http://tinyurl.com/sasrblog-robust, can be used to send it to the output instead. %macro quotacheck; filename randsite url %unquote(%nrstr(% "http://www.random.org/quota/?format=plain%")); proc import datafile=randsite out = __qc dbms = dlm replace; getnames = no; run; data _null_; set __qc; put "Remaining quota is " var1 "bytes"; run; %mend quotacheck; %quotacheck; Remaining quota is 996040 bytes Two R functions are shown below. While the problem isn’t as diﬃcult as in SAS, it is necessary to enclose the character string for the URL in the as.character() function (1.1.4). i i “book” — 2014/5/24 — 9:57 — page 336 — #358 i i i i i i 336 CHAPTER 12. CASE STUDIES > truerand = function(numrand) { read.table(as.character(paste("http://www.random.org/integers/?num=", numrand, "&min=0&max=1000000000&col=1&base=10&format=plain&rnd=new", sep="")))/1000000000 } > quotacheck = function() { line = as.numeric(readLines( "http://www.random.org/quota/?format=plain")) return(line) } > truerand(7) V1 1 0.780 2 0.804 3 0.502 4 0.377 5 0.537 6 0.580 7 0.135 > quotacheck() [1] 1e+06 12.5 Manipulating bigger datasets In this example, we consider analysis of the Data Expo 2009 commercial airline ﬂight dataset [197], which includes details of n = 123, 534, 969 ﬂights from 1987 to 2008. We consider the number of ﬂights originating from Bradley airport (code BDL, serving Hartford, CT and Springﬁeld, MA). Because of the size of the data, we will demonstrate use of a database system accessed using a structured query language (SQL) [172]. Full details are available on the Data Expo website (http://stat-computing.org/ dataexpo/2009/sqlite.html) regarding how to download the Expo data as comma sepa- rated ﬁles (1.6 gigabytes of compressed, 12 gigabytes uncompressed through 2008), set up and index a database (19 gigabytes), then access it from within R. In SAS, analysis can be undertaken on an internal or external database server running MySQL. The following code extracts the sum total of ﬂights from Bradley International Airport, by day, month, and year. Here the SELECT statement speciﬁes the ﬁve variables to be included (one of which is the count of ﬂights), the name of the table, what values to include (only BDL), and what level to aggregate (unique day). proc sql; connect to mysql (user="tlehrer" server="hen3ry.mgh.edu" password="FakePW1" dbname="airlines"); execute (create table ds as SELECT DayofMonth, Month, Year, Origin, sum(1) as numFlights FROM ontime WHERE Origin=BDL GROUP BY DayofMonth,Month,Year) by mysql; disconnect from mysql; quit; i i “book” — 2014/5/24 — 9:57 — page 337 — #359 i i i i i i 12.6. CONSTRAINED OPTIMIZATION: THE KNAPSACK PROBLEM 337 A similar system for allowing access to databases is SQLite, a self-contained, serverless, transactional SQL database engine. To use this with R, the analyst installs the sqlite software library (http://sqlite.org). Next the input ﬁles must be downloaded to the local machine, a database set up (by running sqlite3 ontime.sqlite3) at the shell command line), creating a table with the appropriate ﬁelds, loading the ﬁles using a series of .import statements, then speeding up access by adding indexing. Then the RSQLite package can be used to create a connection to the database. > library(RSQLite) > con = dbConnect("SQLite", dbname = "/Home/Airlines/ontime.sqlite3") > ds = dbGetQuery(con, "SELECT DayofMonth, Month, Year, Origin, sum(1) as numFlights FROM ontime WHERE Origin='BDL' GROUP BY DayofMonth,Month,Year") > # returns a data frame with 7,763 rows and 5 columns > ds = transform(ds, date = as.Date(paste(Year, "-", Month, "-", DayofMonth, sep=""))) > ds = transform(ds, weekday = weekdays(date)) > ds = ds[order(ds$date),] > mondays = subset(ds, weekday=="Monday") > library(lattice) > xyplot(numFlights ~ date, xlab="", ylab="number of flights on Monday", type="l", col="black", lwd=2, data=mondays) As in the SAS example, the SELECT statement speciﬁes the ﬁve variables to be included (one of which is the count of ﬂights), the name of the table ontime, what ﬂights to include (only those originating at BDL), and what level to aggregate (unique day). The results are plotted in Figure 12.6. Similar functionality is provided for MySQL databases using the RMySQL package. 12.6 Constrained optimization: the knapsack problem The website http://rosettacode.org/wiki/Knapsack_Problem describes a fanciful trip by a traveler to Shangri La. Upon leaving, the traveler is allowed to take as much of three valuable items as they like, as long as they ﬁt in a knapsack. A maximum of 25 weights can be taken, with a total volume of 25 cubic units. The weights, volumes, and values of the three items are given in Table 12.1. How can the traveler maximize the value of the items? It is straightforward to calculate the solutions using brute force, by iterating over all possible combinations and eliminating those that are over weight or too large to ﬁt. In SAS, this task is undertaken as a data step. Table 12.1: Weights, volume, and values for the knapsack problem Item Weight Volume Value Panacea 0.3 2.5 3000 Ichor 0.2 1.5 1800 Gold 2.0 0.2 2500 i i “book” — 2014/5/24 — 9:57 — page 338 — #360 i i i i i i 338 CHAPTER 12. CASE STUDIES number of flights on Monday 60 70 80 90 100 110 1990 1995 2000 2005 Figure 12.6: Number of ﬂights departing Bradley airport on Mondays over time data one; wtpanacea=0.3; wtichor=0.2; wtgold=2.0; volpanacea=0.025; volichor=0.015; volgold=0.002; valpanacea=3000; valichor=1800; valgold=2500; maxwt=25; maxvol=0.25; /* find upper bound for looping */ maxpanacea = floor(min(maxwt/wtpanacea, maxvol/volpanacea)); maxichor = floor(min(maxwt/wtichor, maxvol/volichor)); maxgold = floor(min(maxwt/wtgold, maxvol/volgold)); /* loop */ do panacea = 0 to maxpanacea; do ichor = 0 to maxichor; do gold = 0 to maxgold; output; end; end; end; run; The resulting dataset includes improper values: combinations with too much weight or volume. We prune them out in a separate data step where we also calculate the total weight and volume implied. We discard values over the limit using the subsetting if statement i i “book” — 2014/5/24 — 9:57 — page 339 — #361 i i i i i i 12.6. CONSTRAINED OPTIMIZATION: THE KNAPSACK PROBLEM 339 (2.3.1). Note that these statements could have been included in the innermost do loop above. We put them in a separate data step for clarity of presentation. data two; set one; totalweight = wtpanacea*panacea + wtichor*ichor + wtgold*gold; totalvolume = volpanacea*panacea + volichor*ichor + volgold*gold; if (totalweight le maxwt) and (totalvolume le maxvol); vals = valpanacea*panacea + valichor*ichor + valgold*gold; run; To ﬁnd the maximum value that ﬁts in the knapsack, we can sort the dataset and print the results. Here we show the top ﬁve values. proc sort data=two; by descending vals; run; proc print data=two (obs=5) noobs; var panacea ichor gold vals totalweight; run; panacea ichor gold vals totalweight 0 15 11 54500 25.0 3 10 11 54500 24.9 6 5 11 54500 24.8 9 0 11 54500 24.7 1 13 11 53900 24.9 We can maximize the value at the minimum weight with 9 panacea and 11 gold. In R, we deﬁne a number of support functions, then run over all possible values of the knapsack contents (after expand.grid() generates the list). The findvalue() function checks the constraints and sets the value to 0 if they are not satisﬁed, and otherwise calcu- lates them for the set. The apply() function (see 2.6.4) is used to run a function for each item of a vector. > # Define constants and useful functions > weight = c(0.3, 0.2, 2.0) > volume = c(2.5, 1.5, 0.2) > value = c(3000, 1800, 2500) > maxwt = 25 > maxvol = 25 i i “book” — 2014/5/24 — 9:57 — page 340 — #362 i i i i i i 340 CHAPTER 12. CASE STUDIES > # minimize the grid points we need to calculate > max.items = floor(pmin(maxwt/weight, maxvol/volume)) > # useful functions > getvalue = function(n) sum(n*value) > getweight = function(n) sum(n*weight) > getvolume = function(n) sum(n*volume) > # main function: return 0 if constraints not met, > # otherwise return the value of the contents, and their weight > findvalue = function(x) { thisweight = apply(x, 1, getweight) thisvolume = apply(x, 1, getvolume) fits = (thisweight <= maxwt) & (thisvolume <= maxvol) vals = apply(x, 1, getvalue) return(data.frame(panacea=x[,1], ichor=x[,2], gold=x[,3], value=fits*vals, weight=thisweight, volume=thisvolume)) } > # Find and evaluate all possible combinations > combs = expand.grid(lapply(max.items, function(n) seq.int(0, n))) > values = findvalue(combs) Now we can display the solutions. > max(values$value) [1] 54500 > values[values$value==max(values$value),] panacea ichor gold value weight volume 2067 9 0 11 54500 24.7 24.7 2119 6 5 11 54500 24.8 24.7 2171 3 10 11 54500 24.9 24.7 2223 0 15 11 54500 25.0 24.7 The ﬁrst solution (with 9 panacea), no ichor, and 11 gold) satisﬁes the volume constraint, maximizes the value, and also minimizes the weight. More sophisticated approaches are available using the lpSolve package for linear/integer problems. i i “book” — 2014/5/24 — 9:57 — page 341 — #363 i i i i i i Appendix A Introduction to SAS The SASTM system is a programming and data analysis package developed and marketed by SAS Institute, Cary NC (SAS). SAS markets many products which are modular parts of an integrated environment. In this book we address software available in the Base SAS, SAS/STAT, SAS/GRAPH, SAS/ETS, and SAS/IML products. Base SAS provides a wide range of data management and analysis tools, while SAS/STAT and SAS/GRAPH provide support for more sophisticated statistics and graphics, respectively. We touch brieﬂy on the IML (interactive matrix language) module, which provides extensive matrix functions and manipulation, and the SAS/ETS module, which supports time series tools and other specialized procedures. All of these products are typically included in educational institution installations, for which SAS oﬀers discounts. Another option is SAS “OnDemand” (http://http://support.sas.com/learn/ ondemand), a cloud-based system. In general, pricing information can be obtained only by contacting SAS directly, as noted at http://http://www.sas.com/nextsteps. SAS does market a system limited to Base SAS, SAS/STAT, and SAS/GRAPH, the current licensing fee for which is $8,700. A.1 Installation Once licensed, a set of installation disks or a USB key is mailed; this package includes detailed installation instructions tailored to the operating system for which the license was obtained. Also necessary is a special “setinit” ﬁle sent from SAS which functions as a password allowing installation of licensed products. An updated setinit ﬁle is sent upon purchase of a license renewal. A.2 Running SAS and a sample session Once installed, a recommended step for a new user is to start SAS and run a sample session. Starting SAS in a GUI environment opens a SAS window, as displayed in Figure A.1. The window is divided into two panes. On the left is a navigation pane with Results and Explorer tabs, while on the right is an interactive windowing environment with Editor, Log, and Output Windows. Eﬀectively, the right-hand pane is like a limited graphical user interface (GUI) in itself. There are multiple windows, any one of which may be maximized, minimized, or closed. Their contents can also be saved to the operating system or printed. Depending on the code submitted, additional windows may open in this area. To open a window, click on its name at the bottom of the right-hand pane; to maximize or minimize 341 i i “book” — 2014/5/24 — 9:57 — page 342 — #364 i i i i i i 342 APPENDIX A. INTRODUCTION TO SAS Figure A.1: SAS Windows interface within the SAS GUI, click on the standard icons your operating system uses for these actions. On starting SAS, the cursor will appear in the Editor window. Commands such as those in the sample session which follows are typed there. They can also be read into the window from previously saved text ﬁles using File; Open Program from the menu bar. Typing the code doesn’t do anything, even if there are carriage returns in it. To run code, it must be submitted; this can be done by clicking the submit button in the GUI as in Figure A.2 or using keyboard shortcuts. After code is submitted, SAS processes the code. Results are not displayed in the Editor window, but in the Results window, and comments from SAS on the commands which were run are displayed in the Log window. If output lines (typically analytic results) are generated, the Output window will jump to the front. In the left-hand pane, the Explorer tab can be used to display datasets created within the current SAS session or found in the operating system. The datasets are displayed in a spreadsheet-like format. Navigation within the Explorer pane uses idioms familiar to users of GUI-based operating systems. The Results tab allows users to navigate among the output generated during the current SAS session. As a sample session, consider the following SAS code, which generates 100 normal vari- ates (see 3.1.6) and 100 uniform variates (see 3.1.4), displays the ﬁrst ﬁve of each (see 1.2.1), and calculates series of summary statistics (see 5.1.1). These commands would be typed directly into the Editor window. i i “book” — 2014/5/24 — 9:57 — page 343 — #365 i i i i i i A.2. RUNNING SAS AND A SAMPLE SESSION 343 Figure A.2: Running a SAS program /* This is the sample session */ data test; do i = 1 to 100; x1 = normal(0); x2 = uniform(0); output; end; run; proc print data=test (obs=5); run; ods select moments; proc univariate data=test; var x1 x2; run; A user can run a section of code by selecting it using the mouse and clicking the “running ﬁgure” (submit) icon near the right end of the toolbar, as shown in Figure A.2. Clicking the submit button when no text is selected will run all of the contents of the window. As demonstrated, text typed between /* and */ are comments, not interpreted by SAS. This code is available for download from the book website: http://www.amherst.edu/ ~nhorton/sasr2/examples/sampsess.sas. Additional code from the text can be found at http://www.amherst.edu/~nhorton/sasr2/examples. We discuss each block of code in the sample session. i i “book” — 2014/5/24 — 9:57 — page 344 — #366 i i i i i i 344 APPENDIX A. INTRODUCTION TO SAS data test; do i = 1 to 100; x1 = normal(0); x2 = uniform(0); output; end; run; After selecting and submitting the above code there is no output, since none was requested, but the log window will contain some new information: 1 data test; 2 do i = 1 to 100; 3 x1 = normal(0); 4 x2 = uniform(0); 5 output; 6 end; 7 run; NOTE: The dataset WORK.TEST has 100 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds This indicates that the commands ran without incident, creating a dataset called WORK.TEST with 100 rows and 3 columns (one for i, one for x1, and one for x2). The line numbers shown at the left can be used in debugging code. Next consider the proc print code. proc print data=test (obs=5); run; When these commands are submitted, the Results window will show a table like the one shown in Figure A.3. Note that only ﬁve observations are shown because obs=5 was speciﬁed (A.6.1). Omitting this will result in all 100 lines of data printing. Figure A.3: Results from proc print i i “book” — 2014/5/24 — 9:57 — page 345 — #367 i i i i i i A.2. RUNNING SAS AND A SAMPLE SESSION 345 Figure A.4: Results from proc univariate Finally, data are summarized by submitting the lines specifying the univariate procedure. ods select moments; proc univariate data=test; var x1 x2; run; ods select all; The Results window will display the table shown in Figure A.4 (and a similar table will be created for x2). As with the obs=5 speciﬁed in the proc print statement above, the ods select moments statement causes a subset of the default output to be printed. By default, SAS often gener- ates voluminous output that can be hard for new users to digest and would take up many pages of a book. We use the ODS system (A.7) to select pieces of the output throughout the book. For each of these submissions, additional information is presented in the Log window. While some users may ignore the Log window unless the code did not work as desired, it is always a good practice to examine the log carefully, as it contains warnings about unexpected behavior as well as descriptions of errors which cause the code to execute incorrectly or not at all. In addition, it provides conﬁrmation about expected behavior. Note that the contents of the Editor, Log, and Result windows can be saved in typical GUI fashion by bringing the window to the front and using File; Save through the menus. In addition to the HTML-formatted Results window, SAS can also print results to the Output window. Output in the Output window is text based, which may be advantageous in some settings. It is also faster. To have SAS send results to both the Output and Re- sults windows, use the GUI menus: Tools; Options; Preferences; Results tab; then click the Listing box. Figure A.5 shows the appearance of the SAS window after running the sample program and clicking the Output window. The Output window can be scrolled through to ﬁnd results, or the Results tab shown in the left-hand pane can be used to ﬁnd particular pieces of output more quickly. i i “book” — 2014/5/24 — 9:57 — page 346 — #368 i i i i i i 346 APPENDIX A. INTRODUCTION TO SAS Figure A.5: The SAS window after running the sample session code Figure A.6 shows the view of the dataset found through the Explorer window by clicking through Libraries; Work; Test. Datasets not assigned to permanent storage in the operating system (see 1.2.3) are kept in a temporary SAS library called the “Work” library. A.3 Learning SAS and getting help This book is not intended as an introduction to SAS. There are, however, numerous tools available for learning SAS. At least three of these are built into the program. Under the Help menu in the Menu bar are “Getting Started with SAS Software” and “Learning SAS Programming.” In the on-line help, under the Contents tab is “Learning to Use SAS.” For those interested in learning about SAS but without access to a working version, an on-line option is the excellent UCLA statistics website, which includes the “SAS Starter Kit” (http://www.ats.ucla.edu/stat/sas/sk/default.htm). SAS Institute also oﬀers several ways to get help. The central place to start is their web site where the front page for support is currently http://support.sas.com/techsup, which has links to discussion forums, support documents, and instructions for submitting an e-mail or phone request for technical support. Complete documentation is included with SAS installation by default. Clicking the icon of the book with a question mark in the GUI (Figure A.7) will open a new window with a tool for viewing the documentation (Figure A.8). While there are Contents, Index, Search, and Favorites tabs in the help tool, we generally use the Contents tab as a starting point. Expanding the SAS Products folder here will open a list of SAS products (i.e., Base SAS, SAS/STAT, etc.), as well as a link to a list of SAS procedures. Detailed documentation for the desired procedure can be found under the product which provides i i “book” — 2014/5/24 — 9:57 — page 347 — #369 i i i i i i A.4. FUNDAMENTAL ELEMENTS OF SAS SYNTAX 347 Figure A.6: The SAS Explorer window access to that proc or directly in the list of procedures. In the text, we provide occasional pointers to the on-line help, using the folder structure of the help tool to provide directions to these documents. A.4 Fundamental elements of SAS syntax The SAS syntax can be broken into three main elements: the data step, procedures, and global statements. The data step is used to manage and manipulate data. Procedures are generally ways to do some kind of analysis and get results. Users of SAS refer to procedures as “procs.” Global statements are generally used to set parameters and make optional choices that apply to the output of one or more procedures. A typical data step might read as follows. data newtest; set test; logx = log(x); run; In this code a new variable named logx is created by taking the natural log of the variable x. The data step works by applying the instructions listed, sequentially, to each line of the input dataset, which in this case is named using the set statement, then writing that line of data out to the dataset named in the data statement. Data steps and procedures are typically multi-statement collections. Both are terminated with a run statement. As shown above, statements in SAS are separated by semicolons, meaning that carriage returns and line breaks are ignored. When SAS reads the run statement in the example (when it reaches the “;” after the word run), it writes out the processed line of data, then repeats i i “book” — 2014/5/24 — 9:57 — page 348 — #370 i i i i i i 348 APPENDIX A. INTRODUCTION TO SAS Figure A.7: Opening the on-line help Figure A.8: The SAS Help and Documentation window i i “book” — 2014/5/24 — 9:57 — page 349 — #371 i i i i i i A.5. WORK PROCESS: THE COGNITIVE STYLE OF SAS 349 the statements for each line of the input data. In this example, a line of data is read from the test dataset, the logx variable is generated, and the line of data (including logx, x, and any other data stored in test) is written to the new dataset newtest. A typical procedure in SAS might read as follows. proc glm data=newtest; model y = logx / solution; run; Many procedures require multiple statements to function. For example, the glm procedure requires both a proc glm statement and a model statement. Here, we show the two ways that options can be speciﬁed in SAS. One way is by simply listing optional syntax after the statement name. In the proc glm (6.1.1) statement above, we specify, using the data option, that the dataset that should be used is the newtest dataset. Without this option SAS defaults to using the most recently created dataset. As a matter of style, we always specify the dataset using the data option, which can be used with any and all procs. Naming datasets explicitly in each procedure minimizes errors and makes code clearer. The model statement shown demonstrates another way that options are speciﬁed, namely, after a forward slash. In general, this syntax is used when the main body of the statement may include separate words. For example, the slash in the model statement above separates the model speciﬁcation (y = logx) from the solution option that requests the parameter estimates in addition to the default ANOVA table. We refer to any SAS code appearing between semicolons generically as “statements.” Most statements appear within data steps or procs. Global statements are special statements that need not appear within a data step or a proc. An example would be the following code. options ls=78 ps=60 nocenter; This options statement aﬀects the formatting of output pages, limiting the line length to 78 characters per line and 60 lines per page, while removing the default centering. A.5 Work process: The cognitive style of SAS A typical SAS work session involves ﬁrst writing a data step or loading a saved command ﬁle (conventionally saved with a .sas extension) which might read in or perhaps modify a saved dataset. Then a proc is written to perform a desired analysis. The output is examined, and based on the results, the data step is modiﬁed to generate new variables, the proc is edited to choose new options, new procs are written, or some subset of these steps is repeated. At the end of the session, the dataset might be saved in the native SAS format, the commands saved in text format, and the results printed onto paper or saved (conventionally with a .lst extension). Alternatively, reproducible analysis tools (11.3) might be employed to keep the code, results, and interpretation together. A.6 Useful SAS background A.6.1 Dataset options In addition to data steps for manipulating data, SAS allows on-the-ﬂy modiﬁcation of datasets. This approach, while less than ideal for documentation, can be a useful way to reduce code length: rather than create a new dataset with a subset of observations, or with a renamed variable, this can be done simultaneously with specifying the dataset to be used in a procedure. The syntax for these commands, called “dataset options” in SAS documentation, is to list them in parentheses after naming the dataset. So, for example, to i i “book” — 2014/5/24 — 9:57 — page 350 — #372 i i i i i i 350 APPENDIX A. INTRODUCTION TO SAS exclude extraneous variables in a dataset from an analysis dataset, the following code could possibly save time if the dataset were large. proc ttest data=test2 (keep=x y); class x; var y; run; Another useful dataset option limits the number of observations used from the named dataset. proc ttest data=test2 (obs=60); class x; var y; run; A full list of dataset options can be found in the on-line documentation: Contents; SAS Products; Base SAS; SAS Data Set options: Reference; Data Set Options Dictionary. A.6.2 Subsetting It is often convenient to restrict the membership in a dataset or run analyses on a subset of observations. There are three main ways we do this in SAS. One is through the use of a subsetting if statement in a data step. The syntax for this is simply data ...; set ...; if condition; run; where condition is a logical statement such as x eq 2. (See 4.1.2 for a discussion of logical operators.) This includes only observations for which the condition is true, because when an if statement does not include a then, the implied then clause is interpreted as “then output this line to the dataset; otherwise do not output it.” A second approach is a where statement. This can be used in a data step or in a procedure. proc ... data=ds; where condition; ... run; Finally, there is also a where dataset option which can be used in a data step or in a procedure; the syntax here is slightly diﬀerent. proc ... data=ds (where=(condition)); ... run; The diﬀerences between the where statement and the where dataset option are subtle and beyond our scope here. However, it is generally computationally cheaper to use a where approach than a subsetting if. A.6.3 Formats and informats SAS provides special tools for displaying variables or reading them in when they have complicated or unusual constructions in raw text. A good example for this is dates, for which June 27, 2009 might be written as, for example, 6-27-09, 27-6-09, 06/27/2009, and so on. SAS stores dates as the integer number of days since December 31, 1959. To convert one of the aforementioned expressions to the desired storage value, 17710, one could use an informat to describe the way the data is written. For example, if the data were stored i i “book” — 2014/5/24 — 9:57 — page 351 — #373 i i i i i i A.7. OUTPUT DELIVERY SYSTEM 351 as the above expressions, the informats mmddyy8., ddmmyy8., and mmddyy10., respectively, would read them correctly as 17710. An example of reading in dates is shown in 1.1.2. In contrast, displaying data in styles other than that in which it is stored is done using the informat’s inverse, the format. The format for display can be speciﬁed within a proc. For example, if we plan a time series plot of x*time and want the x axis labeled in quarters (i.e., 2010Q3), we could use the following code, where the time variable is the integer-valued date. proc gplot data=ds; plot x*time; format time yyq6.; run; Another example is deciding how many decimal digits to display. For example, if you want to display two decimal places for variable p and three for variable x, you could use the following code. This topic is also discussed in 1.2.1. proc print data=ds; var p x; format p 4.2 x 5.3; run; More information on informats and formats can be found in the on-line documentation: Contents; SAS Products; Base SAS; SAS Formats and Informats. A.7 Accessing and controlling SAS output: the Output Delivery System SAS does not provide access to most of the internal objects used in calculating results. Instead, it provides speciﬁc access to many objects of interest through various procedure statements. The ways to ﬁnd these objects can be idiosyncratic, and we have tried to highlight the most commonly needed objects in the text. This situation is roughly equivalent to the need in R to know the full name of an object before it can be accessed. A general way to access and control output within SAS is through the output delivery system or (redundantly, as in “ATM machine”) the ODS system. This is a very powerful and ﬂexible system for accessing procedure results and controlling printed output. We use the ODS system mainly for two tasks: 1) to save procedure output into explicitly named datasets and 2) to suppress some printed output from procedures which generate lengthy output. In addition, we discuss using the ODS system to save output in useful ﬁle formats such as portable document format (PDF), hypertext markup language (HTML), or the rich text format (RTF) which many word processing programs can read. We note that ODS has other uses beyond the scope of this book and encourage readers to spend time familiarizing themselves with it. A.7.1 Saving output as datasets and controlling output Using ODS to save output or control the printed results involves two steps; ﬁrst, ﬁnding out the name by which the ODS system refers to the output, and second, requesting that the dataset be saved as a SAS dataset or including or excluding it as output. The names used by the ODS system can be most easily found by running an ods trace / listing statement (later reversed using an ods trace off statement). The ODS outputname thus identi- ﬁed can be saved as a dataset using an ods output outputname=newname statement. Spe- ciﬁc pieces of output can be excluded using an ods exclude outputname1 outputname2 ... outputnamek statement or all but desired pieces excluded using the ods select i i “book” — 2014/5/24 — 9:57 — page 352 — #374 i i i i i i 352 APPENDIX A. INTRODUCTION TO SAS outputname1 outputname2 ... outputnamek statement. These statements are each sub- mitted before the procedure code which generates the output concerned. The exclude and select statements can be reversed using an ods exclude none or ods select all state- ment. For example, to save the result of the t test performed by proc ttest (5.4.2), the following code could be used. First, generate some data for the test. data test2; do i = 1 to 100; if i lt 51 then x=1; else x=0; y = normal(0) + x; output; end; run; Then, run the t test, including the ods trace on / listing statement to learn the names used by the ODS system. ods trace on / listing; proc ttest data=test2; class x; var y; run; ods trace off; This will create the following output. The TTEST Procedure Variable: y Output Added: ------------- Name: Statistics Label: Statistics Template: Stat.TTest.Statistics Path: Ttest.y.Statistics ------------- x N Mean Std Dev Std Err Minimum Maximum 0 50 -0.0253 0.8473 0.1198 -1.7148 1.8998 1 50 0.9700 0.9937 0.1405 -0.9282 3.0741 Diff (1-2) -0.9953 0.9234 0.1847 i i “book” — 2014/5/24 — 9:57 — page 353 — #375 i i i i i i A.7. OUTPUT DELIVERY SYSTEM 353 Variable: y Output Added: ------------- Name: ConfLimits Label: Confidence Limits Template: Stat.TTest.ConfLimits Path: Ttest.y.ConfLimits ------------- x Method Mean 95% CL Mean Std Dev 0 -0.0253 -0.2661 0.2155 0.8473 1 0.9700 0.6876 1.2524 0.9937 Diff (1-2) Pooled -0.9953 -1.3618 -0.6288 0.9234 Diff (1-2) Satterthwaite -0.9953 -1.3619 -0.6287 x Method 95% CL Std Dev 0 0.7078 1.0558 1 0.8301 1.2383 Diff (1-2) Pooled 0.8103 1.0736 Diff (1-2) Satterthwaite Variable: y Output Added: ------------- Name: TTests Label: T-Tests Template: Stat.TTest.TTests Path: Ttest.y.TTests ------------- Method Variances DF t Value Pr > |t| Pooled Equal 98 -5.39 <.0001 Satterthwaite Unequal 95.612 -5.39 <.0001 i i “book” — 2014/5/24 — 9:57 — page 354 — #376 i i i i i i 354 APPENDIX A. INTRODUCTION TO SAS Variable: y Output Added: ------------- Name: Equality Label: Equality of Variances Template: Stat.TTest.Equality Path: Ttest.y.Equality ------------- Equality of Variances Method Num DF Den DF F Value Pr > F Folded F 49 49 1.38 0.2680 Note that failing to issue the ods trace off command will result in continued annotation of every piece of output. Similarly, when using the ods exclude and ods select statements, it is good practice to conclude each procedure with an ods select all or ods exclude none statement so that later output will not be aﬀected. The previous output shows that the t test itself (including the tests assuming equal and unequal variances) appears in output which the ODS system calls ttests, so the following code demonstrates how the test results can be saved into a new dataset. Here we assign the new dataset the name appendixattest. ods output ttests=appendixattest; proc ttest data=test2; class x; var y; run; proc print data=appendixattest; run; The proc print code results in the following output. Obs Variable Method Variances tValue DF Probt 1 y Pooled Equal -5.39 98 <.0001 2 y Satterthwaite Unequal -5.39 95.612 <.0001 To run the t test and print only these results, the following code would be used. ods select ttests; proc ttest data=test2; class x; var y; run; ods select all; Variable: y Method Variances DF t Value Pr > |t| Pooled Equal 98 -5.39 <.0001 Satterthwaite Unequal 95.612 -5.39 <.0001 i i “book” — 2014/5/24 — 9:57 — page 355 — #377 i i i i i i A.8. SAS MACRO VARIABLES 355 This application is especially useful when running simulations, as it allows the results of procedures to be easily stored for later analysis. The foregoing barely scratches the surface of what is possible using ODS. For further information, refer to the on-line help: Contents; SAS Products; Base SAS; SAS Output Delivery System User’s Guide. A.7.2 Output ﬁle types and ODS destinations The other main use of the ODS system is to generate output in a variety of ﬁle types. By default, SAS output is printed in a Results window in the internal GUI. When run in batch mode, or when saving the contents of the output window using the GUI, this output is saved as a plain text ﬁle with a .lst extension. The ODS system provides a way to save SAS output in a more attractive form. As discussed in 9.3, procedure output and graphics can be saved to named output ﬁles by using commands of the following form. ods destinationname file="filename.ext"; Valid destinationnames include pdf, rtf, latex, and others. SAS refers to these ﬁle types as “destinations.” It is possible to have multiple destinations open at the same time. For destinations other than listing (the Output window), the destination must be closed before the results can be seen. This is done using the following statement. ods destinationname close; Note that the listing destination can also be closed. If there are no output destinations open, no results will be displayed. A.8 SAS macro variables SAS also includes what are known as macro variables. Unlike SAS macros, macro variables are values that exist during SAS runs and are not stored within datasets. One way to assign a value to a macro variable is the %let statement. %let macrovar=chars; Note that the %let statement need not appear within a data step; it is a global statement. The value is stored as a string of characters, and can be referred to as ¯ovar. data ds; newvar=¯ovar; run; or title "This is the ¯ovar"; In the above example, the double quotes in the title statement allow the text within to be processed to assess whether macro variables are present, and to replace them with text, if so. Enclosing the title text in single quotes will result in ¯ovar appearing in the title, while the code above will replace ¯ovar with the value of the macrovar macro variable. While this basic application of macro variables is occasionally useful in coding, a more powerful application is to generate the macro variables within a SAS data step. This can be done using a call symput function, as shown in 5.7.4. data _null_; ... call symput('macrovar', x); run; This makes a new macro variable named macrovar which has the value of the data set variable x. The null dataset is a special SAS dataset which is not saved. It is eﬃcient to use it when there is no need for a stored dataset. The formulation here can be used, for i i “book” — 2014/5/24 — 9:57 — page 356 — #378 i i i i i i 356 APPENDIX A. INTRODUCTION TO SAS example, to calculate a value during a data step and pass it to the title of a ﬁgure. We demonstrate the use of call symput in section 5.7.4. A.9 Miscellanea Oﬃcial documentation provided by SAS refers to, for example PROC GLM. However, SAS is not case sensitive, with a few exceptions. In this text we use lower case throughout. We ﬁnd lower case easier to read, and prefer the ease of typing (both for coding and book composition) in lower case. Since statements are separated by semicolons, multiple statements may appear on one line and statements may span lines. We usually type one statement per line in the text (and in practice), however. This prevents statements being overlooked among others appearing in the same line. In addition, we indent statements within a data step or proc, to clarify the grouping of related commands. We prefer the ﬁne control available through text-based commands. However, some people may prefer a point-and-click interface to the analytic tools available. SAS provides various applications for such an approach which can be accessed throught the Solutions menu in the GUI. SAS includes both run and quit statements. The run statement tells SAS to act on the code submitted since the most recent run statement (or since startup, if there has been no run statement submitted thus far). Some procedures allow multiple steps within the procedure without having to end it; key procedures which allow this are proc gplot and proc reg. This might be useful for model ﬁtting and diagnostics with particularly large datasets in proc reg. In general we ﬁnd it a nuisance in graphics procedures, because the graphics are sometimes not actually drawn until the quit statement is entered. In the examples, we use the run statement in general and the quit statement when necessary, without further comment. We ﬁnd the SAS GUI to be a comfortable work environment and an aid to productivity. However, SAS can be easily run in batch mode. To use SAS this way, compose code in the text editor of your choice. Save the ﬁle (a .sas extension would be appropriate), then ﬁnd it in the operating system. In Windows, a right-click on the ﬁle will bring up a list of potential actions, one of which is “Batch Submit with SAS”. If this option is selected, SAS will run the ﬁle without opening the GUI. The output will be saved in the same directory with the same name but with a .lst extension; the log will be saved in the same directory with the same name but with a .log extension. Both of these ﬁles are plain text ﬁles. i i “book” — 2014/5/24 — 9:57 — page 357 — #379 i i i i i i Appendix B Introduction to R and RStudio This chapter provides a (brief) introduction to R and RStudio. R is a free, open-source soft- ware environment for statistical computing and graphics [81, 135]. RStudio is an open-source integrated developement environment for R that adds many features and productivity tools for R. The chapter includes a short history, installation information, a sample session, back- ground on fundamental structures and actions, information about help and documentation, and other important topics. R is a general purpose package that includes support for a wide variety of modern statistical and graphical methods (many of which have been contributed by users). It is available for most UNIX platforms, Windows, and MacOS. The R Foundation for Statistical Computing holds and administers the copyright of R software and documentation. R is available under the terms of the Free Software Foundation’s GNU General Public License in source code form. RStudio facilitates use of R by integrating R help and documentation, providing a workspace browser and data viewer, and supporting syntax highlighting, code completion, and smart indentation. It integrates reproducible analysis with Sweave, knitr and R Mark- down (see 11.3) as well as slide presentations, and includes a debugging environment (see 4.1.7). RStudio also provides support for multiple projects as well as an interface to source code control systems such as GitHub. It has become the default interface for many R users, including the authors. RStudio is available as a client (standalone) for Windows, Mac OS X, and Linux, and there is also a server version. Commercial products and support are available in addition to the open-source oﬀerings (see http://www.rstudio.com/ide for details). The ﬁrst versions of R were written by Ross Ihaka and Robert Gentleman at the Uni- versity of Auckland, New Zealand, while current development is coordinated by the R De- velopment Core Team, a group of international volunteers. As of January 2014 this group consisted of Douglas Bates, John Chambers, Peter Dalgaard, Seth Falcon, Robert Gentle- man, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Uwe Ligges, Thomas Lumley, Martin Maechler, Duncan Murdoch, Paul Murrell, Martyn Plummer, Brian Ripley, Deep- ayan Sarkar, Duncan Temple Lang, Luke Tierney, and Simon Urbanek. Many hundreds of other people have contributed to the development of R or developed add-on libraries and packages. R is similar to the S language, a ﬂexible and extensible statistical environment originally developed in the 1980s at AT&T Bell Labs (now Alcatel–Lucent). Insightful Corporation has continued the development of S in their commercial software package S-PLUSTM. New users are encouraged to download and install R from the Comprehensive R archive network (CRAN, http://www.r-project.org, see B.1) and install RStudio from http: 357 i i “book” — 2014/5/24 — 9:57 — page 358 — #380 i i i i i i 358 APPENDIX B. INTRODUCTION TO R AND RSTUDIO Figure B.1: R Windows graphical user interface //www.rstudio.com/ide. The sample session in the appendix of the Introduction to R document, also available from CRAN (see B.2), is highly recommended. B.1 Installation The home page for the R project, located at http://r-project.org, is the best starting place for information about the software. It includes links to CRAN, which features pre- compiled binaries as well as source code for R, add-on packages, documentation (including manuals, frequently asked questions, and the R newsletter) as well as general background information. Mirrored CRAN sites with identical copies of these ﬁles exist all around the world. Updates to R and packages are regularly posted on CRAN. In addition to the in- structions for installation under Windows and Mac OS X, R and RStudio are also available for multiple Linux implementations. B.1.1 Installation under Windows Versions of R for Windows XP and later, including 64-bit versions, are available at CRAN. The distribution includes Rgui.exe, which launches a self-contained windowing system that includes a command-line interface, Rterm.exe for a command-line interface only, Rscript.exe for batch processing only, and R.exe, which is suitable for batch or command- line use. A screenshot of the R graphical user interface (GUI) can be found in Figure B.1. More information on Windows-speciﬁc issues can be found in the CRAN R for Windows FAQ(http://cran.r-project.org/bin/windows/base/rw-FAQ.html). i i “book” — 2014/5/24 — 9:57 — page 359 — #381 i i i i i i B.1. INSTALLATION 359 Figure B.2: R Mac OS X graphical user interface B.1.2 Installation under Mac OS X A version of R for Mac OS X 10.6 and higher is available at CRAN. This is distributed as a disk image containing the installer. In addition to the graphical interface version, a command line version (particularly useful for batch operations) can be run as the command R. A screenshot of the graphical interface can be found in Figure B.2. More information on Macintosh-speciﬁc issues can be found in the CRAN R for Mac OS X FAQ (http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html). B.1.3 RStudio RStudio for MacOS, Windows, or Linux can be downloaded from http://www.rstudio. com/ide. RStudio requires R to be installed on the local machine. A server version (ac- cessible from web browsers) is also available for download. Documentation of the advanced features in the system is available on the RStudio web site. A screenshot of the RStudio interface can be found in Figure B.3. B.1.4 Other graphical interfaces Other graphical user interfaces for R include the R Commander project [45], Deducer (http: //www.deducer.org), and the SOCR (Statistics Online Computational Resource) project (http://www.socr.ucla.edu). i i “book” — 2014/5/24 — 9:57 — page 360 — #382 i i i i i i 360 APPENDIX B. INTRODUCTION TO R AND RSTUDIO Figure B.3: RStudio graphical user interface B.2 Running R and sample session Once installation is complete, the recommended next step for a new user would be to start R and run a sample session. An example from the command line interface within Mac OS X is given in Figure B.4. The > character is the command prompt, and commands are executed once the user presses the RETURN or ENTER key. R can be used as a calculator (as seen from the ﬁrst two commands on lines 1 and 3). New variables can be created (as on lines 5 and 8) using the assignment operator =. If a command generates output (as on lines 6 and 11), then it is printed on the screen, preceded by a number indicating place in the vector (this is particularly useful if output is longer than one line, e.g., lines 23–24). Saved data (here assigned the name ds) is read into R on line 15, then summary statistics are calculated (lines 16–17) and individual observations are displayed (lines 23–24). The $ operator allows access to objects within a dataframe. Alternatively, the with() function can be used to access objects within a dataset. Unlike SAS, R is case sensitive. > x = 1:3 > X = seq(2, 4) > x [1] 1 2 3 > X [1] 2 3 4 A very comprehensive sample session in R can be found in Appendix A of An Introduction to R [186] (http://cran.r-project.org/doc/manuals/R-intro.pdf). i i “book” — 2014/5/24 — 9:57 — page 361 — #383 i i i i i i B.2. RUNNING R AND SAMPLE SESSION 361 %R R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. 1 > 3 + 6 2 [1] 9 3 > 2 * 3 4 [1] 6 5 > x = c(4, 5, 3, 2) 6 > x 7 [1] 4 5 3 2 8 > y = seq(1, 4) 9 > y 10 [1] 1 2 3 4 11 > mean(x) 12 [1] 3.5 13 > sd(y) 14 [1] 1.290994 15 > ds = read.csv("http://www.amherst.edu/~nhorton/sasr2/datasets/help.csv") 16 > mean(ds$age) 17 [1] 35.65342 18 > mean(age) 19 Error in mean(age) : object "age" not found 20 > with(ds, mean(age)) 21 [1] 35.65342 22 > ds$age[1:30] 23 [1] 37 37 26 39 32 47 49 28 50 39 34 58 53 58 60 36 28 35 29 27 27 24 [22] 41 33 34 31 39 48 34 32 35 25 > q() 26 Save workspace image? [y/n/c]: n Figure B.4: Sample session in R B.2.1 Replicating examples from the book and sourcing commands To help facilitate reproducibility, R commands can be bundled into a plain text ﬁle, called a “script” ﬁle, which can be executed using the source() command. The optional argument echo=TRUE for the source() command can be set to display each command and its output. i i “book” — 2014/5/24 — 9:57 — page 362 — #384 i i i i i i 362 APPENDIX B. INTRODUCTION TO R AND RSTUDIO The book web site cited above includes the R source code for the examples. The sample session in Figure B.4 can be executed by running: > source("http://www.amherst.edu/~nhorton/sasr2/examples/sampsess.R", echo=TRUE) while most of the examples at the end of each chapter can be executed by running: > source("http://www.amherst.edu/~nhorton/sasr2/examples/chapterXX.R", echo=TRUE) where XX is replaced by the desired chapter number. In many cases, add-on packages (see B.6.1) need to be installed prior to running the examples. To facilitate this process, we have created a script ﬁle to load them in one step. > source("http://www.amherst.edu/~nhorton/sasr2/examples/install.R", echo=TRUE) If needed libraries are not installed (B.6.1), the example code will generate error messages. B.2.2 Batch mode In addition, R can be run in batch (noninteractive) mode from a command line interface: % R CMD BATCH file.R This will run the commands contained within file.R and put all output into file.Rout. To use R in batch mode under Windows, users need to include R.exe in their path (see the Windows R FAQ and section B.1.1). B.3 Learning R and getting help An excellent starting point for new R users can be found in the Introduction to R, available from CRAN (r-project.org). The system features extensive on-line documentation, though as with SAS, it can some- times be challenging to comprehend. Each command in R has an associated help ﬁle that describes usage, lists arguments, provides details of actions, references, lists other related functions, and includes examples of its use. The help system is invoked using the command: ?function or help(function) where function is the name of the function of interest. As an example, the help ﬁle for the mean() function is accessed by the command help(mean). The output from this command is provided in Figure B.5. It describes the mean() function as a generic function for the (trimmed) arithmetic mean, with arguments x (an R object), trim (the fraction of observations to trim, with default=0; setting trim=0.5 is equivalent to calculating the median), and na.rm (should missing values be deleted, default is na.rm=F). Some commands (e.g., if) are reserved, so ?if will not generate the desired documen- tation. Running ?"if" will work (see also ?Reserved and ?Control). Other reserved words include else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, and NA. The RSiteSearch() function will search for key words or phrases in many places (in- cluding the search engine at http://search.r-project.org). A screenshot of the results of the command RSiteSearch("eta squared anova") can be found in Figure B.6. The RSeek.org site can also be helpful in ﬁnding more information and examples. Examples of many functions are available using the example() function. i i “book” — 2014/5/24 — 9:57 — page 363 — #385 i i i i i i B.3. LEARNING R AND GETTING HELP 363 mean package:base R Documentation Arithmetic Mean Description: Generic function for the (trimmed) arithmetic mean. Usage: mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...) Arguments: x: An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for 'trim = 0', only. trim: the fraction (0 to 0.5) of observations to be trimmed from each end of 'x' before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. na.rm: a logical value indicating whether 'NA' values should be stripped before the computation proceeds. ...: further arguments passed to or from other methods. Value: If 'trim' is zero (the default), the arithmetic mean of the values in 'x' is computed, as a numeric or complex vector of length one. If 'x' is not logical (coerced to numeric), numeric (including integer) or complex, 'NA_real_' is returned, with a warning. If 'trim' is non-zero, a symmetrically trimmed mean is computed with a fraction of 'trim' observations deleted from each end before the mean is computed. References: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole. See Also: 'weighted.mean', 'mean.POSIXct', 'colMeans' for row and column means. Examples: x <- c(0:10, 50) xm <- mean(x) c(xm, mean(x, trim = 0.10)) Figure B.5: Documentation on the mean() function i i “book” — 2014/5/24 — 9:57 — page 364 — #386 i i i i i i 364 APPENDIX B. INTRODUCTION TO R AND RSTUDIO Figure B.6: Display after running RSiteSearch("eta squared anova") > example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.10)) [1] 8.75 5.50 Other useful resources are help.start(), which provides a set of online manuals, and help.search(), which can be used to look up entries by description. The apropos() com- mand returns any functions in the current search list that match a given pattern (which facilitates searching for a function based on what it does, as opposed to its name). Other resources for help available from CRAN include the R-help mailing list (see also B.7, support). New users are also encouraged to read the R FAQ (frequently asked questions) list. i i “book” — 2014/5/24 — 9:57 — page 365 — #387 i i i i i i B.4. FUNDAMENTAL STRUCTURES AND OBJECTS 365 B.4 Fundamental structures and objects Here we provide a brief introduction to R data structures. B.4.1 Objects and vectors Almost everything in R is an object, which may be initially disconcerting to a new user. An object is simply something that R can operate on. Common objects include vectors, matri- ces, arrays, factors (see 2.2.19), dataframes (akin to datasets in SAS), lists, and functions. The basic variable structure is a vector. Vectors can be created using the <- or = as- signment operators (which assigns the evaluated expression on the right-hand side of the operator to the object name on the left-hand side). > x <- c(5, 7, 9, 13, -4, 8) > x = c(5, 7, 9, 13, -4, 8) # equivalent The above code creates a vector of length 6 using the c() function to concatenate scalars (2.2.10). The = operator must be used for the speciﬁcation of options for functions. Other assignment operators exist, as well as the assign() function (see 4.1.5 or help("<-") for more information). The rm() command can be used to remove objects. The exists() function can be utilized to determine whether an object exists. B.4.2 Indexing Since vector operations are so common in R, it is important to be able to access (or index) elements within these vectors. Many diﬀerent ways of indexing vectors are available. Here, we introduce several of these, using the above example. The command x[2] would return the second element of x (the scalar 7), and x[c(2,4)] would return the vector (7,13). The expressions x[c(T,T,T,T,T,F)], x[1:5], and x[-6] would all return a vector consisting of the ﬁrst ﬁve elements in x; the last speciﬁes all elements except the 6th. Knowledge and basic comfort with these approaches to vector indexing are important to eﬀective use of R, as they can help with computational eﬃciency. Vectors are recycled if needed, for example, when comparing each of the elements of a vector to a scalar, as shown below. > x>8 [1] FALSE FALSE TRUE TRUE FALSE FALSE The above expression demonstrates the use of comparison operators (see ?Comparison). Only the third and fourth elements of x are greater than 8. The function returns a logical value of either TRUE or FALSE (see ?Logic). A count of elements meeting the condition can be generated using the sum() function. > sum(x>8) [1] 2 The following commands create a vector of values greater than 8. > largerthan8 = x[x>8] > largerthan8 [1] 9 13 Here the expression x[x>8] can be interpreted as “the elements of x for which x is greater than 8.” This is a diﬃcult construction for some new users. Examples of its application in the book can be found in 11.4.4 and 2.6.2. i i “book” — 2014/5/24 — 9:57 — page 366 — #388 i i i i i i 366 APPENDIX B. INTRODUCTION TO R AND RSTUDIO Other comparison operators include == (equal), >= (greater than or equal), <= (less than or equal and != (not equal). Care needs to be taken in the comparison using == if noninteger values are present (see 3.2.5). B.4.3 Operators There are many operators deﬁned in R to carry out a variety of tasks. Many of these were demonstrated in the sample session (assignment, arithmetic) and above examples (compar- ison). Arithmetic operations include +, -, *, /, ˆ (exponentiation), %% (modulus), and &/& (integer division). More information about operators can be found using the help system (e.g., ?"+"). Background information on other operators and precedence rules can be found using help(Syntax). R supports Boolean operations (OR, AND, NOT, and XOR) using the |, ||, &, ! operators and the xor() function. The | is an “or” operator that operates on each element of a vector, while the || is another “or” operator stops evaluation the ﬁrst time that the result is true (see ?Logic). B.4.4 Lists Lists in R are generic objects that can contain other objects. List members can be named, or referenced using numeric indices (using the [[ operator). > newlist = list(x1="hello", x2=42, x3=TRUE) > is.list(newlist) [1] TRUE > newlist $x1 [1] "hello" $x2 [1] 42 $x3 [1] TRUE > newlist[[2]] [1] 42 > newlist$x2 [1] 42 The unlist() function can be used to ﬂatten (make a vector out of) the elements in a list (see also relist()). > unlisted = unlist(newlist) > unlisted x1 x2 x3 "hello" "42" "TRUE" Note that unlisted objects are coerced (see 2.2.3) to a common type (in this case character). i i “book” — 2014/5/24 — 9:57 — page 367 — #389 i i i i i i B.4. FUNDAMENTAL STRUCTURES AND OBJECTS 367 B.4.5 Matrices Matrices are rectangular objects with two dimensions (see 3.3). We can create a 2×3 matrix, display it, and test for its type. > A = matrix(x, 2, 3) > A [,1] [,2] [,3] [1,] 5 9 -4 [2,] 7 13 8 > is.matrix(A) # is A a matrix? [1] TRUE > is.vector(A) [1] FALSE > is.matrix(x) [1] FALSE Note that comments are supported within R (any input given after a # character is ignored). Indexing for matrices is done in a similar fashion as for vectors, albeit with a second dimension (denoted by a comma). > A[2,3] [1] 8 > A[,1] [1] 5 7 > A[1,] [1] 5 9 -4 B.4.6 Dataframes Analysis datatsets are often stored in a dataframe, which is more general than a matrix. This rectangular object, similar to a dataset in SAS, can be thought of as a matrix with columns of vectors of diﬀerent types (as opposed to a matrix, which consists of vectors of the same type). The functions read.csv() (see 1.1.5) and read.table() (see 1.1.2) return dataframe objects. A simple dataframe can be created using the data.frame() command. Access to sub-elements is achieved using the $ operator, as shown below (see also help(Extract)). In addition, operations can be performed by column (e.g., calculation of sample statis- tics). i i “book” — 2014/5/24 — 9:57 — page 368 — #390 i i i i i i 368 APPENDIX B. INTRODUCTION TO R AND RSTUDIO > y = rep(11, length(x)) > y [1] 11 11 11 11 11 11 > ds = data.frame(x, y) > ds x y 1 5 11 2 7 11 3 9 11 4 13 11 5 -4 11 6 8 11 > ds$x[3] [1] 9 We can check to see if an object is a data frame with is.data.frame(). Note that the use of data.frame() diﬀers from the use of cbind(), which yields a matrix object (unless cbind() is given data frames as inputs). > newmat = cbind(x, y) > newmat x y [1,] 5 11 [2,] 7 11 [3,] 9 11 [4,] 13 11 [5,] -4 11 [6,] 8 11 > is.data.frame(newmat) [1] FALSE > is.matrix(newmat) [1] TRUE Dataframes can be thought of as the equivalent of datasets in SAS. They can be created from matrices using as.data.frame(), while matrices can be constructed from dataframes using as.matrix(). Dataframes can be attached to the workspace using the attach(ds) command (see 2.1.1), though this is discouraged [56]. After this command, individual columns in ds can be referenced directly by name (e.g., x instead of ds$x). Name conﬂicts are a common problem with attach() (see conflicts()). The search() function lists attached packages and objects. To avoid cluttering the name-space, the command detach(ds) should be used once a dataframe is no longer needed. The with() and within() commands (see 2.1.1) can be used to simplify reference to an object within a dataframe without attaching. i i “book” — 2014/5/24 — 9:57 — page 369 — #391 i i i i i i B.5. FUNCTIONS 369 Sometimes it’s desirable to remove a package (B.6.1) from the workspace. For example, a package might deﬁne a function (4.2.2) with the same name as an existing function. Packages can be detached using the syntax detach("package:PKGNAME"), where PKGNAME is the name of the package (see, for example, 7.10.5). The names of all variables within a given dataset (or more generally for sub-objects within an object) are provided by the names() command. The names of all objects deﬁned within an R session can be generated using the objects() and ls() commands, which return a vector of character strings. RStudio includes an Environment tab that lists all the objects in the current environment. The print() and summary() functions can be used to display simple and more complex descriptions, respectively, of an object. Running print(object) at the command line is equivalent to just entering the name of the object, i.e., object. B.4.7 Attributes and classes Objects have a set of associated attributes (such as names of variables, dimensions, or classes) which can be displayed or sometimes changed. While a powerful concept, this can often be initially confusing. For example, we can ﬁnd the dimension of the matrix deﬁned earlier: > attributes(A) $dim [1] 2 3 Other types of objects within R include lists (ordered objects that are not necessarily rectangular), regression models (objects of class lm), and formulae (e.g., y ∼ x1 + x2). Examples of the use of formulas can be found in 5.4.2 and 6.1.1. R supports object-oriented programming (see help(UseMethod)). As a result, objects within R have an associated “Class” attribute, which aﬀects default behaviors for some operations on that object. Many functions have special capabilities when operating on a particular class. For example, when summary() is applied to an lm object, the summary.lm() function is called, while summary.aov() is called when an aov object is given as argument. The class() function returns the classes to which an object belongs, while the methods() function displays all of the classes supported by a function (e.g., methods(summary)). The attributes() command displays the attributes associated with an object, while the typeof() function provides information about the object (e.g., logical, integer, double, complex, character, and list). B.4.8 Options The options() function in R can be used to change various default behaviors, for example, the default number of digits to display in output (options(digits=n) where n is the preferred number). Defaults described in the book include digits, show.signif.stars, and width. The previous options are returned when options() is called (see 8.7.7), to allow them to be restored. The command help(options) lists all of the settable options. B.5 Functions B.5.1 Calling functions Fundamental actions within R are carried out by calling functions (either built-in or user deﬁned). Multiple arguments may be given, separated by commas. The function carries out i i “book” — 2014/5/24 — 9:57 — page 370 — #392 i i i i i i 370 APPENDIX B. INTRODUCTION TO R AND RSTUDIO operations using the provided arguments then returns values (an object such as a vector or list) that are displayed (by default) or which can be saved by assignment to an object. As an example, the quantile() function takes a vector and returns the minimum, 25th percentile, median, 75th percentile and maximum, though if an optional vector of quantiles is given, those are calculated instead. > vals = rnorm(1000) # generate 1000 standard normals > quantile(vals) 0% 25% 50% 75% 100% -2.8478 -0.6288 0.0802 0.7634 3.2597 > quantile(vals, c(.025, .975)) 2.5% 97.5% -1.94 2.12 Return values can be saved for later use. > res = quantile(vals, c(.025, .975)) > res[1] 2.5% -1.94 Options are available for many functions. These are named arguments for the function and are generally added after the other arguments, also separated by commas. The documen- tation speciﬁes the default action if named arguments (options) are not speciﬁed. For the quantile() function, there is a type() option which allows speciﬁcation of one of nine algo- rithms for calculating quantiles. Setting type=3 speciﬁes the “nearest even order statistic” option, which is the default for SAS. > res = quantile(vals, c(.025, .975), type=3) Some functions allow a variable number of arguments. An example is the paste() function (see usage in 2.2.10). The calling sequence is described in the documentation as follows. > paste(..., sep=" ", collapse=NULL) To override the default behavior of a space being added between elements output by paste(), the user can specify a diﬀerent value for sep. B.5.2 The apply family of functions Operations within R are most eﬃciently carried out using vector or list operations rather than looping. The apply() function can be used to perform many actions that would be implemented within a data step (A.4) within SAS. While somewhat subtle, the power of the vector language can be seen in this example. The apply() command is used to calculate column means or row means of the previously deﬁned matrix in one fell swoop: i i “book” — 2014/5/24 — 9:57 — page 371 — #393 i i i i i i B.6. ADD-ONS: PACKAGES 371 > A [,1] [,2] [,3] [1,] 5 9 -4 [2,] 7 13 8 > apply(A, 2, mean) [1] 6 11 2 > apply(A, 1, mean) [1] 3.33 9.33 Option 2 speciﬁes that the mean should be calculated for each column, while option 1 calculates the mean of each row. Here we see some of the ﬂexibility of the system, as functions in R (such as mean()) are also objects that can be passed as arguments to functions. Other related functions include lapply(), which is helpful in avoiding loops when using lists, sapply() (see 2.1.2), mapply(), and vapply() to do the same for dataframes, matrices, and vectors, respectively, and tapply() (11.1) performs an action on subsets of an object. The foreach and plyr package provides equivalent formulations for parallel execution (see also the parallel package). B.6 Add-ons: packages B.6.1 Introduction to packages Additional functionality in R is added through packages, which consist of libraries of bundled functions, datasets, examples, vignettes, and help ﬁles that can be downloaded from CRAN. The function install.packages() or the windowing interface under Packages and Data must be used to download and install packages. RStudio provides an easy to use Packages tab to install and load packages. The library() function can be used to load a previously installed package (i.e., one that is included in the standard release of R or has been previously made available through use of the install.packages() function). As an example, to install and load Frank Harrell’s Hmisc package, two commands are needed: > install.packages("Hmisc") > library(Hmisc) Once a package has been installed, it can be loaded for use in a session of R by executing the function library(libraryname). If a package is not installed, running the library() command will yield an error. Here we try to load the Zelig package (which had not yet been installed): > library(Zelig) Error in library(Zelig) : there is no package called 'Zelig' i i “book” — 2014/5/24 — 9:57 — page 372 — #394 i i i i i i 372 APPENDIX B. INTRODUCTION TO R AND RSTUDIO > install.packages("Zelig") trying URL 'ftp.osuosl.org/pub/cran/bin/macosx/contrib/Zelig_4.2-1.tgz' Content type 'application/x-gzip' length 3374792 bytes (3.2 Mb) opened URL ================================================== downloaded 3.2 Mb The downloaded binary packages are in /var/folders/2j/RtmpXPJ4oO/downloaded_packages > library(Zelig) ZELIG (Versions 4.2-1, built: 2013-09-12) +----------------------------------------------------------------+ | Please refer to http://gking.harvard.edu/zelig for full | | documentation or help.zelig() for help with commands and | | models support by Zelig. | | Zelig project citations: | | Kosuke Imai, Gary King, and Olivia Lau. (2009). | | ``Zelig: Everyone's Statistical Software,'' | | http://gking.harvard.edu/zelig | | and | | Kosuke Imai, Gary King, and Olivia Lau. (2008). | | ``Toward A Common Framework for Statistical Analysis | | and Development,'' Journal of Computational and | | Graphical Statistics, Vol. 17, No. 4 (December) | | pp. 892-913. | +----------------------------------------------------------------+ Attaching package: 'Zelig' A user can test whether a package is available by running require(packagename); this will load the library if it is installed, and generate a warning message if it is not (as opposed to library(), which will return an error, see 4.1.8). This is particularly useful in functions or reproducible analysis. The update.packages() function should be run periodically to ensure that packages are up-to-date. The sessionInfo() command displays the version of R that is running as well as information on all loaded packages. As of January 2014, there were more than 5,000 packages available from CRAN. This represents a tremendous investment of time and code by many developers [46]. While each of these has met a minimal standard for inclusion, it is important to keep in mind that packages within R are created by individuals or small groups, and not endorsed by the R core group. As a result, they do not necessarily undergo the same level of testing and quality assurance that the core R system does. B.6.2 CRAN task views The Task Views on CRAN (http://cran.r-project.org/web/views) are a very useful resource for ﬁnding packages. These are listings of relevant packages within a particular application area (such as multivariate statistics, psychometrics, or survival analysis). Table B.1 displays the task views available as of January 2014. i i “book” — 2014/5/24 — 9:57 — page 373 — #395 i i i i i i B.6. ADD-ONS: PACKAGES 373 Table B.1: CRAN task views Bayesian Bayesian inference ChemPhys Chemometrics and computational physics Clinical Trials Design, monitoring, and analysis of clinical trials Cluster Cluster analysis & ﬁnite mixture models DiﬀerentialEquations Diﬀerential equations Distributions Probability distributions Econometrics Computational econometrics Environmetrics Analysis of ecological and environmental data Experimental Design Design and analysis of experiments Finance Empirical ﬁnance Genetics Statistical genetics Graphics Graphic displays, devices, and visualization gR Graphical models in R High Performance Computing High-performance and parallel computing Machine Learning Machine and statistical learning Medical Imaging Medical image analysis MetaAnalysis Meta-analysis Multivariate Multivariate statistics Natural Language Processing Natural language processing Numerical Mathematics Numerical mathematics Oﬃcial Statistics Oﬃcial statistics & survey methodology Optimization Optimization and mathematical programming Pharmacokinetics Analysis of pharmacokinetic data Psychometrics Psychometric models and methods Reproducible Research Reproducible research Robust Robust statistical methods Social Sciences Statistics for the social sciences Spatial Analysis of spatial data Spatio Temporal Handling and analyzing spatio-temporal data Survival Survival analysis Time Series Time series analysis Web Technologies Web technologies and service B.6.3 Installed libraries and packages Running the command library(help="libraryname") will display information about an installed package. Entries in the book that utilize packages include a line specifying how to access that library (e.g., library(foreign)). As of January 2014, the R distribution comes with the following packages: base Base R functions compiler R byte code compiler datasets Base R datasets grDevices Graphics devices for base and grid graphics graphics R functions for base graphics grid A rewrite of the graphics layout capabilities, plus some support for interaction i i “book” — 2014/5/24 — 9:57 — page 374 — #396 i i i i i i 374 APPENDIX B. INTRODUCTION TO R AND RSTUDIO methods Formally deﬁned methods and classes for R objects, plus other programming tools parallel Support for parallel computation, including by forking and by sockets, and random-number generation splines Regression spline functions and classes stats R statistical functions stats4 Statistical functions using S4 classes tcltk Interface and language bindings to Tcl/Tk GUI elements tools Tools for package development and administration utils R utility functions These are available without having to run the library() command and are eﬀectively part of R. B.6.4 Packages referenced in this book Other packages utilized in the book include: biglm Bounded memory linear and generalized linear models [112] boot Bootstrap functions [20] (recommended) BRugs R interface to the OpenBUGS MCMC software [175] car Companion to Applied Regression [47] chron Chronological objects [83] circular Circular statistics [3] coda Output analysis and diagnostics for Markov Chain Monte Carlo simulations [132] coefplot Plots coeﬃcients from ﬁtted models [92] coin Conditional inference procedures in a permutation test framework [79] dispmod Dispersion models [162] doBy Groupwise summary statistics, LSmeans, and general linear contrasts [70] dplyr Plyr specialized for dataframes: faster and with remote datastores [194] ellipse Functions for drawing ellipses and ellipse-like conﬁdence regions [122] elrm Exact logistic regression via MCMC [205] epitools Epidemiology tools [9] exactRankTests Exact distributions for rank and permutation tests [78] ﬂexmix Flexible mixture modeling [98] foreach Foreach looping construct for R [139] foreign Read data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase [134] (recommended) gam Generalized additive models [64] gdata Various R programming tools for data manipulation [189] gee Generalized estimation equation solver [21] GenKern Functions for generating and manipulating binned kernel density estimates [109] i i “book” — 2014/5/24 — 9:57 — page 375 — #397 i i i i i i B.6. ADD-ONS: PACKAGES 375 GGally Extension to ggplot2 [158] ggmap A package for spatial visualization with Google Maps and OpenStreetMap [87] ggplot2 An implementation of the Grammar of Graphics [196] gmodels Various R programming tools for model ﬁtting [188] gridExtra Functions for grid graphics [10] gtools Various R programming tools [190] hexbin Hexagonal binning routines [22] Hmisc Harrell miscellaneous [62] Hotelling Hotelling’s T-squared test and variants [30] hwriter HTML writer: outputs R objects in HTML format [129] irr Various coeﬃcients of interrater reliability and agreement [48] knitr A general-purpose package for dynamic report generation in R [202] lars Least angle regression, LASSO, and forward stagewise [65] lattice Lattice graphics [151] (recommended) lawstat An R package for biostatistics, public policy, and law [51] lme4 Linear mixed-eﬀects models [12] lmtest Testing linear regression models [206] logistiX Exact logistic regression including Firth correction for binary covariates [66] lpSolve Interface to Lp solve v. 5.5 to solve linear/integer programs [16] lubridate Makes dealing with dates a little easier [57] maps Draw geographical maps [15] markdown Markdown rendering for R [6] MASS Support functions and datasets for Venables and Ripley’s MASS [185] (recom- mended) Matching Multivariate and propensity score matching with balance optimization [164] Matrix Sparse and dense matrix classes and methods [11] (recommended) MCMCpack Markov chain Monte Carlo (MCMC) package [114] memisc Tools for survey data, graphics, programming, statistics, and simulation [36] mice Multivariate imputation by chained equations [184] mitools Tools for multiple imputation of missing data [111] mix Estimation/multiple imputation for mixed categorical and continuous data [155] moments Moments, cumulants, skewness, kurtosis, and related tests [91] mosaic Project MOSAIC statistics and mathematics teaching utilities [133] MplusAutomation Automating Mplus model estimation and interpretation [60] muhaz Hazard function estimation in survival analysis [68] multcomp Simultaneous inference in general parametric models [77] multilevel Multilevel functions [17] nlme Linear and nonlinear mixed eﬀects models [130] (recommended) nnet Feed-forward neural networks and multinomial log-linear models [185] (recommended) i i “book” — 2014/5/24 — 9:57 — page 376 — #398 i i i i i i 376 APPENDIX B. INTRODUCTION TO R AND RSTUDIO nortest Tests for normality [58] partykit A toolkit for recursive partytioning [80] plotrix Various plotting functions [99] plyr Tools for splitting, applying, and combining data [198] poLCA Polytomous variable latent class analysis [106] prettyR Pretty descriptive stats [100] pscl Political science computational laboratory, Stanford University [82] pwr Basic functions for power analysis [23] QuantPsyc Quantitative psychology tools [44] quantreg Quantile regression [90] R2jags A package for running jags from R [170] R2WinBUGS Running WinBUGS and OpenBUGS from R [169] randomLCA Random eﬀects latent class analysis [14] RCurl General network (HTTP/FTP) client interface for R [93] reshape Flexibly reshape data [195] rjags Bayesian graphical models using MCMC [131] RMongo MongoDB client for R [24] rms Regression modeling strategies [63] RMySQL R interface to the MySQL database [84] ROCR Visualizing the performance of scoring classiﬁers [168] RODBC An ODBC database interface [140] rpart Recursive partitioning [173] (recommended) RSQLite SQLite interface for R [85] rtf Rich text format (RTF) output [156] runjags Interface utilities, parallel computing methods, and additional distributions for MCMC models in JAGS [33] sas7bdat SAS database reader [166] scatterplot3d 3D scatter plot [104] sciplot Scientiﬁc graphing functions for factorial designs [120] simPH Tools for simulating and plotting quantities of interest estimated from Cox pro- portional hazards models [49] sqldf Perform SQL selects on R dataframes [59] survey Analysis of complex survey samples [110] survival Survival analysis [174] (recommended) tmvtnorm Truncated multivariate normal and Student t distribution [199] vcd Visualizing categorical data [118] VGAM Vector generalized linear and additive models [204] vioplot Violin plot [2] WriteXLS Cross-platform Perl based R function to create Excel spreadsheets [160] i i “book” — 2014/5/24 — 9:57 — page 377 — #399 i i i i i i B.7. SUPPORT AND BUGS 377 XML Tools for parsing and generating XML [94] xtable Export tables to LATEX or HTML [31] Zelig Everyone’s statistical software [128] These must be downloaded, installed, and loaded prior to use (see install.packages(), require() and library()), though the recommended packages are included in most dis- tributions of R. To facilitate the process of loading the other packages, we have created a script ﬁle to load these in one step (see B.2.1). B.6.5 Datasets available with R A number of datasets are available within the datasets package. The data() function lists these, while the optional package option can be used to regenerate datasets from within a speciﬁc package. B.7 Support and bugs Since R is a free software project written by volunteers, there are no paid support options available directly from the R Foundation. A number of groups provide commercial sup- port for R and related systems, including Revolution Analytics and RStudio. In addition, extensive resources are available to help users. In addition to the manuals, publications, FAQs, newsletter, task views, and books listed on the www.r-project.org web page, there are a number of mailing lists that exist to help answer questions. Because of the volume of postings, it is important to carefully read the posting guide at http://www.r-project.org/posting-guide.html prior to submitting a question. These guidelines are intended to help leverage the value of the list, to avoid embarrassment, and to optimize the allocation of limited resources to technical issues. As in any general purpose statistical software package, some bugs exist. More information about the process of determining whether and how to report a problem can be found using help(bug.report) (please also review the R FAQ). i i “book” — 2014/5/24 — 9:57 — page 378 — #400 i i i i i i i i “book” — 2014/5/24 — 9:57 — page 379 — #401 i i i i i i Appendix C The HELP study dataset C.1 Background on the HELP study Data from the HELP (Health Evaluation and Linkage to Primary Care) study are used to illustrate many of the entries in R and SAS. The HELP study was a clinical trial for adult inpatients recruited from a detoxiﬁcation unit. Patients with no primary care physi- cian were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Funding for the HELP study was provided by the National Institute on Alcohol Abuse and Alco- holism (R01-AA10870, Samet PI) and the National Institute on Drug Abuse (R01-DA10019, Samet PI). Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin, or cocaine as their ﬁrst or second drug of choice, and either resided in proximity to the primary care clinic to which they would be referred, or were homeless. Patients with established primary care relationships they planned to continue, signiﬁcant dementia, speciﬁc plans to leave the Boston area that would prevent research participation, failure to provide contact information for tracking purposes, or pregnancy were excluded. Subjects were interviewed at baseline during their detoxiﬁcation stay, and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these ﬁve occasions. The details of the randomized trial along with the results from a series of additional analyses have been published [149, 138, 76, 103, 88, 148, 147, 165, 95, 201]. C.2 Roadmap to analyses of the HELP dataset Table C.1 summarizes the analyses illustrated using the HELP dataset. These analyses are intended to help illustrate the methods described in the book. Interested readers are encouraged to review the published data from the HELP study for substantive analyses. Table C.1: Analyses undertaken using the HELP dataset Description section (page) Data input and output 2.6.1 (p. 39) Summarize data contents 2.6.1 (p. 39) Data display 2.6.2 (p. 43) 379 i i “book” — 2014/5/24 — 9:57 — page 380 — #402 i i i i i i 380 APPENDIX C. THE HELP STUDY DATASET Derived variables and data manipulation 2.6.3 (p. 45) Sorting and subsetting 2.6.4 (p. 51) Summary statistics 5.7.1 (p. 97) Exploratory data analysis 5.7.1 (p. 97) Bivariate relationship 5.7.2 (p. 101) Contingency tables 5.7.3 (p. 103) Two-sample tests 5.7.4 (p. 107) Survival analysis (logrank test) 5.7.5 (p. 112) Scatterplot with smooth ﬁt 6.6.1 (p. 129) Linear regression with interaction 6.6.2 (p. 130) Regression diagnostics 6.6.3 (p. 135) Fitting stratiﬁed regression models 6.6.4 (p. 138) Two-way analysis of variance (ANOVA) 6.6.5 (p. 139) Multiple comparisons 6.6.6 (p. 144) Contrasts 6.6.7 (p. 146) Logistic regression 7.10.1 (p. 172) Poisson regression 7.10.2 (p. 176) Zero-inﬂated Poisson regression 7.10.3 (p. 178) Negative binomial regression 7.10.4 (p. 180) Quantile regression 7.10.5 (p. 181) Ordinal logit 7.10.6 (p. 182) Multinomial logit 7.10.7 (p. 183) Generalized additive model 7.10.8 (p. 185) Reshaping datasets 7.10.9 (p. 187) General linear model for correlated data 7.10.10 (p. 190) Random eﬀects model 7.10.11 (p. 193) Generalized estimating equations model 7.10.12 (p. 197) Generalized linear mixed model 7.10.13 (p. 199) Proportional hazards regression model 7.10.14 (p. 200) Cronbach α 7.10.15 (p. 201) Factor analysis 7.10.16 (p. 202) Recursive partitioning 7.10.17 (p. 205) Linear discriminant analysis 7.10.18 (p. 206) Hierarchical clustering 7.10.19 (p. 208) Scatterplot with multiple y axes 8.7.1 (p. 230) Conditioning plot 8.7.2 (p. 232) Scatterplot with marginal histogram 8.7.3 (p. 232) Kaplan–Meier plot 8.7.4 (p. 234) ROC curve 8.7.5 (p. 235) Pairs plot 8.7.6 (p. 236) Visualize correlation matrix 8.7.7 (p. 238) By group processing 11.1 (p. 283) Bayesian regression 11.4.1 (p. 290) Propensity score modeling 11.4.2 (p. 296) Multiple imputation 11.4.4 (p. 306) i i “book” — 2014/5/24 — 9:57 — page 381 — #403 i i i i i i C.3. DETAILED DESCRIPTION OF THE DATASET 381 C.3 Detailed description of the dataset The Institutional Review Board of Boston University Medical Center approved all aspects of the study, including the creation of the de-identiﬁed dataset. Additional privacy protection was secured by the issuance of a Certiﬁcate of Conﬁdentiality by the Department of Health and Human Services. A de-identiﬁed dataset containing the variables utilized in the end of chapter examples is available for download at the book web site: http://www.amherst.edu/~nhorton/sasr2/datasets/help.csv. Variables included in the HELP dataset are described in Table C.2. A full copy of the study instruments can be found at http://www.amherst.edu/~nhorton/help. Table C.2: Annotated description of variables in the HELP dataset VARIABLE DESCRIPTION VALUES NOTE a15a number of nights in overnight shelter in past 6 months 0–180 see also homeless a15b number of nights on the street in past 6 months 0–180 see also homeless age age at baseline (in years) 19–60 anysubstatus use of any substance post-detox 0=no, 1=yes see also daysanysub cesd∗ Center for Epidemiologic Studies Depression scale 0–60 higher scores indicate more depressive symp- toms; see also f1a–f1t d1 how many times hospitalized for medical problems (lifetime) 0–100 daysanysub time (in days) to ﬁrst use of any substance post-detox 0–268 see also anysubstatus daysdrink time (in days) to ﬁrst alcoholic drink post-detox 0–270 see also drinkstatus dayslink time (in days) to linkage to pri- mary care 0–456 see also linkstatus drinkstatus use of alcohol post-detox 0=no, 1=yes see also daysdrink drugrisk∗ Risk-Assessment Battery (RAB) drug risk score 0–21 higher scores indicate riskier behavior; see also sexrisk e2b∗ number of times in past 6 months entered a detox program 1–21 f1a I was bothered by things that usually don’t bother me 0–3# f1b I did not feel like eating; my ap- petite was poor 0–3# f1c I felt that I could not shake oﬀ the blues even with help from my family or friends 0–3# f1d I felt that I was just as good as other people 0–3# i i “book” — 2014/5/24 — 9:57 — page 382 — #404 i i i i i i 382 APPENDIX C. THE HELP STUDY DATASET f1e I had trouble keeping my mind on what I was doing 0–3# f1f I felt depressed 0–3# f1g I felt that everything I did was an eﬀort 0–3# f1h I felt hopeful about the future 0–3# f1i I thought my life had been a fail- ure 0–3# f1j I felt fearful 0–3# f1k My sleep was restless 0–3# f1l I was happy 0–3# f1m I talked less than usual 0–3# f1n I felt lonely 0–3# f1o People were unfriendly 0–3# f1p I enjoyed life 0–3# f1q I had crying spells 0–3# f1r I felt sad 0–3# f1s I felt that people dislike me 0–3# f1t I could not get going 0–3# female gender of respondent 0=male, 1=female g1b∗ experienced serious thoughts of suicide (last 30 days) 0=no, 1=yes homeless∗ 1 or more nights on the street or shelter in past 6 months 0=no, 1=yes see also a15a and a15b i1∗ average number of drinks (stan- dard units) consumed per day (in the past 30 days) 0–142 see also i2 i2 maximum number of drinks (standard units) consumed per day (in the past 30 days) 0–184 see also i1 id random subject identiﬁer 1–470 indtot∗ Inventory of Drug Use Conse- quences (InDUC) total score 4–45 linkstatus post-detox linkage to primary care 0=no, 1=yes see also dayslink mcs∗ SF-36 Mental Component Score 7-62 higher scores indicate better functioning; see also pcs pcrec∗ number of primary care visits in past 6 months 0–2 see also linkstatus, not observed at base- line pcs∗ SF-36 Physical Component Score 14-75 higher scores indicate better functioning; see also mcs pss fr perceived social supports (friends) 0–14 satreat any BSAS substance abuse treat- ment at baseline 0=no, 1=yes i i “book” — 2014/5/24 — 9:57 — page 383 — #405 i i i i i i C.3. DETAILED DESCRIPTION OF THE DATASET 383 sexrisk∗ Risk-Assessment Battery (RAB) sex risk score 0–21 higher scores indicate riskier behavior; see also drugrisk substance primary substance of abuse alcohol, cocaine, or heroin treat randomization group 0=usual care, 1=HELP clinic Notes: Observed range is provided (at baseline) for continuous variables. * denotes variables measured at baseline and followup (e.g., cesd is baseline measure, cesd1 is measured at 6 months, and cesd4 is measured at 24 months). #: For each of the 20 items in HELP section F1 (CESD), respondents were asked to indicate how often they behaved this way during the past week (0 = rarely or none of the time, less than 1 day; 1 = some or a little of the time, 1–2 days; 2 = occasionally or a moderate amount of time, 3–4 days; or 3 = most or all of the time, 5–7 days); items f1d, f1h, f1l, and f1p were reverse coded. i i “book” — 2014/5/24 — 9:57 — page 384 — #406 i i i i i i i i “book” — 2014/5/24 — 9:57 — page 385 — #407 i i i i i i References [1] D. Adams. The Hitchhiker’s Guide to the Galaxy. Pan Books, 1979. [2] D. Adler. vioplot: Violin plot, 2005. R package version 0.2. [3] C. Agostinelli and U. Lund. R package circular: Circular Statistics (version 0.4-7), 2013. [4] A. Agresti. Categorical Data Analysis. John Wiley & Sons, Hoboken, NJ, 2002. [5] J. Albert. Bayesian Computation with R. Springer, New York, 2008. [6] J. J. Allaire, J. Horner, V. Marti, and N. Porte. markdown: Markdown rendering for R, 2013. R package version 0.6.3. [7] P. D. Allison. Survival Analysis Using SAS: A Practical Guide (second edition). SAS Institute, 2010. [8] D. G. Altman and J.M. Bland. Measurement in medicine: the analysis of method comparison studies. The Statistician, 32:307–317, 1983. [9] T. J. Aragon. epitools: Epidemiology Tools, 2012. R package version 0.5-7. [10] B. Auguie. gridExtra: Functions in Grid Graphics, 2012. R package version 0.9.1. [11] D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix Classes and Methods, 2013. R package version 1.1-0. [12] D. Bates, M. Maechler, B. Bolker, and S. Walker. lme4: Linear Mixed-Eﬀects Models Using Eigen and S4, 2013. R package version 1.0-5. [13] B. Baumer, M. C¸etinkaya Rundel, A. Bray, L. Loi, and N.J. Horton. R markdown: Integrating a reproducible analysis tool into introductory statistics. Technology Inno- vations in Statistics Education, 8(1), 2014. [14] K. Beath. randomLCA: Random Eﬀects Latent Class Analysis, 2013. R package version 0.8-7. [15] R. A. Becker, A. R. Wilks, R. Brownrigg, and T. P. Minka. maps: Draw Geographical Maps, 2013. R package version 2.3-6. [16] M. Berkelaar. lpSolve: Interface to Lp solve v. 5.5 to Solve Linear/Integer Programs, 2013. R package version 5.6.7. [17] P. Bliese. multilevel: Multilevel Functions, 2013. R package version 2.5. 385 i i “book” — 2014/5/24 — 9:57 — page 386 — #408 i i i i i i 386 REFERENCES [18] A. H. Bowker. Bowker’s test for symmetry. Journal of the American Statistical Association, 43:572–574, 1948. [19] T. S. Breusch and A. R. Pagan. A simple test for heteroscedasticity and random coeﬃcient variation. Econometrica, 47, 1979. [20] A. Canty and B. Ripley. boot: Bootstrap R (S-Plus) Functions, 2013. R package version 1.3-9. [21] V. J. Carey. gee: Generalized Estimation Equation Solver, 2012. R package version 4.13-18. [22] D. Carr, N. Lewin-Koh, and M. Maechler. hexbin: Hexagonal Binning Routines, 2013. R package version 1.26.3. [23] S. Champely. pwr: Basic Functions for Power Analysis, 2012. R package version 1.1.1. [24] T. Chheng. RMongo: MongoDB Client for R, 2013. R package version 0.0.25. [25] R. P. Cody and J. K. Smith. Applied Statistics and the SAS Programming Language. Prentice Hall, 1997. [26] D. Collett. Modelling Binary Data. Chapman & Hall, London, 1991. [27] D. Collett. Modeling Survival Data in Medical Research (second edition). CRC Press, Boca Raton, FL, 2003. [28] L. M. Collins, J. L. Schafer, and C.-M. Kam. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4):330–351, 2001. [29] R. D. Cook. Residuals and Inﬂuence in Regression. Chapman & Hall, London, 1982. [30] J. M. Curran. Hotelling’s T-squared Test and Variants, 2013. R package version 1.0-2. [31] D. B. Dahl. xtable: Export Tables to LaTeX or HTML, 2013. R package version 1.7-1. [32] L. D. Delwiche and S. J. Slaughter. The Little SAS Book: A Primer (third edition). SAS Publishing, 2003. [33] M. J. Denwood. runjags: An R package providing interface utilities, parallel com- puting methods and additional distributions for MCMC models in JAGS. Journal of Statistical Software, in review. [34] A. J. Dobson and A. Barnett. An Introduction to Generalized Linear Models (third edition). CRC Press, Boca Raton, FL, 2008. [35] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, London, 1993. [36] M. Elﬀ. memisc: Tools for Management of Survey Data, Graphics, Programming, Statistics, and Simulation, 2013. R package version 0.96-9. [37] M. J. Evans and J. S. Rosenthal. Probability and Statistics: the Science of Uncertainty. W H Freeman and Company, New York, 2004. [38] J. J. Faraway. Linear Models with R. CRC Press, Boca Raton, FL, 2004. i i “book” — 2014/5/24 — 9:57 — page 387 — #409 i i i i i i REFERENCES 387 [39] J. J. Faraway. Extending the Linear model with R: Generalized Linear, Mixed Eﬀects and Nonparametric Regression Models. CRC Press, Boca Raton, FL, 2005. [40] N. I. Fisher. Statistical Analysis of Circular Data. Cambridge University Press, 1996. [41] G. S. Fishman and L. R. Moore. A statistical evaluation of multiplicative congruential generators with modulus (231 − 1). Journal of the American Statistical Association, 77:29–136, 1982. [42] G. M. Fitzmaurice, N. M. Laird, and J. H. Ware. Applied Longitudinal Analysis. John Wiley & Sons, Hoboken, NJ, 2004. [43] T. R. Fleming and D. P. Harrington. Counting Processes and Survival Analysis. John Wiley & Sons, Hoboken, NJ, 1991. [44] T. D. Fletcher. QuantPsyc: Quantitative Psychology Tools, 2012. R package version 1.5. [45] J. Fox. The R Commander: a basic graphical user interface to R. Journal of Statistical Software, 14(9), 2005. [46] J. Fox. Aspects of the social organization and trajectory of the R Project. The R Journal, 1(2):5–13, December 2009. [47] John Fox and Sanford Weisberg. An R Companion to Applied Regression (second edition). Sage, Thousand Oaks, CA, 2011. [48] M. Gamer, J. Lemon, I. Fellows, and P. Singh. irr: Various Coeﬃcients of Interrater Reliability and Agreement, 2012. R package version 0.84. [49] C. Gandrud. simPH: Tools for Simulating and Plotting Quantities of Interest Esti- mated from Cox Proportional Hazards Models, 2013. R package version 0.8.5. [50] C. Gandrud. Reproducible Research with R and RStudio. CRC Press, Boca Raton, FL, 2014. [51] J. L. Gastwirth, Y. R. Gel, W. L. Wallace Hui, V. Lyubchich, W. Miao, and K. Noguchi. lawstat: An R Package for Biostatistics, Public Policy, and Law, 2013. R package version 2.4.1. [52] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis (second edition). Chapman & Hall, London, 2004. [53] R. Gentleman and D. Temple Lang. Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1):1–23, 2007. [54] L. Gonick. Cartoon Guide to Statistics. HarperPerennial, New York, 1993. [55] P. I. Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag, New York, 1994. [56] Google. R style guide. http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml, date accessed 10/29/2013, 2013. [57] G. Grolemund and H. Wickham. Dates and times made easy with lubridate. Journal of Statistical Software, 40(3):1–25, 2011. [58] J. Gross and U. Ligges. nortest: Tests for Normality, 2012. R package version 1.0-2. i i “book” — 2014/5/24 — 9:57 — page 388 — #410 i i i i i i 388 REFERENCES [59] G. Grothendieck. sqldf: Perform SQL Selects on R Data Frames, 2012. R package version 0.4-6.4. [60] M. Hallquist and J. Wiley. MplusAutomation: Automating Mplus Model Estimation and Interpretation, 2013. R package version 0.6-2. [61] J. W. Hardin and J. M. Hilbe. Generalized Estimating Equations. CRC Press, Boca Raton, FL, 2002. [62] F. E. Harrell. Hmisc: Harrell Miscellaneous, 2013. R package version 3.13-0. [63] F. E. Harrell. rms: Regression Modeling Strategies, 2013. R package version 4.1-0. [64] T. Hastie. gam: Generalized Additive Models, 2013. R package version 1.09. [65] T. Hastie and B. Efron. lars: Least Angle Regression, Lasso and Forward Stagewise, 2013. R package version 1.2. [66] G. Heinze and T. Ladner. logistiX: Exact logistic regression including Firth correction, 2013. R package version 1.0-1. [67] D. F. Heitjan and R. J. A. Little. Multiple imputation for the Fatal Accident Reporting System. Applied Statistics, 40:13–29, 1991. [68] K. Hess and R. Gentleman. muhaz: Hazard Function Estimation in Survival Analysis, 2010. R package version 1.2.5. [69] T. C. Hesterberg, D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein. Bootstrap Methods and Permutation Tests. W.C. Freeman, 2005. [70] S. Højsgaard and U. Halekoh. doBy: Groupwise Summary Statistics, LSmeans, Gen- eral Linear Contrasts, Various Utilities, 2013. R package version 4.5-10. [71] N. J. Horton. I hear, I forget. I do, I understand: A modiﬁed Moore-method mathe- matical statistics course. The American Statistician, 67(3):219–228, 2013. [72] N. J. Horton, E. R. Brown, and L. Qian. Use of R as a toolbox for mathematical statistics exploration. The American Statistician, 58(4):343–357, 2004. [73] N. J. Horton, E. Kim, and R. Saitz. A cautionary note regarding count models of alco- hol consumption in randomized controlled trials. BMC Medical Research Methodology, 7(9), 2007. [74] N. J. Horton and K. P. Kleinman. Much ado about nothing: A comparison of missing data methods and software to ﬁt incomplete data regression models. The American Statistician, 61:79–90, 2007. [75] N. J. Horton and S. R. Lipsitz. Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician, 55(3):244–254, 2001. [76] N. J. Horton, R. Saitz, N. M. Laird, and J. H. Samet. A method for modeling utilization data from multiple sources: Application in a study of linkage to primary care. Health Services and Outcomes Research Methodology, 3:211–223, 2002. [77] T. Hothorn, F. Bretz, and P. Westfall. Simultaneous inference in general parametric models. Biometrical Journal, 50(3):346–363, 2008. i i “book” — 2014/5/24 — 9:57 — page 389 — #411 i i i i i i REFERENCES 389 [78] T. Hothorn and K. Hornik. exactRankTests: Exact Distributions for Rank and Per- mutation Tests, 2013. R package version 0.8-27. [79] T. Hothorn, K. Hornik, M. A. van de Wiel, and A. Zeileis. Implementing a class of permutation tests: The coin package. Journal of Statistical Software, 28(8):1–23, 2008. [80] T. Hothorn and A. Zeileis. partykit: A Toolkit for Recursive Partytioning, 2013. R package version 0.1-6. [81] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. [82] S. Jackman. pscl: Classes and Methods for R Developed in the Political Science Com- putational Laboratory, Stanford University, 2012. R package version 1.04.4. [83] D. James and K. Hornik. chron: Chronological Objects Which Can Handle Dates and Times, 2013. R package version 2.3-44. S original by David James, R port by Kurt Hornik. [84] D. A. James and S. DebRoy. RMySQL: R Interface to the MySQL Database, 2012. R package version 0.9-3. [85] D. A. James and S. Falcon. RSQLite: SQLite Interface for R, 2013. R package version 0.11.4. [86] S. R. Jammalamadaka and A. Sengupta. Topics in Circular Statistics. World Scien- tiﬁc, 2001. [87] D. Kahle and H. Wickham. ggmap: A Package for Spatial Visualization with Google Maps and OpenStreetMap, 2013. R package version 2.3. [88] S. G. Kertesz, N. J. Horton, P. D. Friedmann, R. Saitz, and J. H. Samet. Slowing the revolving door: Stabilization programs reduce homeless persons substance use after detoxiﬁcation. Journal of Substance Abuse Treatment, 24:197–207, 2003. [89] D. Knuth. Literate programming. CSLI Lecture Notes, 27, 1992. [90] R. Koenker. quantreg: Quantile Regression, 2013. R package version 5.05. [91] L. Komsta and F. Novomestky. moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests, 2012. R package version 0.13. [92] J. P. Lander. coefplot: Plots Coeﬃcients from Fitted Models, 2013. R package version 1.2.0. [93] D. Temple Lang. RCurl: General Network (HTTP/FTP/...) Client Interface for R, 2013. R package version 1.95-4.1. [94] D. Temple Lang. XML: Tools for Parsing and Generating XML within R and S-Plus, 2013. R package version 3.95-0.2. [95] M. J. Larson, R. Saitz, N. J. Horton, C. Lloyd-Travaglini, and J. H. Samet. Emergency department and hospital utilization among alcohol and drug-dependent detoxiﬁcation patients without primary medical care. American Journal of Drug and Alcohol Abuse, 32:435–452, 2006. i i “book” — 2014/5/24 — 9:57 — page 390 — #412 i i i i i i 390 REFERENCES [96] M. Lavine. Introduction to Statistical Thought. http://www.math.umass.edu/ ~lavine/Book/book.html, 2005. [97] F. Leisch. Sweave: Dynamic generation of statistical reports using literate data anal- ysis. In Wolfgang H¨ardleand Bernd R¨onz,editors, Compstat 2002 — Proceedings in Computational Statistics, pages 575–580. Physica Verlag, Heidelberg, 2002. [98] F. Leisch. FlexMix: A general framework for ﬁnite mixture models and latent class regression in R. Journal of Statistical Software, 11(8):1–18, 2004. [99] J. Lemon. Plotrix: a package in the red light district of R. R-News, 6(4):8–12, 2006. [100] J. Lemon and P. Grosjean. prettyR: Pretty Descriptive Stats, 2013. R package version 2.0-7. [101] R. Lenth and S. Højsgaard. Reproducible statistical analysis with multiple languages. Computational Statistics, 26(3):419–426, 2011. [102] K.-Y. Liang and S. L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73:13–22, 1986. [103] J. Liebschutz, J. B. Savetsky, R. Saitz, N. J. Horton, C. Lloyd-Travaglini, and J. H. Samet. The relationship between sexual and physical abuse and substance abuse consequences. Journal of Substance Abuse Treatment, 22(3):121–128, 2002. [104] U. Ligges and M. M¨achler. Scatterplot3d: an R package for visualizing multivariate data. Journal of Statistical Software, 8(11):1–20, 2003. [105] D. Y. Lin, L. J. Wei, and Z. Ying. Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika, 80:557–572, 1993. [106] D. A. Linzer and J. B. Lewis. poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software, 42(10):1–29, 2011. [107] S. R. Lipsitz, N. M. Laird, and D. P. Harrington. Maximum likelihood regression methods for paired binary data. Statistics in Medicine, 9:1517–1525, 1990. [108] R. Littell, W. W. Stroup, and R. Freund. SAS For Linear Models (fourth edition). SAS Publishing, 2002. [109] D. Lucy and R. Aykroyd. GenKern: Functions for Generating and Manipulating Binned Kernel Density Estimates, 2013. R package version 1.2-60. [110] T. Lumley. Analysis of complex survey samples. Journal of Statistical Software, 9(1):1–19, 2004. [111] T. Lumley. mitools: Tools for Multiple Imputation of Missing Data, 2012. R package version 2.2. [112] T. Lumley. biglm: Bounded Memory Linear and Generalized Linear Models, 2013. R package version 0.9-1. [113] B. F. J. Manly. Multivariate Statistical Methods: A Primer (third edition). CRC Press, Boca Raton, FL, 2004. [114] A. D. Martin, K. M. Quinn, and J. H. Park. MCMCpack: Markov Chain Monte Carlo in R. Journal of Statistical Software, 42(9):22, 2011. i i “book” — 2014/5/24 — 9:57 — page 391 — #413 i i i i i i REFERENCES 391 [115] M. Matsumoto and T. Nishimura. Mersenne twister: A 623–dimensionally equidis- tributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8:8–30, 1998. [116] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, London, 1989. [117] N. Metropolis, A.W. Rosenbluth, A.H. Teller, and E. Teller. Equations of state calcu- lations by fast computing machines. Journal of Chemical Physics, 21(6):1087–1092, 1953. [118] D. Meyer, A Zeileis, and Kurt Hornik. The strucplot framework: Visualizing multi-way contingency tables with vcd. Journal of Statistical Software, 17(3):1–48, 2006. [119] J. D. Mills. Using computer simulation methods to teach statistics: A review of the literature. Journal of Statistics Education, 10(1), 2002. [120] M. Morales. sciplot: Scientiﬁc Graphing Functions for Factorial Designs, 2012. R package version 1.1-0. [121] F. Mosteller. Fifty Challenging Problems in Probability with Solutions. Dover Publi- cations, 1987. [122] D. Murdoch and E. D. Chow. ellipse: Functions for Drawing Ellipses and Ellipse-Like Conﬁdence Regions, 2013. R package version 0.3-8. [123] P. Murrell. R Graphics. Chapman & Hall, London, 2005. [124] P. Murrell. Introduction to Data Technologies. Chapman & Hall, London, 2009. [125] N. J. D. Nagelkerke. A note on a general deﬁnition of the coeﬃcient of determination. Biometrika, 78(3):691–692, 1991. [126] National Institutes of Alcohol Abuse and Alcoholism, Bethesda, MD. Helping Patients Who Drink Too Much, 2005. [127] D. Nolan and D. Temple Lang. XML and Web Technologies for Data Sciences with R. Springer, New York, 2014. [128] M. Owen, K. Imai, G. King, and O. Lau. Zelig: Everyone’s Statistical Software, 2013. R package version 4.2-1. [129] G. Pau. hwriter: HTML Writer: Outputs R Objects in HTML Format, 2010. R package version 1.3. [130] J. Pinheiro, D. Bates, S. DebRoy, and D. Sarkar. nlme: Linear and Nonlinear Mixed Eﬀects Models, 2013. R package version 3.1-113. [131] M. Plummer. rjags: Bayesian Graphical Models Using MCMC, 2013. R package version 3-11. [132] M. Plummer, N. Best, K. Cowles, and K. Vines. Coda: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11, 2006. [133] R. Pruim, D. Kaplan, and N. J. Horton. mosaic: Project MOSAIC (mosaic-web.org) Statistics and Mathematics Teaching Utilities, 2014. R package version 0.8-18. i i “book” — 2014/5/24 — 9:57 — page 392 — #414 i i i i i i 392 REFERENCES [134] R Core Team. foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ..., 2013. R package version 0.8-57. [135] R Development Core Team. R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, 2013. [136] T. E. Raghunathan, J. M. Lepkowski, J. van Hoewyk, and P. Solenberger. A multi- variate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1):85–95, 2001. [137] T. E. Raghunathan, P. W. Solenberger, and J. V. Hoewyk. IVEware: imputation and variance estimation software. http://www.isr.umich.edu/src/smp/ive, accessed October 29, 2013, 2013. [138] V. W. Rees, R. Saitz, N. J. Horton, and J. H. Samet. Association of alcohol consump- tion with HIV sex and drug risk behaviors among drug users. Journal of Substance Abuse Treatment, 21(3):129–134, 2001. [139] Revolution Analytics and S. Weston. foreach: Foreach Looping Construct for R, 2013. R package version 1.4.1. [140] B. Ripley and M. Lapsley. RODBC: ODBC Database Access, 2013. R package version 1.3-10. [141] B. D. Ripley. Using databases with R. R News, 1(1):18–20, 2001. [142] M. L. Rizzo. Statistical Computing with R. CRC Press, Boca Raton, FL, 2007. [143] J. P. Romano and A. F. Siegel. Counterexamples in Probability and Statistics. Duxbury Press, 1986. [144] P. R. Rosenbaum and D. B. Rubin. Reducing bias in observational studies using sub- classiﬁcation on the propensity score. Journal of the American Statistical Association, 79:516–524, 1984. [145] P. R. Rosenbaum and D. B. Rubin. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39:33–38, 1985. [146] D. B. Rubin. Multiple imputation after 18+ years. Journal of the American Statistical Association, 91:473–489, 1996. [147] R. Saitz, N. J. Horton, M. J. Larson, M. Winter, and J. H. Samet. Primary medical care and reductions in addiction severity: a prospective cohort study. Addiction, 100(1):70–78, 2005. [148] R. Saitz, M. J. Larson, N. J. Horton, M. Winter, and J. H. Samet. Linkage with primary medical care in a prospective cohort of adults with addictions in inpatient detoxiﬁcation: Room for improvement. Health Services Research, 39(3):587–606, 2004. [149] J. H. Samet, M. J. Larson, N. J. Horton, K. Doyle, M. Winter, and R. Saitz. Linking alcohol and drug dependent adults to primary medical care: A randomized controlled trial of a multidisciplinary health intervention in a detoxiﬁcation unit. Addiction, 98(4):509–516, 2003. [150] J.-M. Sarabia, E. Castillo, and D. J. Slottje. An ordered family of Lorenz curves. Journal of Econometrics, 91:43–60, 1999. i i “book” — 2014/5/24 — 9:57 — page 393 — #415 i i i i i i REFERENCES 393 [151] D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, New York, 2008. [152] C.-E. S¨arndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer-Verlag, New York, 1992. [153] SAS Institute. SAS/STAT Software: Changes and Enhancements, Release 9.4, 2013. [154] J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, 1997. [155] J. L. Schafer. mix: Estimation/Multiple Imputation for Mixed Categorical and Con- tinuous Data, 2010. R package version 1.0-8. [156] M. E. Schaﬀer. rtf: Rich Text Format Output, 2013. R package version 0.4-11. [157] N. Schenker and J. M. G. Taylor. Partially parametric techniques for multiple impu- tation. Computational Statistics and Data Analysis, 22(4):425–446, 1996. [158] B. Schloerke, J. Crowley, D. Cook, H. Hofmann, H. Wickham, F. Briatte, and M. Mar- bach. GGally: Extension to ggplot2, 2013. R package version 0.4.4. [159] D. Schoenfeld. Residuals for the proportional hazards regresssion model. Biometrika, 69:239–241, 1982. [160] M. Schwartz. WriteXLS: Cross-Platform Perl Based R Function to Create Excel 2003 (XLS) and Excel 2007 (XLSX) Files, 2013. R package version 3.2.2. [161] R. L. Schwartz, b. d. foy, and T. Phoenix. Learning Perl (sixth edition). O’Reilly and Associates, 2011. [162] L. Scrucca. dispmod: Dispersion Models, 2012. R package version 1.1. [163] G. A. F. Seber and C. J. Wild. Nonlinear Regression. John Wiley & Sons, Hoboken, NJ, 1989. [164] J. S. Sekhon. Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7):1–52, 2011. [165] C. W. Shanahan, A. Lincoln, N. J. Horton, R. Saitz, M. J. Larson, and J. H. Samet. Relationship of depressive symptoms and mental health functioning to repeat detox- iﬁcation. Journal of Substance Abuse Treatment, 29:117–123, 2005. [166] M. S. Shotwell. sas7bdat: SAS Database Reader, 2012. R package version 0.3. [167] T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer. ROCR: visualizing classiﬁer performance in R. Bioinformatics, 21(20):3940–3941, 2005. [168] T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer. ROCR: visualizing classiﬁer performance in R. Bioinformatics, 21(20):7881, 2005. [169] S. Sturtz, U. Ligges, and A. Gelman. R2WinBUGS: A package for running WinBUGS from R. Journal of Statistical Software, 12(3):1–16, 2005. [170] Y.-S. Su and M. Yajima. R2jags: A Package for Running jags from R, 2013. R package version 0.03-11. i i “book” — 2014/5/24 — 9:57 — page 394 — #416 i i i i i i 394 REFERENCES [171] B. G. Tabachnick and L. S. Fidell. Using Multivariate Statistics (ﬁfth edition). Allyn & Bacon, 2007. [172] S. M. M. Tahaghoghi and H. E. Williams. Learning MySQL. O’Reilly Media: Se- bastopol, CA, 2006. [173] T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive Partitioning, 2013. R package version 4.1-4. [174] T. M. Therneau and P. M. Grambsch. Modeling Survival Data: Extending the Cox Model. Springer, New York, 2000. [175] A. Thomas, B. O’Hara, U. Ligges, and S. Sturtz. Making BUGS open. R News, 6(1):12–17, 2006. [176] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1), 1996. [177] E. R. Tufte. Envisioning Information. Graphics Press, Cheshire, CT, 1990. [178] E. R. Tufte. Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, Cheshire, CT, 1997. [179] E. R. Tufte. Visual Display of Quantitative Information (second edition). Graphics Press, Cheshire, CT, 2001. [180] E. R. Tufte. Beautiful Evidence. Graphics Press, Cheshire, CT, 2006. [181] J. W. Tukey. Exploratory Data Analysis. Addison Wesley, 1977. [182] S. van Buuren. Flexible Imputation of Missing Data. CRC Press, Boca Raton, FL, 2012. [183] S. van Buuren, H. C. Boshuizen, and D. L. Knook. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18:681–694, 1999. [184] S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3):1–67, 2011. [185] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S (fourth edition). Springer, New York, 2002. [186] W. N. Venables, D. M. Smith, and the R Core Team. An introduction to R: Notes on R: A programming environment for data analysis and graphics, version 3.0.2. http://cran.r-project.org/doc/manuals/R-intro.pdf, accessed October 27, 2013, 2013. [187] J. Verzani. Using R For Introductory Statistics. CRC Press, Boca Raton, FL, 2005. [188] G. R. Warnes. gmodels: Various R Programming Tools for Model Fitting, 2013. R package version 2.15.4.1. [189] G. R. Warnes, B. Bolker, G. Gorjanc, G. Grothendieck, A. Korosec, T. Lumley, D. MacQueen, A. Magnusson, and J. Rogers. gdata: Various R Programming Tools for Data Manipulation, 2013. R package version 2.13.2. [190] G. R. Warnes, B. Bolker, and T. Lumley. gtools: Various R Programming Tools, 2013. R package version 3.1.1. i i “book” — 2014/5/24 — 9:57 — page 395 — #417 i i i i i i REFERENCES 395 [191] B. West, K. B. Welch, and A. T. Galecki. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press, Boca Raton, FL, 2006. [192] H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48:817–838, 1980. [193] I. R. White and P. Royston. Imputing missing covariate values for the Cox model. Statistics in Medicine, 28:1982–1998, 2009. [194] H. Wickham. Plyr specialised for data frames: faster and with remote datastores. In process. [195] H. Wickham. Reshaping data with the reshape package. Journal of Statistical Soft- ware, 21(12), 2007. [196] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer, New York, 2009. [197] H. Wickham. ASA 2009 data expo. Journal of Computational and Graphical Statistics, 20(2):281–283, 2011. [198] H. Wickham. The Split-Apply-Combine strategy for data analysis. Journal of Statis- tical Software, 40(1):1–29, 2011. [199] S. Wilhelm and B. G. Manjunath. tmvtnorm: Truncated Multivariate Normal and Student t Distribution, 2013. R package version 1.4-8. [200] L. Wilkinson. Dot plots. The American Statistician, 53(3):276–281, 1999. [201] J. D. Wines, R. Saitz, N. J. Horton, C. Lloyd-Travaglini, and J. H. Samet. Overdose after detoxiﬁcation: a prospective study. Drug and Alcohol Dependence, 89:161–169, 2007. [202] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R, 2013. R package version 1.5. [203] Y. Xie. Dynamic Documents with R and knitr. CRC Press, Boca Raton, FL, 2014. [204] T. W. Yee. The VGAM package for categorical data analysis. Journal of Statistical Software, 32(10):1–34, 2010. [205] D. Zamar, B. McNeney, and J. Graham. elrm: Software implementing exact-like inference for logistic regression models. Journal of Statistical Software, 21(3), 2007. [206] A. Zeileis and T. Hothorn. Diagnostic checking in regression relationships. R News, 2(3):7–10, 2002. i i “book” — 2014/5/24 — 9:57 — page 396 — #418 i i i i i i Retaining the same accessible format as the popular first edition, SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition explains how to easily perform an analytical task in both SAS and R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation. The book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, and graphics, along with more complex applications. This edition now covers RStudio, a powerful and easy-to-use interface for R. It incorporates a number of additional topics, including application program interfaces (APIs), database management systems, reproducible analysis tools, Markov chain Monte Carlo (MCMC) methods, and finite mixture models. It also includes extended examples of simulations and many new examples. Through the extensive indexing and cross-referencing, users can directly find and implement the material they need. SAS users can look up tasks in the SAS index and then find the associated R code while R users can benefit from the R index in a similar manner. Numerous example analyses demonstrate the code in action and facilitate further exploration. Features • Presents parallel examples in SAS and R to demonstrate how to use the software and derive identical answers regardless of software choice • Takes users through the process of statistical coding from beginning to end • Contains worked examples of basic and complex tasks, offering solutions to stumbling blocks often encountered by new users • Includes an index for each software, allowing users to easily locate procedures • Shows how RStudio can be used as a powerful, straightforward interface for R • Covers APIs, reproducible analysis, database management systems, MCMC methods, and finite mixture models • Incorporates extensive examples of simulations • Provides the SAS and R example code, datasets, and more online K19040 Ken Kleinman and Nicholas J. Horton Kleinman and Horton SAS and R Ken Kleinman and Nicholas J. Horton Statistics SECOND EDITION K19040_cover.indd 1 5/6/14 8:57 AM

...

还剩424页未读

继续阅读

### 关键词

### 相关pdf

- SAS and R Data Management, Statistical Analysis,and Graphics 2ED
- Data Analysis and Graphics Using R
- Data Mining and Analysis
- data structures and algorithm analysis
- R and Data Mining
- Large Scale and Big Data Processing and Management
- Advances in Machine Learning and Data Analysis
- Data Structure and Algorithm Analysis in C
- Statistical Data Mining Using SAS Applications
- Data Mining with Rattle and R