"R" is a free, open source computer package for doing scientific graphs and calculations. R was written by scientists, for scientists to use in their work. R is taking the scientific world by storm, and is making rapid inroads among corporations that depend on quantitative and analytical results.
R is incredibly powerful and amazingly easy to use. There is a myth that R has a "steep learning curve." In fact, for ordinary purposes such as graphs and straightforward scientific calculations, R is much easier to use than a graphing calculator or a spreadsheet. As well, the graphical displays that R can produce are of fantastic quality.
R is the closest thing to a free lunch that I can readily imagine.
Did I mention that R is free?
Versions of R are available for Windows, Mac OS, and Unix or Linux. If you have a computer, you can download and install R from the R Project website:
http://www.r-project.org/
Installation is easier than installing a computer game. R is free to institutions as well, so see your IT people at your school or business about making it available in your work environment. If your computer is really old, mouse around a bit at the R Project website: you likely will find that one of the older versions of R posted there is suitable for your operating system (R was first written in the 1990s).
So, it has occurred to me that R could be used to great advantage in science and math courses in high school and early college.
This is in part a self-help diary, in the vein of DKOS diaries here that have helped many of us enjoy breadmaking, photography, listening to classical music, etc. Here, I first provide a short tutorial to get you started using R, so that you can have an idea of what the hullabaloo is all about. Whether you are a teacher, or a student, or neither, I believe you will find something of interest in R. Next, after you are readied by the tutorial, comes the argumentative part of this diary. My proposition is that using R instead of calculators or spreadsheets in science and math classrooms might yield great advantages in student learning. I close the diary with a brief list of instructional resoures about R (books and free online tutorials), for anyone who is inspired enough to explore the topic further.
And, because R can enhance the spread and understanding of quantitative evidence, R skills can help sharpen political discourse. But, like all knowledge worth learning at this site, good things happen only for those who will venture forward through the orange mist of curiosity.
A brief R tutorial.
Install R on your system and type along! If you do not want to take the step of installation right now, you can visit the online R website, where there is a box to enter R statements and see the results in your web browser. Yes, you can even use the online R site on your smart phone!
Know that if you install R, it is the real deal. It is not shareware, freetrialware, malware, secretly-sell-your-info-ware. All laws of economics say that R shouldn't exist. R is the internet, and the scientific world, at its very best.
Installing R will put an icon, a blue "R," on the computer desktop or in the program list. Find it and click it to start the program. The R program window will appear, and the “R console”, a window within the R program window, will pop up.
On the console you will see the prompt ">" followed by a cursor. R is waiting for your instructions! Just type the commands that appear at the prompt as you go through this tutorial.
If you are using the online R website, there will be a box but no prompt. Type an R statement into the box (without typing a prompt) and click the "submit" button to see the results. The online R site will be slow but serviceable; the site has to reload whenever you submit a statement or statements for processing.
Ready?
The simplest way to use R is as an ultra-super-duper-powerful calculator. At the prompt, type: 6+8 and hit the Enter key:
> 6+8
[1] 14
Row 1 of the answer is 14. Answers to complicated questions can often come in many rows (matrix calculations, for instance), and so R prints row numbers along with the answers on the screen.
Try subtraction (each time in this tutorial, type the characters after the prompt and hit Enter; the answer you should see is then printed in this tutorial on the next line):
> 6-8
[1] -2
The minus sign can be used as a sign for negative numbers:
> 6+-3
[1] 3
Multiplication is an asterisk:
> 6*8
[1] 48
Division is a forward slash, so 6 divided by 8 is:
> 6/8
[1] 0.75
Raising to a power. Recall that “six to the eight power”, written 6
8, means 6×6×6×6×6×6×6×6. The carat symbol "^" denotes raising to a power:
> 6^8
[1] 1679616
You can put a string of calculations all in one statement. The calculations with * and / are done first, then + and -, and the calculations are done left to right:
> 6+8*2-12/4-1
[1] 18
Raising to a power is done first, even before multiplication and division:
> 2+5*3^2
[1] 47
The order of operations can be altered by using parentheses:
> (6+8)*2-(12/4-1)
[1] 26
Parentheses inside of parentheses will be done first. Just make sure that no right-wing parenthesis “)” is allowed to stand without first being countered by a progressive parenthesis “(”:
> (6+8)*(2-12/(4-1))
[1] -28
R will store
everything for you. Just give your calculations names:
> liz=6+8
> scott=5-1
> liz-scott
[1] 10
R names are sensitive to lower and upper case. In R, the name liz is different from Liz.
Are liz and scott still there?
> liz
[1] 14
> scott
[1] 4
They will disappear when you exit the R program, if you do not save the “workspace” on exiting.
If you use the name for something different, R will erase the old value:
> scott=9
> scott
[1] 9
Interestingly, the symbol = in R (and in many computer programming languages) does not mean "equals". Instead, it means:
calculate what is on the right, and store the result using the name on the left.
This convention was designed originally by computer scientists seemingly to irritate their former math teachers. For example, liz=liz+1 is a statement that mathematicians do not like, because no number exists that, when you add one to it, you get the same number! However, the statement makes perfect sense to computer programmers. It means:
take the old value of liz, add 1, and store the result as the new value of liz.
A statement with an equals sign is called an "assignment statement," assigning the value resulting from the calculation on the right to the storage location on the left. Try it:
> liz=liz+1
> liz
[1] 15
Actually, previous versions of R used the syntax (or typographic rule) liz<-liz+1 for assignment statements. The "<-" (a "less than" symbol followed by a hyphen) is supposed to look like a little arrow pointing left. Many older websites and books about R use this syntax, and the syntax works fine in current versions of R. Try it:
> liz<-liz+scott
> liz
[1] 24
The assignment statement calculated and stored 24 as the new value of liz. The scientists responsible for R finally gave up trying to be mathematical purists and instituted the equals sign for assignment statements in order to be consistent with most other computer languages.
OK, now we are ready to unleash some of the power of R!
The first item of pure cool is this: R can work with whole "lists" of numbers. Try the following:
> x=c(4,-6,3,5,2,-1,0)
> y=4
> x+y
[1] 8 -2 7 9 6 3 4
The c() command in the first line above says "combine" the numbers 4, -6, 3, 5, 2, -1, and 0 into a list. We named the list x. R (and mathematics) has a special term for a list of numbers: a
vector. Here, x is a vector with seven
elements. The value of y is 4. The expression x+y added 4 to every value in x! But what if y were a vector like x?
> y=c(1,2,3,4,5,6,7)
> z=x+y
> z
[1] 5 -4 6 9 7 5 7
Each number in the vector y is added to the corresponding number in x! Try multiplication:
> x=c(3290,4388,-29074,23)
> y=c(100397236, -29883, 0.93772, 22)
> x*y
[1] 3.303069e+11 -1.311266e+08 -2.726327e+04 5.060000e+02
There are a few things to note here. First, when writing R statements, do not use commas within large numbers to group the digits in threes. Rather, commas are used in R for other things, such as to separate numbers in the c() (combine) command. Second, spaces in between the elements are fine, as long as the commas are there to separate them (spaces in R statements are mostly ignored, but do not separate the digits in a number). Third, answers with very large or very small numbers are printed by R in computer-scientific notation: 3.303069e+11 means 3.303069×10
11 (the "e" signifies "exponent"). Fourth, do not show this to your kid in sixth grade who is laboring through arithmetic drills!
All of the arithmetic operations, addition, subtraction, multiplication, division, and power, are "vectorized" in R. We have seen that if you operate with a single number and a vector, then the single number operates on each element in the vector. If you operate with two vectors of the same length, then every element of the first vector operates on the corresponding element of the second vector.
The order of operations is the same for vector arithmetic, and parentheses may be used in the usual way to indicate which calculations to perform first:
> mitt=c(1,2,3)
> barry=c(-1,1,.5)
> 2(mitt+barry)
[1] 0 6 7
> 2*mitt+barry
[1] 1.0 5.0 6.5
If you make a mistake in typing, just type the line again. R will calculate and store the newer version. Also, if a line is long, you can continue it on the next line by hitting the Enter key at a place where the R command is obviously incomplete. R will respond with a different prompt that looks like a plus sign; just continue the R command at the new prompt and hit Enter when the command is complete:
> mitt=c(-1,1,
+ .5)
> mitt
[1] -1.0 1.0 0.5
A special vector can be built with a colon ":" in the following way:
> j=0:10
> j
[1] 0 1 2 3 4 5 6 7 8 9 10
Here j was defined as a vector consisting of all the integers from 0 to 10. One can go backwards if one wants:
> k=5:-5
> k
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
Want to see the powers of 2 from 2
0 to 2
20? Of course you do:
> j=0:20
> 2^j
[1] 1 2 4 8 16 32 64 128 256
[10] 512 1024 2048 4096 8192 16384 32768 65536 131072
[19] 262144 524288 1048576
You could write the previous two statements just as one statement, 2^(0:20), to get the same result.
Take note that the text syntax in R for writing math expressions forms a completely unambiguous way of communicating about math problems via instant messaging, text messaging, or email.
R has pretty much all scientific functions built in, just like a scientific calculator. Unlike a calculator, though, the functions in R accept vector arguments and return vector or single-number answers as appropriate:
> x=0:10
> sum(x)
[1] 55
> mean(x)
[1] 5
> sqrt(x)
[1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[9] 2.828427 3.000000 3.162278
> x^(1/2)
[1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[9] 2.828427 3.000000 3.162278
> sin(x)
[1] 0.0000000 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243
[7] -0.2794155 0.6569866 0.9893582 0.4121185 -0.5440211
> exp(x)
[1] 1.000000 2.718282 7.389056 20.085537 54.598150
[6] 148.413159 403.428793 1096.633158 2980.957987 8103.083928
[11] 22026.465795
> log(x)
[1] -Inf 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595
[8] 1.9459101 2.0794415 2.1972246 2.3025851
> log10(x)
[1] -Inf 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513
[8] 0.8450980 0.9030900 0.9542425 1.0000000
In the above, sin(x) is the trigonometric sine function, with the angle measure x in radians, where 1 radian is around 57 degrees (360 degrees is 2π radians), and exp(x) is the exponential function e
x, where e is the special irrational number e=2.71828... from calculus. Also, log(x) and log10(x) are, respectively, natural (base e) logarithm and base 10 logarithm of x. The logarithm of zero is undefined at minus infinity (printed as the code -Inf in R).
The arguments of functions can themselves be R calculations:
> y=.2*exp(-.2*x)
> y
[1] 0.20000000 0.16374615 0.13406401 0.10976233 0.08986579 0.07357589
[7] 0.06023884 0.04931939 0.04037930 0.03305978 0.02706706
The next item of pure cool is graphing. You might recognize the immediately preceding calculation as an exponential decay function. We can graph it easily, using the fact that many graphical displays are built into R as functions. For instance, the plot() statement can graph two vectors, here x and y, on horizontal and vertical axes. Type the following at the R prompt, being careful to note that type="l" in the statement uses a lower case letter L, not a one:
> plot(x,y,type="l")
Impressed yet?
A window with a graph should have appeared. The plot() command used the vector x on the horizontal axis and the vector y on the vertical axis. The type="l" part of the statement specified a line plot, connecting all the points with lines while not drawing any symbol for the points themselves.
If you click on the graph window to make it active, you can save the graph in any of various graphical formats (jpg, png, eps, etc.) using the pull-down menu at the top of the R window. You can also simply copy it to the clipboard and paste it into a document in your favorite word processor.
Everything about this graph can be customized: the labels, the plotting symbols, the axes, the line thicknesses, the line types, colors, and so on. Customizations are usually entered as a list of options in the plot() statement, separated by commas. For instance, close the graph window and type the following at the prompt:
> plot(x,y,type="l",xlab="time in years",ylab="moles radioisotope remaining")
The options xlab= and ylab= give text strings for labelling the x and y axes.
Let's do some politics! In 1999, a city council election took place in my home town. Candidates ran at large for two seats, with the top two vote-getters winning the seats. In the following R statements, dollars contains the declared amounts of money spent by each of the 7 candidates, and votes contains the numbers of votes obtained by each candidate.
> dollars=c(0,0,404,338,583,1992,1849)
> votes=c(159,305,706,912,1159,1228,1322)
> plot(dollars,votes,type="p")
The type="p"option gives us a point plot (known as a scatter plot), with points drawn without connecting lines. The resulting graph is a rather depressing although not entirely unexpected portrait of local politics. Yard signs evidently win local elections.
The graph window is actually "open" in R, waiting for you to add further points, curves, annotations, and so on. For example, we can superimpose a "diminishing returns" equation on the graph. There are only so many votes that can be garnered by spending money on yard signs, and every additional dollar spent will yield less and less votes on average.
An equation that expresses the notion of diminishing returns is
v = md/(k+d)
where v is votes received, d is dollars spent, and m and k are constants that differ from election to election (m is the maximum votes that can be garnered by money, and k is the amount of money that must be spent to get half of m). I had earlier "fitted" this equation to the data in the graph (using R, of course!). By "fitted" I mean finding the numerical values of m and k producing the best-fitting curve. The values were: m=1454.6 , k=249.5.
Leave the graph window open, but make the R command window active by clicking on it. Type:
> d=0:2000
> m=1454.6
> k=249.5
> v=m*d/(k+d)
> points(d,v,type="l")
You should see the diminishing returns model curve superimposed on the vote data scatterplot. In the above statements, d is a vector containing numbers 0, 1, 2, ..., 2000 (encompassing the range of dollars spent by the candidates). Because d is a vector, v is a vector whose elements are calculated in the statement v=m*d/(k+d) for every element in d. The points() statement adds points to an open, existing graph, and its use is similar to a plot() statement. The data and the simple diminishing returns model show that the two winning candidates defeated the runner-up by spending three times as much money to gain the winning margin of just a handful of extra votes.
This tutorial has barely scratched the surface of the things available in R. Here are just a few highlights of what awaits if you choose to explore further: (1) You can write and edit long lists of R statements, called "scripts" and have R process the statements in sequence all at once (the box at the online web-based R site actually accepts scripts). (2) Data stored in files can be easily brought into R for analysis (the columns of a data file become vectors in R). (3) You can write your own functions for you and others to use. (4) Every imaginable kind of statistical analysis, graphical display, probability calculation, matrix calculation, and random number simulation is built in to R as a special function. (4) R has the usual features of many programming languages: loops, logical statements, conditional statements, indexing of vectors, matrices, and data sets, etc. (5) R has a huge and ever-growing supply of routines contributed by working scientists for specialty analyses in many scientific fields.
The Fourth R?
Why not use R instead of calculators and spreadsheets in high school and early college STEM classes? (STEM is education-ese for "science, technology, engineering, and mathematics")
The graphing calculators in particular are hard to learn. Each calculator brand is different, and a student rarely actually masters the daunting sequence of keystrokes needed to produce even the simplest graph (the student is always wedded to the user manual). The graphs themselves produced by a graphing calculator are low-resolution toys that are near-useless. At great cost of class time, an algebra class might go through the steps of graphing a quadratic equation or a sine wave, but the calculators are rarely used subsequently as real tools for real problems. The graphing calculators themselves are expensive and are a burden to home and school budgets.
The most widely used spreadsheet (let's call it X-Hell by Voldesoft) is of course proprietary. The quality of the graphical displays produced by the proprietary spreadsheet is reviled by many working scientists, because the common default graphing options are unacceptable in scientific journals (axis label sizes and fonts, line thicknesses, tic marks, plotting symbols, etc.). Also, for many years, the proprietary spreadsheet has had statistical routines that are known to be incompetently programmed.
One can use an open source spreadsheet, but then one must spend time dealing with installation, compatibility issues and uneven documentation. Most instructors would want class time to be spent on understanding the math/science concepts rather than on wrestling with software peculiarities.
The myth that R is difficult to learn stems greatly from the sophistication and complexity of the applications that are being tackled with R: data mining, molecular genetics, econometrics, spatial statistics, hierarchical statistical models... the list is endless. R has a well-deserved reputation for being a leading statistical analysis package, and people coming to R are often also in the throes of learning statistics for the first time. The introductions to R in such courses are laced with statistical inference concepts and try to do too much, too soon. To a student studying statistics, R can seem huge and insurmountable.
However, high school and college students in basic science and math courses need only a small portion of what is available in R for their computation tasks. Most such courses do not have statistics as a prerequisite and do not use much statistics at all! A few easily-learned things in R go a long way: graphing, calculations with functions, data input and output, simple summary statistics. Glance through the concepts in any high school algebra textbook, and if you know even a little R you will see at once how easy it is to use R to implement the concepts in real scientific examples.
There are five really intriguing aspects of R that to me seem to have almost revolutionary potential for high school and early college STEM education:
(1) R use could help greatly improve understanding of science and math among students. With R, scientific calculations and graphs are fun and easy to produce. A student using R is freed to focus on the scientific and mathematical concepts without having to pore through a manual of daunting lists of calculator keystroke instructions. The students would be analyzing data and depicting equations just as scientists are doing in labs all over the world.
(2) R could be used across a wide variety of STEM courses, promoting the integration of STEM subjects that has been much discussed in principle but elusive in practice. R skills would follow a student from course to course.
(3) R is probably the most universally available computational tool (aside from counting on fingers). Many students nowadays use Facebook somehow or other, and schools and colleges have institutional machines available to the students. Versions of R exists for most platforms, so R could be made instantly available to every student in every course. No additional proprietary software or calculators have to be purchased. Once R is installed on a machine, the internet is not needed to use R. Why not use all those school computers for actually, uh, computing stuff?
(4) R invites collaboration among students. Like scientists sharing their applications, students can work in groups to conduct projects in R, build R scripts (lists of R commands), and improve each others’ work, and they can collect, accumulate, and just plain show off exemplary graphical analyses. Results on a computer screen are much easier to view in groups than results on a calculator, and R scripts are much easier to save and alter cooperatively than are the calculator keystroke lists. At home, students can message each other about their latest R scripts, working cooperatively online. Computer-savvy students will turn up additional interesting and useful R resources in their work on class projects, and instructors can then have them share their discoveries with the class members! Every new class can take what previous classes have done and build new accomplishments upon the old. R builds on itself.
(5) R skills will follow a student to college and professional life. College statistics and advanced science courses are increasingly teaching R, and a student coming in with some R knowledge will have a head start. Many forward-looking technology companies are now using R, and R skills are a becoming a valuable professional credential.
You will no doubt have recognized that writing lists of R statements suspiciously resembles computer programming. Indeed, R can be considered to be a computer programming language, like Basic, C or Java, except that most of the exhasperating and painstakingly detailed tasks in programming (like coloring each pixel on the screen for a graph, or programming the mathematical calculations for the area under a bell shaped curve) are pre-built into R as functions. The built-in functions and vectorization of R reduces even massive computational tasks to one line statements. The student is freed to focus on scientific and mathematical concepts, while at the same time getting exposed painlessly to a few of the essential ideas of computer programming (such as assignment statements, data stored under a "name", sequences of statements, loops, functions).
Certainly, today's science courses are so packed with subject matter, and the standardized testing regimes are so rigid, that there is hardly room left for any innovation in the classroom. There is no time for playing around with computers, and even if there was, students can't use computers anyway on their standardized tests. I have no answer to this criticism of the idea of using R in STEM education, other than to say that passivity means surrender.
Could R help test scores? I don't know. I would like to think so, and I would like to think that the notion is worthwhile enough to launch some serious experimentation.
We might hypothesize that R could help instructors do a few things really well in science courses, by making some serious examples of real scientific applications accessible for study. Simulations that generate millions of random numbers? Calculate the orbit of Earth around the Sun using Newton's laws and no calculus? Piece of cake if you have R.
As well, we might speculate that R could be a great common thread to help STEM faculty to work together to infuse and integrate math into science and vice versa, and to cooperate around a common theme. Wouldn't this improve student learning of both science and math? Unfortunately, the current standardized testing regimes are pitting teachers against each other rather than promoting cooperation.
Education researchers among us might be able point us to evidence to support or contradict such hypotheses.
I will add that R might be a potentially valuable resource for the "enlightened" homeschooling community (as opposed to the subset of homeschoolers who are sheltering their offspring from the age of the earth and other scientific/biblical discordances, for whom R could prove quite subversive). As a free computational, analysis, and graphing tool that is used by scientists themselves and that performs those functions better than expensive commercial products, R could open the world of science and mathematics to a young Jeffersonian scholar like no other computer resource. While the public and private schools are frozen in place by political struggles, your young scholar could leapfrog his/her peers to leadership levels of STEM understanding.
I anticipate that some comments will bring up alternative computer products, which are numerous. Python, Matlab, Mathematica, Maple, and of course spreadsheets, among others, are seen in college quantitative science courses. Python is another free scripting language like R that is favored in the physical sciences (with R seen more in the life sciences), but Python's documentation right now is somewhat computer-sciencey and geeky. Matlab, Mathematica, and Maple are (expensive) proprietary products. The computer symbolic algebra of Mathematica and Maple is touted by mathematicians but is not really used much by working scientists. New products are under development (Julia, for instance) that could eventually replace R. But right now, nothing has the critical mass of users and documentation that R has developed.
Below are some URLs for online tutorials about R, as well as links to publishers' websites for some introductory-level books. By way of disclosure, I am the author of one of the books (last on the list), but a student or teacher hardly needs a book to get started using R. Simple online searches will turn up vast additional resources, most of them free. In particular, numerous college courses have posted various R introductions. There are some Youtube videos, and you might even find entire free courses in the growing cloud of MOOCs (massive open online courses offered by EdX, Coursera, Udacity, Khan, and so on) that deal with R in some capacity.
I have added at the end a few R scripts to provide examples of the many uses of R.
Happy R-ing! I will be especially interested in hearing about any positive or negative experiences STEM teachers have had with using R in their courses and about any ideas for R activities in science teaching. If readers who have lasted this far would like to see more, I would gladly do a sequel diary with more cool R stuff.
R Tutorials
Intro to R at the R Project website
Self-Learn Tutorial at the National Center for Ecological Analysis and Synthesis
R for Biologists at U Tennessee
R Tutorial at Illinois State U
R Resource Website at UCLA
R Tutorial (with videos) at U Colorado/Denver
Introductory-Level Instructional Books About R
R in Action
R in a Nutshell
R for Dummies
The R Student Companion
Some example R scripts
Download a script to your computer, and either: (1) open it with a text editor, copy it, and paste it into the R prompt, or (2) open it as a script from within R and run it. Each script should run as is without modification.
Bar graph of counts of M&M colors (suitable for grade school!)
Various types of graphs of US unemployment rate, produced in a figure with panels (reproduces one of the plots featured in this diary)
Matrix projection of the age classes of a threatened species (Northern Spotted Owl)
Calculation of the probability that one or more of the 5 conservative or 4 liberal
Supreme Court justices will die by December of Obama's last term (updated from this diary)