Discuss Reading Questions PDF Show
The concepts of a variable, its type, and the structure of a data frame are useful because they help guide our thinking about the nature of a data. But we need more than definitions. If our goal is to construct a claim with data, we need to a tool to aid in the construction. Our tool must be able to do two things: it must be able to store the data and it must be able to perform computations on the data. In high school, you gained experience with one such tool: the graphing calculator. This fits our needs: you can enter a list of number into a graphic calculator like the Ti-84, it can store that list and it can execute computations on it, such as taking its sum. But the types of data that a calculator can store are very limited, as is the volume, as are the options for computation. In this class, we will use a tool that is far more powerful: the computer language called R. The Ti-84 is to R what a tricycle is to the space ship. One of these tools can bring you to the end of the block; the other to the moon. R and RStudioR is one of the most powerful languages for doing statistics data science. One of the reasons for its power and popularity is that it is both free and open-source. This turns languages like R into something that resembles Wikipedia: a collaborative effort that is constantly evolving. Extensions to the R language have been authored by professional programmers1, people working in industry and government2, professors3, and students like you4. You’ll be writing and running code through an app called RStudio. Beyond writing R code, RStudio allows you to manage your files and author polished documents that weave together code and text. RStudio can be run through a browser and we have set up an account for you that you can access by sending a browser tab to https://stat20.datahub.berkeley.edu/ or clicking the link in the upper right corner of the course website. When you log into RStudio, the place where you can type and run R code is called the console and it’s located right here: Figure 1: The R console in RStudio.As you read through these notes, keep RStudio open in another window to code along at the console. R as a CalculatorAlthough R is like a space ship capable of going to the moon, it’s also more than able to go to the end of the block. Type the sum All of the arithmetic operations work in R. Each of these four lines of code is called a command and the response from R is the output. The Although it is easiest to read code when the numbers are separated from the operator by a single space, it’s not necessary. R ignores all spaces when it runs your code, so each of the following also work. You can add exponents by using Saving ObjectsWhenever you want to save the output of an R command, add an assignment arrow When you run this command, there are two things to notice.
Every time you run a command, you can ask yourself: do I want to just see the output at the console or do I want to save it for later? If the latter, you can always see the contents of what you saved by just typing its name at the console and pressing Enter. There are a few rules around the names that R will allow for the objects that you’re saving. First, while all letters are fair game, special characters like But just because I’ve told you that those names won’t work doesn’t mean you shouldn’t give it a try…
This is an example of an error message and, though they can be alarming, they’re also helpful in coaching you how to correct your code. Here, it’s telling you that you had an “unexpected !” and then it points out where in your code that character popped up. Creating VectorsWhile it is helpful to be able to store a single number as an R object, to store data sets we’ll need to store a series of numbers. You can combine multiple values by putting them inside
This is object is called a vector. Vector (in R)A set of contiguous data values that are of the same type. As the definition suggests, you can create vectors out of many different types of data. To store words as data, use the following:
As this example shows, R can store more than just numbers as data. Vectors are often called atomic vectors because, like atoms, they are the simplest building blocks in the R language. Most of the objects in R are, at the end of the day, constructed from a series of vectors. FunctionsWhile the vector will serve as our atomic method of storing data in R, how do we perform computations on it? That is the role of functions. Let’s use a function to find the arithmetic mean of the vector A function in R operates in a very similar manner to functions that you’re familiar with from mathematics. Figure 2: A mathematical function as a box with inputs and outputs.In math, you can think of a function, \(f()\) as a black box that takes the input, \(x\), and transforms it to the output, \(y\). You can think of R functions in a very similar way. For our example above, we have:
Help and ArgumentsEvery function in R has a built-in help file that tells you about how it works. It can be accessed using This will pop up the help file in a tab next to your Files tab in the lower right hand corner of RStudio. In addition to describing what the function does, the help file lists out its arguments. Arguments are the separate pieces of input that you can supply to a function and they can be named or unnamed. In the command that we entered above, we used a single unnamed argument, As the help file suggests, mean() .
The test how this actually works, let’s add a second unnamed argument to our function. From reading the help file, you learn that you can supply it a
It worked! We trim off the 9 and the 28, then take \((11 + 19) / 2 = 15\). We can write the command using named arguments. The code will be a bit more verbose but the answer will be the same.
What happens if we use unnamed arguments but change the order? Let’s find out.
Since there are no names, R looks at the second argument and expects it to be the a proportion between 0 and .5 that it will use to trim. You have passed it a vector of three integers instead, so it’s justified in complaining. Functions on Vectors
By default,
Note that with these two functions, the input was a vector of length four and the output is a vector of length four. This is a distinctive aspect of the R language and it is helpful because it allows you to perform many separate operations (taking the square root of four numbers, one by one) with just a single command. The Taxonomy of Data in RIn the last lecture notes, we introduced the Taxonomy of Data as a broad system to classify
the different types of variables on which we can collect data. If you recall, a variable is a characteristic of an object that you can measure and record. When Dr. Gorman walked up to her first penguin (the unit of observation) and measured its bill length, she collected a single observation of the variable She continued on to measure the next penguin, then the next, then the next… Instead of recording these as separate objects, it is more efficient to store them as a vector.
This example shows that
so in the same way that we have asked, “what is the type of that variable?” we can now ask “what is the class of that variable in R?”. Class (R)A collection of objects, often vectors, that share similar attributes and behaviors. While there are many classes in R, you can get a long way only knowing three. The first is represented by our vector Here we learn that What about R stores that as a character vector. This is a very flexible class that can be used to store text as data. But what if there are only a few fixed values that a variable can take? In that case, you can do better than a character vector, you can use a factor. Factor is a very useful class in R because it encodes the notion of levels discussed in the last notes. To illustrate the difference, let’s make a character vector but then enrich it by turning it into a factor using
The original character vector stores the same three strings that we used as input. The factor adds some additional information: the possible values that this vector can take. This is particularly useful when you want to let R know that these levels have a natural ordering. If you have strong opinions about the relative merit of dogs over cats, you could specify that using:
This example also demonstrates that you can create a (character) vector inside a function. While this doesn’t change the way the levels are ordered in the vector itself, it will effect the way they behave when we use them to create plots, as we’ll do in the next set of notes. These three vector classes do a good job of putting into flesh and bone (or at least silicon) the abstract types captured in the Taxonomy of Data. Figure 4: The Taxonomy of Data with equivalent classes in R.Data Frames in RWhile vectors in R do a great job of capturing the notion of a variable, we will need more than that if we’re going to represent something like a data frame. Conveniently enough, R has a structure well-suited to this task called…(drumroll…) Let’s use R to recreate the penguins data frame collected by Dr. Gorman.
Creating a data frameIn the data frame above, there are three variables; the first two numeric continuous, the last one categorical nominal. Since R stores variables as vectors, we’ll need to create three vectors.
Check the class of these vectors by using the as input to While With the three vectors stored in the Environment, all you need to do is staple them together with
SummaryThis was our first introduction to R, a supercharged calculator for storing and computing on data. We learned how to do basic arithmetic, construct and save a vector, call functions, query the class of an object, and construct a data frame. This forms the foundation of our use of R. If that foundation feels shakey, don’t fret. Next class will be dedicated to a workshop on R. Figure 5: The arc of learning R6.References and further reading
Materials from classSlides
Footnotes
Which function is used to find the number of characters?When you need to count the characters in cells, use the LEN function—which counts letters, numbers, characters, and all spaces.
Which function finds the number of characters in a string?As you know, the best way to find the length of a string is by using the strlen() function. However, in this example, we will find the length of a string manually.
What does as character () mean in Rstudio?character() function in R converts a numeric object to a string data type or a character object. If the collection is passed to it as an object, it converts all the elements of the collection to a character or string type.
|