Subset Data in R

In the previous lesson, we learned about different data structures like vectors, factors and dataframes.

In this lesson we are going to learn how to subset data.

Subset the data means that-  lets say we have 10 observations and 4 variables meaning 10 rows and 4 columns respectively and we want to know whats in row 2 or column 2 or whats in column 3 and 4  or rows 5 and 6.

Lets see how we can approach this method.

1. Sub-setting in Vectors:

G <- c("M","M","M","M","M")
Here we created a variable G having bunch of elements.
If we look at structure of G:

str(G)
chr [1:7] “M” “M” “F” “F” “F” “M” “M”

We see that G has 7 elements.

Lets write out code to subset G variable. Lets say we want 4th element in G:

G[4]
[1] “F”

Notice that when we want elements in a variable we use square brackets [] and when we write a function we use round brackets ()

We can also use colon : separator if we want the elements in succession.

G[1:5]
[1] “M” “M” “F” “F” “F”

If we want element 2,4,6 from G we can write the following code:

G[c(2,4,6)]
[1] “M” “F” “M”

Notice how we used c() function to combine the elements and then printed them.

2. Sub-setting in Data Frames:

You guys remember the data frame we built in the previous section. Here is what we made:

name <- c("Vaibhav", "Bruno","Rocksy")
Passed <- c(TRUE, FALSE, TRUE)
age <- c(23,2,2)
Gender <- c("M","M","F")

Gender <- factor(Gender)

dataframe <- data.frame(name, Passed, age, Gender, stringsAsFactors = FALSE)

str(dataframe)

 

‘data.frame’: 3 obs. of 4 variables:
$ name : chr “Vaibhav” “Bruno” “Rocksy”
$ Passed: logi TRUE FALSE TRUE
$ age : num 23 2 2
$ Gender: Factor w/ 2 levels “F”,”M”: 2 2 1

Here we have 3 rows and 4 columns of data.

Now, we want to subset the data frame and see only Bruno’s result:

dataframe[2,]
name Passed age Gender
2 Bruno FALSE 2 M

In the dataframe[2,] above, 2 here represent the row or observation and a coma is separating row from columns.
If we leave column empty, we get all the columns there are.

Now we want only the age variable in the data frame:

dataframe[,3]
[1] 23 2 2

The way we did for rows is now what we are doing with columns. We left the row part empty and column part we assigned value 3, so we get third column which is age results.

Now, we want to know the name, age and gender of the observation bruno and rocksy only in the data frame.
Name is column 1, age is 3 and gender is 4. While Bruno and Rocksy are observation 2 and 3.
We can write following code:

 

dataframe[c(2,3),c(1,3,4)]
name age Gender
2 Bruno 2 M
3 Rocksy 2 F

We can also subset by excluding the rows or columns as shown:

dataframe[-1, -4]
name Passed age
2 Bruno FALSE 2
3 Rocksy TRUE 2

-1 is the row part and -4 is the column part. We have excluded 1st row and 4th column.

To exclude any column use – sign in front of the row or column part.

We can use the combination as follows too:

dataframe[-c(1,2), -c(3,4)]
name Passed
3 Rocksy TRUE


Importing a data set and subset data set:

Let us import a dataset from our systems and do some more subsetting.

You can download this sample dataset in your system – sample

Now to import this sample.csv file, we need to place it in our current working directory.
You can go to the page where we understood the concept of directories here

First we will set the working directory to the folder where I downloaded the file:

setwd("C:/Users/Vaibhav/Desktop/Study Material - 2018/R Programming")

Now we can write the code to read our data set in R:

sample_data <- read.csv("sample.csv")

If we use the head() function we can extract first 6 rows of our data set.

 

 

 

 

 

 

 

Same is with tail() function. Why don’t you guys give tail function a try.

So, as we see there are 6 variables.

Lets say we want the data of only females.
We can do this by writing the function subset():

 

subset(sample_data, gender %in% "Female")

 

This function takes 2 arguments,

1. Name of the data set

2. Condition

This code will populate your screen with data of all females. Notice the gender variable and see that there are only females. It will be good if we assign this code to a variable but for now lets click CTRL + L to clear the screen.

Now, lets play with one numeric variable – age. Lets say we want the age in our data set to be over 60 (age>60)

We can use subset function:

 

Congratulations on learning this tutorial. Go ahead and learn more.

Leave a Reply

avatar
  Subscribe  
Notify of