So, we have tons of amount of data available and each day it expands. We should be able to extract some useful information about this data so that we know how values or elements in the data influence each other or do they have some relation with each other. Soon we will come to predictions part but lets discuss some descriptive analysis first.
So, lets import our data set in R first.
This time I am going to use a data set from https://vincentarelbundock.github.io/Rdatasets/datasets.html
I downloaded carprice data set from above. So lets import it in R:
sd <- read.csv("carprice.csv", stringsAsFactors = FALSE)
I hope you are able to get to this page as shown above.
So, we have first 6 rows of our data. We saved our data in a variable sd (sample data) to not populate our screen.
Now we want to know the structure of sd:
We see that there are 10 variables of which 1 is character, 3 are integers and 6 are numeric.
Difference between numeric and integer variable is that integer can not have decimals while numeric can have decimals.
Analysis on Categorical variable
We can see that Type variable is character. Let’s see what are the count of unique value it hold by table function
We now know that Type can be a factor variable but saved as character. Lets convert it to factor:
sd$Type = factor(sd$Type)
You might be asking what is sd$Type. It is a way to access a variable or column. sd is our data set name and Type is our variable and both are separated by $ sign.
So, table function gets the count of each of the unique elements stored in our data set sd, so the inference we make out of this is that out of all the cars, there are 7 Compact cars, 11 Large cars, 10 Midsize … etc.
Analysis on Numeric Variable
For now lets do analysis on numeric variable Price using summary function as shown:
We see that minimum price is 7.40 and maximum price is 40.10.
This gives us the range of Price variable. Meaning that whatever the values are of Price it will be with in this range.
We see that Mean is a little higher than Median. This means that our data is right skewed. But how?
Lets see one more function and then you will agree to it.
boxplot(sd$Price, horizontal = T)
This function has 2 arguments:
2. You want it horizontally or not. T means TRUE you can write 2nd argument like this also : horizontal = TRUE
This function returns us the following boxplot of Price variable
The middle line which is median and the box which is made from left line, 1st quartile and right line 3rd quartile tells us that there is a lean of median, which is middle line (2nd quartile) towards the 1st quartile.
This tells us spread between the data after median is more and spread before median is less. There are a bunch of outliers also in this data which are shown by dots after the 5th quartile.
If we were to show a Histogram of this data it will make more sense of data:
With the help of histogram also, we know that most of the data has long tail towards right.
So these are some summaries you can use to know the type of data there is.
Analysis on Multiple Variables
We can do analysis on multiple variables as well in R.
Numerical and Categorical variable analysis
We have used summary function for whole data set earlier, but what if we wanted to know the summary variable wise? We can do that using tapply function.
tapply(sd$Price, sd$Type, FUN = summary)
The function has 3 arguments:
1. It is the variable on which the function in the 3rd argument we want to apply on. Meaning, summary function will be applied on Price variable.
2. The variable that distinguishes the data as per category. Meaning, Price variable will be categorized as per Type variable.
3. The function we want to apply i.e summary function.
So summary function is applied on Price variable within each category of Type variable.
Play with this function a little and see what happens if we were to write FUN = mean or FUN = max
We can also make a boxplot using plot function.
plot(Price ~ Type , data = sd)
Here in 1st argument, Price is our Numeric variable while Type is our Categorical variable. While 2nd argument is the data set.
So we have idea about the Price of cars as per the Type of cars using the plot function. Notice that Price is on y axis and Type is on x axis.
In plot function we write the numerical on the y axis and categorical on the x. If we were to change the position, the results are hilariously awkward.
What if we wanted to summarize Numeric and Numeric variables?
Numerical and Integer variable analysis
So in our data set, we have another interesting variable MPG.city which Miles Per Gallon of cars in city.
It is an integer as per the structure as shown here:
If we use plot function which has both numerical variables, we get a scatter plot.
plot(Price ~ MPG.city , data = sd)
Seems like as the MPG increases the Price decreases. There might be relationship between these variables.
We will discuss more on this.
Right now we have been doing analysis on 2 variables only. Lets take a step ahead and kick in some more variables for analysis.
SandwichAnts – Download this data set and import it in R like we did above.
It is data about ants who like different kinds of sandwich. Pretty funny!
Lets write our code to import:
sandwich <- read.csv("SandwichAnts.csv")
We see that there are 6 variables of which 3 are factors and 3 are numerical variables.
Its good that our data has factors because we can not do analysis on character variables. Character variables must be converted to Factor before analysis like this:
My_Variable <- factor(My_Variable)
We can apply same analysis like we did above. You go ahead and experiment with this.
Lets say we want to know what are the number of ants on a specific bread and filling.
We can do that by writing aggregate function:
aggregate(Ants ~ Bread, data = sandwich, FUN = length)
1 MultiGrain 12
2 Rye 12
3 White 12
4 WholeWheat 12
So, 1st argument Ants ~ Bread, where Ants is numeric and Bread is categorical, is put in the way that numeric comes first and is separated by ~ such that in the 3rd argument, function is applied on numeric based on the category of factor variable which is placed after ~.
2nd argument is again the data set.
3rd is the function we want to apply.
But we want to know now, whether the ant was on butter also or not? We can use aggregate again in this case:
aggregate(Ants ~ Bread + Butter, data = sandwich, FUN = length)
Notice that we just added + Butter in the 1st argument. Butter is again a categorical variable
Now we can use yet another factor, Filling variable.
See what happens if we write the code below:
aggregate(Ants ~ Bread + Butter + Filling, data = sandwich, FUN = length)
The above function is giving us count of ants sitting on what kind of Bread, with Butter or without, with what kind of Filling.
Go ahead and mess with 3rd argument FUN = and compute mean, sd variance, min or max.
By the number of ants on sandwiches of different types, we can definitely say as per data that ants who like butter are equal to ants that don’t like butter.
Congratulations on learning this tutorial. Go ahead and learn more.