Shapiro-Wilk Test for Normality in R

I think the Shapiro-Wilk test is a great way to see if a variable is normally distributed. This is an important assumption in creating any sort of model and also evaluating models.

Let’s look at how to do this in R!


And here is the output:

Shapiro-Wilk normality test
data:  data$CreditScore
W = 0.96945, p-value = 0.2198

So how do we read this? It looks like the p-value is too high. But it is not. The threshold for the p-value is 0.05. So here we fail to reject the null hypothesis. We don’t have enough evidence to say the population is not normally distributed.

Let’s make a histogram to take a look using base R graphics:

     main="Credit Score", 
     xlab="Credit Score", 
     border="light blue", 

Our distribution likes nice here:

Great! I would feel comfortable making more assumptions and performing some tests.

Dollar Signs and Percentages- 3 Different Ways to Convert Data Types in R

Working with percentages in R can be a little tricky, but it’s easy to change it to an integer, or numeric, and run the right statistics on it. Such as quartiles and mean and not frequencies.

data$column = as.integer(sub("%", "",data$column))

Essentially you are using the sub function and substituting the “%” for a blank. You don’t lose any decimals either! So in the end just remember that those are percentage amounts.

Next example is converting to a factor

data$column = as.factor(data$column)

Now you can read the data as discrete. This is great for categorical and nominal level variables.

Last example is converting to numeric. If you have a variable that has a dollar sign use this to change it to a number.

data$balance = as.factor(gsub(",", "", data$balance))
data$balance = as.numeric(gsub("\\$", "", data$balance))

Check out the before

Balance   : Factor w/ 40 levels "$1,000","$10,000",..: 
Utilization  : Factor w/ 31 levels "100%","11%","12%",

And after

Balance      : num  11320 7200 20000 12800 5700 ...
Utilization  : int  25 70 55 65 75 

I hope this helps you with your formatting times! So simple and easy and you’ll be able to summarize your data!