Fraud Detection in Python

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
import seaborn as sns
from scipy.stats import norm

df = pd.read_csv(‘data.csv’) class_names = {0:’Not Fraud’, 1:’Fraud’} print(df.Class.value_counts().rename(index = class_names)) df.head()

corrmat = df.corr() f, ax = pyplot.subplots(figsize =(9, 8)) sns.heatmap(corrmat, ax = ax, cmap =”YlGnBu”, linewidths = 0.1)

array = df.values X = array[:,1:30] y = array[:,30]

# fit model no training data
model = XGBClassifier()
model.fit(X, y)

# plot feature importance
plot_importance(model)
pyplot.show()

def PrintStats(cmat, y_test, pred):
# separate out the confusion matrix components
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]

# calculate F!, Recall scores
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)

# calculate and display metrics
print(cmat)
print( ‘Accuracy: ‘+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+’%’)
print( ‘Cohen Kappa: ‘+ str(np.round(cohen_kappa_score(y_test, pred),3))) print(“Sensitivity/Recall for Model : {recall_score}”.format(recall_score = recallScore))
print(“F1 Score for Model : {f1_score}”.format(f1_score = f1Score))

def RunModel(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train.values.ravel())
pred = model.predict(X_test)
matrix = confusion_matrix(y_test, pred)
return matrix, pred

feature_names = df.iloc[:,1:30].columns
target = df.iloc[:1, 30: ].columns
data_features = df[feature_names]
data_target = df[target]

from sklearn.model_selection import train_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(data_features, data_target, train_size=0.70, test_size=0.30, random_state=1)

from warnings import simplefilter
# ignore all future warnings
simplefilter(action=’ignore’, category=FutureWarning)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression() cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test) PrintStats(cmat, y_test, pred)

Choropleth Map in ggplot2

Creating a map in ggplot2 can be surprisingly easy! This tutorial will show the US by state. The dataset is from 1970 and will show some US statistics including population, income, life expectancy, and illiteracy.

I love making maps, while predictive statistics provide such great insight, map making was one thing that really made my interested in data science. I’m also glad R provides a great way to make them.

I’d also recommend plotly package where you can make it interactive as you scroll over. All within R!

Here is the first map we will make:

This is population by state in 1970 US.

library(ggplot2)
library(dplyr)

states<-as.data.frame(state.x77)
states$region <- tolower(rownames(states))
states_map <- map_data(“state”)
fact_join <- left_join(states_map, states, by = “region”)

ggplot(fact_join, aes(long, lat, group = group))+
geom_polygon(aes(fill = Population), color = “white”)+
scale_fill_viridis_c(option = “C”)+
theme_classic()

For the next graph the code will be mostly similar but I will change the fill = option.

Let’s try per capita income:

This is great. We’re able to see the income range through the color fill of each state.

ggplot(fact_join, aes(long, lat, group = group))+
geom_polygon(aes(fill = Income), color = “white”)+
scale_fill_viridis_c(option = “C”)+
theme_classic()

Last one we’ll make is life expectancy:

Great info here! Life expectancy, in the 1970s by state! That particular variable needed a little extra coding, see below:

fact_join$`Life Exp` <- as.numeric(fact_join$`Life Exp`)

ggplot(fact_join, aes(long, lat, group = group))+
geom_polygon(aes(fill = `Life Exp`), color = “white”)+
scale_fill_viridis_c(option = “C”)+
theme_classic()

Enjoy your maps! Also this dataset is publicly available so feel free to recreate.

Pros and Cons of Top Data Science Online Courses

There are a variety of data science courses online, but which one is the best? Find out the pros and cons of each!

Coursera, EdX, etc

These MOOCs have been around for several years now and continue to grow. But are they really the best option for learning online?

Pros:

  • Lots of Topics including R and Python
  • Affordable and even a free option
  • Well thought out curriculum from professors in great schools

Cons:

  • Not easily translatable to industry
  • Not taught by current industry professionals, but instead academics

Now, these MOOCs are still worth checking out and seeing if it works for you, but beware that you may feel tired of analyzing the iris data set.

PluralSight

Pros:

  • Lots of Topics in R, Python, and databases
  • Easy to skip around through the user interface instead of going in order
  • Taught by industry veterans in top companies that know current trends and expectations
  • You can use your own apps -Anaconda and RStudio – on your computer and not in the website itself

Cons:

  • Still just a bit limited on their data courses, but still growing quickly

DataCamp

Pros:

  • Great options for beginners to intermediate
  • Courses build on each other, fairly good examples
  • Most instructors have spent time in the industry

Cons:

  • You have to use their in website coding tool
  • Exercises are not always that clear
  • Never know if your app will work the same way on your own computer

So that’s a quick overview of options for learning online. Of course blogs are fantastic, too, and stack overflow can really be helpful!

Feel free to add your recommendations, too!

Check out PluralSight’s great offer today!

Shapiro-Wilk Test for Normality in R

I think the Shapiro-Wilk test is a great way to see if a variable is normally distributed. This is an important assumption in creating any sort of model and also evaluating models.

Let’s look at how to do this in R!

shapiro.test(data$CreditScore)

And here is the output:

Shapiro-Wilk normality test
data:  data$CreditScore
W = 0.96945, p-value = 0.2198

So how do we read this? It looks like the p-value is too high. But it is not. The threshold for the p-value is 0.05. So here we fail to reject the null hypothesis. We don’t have enough evidence to say the population is not normally distributed.

Let’s make a histogram to take a look using base R graphics:

hist(data$CreditScore, 
     main="Credit Score", 
     xlab="Credit Score", 
     border="light blue", 
     col="blue", 
     las=1, 
     breaks=5)

Our distribution likes nice here:

Great! I would feel comfortable making more assumptions and performing some tests.

Dollar Signs and Percentages- 3 Different Ways to Convert Data Types in R

Working with percentages in R can be a little tricky, but it’s easy to change it to an integer, or numeric, and run the right statistics on it. Such as quartiles and mean and not frequencies.

data$column = as.integer(sub("%", "",data$column))

Essentially you are using the sub function and substituting the “%” for a blank. You don’t lose any decimals either! So in the end just remember that those are percentage amounts.

Next example is converting to a factor

data$column = as.factor(data$column)

Now you can read the data as discrete. This is great for categorical and nominal level variables.

Last example is converting to numeric. If you have a variable that has a dollar sign use this to change it to a number.

data$balance = as.factor(gsub(",", "", data$balance))
data$balance = as.numeric(gsub("\\$", "", data$balance))

Check out the before

Balance   : Factor w/ 40 levels "$1,000","$10,000",..: 
Utilization  : Factor w/ 31 levels "100%","11%","12%",

And after

Balance      : num  11320 7200 20000 12800 5700 ...
Utilization  : int  25 70 55 65 75 

I hope this helps you with your formatting times! So simple and easy and you’ll be able to summarize your data!

Unsupervised Machine Learning in R: K-Means

K-Means clustering is unsupervised machine learning because there is not a target variable. Clustering can be used to create a target variable, or simply group data by certain characteristics.

Here’s a great and simple way to use R to find clusters, visualize and then tie back to the data source to implement a marketing strategy.

setwd
#import dataset
ABC <-read.table("AbcBank.csv",header=TRUE, 
                  sep=",")

#choose variables to be clustered 
# make sure to exclude ID fields or Dates
ABC_num<- ABC[,2:5]
#scale the data! so they are all normalized 
ABC_scaled <-as.data.frame(scale(ABC_num))

#kmeans function
k3<- kmeans(ABC_scaled, centers=3, nstart=25)
#library with the visualization
library(factoextra)
fviz_cluster(k3, data=ABC_scaled,
             ellipse.type="convex",
             axes =c(1,2),
             geom="point",
             label="none",
             ggtheme=theme_classic())
#check out the centers 
# remember these are normalized but 
#higher values are higher values for the original data
    k3$centers          
#add the cluster to the original dataset!
    ABC$Cluster<-as.numeric(k3$cluster)
    

Check out our awesome clusters:

Repo here with dataset: https://github.com/emileemc/kmeans

Easy R: Summary statistics grouping by a categorical variable

Once I found this great R package that really improves on the dplyr summary() function it was a game changer.

This library allows for the best summary statistics for each variable grouped by a categorical variable. It can also be saved as a list with an assignment.

library(purrr)
credit %>% split(credit$Date) %>% map(summary)

Simply use datatable$column that is the categorical variable then use the map function to run summary. And that’s it! All set to produce results like these:

$Aug
   Homeowner       Credit.Score   Years.of.Credit.History
 Min.   :0.0000   Min.   :485.0   Min.   : 2.00          
 1st Qu.:0.0000   1st Qu.:545.5   1st Qu.: 5.50          
 Median :0.0000   Median :591.0   Median : 9.00          
 Mean   :0.3704   Mean   :601.6   Mean   :10.33          
 3rd Qu.:1.0000   3rd Qu.:630.0   3rd Qu.:14.50          
 Max.   :1.0000   Max.   :811.0   Max.   :22.00          
                                                         
 Revolving.Balance Revolving.Utilization    Approval        Loan.Amount
 $2,000  : 2       100%   : 3            Min.   :0.0000   $11,855 : 1  
 $27,000 : 2       65%    : 2            1st Qu.:0.0000   $12,150 : 1  
 $29,100 : 2       70%    : 2            Median :0.0000   $13,054 : 1  
 $1,000  : 1       78%    : 2            Mean   :0.1481   $15,451 : 1  
 $10,500 : 1       79%    : 2            3rd Qu.:0.0000   $16,218 : 1  
 $12,050 : 1       85%    : 2            Max.   :1.0000   $17,189 : 1  
 (Other) :18       (Other):14                             (Other) :21  
   Date    Default
 Aug :27   0:14   
 July: 0   1:13   
                  
                                         
$July
   Homeowner       Credit.Score   Years.of.Credit.History
 Min.   :0.0000   Min.   :620.0   Min.   : 2.0           
 1st Qu.:0.5000   1st Qu.:682.5   1st Qu.: 8.0           
 Median :1.0000   Median :701.0   Median :12.0           
 Mean   :0.7391   Mean   :711.8   Mean   :12.3           
 3rd Qu.:1.0000   3rd Qu.:746.5   3rd Qu.:16.5           
 Max.   :1.0000   Max.   :802.0   Max.   :24.0           
                                                         
 Revolving.Balance Revolving.Utilization    Approval        Loan.Amount
 $11,200 : 2       11%    : 2            Min.   :0.0000   $3,614  : 2  
 $11,700 : 2       15%    : 2            1st Qu.:1.0000   $12,303 : 1  
 $6,100  : 2       20%    : 2            Median :1.0000   $12,338 : 1  
 $10,000 : 1       5%     : 2            Mean   :0.8261   $12,712 : 1  
 $10,500 : 1       7%     : 2            3rd Qu.:1.0000   $13,020 : 1  
 $11,320 : 1       70%    : 2            Max.   :1.0000   $17,697 : 1  
 (Other) :14       (Other):11                             (Other) :16  
   Date    Default
 Aug : 0   0:10   
 July:23   1:13   

You’ll have to do some formatting, or export to excel ! So fast and easy with this one.