--- title: "Summarize and Explore the Data" subtitle: "R package that automates most of exploratory analyses tasks in modeling" author: "Dayananda Ubrangala, Kiran R, Ravi Prasad Kondapalli, Sayan Putatunda" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes toc: true css: custom.css vignette: > %\VignetteIndexEntry{Vignette Title Subtitle} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\SweaveUTF8 %\VignetteIndexEntry{Information Value: Usage Examples} --- ```{r setup, include=FALSE} library(rmarkdown) library(SmartEDA) library(knitr) library(ISLR) library(scales) library(gridExtra) library(ggplot2) ``` ## 1. Introduction The document introduces the **SmartEDA** package and how it can help you to build exploratory data analysis. **SmartEDA** includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both summary and graphical form. The graphical form or charts can also be exported as reports. ``` सर्वस्य लोचनं शास्त्रं Science is the only eye अनेकसंशयोच्छेदि, परोक्षार्थस्य दर्शक| सर्वस्य लोचनं शास्त्रं, यस्य नास्त्यन्ध एव सः || It blasts many doubts, foresees what is not obvious | Science is the eye of everyone, one who hasnt got it, is like a blind || ``` **SmartEDA** package helps you to construct a good base of data understanding. The capabilities and functionalities are listed below 1. **SmartEDA** package will make you capable of applying different types of EDA without having to + remember the different R package names + write lengthy R scripts + manual effort to prepare the EDA report 2. No need to categorize the variables into Character, Numeric, Factor etc. SmartEDA functions automatically categorize all the features into the right data type (Character, Numeric, Factor etc.) based on the input data. 3. ggplot2 functions are used for graphical presentation of data 4. Rmarkdown and knitr functions were used for build HTML reports To summarize, SmartEDA package helps in getting the complete exploratory data analysis just by running the function instead of writing lengthy r code. ### Journal of Open Source Software Article An article describing SmartEDA package for exploratory data analysis approach has been published in [arxiv](https://arxiv.org/pdf/1903.04754.pdf) and Journal of Open Source Software [JOSS](https://joss.theoj.org/papers/10.21105/joss.01509). Please cite the paper if you use SmartEDA in your work! ## 2. Data In this vignette, we will be using a simulated data set containing sales of child car seats at 400 different stores. Data Source [ISLR package](https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Carseats). Install the package "ISLR" to get the example data set. ```{r eda-c3-r, warning=FALSE,eval=F} #install.packages("ISLR") library("ISLR") #install.packages("SmartEDA") library("SmartEDA") ## Load sample dataset from ISLR pacakge Carseats= ISLR::Carseats ``` ### 2.1 Overview of the data Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables ```{r od_1,warning=FALSE,eval=F,include=T} # Overview of the data - Type = 1 ExpData(data=Carseats,type=1) # Structure of the data - Type = 2 ExpData(data=Carseats,type=2) ``` ```{r od_2,warning=FALSE,eval=T,include=F} ovw_tabl <- ExpData(data=Carseats,type=1) ovw_tab2 <- ExpData(data=Carseats,type=2) ``` * Overview of the data ```{r od_3,warning=FALSE,eval=T,render=ovw_tabl,echo=F} kable(ovw_tabl, "html") ``` * Structure of the data ```{r od_31,warning=FALSE,eval=T,render=ovw_tab2,echo=F} kable(ovw_tab2, "html") ``` ### 2.2 Add summary statistics into Metadata ouput ```{r od_du2,warning=FALSE,eval=T,include=F} ovw_tabl_du <- ExpData(data=Carseats,type=2, fun = c("mean", "median", "var")) ``` * Metadata Information with additional statistics like mean, median and variance ```{r od_du1,warning=FALSE,eval=F,include=T} # Metadata Information with additional statistics like mean, median and variance ExpData(data=Carseats,type=2, fun = c("mean", "median", "var")) ``` ```{r od_du3,warning=FALSE,eval=T,render=ovw_tabl_du,echo=F} kable(ovw_tabl_du, "html") ``` * Adding custom function like 10th percentile or 90th percentile ```{r od_du4,warning=FALSE,eval=T,include=T} # Derive Quantile quantile_10 = function(x){ quantile_10 = quantile(x, na.rm = TRUE, 0.1) } quantile_90 = function(x){ quantile_90 = quantile(x, na.rm = TRUE, 0.9) } output_e1 <- ExpData(data=Carseats, type=2, fun=c("quantile_10", "quantile_90")) ``` ```{r od_du5,warning=FALSE,eval=T,render=output_e1,echo=F} kable(output_e1, "html") ``` ## 3. Exploratory data analysis (EDA) This function shows the EDA output for 3 different cases 1. **Target variable is not defined** 2. **Target variable is continuous** 3. **Target variable is categorical** ### 3.1 Example for case 1: Target variable is not defined #### 3.1.1 Summary of numerical variables Summary of all numerical variables ```{r c1.1,warning=FALSE,eval=T,include=F} ec1 = ExpNumStat(Carseats,by="A",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2,Nlim=3) rownames(ec1)<-NULL ``` ```{r c1.11, warning=FALSE,eval=F,include=T} ExpNumStat(Carseats,by="A",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2,Nlim=10) ``` ```{r c1.12,warning=FALSE,eval=T,render=ec1,echo=F} paged_table(ec1) ``` ##### Compute Weighted Summary Statistics for numerical variable ```{r ,warning=FALSE,eval=T,include=T} carseat = ISLR::Carseats ## Compute random weight carseat$wt = stats::runif( nrow(carseat), 0.5, 1.5 ) wt_summary = ExpNumStat(carseat,by="A",gp=NULL,round=2,Nlim=10, weight = "wt") wt_summary[,c("Vname","TN","W_count","mean", "W_Mean", "SD","W_Sd")] ``` ```{r ,warning=FALSE,eval=T,include=T} ## With group by statement wt_summary = ExpNumStat(carseat,by="GA",gp="ShelveLoc",round=2,Nlim=10, weight = "wt") wt_summary[,c("Vname","Group","TN","W_count","mean", "W_Mean", "SD","W_Sd")] ``` #### 3.1.2 Distributions of numerical variables Graphical representation of all numeric features * Density plot (Univariate) ```{r c1.2 ,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} # Note: Variable excluded (if unique value of variable which is less than or eaual to 10 [nlim=10]) plot1 <- ExpNumViz(Carseats,target=NULL,nlim=10,Page=c(2,2),sample=4) plot1[[1]] ``` #### 3.1.3. Summary of categorical variables ```{r ec13, eval=T,include=F} et1 <- ExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=5,round=2,bin=NULL,per=T) rownames(et1)<-NULL ``` * frequency for all categorical independent variables ```{r ec14, warning=FALSE,eval=F,include=T} ExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=3,round=2,bin=NULL,per=T) ``` ```{r ec14.1,warning=FALSE,eval=T,render=et1,echo=F} kable(et1,"html") ``` >*`NA` is Not Applicable* #### 3.1.4. Distributions of categorical variables * Bar plots for all categorical variables ```{r bp1,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} plot2 <- ExpCatViz(Carseats,target=NULL,col ="slateblue4",clim=10,margin=2,Page = c(2,2),sample=4) plot2[[1]] ``` ### 3.2 Example for case 2: Target variable is continuous #### 3.2.1. Target variable Summary of continuous dependent variable 1. Variable name - **Price** 2. Variable description - **Price company charges for car seats at each site** ```{r tbd0,warning=FALSE,eval=T,include=T} summary(Carseats[,"Price"]) ``` #### 3.2.2 Summary of numerical variables Summary statistics when dependent variable is continuous **Price**. ```{r con_1,warning=FALSE,eval=T,include=F} cpp = ExpNumStat(Carseats,by="A",gp="Price",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2) rownames(cpp)<-NULL ``` ```{r con_2, warning=FALSE,eval=F,include=T} ExpNumStat(Carseats,by="A",gp="Price",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2) ``` ```{r con_3,warning=FALSE,eval=T,render=cpp,echo=F} paged_table(cpp) ``` >*If Target variable is continuous, summary statistics will add the correlation column (Correlation between Target variable vs all independent variables)* #### 3.2.3 Distributions of numerical variables Graphical representation of all numeric variables * Scatter plot (Bivariate) - with continuous dependent variable Scatter plot between all numeric variables and target variable **Price**. This plot help to examine how well a target variable is correlated with dependent variables. Dependent variable is **Price** (continuous). ```{r snv1,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} #Note: sample=8 means randomly selected 8 scatter plots #Note: nlim=4 means included numeric variable with unique value is more than 4 plot3 <- ExpNumViz(Carseats,target="Price",nlim=4,scatter=FALSE,fname=NULL,col="green",Page=c(2,2),sample=8) plot3[[1]] ``` * Scatter plot (Bivariate) - between all the numerical variables ```{r snv1_1,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} #Note: sample=8 means randomly selected 8 scatter plots #Note: nlim=4 means included numeric variable with unique value is more than 4 plot31 <- ExpNumViz(Carseats,target="US",nlim=4,scatter=TRUE,fname=NULL,Page=c(2,1),sample=4) plot31[[1]] ``` #### 3.2.4. Summary of categorical variables Summary of categorical variables ```{r eda_41, eval=T,include=F} et11 <- ExpCTable(Carseats,Target="Price",margin=1,clim=10,round=2,bin=4,per=F) rownames(et11)<-NULL ``` * frequency for all categorical independent variables by descretized **Price** ```{r e4.2, warning=FALSE,eval=F,include=T} ##bin=4, descretized 4 categories based on quantiles ExpCTable(Carseats,Target="Price",margin=1,clim=10,round=2,bin=4,per=F) ``` ```{r e4.2.1,warning=FALSE,eval=T,render=et11,echo=F} paged_table(et11) ``` ##### Compute Weighted Summary Statistics for categorical variable ```{r ,warning=FALSE,eval=T,include=T} carseat = ISLR::Carseats ## Compute random weight carseat$wt = stats::runif( nrow(carseat), 0.5, 1.5 ) wt_summary = ExpCTable(carseat,margin=1,clim=10,round=2,bin=4,per=F, weight = "wt") wt_summary ``` ### 3.3 Example for case 3: Target variable is categorical #### 3.3.1. Summary of categorical dependent variable 1. Variable name - **Urban** 2. Variable description - **Whether the store is in an urban or rural location** ```{r dd,warning=FALSE,eval=T,include=F} tab_tar <- data.frame(table(Carseats[,"Urban"])) tab_tar$Descriptions <- "Store location" names(tab_tar) <- c("Urban","Frequency","Descriptions") rownames(tab_tar)<-NULL ``` ```{r dv-r,warning=FALSE,eval=T,render=tab_tar,echo=F} kable(tab_tar, "html") ``` #### 3.3.2 Summary of numerical variables Summary of all numeric variables ```{r snc1,warning=FALSE,eval=T,include=F} snc = ExpNumStat(Carseats,by="GA",gp="Urban",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2) rownames(snc)<-NULL ``` ```{r snc2, warning=FALSE,eval=F,include=T} ExpNumStat(Carseats,by="GA",gp="Urban",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2) ``` ```{r snc3,warning=FALSE,eval=T,render=snc,echo=F} paged_table(snc) ``` #### 3.3.3 Distributions of Numerical variables * Box plots for all numerical variables vs categorical dependent variable - Bivariate comparision only with categories Boxplot for all the numeric attributes by each category of **Urban** ```{r bp3.1,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} plot4 <- ExpNumViz(Carseats,target="Urban",type=1,nlim=3,fname=NULL,col=c("darkgreen","springgreen3","springgreen1"),Page=c(2,2),sample=8) plot4[[1]] ``` #### 3.3.4 Summary of categorical variables ```{r ed3.3, eval=T,include=F} et100 <- ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F) rownames(et100)<-NULL et4 <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=3,nlim=3,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2) rownames(et4)<-NULL et5 <- ExpCatStat(Carseats,Target="Urban",result = "IV",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2) rownames(et5)<-NULL et5 <- et5[1:15,] ``` **Cross tabulation with target variable** * Custom tables between all categorical independent variables and target variable **Urban** ```{r ed3.4, warning=FALSE,eval=F,include=T} ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F) ``` ```{r ed3.5,warning=FALSE,eval=T,render=et100,echo=F,out.height=8,out.width=8} kable(et100,"html") ``` **Information Value** ```{r ed3.6, warning=FALSE,eval=F,include=T} ExpCatStat(Carseats,Target="Urban",result = "IV",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2) ``` ```{r ed3.7,warning=FALSE,eval=T,render=et5,echo=F,out.height=8,out.width=8} kable(et5,"html") ``` **Statistical test** ```{r ed3.8, warning=FALSE,eval=F,include=T} et4 <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2) ``` ```{r ed3.9,warning=FALSE,eval=T,render=et4,echo=F,out.height=8,out.width=8} kable(et4,"html") ``` **Variable importance based on Information value** ```{r ed3.91,warning=FALSE,eval=T,fig.align='center',fig.height=7,fig.width=7} varimp <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,bins=10,Pclass="Yes",plot=TRUE,top=10,Round=2) ``` #### 3.3.5. Distributions of categorical variables Stacked bar plot with vertical or horizontal bars for all categorical variables ```{r ed3.10,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} plot5 <- ExpCatViz(Carseats,target="Urban",fname=NULL,clim=5,col=c("slateblue4","slateblue1"),margin=2,Page = c(2,1),sample=2) plot5[[1]] ``` ## 4. Quantile-quantile plot for numeric variables Function definition: ``` ExpOutQQ (data,nlim=3,fname=NULL,Page=NULL,sample=NULL) data : Input dataframe or data.table nlim : numeric variable limit fname : output file name (Output will be in PDF format) Page : output pattern. if Page=c(3,2), It will generate 6 plots with 3 rows and 2 columns sample : random number of plots ``` Carseats data from ISLR package: ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7} options(width = 150) CData = ISLR::Carseats qqp <- ExpOutQQ(CData,nlim=10,fname=NULL,Page=c(2,2),sample=4) qqp[[1]] ``` ## 5. Parallel Co-ordinate plots Function definition: ``` ExpParcoord (data,Group=NULL,Stsize=NULL,Nvar=NULL,Cvar=NULL,scale=NULL) data : Input dataframe or data.table Group : stratification variables Stsize : vector of startum sample sizes Nvar : vector of numerice variables, default it will consider all the numeric variable from data Cvar : vector of categorical variables, default it will consider all the categorical variable scale : scale the variables in the parallel coordinate plot[Default normailized with minimum of the variable is zero and maximum of the variable is one] ``` ### 5.1 Defualt ExpParcoord funciton ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group=NULL,Stsize=NULL,Nvar=c("Price","Income","Advertising","Population","Age","Education")) ``` ### 5.2 With Stratified rows and selected columns only ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income"),Cvar=c("Urban","US")) ``` ### 5.3 Without stratification ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group="ShelveLoc",Nvar=c("Price","Income"),Cvar=c("Urban","US"),scale=NULL) ``` ### 5.4 Scale change `std: univariately, subtract mean and divide by standard deviation` ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group="US",Nvar=c("Price","Income"),Cvar=c("ShelveLoc"),scale="std") ``` ### 5.5 Selected numeric variables ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income","Advertising","Population","Age","Education")) ``` ### 5.6 Selected categorical variables ```{r,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=3,fig.width=7} ExpParcoord(CData,Group="US",Stsize=c(15,50),Cvar=c("ShelveLoc","Urban")) ``` ## 6. Customized Summary Statistics Used 'data.table' package functions Function definition: ``` ExpCustomStat(data,Cvar=NULL,Nvar=NULL,stat=NULL,gpby=TRUE,filt=NULL,dcast=FALSE) ``` **ExpCustomStat examples** ```{r dudu, eval=T,include=F} e1du <- ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=F) rownames(e1du)<-NULL e1du1 <- ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=T) rownames(e1du1)<-NULL e1du2 <- ExpCustomStat(Carseats,Cvar=c("Urban","ShelveLoc"),Nvar=c("Age","Price","Advertising","Sales"),stat=c("mean"),gpby=FALSE,dcast=T) rownames(e1du2)<-NULL ``` ```{r dud1, warning=FALSE,eval=F,include=T} ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=F) ``` ```{r dud12,warning=FALSE,eval=T,render=e1du,echo=F,out.height=8,out.width=8} kable(e1du,"html") ``` ```{r dud2, warning=FALSE,eval=F,include=T} ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=T) ``` ```{r dud21,warning=FALSE,eval=T,render=e1du1,echo=F,out.height=8,out.width=8} kable(e1du1,"html") ``` ```{r dud3, warning=FALSE,eval=F,include=T} ExpCustomStat(Carseats,Cvar=c("Urban","ShelveLoc"),Nvar=c("Age","Price","Advertising","Sales"),stat=c("mean"),gpby=FALSE,dcast=T) ``` ```{r dud31,warning=FALSE,eval=T,render=e1du2,echo=F,out.height=8,out.width=8} kable(e1du2,"html") ``` ## 7. Univariate outlier analysis In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.An outlier can cause serious problems in statistical analyses. Function `ExpOutliers` can run univariate outlier analysis based on boxplot or SD method. The function returns the summary of oultlier for selected numeric features and adding new features if there are any outlers Identifying outliers: There are several methods we can use to identify outliers. In `ExpOutliers` used two methods (1) Boxplot and (2) Standard Deviation ```{r ktana, eval=T,include=F} ana1 <- ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9)) outlier_summ <- ana1[[1]] outlier_data <- ana1[[2]] ana2 <- ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev", treatment = "median", capping = c(0.1, 0.9)) outlier_summ1 <- ana2[[1]] outlier_data1 <- ana2[[2]] ``` ### 7.1 Identifying outliers using Boxplot method ```{r out1, warning=FALSE,eval=F,include=T} ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9)) ``` **Summary** ```{r out11,warning=FALSE,eval=T,render=outlier_summ,echo=F,out.height=8,out.width=8} kable(outlier_summ,"html") ``` **Output data head view** ```{r out12,warning=FALSE,eval=T,render=outlier_data,echo=F,out.height=8,out.width=8} kable(head(outlier_data),"html") ``` ### 7.2 Identifying outliers using 3 Standard Deviation method ```{r out2, warning=FALSE,eval=F,include=T} ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev", treatment = "medain", capping = c(0.1, 0.9)) ``` **Summary** ```{r out21,warning=FALSE,eval=T,render=outlier_summ1,echo=F,out.height=8,out.width=8} kable(outlier_summ1,"html") ``` **Output data head view** ```{r out22,warning=FALSE,eval=T,render=outlier_data1,echo=F,out.height=8,out.width=8} kable(head(outlier_data1),"html") ```