#Libraries import section
library(ggplot2)
library(lattice)
library(corrplot)
red_wine_ds <- read.csv('wineQualityReds.csv') #read csv file
str(red_wine_ds)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Data set has 1599 observations and 13 variable. All variables are continuous.
Let’s start exploring our dataset using single variable at a time.
Quality variable is a continuous variable. In this EDA, we have to find out which chemical properties effect quality of wine.
So,let’s start our analysis with quality variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## [1] 0.8248906
I have calculated a ratio of wines with quality 5 and 6 relative to all wines and found the following result:
It means most of the wine are of quality level 5
82.48 % of wines are of quality 5 or 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] 34.14634
It’s a distribution with some peaks in center.
Around 34.15% of fixed acidity lies in range [7,8]. Their is a peak in graph at around 7.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Data shows that only few wines have volatile acidity of value more than 1. These can be outliers.
Let’s draw the graph after removing outliers.
Its a normal distribution with some peaks in between. This graph shows that most of volatile acidity is from 0.3 to 0.7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] 78.61163
## [1] 132
132 wines have citric acid 0.0
Most of the wines(78%) have citric acid value below 0.5
Citric acid distribution have peaks at 0 and 4.8 otherwise it is an even distribtion.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
After removing the outliers its a normal graph. Their is a peak around 2 for residual.sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
After removing outliers from chlorides it gives a normal graph. Their is a peak around 0.75 for chloride.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
After removing outliers from free.sulfur.dioxide , this graph looks evenly spread. Peaks are around 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
In total.sulfur.dioxide graph their are some peaks between 15 to 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density varies from 0.99 to 1 only. Its a very little variation. Most of the wines have around 0.997 density value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH value varies from 2.7 to 4 only.
Its roughly normal distribution with some outliers with values pH<3 and pH>3.7 Most of the wines have pH value around 3.4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
After removing outliers from sulphates. The graph looks roughly normal.
Most of the values of sulphate are around 0.6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Only few values of alcohol amount lie below 9 and above 13.
After removing outliers, the graph shows that there is a peak at 9.5
What is the structure of your dataset?
Red Wine dataset contains 1599 observations and 13 variables.
What is/are the main feature(s) of interest in your dataset?
By univariate analysis, we analysed that 34% fixed.acidity values lie between [7,8], 78% Citric acid values are below 0.5, density varies from 0.99 to 1. so these can be considered as features of interest. We want to analyse over quality, so quality is the main feature of interest. 82% of wine are of quality 5 & 6.
Normally, while drinking wine is generally considered good by amount of sugar, alcohol i.e, taste.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I think sulphates, chlorides may help to analyse the quality of wine.
Did you create any new variables from existing variables in the dataset?
No, didn’t create any new variable.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Chlorides and Residual sugar looks like left skewed graph. Only after removing the outliers, we were able to see roughly normal graph. Scaling of x-axis variable helped to analyse the data. We used limits to scale down x-axis variable.
From all the EDA task done above, we are able to understand variation of 1 chemical property in dataset. But this doesn’t give a clear picture about its relationship with the quality of wines.
So, we will have to do further perform analysis.
Correlation graph may help us by showing correlation among different variables.
From correlation graph, we can detect that volatile.acidity(-0.4) and alcohol(0.48) have strong correlation with quality of wines.
Citric acid and Sulphates are also correlated with quality.
So, lets try to analyse further using these metrics.
Here, we will try to analyse the dataset wrt to 2 variables at a time, i.e, we will plot one variable on x-axis and other one on y-axis.
We know which metrics are correlated with quality metric. So, lets try to analyse by plotting ggplot with quality as on x-axis and one of the correlated metric on y-axis.
Box Plot: Volatile Acidity vs quality
It shows that good quality wines have low volatile acidity. Its negatively correlated with quality.
Box Plot: Alcohol vs quality
It shows that good quality wines have high level of alcohol.
Box Plot: Citric Acid vs quality
It shows that good quality wines have high level of citric acid.
Value of citric acid varies directly proportional to the quality of wine.
Box Plot: Sulphates vs quality
It shows that good quality of wines have high value of sulphates.
Value of sulphates varies directly proporional to quality of wine.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Correlation plot helped to understand the correlation among different features. It shows that only volatile acidity, alcohol, citric acid and sulphates are correlated with quality of wine. Other features like density, residual sugar, chlorides are not correlated.
So, we will have to change our features of interest. Now our features of interest are {volatile acidity, alcohol, citric acid, sulphates}.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Earlier we considered sulphates and chlorides as feature of interest, but from correlation graph we observed that only sulphates is correlated with quality of wine.
We observed that good quality of wines have high value of sulphates.
What was the strongest relationship you found?
Alcohol and Volatile acid are strongly correlated with quality of wine. In bivariate analysis using BoxPlot, it shows :
good quality wines have high level of alcohol.
good quality wines have low volatile acidity.
Bivariate analysis helped us to understand correlation of volatile acidity, alcohol, citric acid, sulphates with quality. It gave a clear result that these metrics influence quality of wine. But to further validate and improve our result, lets perform Multivariate analysis.
From correlation graph, we can see that fixed acidity is correlated with volatile acidity , citric acid, density, pH value.
So, lets try to plot distribution to understand relation of these variables with quality of wines.
Here, we are plotting scatterplot of fixed acidity with other 4 metrics(volatile acidity, citric acid, pH, density).
From these graphs, correlation of x-axis with y-axis is visible due to straight line behaviour but their relationship with quality of wine is not visible. Line with quality 3,4,5 etc lies over
These 4 plots leads us to no good result.
Here, we are trying to convert continuous var into different ranges. We will cut these into 5 different buckets. Using the response of summary(), we will set different range.
#Cut volatile acidity into 5 parts using response of summary()
red_wine_ds$volatile.acidity.bucket <-
cut(red_wine_ds$volatile.acidity,
breaks = c(0.12, 0.39, 0.52, 0.53, 0.64, 1.58))
#Cut alcohol into 5 parts using response of summary()
red_wine_ds$alcohol.bucket <-
cut(red_wine_ds$alcohol,
breaks = c(8.4, 9.5, 10.2, 10.42, 11.1, 14.9))
#Cut citric acid into 5 parts using response of summary()
red_wine_ds$citric.acid.bucket <-
cut(red_wine_ds$citric.acid,
breaks = c(0, 0.09, 0.26, 0.271, 0.42, 1))
#Cut sulphates into 5 parts using response of summary()
red_wine_ds$sulphates.bucket <-
cut(red_wine_ds$sulphates,
breaks = c(0.33, 0.55, 0.62, 0.6581, 0.73, 2))
Lets try to plot scatterplot of all the above buckets with variables which are correlated with quality.
Here we are plotting alchol vs qualtiy vs sulphates/ citric acid/ volatile acidity. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.
These plots show that:
High sulphates and high alcohol influence the quality of wine. They are positively correlated.
High Citric acid and high alcohol influence the quality of wine.
We are plotting volatile acidity vs qualtiy vs sulphates/ citric acid/ alcohol buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.
These plots show that:
We are plotting citric acid vs qualtiy vs sulphates/ citric acid/ volatile acidity buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.
These plots shows no such good influence of citric acid with other properties on quality of wine.
We are plotting sulphates vs qualtiy vs volatile acidity/ citric acid/ alcohol buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.
These plots show that:
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
High alcohol contribute to good quality of wine, adding sulphates or citric acid will influence the quality of wine positively.
Lower value of volatile acidity also influences the quality of wine.
Were there any interesting or surprising interactions between features?
Correlation plot was showing that citric acid influence the quality of wine but from these plots, we can observe that citric acid alone doen’t influence the quality. Only high alcohol & high citric acid will positively influence the quality of wine.
This graph explains that 82.48 % of wines in dataset are of quality 5 & 6. As we have to find influence of other metrics on quality of wine so this graph is very important.
We observed that Alcohol(0.5) and volatile acidity(-0.4) are strongly correlated with quality of wine.
As they are strongly correlated, plotting scatter plot of alcohol vs volatile acidity vs quality is very important. From this scatter plot we can observe that high alcohol is positively correlated with quality of wine.
Higher value of alcohol leads to good quality of wine.
This scatter plot shows clear illustration about relationship of alcohol & sulphates with quality of wine. Its showing straight lines simlar to the lines in a graph of uniform motion. This shows that alcohol and sulphates are positively correlated.
Higher value of sulphates and alcohol leads to good quality of wine.
Red wine dataset contains 1599 observations with 13 variables. In this analysis, our main objective is to find out which chemical properties influence the quality of red wines.
In Univariate analysis, we plotted histograms of various metrics. We observed that 82% of wines are of quality 5 & quality 6. Other histograms in anlysis, didn’t gave a clear picture with which we can conclude about wines quality.
We plotted a correlation plot. It worked like magic. Using this plot, we were able analyse that alcohol, volatile.acidity, citric.acid, sulphates are correlated with quality of wine.
In Bivariate analysis, we plotted box-plots of correlated variables wrt quaity. These boxplots helped to analyse that good quality wines have:
In Multivariate analysis, we plotted scatterplots.
We again analysed the correlation plot and checked the correlation of fixed acidity with other variables, so plotted scatter plots of fixed acidity with other variables. From the graphs correlation of x-axis with y-axis is visible but their relationship with quality of wine is not visible. It didn’t gave good results.
Scatterplots of variables correlated with quality gave a good result. They helped to understand high sulphates & high alcohol , high citric acid & high alcohol, low volatile acidity, High sulphates influence quality of wine.
But scatterplot of Citric Acid with other properties didn’t gave good result.
How could the analysis be enriched in future work (e.g. additional data and analyses)?
The above dataset contains limited data of 1599 observations. In that dataset 82% of the wines are of quality 5 & 6. So, its not a good dataset. If we can have a dataset of 10000s+ wines with aprroximately uniform quality of wines. Then we will be able to perform a better analysis.