#Libraries import section
library(ggplot2)
library(lattice)
library(corrplot)
red_wine_ds <- read.csv('wineQualityReds.csv') #read csv file
str(red_wine_ds)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Data set has 1599 observations and 13 variable. All variables are continuous.

Univariate Plots:

Let’s start exploring our dataset using single variable at a time.

Red wine : Quality

Quality variable is a continuous variable. In this EDA, we have to find out which chemical properties effect quality of wine.

So,let’s start our analysis with quality variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

## [1] 0.8248906

I have calculated a ratio of wines with quality 5 and 6 relative to all wines and found the following result:

  1. It means most of the wine are of quality level 5

  2. 82.48 % of wines are of quality 5 or 6

Fixed Acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

## [1] 34.14634

It’s a distribution with some peaks in center.

Around 34.15% of fixed acidity lies in range [7,8]. Their is a peak in graph at around 7.5

Volatile Acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Data shows that only few wines have volatile acidity of value more than 1. These can be outliers.

Let’s draw the graph after removing outliers.

Its a normal distribution with some peaks in between. This graph shows that most of volatile acidity is from 0.3 to 0.7

Citric Acid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## [1] 78.61163
## [1] 132

132 wines have citric acid 0.0

Most of the wines(78%) have citric acid value below 0.5

Citric acid distribution have peaks at 0 and 4.8 otherwise it is an even distribtion.

Residual Sugar:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

After removing the outliers its a normal graph. Their is a peak around 2 for residual.sugar

Chlorides:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

After removing outliers from chlorides it gives a normal graph. Their is a peak around 0.75 for chloride.

Free Sulfur Dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

After removing outliers from free.sulfur.dioxide , this graph looks evenly spread. Peaks are around 6.

Total Sulfur Dioxide:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

In total.sulfur.dioxide graph their are some peaks between 15 to 25

Density:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density varies from 0.99 to 1 only. Its a very little variation. Most of the wines have around 0.997 density value.

pH values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH value varies from 2.7 to 4 only.

Its roughly normal distribution with some outliers with values pH<3 and pH>3.7 Most of the wines have pH value around 3.4

Sulphates:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

After removing outliers from sulphates. The graph looks roughly normal.

Most of the values of sulphate are around 0.6

Alcohol:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Only few values of alcohol amount lie below 9 and above 13.

After removing outliers, the graph shows that there is a peak at 9.5

Univariate Analysis

What is the structure of your dataset?

Red Wine dataset contains 1599 observations and 13 variables.

What is/are the main feature(s) of interest in your dataset?

By univariate analysis, we analysed that 34% fixed.acidity values lie between [7,8], 78% Citric acid values are below 0.5, density varies from 0.99 to 1. so these can be considered as features of interest. We want to analyse over quality, so quality is the main feature of interest. 82% of wine are of quality 5 & 6.

Normally, while drinking wine is generally considered good by amount of sugar, alcohol i.e, taste.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think sulphates, chlorides may help to analyse the quality of wine.

Did you create any new variables from existing variables in the dataset?

No, didn’t create any new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Chlorides and Residual sugar looks like left skewed graph. Only after removing the outliers, we were able to see roughly normal graph. Scaling of x-axis variable helped to analyse the data. We used limits to scale down x-axis variable.


From all the EDA task done above, we are able to understand variation of 1 chemical property in dataset. But this doesn’t give a clear picture about its relationship with the quality of wines.

So, we will have to do further perform analysis.


Correlation Graph:

Correlation graph may help us by showing correlation among different variables.

From correlation graph, we can detect that volatile.acidity(-0.4) and alcohol(0.48) have strong correlation with quality of wines.

Citric acid and Sulphates are also correlated with quality.

So, lets try to analyse further using these metrics.

Bivariate Plots:

Here, we will try to analyse the dataset wrt to 2 variables at a time, i.e, we will plot one variable on x-axis and other one on y-axis.

We know which metrics are correlated with quality metric. So, lets try to analyse by plotting ggplot with quality as on x-axis and one of the correlated metric on y-axis.

Box Plot: Volatile Acidity vs quality

It shows that good quality wines have low volatile acidity. Its negatively correlated with quality.

Box Plot: Alcohol vs quality

It shows that good quality wines have high level of alcohol.

Box Plot: Citric Acid vs quality

It shows that good quality wines have high level of citric acid.

Value of citric acid varies directly proportional to the quality of wine.

Box Plot: Sulphates vs quality

It shows that good quality of wines have high value of sulphates.

Value of sulphates varies directly proporional to quality of wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Correlation plot helped to understand the correlation among different features. It shows that only volatile acidity, alcohol, citric acid and sulphates are correlated with quality of wine. Other features like density, residual sugar, chlorides are not correlated.

So, we will have to change our features of interest. Now our features of interest are {volatile acidity, alcohol, citric acid, sulphates}.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Earlier we considered sulphates and chlorides as feature of interest, but from correlation graph we observed that only sulphates is correlated with quality of wine.

We observed that good quality of wines have high value of sulphates.

What was the strongest relationship you found?

Alcohol and Volatile acid are strongly correlated with quality of wine. In bivariate analysis using BoxPlot, it shows :

  1. good quality wines have high level of alcohol.

  2. good quality wines have low volatile acidity.


Bivariate analysis helped us to understand correlation of volatile acidity, alcohol, citric acid, sulphates with quality. It gave a clear result that these metrics influence quality of wine. But to further validate and improve our result, lets perform Multivariate analysis.


MultiVariate Plots:

From correlation graph, we can see that fixed acidity is correlated with volatile acidity , citric acid, density, pH value.

So, lets try to plot distribution to understand relation of these variables with quality of wines.

Here, we are plotting scatterplot of fixed acidity with other 4 metrics(volatile acidity, citric acid, pH, density).

From these graphs, correlation of x-axis with y-axis is visible due to straight line behaviour but their relationship with quality of wine is not visible. Line with quality 3,4,5 etc lies over

These 4 plots leads us to no good result.

Cut metrics into buckets:

Here, we are trying to convert continuous var into different ranges. We will cut these into 5 different buckets. Using the response of summary(), we will set different range.

#Cut volatile acidity into 5 parts using response of summary()
red_wine_ds$volatile.acidity.bucket <- 
  cut(red_wine_ds$volatile.acidity,
      breaks = c(0.12, 0.39, 0.52, 0.53, 0.64, 1.58))

#Cut alcohol into 5 parts using response of summary()
red_wine_ds$alcohol.bucket <- 
  cut(red_wine_ds$alcohol,
      breaks = c(8.4, 9.5, 10.2, 10.42, 11.1, 14.9))

#Cut citric acid into 5 parts using response of summary()
red_wine_ds$citric.acid.bucket <- 
  cut(red_wine_ds$citric.acid,
      breaks = c(0, 0.09, 0.26, 0.271, 0.42, 1))

#Cut sulphates into 5 parts using response of summary()
red_wine_ds$sulphates.bucket <- 
  cut(red_wine_ds$sulphates,
      breaks = c(0.33, 0.55, 0.62, 0.6581, 0.73, 2))

Lets try to plot scatterplot of all the above buckets with variables which are correlated with quality.

Scatter plots:

Alcohol & quality with other properties

Here we are plotting alchol vs qualtiy vs sulphates/ citric acid/ volatile acidity. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.

These plots show that:

  1. High sulphates and high alcohol influence the quality of wine. They are positively correlated.

  2. High Citric acid and high alcohol influence the quality of wine.

Volatile acidity & quality with other properties

We are plotting volatile acidity vs qualtiy vs sulphates/ citric acid/ alcohol buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.

These plots show that:

  1. Lower value of volatile acidity influence quality of wine.

Citric Acid & quality with other properties

We are plotting citric acid vs qualtiy vs sulphates/ citric acid/ volatile acidity buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.

These plots shows no such good influence of citric acid with other properties on quality of wine.

Sulphates & quality with other properties

We are plotting sulphates vs qualtiy vs volatile acidity/ citric acid/ alcohol buckets. We are using jitter in the plot to reduce overplotting and geom_smooth to add a smoothed mean.

These plots show that:

  1. High value of sulphates influence quality of wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High alcohol contribute to good quality of wine, adding sulphates or citric acid will influence the quality of wine positively.

Lower value of volatile acidity also influences the quality of wine.

Were there any interesting or surprising interactions between features?

Correlation plot was showing that citric acid influence the quality of wine but from these plots, we can observe that citric acid alone doen’t influence the quality. Only high alcohol & high citric acid will positively influence the quality of wine.


Final Plots and Summary

1. Quality of Wines

This graph explains that 82.48 % of wines in dataset are of quality 5 & 6. As we have to find influence of other metrics on quality of wine so this graph is very important.

2. Scatter Plot of Alcohol Vs Volatile Acidity vs quality

We observed that Alcohol(0.5) and volatile acidity(-0.4) are strongly correlated with quality of wine.

As they are strongly correlated, plotting scatter plot of alcohol vs volatile acidity vs quality is very important. From this scatter plot we can observe that high alcohol is positively correlated with quality of wine.

Higher value of alcohol leads to good quality of wine.

3. Scatter Plot of Alcohol Vs sulphates vs quality

This scatter plot shows clear illustration about relationship of alcohol & sulphates with quality of wine. Its showing straight lines simlar to the lines in a graph of uniform motion. This shows that alcohol and sulphates are positively correlated.

Higher value of sulphates and alcohol leads to good quality of wine.


Reflection

Red wine dataset contains 1599 observations with 13 variables. In this analysis, our main objective is to find out which chemical properties influence the quality of red wines.

In Univariate analysis, we plotted histograms of various metrics. We observed that 82% of wines are of quality 5 & quality 6. Other histograms in anlysis, didn’t gave a clear picture with which we can conclude about wines quality.

We plotted a correlation plot. It worked like magic. Using this plot, we were able analyse that alcohol, volatile.acidity, citric.acid, sulphates are correlated with quality of wine.

In Bivariate analysis, we plotted box-plots of correlated variables wrt quaity. These boxplots helped to analyse that good quality wines have:

  1. low volatile acidity
  2. high level of alcohol
  3. high level of citric acid
  4. high value of sulphates

In Multivariate analysis, we plotted scatterplots.

We again analysed the correlation plot and checked the correlation of fixed acidity with other variables, so plotted scatter plots of fixed acidity with other variables. From the graphs correlation of x-axis with y-axis is visible but their relationship with quality of wine is not visible. It didn’t gave good results.

Scatterplots of variables correlated with quality gave a good result. They helped to understand high sulphates & high alcohol , high citric acid & high alcohol, low volatile acidity, High sulphates influence quality of wine.

But scatterplot of Citric Acid with other properties didn’t gave good result.

How could the analysis be enriched in future work (e.g. additional data and analyses)?

The above dataset contains limited data of 1599 observations. In that dataset 82% of the wines are of quality 5 & 6. So, its not a good dataset. If we can have a dataset of 10000s+ wines with aprroximately uniform quality of wines. Then we will be able to perform a better analysis.