Wine Classification

Author

Prabuddha Tamhane, Harrison Li, Shihan Xu, Wesley Beard

1 Summary

This project illustrates how our team endeavoured to build a classification model to predict whether a wine was red or white based on a set of wine quality features (ex. pH, residual sugars, etc.). Four models were investigated with the best one being RBF SVM, and it performed extremely well on our test data. While there were still false positives and false negatives, indicated by the scores being less than one, this is not of great concern to us. Since we are not dealing with life-threatening or possible adverse outcomes, should a prediction be incorrect, the high level of precision in our model has given us confidence to use it in production.

2 Introduction

Red or white? This is a question that countless dinner guests are asked every evening. Some know exactly what they like and others are more open to exploring something new and exciting. And just like the consumer, wine producers have a vested interest in this question. However, they aren’t looking at the wine through a glass, but at the very chemistry that makes up each and every nuanced flavour. They are asking: What does the chemistry say this wine is? Our goal in this analysis is to see if we can develop a machine learning model to predict whether a wine will be classified as either white or red, based on a sample of wine chemical characteristics. This would allow the industry to be able to confirm their wines not only on a visual aspect but on a chemical level as well.

There may be a time that industry producers, small and large-alike, want to dive into how their wines measure up when compared to others. This could be confirming what is already known or helping to make that key distinction when a winemaker creates a blend of white and red and would like to know how best to market their product.

For example, it may be the case that a wine visually looks like a red, but that may not mean it chemically acts like one. Our model will be able to make that distinction and report back what the chemistry says.

Of course, there may be classifications that don’t align with what the wine is visually or the goal of the winemaker. In these cases, it may be due to variation in the model and the prediction probability not being reasonably high for either a red or white. While this may be seen as a disadvantage, it could also provide additional insights regarding the chemical quality of the wine in question. (I.e. It could be produced as a white wine, but with some underlying qualities of red wine.)

With a proper understanding of our model, the input data, and final prediction, we hope that it will provide the wine industry with another way of viewing and marketing their product to the next couple dining out.

3 Data

We are using a dataset of wine quality for wines from the northern region of Portugal (C. Cortez Paulo and Reis (2009), originally described in P. Cortez et al. (2009)) that consists of several chemical features: fixed acidity, pH, citric acid, sulphates, residual sugar and several others.

3.1 Data Wrangling

To prepare for data analysis, we:

  • combined the red and white wine datasets into a single dataframe,
  • remove the quality variable to avoid data leakage,
  • standardized column names,
  • created a categorical wine_type variable (red vs. white),
  • created a binary target variable (0 = red, 1 = white).

We confirmed that the dataset contains no missing values and falls within the expected measurement ranges described in P. Cortez et al. (2009). Finally, the data were randomly split into training and test sets.

3.2 Data Validation

Before conducting analysis, we performed a series of validation checks to ensure data quality:

  • verified the presence of all expected variables,
  • confirmed the absence of missing values,
  • checked for duplicate rows and removed them when present,
  • validated numerical ranges using domain knowledge from P. Cortez et al. (2009),
  • ensured that the wine_type and target variables contained only valid categories,
  • ensured that no feature-feature anomalous correlation and target-feature anomalous correlation exists.

These checks ensured the dataset was clean, consistent, and suitable for modelling.

4 Methods

All models were implemented using the scikit-learn library (Pedregosa et al. (2011)), and Altair (VanderPlas (2018)) was used for visualizations. The analysis makes use of NumPy (Harris et al. (2020)) for numerical computations.

5 Exploratory Data Analysis

This section summarizes the data structure and important statistical properties that are related to the features of this wine classification project.

5.1 Data Structure

From Table 1, there are some key observations:

  • Residual sugar, free sulfur dioxide, and total sulfur dioxide have wide ranges and right-skewed distributions with extreme outliers.
  • pH, alcohol, density, chlorides and critic acid have small standard deviations, showing relatively tight distributions.

These statistics provide insight into which variables may separate wine types well. From Table 2, 11 numerical features with no missing values, which confirms that the dataset is clean and ready for analysis.

Table 1: Summary statistics for all numerical features in the wine dataset.
Unnamed: 0 fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol target
count 4256 4256 4256 4256 4256 4256 4256 4256 4256 4256 4256 4256
mean 7.19754 0.345301 0.314866 5.06326 0.0567735 29.9641 113.751 0.994537 3.22858 0.533609 10.5558 0.742716
std 1.30927 0.168979 0.146301 4.5334 0.0364303 17.7682 56.7417 0.00296598 0.161483 0.148742 1.18807 0.437188
min 3.8 0.08 0 0.6 0.012 1 6 0.98711 2.72 0.22 8 0
25% 6.4 0.23 0.24 1.8 0.038 16 73 0.99226 3.12 0.43 9.5 0
50% 7 0.3 0.31 2.7 0.047 28 116 0.994685 3.22 0.51 10.4 1
75% 7.7 0.41 0.39 7.5 0.066 41 154 0.99675 3.33 0.6 11.4 1
max 15.9 1.58 1.66 65.8 0.611 289 440 1.03898 4.01 2 14.9 1
Table 2: Summary of dataset structure, including feature names, entries, null values, and data types.
Features Total Entries Null Values Data Type
target 4256 0 int64
alcohol 4256 0 float64
chlorides 4256 0 float64
citric_acid 4256 0 float64
density 4256 0 float64
fixed_acidity 4256 0 float64
free_sulfur_dioxide 4256 0 float64
pH 4256 0 float64
residual_sugar 4256 0 float64
sulphates 4256 0 float64
total_sulfur_dioxide 4256 0 float64
volatile_acidity 4256 0 float64

5.2 Overall Univariate Distributions

Figure 1: Histograms showing the univariate distributions of all numeric chemical features in the wine dataset. Each subplot illustrates the overall shape, skewness, and spread of a single variable.

We can see from Figure 1, many features (e.g. residual sugar, chlorides, sulphates) show right-skewness, which means most wines fall in lower ranges with a few extreme values, whereas alchohol is slightly right-skewed, the distribution of density and pH are similar to Normal distribution.

5.3 Distributions of Features by Wine Type

Figure 2: Kernel density estimates of each numerical feature, comparing distributions of red (0) and white (1) wines.

From Figure 2, we can see differences between red (0) and white (1) wine:

  • Overall, most features for both wine types show strongly right-skewed distributions. The distribution of pH looks close to bell-shaped. For pH, the red wines have higher pH than white wines, so the whites are more acidic.
  • For fixed_acidity, volatile_acidity, citric_acid, chlorides, and sulphates, the white wines have higher density at the lower end of the scale, so whites here look “lighter” and cleaner on these chemistry dimensions.
  • Alcohol levels tend to be higher in red wines, while white wines peak at slightly lower alcohol values. Combined with their higher pH, this suggests that whites are lighter but sweeter and more acidic.

5.4 Pairwise Correlations

Figure 3: Correlation heatmap showing pairwise relationships among the wine chemical features.

From Figure 3, we can observe some overall correlation patterns:

  • Density is strongly positively correlated with residual sugar, and free sulfur dioxide is strongly correlated with total sulfur dioxide.
  • Alcohol is negatively correlated with density, consistent with wine chemistry.
  • pH is negatively correlated with fixed acidity, showing the expected inverse relationship.

6 Modeling

6.1 Data Splitting

First, we splited the data into training and testing sets. Meanwhile, the wine_type column has been encoded into the binary target variable and was therefore removed. Then, the resulting dataset was separated into feature dataframe and the target vector for use in modeling the steps.

6.2 Pipeline

Since all predictors are numerical and models such as KNN and SVMs are distance-based, StandardScaler from sklearn was applied to all those features so as to ensure all features have equal weights in the model.

6.3 Model Comparison

Overall, from Table 3, most classifiers achieved strong predictive performance, except for the DummyClassifier, which is expected, since it is a baseline that does not use feature information.

Table 3: Cross-validated performance metrics for all evaluated classification models, including mean fit time, score time, training accuracy, and test accuracy with standard deviations across folds.
model fit_time score_time test_score train_score
dummy 0.0005 (+/- 0.0002) 0.0004 (+/- 0.0001) 0.7427 (+/- 0.0001) 0.7427 (+/- 0.0000)
Decision Tree 0.0124 (+/- 0.0002) 0.0008 (+/- 0.0002) 0.9810 (+/- 0.0061) 0.9998 (+/- 0.0002)
KNN 0.0021 (+/- 0.0001) 0.0105 (+/- 0.0003) 0.9930 (+/- 0.0037) 0.9947 (+/- 0.0009)
RBF SVM 0.0170 (+/- 0.0012) 0.0059 (+/- 0.0004) 0.9958 (+/- 0.0032) 0.9969 (+/- 0.0008)
Logistic Regression 0.0033 (+/- 0.0008) 0.0007 (+/- 0.0000) 0.9925 (+/- 0.0034) 0.9937 (+/- 0.0006)

Among all those evaluated models, the RBF-SVM achieved the highest validation performance and showed the smallest variability across folds, with mean CV train score of 0.9969, and test score of 0.9958, indicating a strong balance between accuracy and stability. This suggests that non-linear decision boundaries may better capture the structure of the one classification compared to linear methods.

7 Results

7.1 Confusion Matrix

To further evaluate model behavior, we examined the confusion matrices for each classifier. The matrix provided a detailed view of how well each model distinguishes between red wine (label = 0) and white wine (label = 1). Overall, we especially focus on the confusion matrix for RBF SVM model, and all models—except the DummyClassifier—achieve very strong separation between the two classes:

  • Decision Tree, KNN, RBF SVM, and Logistic Regression all show very low wrong counts for both classes.
  • RBF SVM performs especially well, from Figure 4 we can see that it produced near-perfect predictions with only a small number of false positives and false negatives.
Figure 4: Confusion matrix for the RBF SVM model.

These results show that those models are highly capable of distinguishing between wine types, with RBF SVM providing the most consistent and accurate classification.

7.2 Test Scoring

We next evaluated each model on the test set using accuracy, recall, precision, and F1 score. These metrics allow us to assess the generalization performance of each classifier beyond the training data and the result is shown in Table 4 below.

Table 4: Performance metrics (Accuracy, Recall, Precision, F1) for the best performing model (RBF SVM) on the test set.
model accuracy recall precision f1
dummy 0.75188 1 0.75188 0.858369
Decision Tree 0.976504 0.9875 0.981366 0.984424
KNN 0.986842 0.99125 0.99125 0.99125
RBF SVM 0.993421 0.99625 0.995006 0.995628
Logistic Regression 0.989662 0.9925 0.993742 0.993121

The results show a clear trend:

  • The DummyClassifier performs the worst, with accuracy around 0.7519 and substantially lower precision and F1.
  • Decision Tree, KNN, and Logistic Regression all perform well, achieving accuracy scores above 0.9765 and balanced precision/recall values.
  • RBF SVM outperforms all other models, achieving the highest scores in accuracy, recall, precision, and F1—demonstrating excellent predictive capability and strong generalization.

8 Discussion

We found that all of the models that we tested preformed much better than our Dummy Classifier. Subsequently, comparing our models individually, each one performed well. However, the model that had the highest cross-validation score was RBF SVM at 0.9958. Since all models scored high (>0.99), we included them in calculating accuracy, recall, precision and f1 on our test data. Here again, we see RBF SVM performing the best with the highest scores in each one of these metrics (0.9934, 0.9962, 0.995, and 0.9956 respectively).

The metric of focus for our analysis was accuracy. Our primary aim was to ensure that we were correctly predicting whether or not a wine was red or white. Although we were looking towards accuracy, we can also see from our confusion matrices that the test recall and precision scores were high resulting in the lowest false positives and false negatives. While our model was not being explicitly optimized regarding precision and recall, having these relatively small incorrect predictions from our training data and higher test scores adds to the robustness and our confidence of the model.

All four of our potential models exceeded our expectations. We expected them to do well, however, were pleased with the final results and scoring. There were some key features that led us to expect at least one well performing model. These were highlighted during our Exploratory Data Analysis and can be seen in our Distributions of Different Features by Wine Type and Pairwise Correlations visualizations. By analyzing the various density plots, we were able to see that there were several noticeable differences between red and white wines for several different features. There appeared to be several distributions with varying levels of overlap that increased our confidence that any potential model would be able to correctly predict red or white with a reasonable level of accuracy. For example, we saw different distributions for the volatile acidity and pH features. Our pairwise correlations showed a varying level of correlations among all the different features. This was an interesting part of the analysis as highly correlated free sulfur dioxide and sulfates made sense, while the negatively correlated alcohol and density got us to think deeper into the chemical nature of the wine.

These findings allow us to present a model that we can offer to winemakers and industry stakeholders to use in their business needs. Without having to spend the resources and time to create new wine products, a business can focus on the important features and from there decide on what wine to produce. For example, if a business was interested in making and marketing a wine that was low sugar and low in acidity (along with the other features), our model would help determine the type that should be produced. This is in addition to using the model to predict what type of wine an already produced wine is.

Our current analysis was only for the classification between two different types of wine: red and white. It would be interesting to see if we could expand this to include additional types of wine. We could look at including quality data for rose and sparkling wines. Additionally, if this new model worked well, it would be interesting to see if we could further generalize and make an all-purpose model that could be used to predict different types of alcohols based solely on chemical quality composition. This could possibly be useful, not only in the liquor industry, but in tangential industries that rely on models to predict unknown substances from a sample. (For example, forensics comes to mind.)

9 References

Cortez, Cerdeira, Paulo, and J. Reis. 2009. Wine Quality.” UCI Machine Learning Repository.
Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/https://doi.org/10.1016/j.dss.2009.05.016.
Harris, Charles R, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. Array programming with NumPy.” Nature 585 (7825): 357–62. https://doi.org/10.1038/s41586-020-2649-2.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
VanderPlas, Jake. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (7825, 32): 1057. https://doi.org/10.21105/joss.01057.