How To Import And Clean Wine Dataset
In this post we explore the wine dataset. First, we perform descriptive and exploratory data assay. Next, we run dimensionality reduction with PCA and TSNE algorithms in order to check their functionality. Finally a random forest classifier is implemented, comparison different parameter values in order to cheque how the touch on the classifier results.
In [43]:
import numpy as np import pandas as pd import seaborn equally sns from sklearn import datasets from sklearn.manifold import TSNE from sklearn.decomposition import PCA from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score , confusion_matrix from sklearn.model_selection import train_test_split , cross_val_score import matplotlib.pyplot as plt pd . set_option ( 'brandish.max_columns' , None ) % matplotlib inline
Load the dataset¶
In [44]:
#Let's import the data from sklearn from sklearn.datasets import load_wine wine = load_wine () #Conver to pandas dataframe data = pd . DataFrame ( information = np . c_ [ wine [ 'data' ], wine [ 'target' ]], columns = wine [ 'feature_names' ] + [ 'target' ]) #Bank check data with info office data . info ()
RangeIndex: 178 entries, 0 to 177 Data columns (full xiv columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 alcohol 178 non-null float64 i malic_acid 178 non-null float64 2 ash 178 not-cipher float64 iii alcalinity_of_ash 178 non-null float64 4 magnesium 178 non-null float64 five total_phenols 178 non-null float64 6 flavanoids 178 non-null float64 7 nonflavanoid_phenols 178 non-null float64 8 proanthocyanins 178 non-cipher float64 9 color_intensity 178 non-null float64 10 hue 178 not-naught float64 xi od280/od315_of_diluted_wines 178 non-aught float64 12 proline 178 non-nada float64 13 target 178 non-null float64 dtypes: float64(14) memory usage: xix.6 KB
In [45]:
# Search for missing, NA and null values) ( data . isnull () | data . empty | data . isna ()) . sum ()
Out[45]:
Information assay¶
Basic statistical analysis¶
In [46]:
#Allow's see the frequency of the variable target. #Catechumen variable to categorical. data . target = data . target . astype ( 'int64' ) . astype ( 'category' ) #Frequency. freq = data [ 'target' ] . value_counts () freq
Out[46]:
In [47]:
#Let's bank check graphically. freq . plot ( kind = 'bar' )
In [48]:
#Let's evidence a summary of teh dataset where we can see # the bones statistic information. data . describe ()
Out[48]:
In [49]:
#Let'due south show the histograms of the variables alcohol, magnesium y color_intensity. #Histogramas data [[ 'alcohol' , 'magnesium' , 'color_intensity' ]] . hist ()
Out[49]:
Análisis: Comentad los resultados.
In the previous points we meet how all the variables in the dataset, except the target variable, are continuous numerical. There are no missing values in whatsoever of the variables. From the bones statistical values we can see that none of the variables follows a normal distribution, since none has mean 0 and standard deviation 1. In the histograms nosotros can observe how the alcohol variable has a more or less centered distribution, with most of the records having values between 12 and 14 degrees, as for color_intensity and magnesium, we notice that their distributions are skewed to the left.
Exploratory analysis¶
In [50]:
feats_to_explore = [ 'booze' , 'magnesium' , 'color_intensity' ]
In [51]:
# Booze variable histograms. x1 = data . loc [ data . target == 0 , 'alcohol' ] x2 = data . loc [ data . target == 1 , 'alcohol' ] x3 = information . loc [ data . target == two , 'alcohol' ] kwargs = dict ( alpha = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , color = 'thou' , label = 'Tipo 0' ) plt . hist ( x2 , ** kwargs , colour = 'b' , label = 'Tipo i' ) plt . hist ( x3 , ** kwargs , colour = 'r' , label = 'Tipo 2' ) plt . gca () . prepare ( title = 'Frecuencia de alcohol por tipo de wine' , ylabel = 'Frequencia' ) plt . legend ();
In [52]:
#Color_intensity histograms x1 = data . loc [ data . target == 0 , 'color_intensity' ] x2 = data . loc [ data . target == one , 'color_intensity' ] x3 = data . loc [ data . target == ii , 'color_intensity' ] kwargs = dict ( blastoff = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , color = 'yard' , characterization = 'Tipo 0' ) plt . hist ( x2 , ** kwargs , color = 'b' , label = 'Tipo ane' ) plt . hist ( x3 , ** kwargs , colour = 'r' , label = 'Tipo ii' ) plt . gca () . set ( title = 'Frecuencia de intensidad de color por tipo de wine' , ylabel = 'Frequencia' ) plt . legend ();
In [53]:
#Magnesium histograms x1 = data . loc [ data . target == 0 , 'magnesium' ] x2 = data . loc [ data . target == one , 'magnesium' ] x3 = information . loc [ information . target == ii , 'magnesium' ] kwargs = dict ( alpha = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , colour = 'g' , label = 'Tipo 0' ) plt . hist ( x2 , ** kwargs , colour = 'b' , label = 'Tipo 1' ) plt . hist ( x3 , ** kwargs , color = 'r' , label = 'Tipo 2' ) plt . gca () . set up ( title = 'Frecuencia de magnesio por tipo de vino' , ylabel = 'Frequencia' ) plt . fable ();
We tin observe that the variable that tin can best define the blazon of wine is the booze variable, since co-ordinate to the graph the types of wine have less overlap according to the amount of alcohol, we encounter how type 0 and 1 are well differentiated in some ranges. As for colour intensity, it would as well allow us to obtain a classification, although perhaps a greater overlapping of the graphs is observed. Magnesium seems to exist the variable that least defines the type of wine since the histograms are quite overlapped in almost all the graph.
Let'south repeat the graphs just showing the hateful and the standard divergence.
In [54]:
#Alcohol histograms with the mean and the standard deviation. x1 = information . loc [ data . target == 0 , 'alcohol' ] x2 = data . loc [ data . target == 1 , 'alcohol' ] x3 = data . loc [ data . target == ii , 'alcohol' ] kwargs = dict ( blastoff = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , color = 'grand' , characterization = 'Tipo 0' + str ( " {:half-dozen.2f} " . format ( x1 . std ()))) plt . hist ( x2 , ** kwargs , color = 'b' , label = 'Tipo 1' + str ( " {:half dozen.2f} " . format ( x2 . std ()))) plt . hist ( x3 , ** kwargs , colour = 'r' , label = 'Tipo 2' + str ( " {:6.2f} " . format ( x3 . std ()))) plt . gca () . set ( championship = 'Frecuencia de alcohol por tipo de vino' , ylabel = 'Frequencia' ) plt . axvline ( x1 . mean (), color = 'g' , linestyle = 'dashed' , linewidth = 1 ) plt . axvline ( x2 . mean (), color = 'b' , linestyle = 'dashed' , linewidth = 1 ) plt . axvline ( x3 . mean (), colour = 'r' , linestyle = 'dashed' , linewidth = 1 ) plt . fable ();
In [55]:
#color_intensity histograms with the hateful and the standard deviation.. x1 = data . loc [ information . target == 0 , 'color_intensity' ] x2 = information . loc [ data . target == 1 , 'color_intensity' ] x3 = information . loc [ data . target == 2 , 'color_intensity' ] kwargs = dict ( alpha = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , color = 'grand' , label = 'Tipo 0' + str ( " {:6.2f} " . format ( x1 . std ()))) plt . hist ( x2 , ** kwargs , colour = 'b' , characterization = 'Tipo ane' + str ( " {:six.2f} " . format ( x2 . std ()))) plt . hist ( x3 , ** kwargs , color = 'r' , label = 'Tipo 2' + str ( " {:6.2f} " . format ( x3 . std ()))) plt . gca () . set ( title = 'Frecuencia de intensidad de color por tipo de vino' , ylabel = 'Frequencia' ) plt . axvline ( x1 . mean (), color = 'grand' , linestyle = 'dashed' , linewidth = ane ) plt . axvline ( x2 . mean (), color = 'b' , linestyle = 'dashed' , linewidth = one ) plt . axvline ( x3 . hateful (), color = 'r' , linestyle = 'dashed' , linewidth = 1 ) plt . legend ();
In [56]:
#magnesium histograms with the hateful and the standard deviation.. x1 = data . loc [ information . target == 0 , 'magnesium' ] x2 = data . loc [ information . target == 1 , 'magnesium' ] x3 = data . loc [ data . target == 2 , 'magnesium' ] kwargs = dict ( alpha = 0.3 , bins = 25 ) plt . hist ( x1 , ** kwargs , color = 'g' , characterization = 'Tipo 0' + str ( " {:vi.2f} " . format ( x1 . std ()))) plt . hist ( x2 , ** kwargs , color = 'b' , label = 'Tipo 1' + str ( " {:vi.2f} " . format ( x2 . std ()))) plt . hist ( x3 , ** kwargs , color = 'r' , label = 'Tipo 2' + str ( " {:half dozen.2f} " . format ( x3 . std ()))) plt . gca () . set ( championship = 'Frecuencia de magnesio por tipo de vino' , ylabel = 'Frequencia' ) plt . axvline ( x1 . hateful (), color = 'one thousand' , linestyle = 'dashed' , linewidth = 1 ) plt . axvline ( x2 . mean (), color = 'b' , linestyle = 'dashed' , linewidth = 1 ) plt . axvline ( x3 . hateful (), colour = 'r' , linestyle = 'dashed' , linewidth = 1 ) plt . legend ();
Allow'due south check the correlation among the variables and bear witness scatterplots.
In [57]:
#Correlation tabular array df = information [[ 'alcohol' , 'magnesium' , 'color_intensity' ]] df . corr ()
Out[57]:
In [58]:
#scatter plots df = data [[ 'alcohol' , 'magnesium' , 'color_intensity' , 'target' ]] sns . pairplot ( df , hue = 'target' )
We can see how the correlation of alcohol with magnesium is low, 0.27, which can be seen in the depression directionality of the points in this graph. We can likewise observe very little directionality in the plot of magnesium with color_intensity, which corresponds with the very low correlation index found previously (0.nineteen). On the other hand, the correlation of booze with color_intensity is the highest of all (0.54) as can also be seen in the higher directionality of its dot plot, although without being a high correlation.
Dimensionality reduction
Let's apply dimensionality reduction in order to reduce the data to ii dimensions. We will use 2 different functions (PCA and TSNE) to check which of them yields better results.
In [59]:
#Import standardscaler from sklearn.preprocessing import StandardScaler #Remove target columns. x = data . loc [:, data . columns != 'target' ] . values y = information . loc [:,[ 'target' ]] . values #Scale the data x = pd . DataFrame ( StandardScaler () . fit_transform ( 10 )) y = pd . DataFrame ( y ) # Create PCA object. pca = PCA ( n_components = 2 ) #Run PCA. pComp = pca . fit_transform ( x ) principalDf = pd . DataFrame ( data = pComp , columns = [ 'PC 1' , 'PC ii' ]) principalDf . head ()
Out[59]:
In [60]:
# Bring together again the target variable finalDf = pd . concat ([ principalDf , data [[ 'target' ]]], axis = 1 ) finalDf . head ()
Out[lx]:
In [61]:
# Show the graphics. fig = plt . figure ( figsize = ( 10 , 10 )) ax = fig . add_subplot ( 1 , 1 , one ) ax . set_xlabel ( 'Principal Component 1' , fontsize = fifteen ) ax . set_ylabel ( 'Principal Component 2' , fontsize = xv ) ax . set_title ( 'PCA' , fontsize = 20 ) targets = [ 0.0 , 1.0 , 2.0 ] colors = [ 'r' , 'chiliad' , 'b' ] for target , colour in zip ( targets , colors ): indicesToKeep = finalDf [ 'target' ] == target ax . besprinkle ( finalDf . loc [ indicesToKeep , 'PC 1' ] , finalDf . loc [ indicesToKeep , 'PC 2' ] , c = colour , s = fifty ) ax . legend ( targets ) ax . grid ()
Allow's employ TSNE to reduce dimensionality. This algorithms tries to minimize the divergence between the distributions of the pairwise similarities of the original objects and the same in the depression-dimensional data.
In [62]:
#Use same variables as in the previous bespeak, they are already standarized # Create TSNE object. X_embedded = TSNE ( n_components = 2 , perplexity = 15 , random_state = 42 ) . fit_transform ( 10 ) tsneDf = pd . DataFrame ( information = X_embedded , columns = [ 'PC one' , 'PC two' ]) tsneDf . caput ()
Out[62]:
In [63]:
# Join the target variable ftnseDf = pd . concat ([ tsneDf , information [[ 'target' ]]], centrality = 1 ) ftnseDf . head ()
Out[63]:
In [64]:
# Bear witness the graphic. fig = plt . figure ( figsize = ( x , ten )) ax = fig . add_subplot ( 1 , 1 , i ) ax . set_xlabel ( 'Principal Component 1' , fontsize = xv ) ax . set_ylabel ( 'Principal Component two' , fontsize = 15 ) ax . set_title ( 'TSNE' , fontsize = 25 ) targets = [ 0.0 , i.0 , 2.0 ] colors = [ 'r' , 'yard' , 'b' ] for target , color in zip ( targets , colors ): indicesToKeep = ftnseDf [ 'target' ] == target ax . scatter ( ftnseDf . loc [ indicesToKeep , 'PC 1' ] , ftnseDf . loc [ indicesToKeep , 'PC 2' ] , c = color , southward = fifty ) ax . fable ( targets ) ax . filigree ()
It seems that the dimensionality reduction has worked well since the classes are well separated in the methods. Both methods show a articulate separation between the classes, perhaps PCA shows this separation more conspicuously and there is less scatter and mixing at the grouping boundaries. The different results are due to the fact that both methods work differently, while PCA explains the variance of the information when projected on an axis and looks for the components that explain the most variance, TSNE, on the other hand, uses probability distributions to look for a similarity betwixt the reduced dimension space and the original dimension infinite.
Predictions¶
En este último ejercicio se trata de aplicar un método de aprendizaje supervisado, concretamente el clasificador Random Forest, para predecir la clase a la que pertenece cada vino y evaluar la precisión obtenida con el modelo. Para eso usaremos:
Let's start applying a random wood. With the original data, then with the reduced datasets (PCA and TSNE) to bank check the results.
In [65]:
#Vamos a dividir el dataset usando los datos escalados X_train , X_test , y_train , y_test = train_test_split ( x , y , test_size = 0.33 , random_state = 42 ) X_train . shape
In [67]:
#Create the classifier. clf = RandomForestClassifier ( n_estimators = ten , random_state = 42 ) clf . fit ( X_train , y_train . values . ravel ())
Out[67]:
In [68]:
#Apply cantankerous validation to evaluate the results. scores = cross_val_score ( clf , X_train , y_train . values . ravel (), cv = 5 ) scores
Out[68]:
In [69]:
#Calculate the mean and the standard deviation of the validation print ( "Mean: %0.2f ; Standard Dev.: %0.2f )" % ( scores . hateful (), scores . std ()))
Mean: 0.94 ; Standard Dev.: 0.03)
Let'south run the classifier with PCA reduced data
In [70]:
#Apply PCA. # Create PCA object. pca = PCA ( n_components = 2 ) #Apply PCA on grooming data pComp = pca . fit_transform ( X_train ) #Run PCA principalDf = pd . DataFrame ( data = pComp , columns = [ 'PC 1' , 'PC 2' ]) principalDf . head ()
Out[70]:
In [71]:
#Create the classifier pcaclf = RandomForestClassifier ( n_estimators = x , random_state = 42 ) pcaclf . fit ( principalDf , y_train . values . ravel ())
Out[71]:
In [72]:
#Apply cantankerous validation scores = cross_val_score ( pcaclf , principalDf , y_train . values . ravel (), cv = 5 ) scores
Out[72]:
In [73]:
#Mean and standard deviation of the validation. print ( "Mean: %0.2f ; Standard dev.: %0.2f )" % ( scores . mean (), scores . std ()))
Hateful: 0.96 ; Standard dev.: 0.05)
Let's run the classifier with TSNE reduced data
In [74]:
#Run TSNE. X_embedded = TSNE ( n_components = 2 , perplexity = 15 ) . fit_transform ( X_train ) tsneDf = pd . DataFrame ( data = X_embedded , columns = [ 'PC 1' , 'PC 2' ]) tsneDf . head ()
Out[74]:
In [75]:
#Create the classifier tclf = RandomForestClassifier ( n_estimators = 10 , random_state = 42 ) tclf . fit ( tsneDf , y_train . values . ravel ())
Out[75]:
In [76]:
#Use cross validation scores = cross_val_score ( tclf , tsneDf , y_train . values . ravel (), cv = 5 ) scores
Out[76]:
In [77]:
#Calculate mean and standard divergence of the validation print ( "Hateful: %0.2f ; Standard dev.: %0.2f )" % ( scores . mean (), scores . std ()))
Mean: 0.95 ; Standard dev.: 0.03)
In the case shown, both PCA and TNSE show an improvement in the model, both deport in a like way, which is consistent with the graphs of practise 3. Information technology should be noted that this result has been obtained by repeatedly executing the TSNE algorithm since it contains a random component, as practise the random forests. The parameter random_state is used to exist able to repeat the results in the different executions of the algorithm.
Let's predict with the PCA information.
In [79]:
#Permit'south transform test data PCA_test = pca . transform ( X_test ) pcaTestDf = pd . DataFrame ( data = PCA_test , columns = [ 'PC 1' , 'PC two' ]) pcaTestDf . shape
In [80]:
prediction = pcaclf . predict ( pcaTestDf ) prediction
Out[80]:
In [81]:
#Cross validation and metrics. acc_score = accuracy_score ( y_test , prediction ) acc_score
In [82]:
#We get a 98% accuracy, allow's see confussion matrix. conf_matrix = confusion_matrix ( y_test , prediction ) conf_matrix
Out[82]:
Next we are going to test n_estimators,max_depth and min_samples_split parameters with unlike values, to clearly see their purpose and event on the results, we are going to save all the prediction efficiency results on the railroad train and exam data, and show a graph. To see more clearly its comeback we will exam on the dataset without dimensionality reduction since it is non the best model, so we can check how much the model improves with each parameter.
n_estimators: This parameter represents the number of trees used in the model, in the first graph shows its result, we tin conspicuously see how the effectiveness of the model for new cases goes upwardly to 16 copse, where it reaches its maximum, a college number of copse does not become an improvement of the model.
max_depth: Information technology represents the depth of the trees in the model, i.e. the number of levels of each tree.In the example we show its result with 4 trees (n_estimators), it is observed how from a certain depth overfitting is produced and the model does non learn for new data.
min_samples_split: This parameter defines the number of data to apply before splitting a node. A larger value in this parameter further restricts the tree by forcing it to use more data earlier splitting. In this instance we utilize the parameter with percentages, we encounter how with values up to 40% the model stays above 90% effectiveness, from there its effectiveness drops quite a lot.
In [83]:
#Vamos a comenzar con n_estimators from matplotlib.legend_handler import HandlerLine2D n_estimators = [ i , ii , 4 , eight , 16 , 32 , 64 , 100 , 200 ] train_results = [] test_results = [] #Save precision data in arrays in order to show the graphic. for estimator in n_estimators : clf = RandomForestClassifier ( n_estimators = figurer , random_state = 42 ) clf . fit ( X_train , y_train . values . ravel ()) pred_train = clf . predict ( X_train ) acc_score_train = accuracy_score ( y_train , pred_train ) train_results . append ( acc_score_train ) pred_test = clf . predict ( X_test ) acc_score_test = accuracy_score ( y_test , pred_test ) test_results . append ( acc_score_test ) line1 , = plt . plot ( n_estimators , train_results , 'b' , label = 'Train accurateness' ) line2 , = plt . plot ( n_estimators , test_results , 'r' , label = 'Test accuracy' ) plt . legend ( handler_map = { line1 : HandlerLine2D ( numpoints = ii )}) plt . ylabel ( 'Accuracy' ) plt . xlabel ( 'n_estimators' ) plt . show ()
In [89]:
#Go along with max_depth max_depths = np . linspace ( i , 32 , 32 , endpoint = Truthful ) train_results = [] test_results = [] #Save precision data in arrays in order to show the graphic for max_depth in max_depths : clf = RandomForestClassifier ( n_estimators = 4 , max_depth = max_depth , random_state = 42 ) clf . fit ( X_train , y_train . values . ravel ()) pred_train = clf . predict ( X_train ) acc_score_train = accuracy_score ( y_train , pred_train ) train_results . append ( acc_score_train ) pred_test = clf . predict ( X_test ) acc_score_test = accuracy_score ( y_test , pred_test ) test_results . append ( acc_score_test ) line1 , = plt . plot ( max_depths , train_results , 'b' , characterization = 'Train accuracy' ) line2 , = plt . plot ( max_depths , test_results , 'r' , label = 'Test accuracy' ) plt . legend ( handler_map = { line1 : HandlerLine2D ( numpoints = 2 )}) plt . ylabel ( 'Accuracy' ) plt . xlabel ( 'max_depths' ) plt . show ()
In [87]:
#Finally, min_samples_split min_samples_splits = np . linspace ( 0.i , i.0 , 10 , endpoint = True ) test_results = [] train_results = [] #Save precision information in arrays in order to evidence the graphic for min_samples_split in min_samples_splits : clf = RandomForestClassifier ( n_estimators = 4 , max_depth = two , min_samples_split = min_samples_split , random_state = 42 ) clf . fit ( X_train , y_train . values . ravel ()) pred_train = clf . predict ( X_train ) acc_score_train = accuracy_score ( y_train , pred_train ) train_results . suspend ( acc_score_train ) pred_test = clf . predict ( X_test ) acc_score_test = accuracy_score ( y_test , pred_test ) test_results . suspend ( acc_score_test ) line1 , = plt . plot ( min_samples_splits , train_results , 'b' , label = 'Train accuracy' ) line2 , = plt . plot ( min_samples_splits , test_results , 'r' , label = 'Exam accuracy' ) plt . legend ( handler_map = { line1 : HandlerLine2D ( numpoints = 2 )}) plt . ylabel ( 'Accuracy' ) plt . xlabel ( 'min_samples_splits' ) plt . show ()
Source: https://www.alldatascience.com/classification/wine-dataset-analysis-with-python/
Posted by: yancyhatelt.blogspot.com

0 Response to "How To Import And Clean Wine Dataset"
Post a Comment