Notebook 1 : Getting familiar with scikit-learn

Notebook prepared by Chloé-Agathe Azencott, modified by Victor Laigle and by Giann Karlo.

This notebook will allow you to discover scikit-learn functionalities to:

# Load packages numpy, pandas and matplotlib (with aliases np, pd and plt respectively)
%matplotlib inline 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('font', **{'size': 12}) # set the global font size for plots (in pt)

1. Data Loading

In this notebook, we will work with the data contained in the data/auto-mpg.tsv file. This data, obtained from https://archive.ics.uci.edu/ml/datasets/Auto+MPG, describes cars using the following variables:

1. mpg:           continuous (consumption in miles per gallon)
2. cylinders:     discrete (number of cylinders)
3. displacement:  continuous (air volume moved by the pistons in the engine)
4. horsepower:    continuous
5. weight:        continuous
6. acceleration:  continuous
7. model year:    discrete
8. origin:        discrete (region of origin, 1=North America, 2=Europe, 3=Asia)
9. car name:      string

Our goal is to predict the consumption of a vehicle (mpg) from the other variables (excluding the car name, which is a unique identifier).

We will start by downloading the data from the github repository to your current working directory, and load the data into a pandas dataframe.

If you already downloaded the data, you can use the alternative suggested in the next cell.

# Download the data to the current directory
!wget https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems-en/main/data/auto-mpg.tsv

# Load the data
df = pd.read_csv("auto-mpg.tsv", delimiter='\t')
--2026-05-10 14:21:22--  https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems-en/main/data/auto-mpg.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17484 (17K) [text/plain]
Saving to: ‘auto-mpg.tsv’

auto-mpg.tsv        100%[===================>]  17.07K  --.-KB/s    in 0.001s  

2026-05-10 14:21:23 (18.0 MB/s) - ‘auto-mpg.tsv’ saved [17484/17484]

Alternatively : If you’re not on Colab and have already downloaded the file to a data folder, uncomment and run the following cell:

# df = pd.read_csv("data/auto-mpg.tsv", delimiter='\t')
# Check what the data looks like with the first 10 rows of the dataframe
df.head(10)
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
5 15.0 8 429.0 198 4341 10.0 70 1 ford galaxie 500
6 14.0 8 454.0 220 4354 9.0 70 1 chevrolet impala
7 14.0 8 440.0 215 4312 8.5 70 1 plymouth fury iii
8 14.0 8 455.0 225 4425 10.0 70 1 pontiac catalina
9 15.0 8 390.0 190 3850 8.5 70 1 amc ambassador dpl

Create data matrices X and y

  • X: predictive variables
  • y: true values (“ground truth”)
X = np.array(df.drop(columns=['mpg', 'car name']))
y = np.array(df['mpg'])
X.shape
(392, 7)
y.shape
(392,)

2. Data Visualization

We will now visualize the variables representing our vehicles. To do this, we will separate the continuous variables (which we will represent each with a histogram) from the discrete variables (which we will represent with bar charts).

Feel free to adjust the parameters of the matplotlib methods to produce more readable plots.

# Define continuous and discrete features
continuous_features = ['displacement', 'horsepower', 'weight', 'acceleration']
discrete_features = ['cylinders', 'model year', 'origin']

# Get feature names and their indices in the dataframe
features = list(df.drop(columns=['mpg', 'car name']).columns)

continuous_features_idx = [features.index(feat_name) for feat_name in continuous_features]
discrete_features_idx = [features.index(feat_name) for feat_name in discrete_features]

Histograms for the continuous variables

fig = plt.figure(figsize=(8, 6))

for (plot_idx, feat_idx) in enumerate(continuous_features_idx):
    # Create a graph at position (plot_idx+1) of a 2x2 grid
    ax = fig.add_subplot(2, 2, (plot_idx+1))
    # Display histogram for the variable at index feat_idx
    h = ax.hist(X[:, feat_idx], bins=30, edgecolor='none')
    # Use variable name as title
    ax.set_title(features[feat_idx])
# Handle spacing between graphs
fig.tight_layout(pad=1.0)
plt.show()

Bar charts for the discrete variables

fig = plt.figure(figsize=(12, 3))

for (plot_idx, feat_idx) in enumerate(discrete_features_idx):
    # Create a graph at position (plot_idx+1) of a 1x3 grid
    ax = fig.add_subplot(1, 3, (plot_idx+1))

    # Compute frequencies of each value of the variable at index feat_idx
    feature_values = np.unique(X[:, feat_idx])
    frequencies = [(float(len(np.where(X[:, feat_idx]==value)[0]))/X.shape[0]) \
                   for value in feature_values]
    
    # Display bar chart for the variable at index feat_idx
    b = ax.bar(range(len(feature_values)), frequencies, width=0.5, 
               tick_label=list([int(n) for n in feature_values]))
    
    # Set variable name as title
    ax.set_title(features[feat_idx])
fig.tight_layout(pad=1.0)
plt.show()

Question: Observe the orders of magnitude / value ranges of the different variables. What can you comment about them ?

Based on the visualizations of the continuous and discrete variables, we can observe significant differences in their orders of magnitude and value ranges:

  • Continuous variables:
    • displacement is in the hundreds.
    • horsepower ranges from tens to hundreds.
    • weight is in the thousands.
    • acceleration is in the tens.
  • Discrete variables:
    • cylinders takes small integer values (e.g., 4, 6, 8).
    • model year takes values in the 70s and 80s.
    • origin takes values 1, 2, or 3.

This means that the variables are on very different scales. For instance, weight values are typically much larger than acceleration or origin values. This difference in scale can impact machine learning models, as variables with larger magnitudes might inadvertently dominate the learning process. It also makes direct comparison of coefficients in models like linear regression difficult without proper scaling.

Histogram of the labels

plt.hist(y, bins=30, edgecolor='none')
plt.title('mpg')
plt.show()

3. Linear Regression

We will now use scikit-learn to train a linear regression on the data.

The linear models in scikit-learn are implemented in the sklearn.linear_model module.

Feel free to refer frequently to the scikit-learn documentation, which is very comprehensive.

The framework, which is very common, is the following: - initialize a model object - train the model on the data - use the model to predict new values

from sklearn import linear_model

Model training

# Initialize a LinearRegression object
predictor = linear_model.LinearRegression()
# Train this object on the data
predictor.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predictions

We can now use this model to predict labels from the variables. Particularly, it’s possible to apply it to the data we’ve just used to train it:

y_pred = predictor.predict(X)

CAUTION In practice, what we are really interested in is a model’s ability to make good predictions on data that was not used to train it. A model’s performance on the data used to train it does not allow to determine whether it is a good model. We will discuss this in more detail later in the course.

Performance

We would now like to evaluate our model.

To do this, we will use the functionalities of the metrics module from scikit-learn.

As this is a regression problem, we will use the RMSE (Root Mean Squared Error) as a measure of the model’s performance: this is the square root of the mean of the squared errors. The square root is used for homogeneity reasons: the RMSE is expressed in the same unit as the label.

from sklearn import metrics
print("RMSE: %.2f" % metrics.root_mean_squared_error(y, y_pred))
RMSE: 3.29

Question: What do you think about this error ? Is it high ? Low ?

The RMSE (Root Mean Squared Error) of 3.29 should be evaluated in the context of the range of the target variable, ‘mpg’.

Looking at the distribution of ‘mpg’, the values generally range from around 9 to 46 mpg. An RMSE of 3.29 means that, on average, the predictions deviate from the actual values by about 3.29 mpg.

  • Is it high? It’s not extremely high, but it’s also not negligibly small. For instance, if the range of mpg was only 0-10, an RMSE of 3.29 would be considered very high. However, with a range of approximately 9-46, it represents a moderate error.
  • Is it low? It’s not extremely low, suggesting there’s room for improvement in the model’s accuracy. A lower RMSE would indicate more precise predictions.

In summary, an RMSE of 3.29 is a reasonable starting point for this dataset, indicating that the linear model has some predictive capability, but it also suggests that further optimizations or more complex models could potentially reduce this error. It’s important to note that without a baseline or domain-specific context, definitive judgment (high/low) can be subjective.

Visualization

We can also use a visual, and represent each individual from the test set by its predicted label vs. its true label.

fig = plt.figure(figsize=(5, 5))
plt.scatter(y, y_pred)

plt.xlabel("Real Consumption (mpg)")
plt.ylabel("Predicted Consumption (mpg)")
plt.title("Linear Regression")

# Same values on both axes
axis_min = np.min([np.min(y), np.min(y_pred)])-1
axis_max = np.max([np.max(y), np.max(y_pred)])+1
plt.xlim(axis_min, axis_max)
plt.ylim(axis_min, axis_max)
  
# Diagonal y=x
plt.plot([axis_min, axis_max], [axis_min, axis_max], 'k-')
plt.show()

Regression coefficients

To understand our model, we can look at the coefficients attributed to each variable in the learnt linear model.

# Display, for each variable, the absolute value of its coefficient in the model
num_features = X.shape[1]
feature_names = df.drop(columns=['mpg', 'car name']).columns
plt.scatter(range(num_features), np.abs(predictor.coef_))

plt.xlabel('Variable')
tmp = plt.xticks(range(num_features), feature_names, rotation=90)
tmp = plt.ylabel('Coefficient')
plt.show()

Question: Which variable has the highest coefficient (in absolute value) ? Do you think that it means this variable plays a very important part in the prediction ?

The variable with the highest coefficient in absolute value is origin.

Regarding whether this means origin plays a very important part in the prediction:

While origin has the largest absolute coefficient in this unscaled model, it does not necessarily mean it plays the most important part in the prediction and for the interpretations in a general sense. The fact that variables are on different scales makes the interpretation of the linear regression coefficients quite tricky. The origin variable is discrete with values 1, 2, or 3, which is a relatively small range. Its large coefficient might reflect the model trying to compensate for its categorical nature when treated as a continuous variable, or simply that its magnitude has a strong influence on mpg given its range compared to other variables.

To truly understand the relative importance of variables, especially when they are on different scales or are categorical, it’s crucial to either scale all variables or use appropriate encoding for categorical variables. A large coefficient in an unscaled model can be misleading about true predictive power.

4. Changing the scale of the variables

The fact that variables are on different scales makes the interpretation of the linear regression coefficients quite tricky.

Variables transformation

Centering (bringing to a mean of 0) and scaling (bringing to a standard deviation of 1) the variables helps to remedy this problem.

n_samples, n_features = X.shape
X_scaled_manual = np.zeros_like(X)  # Create an array with same shape as X to hold the standardized variables

# Compute the mean and standard deviation of each variable in X, and use them to standardize the variable
# Values from column i in the array X can be accessed with X[:, i]
for i in range(n_features):
    
    # Compute the mean
    mean = sum(X[:, i]) / n_samples
    # Compute the standard deviation
    variance = sum((X[:, i] - mean) ** 2) / n_samples
    std_dev = variance ** 0.5
    # Standardize the variable
    X_scaled_manual[:, i] = (X[:, i] - mean) / std_dev
print("Original data (first 5 rows and 5 columns):")
print(X[0:5, 0:5])

print("\nStandardized data (first 5 rows and 5 columns):")
print(X_scaled_manual[0:5, 0:5])
Original data (first 5 rows and 5 columns):
[[   8.   307.   130.  3504.    12. ]
 [   8.   350.   165.  3693.    11.5]
 [   8.   318.   150.  3436.    11. ]
 [   8.   304.   150.  3433.    12. ]
 [   8.   302.   140.  3449.    10.5]]

Standardized data (first 5 rows and 5 columns):
[[ 1.48394702  1.07728956  0.66413273  0.62054034 -1.285258  ]
 [ 1.48394702  1.48873169  1.57459447  0.84333403 -1.46672362]
 [ 1.48394702  1.1825422   1.18439658  0.54038176 -1.64818924]
 [ 1.48394702  1.04858429  1.18439658  0.53684535 -1.285258  ]
 [ 1.48394702  1.02944745  0.92426466  0.5557062  -1.82965485]]

Check that the means and standard deviations of your variables are indeed all set to 0 and 1 (or values very close to these, due to numeric approximations).

print(f"Means: {[f'{x:.3f}' for x in X_scaled_manual.mean(axis=0)]}")
print(f"Std: {[f'{x:.3f}' for x in X_scaled_manual.std(axis=0)]}")
Means: ['-0.000', '-0.000', '-0.000', '-0.000', '0.000', '-0.000', '0.000']
Std: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000']

Now that we’ve seen how it works, we can use the StandardScaler object of the sklearn.preprocessing module to do it for us automatically.

from sklearn import preprocessing
standard_scaler = preprocessing.StandardScaler()
standard_scaler.fit(X)
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
X_scaled = standard_scaler.transform(X)

Note the use of the transform method here: the fit function is used to compute the values needed to center and scale the data in X, but it does not transform the data itself. We need to specifically ask to transform the data. Among other things, it allows to keep our original data X unchanged if we need to. There also exists a fit_transform method to do both in the same function call.

Visualization of the new variables

Histograms for continuous variables

We simply replace X by X_scaled in the code used previously.

fig = plt.figure(figsize=(8, 6))

for (plot_idx, feat_idx) in enumerate(continuous_features_idx):
    # Create a graph at position (plot_idx+1) of a 2x2 grid
    ax = fig.add_subplot(2, 2, (plot_idx+1))
    # Display histogram for the variable at index feat_idx
    h = ax.hist(X_scaled[:, feat_idx], bins=30, edgecolor='none')
    # Use variable name as title
    ax.set_title(features[feat_idx])
# Handle spacing between graphs
fig.tight_layout(pad=1.0)
plt.show()

Bar charts for discrete variables

Same here, we replace X by X_scaled in the previous code.

fig = plt.figure(figsize=(18, 4))

for (plot_idx, feat_idx) in enumerate(discrete_features_idx):
    # Create a graph at position (plot_idx+1) of a 1x3 grid
    ax = fig.add_subplot(1, 3, (plot_idx+1))

    # Compute frequencies of each value of the variable at index feat_idx
    feature_values = np.unique(X_scaled[:, feat_idx])
    frequencies = [(float(len(np.where(X_scaled[:, feat_idx]==value)[0]))/X_scaled.shape[0]) \
                   for value in feature_values]
    
    # Display bar chart for the variable at index feat_idx
    b = ax.bar(range(len(feature_values)), frequencies, width=0.5, 
               tick_label=list(['%.1f' % n for n in feature_values]))
    
    # Set variable name as title
    ax.set_title(features[feat_idx])
fig.tight_layout(pad=1.0)
plt.show()

Impact on the model

We can now train a model predictor_scaled on the centered and scaled data

# Create a new LinearRegression object
predictor_scaled = linear_model.LinearRegression()

# Train predictor_scaled on the new data
predictor_scaled.fit(X_scaled, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And create an array y_pred_scaled which contains the predictions of predictor_scaled on the data.

y_pred_scaled = predictor_scaled.predict(X_scaled)

RMSE

The RMSE of this new model is:

print("RMSE (scaled): %.2f" % metrics.root_mean_squared_error(y, y_pred_scaled))
RMSE (scaled): 3.29

Question: Compare it to the previous RMSE. Are the predictions any different ?

The RMSE values are identical. This indicates that scaling the features (centering and standardizing) does not change the predictive performance of a linear regression model itself, but it does change the interpretability of the coefficients.

fig = plt.figure(figsize=(5, 5))
plt.scatter(y_pred_scaled, y_pred)

plt.xlabel("Comsumption predicted on scaled data (mpg)")
plt.ylabel("Comsumption predicted from original data (mpg)")
plt.title("Comparison of linear regression predictions")

# Same values on both axes
axis_min = np.min([np.min(y), np.min(y_pred)])-1
axis_max = np.max([np.max(y), np.max(y_pred)])+1
plt.xlim(axis_min, axis_max)
plt.ylim(axis_min, axis_max)
  
# Diagonal y=x
plt.plot([axis_min, axis_max], [axis_min, axis_max], 'k-')
plt.show()

Comparison of regression coefficients

Finally, we can compare the regression coefficients from both models.

# Display, for each variable, the absolute value of its coefficient in the model
num_features = X.shape[1]
plt.scatter(range(num_features), np.abs(predictor.coef_), label='Original')

plt.scatter(range(num_features), np.abs(predictor_scaled.coef_), label='Scaled', marker='x')

plt.xlabel('Variable')
tmp = plt.xticks(range(num_features), feature_names, rotation=90)
tmp = plt.ylabel('Coefficient')
plt.legend(loc=(0.02, 0.75))
plt.show()

We can notice that, even if the RMSE remains identical, standardizing (= centering and scaling) the variables changed the values of the parameters learnt by the model. We can compare for example the values taken by the intercept (independent term in the linear model).

print("Intercept in the two models")
print(f"- Original data: {predictor.intercept_:.3f}")
print(f"- Scaled data  : {predictor_scaled.intercept_:.3f}")
Intercept in the two models
- Original data: -17.218
- Scaled data  : 23.446

Question: Which variables are now the most relevant to predict a vehicle’s consumption ? Does it seem sensible ?

After scaling the variables, we can now more accurately compare the influence of each feature on the prediction of vehicle consumption, as shown in the ‘Scaled’ coefficients plot in cell 70. The variables with the highest absolute coefficients are:

  1. Weight: This variable has a very high absolute coefficient. This is highly sensible, as heavier vehicles typically require more energy to move, leading to higher fuel consumption.
  2. Model Year: Also exhibiting a significant absolute coefficient, the model year often reflects technological advancements in engine efficiency, aerodynamics, and overall vehicle design. Newer cars (higher model year) tend to be more fuel-efficient, so its importance is sensible.
  3. Displacement: The engine’s displacement (volume) is directly related to its size and power output, which is a primary factor in fuel consumption. Larger displacement generally means more fuel consumption, making its relevance entirely sensible.

Other variables like cylinders and horsepower also have notable coefficients and are sensible predictors of fuel consumption. acceleration and origin appear to have less relative importance in this scaled model.

This revised interpretation, made possible by scaling the features, provides a much more intuitive and sensible understanding of which vehicle characteristics are most influential in predicting fuel consumption.

5. Encoding categorical variables

The origin variable is a qualitative (or categorical) variable: the 1-2-3 encoding is completely arbitrary. It implies, in particular, if we think in terms of distances, that Asia is twice as far from North America as it is from Europe, which does not make sense.

A more reasonable encoding for this kind of case is what is called one-hot, or dummy encoding: we represent the variable by as many binary variables as there are possible values (3 in the case of the origin variable: the first corresponds to North America, the second to Europe, the third to Asia), and we set to 1 the single binary variable corresponding to the value we are encoding.

Thus the single origin variable becomes 3 binary variables:

   North America --> 1, 0, 0
   Europe --> 0, 1, 0
   Asia --> 0, 0, 1

This representation has the disadvantage of increasing the number of variables, but the Euclidean distances are now more reasonable (they are 1 if the values are different and 0 if they are identical).

This functionality exists in pandas as well as in scikit-learn.

One-hot transformation

# Create a new dataframe where the 'origin' column is replaced by its 'one-hot' encoding
df_dummies = pd.get_dummies(df, columns=['origin'])
df_dummies.head()
mpg cylinders displacement horsepower weight acceleration model year car name origin_1 origin_2 origin_3
0 18.0 8 307.0 130 3504 12.0 70 chevrolet chevelle malibu True False False
1 15.0 8 350.0 165 3693 11.5 70 buick skylark 320 True False False
2 18.0 8 318.0 150 3436 11.0 70 plymouth satellite True False False
3 16.0 8 304.0 150 3433 12.0 70 amc rebel sst True False False
4 17.0 8 302.0 140 3449 10.5 70 ford torino True False False
# Extract the data once again
X_dummies = np.array(df_dummies.drop(columns=['mpg', 'car name']))

As previously, we normalize each of the variables.

# Initialize a StandardScaler object
standard_scaler_dummies = preprocessing.StandardScaler()
# Fit the object on the data with dummies
standard_scaler_dummies.fit(X_dummies)
# Scale the data with dummies
X_scaled_dummies = standard_scaler_dummies.transform(X_dummies)

Impact on the model

Let’s learn a linear regression on the data where the origin varibale has been replaced by its one-hot encoding.

To do so, we create an instance predictor_dummy from the LinearRegression class, trained on the data containing the one-hot version of the origin variable.

# Create a new LinearRegression object
predictor_dummy = linear_model.LinearRegression()

# Train predictor_dummy on the new data
predictor_dummy.fit(X_scaled_dummies, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can now create an array y_pred_dummy which contains the predictions of the new linear regression on these data.

y_pred_dummy = predictor_dummy.predict(X_scaled_dummies)

RMSE for this new model is:

print("RMSE (one-hot encoding): %.2f" % metrics.root_mean_squared_error(y, y_pred_dummy))
RMSE (one-hot encoding): 3.27

Question: Compare it to the previous RMSE.

Comparing the RMSE for the one-hot encoded model (3.27) to the previous RMSE (from the scaled data model, 3.29), we observe a slight improvement.

This small reduction in RMSE indicates that representing the origin variable using one-hot encoding provides a slightly better fit for the linear regression model than treating it as a continuous numerical variable. By converting the categorical origin feature into a set of binary features, the model can capture the distinct effect of each origin category more accurately, leading to a marginal improvement in predictive performance.

Comparison to previous predictions

Are the performances really different ? We can compare the predictions directly:

fig = plt.figure(figsize=(5, 5))
plt.scatter(y_pred, y_pred_dummy)

plt.xlabel("Predicted consumption (mpg) (baseline)")
plt.ylabel("Predicted consumption (mpg) (one-hot)")
plt.title("Comparison of Linear Regression predictions")

# Same values on both axes
axis_min = np.min([np.min(y_pred), np.min(y_pred_dummy)])-1
axis_max = np.max([np.max(y_pred), np.max(y_pred_dummy)])+1
plt.xlim(axis_min, axis_max)
plt.ylim(axis_min, axis_max)
  
# Diagonal y=x
plt.plot([axis_min, axis_max], [axis_min, axis_max], 'k-')
plt.show()

Let’s see what the correlation is between the two predictions

import scipy.stats as st
r, pval = st.pearsonr(y_pred, y_pred_dummy)
print("Correlation between predictions : %.2f (p=%.2e)" % (r, pval))
Correlation between predictions : 1.00 (p=0.00e+00)

Comparison of the regression coefficients

Let’s now compare visually the two models:

# Display for each variable, the absolute value of its coefficient in the model
num_features = X.shape[1]
plt.scatter(range(num_features), np.abs(predictor_scaled.coef_), label='Scaled')

num_features2 = X_scaled_dummies.shape[1]
plt.scatter(range(num_features2), np.abs(predictor_dummy.coef_), label='One-hot', marker='x')
feature_names2 = df_dummies.drop(columns=['mpg', 'car name']).columns

plt.xlabel('Variable')
tmp = plt.xticks(range(num_features2), feature_names2, rotation=90)
tmp = plt.ylabel('Coefficient')

plt.legend(loc=(0.02, 0.75))
plt.show()

Conclusion

We reached the end of this practical. Here is a summary of what we have covered, with the key takeaways: - We used the scikit-learn library to predict a quantitative, continuous variable (car consumption, mpg) from continuous and discrete variables (cars features)

  • scikit-learn uses a general framework with

    • object initialization
    • Training the object on the data with fit
    • Predicting the output values with predict or transforming the data with transform
  • We tried a first predictive model: the linear regression from sklearn.linear_model

  • Scaling the data so that every variable has some similar range of values is (very) important.

    • For the linear regression model here, it allowed a much better interpretation
    • With other models, scaling the variables might be critical for the model to learn anything
    • Especially, it prevents the model to attribute too much importance to a variable solely because of its values range
  • We’ve used an example of scikit-learn’s object to scale the data: the StandardScaler

  • Categorical variables can be encoded with a one-hot encoding, which avoids arbitrary order or distances between the variables. A drawback of this is that we need to increase the number of variables used by the model, so it is often a trade-off between these two aspects.