Scaling for a double y axis plot with scikit-learn

Many machine learning algorithms work better when features are on a relatively similar scale and close to normally distributed. MinMaxScaler, RobustScaler, StandardScaler, and Normalizer are scikit-learn methods to preprocess data for machine learning. Which method you need, if any, depends on your model type and your feature values.

These scalers not only are valuable for modeling but also when plotting multiple y axis.

When plotting a double y axis comparing numerical data for the Nasdaq stock index price and US covid19 cases numbers I chose the MinMax Scaler. As the data was sensitive to analyzing any radical shifts in price or case numbers, I kept the outliers and hence I chose the Min Max Scaler on the double y axis because:

It doesn't reduce the importance of outliers.

For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data.

The default range for the feature returned by MinMaxScaler is 0 to 1.

Here’s the plot after MinMaxScaler has been applied:

Notice how the features are all on the same relative scale. The relative spaces between each feature’s values have been maintained.

If plotting a dataframe df which has a datetime column with MinMaxScaler, first you will need to set the date to the df index -> then apply MinMaxScaler -> then convert back to df without the date as index to plot. Heres an example of these three steps:

Import sklearn scalers
[1] Set Date to datetime index [2] Call MinMaxScaler [3] convert dataframe back to pd.DataFrame to plot
[4]Use the dataframe to plot

To compare the effect of different scalers on data with outliers read more at : https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Deep Diving with Data Science