2021年3月30日星期二

Why did the Test data distribution change on applying Feature Scaling to the Training data?

So, I found a simple Linear Regression data in Kaggle. The problem is without feature scaling, the accuracy is 98% but when Feature Scaling is applied on the training set, the accuracy falls down to 72%.

Can someone explain why this is happening. Here is the code below and the training and testing graphs respectively.

train_dataset = pd.read_csv('train.csv')  test_dataset = pd.read_csv('test.csv')    train_dataset.dropna(inplace = True)    X_train = train_dataset.iloc[:, :-1].values  y_train = train_dataset.iloc[:, -1].values    X_test = test_dataset.iloc[:, :-1].values  y_test = test_dataset.iloc[:, -1].values    sc = StandardScaler()  X_train[:, 0:] = sc.fit_transform(X_train[:, 0:])  X_test[:, 0:] = sc.transform(X_test[:, 0:])    regression = LinearRegression()  regression.fit(X_train, y_train)    y_pred = regression.predict(X_test)    plt.figure(figsize=(32, 16))  plt.scatter(X_train, y_train, color = 'red')  plt.plot(X_train, regression.predict(X_train), color = 'blue')  plt.title('Salary vs Experience (Training Set)')  plt.xlabel('Experience')  plt.ylabel('Salary', rotation = 0)  plt.show()    plt.figure(figsize=(32, 16))  plt.scatter(X_test, y_test, color = 'red')  plt.plot(X_test, regression.predict(X_test), color = 'blue')  plt.title('Salary vs Experience (Test Set)')  plt.xlabel('Experience')  plt.ylabel('Salary', rotation = 0)  plt.show()    r2_score(y_test, regression.predict(X_test))  

Training Distribution

Test Distribution

Edit : Let me rephrase the question to....Why did the distribution of my X_test change to just -1, 0, 1 after applying Feature Scaling on the training set.

https://stackoverflow.com/questions/66873458/why-did-the-test-data-distribution-change-on-applying-feature-scaling-to-the-tra March 30, 2021 at 10:59PM

没有评论:

发表评论