'XGBoost giving a static prediction of "0.5" randomly
I am using a scikit-learn pipeline with XGBRegressor. Pipeline is working good without any error. When I am prediction with this pipeline, I am predicting the same data multiple times, Sometimes out of random the predictions are coming as 0.5 while the normal prediction range is (1000-10,000)
eg : (1258.2,1258.2,1258.2,1258.2,1258.2,1258.2,0.5,1258.2,1258.2,1258.2,1258.2)
- Input Data is exactly same
Environment is same
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) numeric_features = X.select_dtypes( include=['int64', 'float64']).columns categorical_features = X.select_dtypes( include=['object']).columns preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) # Number of trees n_estimators = [int(x) for x in np.linspace(start=50, stop=1000, num=10)] # Maximum number of levels in tree max_depth = [int(x) for x in np.linspace(1, 32, 32, endpoint=True)] # Booster booster = ['gbtree', 'gblinear', 'dart'] # selecting gamma gamma = [i / 10.0 for i in range(0, 5)] # Learning rate learning_rate = np.linspace(0.01, 0.2, 15) # Evaluation metric # eval_metric = ['rmse','mae'] # regularization reg_alpha = [1e-5, 1e-2, 0.1, 1, 100] reg_lambda = [1e-5, 1e-2, 0.1, 1, 100] # Min chile weight min_child_weight = list(range(1, 6, 2)) # Samples subsample = [i / 10.0 for i in range(6, 10)] colsample_bytree = [i / 10.0 for i in range(6, 10)] # Create the random grid random_grid = {'n_estimators': n_estimators, 'max_depth': max_depth, 'booster': booster, 'gamma': gamma, 'learning_rate': learning_rate, # 'eval_metric' : eval_metric, 'reg_alpha': reg_alpha, 'reg_lambda': reg_lambda, 'min_child_weight': min_child_weight, 'subsample': subsample, 'colsample_bytree': colsample_bytree } # Use the random grid to search for best hyperparameters # First create the base model to tune rf = xgboost.XGBRegressor(objective='reg:squarederror', n_jobs=4) # Random search of parameters, using 3 fold cross validation, # search across 100 different combinations, and use all available cores rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100, cv=3, verbose=0, random_state=42, n_jobs=4) pipe = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', rf_random)]) pipe.fit(X, y)
What could be the issue?
Solution 1:[1]
If you're getting some unusually low predictions, it's probably indicating that the dependent variable has outliers. I'd suggest you to read about it, and about the different strategies to tackle this problem, or recommendations.
Usually its not a good idea to consider all data samples for your model without outlier removal. This will lead to much worse and non-representative metrics.
Solution 2:[2]
It is probably because you have Nans or None in your target (y)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Dror Hilman |