'scikit-learn GridSearchCV() fit() performance improvement

I am using GridSearchCV() and its fit() method to build a model. I currently have this working, but would like to improve the accuracy of the model by supplying more images to train on. Right now, fit() takes over an hour to complete with 500 images. Processing time exponentially grows as the number of images doubles. Ultimately, I'd like to train on several thousand images and even include additional categories besides the two in my proof of concept. I have tried several ways to improve performance and can't resolve it. The only thing that reduces processing time is to significantly lower train_size/test_size in train_test_split() but doing this defeats the purpose of a larger data set to train from. I'm a little stumped on this one. Below is the code I'm using for reference. Thank you.

categories = ['Cat', 'Dog']
flat_data_arr = []
target_arr = []
datadir = 'C:\\Users\\Name\\Python\\images'

for i in categories:
    path = os.path.join(datadir, i)
    for image in os.listdir(path):
        image_array = imread(os.path.join(path, image))
        image_resized = resize(image_array, (150, 150, 3))

flat_data = np.array(flat_data_arr)
target = np.array(target_arr)
df = pd.DataFrame(flat_data)
df['Target'] = target
x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, test_size=0.25, shuffle=True, stratify=y)
model.fit(x_train,y_train) #this takes hours depending on number of images

Try - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html


Probably best to use tensorflow or keras or pytorch for computer vision and with GPUs on top, this will run in mili/seconds... even without GPU you will see significant speed up.

However in the case if you decide to continue you could try the following (basically reducing dimensions & adding features):

support libraries

import Image from PIL
from PIL import Image

import numpy as np

from skimage.feature import hog
from skimage.color import rgb2grey

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

  1. not sure why but I see you jump from np.array to pandas, I think you should be able to go directly with np matrix

  1. Make sure you are using all cores / processors, parameter n_jobs = -1 in your grid search call should do it...

  1. Then you can also reduce the size of your images even further, say 100 x 100 instead of 150 x 150

  1. Additionally could convert image to gray scale (making your matrix 1 dimensional, not 3)
grey_scaled = rgb2grey(imread(os.path.join(path, image))..

  1. If interested in experimenting then could try to use hog features of your grey_scaled image by pre processing from step 3 via
hog_features = hog(grey_scaled, block_norm='L2-Hys', pixels_per_cell=(10,10))

  1. You could even then try to stack original image and hog features together together
color_features = imread(os.path.join(path, image).flatten()
final_features = np.hstack((color_features,hog_features))

  1. loop over all your images, and append this pipeline to say “final_features_list” list and convert that to a to matrix = np.array(final_features_list)

  1. With so many features you probably can reduce dimensionality. So standard scale and do PCA .

standard_sc = StandardScaler()

matrix_scaled = standard_sc.fit_transform(np.array(final_features_list))

### read up on how to select # of components
### there are methods to help you with that
pca = PCA(n_components=300)
matrix_scaled_pca = pca.fit_transform(matrix_scaled)

  1. Add Try to run your grid search again using matrix_scaled_pca matrix... should go much faster. Could try RandomizedSearchCV or better yet something that should be faster than that (about 10X faster than GridSearch) - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html

Best of luck,


