'Difference between Shuffle and Random_State in train test split?

I tried both on a small dataset sample and it returned the same output. So the question is, what is the difference between the "shuffle" and the "random_state" parameter in scikit's train-test-split method?

Code for MWE:

X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=False)

Out: [[0, 1, 2], [3, 4]]

train_test_split(y, random_state=0)

Out: [[0, 1, 2], [3, 4]]



Solution 1:[1]

Sometimes experimenting may help understand how a function works.

Say if you have a DataFrame of the sort:

   X  Y
0  A  2
1  A  3
2  A  2
3  B  0
4  B  0

We'll go over the different things that you can do with the function train_test_split:


  • if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None), you will always end up with:
# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

  • if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1) or any other int for random_state, you will get the same:
# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

This comes from the fact that you decided not to shuffle your dataset, so random_state is not used by the function.


  • Now, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None), you will get a dataset that looks like this:
# TRAIN
   X  Y
4  B  0
0  A  2
1  A  3

# TEST
   X  Y
2  A  2
3  B  0

Note that entries have been shuffled. But note as well that if you run your code again, results might differ.


  • Finally, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1) or any other int for random_state, you will get two datasets with shuffled entries as well:
# TRAIN
   X  Y
4  B  0
0  A  2
3  B  0

# TEST
   X  Y
2  A  2
1  A  3

Only, this time, if you run the code again with the same random_state, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results!

Solution 2:[2]

  • random_state controls the pseudo-random numpy generator. For the reproducibility of the code, a random_state should be specified.

  • shuffle: if True then it shuffles the data before splitting

More details:

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 bglbrt
Solution 2