'Difference between Shuffle and Random_State in train test split?
I tried both on a small dataset sample and it returned the same output. So the question is, what is the difference between the "shuffle" and the "random_state" parameter in scikit's train-test-split method?
Code for MWE:
X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=False)
Out: [[0, 1, 2], [3, 4]]
train_test_split(y, random_state=0)
Out: [[0, 1, 2], [3, 4]]
Solution 1:[1]
Sometimes experimenting may help understand how a function works.
Say if you have a DataFrame of the sort:
X Y
0 A 2
1 A 3
2 A 2
3 B 0
4 B 0
We'll go over the different things that you can do with the function train_test_split
:
- if you input
train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None)
, you will always end up with:
# TRAIN
X Y
0 A 2
1 A 3
2 A 2
#TEST
X Y
3 B 0
4 B 0
- if you input
train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1)
or any other int forrandom_state
, you will get the same:
# TRAIN
X Y
0 A 2
1 A 3
2 A 2
#TEST
X Y
3 B 0
4 B 0
This comes from the fact that you decided not to shuffle your dataset, so
random_state
is not used by the function.
- Now, if you do
train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None)
, you will get a dataset that looks like this:
# TRAIN
X Y
4 B 0
0 A 2
1 A 3
# TEST
X Y
2 A 2
3 B 0
Note that entries have been shuffled. But note as well that if you run your code again, results might differ.
- Finally, if you do
train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1)
or any other int forrandom_state
, you will get two datasets with shuffled entries as well:
# TRAIN
X Y
4 B 0
0 A 2
3 B 0
# TEST
X Y
2 A 2
1 A 3
Only, this time, if you run the code again with the same
random_state
, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results!
Solution 2:[2]
random_state
controls the pseudo-random numpy generator. For the reproducibility of the code, a random_state should be specified.shuffle
: if True then it shuffles the data before splitting
More details:
random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | bglbrt |
Solution 2 |