'randomly split dataframe into groups with even distribution of values

I have a dataframe of two groups (A and B) and within those groups, 6 subgroups (a, b, c, d, e, and f). Example data below:

index   group    subgroup    value
0       A        a           1
1       A        b           1
2       A        c           1
3       A        d           1
4       A        e           1
5       A        f           1
6       B        a           1
7       B        b           1
8       B        c           1
9       B        d           1
10      B        e           1
11      B        f           1
...     ...      ...         ...

While I've only listed 12 rows here of values that equal 1, there are in fact 300 rows in the real dataset (values equal 2, 3, etc). I'm trying to randomly split the dataframe into 6 batches of 50 values. However, I'd like each batch to contain an even distribution of group values (so 25 A's and 25 B's) and approximately even distribution of subgroup values.

For example, batch_1 might contain:

25 A's which contain 4 a's, 5 b's, 4 c's, 4 d's, 5 e's and 3 f's. And 25 B's which contain 5 a's, 4 b's, 3 c's, 5 d's, 4 e's and 4 f's.

These 6 batches will be given to 1 user. (So I actually need to randomly split the dataframe into multiple different 6 batches for more users.) But I can't tell whether this is a problem in which the dataframe should be randomly split or sampled from. Does anyone have advice on how to achieve this?

This may be helpful but doesn't ensure an even distribution of values: https://www.geeksforgeeks.org/break-list-chunks-size-n-python/



Solution 1:[1]

Using a few techniques

  1. use pd.factorize() to turn categorical data into values for each category
  2. calculate a value/factor f that represents a pairing of group / subgroup
  3. randomise this a bit np.random.uniform() with min & max close to 1
  4. once have a value that represents grouping, can sort_values() and reset_index() to have a clean ordered index
  5. finally working out grouping by integer remainder
group = list("ABCD")
subgroup = list("abcdef")
df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)], 
 "subgroup":subgroup[random.randint(0,len(subgroup)-1)],
 "value":random.randint(1,3)} for i in range(300)])

bins=6
dfc = df.assign(
    # take into account concentration of group and subgroup
    # randomise a bit....
    f = ((pd.factorize(df["group"])[0] +1)*10 + 
            (pd.factorize(df["subgroup"])[0] +1) 
            *np.random.uniform(0.99,1.01,len(df))
        ),
).sort_values("f").reset_index(drop=True).assign(
    gc=lambda dfa: dfa.index%(bins)
).drop(columns="f")

# check distribution ... used plot for SO
dfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")
# every group same size...
# dfc.groupby("gc").count()

# now it's easy to get each of the cuts.... 0 through 5
# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)
# dfcut0

output enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Rob Raymond