'Pandas: Sampling from a DataFrame according to a target distribution
I have a Pandas DataFrame containing a dataset D
of instances which all have some continuous value x
. x
is distributed in a certain way, say uniform, could be anything.
I want to draw n
samples from D
for which x
has a target distribution that I can sample or approximate. This comes from a dataset, here I just take normal distribution.
How can I sample instances from D
such that the distribution of x
in the sample is equal/similar to an arbitrary distribution which I specify?
Right now, I sample a value x
, subset D
such that it contains all x +- eps
and sample from that. But this is quite slow when the datasets get bigger. People must have come up with a better solution. Maybe the solution is already good but could be implemented more efficiently?
I could split x
into strata, which would be faster, but is there a solution without this?
My current code, which works fine but is slow (1 min for 30k/100k, but I have 200k/700k or so.)
import numpy as np
import pandas as pd
import numpy.random as rnd
from matplotlib import pyplot as plt
from tqdm import tqdm
n_target = 30000
n_dataset = 100000
x_target_distribution = rnd.normal(size=n_target)
# In reality this would be x_target_distribution = my_dataset["x"].sample(n_target, replace=True)
df = pd.DataFrame({
'instances': np.arange(n_dataset),
'x': rnd.uniform(-5, 5, size=n_dataset)
})
plt.hist(df["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)
def sample_instance_with_x(x, eps=0.2):
try:
return df.loc[abs(df["x"] - x) < eps].sample(1)
except ValueError: # fallback if no instance possible
return df.sample(1)
df_sampled_ = [sample_instance_with_x(x) for x in tqdm(x_target_distribution)]
df_sampled = pd.concat(df_sampled_)
plt.hist(df_sampled["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)
Solution 1:[1]
Rather than generating new points and finding a closest neighbor in df.x
, define the probability that each point should be sampled according to your target distribution. You can use np.random.choice
. A million points are sampled from df.x
in a second or so for a gaussian target distribution like this:
x = np.sort(df.x)
f_x = np.gradient(x)*np.exp(-x**2/2)
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)
sample_probs
is the key quantity, as it can be joined back to the dataframe or used as an argument to df.sample
, e.g.:
# sample df rows without replacement
df_samples = df["x"].sort_values().sample(
n=1000,
weights=sample_probs,
replace=False,
)
The result of plt.hist(samples, bins=100, density=True)
:
Gaussian distributed x, uniform target distribution
Lets see how this method performs when the original samples are drawn from a gaussian distribution and we wish to sample them from a uniform target distribution:
x = np.sort(np.random.normal(size=100000))
f_x = np.gradient(x)*np.ones(len(x))
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)
The tails look jittery at this resolution, but if we increased the bin size they would smooth out.
Method
Approximate probabilities are calculated for samples in x
in the form:
prob(x_i) ~ delta_x*rho(x_i)
where rho(x_i)
is the density function and np.gradient(x)
is used as a differential value. If the differential weight is ignored, f_x
will over-represent close points and under-represent sparse points in the resampling. I made this mistake initially, the effect is small is x is uniformly distributed (but generally can be significant):
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |