'How to create federated dataset from a CSV file?
I have selected this dataset: https://www.kaggle.com/karangadiya/fifa19
Now, I would like to convert this CSV file into the federated dataset to fit in the model.
Tensorflow provided tutorials on federated learning where they have used a pre-defined dataset. However, my question is How can I use this particular dataset for a federated learning scenario?
Solution 1:[1]
I'll use a different CSV dataset, but this should still address the core of this question, which is how to create a federated dataset from a CSV. Let's also assume that there is a column in that dataset which you would like to represent the client_id
s for your data.
import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff
csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"
df = pd.read_csv(csv_url, na_values=("?",))
client_id_colname = 'native.country' # the column that represents client ID
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1
# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = client_ids.sample(frac=0.5).tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]
There are a few ways to do this, but the way I'll illustrate here uses tff.simulation.ClientData.from_clients_and_fn
, which requires that we write a function that accepts a client_id
as input and returns a tf.data.Dataset
. We can easily construct this from the dataframe.
def create_tf_dataset_for_client_fn(client_id):
# a function which takes a client_id and returns a
# tf.data.Dataset for that client
client_data = df[df[client_id_colname] == client_id]
dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
return dataset
Now, we can use the function above to create a ConcreteClientData
object for our training and test data:
train_data = tff.simulation.ClientData.from_clients_and_fn(
client_ids=train_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
test_data = tff.simulation.ClientData.from_clients_and_fn(
client_ids=test_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
To see one instance of the dataset, try:
example_dataset = train_data.create_tf_dataset_for_client(
train_data.client_ids[0]
)
print(type(example_dataset))
example_element = iter(example_dataset).next()
print(example_element)
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...
Each element of example_dataset
is a Python dictionary where the keys are strings representing feature names, and the values are tensors with one batch of those features. Now, you have a federated dataset that can be preprocessed and used for modeling.
Solution 2:[2]
You can convert your CSV file to federated data by first creating an h5 file from your CSV file.
Background An h5 file is a hierarchal file structure that shows metadata, this works well as the hierarchal structure represents federated user id's very well
When you are creating federated data you are creating using a client data object, client data is implemented using an h5 file,
Federated Source Code : Client Data https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/simulation/hdf5_client_data.py
Steps
- Create your h5 file
- In Federated, Experiment create a client data object , and then follow the Image Recognition tutorial on the federated main page
Creating h5 file
with h5py.File("student31.h5", 'a') as hdf:
example = hdf.create_group("examples")
for i in range(0,20):
# for data in myDataFrame:
# localList.append(str(data))
# print(type(myDataFrame))
# data.append(myDataFrame)
exampleGroup = example.create_group(str(i))
# myClientGroup = hdf.create_group(str(i))
# d1 = np.random.random(size = (100,33))
print("printing the type ")
print(type(train[i][0]))
exampleGroup.create_dataset('x',data=train[i])
exampleGroup.create_dataset('y',data=dataY[i])
Federated Client data Instantiation
myclient = HDF5ClientData("student31.h5")
Solution 3:[3]
If you're in 2022, try this update code based in @jpgard response
!pip install tensorflow-federated==0.19.0
import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff
csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"
df = pd.read_csv(csv_url, na_values=("?",))
client_id_colname = 'native.country' # the column that represents client ID
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1
# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = pd.DataFrame(client_ids).sample(frac=0.5).values.tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]
def create_tf_dataset_for_client_fn(client_id):
# a function which takes a client_id and returns a
# tf.data.Dataset for that client
client_data = df[df[client_id_colname] == client_id[0]]
dataset = tf.data.Dataset.from_tensor_slices(client_data.fillna('').to_dict("list"))
dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
return dataset
train_data = tff.simulation.datasets.ClientData.from_clients_and_fn(
client_ids=train_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
test_data = tff.simulation.datasets.ClientData.from_clients_and_fn(
client_ids=test_client_ids,
create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
)
example_dataset = train_data.create_tf_dataset_for_client(
train_data.client_ids[0]
)
print(type(example_dataset))
example_element = iter(example_dataset).next()
print(example_element)
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | jpgard |
Solution 2 | Ronak Pasricha |
Solution 3 | Lucas Emanuel |