'How to explicitly split data for training and evaluation with BigQuery ML?
I understand there's already another post, but it's a bit old and it doesn't really answer the question.
I understand that we can use the parameter DATA_SPLIT_METHOD to separate dataset for training and evaluation. But I how do I make sure that they're both different data set?
So for example, I set DATA_SPLIT_METHOD to AUTO_SPLIT, and my data set is between 500 and 500k rows, so 20% of data will be used as evaluation. How do I make sure that the rest of 80% will be used for training when I run my evaluation (ML.EVALUATE?
Solution 1:[1]
The short answer is BigQuery does it for you.
The long answer would be that DATA_SPLIT_METHOD is a parameter of CREATE MODEL which upon called will already create and train the model using the right percentage set at DATA_SPLIT_METHOD.
When you run ML.EVALUATE, you run it for the model which will have DATA_SPLIT_METHOD as a parameter. Therefore, it already knows which part of the data set has to evaluate and uses the already trained model.
Solution 2:[2]
Very interesting question I would say.
As stated in the BQ's parameters of CREATE MODEL by using the DATA_SPLIT_METHOD (i.e. The method to split input data into training and evaluation sets.), that is done for you.
But in case you would still like to split your data, then here is a way for a random sampling method:
-- Create a new table that will include a new variable that splits the existing data into **training (80%)**, **evaluation (10%)**, and **prediction (10%)**
CREATE OR REPLACE TABLE `project_name.table_name` AS
SELECT *,
CASE
WHEN split_field < 0.8 THEN 'training'
WHEN split_field = 0.8 THEN 'evaluation'
WHEN split_field > 0.8 THEN 'prediction'
END AS churn_dataframe
FROM (
SELECT *,
ROUND(ABS(RAND()),1) AS split_field
FROM `project_name.table_name`
)
split_field = A field that generates a random number for each row between 0 to 1 (it is assumed that the numbers that are generated, are uniformly distributed)
Hope this could help.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | aemon4 |
Solution 2 | Ledian K. |