'How to use SageMaker Estimator for model training and saving

The documentations of how to use SageMaker estimators are scattered around, sometimes obsolete, incorrect. Is there a one stop location which gives the comprehensive views of how to use SageMaker SDK Estimator to train and save models?

amazon-sagemaker

Solution 1:^[1]

Answer

There is no one such resource from AWS that provides the comprehensive view of how to use SageMaker SDK Estimator to train and save models.

Alternative Overview Diagram

I put a diagram and brief explanation to get the overview on how SageMaker Estimator runs a training.

SageMaker sets up a docker container for a training job where:
- Environment variables are set as in SageMaker Docker Container. Environment Variables.
- Training data is setup under /opt/ml/input/data.
- Training script codes are setup under /opt/ml/code.
- /opt/ml/model and /opt/ml/output directories are setup to store training outputs.

/opt/ml
??? input
?   ??? config
?   ?   ??? hyperparameters.json  <--- From Estimator hyperparameter arg
?   ?   ??? resourceConfig.json
?   ??? data
?       ??? <channel_name>        <--- From Estimator fit method inputs arg
?           ??? <input data>
??? code
?   ??? <code files>              <--- From Estimator src_dir arg
??? model
?   ??? <model files>             <--- Location to save the trained model artifacts
??? output
    ??? failure                   <--- Training job failure logs

SageMaker Estimator fit(inputs) method executes the training script. Estimator hyperparameters and fit method inputs are provided as its command line arguments.
The training script saves the model artifacts in the /opt/ml/model once the training is completed.
SageMaker archives the artifacts under /opt/ml/model into model.tar.gz and save it to the S3 location specified to output_path Estimator parameter.
You can set Estimator metric_definitions parameter to extract model metrics from the training logs. Then you can monitor the training progress in the SageMaker console metrics.

I believe AWS needs to stop mass-producing verbose, redundant, wordy, scattered, and obsolete documents. AWS needs to understand A picture is worth thousand words.

Have diagrams and piece document parts together in a context with a clear objective to achieve.

Problem

AWS documentations need serious re-design and re-structuring. Just to understand how to train and save a model forces us going through dozens of scattered, fragmented, verbose, redundant documentations, which are often obsolete, incomplete, and sometime incorrect.

It is well-summarized in Why I think GCP is better than AWS:

It’s not that AWS is harder to use than GCP, it’s that it is needlessly hard; a disjointed, sprawl of infrastructure primitives with poor cohesion between them.

A challenge is nice, a confusing mess is not, and the problem with AWS is that a large part of your working hours will be spent untangling their documentation and weeding through features and products to find what you want, rather than focusing on cool interesting challenges.

Watch AI Simplified and see how simple and easy to understand the GCP AI stack is, and how ugly SageMaker is. I would strongly recommend moving to GCP and staying away from SageMaker. There will be no future for the technology which AWS cannot document by themselves.

Especially the SageMaker team keeps changing implementations without updating documents. Its roll-out was also inconsistent, e.g. SDK version 2 was rolled out in the SageMaker Studio making the AWS examples in Github incompatible without announcing it. Whereas SageMaker instance still had SDK 1, hence code worked in Instance but not in Studio.

It is mind-boggling, even insane, that we have to go through these many documents below to understand how to use the SageMaker SDK Estimator for training. How much time of developers does AWS want to waste?

Documents for Model Training

Train a Model with Amazon SageMaker

This document gives 20,000 feet overview of how SageMaker training but does not give any clue what to do.

Running a container for Amazon SageMaker training

This document gives an overview of how SageMaker training looks like. However, this is not up-to-date as it is based on SageMaker Containers which is obsolete.

WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.

Step 4: Train a Model

This document layouts the steps for training.

The Amazon SageMaker Python SDK provides framework estimators and generic estimators to train your model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker features for training and the AWS infrastructures

Train a Model with the SageMaker Python SDK

To train a model by using the SageMaker Python SDK, you:

Prepare a training script

Create an estimator

Call the fit method of the estimator

Finally this document gives concrete steps and ideas. However still missing comprehensiv details about Environment Variables, Directory structure in the SageMaker docker container**, S3 for uploading code, placing data, S3 where the trained model is saved, etc.

Use TensorFlow with the SageMaker Python SDK

This documents is focused on TensorFlow Estimator implementation steps. Use Training a Tensorflow Model on MNIST Github example to accompany with to follow the actual implementation.

Documents for passing parameters and data locations

How Amazon SageMaker Provides Training Information

This section explains how SageMaker makes training information, such as training data, hyperparameters, and other configuration information, available to your Docker container.

This document finally gives the idea of how parameters and data are passed around but again, not comprehensive.

SageMaker Docker Container Environment Variables

This documentation is marked as deprecated but the only document which explains the SageMaker Environment Variables.

IMPORTANT ENVIRONMENT VARIABLES

SM_MODEL_DIR

SM_CHANNELS

SM_CHANNEL_{channel_name}

SM_HPS

SM_HP_{hyperparameter_name}

SM_CURRENT_HOST

SM_HOSTS

SM_NUM_GPUS

List of provided environment variables by SageMaker Containers

SM_NUM_CPUS

SM_LOG_LEVEL

SM_NETWORK_INTERFACE_NAME

SM_USER_ARGS

SM_INPUT_DIR

SM_INPUT_CONFIG_DIR

SM_OUTPUT_DATA_DIR

SM_RESOURCE_CONFIG

SM_INPUT_DATA_CONFIG

SM_TRAINING_ENV

Documents for SageMaker Docker Container Directory Structure

Running a container for Amazon SageMaker training

/opt/ml
??? input
?   ??? config
?   ?   ??? hyperparameters.json
?   ?   ??? resourceConfig.json
?   ??? data
?       ??? <channel_name>
?           ??? <input data>
??? model
?   ??? <model files>
??? output
    ??? failure

This document explains the directory structure and purpose of each directory.

The input

/opt/ml/input/config contains information to control how your program runs. hyperparameters.json is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. resourceConfig.json is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn’t support distributed training, we’ll ignore it here.

/opt/ml/input/data/<channel_name>/ (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it’s generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.

/opt/ml/input/data/<channel_name>_<epoch_number> (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

The output

/opt/ml/model/ is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the DescribeTrainingJob result.

/opt/ml/output is a directory where the algorithm can write a file failure that describes why the job failed. The contents of this file will be returned in the FailureReason field of the DescribeTrainingJob result. For jobs that succeed, there is no reason to write this file as it will be ignored.

However, this is not up-to-date as it is based on SageMaker Containers which is obsolete.

Documents for Model Saving

The information on where the trained model is saved and in what format are fundamentally missing. The training script needs to save the model under /opt/ml/model and the format and sub-directory structure depend on the frameworks e,g TensorFlow, Pytorch. This is because SageMaker deployment uses the Framework dependent model-serving, e,g. TensorFlow Serving for TensorFlow framework.

This is not clearly documented and causing confusions. The developer needs to specify which format to use and under which sub-directory to save.

To use TensorFlow Estimator training and deployment:

Deploy the trained model

Because we’re using TensorFlow Serving for deployment, our training script saves the model in TensorFlow’s SavedModel format.

amazon-sagemaker-examples/frameworks/tensorflow/code/train.py

    # Save the model
    # A version number is needed for the serving container
    # to load the model
    version = "00000000"
    ckpt_dir = os.path.join(args.model_dir, version)
    if not os.path.exists(ckpt_dir):
        os.makedirs(ckpt_dir)
    model.save(ckpt_dir)

The code is saving the model in /opt/ml/model/00000000 because this is for TensorFlow serving.

Using the SavedModel format

The save-path follows a convention used by TensorFlow Serving where the last path component (1/ here) is a version number for your model - it allows tools like Tensorflow Serving to reason about the relative freshness.

Train and serve a TensorFlow model with TensorFlow Serving

To load our trained model into TensorFlow Serving we first need to save it in SavedModel format. This will create a protobuf file in a well-defined directory hierarchy, and will include a version number. TensorFlow Serving allows us to select which version of a model, or "servable" we want to use when we make inference requests. Each version will be exported to a different sub-directory under the given path.

Documents for API

Basically the SageMaker SDK Estimator implements the CreateTrainingJob API for training part. Hence, better to understand how it is designed and what parameters need to be defined. Otherwise working on Estimators are like walking in the dark.

Example

Jupyter Notebook

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()

metric_definitions = [
    {"Name": "train:loss", "Regex": ".*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*"},
    {"Name": "train:accuracy", "Regex": ".*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*"},
    {
        "Name": "validation:accuracy",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*",
    },
    {
        "Name": "validation:loss",
        "Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*",
    },
    {
        "Name": "sec/sample",
        "Regex": ".* - \d+s (\d+)[mu]s/sample - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+",
    },
]

import uuid

checkpoint_s3_prefix = "checkpoints/{}".format(str(uuid.uuid4()))
checkpoint_s3_uri = "s3://{}/{}/".format(bucket, checkpoint_s3_prefix)

from sagemaker.tensorflow import TensorFlow

# --------------------------------------------------------------------------------
# 'trainingJobName' msut satisfy regular expression pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
# --------------------------------------------------------------------------------
base_job_name = "fashion-mnist"
hyperparameters = {
    "epochs": 2, 
    "batch-size": 64
}
estimator = TensorFlow(
    entry_point="fashion_mnist.py",
    source_dir="src",
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    role=role,
    input_mode='File',
    framework_version="2.3.1",
    py_version="py37",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    base_job_name=base_job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    model_dir=False
)
estimator.fit()

fashion_mnist.py

import os
import argparse
import json
import multiprocessing

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras import backend as K

print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution is: {}".format(tf.executing_eagerly()))
print("Keras version: {}".format(tf.keras.__version__))


image_width = 28
image_height = 28


def load_data():
    fashion_mnist = tf.keras.datasets.fashion_mnist
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

    number_of_classes = len(set(y_train))
    print("number_of_classes", number_of_classes)

    x_train = x_train / 255.0
    x_test = x_test / 255.0
    x_full = np.concatenate((x_train, x_test), axis=0)
    print(x_full.shape)

    print(type(x_train))
    print(x_train.shape)
    print(x_train.dtype)
    print(y_train.shape)
    print(y_train.dtype)

    # ## Train
    # * C: Convolution layer
    # * P: Pooling layer
    # * B: Batch normalization layer
    # * F: Fully connected layer
    # * O: Output fully connected softmax layer

    # Reshape data based on channels first / channels last strategy.
    # This is dependent on whether you use TF, Theano or CNTK as backend.
    # Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
    if K.image_data_format() == 'channels_first':
        x = x_train.reshape(x_train.shape[0], 1, image_width, image_height)
        x_test = x_test.reshape(x_test.shape[0], 1, image_width, image_height)
        input_shape = (1, image_width, image_height)
    else:
        x_train = x_train.reshape(x_train.shape[0], image_width, image_height, 1)
        x_test = x_test.reshape(x_test.shape[0], image_width, image_height, 1)
        input_shape = (image_width, image_height, 1)

    return x_train, y_train, x_test, y_test, input_shape, number_of_classes

# tensorboard --logdir=/full_path_to_your_logs

validation_split = 0.2
verbosity = 1
use_multiprocessing = True
workers = multiprocessing.cpu_count()


def train(model, x, y, args):
    # SavedModel Output
    tensorflow_saved_model_path = os.path.join(args.model_dir, "tensorflow/saved_model/0")
    os.makedirs(tensorflow_saved_model_path, exist_ok=True)

    # Tensorboard Logs
    tensorboard_logs_path = os.path.join(args.model_dir, "tensorboard/")
    os.makedirs(tensorboard_logs_path, exist_ok=True)

    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=tensorboard_logs_path,
        write_graph=True,
        write_images=True,
        histogram_freq=1,  # How often to log histogram visualizations
        embeddings_freq=1,  # How often to log embedding visualizations
        update_freq="epoch",
    )  # How often to write logs (default: once per epoch)

    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=['accuracy']
    )
    history = model.fit(
        x,
        y,
        shuffle=True,
        batch_size=args.batch_size,
        epochs=args.epochs,
        validation_split=validation_split,
        use_multiprocessing=use_multiprocessing,
        workers=workers,
        verbose=verbosity,
        callbacks=[
            tensorboard_callback
        ]
    )
    return history


def create_model(input_shape, number_of_classes):
    model = Sequential([
        Conv2D(
            name="conv01",
            filters=32,
            kernel_size=(3, 3),
            strides=(1, 1),
            padding="same",
            activation='relu',
            input_shape=input_shape
        ),
        MaxPooling2D(
            name="pool01",
            pool_size=(2, 2)
        ),
        Flatten(),  # 3D shape to 1D.
        BatchNormalization(
            name="batch_before_full01"
        ),
        Dense(
            name="full01",
            units=300,
            activation="relu"
        ),  # Fully connected layer
        Dense(
            name="output_softmax",
            units=number_of_classes,
            activation="softmax"
        )
    ])
    return model


def save_model(model, args):
    # Save the model
    # A version number is needed for the serving container
    # to load the model
    version = "00000000"
    model_save_dir = os.path.join(args.model_dir, version)
    if not os.path.exists(model_save_dir):
        os.makedirs(model_save_dir)
    print(f"saving model at {model_save_dir}")
    model.save(model_save_dir)


def parse_args():
    # --------------------------------------------------------------------------------
    # https://docs.python.org/dev/library/argparse.html#dest
    # --------------------------------------------------------------------------------
    parser = argparse.ArgumentParser()

    # --------------------------------------------------------------------------------
    # hyperparameters Estimator argument are passed as command-line arguments to the script.
    # --------------------------------------------------------------------------------
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)

    # /opt/ml/model
    # sagemaker.tensorflow.estimator.TensorFlow override 'model_dir'.
    # See https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/\
    # sagemaker.tensorflow.html#sagemaker.tensorflow.estimator.TensorFlow
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])

    # /opt/ml/output
    parser.add_argument("--output_dir", type=str, default=os.environ["SM_OUTPUT_DIR"])

    args = parser.parse_args()
    return args


if __name__ == "__main__":
    args = parse_args()
    print("---------- key/value args")
    for key, value in vars(args).items():
        print(f"{key}:{value}")

    x_train, y_train, x_test, y_test, input_shape, number_of_classes = load_data()
    model = create_model(input_shape, number_of_classes)

    history = train(model=model, x=x_train, y=y_train, args=args)
    print(history)
    
    save_model(model, args)
    results = model.evaluate(x_test, y_test, batch_size=100)
    print("test loss, test accuracy:", results)

SageMaker Console

Notebook output

2021-09-03 03:02:04 Starting - Starting the training job...
2021-09-03 03:02:16 Starting - Launching requested ML instancesProfilerReport-1630638122: InProgress
......
2021-09-03 03:03:17 Starting - Preparing the instances for training.........
2021-09-03 03:04:59 Downloading - Downloading input data
2021-09-03 03:04:59 Training - Downloading the training image...
2021-09-03 03:05:23 Training - Training image download completed. Training in progress.2021-09-03 03:05:23.966037: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:23.969704: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-09-03 03:05:24.118054: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:26,842 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2021-09-03 03:05:26,852 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:27,734 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/usr/local/bin/python3.7 -m pip install -r requirements.txt
WARNING: You are using pip version 21.0.1; however, version 21.2.4 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.

2021-09-03 03:05:29,028 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,045 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,062 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,072 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch-size": 64,
        "epochs": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "fashion-mnist-2021-09-03-03-02-02-305",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz",
    "module_name": "fashion_mnist",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "fashion_mnist.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch-size":64,"epochs":2}
SM_USER_ENTRY_POINT=fashion_mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=fashion_mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch-size":64,"epochs":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"fashion-mnist-2021-09-03-03-02-02-305","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz","module_name":"fashion_mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"fashion_mnist.py"}
SM_USER_ARGS=["--batch-size","64","--epochs","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH-SIZE=64
SM_HP_EPOCHS=2
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages

Invoking script with the following command:

/usr/local/bin/python3.7 fashion_mnist.py --batch-size 64 --epochs 2


TensorFlow version: 2.3.1
Eager execution is: True
Keras version: 2.4.0
---------- key/value args
epochs:2
batch_size:64
model_dir:/opt/ml/model
output_dir:/opt/ml/output

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1