Table of Content
What is a Machine Learning CI/CD, and why should you use it?
Introduced in 2015, Jupyter Notebooks is a powerful platform for data scientists that enables them to quickly test different approaches to solve their machine learning problems. Still, when it comes to keeping source control (aka version control) of your project or collaborating with other data scientists and ML engineers on projects, Jupyter Notebooks is no longer a good option.
Ideally, we need a platform that keeps track of changes and covers all steps of a machine learning problem, from preprocessing data to the model’s training and finally deploying it to production. You can think of it as Google Docs vs. working on Microsoft Word on your computer (sorry Microsoft! We will use your Microsoft Edge later in the demo), where you can easily collaborate with others and keep track of changes and who made them.
Here we describe and show an end-to-end machine learning framework that uses GitLab as a source control system, and with its powerful CI/CD tool, we will train and deploy our models on SageMaker.
Video version of this tutorial
The video version of this tutorial is available on our YoutTube channel:
Overview of the platform
This platform is technically a GitLab repository with two branches, main and development. The development branch is where you try different approaches to solve your machine learning problem, and the main branch is only for the deployment of the selected model. You start a merge request (a.k.a, pull request) from the development to the main branch whenever you want to see how your development codes work.
The GitLab CI will build your training image according to your Dockerfile and push it to ECR. After that, it will start a training job using the created image, your training script, and the input data located in an S3 bucket.
You can see the training job description on the merge request page, and as soon as the training job is finished, you can also see the performance metrics there.
If you and your teammate are satisfied with the results, you can merge them to the main branch. This will trigger the deploy stage of GitLab CI that will automatically create a real-time inference endpoint using the latest training job.
The following picture shows the overview of branches:
The following picture shows the overview of the training pipeline:
S3 bucket diagram
This platform expects an S3 bucket with the following hierarchy:
sample-s3-bucket ┗ name-of-this-project ┃ ┗ input data file or files
After running the first training job, your S3 bucket will look like this, where reports.csv is the full history of your training jobs descriptions along with the performance metrics you defined. And output folder is where SageMaker saves training jobs’ model artifact:
sample-s3-bucket ┗ name-of-this-project ┃ ┣ input data file or files ┃ ┣ output/ ┃ ┗ reports.csv
The initial repository is composed of these files:
- .gitlab-ci.yml: Configuration of the GitLab CI
- Dockerfile: Configuration of our training and serves Docker image
- training-job.py: the python file that starts the training job
- training-script.py: training-job.py use this file as the training job entry point
- serve-script.py: A simple Flask app that will be used for Endpoint
- deploy.py: the python file that by default make an endpoint with the latest training job
Example project to see how it works
Here we used a toy dataset, Boston housing, with RandomForest algorithm from the scikit-learn package for training and a Flask app for real-time prediction. If you want to use any other package for training or inference, include its pip installation in the Dockerfile.
In this repository, you usually work on the development branch, and as soon as you start a merge request (from development to the main branch), a pipeline will be started. This pipeline is composed of two steps, one that builds a docker image according to the configuration that we provided in the Dockerfile and pushes it to Elastic Container Registry (ECR).
The second step submits a training job to SageMaker using the resulting image, the S3 data path we provided, and the training script that we have in our repository.
The training script is where we try different algorithms to solve our machine learning problem. Once the training job is submitted the pipeline will end (to decrease pipeline usage time that is usually limited), and a comment on the merge request page will appear that shows different info about the submitted training job, including its name, its artifact location, the hyperparameters that you have used for this job, the CloudWatch link that you can see what exactly happing right now in your training job and finally the URL of the endpoint that will be created if you accept this merge request.
Duration of training job depends on the size of your dataset and algorithm you have chosen; once the training job is finished, a comment including the training job metrics and the link to the entire history of performance metrics will appear on the merge request page.
Merge request is a powerful way to collaborate with others on your project, they can see which commit with which changes started this merge request, and you can also vote whether to accept this merge request or not based on the code and its resulting metrics.
Once a merge request is accepted, a pipeline will be started that by default will deploy the latest training job model. You can change this pipeline preference from the latest model to the lowest or highest in one of your metrics from the
Note: Don’t forget to uncheck the
delete source branch when merge request acceptedso you don’t lose your development branch, or possibly you could set the main and development as protected branches so they can’t be deleted this way.
Further development to this platform
This platform can be extended and personalized for your specific project. For instance, you can add a staging branch between the main and development branch where you deploy your model to a staging environment to test it before going to production. Other possible improvements could be adding an updating endpoint step before deployment, so instead of making a new URL for the new model, you can use the previous URL but with an updated configuration. Here we also did not cover the data processing step, but a potential solution would be to use the S3 data versioning feature so the S3 data path to your data won’t change each time you replace a new version of data or apply a new preprocessing method. For more information about S3 data versioning, see the official documentation.
Alternative solutions to the same problem
Here we used GitLab CI to build and push our Docker image, submit the training job, and deploy the final model. If you prefer another Git service provider, you just need to change the .yml file and API calls according to that service. For a similar project on GitHub, check out Sung Kim’s repository. These targets could also be achieved in many other different ways. One of them that is more AWS integrated is SageMaker Projects, where you can set up a connection between your repository, and SageMaker will take other steps. For now, you have only GitHub, Bitbucket, and CodeCommit as an option, but you can establish a repository mirroring from GitLab to one of them to solve this issue.
This post showed you how to set up your end-to-end machine learning CI/CD pipeline using GitLab as source control and SageMaker for storing data, training, and real-time prediction. The project template is available at our GitLab Repository. For a demo of this project and our other projects, check out our YouTube Channel.