Machine Learning with Julia on AWS SageMaker

June 16, 2021


Table of Contents



Why you should run your Julia program on SageMaker?

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models without worrying about the infrastructure. Julia is a high-level, high-performance, dynamic programming language. While it is a general-purpose language and can be used to design various applications, many of its features are well suited for numerical analysis and computational science. By running Julia on SageMaker, you will be able to get the most out of this programming language as you can easily access high-performance instances.



Julia on AWS

How to use Julia in SageMaker Notebook Instances

An Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the Jupyter Notebook App. SageMaker manages provisioning-related resources. You can use Jupyter notebooks in your notebook instance to prepare and process data, write code to train/validate models, and deploy them as SageMaker endpoints. You can create multiple notebooks within your notebook instance.

To create a notebook instance, you should go to Notebook instance in the AWS SageMaker console, and click on the Create Notebook instance button:



On the following page, you should define the configuration of your notebook instance:

  1. Notebook instance name
  2. Notebook instance type: The instance types vary based on the hardware. For more information about the configuration and pricing, check out the AWS SageMaker Pricing page.
  3. Volume size in GB: By default, 5GB will be allocated to the instance. If you need a bigger storage, you can increase it up to 16 TB.



For now, SageMaker Notebook Instances does not include Julia as an available kernel, so it must be manually installed and added to a Notebook Instance in the Lifecycle Configuration part of this page. Lifecycle configurations provide shell scripts that run on instance creation. To add a Lifecycle Configuration:

  1. Click on Lifecycle Configuration and select the Create a new lifecycle configuration.



  1. Choose a name for Lifecycle Configuration.
  2. Paste this script in the Create Notebook part, this script will install and activate Julia on the first initiation of your notebook instance:
#!/bin/bash
set -e
sudo -u ec2-user -i <<EOF
echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc 
conda create --yes --prefix ~/SageMaker/envs/julia
curl --silent https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.0-linux-x86_64.tar.gz | tar xzf -
cp -R julia-1.6.0/* ~/SageMaker/envs/julia/
mkdir -p ~/SageMaker/envs/julia/etc/conda/activate.d
echo 'export JULIA_DEPOT_PATH=~/SageMaker/envs/julia/depot' >> ~/SageMaker/envs/julia/etc/conda/activate.d/env.sh
echo -e 'empty!(DEPOT_PATH)\npush!(DEPOT_PATH,raw"/home/ec2-user/SageMaker/envs/julia/depot")' >> ~/SageMaker/envs/julia/etc/julia/startup.jl
conda activate /home/ec2-user/SageMaker/envs/julia
julia --eval 'using Pkg; Pkg.add("IJulia"); using IJulia'
EOF
  1. Paste this script in the Start Notebook section This script will activate Julia each time you start your notebook instance:
#!/bin/bash
set -e
sudo -u ec2-user -i <<EOF
echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
conda run --prefix ~/SageMaker/envs/julia/ julia --eval 'using IJulia; IJulia.installkernel("Julia")'
EOF
  1. Click on the Create Configuration button.
  2. Now, you can leave other settings as default and click on Create notebook instance.

Your notebook instance will be ready to use in minutes. The first time you start an instance or install packages will take some time.

Note: If you have a .jl file and you want to run it outside of a notebook, just open a terminal (via File->New->Terminal) and run the following command:

conda run --prefix ~/SageMaker/envs/julia/ julia SageMaker/path_to_jl_file

Automation with CloudFormation

This whole process can be automated through an Amazon CloudFormation template. To create notebooks with a CloudFormation template:

  1. Click on this link while you are logged in to your aws console.
  2. Choose a name for the stack and the notebook and specify the volume size and the instance type.
  3. Check the I acknowledge that AWS CloudFormation might create IAM resources.
  4. Click Create stack.

Note: This process will create notebook in EU-West-1 region, if you prefer another region change it using the top-right option list in aws console.

CloudFormation will make a notebook instance for you along with the Lifecycle configuration. You can always delete all the created resources with the Delete option of the stack.



Hands-on Data Science with Julia

The main advantage of Julia over other Machine Learning programming languages is its speed. There are two main reasons for this: Firstly, Julia is a compiled language, and secondly, it has been designed for parallelism. Now that we have prepared our Julia environment on SageMaker, we can proceed to the next step and see Julia in action. For this, we are going to do a classification task on the Iris Dataset.

Iris Dataset

Here we use the Iris Dataset to utilize Julia for machine learning. The dataset includes Iris features such as petal length and width and their corresponding exact species. For more information about the data, see here.

Installing and using packages on Julia

Before we start, we need to install the necessary packages for our job. These packages don’t come with the default Julia installation. That is why we need to install them separately. The installation may take a while, but we don’t need to install them again in the next usage. To install and use packages in Julia, the following lines of code can be used:

import Pkg
Pkg.add(["LIBSVM", "RDatasets", "MLBase", "Plots", "StatsPlots"])

using RDatasets
using MLBase
using StatsPlots
using Random
using LIBSVM

Loading the Dataset

After adding the necessary packages, here we load the Iris dataset to a Dataframe object.

iris = dataset("datasets", "iris")

Visualizing the distribution of the features

First, let’s get an initial overview of our features by drawing Pairplot, here we define our hand-made Pairplot:

plots = [[], [], [], []]
for (i, yaxis) in enumerate(["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"])
    for (j, xaxis) in enumerate(["PetalWidth", "PetalLength", "SepalWidth", "SepalLength"])
        if i == 5 - j
            push!(plots[i], histogram(iris[!, yaxis], label=yaxis, size=(2000, 1500), dpi=200, group =iris.Species))
        else
            push!(plots[i], scatter(iris[!, xaxis], iris[!, yaxis], group =iris.Species, size=(2000, 1500), markerstrokewidth=0, dpi=200, xlabel=xaxis, ylabel=yaxis))
        end
    end
end

l = @layout [a b c d ; e f g h ; i j k l ; m n o p]

plt =  plot(plots[4][1], plots[4][2], plots[4][3], plots[4][4], 
            plots[3][1], plots[3][2], plots[3][3], plots[3][4], 
            plots[2][1], plots[2][2], plots[2][3], plots[2][4], 
            plots[1][1], plots[1][2], plots[1][3], plots[1][4], layout = l)
savefig(plt, "img/pairplot_1.png")
display(plt)





Preprocessing data

To prepare our dataset for training, it is necessary to manipulate it to be usable for training. For instance, here we encode the target variable (Setosa, Virginica, and Versicolor) into numeric values.

# Loading features and labels separate to arrays
X = Matrix(iris[:,1:4])
irislabels = iris[:,5]

# Encode species name to 1,2,3 labels  
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)

In a Machine Learning problem, a subset of data is usually put aside as a test set. Once training is finished, the model's performance will be evaluated on this unseen data to ensure generalizability. The code below will divide the dataset accordingly.

# Split dataset to train and test set
trainids = randsubseq(1:length(y), 0.7)
testids = setdiff(1:length(y), trainids)

X_train = X[trainids,:]
y_train = y[trainids]

X_test = X[testids,:]
y_test = y[testids]

LibSVM package

Now that we prepared our train and test data, we will train a Support Vector Machine Classifier on the training data using the LibSVM package. X_train’ is the transposed matrix of X_train.

model = svmtrain(X_train', y_train)

Report accuracy on the test set

Once the model is trained, we can predict the labels for the test set and evaluate its performance using the accuracy metric.

predictions_SVM, decision_values = svmpredict(model, X_test')
mean(predictions_SVM .== y_test) * 100
>>> 93.87755102040816

predictions_SVM, decision_values = svmpredict(model, X_train')
mean(predictions_SVM .== y_train) * 100
>>> 99.00990099009901

As you can see, our model has been a little overfitted on the training set, but it still has acceptable accuracy on the test set.



Monitor what percentage of provisioned resources have been utilized with CloudWatch

A list of available notebook instances and their prices can be seen on the Amazon SageMaker Pricing page. While an exciting advantage of cloud computing is that you can choose from a variety of on-demand resources, spending too much by overprovisioning resources or failures due to underprovisioning is always a concern. To select the optimal SageMaker instance, we can monitor hardware utilization (CPU, memory, disk) by CloudWatch. Monitoring hardware utilization metrics can help us choose the best instance in terms of price and power. Please see our post on ‘How to choose the best training instance on SageMaker’ for more information.

Setting-up Monitoring Settings

Sagemaker Notebook Instances don’t send utilization metrics to CloudWatch, by default, so it has to be configured manually. To do this, you should run the following command in the notebook terminal, available via File->New->Terminal.

$ cd SageMaker

$ wget https://raw.githubusercontent.com/DataChefHQ/BlogProjects/main/julia_on_sagemaker/config.json

$ sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s

Monitoring Usage Over Time

Now that we set up our configuration, we can see our resource usage in CloudWatch. To access our notebook metrics, we should go to CloudWatch from the AWS console. After that, in the Metrics/custom namespace, we can find a metrics group called “JuliaAWS-notebook-metrics-1” that includes our desired metrics. Here we run the previous section codes (iris dataset classification), and you can see the CPU usage metrics per each core (we used ml.t2.medium that have two CPU core):



We can also check out our memory (RAM) usage:



Differential Equations in Julia

Solving differential equations are one of the important areas in scientific computing. Julia has many advantages over other options in this field, including high efficiency, built-in parallelism, and GPU compatibility. In fact, in a comparative study, Julia was totally ahead of other programming languages, including Matlab and Python.

Define the Lorenz attractor

To understand how amazing Julia can be in solving and visualizing differential equations, we visualize the famous Lorenz system. To get the most out of Julia in solving differential equations, you should check out the Differential Equations package.

Lorenz attractor can be defined as follows:

Base.@kwdef mutable struct Lorenz
    dt::Float64 = 0.02 # step size
    σ::Float64 = 10
    ρ::Float64 = 28
    β::Float64 = 8/3
    x::Float64 = 1 # initial coordination
    y::Float64 = 1 # initial coordination
    z::Float64 = 1 # initial coordination
end

Defining the Step function

This function calculates the derivatives and a return new value for each coordination (i.e. x,y,z)

function step!(l::Lorenz)
    dx = l.σ * (l.y - l.x);         l.x += l.dt * dx
    dy = l.x * (l.ρ - l.z) - l.y;   l.y += l.dt * dy
    dz = l.x * l.y - l.β * l.z;     l.z += l.dt * dz
end

attractor = Lorenz()

Plotting Lorenz attractor as an animated Gif file

# initialize a 3D plot with 1 empty series
plt = plot3d(
    1,
    xlim = (-30, 30),
    ylim = (-30, 30),
    zlim = (0, 60),
    title = "Lorenz Attractor",
    marker = 2,
    show=true
)

# build an animated gif by pushing new points to the plot, saving every 10th frame
@gif for i=1:1500
    step!(attractor)
    push!(plt, attractor.x, attractor.y, attractor.z)
end every 5



Julia AWS SDK

The AWS Software Development Kit (SDK) for Julia simplifies the use of AWS Services by providing a set of consistent and familiar libraries for Juila developers. It provides support for different AWS resources like AWS S3, AWS SageMaker, AWS SES, and some other services.

Here we show an example usage of AWS SDK for S3 buckets to download and upload files.

Note: For now, AWS doesn’t provide an official SDK for Julia, so these SDKs (that have been developed by Julia developers) may not be up-to-date for some AWS services.

S3 Buckets:

Amazon Simple Storage Service (Amazon S3) is storage for the Internet. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web.

Install AWS-SDK

First, we need to install AWS-SDK:

import Pkg
Pkg.add("AWS")

Download a CSV file from the S3 bucket

To download a file from an S3 bucket following script can be used:

using AWS: @service
@service S3

s3_bucket_name = "sample-s3-bucket-name"
file_path = "file.csv" 
output_file = S3.get_object(s3_bucket_name , file_path)

open("output.csv", "a") do file
    write(file, output_file)
end

Upload a CSV file from the S3 bucket

To upload a file from your notebook instance to an S3 bucket, the following code can be used.

@service S3

body = read("output.csv", String)
s3_bucket_name = "sample-s3-bucket-name"
destination_file_name = "output.csv"
output_file = S3.put_object(s3_bucket_name, destination_file_name , Dict("body" => body))



Conclusion

In this post, we set up Julia’s environment on AWS SageMaker and use it for a real-world case. Working on the cloud gave us the opportunity to access diverse options for instances and the ability to monitor our resource usage. The associated codes and files to this post are accessible in this repository.

Made with + in Amsterdam

Back to top