Engineering Lab #1 — TEAM 1: A MLOps Tale about operationalising MLFlow and PyTorch

Ivan Nardini
MLOps.community
Published in
15 min readApr 7, 2021

--

written by John, Michel, Alexey and Varuna.

Premises

This article is part of Engineering Labs series which is a collection of reviews about the corresponding initiative provided by each team. And this time you will read about the News Classification solution of the TEAM 1.
And if you are interested to know more about the initiative and how to join, here you can find all information you need.

HAPPY READING!

Introduction

Machine learning has moved from hype to providing real productivity for more and more companies. As this happens, these companies are finding the difficulties that come with putting machine learning into production settings. MLOps is a discipline and culture that has arisen in response which aims to alleviate these difficulties by unifying the experimentation, development, and operations parts of a machine learning product as much as possible.

While most ML practitioners appreciate the need for MLOps, it can be difficult to learn about how to do it for multiple reasons:

  • It’s a new field, so there aren’t a huge amount of learning resources, and those that are out there tend to be high level and expansive because…
  • The field is very broad, encompassing the entire lifecycle of an ML model from data, to training and experimentation, model management, serving, monitoring and much, much more. This means it’s difficult to know where to start learning, and equally difficult to know how the pieces fit together into a coherent whole.
  • The range of skills and experiences needed to cover the space is equally broad, meaning it’s very easy to be a data scientist with no experience of production, or a software engineer with no experience of model training. Since MLOps is such a practical field, hands-on projects are essential, but these skills gaps mean useful projects for learning can feel out of reach.

The MLOps.community set up the Engineering Labs to help with these difficulties:

  • It aims to be a learning resource not only for the participants but also to others through documentation, coding in the open, and articles like these.
  • Each lab tackles a well-defined and narrow part of the MLOps landscape to allow learning about specific parts of the MLOps landscape and also how they fit into their context.
  • The labs bring together diverse teams of varying experience and skill sets to make sure the hands-on projects can be completed efficiently and so the participants can learn from each other.

The first of the MLOps labs investigated the use of MLflow to train and deploy a PyTorch model to TorchServe. For context, MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. PyTorch is well known as one of the two main deep learning frameworks, and TorchServe is the low-latency model serving framework for PyTorch. Recently, PyTorch and MLflow announced a complete integration which allowed:

  • Auto-logging for PyTorch Lightning models
  • Better model saving for PyTorch models (including TorchScript)
  • A plugin to allow deployment of PyTorch models to a TorchServe server
  • Example projects to help users learn how to train and deploy PyTorch models with MLflow
Figure 1. When MLflow meets PyTorch

This article describes our experiences taking part in the first MLOps Engineering lab, and we hope it can be a practical learning resource to other people trying to learn about putting ML models into production in an automated, cloud-native way.

This is the Table of Contents for the rest of the article:

  1. Who we are
  2. Our project
    The stages and components of the pipeline
    Model Versioning and Baselines
    Tools and mechanisms used in the solution
  3. Infrastructure
    Design
    Infrastructure as Code
  4. Workflows & Automation
    Training workflow outline
    Deployment workflow outline
    Other workflows
  5. What did we learn
    Technical
    Non-technical
  6. Summary
  7. Resources
  8. Notes

Who we are

The team was composed by:

Figure 2. Team 1 — members

Our project

Our goal was to deliver a pipeline to build and deploy PyTorch models into a Production-like environment. A CI \ CD pipeline is a usual tool when building traditional software, but that’s not so trivial when dealing with AI-Enabled systems. When planning and designing the solution, our team tried to tackle ML-specific concerns (like model versioning, selecting the best model to deploy, etc) as well as extracting the best from MLFlow and PyTorch. Our final pipeline was capable of addressing most of these concerns, with some left to future work. Here follows some concerns that have driven our design.

The stages and components of the pipeline

Our work is heavily inspired by the maturity levels proposed by Google. Each level of maturity presents new structures and a degree of automation. The first level (level 0) defines a manual process in which the data scientist iterates through data analysis and exploration to find the ML model that best fits the problem, which is manually deployed to production in a separate ops system. Level 1, advocates for an automated pipeline and also proposes to continuously train the model in Production as well. We aimed for an intermediate level in which we hopefully address automated model training, versioning, and deployment. We decided to skip data-oriented activities (extraction, analysis, and preparation) since there was no way to update the dataset during this lab.

Model Versioning and Baselines

Software developers are usually concerned with versioning and configuration management when developing traditional traditional software. In general, the development team defines policies so they can evolve source code and other artifacts, deliver them out and rollback any release if needed. The releases are associated with baselines¹ to provide traceability and means for a rollback process. Machine Learning systems increase the complexity because configuration management processes must account for data, hyperparameters and all other relevant information to build the model.

Table 1. Resources, Tools and Capabilities

Besides that, our first design was defined as distributed nodes performing model training, tracking and serving. That could lead to erroneous model packaging, that is, models associated with a wrong source code version. We discussed around two possible solutions:

  • Apply a logical baseline in all resources. We could do that by obtaining the current version (or tags) for each resource during the training process and registering it in a different storage (e.g., a file in GCS). Deploying to Production environment would be a matter of inspecting the release baseline and getting the proper model version;
  • Simplify model training and registering by training and registering the model in the same node. Model registering must occur right after training without code or dataset updates. This is simpler but doesn’t scale very well when we think of many models being built and deployed.

For this lab, we have chosen to simplify the process. However, we have plan to extend and apply logical baselines in the future.

Tools and mechanisms used in the solution

Complex solutions demand rationale around selecting the “best tool for the job” or which mechanism would address a problem. By mechanism, we mean a strategy, pattern, component, or any asset that could help to solve a problem in the lab². While designing and implementing the solution, we found issues we may classify as “easy-to-pick-option”, for instance, we decided to use Docker as our deployment environment right-away (it’s the de facto tool for the job). At the same time, there were also “hard-to-pick” decisions like choosing which set of tools we would use to control the CI / CD flow. Anyway, we have used the following rationale when selecting a tool or mechanism:

  • Does it fully address the problem?
  • Do we have previous experience with it?
  • Is it an industry standard or pattern?
  • Is there a learning curve?
  • How widespread is it?

You may check the complete list of design decisions in our repository.

Infrastructure

Designing system and software architecture is an exercise of addressing non-functional requirements while avoiding over-engineering and nice-to-have structures. Moreover, one must be aware that the “design” is a live asset that evolves as the technical team advances into the product implementation. Our draft design proposed a node for each role (e.g., training, tracking, serving, etc) we would need in the pipeline. The design changed as we integrated cloud services. In the end, we embodied as cloud-native elements as we could, removing the burden of managing those (virtual) machines.

Our team intended to orchestrate common unix management toolkits (shell commands, ssh, etc), MLFlow CLI and MLFlow components to build a seamless end-to-end pipeline. It should comprehend Continuous Integration (CI), Continuous Delivery (CD) and Continuous Training (CT). The following figure depicts the proposed architecture.

Design

Figure 3. Proposed Architecture

Nodes are bare-metal or virtual machines that host a part of the pipeline process. ML components are ML-based code that should be trained, tested and deployed by end-to-end pipeline.They must be packaged as MLFlow projects. Training nodes host the training process which runs in Training containers, running dockerized images built with MLFlow and necessary libs to train ML components. Each ML component defines its needs, that is, libs and external dependencies. To ensure traceability, provenance and model configuration management, one may track hyperparameters, training steps, test results, extra data and the model itself into MLFlow Tracking Server hosted in the Tracking Node. All these items, metadata and trained models, are stored in external resources: RDBMS and Cloud Storage Services. Trained models are embedded into dockerized Serving Instances, containers running Torchserve providing access to the models via HTTP.

Control or Operator Machine is the VM used by Configuration Manager or Operation Engineer to follow up the process or issue commands to the nodes in the pipeline. This process may run through many different perimeters, however this solution comprises Development and Production stages only.

We thought about providing a feedback loop with a Monitoring Node. It would check the status and performance meters, issuing new training cycles or a full rebuild process if necessary. However, we removed that from the scope to fit the lab time-box.

Infrastructure as Code

We are used to creating infrastructure manually. However, modern solutions may demand complex infrastructure and it may drift and get outdated while your project evolves. Manually recreating a MLOps infrastructure may turn into a painful, error-prone task. In this lab, we utilised Terraform and Ansible to address configuration drift and infrastructure reproducibility issues. We organized the Infrastructure workflow in three stages:

  • Provisioning: It’s about allocating resources in your machine or in the cloud;
  • Configuration: Setting up the Infra, that is, installing packages, creating users, updating configuration files, etc;
  • Destroying: Freeing your local and cloud resources.

Here is the list of some relevant resources created during Provisioning. You can find more details about infrastructure assets and how to create the infrastructure in our repository.

Table 2. Assets Description

Workflows & Automation

We designed our system to support two workflows:

  • Training a new model automatically and preparing it for deployment
  • Deploying a chosen model to production
Figure 4. Development and Production workflows

Training workflow outline

Our training workflow is the most complex of the two. It starts from the point that a data scientist has done enough experimentation to make a change to production e.g. a new model architecture.

  • A PR is created and after a review and merge to master, our automation takes over (1)
  • A training image is built, and an MLFlow training run is begun using this training image and current master code state (2)
  • MLFlow logs all metrics during the training run to the MLFlow tracking server and logs the model artifacts to the artifact store/model registry on training finish (3)
  • Once training has ended, our deployment image is built, containing a TorchServe server and copy of the trained PyTorch model
  • The image is tagged with the model name and version provided by the MLFlow model registry and is pushed to a docker registry
  • This image is deployed to a unique temporary Cloud Run instance for acceptance testing(4)

We used the MLFlow Project format to keep our training runs reproducible, choosing to use docker as the environment solution instead of conda as it will work better with our deployment workflow. This format is pretty lightweight and is basically a thin wrapper over python commands, so doesn’t involve much vendor-lock.

We used Github Actions as our automation tool, allowing us to build declarative pipelines in simple yaml files that run in close concert with our source control.

Deployment workflow outline

Our deployment workflow is very simple as we’ve already built and tested our deployment image in the training workflow

  • Once an admin member of the team wants to move a certain model to production, they create a release with a tag containing the chosen model name and version
  • The image with the corresponding tag is pulled from the docker registry and deployed to the production Cloud Run instance (replacing the image currently in use, though canary deployments etc are possible) (5)

Other workflows

We did not explicitly design this system to handle other workflows, though it’s interesting to think how they could be built on in the future, in particular experimentation and continuous training workflows.

As mentioned above, our workflows start once a data scientist has done enough experimentation to know what change they’d like to make in production. MLFlow started life as an experiment tracking solution, so the current setup would be perfect for a data scientist to track all experimentation using the MLFlow server. This could then be used to decide on the best model, which could be linked in the PR for review, and also used to compare to the results of the automated training workflow to ensure no issues. However, this would still be a very manual experimentation workflow, so another tool would be needed to orchestrate the experiments to make them more automated. This tool would ideally make the transition from experimentation to production seamless, known as experimental-operational symmetry, which may mean the replacement of some of the steps in our training workflow e.g. running the training in KubeFlow Pipelines rather than our training node if that was the orchestration tool chosen.

At the other end of the process, the current system doesn’t handle continuous training. A very basic form of continuous training could be implemented in this system if it was clear your model degraded in performance each day or each week. A scheduled github action could be built to create a new master commit and release, which would trigger the training and deployment pipelines. However most useful forms of continuous training monitor the performance of models in production and only trigger retraining on degradation. This would at the very least require tracking of the predictions and actuals from our model in a database, which could be evaluated on a schedule, triggering the training and deployment pipelines if the model was showing degradation.

Hopefully both of these workflows will be tackled in future Engineering Labs so we can all learn good ways to tackle them!

What did we learn

Technical

Overall we were very impressed with the integration of MLflow and PyTorch/Torchserve. Although we had some difficulties in getting things started, these were mostly issues around documentation that can easily be fixed in the example projects. The errors during deployment were usually quite opaque also, but this was mostly an issue on TorchServe’s end, possibly fixable with better sanity checking by the plugin. The payoff is huge once we were set up, with just a couple of MLflow commands allowing us to train a model reproducibly and deploy it to a TorchServe server.

The major struggles we had through the project were due to differences between our training and serving environments, in particular the python environments we had in each. We would train a model then try to deploy it and get different varieties of errors. We eventually discovered the problem was due to difficulties in deserialising the pickled PyTorch model that had been uploaded to the registry. PyTorch recommends saving the model as a state_dict rather than a pickled model for this very reason, and the recently released MLFlow version 1.14 allows saving and loading state_dict versions. We were able to minimise the chances of problems like this by using a base docker image to install all common dependencies for the train and serve docker images. This was a great reminder though of the ever present difficulties in production ML projects caused by differences in the offline environment and the online environment.

We used two other tools to help in our journey toward MLOps level 1 that are worth calling out, Github Actions for automation and Google Cloud Run for our container serving. As discussed in the design decisions above, GitHub Actions was the lowest friction way to add automation to our project. Though there are tools like Kubeflow Pipelines or Airflow which would be best in class for this, we were able to create a relatively complex pipeline in a compact, declarative form which should be able to grow with this project for a long time without the overhead of the other tools.

We were also very impressed with the ease of deployment of our serving docker image to Google Cloud Run. Managing a container execution environment may become troublesome since the team must deal with container lifecyle, communication mechanism, load balancing, etc. The complexity may scale quickly when dealing with multiple images and containers. When we decided to host our serving image in this managed Kubernetes service, we set aside a good bit of time towards getting it working and maintaining it. We haven’t had to look at it since then, it scales to zero when it’s not being used and costs zero.

Non-technical

As a team we could easily have struggled as we were across 4 different timezones from Brazil to Sri Lanka, with different languages and different experiences with machine learning and software engineering. We were able to work well together using a lot of great tools for collaboration, our slack channel was a hive of activity, we had weekly zoom meetings for ideas and designing and we used Github PRs and issues to allow us to dig into the details of the work.

We followed a good iterative process too, with a viable project created at each step as we moved from our laptops, to the cloud, to containerised, to automated.

Summary

Overall we were really impressed with MLflow as a tool. We think it’s an ideal tool for ML teams that want to move their work from research to production effectively.

The experiment tracking and logging is excellent, and has a really user friendly UI. We liked how you could add the tracking and logging as a wrapper around your ML training code, which means you can easily start to track research code without worrying about vendor lock.

For the Operations part, the model registry is really well done, particularly the standardisation framework they’ve put in place that means your models are much more portable. For example when a PyTorch model is saved by MLflow, it is packaged so that it’s easily usable by Torchserve but also so that it can be loaded as a python function for use with e.g. Sagemaker.

Deployment is also made easier with the MLflow deployments tool. Though this is definitely the least developed of the components, it allows users unfamiliar with different deployment environments to get their models into production.

While MLflow perfectly fits as the centrepiece of a level 0 MLOps maturity workflow, as we moved our project towards level 1, we needed more tools to take care of the orchestration and automation and MLflow started to take a back seat. However, we felt that it would work well to have MLflow as the model registry and experiment tracking pieces in a mature level 1 setup. It seems like the effort of learning MLflow and integrating it with your workflows can pay off as it’s a useful component throughout your journey through the maturity levels, first as a trellis around which to organise, and then as an individual brick in your architecture.

The engineering labs themselves were a great opportunity to learn different aspects of MLOps architectures. I think a lot of us have felt overwhelmed with the diagrams that show the MLOps tooling landscape with a menagerie of logos. I know I’ve found it hard to get a good understanding of where the workflows we use in our company fit into the ones described in those articles. This lab allowed us to take one small piece of the MLOps landscape and really get to know it, and we feel like we have a solid foundation to branch out from and learn the other important areas.

The diversity in skill sets was another amazing part of working in the lab. It would have been a very different project if we had all been data scientists or all software engineers, but the mix of skills allowed us each to bring the project to where it needed to go and allowed us all to learn from each other. I think we’ve all come away from the project learning a whole new set of skills and ways of thinking to bring to our next projects. We hope that our repository can help others to learn some of these skills and also to be an example of a somewhat end-to-end machine learning project developed out in the open.

Resources

Notes

  1. In this context, baseline is a set of one or more tags that comprises all artifacts needed to produce a release
  2. For instance, one may use a combination of HTTP protocol and REST architectural style to address a requirement which demands remote updates in a hypermedia document.
  3. The phenomenon where infrastructure assets change and deviate from initial setup over time

--

--

Ivan Nardini
MLOps.community

Customer Engineer at @GoogleCloud who is passionate with Machine Learning Engineering. The Lead of MLOps.community’s Engineering Lab.