Life cycle of a machine learning software project

Whenever you are starting on a new project of any kind, it’s good to be aware of how that project is likely to pan out and what kind of milestones and challenges you should expect to meet. In the field of software engineering, we have lots of different process models and methodologies already to give a rough idea of what to expect, from waterfalls to Scrum and so on. But once you add machine learning into the project, especially if your goal is to focus on the capabilities provided by machine learning, these old processes need to be adapted to this new scenario. In this blog post I’m going to provide an example of a life cycle for a machine learning software project to give you an idea of what to expect once you eventually launch your own projects.

Note! As usual, every software project is always a beast of its own. While I’m giving you an example of a by-the-book project, it’s important to realize that not all projects are going to follow this model, and neither they should. So take some tips from the parts that area useful for you, and adapt to your own liking where necessary.

Machine learning software projects are unique in two ways: First of all, due to what machine learning is, developing a machine learning project requires you to collect a large amount of useful history data before starting. The purpose of this data is to enable the training of machine learning models, but this requirement means that occasions where you can even employ machine learning can be very limited. Sometimes, you might even find a use case where you’d like to implement machine learning, but don’t have the data required for training, so you’ll have to first start a separate project for data collection first. Second, since machine learning only produces approximations and the quality of these approximations is based on both the quality and amount of training data you have, and the model training performed, it can be difficult to say beforehand whether your machine learning project is actually going to produce results that are useful. As such, machine learning projects should be approached with a proof-of-concept -mindset: Rather than investing lots of money right away into something that may or may not work, create a few different prototypes, evaluate them and move on from there.

With that, I would outline the general life cycle of a machine learning project as follows:

  1. Identifying the project scope
  2. Gathering training data and prototyping models
  3. Developing a proof-of-concept solution with a prototype model
  4. Developing a production-ready solution with a production model
  5. Implementing re-training for the production model

It’s worth noting here, that terminating the project at points 1, 2 or 3 is perfectly fine if it seems like the project is not going to produce good results. It’s always better to cut losses early that forcing yourself through a project that’s guaranteed to fail. Also, the last step might not be necessary depending on the nature of your machine learning model. But let’s take a look at what these steps actually include!

1. Identifying the project scope

As is usual, before any software can begin you need to first identify its scope: What problems is the software going to solve and what constraints may be placed on the project’s implementation. With machine learning there is an additional concern to think about: The training data. Training a good machine learning model is going to require a large amount of historical data. This can affect the project in two ways: First, your project may have started from identifying a problem area that could be solved with machine learning. Then you will have to ask yourself: Do you have the necessary training data available? If not, do you have capabilities for collecting that data? What does it take to implement data collection from scratch? Are you able to perform that within your budget and time constraints? For example, say you are working at a thermal power plan and you have found out that by monitoring pressure fluctuations in gas pipes you could perform planned maintenance before actual faults occur. But if you don’t yet have data on how these pressure fluctuations occur in your specific gas pipe system, you would first have to implement a process for capturing pressure data over some period of time before you can train a machine learning model that’s tailored to your real-world gas pipes. This would affect the scope of your project drastically.

Alternatively, your project may start from identifying that you have lots of good, history data that could be used for model training, but you are unsure whether any solvable problems actually exist. In such situations it’s best to involve domain experts who can tell you what potential use cases exist for the data, and how the data could be used to address these problems. Scenarios where projects can start like this include, for example, IoT systems or widely spread consumer applications where you have vast amounts of sensor or logging data stored just in case. For example, you can easily imagine Netflix’s movie recommendations having started like this: “Hmm, we have collected lots of data on what movies people have watched. I wonder if we could turn that into a system that tells people what other movies they might enjoy?” Here your project scope is likely to stay neat and tight since you already have all the data you need but finding a solvable problem area – if one even exists – can be the challenging part.

2. Gathering training data and prototyping models

Once you have identified the scope of your project and the problem you will be trying to solve with machine learning, it’s time to gather the data needed for model training. In this step I’m assuming that you already have the data available in some data store, be it a relational database, a data lake, a data warehouse, or something else. Now you need to take that data and extract it in a format that is suitable for machine learning algorithms to process. You might also want to do some transformations on your data to make it more useful for training models for the problem at hand. For example, creating new columns from timestamps to contain years, months, and dates separately, or formatting the date into a single continuous integer depending on what its function is in the machine learning process. The data should be converted into a tabular format (for example, into csv files) and stored somewhere where it is easily accessible by both the people performing model training and the machine learning training platform itself. In the case of Azure Machine Learning, Azure Storage is a good candidate for storing training data.

With your training data in place, it’s time to start prototyping machine learning models. The goal here is to train a few different models and evaluate their results to see whether they produce good enough approximations to warrant continuing with the project. Performing these evaluations can be done in many ways, for example, you could run batch predictions with a Python script and store the predictions in an Excel sheet and have a business analyst compare them to real results: If your project aims to predict daily sales based on a set of variables such as number of customers, ongoing marketing campaigns and the day of week, you would want to verify that the estimates the prototype models are close enough to actual sales values to be useful for making business decisions. You would also want to make sure that the predictions are based on variables that make sense, such as sales increasing if the number of visitors in the store increases – and not the other way around!

Then what should you do if the prototype models don’t perform up to expectations? The first thing to do would be asking yourself the question “why?” Is there perhaps something wrong with your training data, or have you missed some critical variable that affects the daily sales, but which isn’t represented in your data at all? For example, if your shop is an ice cream parlor and you forgot to collect daily weather in your data set, it’s likely that your model won’t be giving as good sales predictions as you’d like. People love to eat ice cream when it’s warm and sunny, after all! Alternatively, you might simply have too little training data, or your data may contain biases that warp the predictions. Maybe you have sales figures only from sunny days and now your model over-estimates the sales for rainy days? Last, it’s also possible that you used machine learning algorithms that are simply unsuitable for your training data, or that the hyperparameters you used in the training process were chosen poorly. To avoid this, I recommend using automated machine learning solutions such Azure’s Auto ML for prototyping, since it automatically tries various different algorithms and hyperparameters and attempts to find the best combination for your training data.

3. Developing a proof-of-concept solution with a prototype model

After developing a prototype model that produces good predictions it’s time to develop an application that utilizes the model. This is the proof-of-concept for a user interface that allows your end users to use the model. There are numerous ways this could be done: As a web application, a mobile app, a desktop application, or something completely different. The proof-of-concept needs to provide an interface for the users to provide needed parameters to the model (e.g., “How many ice cream cones are we expected to sell on a Monday if the weather is cloudy and the temperature tops at 20 degrees Celsius?”) and then display the generated predictions. This step does not differ much from usual proof-of-concept software development. However, with the introduction of machine learning into the mix the goal with the proof-of-concept now includes verifying that the machine learning model can be utilized effectively in practice. For example, “Can this application be used to generate sales predictions be in a manner that is beneficial to making useful business decisions?” Once you have verified that, it’s time to move on to creating a finalized solution. Speaking of which…

4. Developing a production-ready solution with a production model

With a working prototype model and proof-of-concept solution implemented it’s time to bring them both into production. Depending on the scope of the whole project this can be both very simple and fast to do, or complicated and time-consuming. At this point you might want to develop the machine learning model further to provide more accurate predictions: For example, if 90 % accuracy was good enough for the prototype, maybe you would want to get it up to 92 % for the production version? A two percent increase might not sound like a lot but getting there can involve a lot of work – both in collecting additional training data and tuning your machine learning algorithms!

Similarly, with the development of production-ready software there comes all the little polishes that you are often expected to implement: Logging, error handling, user interface and usability improvements, security hardening, testing, the list goes on. With machine learning you might also want to capture predictions that users generate, and the parameters used to get those predictions, enabling you to further analyze the model’s performance later. This last part can be very important, since even though your model might perform well right now, it’s not guaranteed to work equally as well in the future due to the world being an ever-changing place. Maybe your ice cream shop gets renovated a year down the line, and the new sparkly interior brings in even more customer, which isn’t reflected in your old training data at all…? If that were to happen your old model would no longer be useful, and as such it’s important to prepare for the inevitable changes down the line.

5. Implementing re-training for the production model

Be it increased number of customers due to a new, more appealing shop interior, or having replaced your gas pipes with new, sturdier pipes, or whatever else that may apply to your project’s problem area, one thing is certain: Something is eventually bound to happen in the real world that makes your old machine learning models, trained based on even older history data, produce results that are no longer valid in the changing environment. During your machine learning software project, it’s good to list such potential factors and estimate how long your model is likely to remain valid. If it’s likely that your model is going to produce useful results at least for the next few years, then you might not want to do anything special for the time being. In such cases it’s probably going to be enough to manually update your model once the need arises. But if you are expecting that the model needs to be kept up to date often, it can be better to implement an automated re-training process for the machine learning model.

Such a re-training process would include automating the process of gathering history data in a format suitable for model training – which is the same task that was performed manually in the step #2 above. After data gathering, the next step to automate would be the model training itself. If you are using a cloud-based PaaS-solution for machine learning such as Azure Machine Learning, this automation process might also include configuring your machine learning platform to enable running model training without human interaction. You should also modify the training process to capture logging data and send alerts in case of errors, so that you are informed if something goes wrong in your automated machine learning run that was being performed on a Saturday night. Finally, you’ll want to decide whether the new re-trained models should be automatically deployed to production (for example, as an Azure App Service -based API) or whether you want to have business analysts or other human specialists perform validation on the model before it can be made public. This latter option can be useful if your model is used in scenarios that are business critical or are otherwise very impactful to human life (e.g., healthcare, safety, or general infrastructure). Once the model passes any validations imposed on it you can deploy it to your application for users to enjoy, preferably through some kind of a continuous deployment mechanism.

And with that the model reaches the never-ending re-training loop of its life cycle. In the big picture a machine learning software project does not differ that much from a typical software project. The main takeaway here is that the initial proof-of-concept phase includes more prototyping steps and checks where the project can (and should!) be terminated if it doesn’t seem to be worth the investment. After all, even though machine learning is cool and useful, there are still some situations where it isn’t worth the cost – and determining whether that’s the case without prototyping can be very difficult. If you have anything to add feel free to drop a comment, otherwise, until next time!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s