Machine Learning Project Steps

Every Machine Learning project depends on data and not just any quantity or quality but a huge amount of data with the best quality. Since the success of the project will depend on the data available, you must make conscious efforts to ensure that you gather and work on your data to achieve the best output.

Another thing you must bear in mind is that an incomplete or bad data can cause an ML project to fail. Let me share this story by a consultant in data science, Martin Goodson.  One time ago, there was a healthcare project executed to reduce the cost of treating pneumonia patients. To get an accurate and fast result, ML was employed to analyze the patient records and decide on two things;

  1. Those with the lowest risk of death and should be sent home with antibiotics.
  2.  Those whose death risk is high and to be admitted into the hospital.

What the team used as data was historic records of patients from the clinic and of course, the Machine Learning algorithm was accurate in its output. However, there was an exception which they overlooked. Pneumonia and asthma are like Siamese twins and doctors always admit the asthmatic patient into the intensive care unit to reduce the death rates.

Unfortunately, the algorithm didn’t get any data about cases of asthmatic death, and as such, it concluded that asthma wasn’t life-threatening during an attack of pneumonia. In every case, the machine made a wrong conclusion and recommended that all patients with asthma should be sent home while others should be admitted (AltexSoft, Inc. USA, 2017).

The lesson here is that Machine Learning needs a huge amount of data flawless to come out with effective results.  Since every available data may have some flaws, you must follow the due process to ensure that the information you have is sensible. If not, your Machine Learning project may be unsuccessful.

Therefore, let’s consider the following ML project steps.

Data Collection

The only difference between those who are capable of getting the best output from ML projects is the availability of information. Many organizations today have gathered decade’s worth of records that the conventional broadband has failed to bear the weight. However, new actors in the industrial world are struggling to gather adequate data for their Machine Learning projects.

If you are still a new player and have limited data to execute ML projects, you are fortunate because you have better options to turn the disadvantage to an opportunity.

The first thing you have to do is to consider the open source data for your Machine Learning execution. If you visit Google or other companies, you can get large amounts of data for your project. Also, since you are new in data collection, you are going to get it right from the beginning. While those big companies that started to collect data in paper ledgers, .xlsx & .csv will be struggling to prepare their data, yours will be easier because you are starting with an ML-friendly dataset. Therefore, what you should do now is to tailor-make a mechanism to gather just the data that will be suitable for your ML projects.

Also, even if the general recommendation is to aim at big data as a beginner, make sure that you can process them when you need them. There is no sense in having big data that you can’t utilize. Therefore as a beginner, start small in your data collection so that you will reduce the stress of data preparation and processing (AltexSoft, Inc. USA, 2017).

So, how do you succeed in your data collection stage?

  •  Eliminate gaps in your data

While collecting the dataset for your ML project, make sure that every variable is in place. If you allow gaps to exist in your data, building a model will be very difficult. The best thing to do is to have a strategy for your data collection before embarking on the process. With a strategy in place, you can avoid the collection of unnecessary data. However, if you are yet to develop a strategy, make sure that you collect a complete data.

  • Maintain your raw data as it is

It is advisable to keep your data in the raw state because it fastens the data analysis process. Don’t start throwing away some parts of your data because you may not know how beneficial it will be in the future. Don’t waste your time predicting that this or that data may be useful or not in the future. Keep the data as it is and avoid any subsequent embarassment.

  • Document every missing values

When you realise that some values are missing in the process of collection the data write them down. This will save time for data scientists when they are analyzing the data for the project. Also, bear in mind that you can prevent cases of malfunctions tomorrow and also help to hasten the data analysis process.

  • Update the structure of your data storage

If the data storage structure is modified, make sure that you update the logs and adequately describe the changes.  Also, ensure that you keep the original versions of all the documents that were modified.

  • Don’t lose your data points

Even as you collect data, ensure that you take measures to keep all your data points safe and intact. Unforeseen issues like lost connection can lead to data loss. Therefore, take precautions against data loss.

  • Employ the services of a data officer

Hiring a data officer will go a long way to hasten your data collection process. There may be some datasets which others may not know, but with an in-house data officer, every data will be brought to light. At least, he/she will know about every data that exist in the company as gathered by several teams (Shchutskaya, n.d.).

If you are still confused on where to find data for your ML project, you can start from these places: Kaggle, UCI Machine Learning Repository, Quandl, etc.

Data Preparation

After collecting data from different sources, the next step calls for cleaning the data so that analysts and data scientists can use it for ML projects. Although data preparation can be a do-it-yourself project, it is always advisable to use a data scientist. However, you can assist the scientist and speed up the whole process. To be of any help during this process, there are some techniques you need to know in order to obtain better results.

  • Determine the problem early

The first technique that will help you to clean the data is to know the outcome you want to predict with the ML project. Knowing this can help you to extract the valuable data from the large pool. Also, this stage calls for using some Machine Learning techniques such as classification, clustering, regression and ranking.

  • Format your data

After the first step above, it is time to format the data to fit the ML system you are using. The aim of presenting your data in a compatible format is to ensure that the records you have are consistent. Remember that you collected the data from various sources and different people updated them as well. Therefore, instead of having discrepancies in the variables use one format for the whole dataset so that you can be sure that the data is consistent.

  • Reduce the data

Instead of aiming for big data like others do, aim at the ones that are relevant to the outcome you want to predict in the project. Instead of creating complexities and excessive dimensions, reduce the data by using the attribute-sampling approach, record-sampling approach or aggregating the data into larger records.

  • Clean the data thoroughly

If you have missing values in your dataset, the prediction accuracy will be lower. So, the best thing to do is to assume the missing values and use dummy values to substitute them. Also, you can use mean figures to substitute numerical values missing in the dataset. 1f categorical values are missing as well; use those items that are most frequent to fill them.

Even though the data cleaning step can be done manually, you can choose to automate it with an ML service. Many systems such as Amazon ML, Azure ML and others are available for this task.

  • Decompose the data

At this step, you need to decompose some of the complex values in the dataset into many parts. This will help you identify those relationships that are more specific.  Decomposing of data is not like reducing data but the opposite. Here, you must add some new attributes to the dataset based on what you have at hand.

  • Data Rescaling

What you are going to aim for is to improve the quality of your dataset. To achieve this, you have to reduce dimensions and also avoid any situation that will make some of your data values larger than others. This process is a part of data normalization. Some of the approaches you can use at this step are decimal scaling or min-max normalization.

  • Data discretization

If you want your predictions to be more effective, you need to turn the values from numerical to categorical values. All you can do is to divide the values into groups (AltexSoft, Inc. USA, 2017).

Importance of data preparation

Since you are using your dataset for a Machine Learning project, you have to bear in mind that data preparation is very important. The reason is that most of the ML algorithms available require a specific way of data formatting if they must deliver useful insights. Therefore, you must spend quality time preparing the available data to match the required format.

The problem of missing, invalid or difficult values can pose a big problem when the ML algorithm wants to process it. Any data that has missing values will not work for the algorithm. Also, if your data is not valid, the algorithm will make predictions that are misleading or less accurate.

However, if you prepare your data very well before the algorithm processes it, you can be sure of accurate and more practical model outcomes (Data Robot Inc., 2018).

Data Preparation using DataRobot – DR

Let’s say you are finding it difficult to prepare your data to be clean and compatible; you can use a little help from DataRobot. Apart from the complexities of data preparation, another reason to seek for assistance may be time limitations. In any case, you can use the DR automated ML platform to prepare your data.

DataRobot ML works with Trifacta to prepare your data. All that you have to do is to prepare a Machine Learning dataset and import it into DataRobot. The process is very easy like when you drag and drop a .csv file into an SQL database. It doesn’t matter whether you are a geek or not; the platform has options that can accommodate your skills level.

Immediately you upload your data to the DataRobot platform, it will analyze it, identify every variable and generate the descriptive statistics for median, mean standard deviation etc. After the preparation on the platform, any data scientist or analyst can easily work with the data instead of spending a lot of time trying to make sense out of it.

Also, DataRobot can help you to rectify the issue of missing values by adding them whenever it is necessary (Data Robot Inc., 2018).

Model Selection

Selecting a model is always a problem when you are dealing with a machine learning project. Since the training data you use in ML is always made up of input-output pairs, a model is very necessary to enable you to predict what the output will be from the input, when you fit the adjustable parameters. The task of prediction using a model is usually done with predictive models such as neural networks, regression trees, kernel methods, linear models and classification (Clopinet Enterprises, 2006). The issue now will be how to select the optimal model that will perform the best on your test data.

As I said before, lots of models are available, but the task of choosing the one to use is always intimidating. In some situations when you have to provide a predictor, it doesn’t end in presenting the best, but you must be accurate about the ability of the analyst to work on the unfamiliar data and produce the expected result. Before you select any model therefore, make sure that the predictor will meet the specifications of the ML projects. If not, you must spend extra time and even resources to collect more data or create better models.

Therefore, instead of training one model, you can try many models and select the best out of them. Yes, the task can be herculean, but it’s better to be accurate the first time. Since you are handling a Machine Learning project, you should remember that the field depends more on empirical results instead of theoretical results. Also, don’t forget that it is never easy to know which model will work best on predicting the best outcomes (Koehrsen, 2018).

Simple advice! If you want to try out several models, start with the simple ones you can interpret easily; like the linear regression model. If after the trial you are not satisfied with the output, move over to a complex model. The truth is that the complex models give more accurate results.

To further help your selection process, you can use these statistical methods to interpret the skill of each model you use during the trial period; statistical hypothesis tests and estimation statistics. Finally, remember to validate the performance of all the models you train with the same sample.

Model Training

At this step, you have selected a model and want to equip it for the final prediction on the new data set you will input into the system. So how do you train the final model? The Model Training process involves providing the algorithm with a set of training data. This training data you must use will have the target attribute/correct answer.

When the algorithm gets the data set, it will process it and present a model that will make accurate predictions in a new data. Two styles of model training popularly used in Machine Learning are; supervised learning and unsupervised learning (Amazon Web Services Inc., 2018).

Like I explained in Chapter One, supervised learning sends a labelled data to the system and it can then use it to determine patterns. However, in unsupervised learning, the algorithm learns with unlabelled data. While supervised learning solves problems of regression and classification, unsupervised learning centres on solving clustering, dimensional reduction problems, etc.

Model training process

To get the best result out of your model training, you can follow this simple process;

  1. Input the data source for the training
  2. Name the attribute of the data containing the targeted predictions
  3. Give the instructions for the data transformation
  4. Add the training parameter (Amazon Web Service Inc., 2018).

Model Training parameters

Some of the training parameters that will control the model learning algorithm are:

  1. Maximum size of the model
  2. Shuffle type
  3. Regularization amount
  4. Maximum passes
  5. Regularization time (Amazon Web Service Inc., 2018).

Model Evaluation

After training your final model, the next step is to evaluate it to be sure that it will give an accurate prediction using a new dataset or future data as the case may be. Don’t forget that the predictions you may require in  future may not have a known target. As such, you must be sure that you test your model’s accuracy using the data with already known target answers.

If you must evaluate your model properly, don’t use the whole dataset for the model training exercise. The best thing to do is to split the available data into three categories. Use 60% of the data as a training data, 20% of the data as test data and 20% for validation.

Now that you have finished your model training above, it is assumed that you used 60% for the training and at this stage; you will use the 20% to test your model. One of the reasons to split data into train/test/validation is to make sure that the model does not over-fit to the whole data set. This is why you will use 20% during the testing process and the remaining for validation especially if you will use the Hold-out method in your evaluation (Jordan, 2017).

One thing you must remember is to shuffle your data properly so that each of the split will be accurate in representing the entire dataset.

If you want to evaluate your model properly, you should make use of the cross-validation and hold-out method (Sayad, 2018).

Model Tuning

Model tuning aims at maximizing the performance of a Machine Learning project model without creating a high variance or over-fitting it. One way to accomplish this objective is usually by selecting hyper-parameters (HP) that will be appropriate for the model.

Hyper-parameters are the knobs or dials of ML models. If you must get an accurate model, you must make sure that the HP set you choose is appropriate. However, the challenge lies in computing them. Bear in mind that hyper-parameters are different from the model parameter. With the former, you must set them manually because the model cannot learn them automatically.

If you want to select the appropriate H-parameters, use these three methods; Grid Search, Random Search and Bayesian Optimization (Brooks Anderson, 2017).

During tuning, you are expected to do some trial and error exercise on the hyper-parameters. What you will do is to keep changing some of the HPs and running the ML algorithm on your data and keep comparing the performance using your validation dataset until you can finally determine the appropriate hyper-parameters to produce a model that is more accurate than others.

The importance of model tuning cannot be overemphasized. This stage helps in customizing the final model to give you an outcome and insight that are accurate and highly valuable for effective decisions.

Let’s say you don’t want to pass through coding and manual tweaking; you can use DataRobot (Data Robot Inc., 2018).


At this stage in your Machine Learning project, it is time to use the model for predictions. What it involves is that you will feed data to the algorithm and allow it to guess the best answer using the data that was fed into it.

The approaches and methods used for prediction depend on the problem you aim to solve. Also, many algorithms for prediction exist, but each of them has their advantages and disadvantages. Some of the prediction algorithms available are linear & logistic regression algorithm, tree-based algorithms and neural networks (Temboo Inc., 2017).

However, there are certain things you have to know about predictions. If you are using a simple model such as linear/logistic regression, you can be sure that they will perform very well (Magnos Technologies, 2017). Also, while making a prediction, make sure that your method is transparent and that you can interpret the results very well. The reason is that groups like your industry, scientific community and even the public will want to know how you reached to such a conclusion.

Leave a Reply

Your email address will not be published. Required fields are marked *