Machine Learning Workflow Made Simpler

AshutoshGhattikar
8 min readAug 16, 2021

--

Every day tons of data is being generated by almost every software and hardware product. Wouldn’t that be great if something more can be achieved by merely using this data? From suggesting which shows or movies you’d like to watch to predicting if a person can have a disease or not, that can all be done by Machine Learning and Deep Learning which uses programming and a whole lot of math underneath the hood in order to predict, automate, suggest or even mimic human work. This article is made to walk through the steps which are taken in a typical ML project by using minimalistic, terminology and jargon-free language (and a bunch of relatable meme templates as well).

NOTE: There can be scenarios where certain steps aren’t used, or may be sometimes more steps than these are used this is depending upon the data and the use-cases.

from CommonSense import sense_of_humor;

To begin with, here are the most common steps which are taken while creating an ML project.

Step 1: Data Collection or even known as Data Mining or Data scrapping

Step 2: Data preprocessing or Data preparation

Step 3: Choosing the right algorithm

Step 4: Splitting the dataset

Step 5: Model evaluation

Step 6: Cross-validation and Hyperparameter tuning

Step 7: Model deployment

Step 1: Data Collection/ Data Mining/ Data Scraping

“Not all heroes wear capes”

If a person is working for a well-established organization, it’s highly unlikely that he/she has to gather the data on their own. The data will be provided in suitable forms to proceed with the next step by some other group of people. In some cases, if at all the data has to be gathered by the individuals from different sources like articles, websites, sensors, etc then it can be a tedious job, although there are tools like BeautifulSoup, Scrapy by which this can be done easily. The data can then be stored in the formats as required such as a CSV file, JSON file, or even sent to a server or a database as per the needs. And this is exactly the initial step of our project. (keeping aside the step where we have to understand the problem and about its solution, this can be considered as the first step)

The aim here is to get raw data from different the sources to work with.

Step 2: Data Preprocessing/Data Preparation/Data cleaning

Spice up the things

Now that we have some data, we need to make it in a proper way to provide it to our ML algorithm. Many of the algorithms work well when the data is in numerical form and that is why this step is necessary. (algorithms are kind of a set of rules which are pre-written and these improve the efficiency of the approach we take to solve that problem) Processing of the raw data can be done by various math concepts and visualizations which provides the hidden insights about our data and then we can use certain measures to process those, practices like dealing with the missing values, reducing the dimensions of features (features are the factors which will have a heavy impact on the output of the problem, for an example, the price of a house depends upon the no. of rooms it has, the floor on which it is in and some other factors just like that, these are the features), normalizing the imbalanced dataset, finding the most essential features and discarding the redundant features, are often carried out in this step. Once our data is prepared well we can expect the ML algorithm to run to its full potential. This step is the most time-consuming and yet most important step in the project life cycle. If in case further steps do not do well, then we may have to trace back to this step and carry out the process again. The tools used in this step are majorly Pandas and NumPy for data manipulation, and Matplotlib extensively for visualization, there are other such tools also used for manipulation and visualization but for beginners, these three are commonly used. If we are using a powerful module like Scikit-Learn or Tensorflow, we can expect all the functionalities in the same tool.

The main aim of this stage is to transform the raw data into more reliable data which can be provided to the algorithm.

Step 3: Choosing The Right Algorithm

The real unsettling

So far we’ve dealt with the data manipulation part, but what are we even going to do with it? Well, we have to provide the cleaned data to the right algorithm, and algorithms can vary depending upon the problem you are trying to solve and the output you are expecting. Generally, almost every problem can be categorized into a Classification or a Regression problem, and hence the algorithms related to each of the problems differ. The algorithm which we use will take in the cleaned data, learns about the patterns in that data, and then uses the same pattern flow to predict a value or to classify a new input. These algorithms are already pre-written in most of the modules/packages and we do not have to write a new one from the scratch, but in rare cases, we may have to write our own algorithm for a specific problem, there are some techniques to pick the right algorithm to work with. Most of the existing algorithms have default parameters and these are also called Hyperparameters and these can be tweaked in order to gain some extra accuracy.

We choose a suitable algorithm based on our data, problem and the output requirement.

Step 4: Splitting The Dataset

I don't even have to caption it

After doing all the above steps we split the entire dataset into 2 parts, namely training data and testing data. This may seem to be unimportant to split the data but it will soon make sense of why we do this. We provide the training data to our algorithm and then the algorithm learns the patterns from it, we then use the testing data to check the predictions, and then we can make out how well our algorithm performed on predicting the output of a data which it wasn't trained on and so we can get a value of accuracy or a score of how well our algorithm can perform on new data. There can be some problems like underfitting or overfitting while training the data which we may want to avoid or minimize. The dataset is usually split as a 70–30 ratio (70 for training and 30 for testing) and can be changed accordingly. Sometimes step 3 and the current step is shuffled.

What more to say about it, you understood it already

Step 5: Model Evaluation

Time for some results!

This step is the most integral part of the cycle, this is to provide an estimate of how well our model/algorithm can generalize (work on the unseen data and provide a reliable output), there are certain evaluation criteria which can infer a lot about the model we are using, these are called as evaluation metrics. Metrics can change with the type of requirement. There are some metrics for classification and there are some metrics for regression-type problems. These metrics are mathematical-based equations that can provide a score based on the values (values here are the numbers related to the number of right predictions and wrong predictions made by our model).

Most simple and straight forward stage, we want to know how well our model/algo can perform on new data (look for the accuracy score)

Step 6: Cross-Validation and Hyperparameter Tuning

well, “Never Settle”

This step is linked with Step 4 where we split the data into training and testing, if our dataset is large enough, and more often times the data can get really big ( ͡° ͜ ʖ ͡° ), then we can have several groups of training data and testing data and then carry out the process with a different combination of those sets. This method is referred to as Cross-validation, sometimes cross-validation is also used with different algorithms at once in order to get the most well-performing algorithm by cross-checking the model with the best evaluation metric score and then carry out the further steps on that model. If we are working with a particular algorithm then we can try out different combinations of hyperparameters and settle with the combination of these hyperparameters which can yield the best accuracy. This step is essential as it serves as a backup if step 5 brings out poor results(poor metric score) and this step is ruled out if the evaluation metrics from step 5 are satisfactory.

The aim here is to try out with different sets of data and the hyperparameters to gain more accuracy

Step 7: Model deployment

Endgame!

Finally, once our model has been trained well and even has a good metric value, we can go ahead and deploy it for the end-users. This deployment can be done to a website, cloud-based application, a mobile app. There are various tools used for in stage other than the ones which we normally used for the above steps. Once the deployment is done, the model is continuously maintained and the cycle is repeated if there is any change in the requirements or the data.

Here, we look forward to enable the end-user to make use of ML application in an abstract way hiding all the implementation details (productionizing ML application)

A BIG NOTE: All the above steps unfold a whole different story on their own, this was rather a simple description of each of the steps without actually getting too much into the math and the programming part. Each step is pipelined with the next step (read more about pipelines in my previous article) we will be looking at each of the parts in more detail sometime later.

I have tried my best to concise all these complicated steps and put them in front of you in an easy to understand manner, if you really made it until here, a big thanks for going through each of the steps for trying to learn something new or maybe to gain some insights.

I really hope this article was good enough to push you further for the detailed exploration of the topic.

Please feel free to suggest new ideas or correct anything from this article, we will be exploring a whole new topic in a different article, keep learning new every day, keep growing! Cheers. 🙌❤️

--

--

AshutoshGhattikar
AshutoshGhattikar

No responses yet