Life Cycle of a Data Science Project
In this article, I will walk you through the most common steps that data scientists follow for a data science project. Having worked on a few projects I will also plug in the methods I follow in each step.
There are many ways to approach a data science project the steps I am mentioning are common ones followed by most people.
- Business Understanding
The main aim of a data scientist is to solve the business problem given by their manager. For example, you need to do customer segmentation of people visiting a mall.
It isn't easy to work on a dataset unless you have domain knowledge. This step involves asking more questions to understand the data better.
The data scientist needs to have clarity on what is he actually trying to do and predict. You might also need to consult domain experts in this step.
2. Data Collection
Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes.
You need to identify appropriate data sources which will help you in solving your problem. Data engineers might help you in this step by building data warehouses and data marts.
There are many techniques like web scraping, using third-party API’s or using data from a simple CSV file.
3. Feature Engineering/Data Cleaning
Feature engineering is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process
Here we clean the data using various techniques which makes it easier for us to choose the model.
Some common steps here are handling missing values, Handling imbalanced data points, Handling categorical features, Remove noise. Formatting data, Outlier detection.
4. Feature Selection and Exploratory data analysis.
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations
Here we plot different kinds of graphs on the data and understand the relationships between different features.
Some common practices are using the correlation concept to identify useless columns or even doing dimensionality reduction to avoid the dimensionality curse.
5. Model building and testing
Here we select the appropriate model which suits our data and predicts the data appropriately.
The data scientist should have the knowledge of different machine learning algorithms and should be able to intuitively decide the best one by comparing different models.
We test the accuracy of the model using different ways like confusion matrix, roc score, etc. If the accuracy is not satisfactory we need to go back and try to improve the accuracy.
6. Deployment and MLops
Once the data scientist is satisfied with accuracy he then tries to deploy the model. This step includes building an interface for the end-user to use our model for predictions.
For example, a flask server can be built and deployed onto the cloud and this website can be used for ready-made predictions.
MLops is a new technique in the market that helps us maintain the model by building the pipeline. It helps us go back and monitor our model by infusing extra data or changing the algorithms.
Well, this is just an overview of how a Data science project works. Every step is an ocean in itself and required through mastery.
Hope you found the article useful. Consider clapping for the article below. 😉
Feel free to reach out for further discussions.