Hello everyone! Today I'll explain the CRISP-DM process model, an essential framework for data science projects.
What is CRISP-DM?
Cross Industry Standard Process for Data Mining (CRISP-DM) is the de-facto industry-independent process model for data mining and applying data mining projects¹. This process model defines six phases that describe the complete data science lifecycle, from initial business understanding through to deployment.
The Six Phases of CRISP-DM
1. Business Understanding
This phase serves as the foundation for all subsequent phases. The primary goal is to establish a clear understanding of the problem and objectives that will guide all future work.
During this phase, we assess the business situation to gain an overview of available resources, constraints, and requirements. This understanding is critical because it informs all decisions made in the phases that follow.
2. Data Understanding
Before advancing to data preparation, we must thoroughly understand the data we have. This involves:
- Understanding the data structure (time series, dataframes, images, etc.)
- Identifying variable types (binary, continuous, integer, categorical, string, etc.)
- Assessing data quality and characteristics
Without a clear understanding of our data, we cannot effectively move to the next phase and prepare it for modeling.
3. Data Preparation
Before data can be used for modeling, we must clean and prepare it so that it's ready to be consumed by machine learning algorithms. This is a critical phase because poor data quality directly leads to poor modeling results.
Data preparation involves activities such as handling missing values, removing outliers, feature engineering, and data transformation—all necessary steps to ensure data quality.
4. Modelling
The modeling phase consists of three main components:
- Selecting the modeling technique: Choose appropriate machine learning algorithms or data mining techniques based on your problem
- Building the model: Set specific parameters required for the model
- Testing the model: Evaluate the model using appropriate metrics (accuracy, MSE, RMSE, F1-score, etc.), depending on your problem type
It's important to document and explain your choices throughout this phase.
5. Evaluation
In the evaluation phase, we assess whether the model's results align with the business objectives defined in the first phase. Based on this assessment, we determine the next steps—whether to refine the model, try different techniques, or proceed to deployment.
6. Deployment
Once the model has been validated and approved, it's deployed so that stakeholders and customers can use it. The deployment phase consists of several key steps²:
- Plan Deployment: Develop and document a detailed plan for implementing the model in production
- Plan Monitoring and Maintenance: Establish procedures for monitoring and maintaining the model to prevent issues during the operational phase
- Produce Final Report: Document comprehensive final reports containing all project results and findings
- Review Project: Conduct a retrospective on what was accomplished, allowing the team to identify lessons learned and areas for improvement in future projects
Conclusion
CRISP-DM is a comprehensive process model for data mining consisting of six interconnected phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. By following this framework, data science teams can ensure a structured, repeatable, and effective approach to their projects.
References
[1] Schroer Christoph Et al. 2021. "A Systematic Literature Review on Applying CRISP-DM Process Model" https://www.sciencedirect.com/science/article/pii/S1877050921002416
[2] Data Science Process Alliance. "CRISP DM" https://www.datascience-pm.com/crisp-dm-2/