Data Validation for Data Science

Room:: Wicklow Hall 2A
Start (Dublin time):: 13:45 on 12 July 2022
Start (your time):: 08:45 on 12 July 2022
Duration:: 180 minutes

Abstract

Have you ever worked really hard on choosing the best algorithm, tuned the parameters to perfection, and built awesome feature engineering methods only to have everything break because of a null value? Then this tutorial is for you! Data validation is often neglected in the process of working on data science projects. In this tutorial, we will demonstrate the importance of implementing data validation for data science in commercial, open-source, and even hobby projects. We will then dive into some of the open-source tools available for validating data in Python and learn how to use them so that edge cases will never break our models. The open-source Python community will come to our help and we will explore wonderful packages such as Pydantic for defining data models, Pandera for complementing the use of Pandas, and Great Expectations for diving deep into the data. This tutorial will benefit anyone working on data projects in Python who want to learn about data validation. Some Python programming experience and understanding of data science are required. The examples used and the context of the discussion is around data science, but the knowledge can be implemented in any Python oriented project.

TutorialPyData: Data Engineering

Description

For this tutorial you will need either a Google account for using Google Colaboratory, or a Python 3.8 and up environment with Jupyter installed. We will go through the hands-on exercises together in Jupyter notebooks. The context of the tutorial is a standard data science project with the common practice architecture of data ingestion, feature engineering, model training, model serving, etc. In the first part of the tutorial, we will go through all of the common pitfalls where unexpected data values can impact the model performance, or even worse - break the run altogether. In light of the potential consequences, we will discuss the importance of data validation. For the second part of the tutorial, we will dive into some of the open-sourced tools in the Python community that can help us with the validation task: Pydantic - For defining data models, types, and simple checks. Pandera - Used on top of Pandas Dataframes for schema validation. Great Expectations - a framework for data testing, quality, and profiling. GitHub repository: https://github.com/NatanMish/data_validation

The speaker

Natan Mish

Senior Machine Learning Engineer at Zimmer Biomet - the world's leading Orthopaedic medical devices company. London School of Economics graduate with an MSc in Applied Social Data Science. Passionate about using Machine Learning to solve complicated problems. I have experience analysing, researching and building data products in the financial, real estate, transportation and healthcare industries. Curious about (almost) everything and always happy to take on new experiences and challenges.