Steps for exploratory data analysis before implementing feature engineering on a data that is given for machine learning modelling
From this post you will learn how to use following Python utilities for exploratory data analysis.
- dataset.info() for missing data
- dataset isnull().sum() for missing data
- sns.countplot() to plot bar chart relation between 2 variables/factors
- sns.distplot() to plot probability distribution of all the variables in the dataset
- sns.Facetgrid() and sns.distplot() together to get relationship of more than 3 variables/factors in one go.
- sns.heatmap() to establish correlation between all the variables/factors in the dataset.
For complex plots and visualization Seaborn is best for data cleansing and exploratory analysis. Seaborn like Pandas rely on matplotlib. From my experience above python utilities are good enough to perfrom a minimal Quality check and exploratory data analysis before embarking on feature engineering or machine learning model.
For more on Seaborn scripts please visit : https://seaborn.pydata.org/examples/index.html
Below example depicts how I used the Python Seaborn for visualization, data quality check and…