Exploratory Data Analysis
(EDA)
Every machine learning project begins with data.
Before training any model, the first essential step is:
Exploratory Data Analysis (EDA)
EDA is the process of understanding the dataset before making
modeling decisions.
Why EDA Comes First
In this project, datasets were already prepared.
In real-world scenarios, you typically:
- Download or collect a dataset
- Inspect its structure
- Clean and transform it
- Define the prediction target
- Choose a modeling approach
Modeling decisions should never come before understanding the
data.
What EDA Involves
EDA helps answer fundamental questions:
- What does the dataset contain?
- How many samples and features exist?
- Are there missing values?
- Are there outliers?
- What type of variables are present?
- Is the data balanced?
Without these answers, model selection is arbitrary.
Strategic Decisions During
EDA
EDA is not only descriptive — it is strategic.
During this phase, you may decide to:
- Remove irrelevant columns
- Create new features (feature engineering)
- Transform variables
- Encode categorical variables
- Handle missing values
- Detect and treat outliers
These decisions directly impact model performance.
Choosing the Target
Variable
EDA also helps determine:
- What should be predicted?
- Is the problem regression or classification?
- Is the target well defined?
- Is the dataset suitable for supervised learning?
Sometimes the label must be created from raw data.
Data Understanding
Before Model Choice
The type of data influences:
- The learning paradigm (supervised vs unsupervised)
- The evaluation metric
- The preprocessing strategy
- The model family to use
Example:
- Strong linear relationships → linear models may work well
- Complex nonlinear patterns → more flexible models may be
required
EDA vs Feature
Engineering
EDA focuses on understanding the dataset.
Feature engineering focuses on modifying the dataset to improve
learning.
In practice, these two processes often overlap.
Key Principle
Machine learning does not start with models.
It starts with understanding the data.
Well-executed EDA reduces:
- unnecessary experimentation
- incorrect model selection
- poor evaluation choices
A strong data foundation leads to stronger models.