Data Exploration and Cleaning

0. Learning Objectives

Understand the importance of exploring data and its structure before using it.
Explain the significance of understanding variable distribution and describe ways to visualize the distribution.
Look for outliers, correlated variables, and the overall distribution of data; describe why these factors are important.
Understand the differences between common dimensionality reduction strategies. Explain why we would use them.
Explain the concepts of data cleaning and imputation.
Explain the implications of using various imputation strategies.
Give examples of why health data may need cleaning.
Describe strategies for working with missing data, and identify the three main types of missing data.

1. Data Exploration

Initial Data Understanding

Examine data appropriateness before starting research
Consider three dimensions of variables:
- Role (Predictor vs Target)
- Data type (Strings, Dates, Numbers)
- Nature (Categorical vs Continuous)

Data Structure Assessment

Check data point quantity
Verify variable data types
Validate value ranges
Assess data completeness
Use pandas for type conversion when needed

Univariate Analysis

Categorical Variables
- Use value_counts()
- Visualize with bar plots
Continuous Variables
- Basic statistics (mean, median, mode, std, min, max)
- Visualizations:
  - Histograms
  - Box plots (shows median, quartiles, outliers)
  - Violin plots (distribution shape)
  - Q-Q plots (normality test)
Data Standardization
- Z-score standardization: $z_i = \frac{x_i - \bar{x}}{s}$ (mean 0, std 1)
- Min-max scaling (0 to 1)

Bivariate/Multivariate Analysis

Scatter plots
Correlation analysis (Pearson)
Watch for outliers

Dimensionality Reduction

Linear Methods
- PCA (Principal Component Analysis): Unsupervised, linear
  - Idea: find directions of maximum variance in high-dimensional data and project data onto these directions
- LDA (Linear Discriminant Analysis): Supervised, linear
  - Idea: find directions that maximize class separability
Non-linear Methods
- t-SNE: Unsupervised, non-linear
  - Idea: map high-dimensional data to a 2D space for visualization
- UMAP: Unsupervised, non-linear
  - Idea: similar to t-SNE, but faster and more scalable

2. Data Cleaning

Common Data Issues

Non-standardized Data
- Inconsistent formats
- Different terminologies
- Use standard ontologies (MeSH, RxNorm, UMLS, etc.)

Structure Issues

Apply tidy data principles:
- One variable per column
- One observation per row
- One value per cell
Use melting for restructuring

Example:

df = pd.DataFrame({
    'id': [1, 1, 1],
    'name': ['Alice', 'Alice', 'Alice'],
    'math': [90, 85, 95],
    'science': [85, 90, 88],
    'english': [95, 92, 93]
})
# id name math science english
# 1 Alice 90 85 95
# 1 Alice 85 90 92
# 1 Alice 95 88 93
df.melt(id_vars=['id', 'name'], value_vars=['math', 'science', 'english'])

Output:

id name  variable value
Alice math      90
Alice science   85
Alice english   95

Duplicate Data
- Check for exact duplicates
- Verify legitimate duplicates
- Use drop_duplicates() with appropriate criteria
Incorrect Values
- Define valid ranges
- Check categorical values
- Look for inconsistencies
- Examine outliers

Missing Data

Types of Missing Data

MCAR (Missing Completely At Random): Reason for missing data is unrelated to any other variables in the dataset.
MAR (Missing At Random): Reason for missing data is related to other variables in the dataset.
MNAR (Missing Not At Random): Reason for missing data is related to the missing value itself.

Handling Missing Data

Detection
- Use isnull()/notnull()
- Apply Little’s Test for MCAR
- Check for MAR patterns
Solutions
- Drop rows (listwise/pairwise deletion)
- Drop fields (columns)
- Imputation methods:
  - Deductive imputation
  - Multiple imputation
  - FIML (Full Information Maximum Likelihood)
  - EM (Expectation Maximization)
Avoid
- Simple mean/median replacement
- Single deterministic regression
- Single stochastic regression