Lesson 3 of 4

Pandas Data Manipulation

Learn to load, clean, transform, and analyze data using Pandas DataFrames and Series for real-world data analysis.

35 minutes

Pandas Data Manipulation

Pandas is Python's premier library for data manipulation and analysis, providing powerful data structures and operations.

What is Pandas?

  • DataFrames: 2D labeled data structures (like spreadsheets)
  • Series: 1D labeled arrays
  • Data I/O: Read/write CSV, Excel, SQL, JSON
  • Data Cleaning: Handle missing data, duplicates, transformations

Creating DataFrames

Multiple ways to create DataFrames:

import pandas as pd

# From dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# From CSV
df = pd.read_csv('data.csv')

Data Selection

Access rows and columns efficiently:

  • df['column']: Select column
  • df[['col1', 'col2']]: Select multiple columns
  • df.loc[]: Label-based selection
  • df.iloc[]: Position-based selection
  • Boolean indexing: Filter rows by conditions

Data Cleaning

Handle real-world messy data:

  • dropna(): Remove missing values
  • fillna(): Fill missing values
  • drop_duplicates(): Remove duplicate rows
  • replace(): Replace values
  • astype(): Convert data types

Data Transformation

Modify and create new data:

  • apply(): Apply functions to rows/columns
  • map(): Transform values in a Series
  • groupby(): Group data for aggregation
  • merge(): Join DataFrames
  • pivot_table(): Reshape data

Common Operations

# Descriptive statistics
df.describe()

# Group and aggregate
df.groupby('category')['sales'].sum()

# Sort values
df.sort_values('date', ascending=False)

Best Practices

  • Use vectorized operations instead of loops
  • Check data types after loading
  • Handle missing data appropriately
  • Use method chaining for cleaner code
  • Set meaningful column names

Code Example

import pandas as pd
import numpy as np

# Creating a DataFrame
sales_data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04'],
    'product': ['A', 'B', 'A', 'C'],
    'quantity': [10, 15, 12, 8],
    'price': [100, 150, 100, 200],
    'region': ['East', 'West', 'East', 'North']
})

# Convert date column to datetime
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Add calculated column
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']

print("Sales Data:")
print(sales_data)

# Data selection
print("\nProduct A sales:")
print(sales_data[sales_data['product'] == 'A'])

# Aggregation
print("\nTotal revenue by product:")
print(sales_data.groupby('product')['revenue'].sum())

print("\nAverage quantity by region:")
print(sales_data.groupby('region')['quantity'].mean())

# Descriptive statistics
print("\nStatistics:")
print(sales_data[['quantity', 'price', 'revenue']].describe())

# Working with missing data
customer_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, np.nan, 35, 28],
    'email': ['alice@email.com', 'bob@email.com', None, 'david@email.com'],
    'purchases': [5, 3, np.nan, 7]
})

print("\nMissing values:")
print(customer_data.isnull().sum())

# Fill missing values
customer_data['age'].fillna(customer_data['age'].mean(), inplace=True)
customer_data['purchases'].fillna(0, inplace=True)

print("\nAfter filling missing values:")
print(customer_data)