Course Lessons
Lesson 3 of 4
Pandas Data Manipulation
Learn to load, clean, transform, and analyze data using Pandas DataFrames and Series for real-world data analysis.
35 minutes
Pandas Data Manipulation
Pandas is Python's premier library for data manipulation and analysis, providing powerful data structures and operations.
What is Pandas?
- DataFrames: 2D labeled data structures (like spreadsheets)
- Series: 1D labeled arrays
- Data I/O: Read/write CSV, Excel, SQL, JSON
- Data Cleaning: Handle missing data, duplicates, transformations
Creating DataFrames
Multiple ways to create DataFrames:
import pandas as pd
# From dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# From CSV
df = pd.read_csv('data.csv')
Data Selection
Access rows and columns efficiently:
- df['column']: Select column
- df[['col1', 'col2']]: Select multiple columns
- df.loc[]: Label-based selection
- df.iloc[]: Position-based selection
- Boolean indexing: Filter rows by conditions
Data Cleaning
Handle real-world messy data:
- dropna(): Remove missing values
- fillna(): Fill missing values
- drop_duplicates(): Remove duplicate rows
- replace(): Replace values
- astype(): Convert data types
Data Transformation
Modify and create new data:
- apply(): Apply functions to rows/columns
- map(): Transform values in a Series
- groupby(): Group data for aggregation
- merge(): Join DataFrames
- pivot_table(): Reshape data
Common Operations
# Descriptive statistics
df.describe()
# Group and aggregate
df.groupby('category')['sales'].sum()
# Sort values
df.sort_values('date', ascending=False)
Best Practices
- Use vectorized operations instead of loops
- Check data types after loading
- Handle missing data appropriately
- Use method chaining for cleaner code
- Set meaningful column names
Code Example
import pandas as pd
import numpy as np
# Creating a DataFrame
sales_data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04'],
'product': ['A', 'B', 'A', 'C'],
'quantity': [10, 15, 12, 8],
'price': [100, 150, 100, 200],
'region': ['East', 'West', 'East', 'North']
})
# Convert date column to datetime
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Add calculated column
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']
print("Sales Data:")
print(sales_data)
# Data selection
print("\nProduct A sales:")
print(sales_data[sales_data['product'] == 'A'])
# Aggregation
print("\nTotal revenue by product:")
print(sales_data.groupby('product')['revenue'].sum())
print("\nAverage quantity by region:")
print(sales_data.groupby('region')['quantity'].mean())
# Descriptive statistics
print("\nStatistics:")
print(sales_data[['quantity', 'price', 'revenue']].describe())
# Working with missing data
customer_data = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, np.nan, 35, 28],
'email': ['alice@email.com', 'bob@email.com', None, 'david@email.com'],
'purchases': [5, 3, np.nan, 7]
})
print("\nMissing values:")
print(customer_data.isnull().sum())
# Fill missing values
customer_data['age'].fillna(customer_data['age'].mean(), inplace=True)
customer_data['purchases'].fillna(0, inplace=True)
print("\nAfter filling missing values:")
print(customer_data)