Table of contents
- 🔰Introduction
- 🔰What is CSV File?
- 🔰Why use Pandas to Read CSV Files?
- 🔰CSV & Pandas
- 🔰 Conclusion
🔰Introduction
In the world of data analysis, CSV ( comma-separated values) files are like the bread and butter of data exchange. Whether you are dealing with financial records, survey results, or large datasets scraped from the web, CSV files provide a lightweight and structural way to store data.
if you are working with the python, the Pandas library is your ultimate tool for efficiently handling CSV files using pandas, along with tips to optimize the process.
🔰What is CSV File?
A CSV file is essentially a test file where values are separated by commas(,) or other delimiters like semicolon (:) or tabs (/t). Each row in a CSV file represents a record, and each column represents a field of that record.
For example, consider a sample CSV file named students.csv :
ID,Name,Age,Grade
101,Bhuban Bam,18,A
102,Aditya Gupta,19,B
103,Viru Singh,20,A
Each line contains comma-separated values, making it easy to process and manipulate in Python.
🔰Why use Pandas to Read CSV Files?
While Python has a built-in CSV module , Pandas provides a much more powerful and flexible way to read, process, and analyze CSV files with just a single line of code. Pandas make it easy to :
✅ Load large datasets efficiently
✅ Handle missing values automatically
✅ Select specific rows/columns effortlessly
✅ Apply transformations and filtering
✅ Export processed data to new CSV files
Now, let’s dive into how to import CSV files and other stufs using Pandas like a pro!
🔰CSV & Pandas
1. Importing a CSV file using Pandas
Before we start, make sure you have Pandas installed in your Python environment:
pip install pandas
Now, let’s load a CSV file into a Pandas DataFrame (which is a table-like data structure used for analysis).
Basic CSV Import
import pandas as pd
#Read the csv file
df = pd.read_csv('students.csv')
#display the first 5 rows
print(df.head())
Explanation
pd.read_csv(‘students.csv’)
loads the csv file into a Pandas Dataframe.df.head()
diplays the first five rows to check the data.
Output
ID Name Age Grade
0 101 Bhuban Bam 18 A
1 102 Aditya Gupta 19 B
2 103 Viru Singh 20 A
Boom! 🚀 With just one line of code, you’ve successfully imported a CSV file and displayed its contents.
2. Handling Different Delimiters in CSV Files
Not all csv files use commas to separate values. Some use semicolon( ; ) or tabs( /t ) instead.
Reading a CSV file with Semicolon delimiter and Tab delimiter
df = pd.read_csv('students_semicolon.csv', delimeter=';')
ab = pd.read_csv('students_tab.txt', delimiter='\t')
If you try reading a CSV wth the wrong delimiter, Pandas might load the entire row into a single column instead of separating values.
3. Skipping Rows and Handling Headers
Sometimes, CSV files contain unnecessary metadata or descriptions in the first few rows. Pandas allows us to skip rows while reading data.
Skipping the First Two Rows
df = pd.read_csv('students.csv', skiprows=2)
If your file has no headers, you can manually define column names:
df = pd.read_csv('students.csv', header=None, names=['ID', 'Name', 'Age', 'Grade'])
4. Selecting Specific Columns While Importing
Sometimes, CSV files contain a large number of columns, but we only need a few for analysis. Instead of loading the entire file (which can be slow and memory-intensive), we can select specific columns using the usecols
parameter in pd.read_csv()
.
Why is Selecting Columns Important?
Saves memory and processing time, especially for large datasets.
Avoids clutter by focusing only on relevant data.
Let’s say we have a CSV file, students.csv
, with the following data :
ID,Name,Age,Grade,Address,Email,Phone
101,Kanika,18,A,New York,kanika@example.com,1234567890
102,Amiliya,19,B,Los Angeles,amiliya@example.com,9876543210
103,Deepika,20,A,Chicago,deepika@example.com,1122334455
If we only need the ID
, Name
, and Grade
columns, we can ignore the rest while importing the data.
import pandas as pd
df = pd.read_csv('students.csv', usecols=['ID', 'Name', 'Grade'])
print(df.head())
OUTPUT
ID Name Grade
0 101 Kanika A
1 102 Amiliya B
2 103 Deepika A
Additional Notes
The
usecols
parameter only accepts a list of column names or indexes.If you have column indexes instead of names, you can use:
df = pd.read_csv('students.csv', usecols=[0, 1, 3]) # Selects ID, Name, and Grade
🔹 Best Practice: Always use usecols
when dealing with large CSV files to reduce memory usage and improve speed.
5. Handling Missing Values in CSV Files
Real-world datasets are rarely perfect. Missing values can occur due to human error, data corruption, or incomplete entries. Pandas provides several powerful ways to handle missing data.
Identifying Missing Values
Let’s assume we have a CSV file, students_missing.csv
:
ID,Name,Age,Grade
101,Kanika,18,A
102,Puja,B
103,David,20,
104,,21,C
When we load it into Pandas, we can check for missing values:
df = pd.read_csv('students_missing.csv')
print(df.isnull().sum()) # Count missing values per column
Output
ID 0
Name 1
Age 1
Grade 1
dtype: int64
Now, let’s explore different ways to handle missing values.
Method 1: Replacing Missing Values
We can replace missing values with a specific default value:
df.fillna("Unknown", inplace=True) # Replace NaN with 'Unknown'
🔹 Example Output:
ID Name Age Grade
0 101 Kanika 18 A
1 102 Puja Unknown B
2 103 David 20 Unknown
3 104 Unknown 21 C
Method 2: Removing Rows with Missing Values
If missing data is minimal, we can simply remove rows:
df.dropna(inplace=True) # Drops all rows with missing values
🔹 Output:
ID Name Age Grade
0 101 Kanika 18 A
Method 3: Filling Missing Values with Mean/Median/Mode
If the missing values are numerical, we can use statistical imputation:
df['Age'].fillna(df['Age'].mean(), inplace=True) # Replace with column mean
df['Age']
→ Accesses the "Age" column from the DataFrame.df['Age'].mean()
→ Calculates the mean (average) of the "Age" column, ignoringNaN
values.fillna(df['Age'].mean())
→ Replaces allNaN
values with the calculated mean.inplace=True
→ Modifies the DataFrame directly instead of returning a new one.
Before Handling Missing Values:
import pandas as pd
data = {'ID': [101, 102, 103, 104],
'Name': ['Beladona', 'Archis', 'Eshan', 'Debika'],
'Age': [18, None, 20, None]} # Two missing values in 'Age'
df = pd.DataFrame(data)
print(df)
Output:
ID Name Age
0 101 Beladona 18.0
1 102 Archis NaN
2 103 Eshan 20.0
3 104 Debika NaN
After Filling Missing Values with the Mean:
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
Output:
ID Name Age
0 101 Beladona 18.0
1 102 Archis 19.0
2 103 Eshan 20.0
3 104 Debika 19.0
💡 The missing values (NaN
) in "Age" are replaced by 19.0, which is the mean of the available values:
Mean Age = ( 18 + 20 ) / 2 =19
🔹 Best Practice:
Use
mean()
for numerical data with continuous values.Use
median()
for skewed data (e.g., income levels).Use
mode()
for categorical data (e.g., city names, product categories).
6. Reading Large CSV Files Efficiently
When working with small CSV files, pd.read
_csv()
loads data into memory without issues. However, if you’re dealing with huge datasets (gigabytes in size), you may run into performance problems.
Pandas provides several techniques to optimize large CSV imports.
Method 1: Reading Only a Subset of Rows
If you only need the first few rows to analyze the data, use the nrows
parameter.
df =
pd.read
_csv('large_data.csv', nrows=100) # Read first 100 rows
This is useful for previewing the data without loading the entire file.
Method 2: Loading Data in Chunks
Instead of loading the entire file at once, Pandas allows us to read data in chunks.
chunk_size = 5000 # Read 5000 rows at a time
for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
print(chunk.shape) # Prints the number of rows and columns in each chunk
Why Use Chunks?
✅ Avoids memory crashes when handling huge files
✅ Allows incremental processing (useful for large-scale data analysis)
✅ Speeds up performance
Method 3: Specifying Data Types for Efficient Loading
By default, Pandas assigns general-purpose data types, which can consume more memory. We can explicitly specify data types to optimize performance.
df = pd.read_csv('large_data.csv', dtype={'ID': 'int32', 'Age': 'int8', 'Grade': 'category'})
Benefits:
✅ int32
instead of int64
reduces memory usage
✅ int8
is sufficient for small numbers (like ages)
✅ category
is ideal for categorical values like Grades (A, B, C)
7. Writing data to a csv file using Pandas
Once you've imported, processed, and analyzed your data using Pandas, the next step is often saving it back into a CSV file for sharing, storage, or further analysis. Pandas makes this incredibly easy with its built-in .to_csv()
function, which allows you to export DataFrames into structured CSV files.
⏺️ Basic CSV Export
The simplest way to write a DataFrame to a CSV file is:
import pandas as pd
# Sample DataFrame
data = {
'ID': [101, 102, 103],
'Name': ['John Doe', 'Jane Smith', 'Emily Davis'],
'Age': [18, 19, 20],
'Grade': ['A', 'B', 'A']
}
df = pd.DataFrame(data)
# Save DataFrame to CSV
df.to_csv('students.csv')
print("CSV file has been saved!")
This will create a students.csv
file with the following content:
,ID,Name,Age,Grade
0,101,John Doe,18,A
1,102,Jane Smith,19,B
2,103,Emily Davis,20,A
🤔 Why do we see an extra column (Unnamed: 0
)?
By default, Pandas writes row indices (0,1,2,...) as a column in the CSV. If you don’t want this, disable it using index=False
.
df.to_csv('students.csv', index=False)
Now, the extra index column is removed. ✅ Then updated output will be :
ID,Name,Age,Grade
101,John Doe,18,A
102,Jane Smith,19,B
103,Emily Davis,20,A
⏺️Customizing the CSV Output
Changing the Delimiter (Comma, Semicolon, Tab, etc.)
If you need a semicolon-separated CSV, use the
sep
parameter:df.to
_csv('students_semicolon.csv', sep=';', index=False)
Similarly, for a tab-separated CSV, use:
df.to
_csv('students_tab.csv', sep='\t', index=False)
Saving Only Selected Columns
If you only need specific columns while saving:
df.to_csv('students_filtered.csv', columns=['ID', 'Name'], index=False)
Output (students_filtered.csv)
ID,Name
101,John Doe
102,Jane Smith
103,Emily Davis
Handling Missing Values in CSV Export
If your DataFrame has missing (
NaN
) values, Pandas writes them as empty cells by default.Example:
data = {
'ID': [101, 102, 103],
'Name': ['John Doe', 'Jane Smith', 'Emily Davis'],
'Age': [18, None, 20], # Missing age for ID 102
'Grade': ['A', 'B', 'A']
}
df = pd.DataFrame(data)
df.to_csv('students_missing.csv', index=False)
Output (students_missing.csv)
ID,Name,Age,Grade
101,John Doe,18,A
102,Jane Smith,,B
103,Emily Davis,20,A
Ways to Handle Missing Values in Export:
Fill Missing Values Before Saving:
df.fillna("Unknown").to_csv('students_filled.csv', index=False)
Output
ID,Name,Age,Grade 101,John Doe,18,A 102,Jane Smith,Unknown,B 103,Emily Davis,20,A
Replace NaN with a Custom Symbol (e.g.,
NA
)df.to_csv('students_nan.csv', na_rep='NA', index=False)
Output:
ID,Name,Age,Grade 101,John Doe,18,A 102,Jane Smith,NA,B 103,Emily Davis,20,A
Writing a CSV Without a Header
By default, Pandas includes column names in the CSV. If you want to remove them:
df.to_csv('students_no_header.csv', header=False, index=False)
Output (students_no_header.csv)
101,John Doe,18,A
102,Jane Smith,19,B
103,Emily Davis,20,A
🔰 Conclusion
CSV files are everywhere, and knowing how to handle them efficiently with Pandas is a must-have skill for any data analyst or Python developer.
In this guide, we covered:
✅ Basic CSV importing
✅ Handling different delimiters
✅ Skipping unnecessary rows
✅ Selecting specific columns
✅ Handling missing values
✅ Optimizing performance for large files
✅ Writing data back to CSV
Now, you're ready to import, manipulate, and analyze CSV files like a pro! 🚀
What’s Next?
Try experimenting with different CSV datasets and explore Pandas' powerful data analysis capabilities. If you found this guide helpful, share it with your fellow data enthusiasts!
Happy Coding! ☕💻