Data analysis from a CSV file in Python
Learn to read and write CSV files in Python
Table of contents
- What is a CSV file?
- How to read CSV files using the csv module?
- How to write to CSV files using the csv module?
- Playing with additional parameters in csv module
- How to read CSV files using the pandas module?
- How to write to CSV files using the pandas module?
- Data Analyst Project: Analyze Titanic data from Kaggle
What is a CSV file?
CSV stands for comma separated value.
You might have come across this file format while downloading data from an excel spreadsheet or a database. CSV files are convenient for storing tabular data.
It should be clear from the name that values in a CSV file are separated by a comma(by default).
Below is an example of CSV file containing information about a family.
my_family.csv
name,age,height(cm),weight(kg)
Lenin,30,188,90
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46
Usually the first line in a CSV file is called the Header which identifies the column name and data type. Every row after the header is a data record.
From the above example, you can see that each value(whether part of header or a data record) is separated by a comma. This separator character is called a Delimiter. A CSV file may use other delimiters other than comma.
Examples of other delimiter -
- tab
\t
- colon
:
- semi colon
;
- pipe
|
In this article, you will learn to work with CSV files using the csv
module and the pandas
library.
How to read CSV files using the csv module?
Reading from a CSV file is done with the csv.reader
object. You can open the CSV file as a text file with Python’s built-in open() function.
Example 1
import csv
with open('my_family.csv') as input:
csv_reader = csv.reader(input, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
print(f'Header row - {", ".join(row)}')
line_count += 1
else:
print(f'{row[0]} is {row[1]} years old, {row[2]} cm tall and {row[3]} kg heavy')
line_count += 1
print(f'Total: {line_count} lines')
Output
Header row - name, age, height(cm), weight(kg)
Lenin is 30 years old, 188 cm tall and 90 kg heavy
Phil is 42 years old, 178 cm tall and 76 kg heavy
Claire is 40 years old, 165 cm tall and 54 kg heavy
Alex is 18 years old, 140 cm tall and 46 kg heavy
Total: 5 lines
Since the first row is the header row(line_count
will be 0), it is treated differently. You can also skip the header row while reading the CSV.
How to skip the header row in CSV with Python?
Since csv.reader
object is an iterable, you can use next(reader object, None)
function to return the header row and skip over it.
Example 2
import csv
with open('my_family.csv') as input:
csv_reader = csv.reader(input, delimiter=',')
line_count = 0
next(csv_reader, None) #ignore the header
for row in csv_reader:
print(f'{row[0]} is {row[1]} years old, {row[2]} cm tall and {row[3]} kg heavy')
line_count += 1
print(f'Total: {line_count} lines')
Output
Lenin is 30 years old, 188 cm tall and 90 kg heavy
Phil is 42 years old, 178 cm tall and 76 kg heavy
Claire is 40 years old, 165 cm tall and 54 kg heavy
Alex is 18 years old, 140 cm tall and 46 kg heavy
Total: 4 lines
Reading CSV files as a dictionary
You can read the CSV file as a dictionary by using the csv.DictReader
object.
An advantage of using the DictReader
object is that it turns each row into a dictionary which make accessing the fields a little more easier.
Example 3
import csv
with open('my_family.csv') as input:
csv_reader = csv.DictReader(input, delimiter=',')
for row in csv_reader:
print(f'{row["name"]} is {row["age"]} years old, {row["height(cm)"]} cm tall and {row["weight(kg)"]} kg heavy')
print(f'Total: {csv_reader.line_num} lines')
The csv_reader.line_num
method returns the total number of lines in the CSV file.
For the csv.DictReader
object, Python uses the column names as key from the header row. The csv.DictReader
object doesn't have the header row in it.
How to write to CSV files using the csv module?
You can write to a CSV file using the csv.writer
object. Be careful to open the file in writing mode.
Example 1
import csv
header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']
data = [ ['Phil', 42, 178, 76],
['Alex', 18, 140, 46],
['Claire', 40, 165, 54] ]
filename = "my_family.csv"
with open(filename, 'w') as output:
csvwriter = csv.writer(output)
# Write a single list
csvwriter.writerow(header)
# Writing a list of lists
csvwriter.writerows(data)
Output
Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Alex,18,140,46
Claire,40,165,54
The writerow
method is going to write a list of values into a single row whereas writerows
is going to write multiple rows from a buffer that contains one or more lists.
Using the delimiter parameter
Notice that no delimiter has been mentioned while creating the csv.writer
object. In such cases, comma ,
is used as the default delimiter. You can also use a different delimiter by passing the delimiter
parameter.
Example 2
import csv
header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']
data = [ ['Phil', 42, 178, 76],
['Alex', 18, 140, 46],
['Claire', 40, 165, 54] ]
filename = "my_family.csv"
with open(filename, 'w') as output:
csvwriter = csv.writer(output, delimiter = '|')
# Write a single list
csvwriter.writerow(header)
# Writing a list of lists
csvwriter.writerows(data)
Output
Name|Age|Height(cm)|Weight(kg)
Phil|42|178|76
Alex|18|140|46
Claire|40|165|54
Writing a dictionary to a CSV file
You can write the dictionary into a CSV file using the DictWriter
method. The fieldnames
parameter is compulsory for passing the header information.
Example 3
import csv
header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']
data = [
{"Name":"Phil", "Age": 42, "Height(cm)":178, "Weight(kg)":76},
{"Name":"Claire", "Age": 40, "Height(cm)":165, "Weight(kg)":54},
{"Name":"Alex", "Age": 18, "Height(cm)":140, "Weight(kg)":46}
]
filename = "my_family.csv"
with open(filename, 'w') as output:
csvwriter = csv.DictWriter(output, fieldnames=header)
csvwriter.writeheader()
for row in data:
csvwriter.writerow(row)
Output
Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46
You can also use writerows
to write all the dictionaries to the CSV file at once.
Example 4
import csv
header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']
data = [
{"Name":"Phil", "Age": 42, "Height(cm)":178, "Weight(kg)":76},
{"Name":"Claire", "Age": 40, "Height(cm)":165, "Weight(kg)":54},
{"Name":"Alex", "Age": 18, "Height(cm)":140, "Weight(kg)":46}
]
filename = "my_family.csv"
with open(filename, 'w') as output:
csvwriter = csv.DictWriter(output, fieldnames=header)
csvwriter.writeheader()
csvwriter.writerows(data)
Output
Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46
Playing with additional parameters in csv module
quotechar
It refers to the character string that is used to quote values when special characters or delimiters appears inside the field. It defaults to "
.
For example, if the delimiter of your CSV file is a comma and you have an address column that may have comma in it's values. Check out the example below.
my_family.csv
Name,Age,Height(cm),Weight(kg),Address
Phil,42,178,76,'Gryffindor room, Hogwarts'
Claire,40,165,54,'Snapes room, Hogwarts'
Alex,18,140,46,'4 Private Drive, Little Whinging'
The above CSV file is using single quotes to separate the address fields for each data record. You can pass this as the quotechar
value.
Example 1
import csv
filename = "my_family.csv"
with open(filename, 'r') as output:
csvreader = csv.reader(output, quotechar="'")
for row in csvreader:
print(row)
Output
['Name', 'Age', 'Height(cm)', 'Weight(kg)', 'Address']
['Phil', '42', '178', '76', 'Gryffindor room, Hogwarts']
['Claire', '40', '165', '54', 'Snapes room, Hogwarts']
['Alex', '18', '140', '46', '4 Private Drive, Little Whinging']
quoting
The quoting
argument controls when quotes should be generated by the writer or recognized by the reader. It is of 4 types.
csv.QUOTE_MINIMAL
- It adds quote only when required(default).csv.QUOTE_ALL
- It quotes everything regardless of the field type.csv.QUOTE_NONNUMERIC
- It quotes everything except integers and floats.csv.QUOTE_NONE
- It does not quote anything on output. However, while reading, quotes are included around the field values.
Example 1
import csv
filename = "my_family.csv"
header = ['Name','Age','Height(cm)','Weight(kg)','Address']
data = [
['Phil',42,178,76,'Gryffindor room, Hogwarts'],
['Claire',40,165,54,'Snapes room, Hogwarts'],
['Alex',18,140,46,'4 Private Drive, Little Whinging']
]
with open(filename, 'w') as output:
csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_ALL)
csvwriter.writerow(header)
csvwriter.writerows(data)
The above code uses csv.QUOTE_ALL
as the quoting argument. This will ensure that every data point has a single quotation wrapped around it while being written to the CSV.
my_family.csv
'Name','Age','Height(cm)','Weight(kg)','Address'
'Phil','42','178','76','Gryffindor room, Hogwarts'
'Claire','40','165','54','Snapes room, Hogwarts'
'Alex','18','140','46','4 Private Drive, Little Whinging'
escapechar
Let's say, you don't want any quotation in your CSV file while executing the above code. So you use csv.QUOTE_NONE
as the quoting argument.
Example 1
import csv
filename = "my_family.csv"
header = ['Name','Age','Height(cm)','Weight(kg)','Address']
data = [
['Phil',42,178,76,'Gryffindor room, Hogwarts'],
['Claire',40,165,54,'Snapes room, Hogwarts'],
['Alex',18,140,46,'4 Private Drive, Little Whinging']
]
with open(filename, 'w') as output:
csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_NONE)
csvwriter.writerow(header)
csvwriter.writerows(data)
The above code will throw you an error.
Output
Traceback (most recent call last):
File "main.py", line 16, in <module>
csvwriter.writerows(data)
_csv.Error: need to escape, but no escapechar set
The problem is that the address field contains commas. Since the quoting argument is set to csv.QUOTE_NONE
, the csv
module doesn't know how to escape the commas properly.
For this purpose, you can use the escapechar
argument. It takes a single character string that is used to escape the delimiter when the the quoting is turned off.
The below code escapes the comma using a backslash \
.
Example 2
import csv
filename = "my_family.csv"
header = ['Name','Age','Height(cm)','Weight(kg)','Address']
data = [
['Phil',42,178,76,'Gryffindor room, Hogwarts'],
['Claire',40,165,54,'Snapes room, Hogwarts'],
['Alex',18,140,46,'4 Private Drive, Little Whinging']
]
with open(filename, 'w') as output:
csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_NONE, escapechar='\\')
csvwriter.writerow(header)
csvwriter.writerows(data)
my_family.csv
Name,Age,Height(cm),Weight(kg),Address
Phil,42,178,76,Gryffindor room\, Hogwarts
Claire,40,165,54,Snapes room\, Hogwarts
Alex,18,140,46,4 Private Drive\, Little Whinging
Notice how the commas have been escaped with backslash \
and no error is thrown.
skipinitialspace
It skips the space following the delimiter. If True, the initial white spaces will be removed. It defaults to False.
my_family.csv
Name, Age, Height(cm), Weight(kg), Address
Phil, 42, 178, 76, 'Gryffindor room, Hogwarts'
Claire, 40, 165, 54, 'Snapes room, Hogwarts'
Alex, 18, 140, 46, '4 Private Drive, Little Whinging'
The above CSV file has spaces after every delimiter. If you read it without the skipinitialspace
argument, there will be white spaces in your data points.
Example 1
import csv
with open('my_family.csv', 'r') as f:
csv_reader = csv.reader(f, quotechar="'")
for line in csv_reader:
print(line)
Output
['Name', ' Age', ' Height(cm)', ' Weight(kg)', ' Address']
['Phil', ' 42', ' 178', ' 76', " 'Gryffindor room", " Hogwarts'"]
['Claire', ' 40', ' 165', ' 54', " 'Snapes room", " Hogwarts'"]
['Alex', ' 18', ' 140', ' 46', " '4 Private Drive", " Little Whinging'"]
To get rid of the whitespaces, set the skipinitialspace
argument to True
.
Example 2
import csv
with open('my_family.csv', 'r') as f:
csv_reader = csv.reader(f, quotechar="'", skipinitialspace=True)
for line in csv_reader:
print(line)
Output
['Name', 'Age', 'Height(cm)', 'Weight(kg)', 'Address']
['Phil', '42', '178', '76', 'Gryffindor room, Hogwarts']
['Claire', '40', '165', '54', 'Snapes room, Hogwarts']
['Alex', '18', '140', '46', '4 Private Drive, Little Whinging']
How to read CSV files using the pandas module?
Reading CSV files into a pandas DataFrame is very straightforward. A pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
Example 1
import pandas as pd
df = pd.read_csv('my_family.csv')
print(df)
Output
Name Age Height(cm) Weight(kg)
0 Phil 42 178 76
1 Claire 40 165 54
2 Alex 18 140 46
Notice the following points:-
- It used the first line of CSV as column names automatically.
- It displays the CSV data like a spreadsheet, thus making it easy to perform data analysis.
- Pandas automatically converted the datatype for Age, Height(cm) and Weight(kg) columns to integer.
Example 2
import pandas as pd
df = pd.read_csv('my_family.csv')
print(type(df['Age'][0]))
print(type(df['Height(cm)'][0]))
print(type(df['Weight(kg)'][0]))
Output
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
Pandas trick to deal with CSVs without header
If your CSV is missing the header row, use the names
arguments in pd.read_csv()
method.
my_family.csv
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46
Example 2
import pandas as pd
df = pd.read_csv('my_family.csv',
index_col='Name',
names=['Name', 'Age', 'Height(cm)', 'Weight(kg)']
)
print(df)
Output
Age Height(cm) Weight(kg)
Name
Phil 42 178 76
Claire 40 165 54
Alex 18 140 46
How to write to CSV files using the pandas module?
To write a pandas dataframe to a CSV file, use df.to_csv
method.
Example 1
import pandas as pd
df = pd.read_csv('my_family.csv',
index_col='Name',
names=['Name', 'Age', 'Height(cm)', 'Weight(kg)']
)
df.to_csv('my_new_family.csv')
Output
Age Height(cm) Weight(kg)
Name
Phil 42 178 76
Claire 40 165 54
Alex 18 140 46
Data Analyst Project: Analyze Titanic data from Kaggle
The famous Titanic challenge by Kaggle is to build Machine Learning model that predicts which passengers survived the Titanic shipwreck.
However, in this section you are going to do simple data analysis on train.csv file and figure out the answers to the following questions:-
- How many male and female passengers were onboard the Titanic?
- How many male and female members survived the Titanic shipwreck?
- What is the median age of each sex?
Male to Female ratio on the Titanic
import pandas as pd
#load the csv file
df = pd.read_csv('train.csv')
# Column Names
print(df.columns)
# Count unique values in Sex column
print(df['Sex'].value_counts())
# Percentage of male and female passengers
print(df['Sex'].value_counts(normalize=True))
Output
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
male 577
female 314
Name: Sex, dtype: int64
male 0.647587
female 0.352413
Name: Sex, dtype: float64
The above analysis shows that 65% of people on Titanic were Male and 35% were Female.
Surviving male to female ratio on the Titanic
import pandas as pd
#load the csv file
df = pd.read_csv('train.csv')
# Column Names
print(df.columns)
# Count unique values in Sex column
print(df[df["Survived"] == 1]['Sex'].value_counts())
# Percentage of surviving male and female passengers
print(df[df["Survived"] == 1]['Sex'].value_counts(normalize=True))
Output
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
female 233
male 109
Name: Sex, dtype: int64
female 0.681287
male 0.318713
Name: Sex, dtype: float64
In the above code, first you filter the dataframe for surviving passengers and then use the value_counts()
method to find out the unique male and female passengers.
The above analysis shows that 68% of surviving people on the Titanic were Female.
Median age of each sex
import pandas as pd
#load the csv file
df = pd.read_csv('train.csv')
# median age of each sex
median_age_men=df[df['Sex']=='male']['Age'].median()
median_age_women=df[df['Sex']=='female']['Age'].median()
print(f"The median age of men is {median_age_men}")
print(f"The median age of women is {median_age_women}")
Output
The median age of men is 29.0
The median age of women is 27.0
The above analysis shows that median age of male was 29 whereas median age of female was 27.