An End-to-End Data Cleaning Project on the UN Data with Python

Jeeho Bae
10 min readJul 18, 2019

--

Photo by NASA on Unsplash

The United Nations data is one of the great open data sources out there. However, because of the format of the dataset, the users need to clean and reorganize the data for further analysis. In this post, I used Total international migrant stock data in 2015 from the United Nations. Let’s start to work on this!

1. Read a created CSV file

First, read the dataset after downloading the file. Here, the file is uploaded as a CSV file on my Github page.

# import any necessary librariesimport pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
mst_df = pd.read_csv(' https://raw.githubusercontent.com/jeehobdata/projects/master/Data/UN_MigrantStockTotal_2015.csv')
mst_df.head(20)

So, there are many null values in various columns. Because the dataset is organized in Microsoft Excel format, it needs to be cleaned to be used in the python environment.

2. Tidy and transform the data

Within the table of data, the values are categorized in three sections: International migrant stock at mid-year (both sexes), International migrant stock at mid-year (male), and International migrant stock at mid-year (female). Due to the duplicated year columns by the three categories of gender, we need to clean and transform the dataset before analyzing it.

2.1 Clean the data

To begin with, let’s clean the dataset by setting the important parts of the data, replacing the null values, and resetting the indices.

#cut out unnecessaty parts of the table
mst_df = mst_df[mst_df.columns[1:23]].iloc[13:281]
mst_df = pd.DataFrame(mst_df)
#Set the first column as the headers
mst_df.columns = mst_df.iloc[0]
mst_df = mst_df.reindex(mst_df.index.drop(13))
mst_df.head()
#Drop the Notes column and Type of Data(a) column
mst_df = mst_df.drop(['Notes', 'Country code', 'Type of data (a)'], axis=1)
#Set a new index on the cleaned dataframe
mst_index = pd.Series(range(266))
mst_df = mst_df.set_index([mst_index])
mst_df.head()

It looks much readable than before. Now, since the dataset includes the information of migrants in three different categories as male, female, and both sexes, we should reorganize it to categorize the genders, the years, and the countries. To do that, let’s create three data frames by the genders, and then tidy and merge them.

2.2 Create three data frames by the genders

Let’s now create three data frames of the categories of the genders. To create the three data frames, we need to label all columns based on the categories.

# Create a copied dataset of the creaned dataset 
mst_df1 = mst_df
# Label the each columns
mst_df1.columns=['Destination', 'Both sexes','Both sexes',
'Both sexes', 'Both sexes','Both sexes',
'Both sexes', 'Male', 'Male', 'Male', 'Male',
'Male', 'Male', 'Female', 'Female', 'Female',
'Female', 'Female', 'Female']
# Replace the null value with 'Destination'
mst_df1["Destination"] = mst_df1["Destination"].fillna("Destination")
mst_df1.head()

The data shows which column is for which gender. However, the year is still duplicated three times by the genders. Now. let’s create data frames for each of the gender categories.

2.2.a Create a data frame of both sexes

#Create a table of the columns of both sexes
mst_bothsexes = mst_df1[['Destination', 'Both sexes']]
# Add a new column with a label, 'Gender'
# and fill the values with 'Both sexes' to categorize them later when we merge it with the other created dataframes
mst_bothsexes['Gender'] = 'Both sexes'
# Replace the value on the Gender column on the first row to reindex the table by the first row
mst_bothsexes.iloc[0,7] ='Gender'
# Set the first column as the headers
mst_bothsexes.columns = mst_bothsexes.iloc[0]
mst_bothsexes = mst_bothsexes.reindex(mst_bothsexes.index.drop(0))
# Reorder the columns
mst_bothsexes = mst_bothsexes[['Gender','Destination', '1990','1995','2000','2005','2010','2015']]
mst_bothsexes.head()

It looks clean! Let’s create other data frames for the two others.

2.2.b Create a data frame of male

#Create a table of data of the columns of Male
mst_male = mst_df1[['Destination', 'Male']]
# Add a new column with a label, 'Gender'
# and fill the values with 'Male'
mst_male['Gender'] = 'Male'
# Replace the value on the Gender column on the first row to reindex the table by the first row
mst_male.iloc[0,7] ='Gender'
#Set the first column as the headers
mst_male.columns = mst_male.iloc[0]
mst_male = mst_male.reindex(mst_male.index.drop(0))
#Reorder the columns
mst_male = mst_male[['Gender','Destination', '1990','1995','2000','2005','2010','2015']]
mst_male.head()

2.2.c Create a data frame of female

#Create a table of data of the columns of remale
mst_female = mst_df1[['Destination', 'Female']]
# Add a new column with a label, 'Gender'
# and fill the values with 'Female'
mst_female['Gender'] = 'Female'
# Replace the value on the Gender column on the first row to reindex the table by the first row
mst_female.iloc[0,7] ='Gender'
#Set the first column as the headers
mst_female.columns = mst_female.iloc[0]
mst_female = mst_female.reindex(mst_female.index.drop(0))
#Reorder the columns
mst_female = mst_female[['Gender','Destination', '1990','1995','2000','2005','2010','2015']]
mst_female.head()

2.3 Tidy and Merge the data frames

2.3.a Convert the format of data frames into a long format

We now have three data frames for each of the gender categories. Let’s merge those tables. To merge the table in better shape, it is better to be in a long format based on the year.

# Set index
mst_bothsexes.set_index(['Gender','Destination'], inplace=True)
mst_bothsexes.columns.names = ['Year']
# convert the dataframes into a long format
mst_both_l = mst_bothsexes.stack().to_frame()
# Name the column
mst_both_l.columns = ['Total']
# Remove the empty spaces between numbers by applying regular expression
mst_both_l.replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
mst_both_l.head(10)
# Set index
mst_male.set_index(['Gender','Destination'], inplace=True)
mst_male.columns.names = ['Year']
# convert the dataframes into a long format
mst_male_l = mst_male.stack().to_frame()
# Name the column
mst_male_l.columns = ['Total']
# Remove the empty spaces between numbers by applying regular expression
mst_male_l.replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
mst_male_l.head(10)
# Set index
mst_female.set_index(['Gender','Destination'], inplace=True)
mst_female.columns.names = ['Year']
# convert the dataframes into a long format
mst_female_l = mst_female.stack().to_frame()
# Name the column
mst_female_l.columns = ['Total']
# Remove the empty spaces between numbers by applying regular expression
mst_female_l.replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
mst_female_l.head(10)

2.3.b Convert data type of the Total columns

Almost there! We are about to merge the tables. Before doing that, we need to convert the data type of ‘Total’ columns of the data frames from strings into numerics.

# data frame of Both sexes 
mst_both_l['Total'] = mst_both_l['Total'].apply(pd.to_numeric)
# data frame of Male
mst_male_l['Total'] = mst_male_l['Total'].apply(pd.to_numeric)
# data frame of Female
mst_female_l['Total'] = mst_female_l['Total'].apply(pd.to_numeric)

Pivot the ‘gender’ column of the three data frames to merge them.

# pivot gender column of the dataframe of Both sexes 
long_both = pd.pivot_table(mst_both_l, values='Total', index=['Destination', 'Year'], columns=['Gender'])
long_both.head()
# pivot gender column of the dataframe of Male 
long_male = pd.pivot_table(mst_male_l, values='Total', index=['Destination', 'Year'], columns=['Gender'])
long_male.head()
# pivot gender column of the dataframe of Female 
long_female = pd.pivot_table(mst_female_l, values='Total', index=['Destination', 'Year'], columns=['Gender'])
long_female.head()

2.3.c Merge the three data frames

After tidying and transforming the data frames, merge them into one data frame.

# Create a list of the name of dataframes 
dfs = [long_male, long_female, long_both]
# merge the dataframes by using pd.concat() function
destination_by_gen = pd.concat(dfs, axis =1)
destination_by_gen.head(20)

Finally! The merged data frame presents the total migrants by gender, countries, and years. Let’s do some quick analysis.

3. Analysis

Questions:

  • What is the total number of migrants by the year?
  • Where are the countries most migrants go to, by gender and by year?

3.1 The total number of migrants by the year

In the merged data frame, the total number of migrants across the world is found in the rows where the destination is WORLD. After selecting the rows and values in the column of ‘Both sexes,’ it shows the total number of migrants in the world increased from 152,563,212 in 1990 to 243,700,236 in 2015. Let’s create a barplot.

# Select the rows where Destination is WORLD
df_world = destination_by_gen.loc['WORLD']
# Select the 'Bothe sexes' column to see the total number of migrants across the world of the years.
df_world['Both sexes']
# Create a barplot
df_world.plot(kind='bar', figsize=(8,5),
title='Total number of migrants in the world(1990 - 2015)').tick_params(labelrotation=0)
plt.ticklabel_format(style='plain', axis='y')

The bar plot shows the consistently increasing number of migrants from 1990 to 2015.

3.2 Top destinations of migrants, by gender and by years?

In this given data, it is limited to analyze where the migrants come from because it only includes information of destination without the origin. However, explaining the migrants by gender and years is available.

3.2.a Top destinations of migrants by the year

Since the list of the top destinations varies each year, we need to create a table of data for each year. Because we need to know the list of destination countries by the total number of migrants, we can use ‘Both sexes’ only. To use the data, let’s unstack the column because the data of the years are duplicated for the countries.

# reshape the dataframe as a wide format with the value of Both sexes column
dest_year = destination_by_gen['Both sexes'].unstack()
dest_year.head()

Now, we have a total number of migrants for the destination countries each year. Let’s start from the year 1990.

# list the top 10 destination countries in 1990dest_1990 = dest_year['1990']
dest_1990.nlargest(10)
# list the top 10 destination countries in 1995dest_1995 = dest_year['1995']
dest_1995.nlargest(10)
# list the top 10 destination countries in 2000dest_2000 = dest_year['2000']
dest_2000.nlargest(10)
# list the top 10 destination countries in 2005dest_2005 = dest_year['2005']
dest_2005.nlargest(10)
# list the top 10 destination countries in 2010dest_2010 = dest_year['2010']
dest_2010.nlargest(10)
# list the top 10 destination countries in 2015dest_2015 = dest_year['2015']
dest_2015.nlargest(10)

The results show a list of the overall countries to regions to continents to individual countries. As a single nation, the United States is listed as a destination where the largest number of migrants go from 1990 to 2015.

3.2.b Top 10 destinations of migrants by gender

To find a more specific answer, let’s focus on which countries are the destinations where the percentage of female migrants is bigger than the male migrants on average from 1990 to 2015. To start, create a new column of the percentage of female migrants over the total number of people.

# Create a column for the pecentage of female migrants and label it as Female_percentagedestination_by_gen['Female_percentage'] = destination_by_gen['Female']/destination_by_gen['Both sexes']*100destination_by_gen.head(10)
# Pivot the year column with holding the value of Female_percentagedest_f_percentage = destination_by_gen['Female_percentage'].unstack()dest_f_percentage.head()
# Calculate the average of the percentage of the six years
dest_f_percentage_ave = (dest_f_percentage['1990']+dest_f_percentage['1995']+dest_f_percentage['2000']+dest_f_percentage['2005']+dest_f_percentage['2010']+dest_f_percentage['2015'])/6
# Present the top 10 countries with the largest number of the percentage
dest_f_percentage_ave.nlargest(10)
# Create a barplot
dest_f_percentage_ave.nlargest(10).plot(figsize=(7,3), kind='bar',
title='Top 10 countries where the percentage of female migrants higher than men on average (1990 - 2015)')
plt.ticklabel_format(style='plain', axis='y')
# Present the 10 destinations with the smallest average percentage of female migrant  
dest_f_percentage_ave.nsmallest(10)
# Create a barplot
dest_f_percentage_ave.nsmallest(10).plot(figsize=(7,3), kind='bar',
title='the 10 destinations with the smallest percentage of female migrants on average (1990 - 2015)')
plt.ticklabel_format(style='plain', axis='y')

Interestingly, based on comparing the two created bar charts, eight out of the top ten countries with the highest average percentage of female migrants are in Europe, and the other two countries are located in Middle Asia and South Asia. Meanwhile, in the list of the ten destinations with the smallest average percentage of female migrants, six out of ten countries are located in the Middle East region, two of them in Middle Asia and one in North Africa, and one in Greenland.

4. Conclusion

Data cleaning is tedious but inevitable work. Data Tidying is particularly one of the very tactical skills that can make messy data look much readable and help analysts to find the hidden information from the data.

--

--

Jeeho Bae
Jeeho Bae

No responses yet