DataraFlow Week 8: Advanced Data Analysis On World Bank Indicators
Last week already focused on how to clean data, so this week took a deeper dive to explore how to transform and combine datasets and dataframes to suit the needs of a data scientist. \
JOINING DATAFRAMES TOGETHER
To kick things off, we learnt about the different types of joins, such as inner, outer, left or right, and the effects they have on the resulting table or dataset.
pd.merge(gdp, life, on='Country name', how='left')
The code above is used to join two dataframes, GDP and life, using the left join. This join method prioritizes the dataframe on the left (gdp) while joining them.
CONSTANTS IN DATA ANALYSIS
This is mainly used to represent variables that are only assigned once, for ease of access and readability for whoever might come across the code in the future. Constants are denoted by writing the variables in uppercase, with words separated by an underscore.
CORRELATION AND THE SPEARMANR() FUNCTION
In data science, some form of statistical comparison is always carried out to draw a direct relationship between two factors. The easiest way of doing that is with Spearman's rank coefficient and p-value, which show the correlation between two indicators on a scale of -1 to 1. -1 denotes an inverse relationship, 1 denotes a direct relationship, and 0 denotes no rank relations between the indicators. To further rank statistical significance, the p-value needs to be lower than 0.05 for the coefficient obtained to be considered statistically significant.
The spearman () function is gotten from the scipy module and is used in this way, to check the correlation between two columns, pop and gdp.
from scipy.stats import spearmanr
popCol = gdpLifePop[POP]
gdpCol = gdpLifePop[GDP]
(correlation, pvalue) = spearmanr(popCol, gdpCol)
print(correlation, pvalue)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
Scatterplots
To accurately visualize the correlation between two indicators, scatterplots are used. It is natively available under the dataframe plot() method, with logx & logy being used to set a logarithmic scale on the corresponding axis.
The code below is used to plot a scatterplot of countries using the GDP and LIFE columns.
%matplotlib inline
gdpVsLifeClean.plot(y=GDP, x=LIFE, kind='scatter', logy=True, grid=True, figsize=(6, 7))
Capstone Assignment
As a capstone assignment, I found the correlation between the Multi-dimensional poverty ratio and unemployment, using data from the World Bank dataset.
The pandas_datareader module was installed on Anaconda to enable me to download my dataset directly from the World Bank site by specifying the year and the indicator code.
from pandas_datareader.wb import download
from scipy.stats import spearmanr
POVERTY_INDICATOR = 'SI.POV.MPWB'
UNEMPLOYMENT_INDICATOR = 'SL.UEM.TOTL.NE.ZS'
YEAR = 2020
povWB = download(country='all', indicator=POVERTY_INDICATOR, start=YEAR, end= YEAR)
unempWB = download(country='all', indicator=UNEMPLOYMENT_INDICATOR, start=YEAR, end= YEAR)
povWB, unempWB = povWB.dropna(), unempWB.dropna()
povWB, unempWB = povWB.reset_index(), unempWB.reset_index()
I then carried on by cleaning, transforming, and combining the datasets.
povWBNew = povWB[7:]
unempWBNew = unempWB[30:]
povVsUnemp = pd.merge(povWBNew, unempWBNew, on='country', how='inner')
COUNTRY = 'country'
UNEMP = "Unemployment (%)"
POV = 'Multidimensional Poverty ratio (%)'
povVsUnemp[POV] = povVsUnemp[POVERTY_INDICATOR]
povVsUnemp[UNEMP] = povVsUnemp[UNEMPLOYMENT_INDICATOR]
povVsUnemp = povVsUnemp[[COUNTRY, POV, UNEMP]]
povVsUnemp
Once that was over, I found the correlation between the indicators and the p-value. I also took it a further step by making a scatterplot of the two indicators to offer better visualization.
from scipy.stats import spearmanr
correlation, pvalue = spearmanr(povVsUnemp[POV], povVsUnemp[UNEMP])
print('The correlation is', correlation)
if pvalue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
%matplotlib inline
povVsUnemp.plot(x=UNEMP, y=POV, kind='scatter', figsize=(6,10), grid=True, logy= True)