Advanced Data Analysis with World Bank Indicators

Last week already focused on how to clean data, so this week took a deeper dive to explore how to transform and combine datasets and dataframes to suit the needs of a data scientist. \

JOINING DATAFRAMES TOGETHER

To kick things off, we learnt about the different types of joins, such as inner, outer, left or right, and the effects they have on the resulting table or dataset.

pd.merge(gdp, life, on='Country name', how='left')

The code above is used to join two dataframes, GDP and life, using the left join. This join method prioritizes the dataframe on the left (gdp) while joining them.

CONSTANTS IN DATA ANALYSIS

This is mainly used to represent variables that are only assigned once, for ease of access and readability for whoever might come across the code in the future. Constants are denoted by writing the variables in uppercase, with words separated by an underscore.

CORRELATION AND THE SPEARMANR() FUNCTION

In data science, some form of statistical comparison is always carried out to draw a direct relationship between two factors. The easiest way of doing that is with Spearman's rank coefficient and p-value, which show the correlation between two indicators on a scale of -1 to 1. -1 denotes an inverse relationship, 1 denotes a direct relationship, and 0 denotes no rank relations between the indicators. To further rank statistical significance, the p-value needs to be lower than 0.05 for the coefficient obtained to be considered statistically significant.

The spearman () function is gotten from the scipy module and is used in this way, to check the correlation between two columns, pop and gdp.

from scipy.stats import spearmanr

popCol = gdpLifePop[POP]
gdpCol = gdpLifePop[GDP]

(correlation, pvalue) = spearmanr(popCol, gdpCol)
print(correlation, pvalue)
if pValue < 0.05:
    print('It is statistically significant.')
else:
    print('It is not statistically significant.')

Scatterplots

To accurately visualize the correlation between two indicators, scatterplots are used. It is natively available under the dataframe plot() method, with logx & logy being used to set a logarithmic scale on the corresponding axis.

The code below is used to plot a scatterplot of countries using the GDP and LIFE columns.

%matplotlib inline

gdpVsLifeClean.plot(y=GDP, x=LIFE, kind='scatter', logy=True, grid=True, figsize=(6, 7))

Capstone Assignment

As a capstone assignment, I found the correlation between the Multi-dimensional poverty ratio and unemployment, using data from the World Bank dataset.

The pandas_datareader module was installed on Anaconda to enable me to download my dataset directly from the World Bank site by specifying the year and the indicator code.

from pandas_datareader.wb import download
from scipy.stats import spearmanr

POVERTY_INDICATOR = 'SI.POV.MPWB'
UNEMPLOYMENT_INDICATOR = 'SL.UEM.TOTL.NE.ZS'
YEAR = 2020

povWB = download(country='all', indicator=POVERTY_INDICATOR, start=YEAR, end= YEAR)
unempWB = download(country='all', indicator=UNEMPLOYMENT_INDICATOR, start=YEAR, end= YEAR)
povWB, unempWB = povWB.dropna(), unempWB.dropna()
povWB, unempWB = povWB.reset_index(), unempWB.reset_index()

I then carried on by cleaning, transforming, and combining the datasets.

povWBNew = povWB[7:]
unempWBNew = unempWB[30:]
povVsUnemp = pd.merge(povWBNew, unempWBNew, on='country', how='inner')

COUNTRY = 'country'
UNEMP = "Unemployment (%)"
POV = 'Multidimensional Poverty ratio (%)'
povVsUnemp[POV] = povVsUnemp[POVERTY_INDICATOR]
povVsUnemp[UNEMP] = povVsUnemp[UNEMPLOYMENT_INDICATOR]

povVsUnemp = povVsUnemp[[COUNTRY, POV, UNEMP]]
povVsUnemp

Once that was over, I found the correlation between the indicators and the p-value. I also took it a further step by making a scatterplot of the two indicators to offer better visualization.

from scipy.stats import spearmanr

correlation, pvalue = spearmanr(povVsUnemp[POV], povVsUnemp[UNEMP])
print('The correlation is', correlation)
if pvalue < 0.05:
    print('It is statistically significant.')
else:
    print('It is not statistically significant.')

%matplotlib inline

povVsUnemp.plot(x=UNEMP, y=POV, kind='scatter', figsize=(6,10), grid=True, logy= True)

DataraFlow Week 8: Advanced Data Analysis On World Bank Indicators

JOINING DATAFRAMES TOGETHER

CONSTANTS IN DATA ANALYSIS

CORRELATION AND THE SPEARMANR() FUNCTION

Scatterplots

Capstone Assignment

Comments

More from this blog

DataraFlow Week 23: Building a Multi-Agent GenAI System For Job Market Analysis

DataraFlow Week 19: Clustering and Natural Language Processing

DataraFlow Week 16: Customer Churn Prediction Using Logistic Regression, Decision Trees, and Random Forest.

DataraFlow Week 14: Regression Analysis

Week 13 Of The DataraFlow Program: Analysis Of Canadian Companies On The U.S Stock Exchange By Web Scraping

Command Palette

JOINING DATAFRAMES TOGETHER

CONSTANTS IN DATA ANALYSIS

CORRELATION AND THE SPEARMANR() FUNCTION

Scatterplots

Capstone Assignment

Comments

More from this blog