Introduction to Python
Short tutorial on pandas, seaborn, matplotlib, sklearn and statsmodels packages

FOREST FIRES IN BRAZIL
See https://www.kaggle.com/gustavomodelli/forest-fires-in-brazil for a full description of the dataset.
Import packages
%matplotlib notebook
%matplotlib inline
import pandas as pd #Handle datasets
import seaborn as sns #Plots
import matplotlib.pyplot as plt #Plots
#Set some graphical parameters
rc={'axes.labelsize': 25, 'figure.figsize': (20,10),
'axes.titlesize': 25, 'xtick.labelsize': 18, 'ytick.labelsize': 18}
sns.set(rc=rc)
#Path data
path = 'C:/Users/Angela Andreella/Documents/Conference_Talk/Python_Psicostat1/Data'
df = pd.read_csv(path + '/amazon.csv')
First $3$ observations:
df.head(n=3)
year | state | month | number | date | |
---|---|---|---|---|---|
0 | 1998 | Acre | Janeiro | 0.0 | 1998-01-01 |
1 | 1999 | Acre | Janeiro | 0.0 | 1999-01-01 |
2 | 2000 | Acre | Janeiro | 0.0 | 2000-01-01 |
Some information about the variables:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
year 6454 non-null int64
state 6454 non-null object
month 6454 non-null object
number 6454 non-null float64
date 6454 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 252.2+ KB
We are interested about the number of forest fires in Brazil
df.number.describe()
count 6454.000000
mean 108.293163
std 190.812242
min 0.000000
25% 3.000000
50% 24.000000
75% 113.000000
max 998.000000
Name: number, dtype: float64
To have an simple plot, we take a subset of the dataset:
df1 = df[(df.year > 2010) & (df.year < 2019)]
df1 =df1.loc[(df1.state.str.startswith('M'))]
We do a boxplot about the number of fire by groups, i.e., the states and the years.
sns.boxplot(x = 'year', y = 'number', hue = "state", data = df1)
<matplotlib.axes._subplots.AxesSubplot at 0x4ccdfa90>
We do a timeseries plot with error bands:
sns.lineplot(x="year", y="number",hue="state", data=df1)
<matplotlib.axes._subplots.AxesSubplot at 0x4d887c88>
also we do a grouped violinplots:
sns.violinplot(x="state", y="number", data=df1)
<matplotlib.axes._subplots.AxesSubplot at 0x4d9a26a0>
For other plots, please see https://seaborn.pydata.org/examples/index.html.
ECONOMIC FREEDOM INDEX
See https://www.kaggle.com/lewisduncan93/the-economic-freedom-index for a full description of the dataset.
Load and preprocess data
dt = pd.read_csv(path + '/economic_freedom_index2019_data.csv')
dt.columns = dt.columns.str.replace(' ', '')
dt.columns = dt.columns.str.replace('2019', '')
dt.columns = dt.columns.str.replace('%', '')
dt.columns = dt.columns.str.replace('(', '')
dt.columns = dt.columns.str.replace(')', '')
dt = dt.dropna(axis = 0,how='any')
Basic info
dt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 173 entries, 0 to 185
Data columns (total 34 columns):
CountryID 173 non-null int64
CountryName 173 non-null object
WEBNAME 173 non-null object
Region 173 non-null object
WorldRank 173 non-null float64
RegionRank 173 non-null float64
Score 173 non-null float64
PropertyRights 173 non-null float64
JudicalEffectiveness 173 non-null float64
GovernmentIntegrity 173 non-null float64
TaxBurden 173 non-null float64
Gov'tSpending 173 non-null float64
FiscalHealth 173 non-null float64
BusinessFreedom 173 non-null float64
LaborFreedom 173 non-null float64
MonetaryFreedom 173 non-null float64
TradeFreedom 173 non-null float64
InvestmentFreedom 173 non-null float64
FinancialFreedom 173 non-null float64
TariffRate 173 non-null float64
IncomeTaxRate 173 non-null float64
CorporateTaxRate 173 non-null float64
TaxBurdenofGDP 173 non-null float64
Gov'tExpenditureofGDP 173 non-null float64
Country 173 non-null object
PopulationMillions 173 non-null object
GDPBillions,PPP 173 non-null object
GDPGrowthRate 173 non-null float64
5YearGDPGrowthRate 173 non-null float64
GDPperCapitaPPP 173 non-null object
Unemployment 173 non-null object
Inflation 173 non-null float64
FDIInflowMillions 173 non-null object
PublicDebtofGDP 173 non-null float64
dtypes: float64(24), int64(1), object(9)
memory usage: 47.3+ KB
Some plots
Boxplot by group, i.e. region:
sns.boxplot(x = 'Region', y = 'Score', data = dt)
<matplotlib.axes._subplots.AxesSubplot at 0x4dc85eb8>
First scatter plot:
plt.scatter(x="PropertyRights", y="Score", data=dt)
<matplotlib.collections.PathCollection at 0x4bff0c18>
We can put directly the linear regression fitting:
sns.lmplot(x="PropertyRights", y="Score", data=dt)
<seaborn.axisgrid.FacetGrid at 0x4ddbc710>
Density plot of the score variable:
sns.distplot(dt.Score, color="r")
<matplotlib.axes._subplots.AxesSubplot at 0x4df73fd0>
Pair plot considering some variables, i.e. Property Rights, Labor Freedom, Government Integrity, Judical Effectiveness, Fiscal Health, Region and Score:
dt1 = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score", 'Region']]
matplotlib.rc_file_defaults()
sns.pairplot(dt1, hue="Region")
<seaborn.axisgrid.PairGrid at 0x4cbd9da0>
Linear regression
Import packages
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
Correlation matrix
corr = dt[['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth', "Score"]].corr()
corr
PropertyRights | LaborFreedom | GovernmentIntegrity | JudicalEffectiveness | FiscalHealth | Score | |
---|---|---|---|---|---|---|
PropertyRights | 1.000000 | 0.432746 | 0.866998 | 0.826805 | 0.329969 | 0.876601 |
LaborFreedom | 0.432746 | 1.000000 | 0.413794 | 0.421694 | 0.104431 | 0.512976 |
GovernmentIntegrity | 0.866998 | 0.413794 | 1.000000 | 0.888880 | 0.292240 | 0.818174 |
JudicalEffectiveness | 0.826805 | 0.421694 | 0.888880 | 1.000000 | 0.287380 | 0.805825 |
FiscalHealth | 0.329969 | 0.104431 | 0.292240 | 0.287380 | 1.000000 | 0.559395 |
Score | 0.876601 | 0.512976 | 0.818174 | 0.805825 | 0.559395 | 1.000000 |
Heatmap of the correlation matrix:
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
<matplotlib.axes._subplots.AxesSubplot at 0x508b0a58>
We split the dataset into training (0.8) and test set (0.2):
msk = np.random.rand(len(dt)) < 0.8
train = dt[msk]
test = dt[~msk]
Linear regression having as dependent variable the Score and PropertyRights, LaborFreedom and FiscalHealth as explicative variables:
results = smf.ols('Score ~ PropertyRights + LaborFreedom + FiscalHealth', data=train).fit()
results.summary()
Dep. Variable: | Score | R-squared: | 0.883 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.880 |
Method: | Least Squares | F-statistic: | 335.6 |
Date: | Tue, 26 Nov 2019 | Prob (F-statistic): | 4.18e-62 |
Time: | 14:50:56 | Log-Likelihood: | -368.66 |
No. Observations: | 138 | AIC: | 745.3 |
Df Residuals: | 134 | BIC: | 757.0 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 27.2738 | 1.444 | 18.881 | 0.000 | 24.417 | 30.131 |
PropertyRights | 0.3903 | 0.019 | 20.505 | 0.000 | 0.353 | 0.428 |
LaborFreedom | 0.0997 | 0.024 | 4.084 | 0.000 | 0.051 | 0.148 |
FiscalHealth | 0.1062 | 0.011 | 10.062 | 0.000 | 0.085 | 0.127 |
Omnibus: | 4.354 | Durbin-Watson: | 1.962 |
---|---|---|---|
Prob(Omnibus): | 0.113 | Jarque-Bera (JB): | 4.082 |
Skew: | -0.420 | Prob(JB): | 0.130 |
Kurtosis: | 3.065 | Cond. No. | 514. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. We **predict** the score values using the test set: ```python pred = results.predict(test) plt.scatter(test.Score, pred, color='b') ```
Model: | MixedLM | Dependent Variable: | Score |
No. Observations: | 138 | Method: | REML |
No. Groups: | 5 | Scale: | 11.7894 |
Min. group size: | 12 | Likelihood: | -376.7216 |
Max. group size: | 34 | Converged: | Yes |
Mean group size: | 27.6 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 26.215 | 1.628 | 16.103 | 0.000 | 23.025 | 29.406 |
PropertyRights | 0.402 | 0.021 | 18.871 | 0.000 | 0.361 | 0.444 |
LaborFreedom | 0.102 | 0.024 | 4.211 | 0.000 | 0.055 | 0.150 |
FiscalHealth | 0.114 | 0.011 | 10.372 | 0.000 | 0.092 | 0.135 |
Region Var | 1.455 | 0.456 |
See http://www.statsmodels.org/stable/index.html for other commands about the linear (mixed) model. Also, https://www.statsmodels.org/stable/examples/notebooks/generated/mixed_lm_example.html makes a comparison between R lmer and Statsmodels MixedLM.
Principal Component Analysis
Import packages:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
Standardize data:
features = ['PropertyRights', 'LaborFreedom', 'GovernmentIntegrity', 'JudicalEffectiveness','FiscalHealth']
# Separating out the features
x = dt.loc[:, features].values
# Separating out the target
y = dt.loc[:,'Score'].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
Perform PCA considering $2$ principal components:
pca = PCA(4)
projected = pca.fit_transform(x)
print(x.shape)
print(projected.shape)
(173L, 5L)
(173L, 4L)
Plot the first $2$ principal components:
plt.scatter(projected[:, 0], projected[:, 1],
c=y, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('seismic', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x129a2be0>
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of component")
plt.ylabel("Variance explained")
plt.xticks(range(4), [1,2,3,4])
([<matplotlib.axis.XTick at 0x12ac31d0>,
<matplotlib.axis.XTick at 0x129e3b38>,
<matplotlib.axis.XTick at 0x12ab8860>,
<matplotlib.axis.XTick at 0x12b1a128>],
<a list of 4 Text xticklabel objects>)