{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mashup datasets\n",
"\n",
"This Jupyter Notebook analyses the mashup datasets for [\"Blessed Be the Fruit\"](https://github.com/OrsolaMBorrini/blessedfruit), an open data project regarding the analysis of factors that might influence pregnancy rates in young women in Italy.\n",
"\n",
"From the starting 7 source datasets, three different mashup datasets have been created:\n",
"Id | Dataset | Description (factor of interest) | Original source datasets\n",
"--- | --- | --- | --- \n",
"MD1 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D1, D2, D3 \n",
"MD2 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6\n",
"MD3 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D1, D2, D7\n",
"\n",
"### Disclaimer\n",
"This Jupyter Notebook is of informational nature only, it is not thought to be used for the clean up process for the datasets, but only for their analysis and the explanation of the process. The code actually used for the clean up can be found in `scripts > MASHUP.py`.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mashup phase\n",
"### MD1 - Religious observance in each region\n",
"To create a dataset showing the percentage of religious observance in each region, we need to compare the total general population of each region (with no distinction of gender) to the religious observance (in thousands value) in each region. To get this result we will use the following source datasets:\n",
"* D1 and D2 -> for the data regarding total general population in each region for each of the three years of interest (`data > cleanDS > Population2017_Generalclean.csv`, `Population2018_Generalclean.csv`, `Population2019_Generalclean.csv`)\n",
"* D3 -> for the data regarding religious observance (thousands value) in each region (`data > cleanDS > D3_clean.csv`)\n",
"\n",
"First of all the datasets regarding general population were stripped of any distinction between genders and the various ages were merged together. Additional columns to keep track of the time were added accordingly."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ITTER107 | \n",
" Region | \n",
" Population | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ITC1 | \n",
" Piemonte | \n",
" 8622434 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" ITC2 | \n",
" Valle d'Aosta / Vallée d'Aoste | \n",
" 250068 | \n",
" 2019 | \n",
"
\n",
" \n",
" 2 | \n",
" ITC3 | \n",
" Liguria | \n",
" 3049652 | \n",
" 2019 | \n",
"
\n",
" \n",
" 3 | \n",
" ITC4 | \n",
" Lombardia | \n",
" 20055204 | \n",
" 2019 | \n",
"
\n",
" \n",
" 4 | \n",
" ITD3 | \n",
" Veneto | \n",
" 9758266 | \n",
" 2019 | \n",
"
\n",
" \n",
" 5 | \n",
" ITD4 | \n",
" Friuli-Venezia Giulia | \n",
" 2412432 | \n",
" 2019 | \n",
"
\n",
" \n",
" 6 | \n",
" ITD5 | \n",
" Emilia-Romagna | \n",
" 8928238 | \n",
" 2019 | \n",
"
\n",
" \n",
" 7 | \n",
" ITDA | \n",
" Trentino Alto Adige / Südtirol | \n",
" 2156138 | \n",
" 2019 | \n",
"
\n",
" \n",
" 8 | \n",
" ITE1 | \n",
" Toscana | \n",
" 7385110 | \n",
" 2019 | \n",
"
\n",
" \n",
" 9 | \n",
" ITE2 | \n",
" Umbria | \n",
" 1740330 | \n",
" 2019 | \n",
"
\n",
" \n",
" 10 | \n",
" ITE3 | \n",
" Marche | \n",
" 3025344 | \n",
" 2019 | \n",
"
\n",
" \n",
" 11 | \n",
" ITE4 | \n",
" Lazio | \n",
" 11511400 | \n",
" 2019 | \n",
"
\n",
" \n",
" 12 | \n",
" ITF1 | \n",
" Abruzzo | \n",
" 2587882 | \n",
" 2019 | \n",
"
\n",
" \n",
" 13 | \n",
" ITF2 | \n",
" Molise | \n",
" 601032 | \n",
" 2019 | \n",
"
\n",
" \n",
" 14 | \n",
" ITF3 | \n",
" Campania | \n",
" 11424286 | \n",
" 2019 | \n",
"
\n",
" \n",
" 15 | \n",
" ITF4 | \n",
" Puglia | \n",
" 7906610 | \n",
" 2019 | \n",
"
\n",
" \n",
" 16 | \n",
" ITF5 | \n",
" Basilicata | \n",
" 1106508 | \n",
" 2019 | \n",
"
\n",
" \n",
" 17 | \n",
" ITF6 | \n",
" Calabria | \n",
" 3788220 | \n",
" 2019 | \n",
"
\n",
" \n",
" 18 | \n",
" ITG1 | \n",
" Sicilia | \n",
" 9750580 | \n",
" 2019 | \n",
"
\n",
" \n",
" 19 | \n",
" ITG2 | \n",
" Sardegna | \n",
" 3223242 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ITTER107 Region Population Time\n",
"0 ITC1 Piemonte 8622434 2019\n",
"1 ITC2 Valle d'Aosta / Vallée d'Aoste 250068 2019\n",
"2 ITC3 Liguria 3049652 2019\n",
"3 ITC4 Lombardia 20055204 2019\n",
"4 ITD3 Veneto 9758266 2019\n",
"5 ITD4 Friuli-Venezia Giulia 2412432 2019\n",
"6 ITD5 Emilia-Romagna 8928238 2019\n",
"7 ITDA Trentino Alto Adige / Südtirol 2156138 2019\n",
"8 ITE1 Toscana 7385110 2019\n",
"9 ITE2 Umbria 1740330 2019\n",
"10 ITE3 Marche 3025344 2019\n",
"11 ITE4 Lazio 11511400 2019\n",
"12 ITF1 Abruzzo 2587882 2019\n",
"13 ITF2 Molise 601032 2019\n",
"14 ITF3 Campania 11424286 2019\n",
"15 ITF4 Puglia 7906610 2019\n",
"16 ITF5 Basilicata 1106508 2019\n",
"17 ITF6 Calabria 3788220 2019\n",
"18 ITG1 Sicilia 9750580 2019\n",
"19 ITG2 Sardegna 3223242 2019"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import csv\n",
"from pandas import *\n",
"import functools as ft\n",
"\n",
"# ==== MD1 ====\n",
"\n",
"# GENERAL POPULATION FILES\n",
"pop17 = read_csv(\"../data/cleanDS/Population2017General_clean.csv\")\n",
"pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") | (pop17.Sex == \"Females\")].index)\n",
"gen_pop17 = (pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
"gen_pop17[\"Time\"] = 2017\n",
"\n",
"def generalPop(path):\n",
" file = read_csv(path)\n",
" file.drop([\"Males\",\"Females\"],axis=1,inplace=True)\n",
" file = (file.groupby([\"ITTER107\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
" return file\n",
"\n",
"pop18 = \"../data/cleanDS/Population2018General_clean.csv\"\n",
"pop19 = \"../data/cleanDS/Population2019General_clean.csv\"\n",
"\n",
"gen_pop18 = generalPop(pop18)\n",
"gen_pop18[\"Time\"] = 2018\n",
"gen_pop19 = generalPop(pop19)\n",
"gen_pop19[\"Time\"] = 2019\n",
"\n",
"# Example\n",
"gen_pop19"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Two ancillary functions were created to first get the percentage values from the thousands values that we had in the tables (function `getPercentage`) and to add the column with the percentage value in the specific dataframe (`addPercentageColumn`)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"d3_clean = read_csv(\"../data/cleanDS/D3_clean.csv\")\n",
"\n",
"def getPercentage(num,pop):\n",
" return num * 100 / pop\n",
"\n",
"def addPercentageColumn(yearlyDF):\n",
" for idx,row in yearlyDF.iterrows():\n",
" perc = getPercentage((row[\"Value\"]*1000),row[\"Population\"])\n",
" yearlyDF.loc[idx,\"Percentage\"] = perc\n",
" return yearlyDF"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"As the process is very similar with regards for each year, we have created one single function to create (and then save as a csv file) all three mashup datasets: `createYearlyDF`.\n",
"\n",
"This function has two input parameters:\n",
"1. `genPopList` -> a list of the general population dataframes\n",
"2. `rightDF` -> the dataframe regarding religious observance (containing data for all three years, from 2017 to 2019)\n",
"\n",
"For each year (which is obtained by checking the first value under the column \"Time\" of whatever general population dataframe), we cleaned up the religious observance dataframe, keeping only the data needed for that specific year (see: if-elif structure in the code below + drop instruction).\n",
"\n",
"Then, we have merged the two input dataframes together using the unique code region and we subsequently dropped other unnecessary and repeated columns (namely \"Territory\",\"TIME\"...).\n",
"\n",
"Finally, we called the ancillary function to add a \"Percentage\" column to the resulting dataframe and we have saved it as a new csv file (this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project).\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def createYearlyDF(genPopList,rightDF):\n",
" for item in genPopList:\n",
" if item[\"Time\"][0] == 2017:\n",
" d3_2017 = rightDF.drop(rightDF[(rightDF.TIME == 2018) | (rightDF.TIME == 2019)].index)\n",
" MD1_17 = (merge(item,d3_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"Territory\",\"ITTER107\",\"TIME\"],axis=1)\n",
" addPercentageColumn(MD1_17)\n",
" #MD1_17.to_csv(\"data/mashupDS/MD1_17.csv\")\n",
" elif item[\"Time\"][0] == 2018:\n",
" d3_2018 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2019)].index)\n",
" MD1_18 = (merge(item,d3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n",
" addPercentageColumn(MD1_18)\n",
" #MD1_18.to_csv(\"data/mashupDS/MD1_18.csv\")\n",
" elif item[\"Time\"][0] == 2019:\n",
" d3_2019 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2018)].index)\n",
" MD1_19 = (merge(item,d3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n",
" addPercentageColumn(MD1_19)\n",
" #MD1_19.to_csv(\"data/mashupDS/MD1_19.csv\")\n",
" return \n",
"\n",
"createYearlyDF([gen_pop17,gen_pop18,gen_pop19],d3_clean)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In the end, we have three mashup datasets (one for each year in our scope) regarding religious observance rates in each region in Italy."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" Region code | \n",
" Region | \n",
" Population | \n",
" Time | \n",
" TIPO_DATO_AVQ | \n",
" Value | \n",
" Percentage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" ITC1 | \n",
" Piemonte | \n",
" 4349911 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 1105 | \n",
" 25.402819 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" ITC1 | \n",
" Piemonte | \n",
" 4349911 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1007 | \n",
" 23.149899 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" ITC2 | \n",
" Valle D'Aosta/Vallée D'Aoste | \n",
" 126213 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 37 | \n",
" 29.315522 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" ITC2 | \n",
" Valle D'Aosta/Vallée D'Aoste | \n",
" 126213 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 22 | \n",
" 17.430851 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" ITC3 | \n",
" Liguria | \n",
" 1541541 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 252 | \n",
" 16.347278 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" ITC3 | \n",
" Liguria | \n",
" 1541541 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 536 | \n",
" 34.770402 | \n",
"
\n",
" \n",
" 6 | \n",
" 6 | \n",
" ITC4 | \n",
" Lombardia | \n",
" 9986962 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 2539 | \n",
" 25.423147 | \n",
"
\n",
" \n",
" 7 | \n",
" 7 | \n",
" ITC4 | \n",
" Lombardia | \n",
" 9986962 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 2586 | \n",
" 25.893760 | \n",
"
\n",
" \n",
" 8 | \n",
" 8 | \n",
" ITD3 | \n",
" Veneto | \n",
" 4880936 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1302 | \n",
" 26.675211 | \n",
"
\n",
" \n",
" 9 | \n",
" 9 | \n",
" ITD3 | \n",
" Veneto | \n",
" 4880936 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 964 | \n",
" 19.750310 | \n",
"
\n",
" \n",
" 10 | \n",
" 10 | \n",
" ITD4 | \n",
" Friuli-Venezia Giulia | \n",
" 1211155 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 353 | \n",
" 29.145733 | \n",
"
\n",
" \n",
" 11 | \n",
" 11 | \n",
" ITD4 | \n",
" Friuli-Venezia Giulia | \n",
" 1211155 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 243 | \n",
" 20.063493 | \n",
"
\n",
" \n",
" 12 | \n",
" 12 | \n",
" ITD5 | \n",
" Emilia-Romagna | \n",
" 4445920 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 1520 | \n",
" 34.188649 | \n",
"
\n",
" \n",
" 13 | \n",
" 13 | \n",
" ITD5 | \n",
" Emilia-Romagna | \n",
" 4445920 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 834 | \n",
" 18.758772 | \n",
"
\n",
" \n",
" 14 | \n",
" 14 | \n",
" ITDA | \n",
" Trentino-Alto Adige/Südtirol | \n",
" 1068738 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 267 | \n",
" 24.982737 | \n",
"
\n",
" \n",
" 15 | \n",
" 15 | \n",
" ITDA | \n",
" Trentino-Alto Adige/Südtirol | \n",
" 1068738 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 214 | \n",
" 20.023617 | \n",
"
\n",
" \n",
" 16 | \n",
" 16 | \n",
" ITE1 | \n",
" Toscana | \n",
" 3712048 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 682 | \n",
" 18.372607 | \n",
"
\n",
" \n",
" 17 | \n",
" 17 | \n",
" ITE1 | \n",
" Toscana | \n",
" 3712048 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 1255 | \n",
" 33.808830 | \n",
"
\n",
" \n",
" 18 | \n",
" 18 | \n",
" ITE2 | \n",
" Umbria | \n",
" 876477 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 177 | \n",
" 20.194483 | \n",
"
\n",
" \n",
" 19 | \n",
" 19 | \n",
" ITE2 | \n",
" Umbria | \n",
" 876477 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 208 | \n",
" 23.731370 | \n",
"
\n",
" \n",
" 20 | \n",
" 20 | \n",
" ITE3 | \n",
" Marche | \n",
" 1526331 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 318 | \n",
" 20.834275 | \n",
"
\n",
" \n",
" 21 | \n",
" 21 | \n",
" ITE3 | \n",
" Marche | \n",
" 1526331 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 404 | \n",
" 26.468702 | \n",
"
\n",
" \n",
" 22 | \n",
" 22 | \n",
" ITE4 | \n",
" Lazio | \n",
" 5774606 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1383 | \n",
" 23.949686 | \n",
"
\n",
" \n",
" 23 | \n",
" 23 | \n",
" ITE4 | \n",
" Lazio | \n",
" 5774606 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 1485 | \n",
" 25.716040 | \n",
"
\n",
" \n",
" 24 | \n",
" 24 | \n",
" ITF1 | \n",
" Abruzzo | \n",
" 1306059 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 310 | \n",
" 23.735528 | \n",
"
\n",
" \n",
" 25 | \n",
" 25 | \n",
" ITF1 | \n",
" Abruzzo | \n",
" 1306059 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 218 | \n",
" 16.691436 | \n",
"
\n",
" \n",
" 26 | \n",
" 26 | \n",
" ITF2 | \n",
" Molise | \n",
" 306564 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 90 | \n",
" 29.357655 | \n",
"
\n",
" \n",
" 27 | \n",
" 27 | \n",
" ITF2 | \n",
" Molise | \n",
" 306564 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 51 | \n",
" 16.636004 | \n",
"
\n",
" \n",
" 28 | \n",
" 28 | \n",
" ITF3 | \n",
" Campania | \n",
" 5762889 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1742 | \n",
" 30.227894 | \n",
"
\n",
" \n",
" 29 | \n",
" 29 | \n",
" ITF3 | \n",
" Campania | \n",
" 5762889 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 780 | \n",
" 13.534878 | \n",
"
\n",
" \n",
" 30 | \n",
" 30 | \n",
" ITF4 | \n",
" Puglia | \n",
" 4000966 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 624 | \n",
" 15.596234 | \n",
"
\n",
" \n",
" 31 | \n",
" 31 | \n",
" ITF4 | \n",
" Puglia | \n",
" 4000966 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1318 | \n",
" 32.942044 | \n",
"
\n",
" \n",
" 32 | \n",
" 32 | \n",
" ITF5 | \n",
" Basilicata | \n",
" 562968 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 164 | \n",
" 29.131318 | \n",
"
\n",
" \n",
" 33 | \n",
" 33 | \n",
" ITF5 | \n",
" Basilicata | \n",
" 562968 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 80 | \n",
" 14.210399 | \n",
"
\n",
" \n",
" 34 | \n",
" 34 | \n",
" ITF6 | \n",
" Calabria | \n",
" 1924257 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 263 | \n",
" 13.667613 | \n",
"
\n",
" \n",
" 35 | \n",
" 35 | \n",
" ITF6 | \n",
" Calabria | \n",
" 1924257 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 622 | \n",
" 32.324165 | \n",
"
\n",
" \n",
" 36 | \n",
" 36 | \n",
" ITG1 | \n",
" Sicilia | \n",
" 4942188 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 821 | \n",
" 16.612075 | \n",
"
\n",
" \n",
" 37 | \n",
" 37 | \n",
" ITG1 | \n",
" Sicilia | \n",
" 4942188 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 1492 | \n",
" 30.189058 | \n",
"
\n",
" \n",
" 38 | \n",
" 38 | \n",
" ITG2 | \n",
" Sardegna | \n",
" 1631040 | \n",
" 2017 | \n",
" 6_WEEK_RELIG | \n",
" 382 | \n",
" 23.420640 | \n",
"
\n",
" \n",
" 39 | \n",
" 39 | \n",
" ITG2 | \n",
" Sardegna | \n",
" 1631040 | \n",
" 2017 | \n",
" 6_NEVER_RELIG | \n",
" 398 | \n",
" 24.401609 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 Region code Region Population Time \\\n",
"0 0 ITC1 Piemonte 4349911 2017 \n",
"1 1 ITC1 Piemonte 4349911 2017 \n",
"2 2 ITC2 Valle D'Aosta/Vallée D'Aoste 126213 2017 \n",
"3 3 ITC2 Valle D'Aosta/Vallée D'Aoste 126213 2017 \n",
"4 4 ITC3 Liguria 1541541 2017 \n",
"5 5 ITC3 Liguria 1541541 2017 \n",
"6 6 ITC4 Lombardia 9986962 2017 \n",
"7 7 ITC4 Lombardia 9986962 2017 \n",
"8 8 ITD3 Veneto 4880936 2017 \n",
"9 9 ITD3 Veneto 4880936 2017 \n",
"10 10 ITD4 Friuli-Venezia Giulia 1211155 2017 \n",
"11 11 ITD4 Friuli-Venezia Giulia 1211155 2017 \n",
"12 12 ITD5 Emilia-Romagna 4445920 2017 \n",
"13 13 ITD5 Emilia-Romagna 4445920 2017 \n",
"14 14 ITDA Trentino-Alto Adige/Südtirol 1068738 2017 \n",
"15 15 ITDA Trentino-Alto Adige/Südtirol 1068738 2017 \n",
"16 16 ITE1 Toscana 3712048 2017 \n",
"17 17 ITE1 Toscana 3712048 2017 \n",
"18 18 ITE2 Umbria 876477 2017 \n",
"19 19 ITE2 Umbria 876477 2017 \n",
"20 20 ITE3 Marche 1526331 2017 \n",
"21 21 ITE3 Marche 1526331 2017 \n",
"22 22 ITE4 Lazio 5774606 2017 \n",
"23 23 ITE4 Lazio 5774606 2017 \n",
"24 24 ITF1 Abruzzo 1306059 2017 \n",
"25 25 ITF1 Abruzzo 1306059 2017 \n",
"26 26 ITF2 Molise 306564 2017 \n",
"27 27 ITF2 Molise 306564 2017 \n",
"28 28 ITF3 Campania 5762889 2017 \n",
"29 29 ITF3 Campania 5762889 2017 \n",
"30 30 ITF4 Puglia 4000966 2017 \n",
"31 31 ITF4 Puglia 4000966 2017 \n",
"32 32 ITF5 Basilicata 562968 2017 \n",
"33 33 ITF5 Basilicata 562968 2017 \n",
"34 34 ITF6 Calabria 1924257 2017 \n",
"35 35 ITF6 Calabria 1924257 2017 \n",
"36 36 ITG1 Sicilia 4942188 2017 \n",
"37 37 ITG1 Sicilia 4942188 2017 \n",
"38 38 ITG2 Sardegna 1631040 2017 \n",
"39 39 ITG2 Sardegna 1631040 2017 \n",
"\n",
" TIPO_DATO_AVQ Value Percentage \n",
"0 6_NEVER_RELIG 1105 25.402819 \n",
"1 6_WEEK_RELIG 1007 23.149899 \n",
"2 6_NEVER_RELIG 37 29.315522 \n",
"3 6_WEEK_RELIG 22 17.430851 \n",
"4 6_WEEK_RELIG 252 16.347278 \n",
"5 6_NEVER_RELIG 536 34.770402 \n",
"6 6_WEEK_RELIG 2539 25.423147 \n",
"7 6_NEVER_RELIG 2586 25.893760 \n",
"8 6_WEEK_RELIG 1302 26.675211 \n",
"9 6_NEVER_RELIG 964 19.750310 \n",
"10 6_NEVER_RELIG 353 29.145733 \n",
"11 6_WEEK_RELIG 243 20.063493 \n",
"12 6_NEVER_RELIG 1520 34.188649 \n",
"13 6_WEEK_RELIG 834 18.758772 \n",
"14 6_WEEK_RELIG 267 24.982737 \n",
"15 6_NEVER_RELIG 214 20.023617 \n",
"16 6_WEEK_RELIG 682 18.372607 \n",
"17 6_NEVER_RELIG 1255 33.808830 \n",
"18 6_WEEK_RELIG 177 20.194483 \n",
"19 6_NEVER_RELIG 208 23.731370 \n",
"20 6_NEVER_RELIG 318 20.834275 \n",
"21 6_WEEK_RELIG 404 26.468702 \n",
"22 6_WEEK_RELIG 1383 23.949686 \n",
"23 6_NEVER_RELIG 1485 25.716040 \n",
"24 6_WEEK_RELIG 310 23.735528 \n",
"25 6_NEVER_RELIG 218 16.691436 \n",
"26 6_WEEK_RELIG 90 29.357655 \n",
"27 6_NEVER_RELIG 51 16.636004 \n",
"28 6_WEEK_RELIG 1742 30.227894 \n",
"29 6_NEVER_RELIG 780 13.534878 \n",
"30 6_NEVER_RELIG 624 15.596234 \n",
"31 6_WEEK_RELIG 1318 32.942044 \n",
"32 6_WEEK_RELIG 164 29.131318 \n",
"33 6_NEVER_RELIG 80 14.210399 \n",
"34 6_NEVER_RELIG 263 13.667613 \n",
"35 6_WEEK_RELIG 622 32.324165 \n",
"36 6_NEVER_RELIG 821 16.612075 \n",
"37 6_WEEK_RELIG 1492 30.189058 \n",
"38 6_WEEK_RELIG 382 23.420640 \n",
"39 6_NEVER_RELIG 398 24.401609 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"md1_17 = read_csv(\"../data/mashupDS/MD1_17.csv\")\n",
"md1_17"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### MD2 - Pregnancy rates in young women in each region\n",
"\n",
"#### MD2 - Absolute values\n",
"\n",
"This mashup dataset contains the total number of pregnancies in young women aged 15-25, divided per region. A dataset for each year (2017, 2018, 2019) has been created.\n",
"\n",
"The following source datasets have been used:\n",
"* D4 - \"Mother - Age and Citizenship\"\n",
"* D5 - \"Spontaneous abortions - resignation from the place of the event: Age of women - prov.\"\n",
"* D6 - \"Induced abortions - Migration: Events by region of residence of the woman and region of intervention\"\n",
"\n",
"The result dataset contains the following columns:\n",
"\n",
"\n",
"ITTER107 | \n",
"Region | \n",
"Live_births | \n",
"Miscarriages | \n",
"Abortions | \n",
"Total | \n",
"Time | \n",
"
\n",
"
\n",
"\n",
"Total
contains the data retreived by summing values of live_births, spontaneous abortions and induced abortions in D4, D5, D6.\n",
"\n",
"Disclaimer\n",
"\n",
"* Although our selected age group would be 15-25 years old, unfortunately, these istat datasets were organized in age classes of 15-24, 25-29. For this reason we have decided to keep only age class 15-24 since, including pregnancies data for women closer to 30 years old, hence closer to the average age of pregnancy in women, would have falsified our data too much\n",
"\n",
"* Dataset 5 does not have values for a class age of 15 -24 as the other two datasets have, instead, it gives values of live births aggregated for a class of women until 17 years old. However, we considered valid to aggregate numbers from this dataset without further distinction of age, since values for pregnancies under 15 years old are very close to 0, hence irrelevant.\n",
"\n",
"\n",
"For all datasets (D4, D5 and D6), the same process has been adopted. For the sake of documentation we will use D4 as an example.\n",
"\n",
"1. Age groups have been aggregated, cleaning eventual unwanted values from source datasets "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"pathD4 = \"../data/srcDS/D4Pregnancy/cleanedDS/cleanedD4-2019-Pregnancy.csv\"\n",
"live_births = read_csv(pathD4, keep_default_na=False,\n",
" dtype= {\n",
" \"RESIDENCE_TERR\":\"string\",\n",
" \"CITIZENSHIP_MOTHER\": \"string\",\n",
" \"MOTHER_AGE\" :\"string\",\n",
" \"OBS_VALUE\" : \"int64\"\n",
" })\n",
"\n",
"live_births = live_births.query(\"MOTHER_AGE != 'Y25'\") # create a df without data for Y25\n",
"live_births = live_births.query(\"RESIDENCE_TERR != 'ITD1'\")\n",
"live_births = live_births.query(\"RESIDENCE_TERR != 'ITD2'\") \n",
"# take out the column with mother_age so I can sum all of the values per region\n",
"live_births = live_births[[\"RESIDENCE_TERR\", \"OBS_VALUE\"]] \n",
"live_births = live_births.groupby(\"RESIDENCE_TERR\", as_index=False).sum()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Regional code and Region name have been added
\n",
"In D4, that only had regional codes:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"ID_name_dict = { \"ITC1\" : \"Piemonte\",\n",
" \"ITC2\" : \"Valle d'Aosta / Vallée d'Aoste\",\n",
" \"ITC3\" : \"Liguria\",\n",
" \"ITC4\" : \"Lombardia\",\n",
" \"ITDA\" : \"Trentino Alto Adige / Südtirol\",\n",
" \"ITD3\" : \"Veneto\",\n",
" \"ITD4\" : \"Friuli-Venezia Giulia\",\n",
" \"ITD5\" : \"Emilia-Romagna\",\n",
" \"ITE1\" : \"Toscana\",\n",
" \"ITE2\" : \"Umbria\",\n",
" \"ITE3\" : \"Marche\",\n",
" \"ITE4\" : \"Lazio\",\n",
" \"ITF1\" : \"Abruzzo\",\n",
" \"ITF2\" : \"Molise\",\n",
" \"ITF3\" : \"Campania\",\n",
" \"ITF4\" : \"Puglia\",\n",
" \"ITF5\" : \"Basilicata\",\n",
" \"ITF6\" : \"Calabria\",\n",
" \"ITG1\" : \"Sicilia\",\n",
" \"ITG2\" : \"Sardegna\"\n",
" }\n",
"\n",
"\n",
"i = 0\n",
"\n",
"# add NL Region name to dataset\n",
"\n",
"region_name = []\n",
"\n",
"for idx, row in live_births.iterrows():\n",
" region_name.append(ID_name_dict[live_births.at[idx, \"RESIDENCE_TERR\"]])\n",
"\n",
"live_births[\"Region\"] = region_name #maybe try inserting at idx 1 ??\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2.2. In D5 and D5, that only had Region name:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"pathD5 = \"../data/srcDS/D5Pregnancy/cleanedDS/cleanedD5-2019-Pregnancy.csv\"\n",
"miscarriages = read_csv(pathD5, keep_default_na=False,\n",
" dtype= {\n",
" \"Territorio\":\"string\",\n",
" \"Classe di età\": \"string\",\n",
" \"Value\" : \"int64\"\n",
" })\n",
"\n",
"pathD6 = \"../data/srcDS/D6Pregnancy/cleanedDS/cleanedD6-2019-Pregnancy.csv\"\n",
"abortions = read_csv(pathD6, keep_default_na=False,\n",
" dtype= {\n",
" \"Territorio dell'evento\":\"string\",\n",
" \"Età e classe di età\": \"string\",\n",
" \"Value\" : \"int64\"\n",
" })\n",
" \n",
"region_code = []\n",
"region_name = []\n",
"\n",
"for key, value in ID_name_dict.items():\n",
" region_name.append(value)\n",
" region_code.append(key)\n",
"\n",
"DF_ID_name = DataFrame({\"ITTER107\" : Series(region_code, dtype=\"string\"), \"Region\": Series(region_name, dtype=\"string\")})\n",
"\n",
"miscarriages = merge(DF_ID_name, miscarriages, left_on=\"Region\", right_on=\"Territorio\")\n",
"miscarriages.drop([\"Territorio\"], axis=1, inplace=True)\n",
"\n",
"abortions = merge(DF_ID_name, abortions, left_on=\"Region\", right_on=\"Territorio dell'evento\")\n",
"abortions.drop([\"Territorio dell'evento\"], axis=1, inplace=True)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"3. If needed columns label have been renamed"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"live_births.rename(columns = {\"RESIDENCE_TERR\" : \"ITTER107\", \"OBS_VALUE\" : \"Live_births\"}, inplace=True)\n",
"miscarriages.rename(columns = {\"Value\" : \"Miscarriages\"}, inplace=True)\n",
"abortions.rename(columns = { \"Value\" : \"Abortions\"}, inplace=True)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"4. A column Time
specifying the year of the dataset has been added.
\n",
"For this purpose, a function addyear
has been implemented. It takes in imput pathyear
the path of the dataset of interest, and df
the Dataframe object of the dataset of interest.
\n",
"If Time
column is not already in the dataset, the function adds it, etracting the correct year from the name of the dataset found in the path. It returns yr
the year of the dataset as a string\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"def addyear(pathyear, df):\n",
" if \"Time\" not in df.columns[0]:\n",
" year2017 = re.search(\"2017\", pathyear)\n",
" year2018 = re.search(\"2018\", pathyear)\n",
" year2019 = re.search(\"2019\", pathyear)\n",
"\n",
" years = [year2017, year2018, year2019]\n",
" for el in years:\n",
" if el is not None:\n",
" yr = el.group()\n",
" df[\"Time\"] = yr\n",
" return yr\n",
" return False\n",
"\n",
"yr = addyear(pathD4, live_births)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Finally, the resulting datasets are merged together based on Region's value and the total pregnancies are calculated summing each DS' value. They are exported as csv to create the final mashup dataset MD2."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"PregnancyDS = merge(live_births, miscarriages, left_on= \"ITTER107\", right_on=\"ITTER107\")\n",
"PregnancyDS = merge(PregnancyDS, abortions, left_on= \"ITTER107\", right_on=\"ITTER107\", how=\"left\") #how=\"left\" is because I don'7 have lazio's value rn\n",
"\n",
"total = PregnancyDS.sum(axis=1, numeric_only=True) # sums all pregnancies value per region\n",
"PregnancyDS[\"Total\"] = total\n",
"PregnancyDS = PregnancyDS[[\"ITTER107\", \"Region\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\", \"Time\"]]\n",
"PregnancyDS.to_csv(\"../data/mashupDS/MD2-ASS-\" + yr + \".csv\", index=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the visualisation of the 2017 mashup dataset with absolute values below as example.\n",
"All MD2 - Absolute values can be found in ` data > mashupDS `
\n",
"Full code is available at `scripts > MASHUP.py `"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ITTER107 | \n",
" Region | \n",
" Live_births | \n",
" Miscarriages | \n",
" Abortions | \n",
" Total | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ITC1 | \n",
" Piemonte | \n",
" 5448 | \n",
" 309 | \n",
" 1670 | \n",
" 7427 | \n",
" 2017 | \n",
"
\n",
" \n",
" 1 | \n",
" ITC2 | \n",
" Valle d'Aosta / Vallée d'Aoste | \n",
" 142 | \n",
" 9 | \n",
" 46 | \n",
" 197 | \n",
" 2017 | \n",
"
\n",
" \n",
" 2 | \n",
" ITC3 | \n",
" Liguria | \n",
" 1794 | \n",
" 92 | \n",
" 664 | \n",
" 2550 | \n",
" 2017 | \n",
"
\n",
" \n",
" 3 | \n",
" ITC4 | \n",
" Lombardia | \n",
" 13468 | \n",
" 599 | \n",
" 3381 | \n",
" 17448 | \n",
" 2017 | \n",
"
\n",
" \n",
" 4 | \n",
" ITD3 | \n",
" Veneto | \n",
" 5944 | \n",
" 287 | \n",
" 1116 | \n",
" 7347 | \n",
" 2017 | \n",
"
\n",
" \n",
" 5 | \n",
" ITD4 | \n",
" Friuli-Venezia Giulia | \n",
" 1342 | \n",
" 75 | \n",
" 343 | \n",
" 1760 | \n",
" 2017 | \n",
"
\n",
" \n",
" 6 | \n",
" ITD5 | \n",
" Emilia-Romagna | \n",
" 6016 | \n",
" 234 | \n",
" 1644 | \n",
" 7894 | \n",
" 2017 | \n",
"
\n",
" \n",
" 7 | \n",
" ITDA | \n",
" Trentino Alto Adige / Südtirol | \n",
" 1776 | \n",
" 50 | \n",
" 326 | \n",
" 2152 | \n",
" 2017 | \n",
"
\n",
" \n",
" 8 | \n",
" ITE1 | \n",
" Toscana | \n",
" 4392 | \n",
" 221 | \n",
" 1310 | \n",
" 5923 | \n",
" 2017 | \n",
"
\n",
" \n",
" 9 | \n",
" ITE2 | \n",
" Umbria | \n",
" 1056 | \n",
" 41 | \n",
" 275 | \n",
" 1372 | \n",
" 2017 | \n",
"
\n",
" \n",
" 10 | \n",
" ITE3 | \n",
" Marche | \n",
" 1846 | \n",
" 96 | \n",
" 383 | \n",
" 2325 | \n",
" 2017 | \n",
"
\n",
" \n",
" 11 | \n",
" ITE4 | \n",
" Lazio | \n",
" 6770 | \n",
" 440 | \n",
" 2189 | \n",
" 9399 | \n",
" 2017 | \n",
"
\n",
" \n",
" 12 | \n",
" ITF1 | \n",
" Abruzzo | \n",
" 1698 | \n",
" 61 | \n",
" 368 | \n",
" 2127 | \n",
" 2017 | \n",
"
\n",
" \n",
" 13 | \n",
" ITF2 | \n",
" Molise | \n",
" 368 | \n",
" 20 | \n",
" 102 | \n",
" 490 | \n",
" 2017 | \n",
"
\n",
" \n",
" 14 | \n",
" ITF3 | \n",
" Campania | \n",
" 12216 | \n",
" 306 | \n",
" 1736 | \n",
" 14258 | \n",
" 2017 | \n",
"
\n",
" \n",
" 15 | \n",
" ITF4 | \n",
" Puglia | \n",
" 6508 | \n",
" 420 | \n",
" 1794 | \n",
" 8722 | \n",
" 2017 | \n",
"
\n",
" \n",
" 16 | \n",
" ITF5 | \n",
" Basilicata | \n",
" 598 | \n",
" 42 | \n",
" 124 | \n",
" 764 | \n",
" 2017 | \n",
"
\n",
" \n",
" 17 | \n",
" ITF6 | \n",
" Calabria | \n",
" 3442 | \n",
" 154 | \n",
" 449 | \n",
" 4045 | \n",
" 2017 | \n",
"
\n",
" \n",
" 18 | \n",
" ITG1 | \n",
" Sicilia | \n",
" 11906 | \n",
" 553 | \n",
" 1555 | \n",
" 14014 | \n",
" 2017 | \n",
"
\n",
" \n",
" 19 | \n",
" ITG2 | \n",
" Sardegna | \n",
" 1730 | \n",
" 80 | \n",
" 511 | \n",
" 2321 | \n",
" 2017 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ITTER107 Region Live_births Miscarriages \\\n",
"0 ITC1 Piemonte 5448 309 \n",
"1 ITC2 Valle d'Aosta / Vallée d'Aoste 142 9 \n",
"2 ITC3 Liguria 1794 92 \n",
"3 ITC4 Lombardia 13468 599 \n",
"4 ITD3 Veneto 5944 287 \n",
"5 ITD4 Friuli-Venezia Giulia 1342 75 \n",
"6 ITD5 Emilia-Romagna 6016 234 \n",
"7 ITDA Trentino Alto Adige / Südtirol 1776 50 \n",
"8 ITE1 Toscana 4392 221 \n",
"9 ITE2 Umbria 1056 41 \n",
"10 ITE3 Marche 1846 96 \n",
"11 ITE4 Lazio 6770 440 \n",
"12 ITF1 Abruzzo 1698 61 \n",
"13 ITF2 Molise 368 20 \n",
"14 ITF3 Campania 12216 306 \n",
"15 ITF4 Puglia 6508 420 \n",
"16 ITF5 Basilicata 598 42 \n",
"17 ITF6 Calabria 3442 154 \n",
"18 ITG1 Sicilia 11906 553 \n",
"19 ITG2 Sardegna 1730 80 \n",
"\n",
" Abortions Total Time \n",
"0 1670 7427 2017 \n",
"1 46 197 2017 \n",
"2 664 2550 2017 \n",
"3 3381 17448 2017 \n",
"4 1116 7347 2017 \n",
"5 343 1760 2017 \n",
"6 1644 7894 2017 \n",
"7 326 2152 2017 \n",
"8 1310 5923 2017 \n",
"9 275 1372 2017 \n",
"10 383 2325 2017 \n",
"11 2189 9399 2017 \n",
"12 368 2127 2017 \n",
"13 102 490 2017 \n",
"14 1736 14258 2017 \n",
"15 1794 8722 2017 \n",
"16 124 764 2017 \n",
"17 449 4045 2017 \n",
"18 1555 14014 2017 \n",
"19 511 2321 2017 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"md2_ass_2017 = read_csv(\"../data/mashupDS/MD2-ASS-2017.csv\")\n",
"md2_ass_2017"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### MD2 - Percentage values\n",
"\n",
"This dataset contains the percentage values of total pregnancies, live births, spontaneous abortions and induced abortions calculated over the total female population aged 15-25.\n",
"\n",
"The following datasets have been used:
\n",
"D1-D2 selected Population cleaned datasets
\n",
"\n",
"* Population2017Selected_clean.csv\n",
"* Population2018Selected_clean.csv\n",
"* Population2019Selected_clean.csv\n",
"\n",
" MD2 with absolute values
\n",
"\n",
"* MD2-ASS-2017.csv\n",
"* MD2-ASS-2018.csv\n",
"* MD2-ASS-2019.csv\n",
"\n",
"\n",
"1. Males values + Total values have been taken out from selected population datasets whereas Female values have been summed and grouped by region\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# SELECTED POPULATION FILES\n",
"sel_pop17 = read_csv(\"../data/cleanDS/Population2017Selected_clean.csv\")\n",
"# only keep \"Females\" column\n",
"sel_pop17 = sel_pop17.drop(sel_pop17[(sel_pop17.Sex == \"Males\") | (sel_pop17.Sex == \"Total\")].index)\n",
"# Put together ages and sum value under column \"Population\"\n",
"sel_pop17 = (sel_pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
"sel_pop17[\"Time\"] = 2017\n",
"sel_pop17.rename(columns = {\"Region code\" : \"ITTER107\", \"Population\" : \"Females\"}, inplace=True)\n",
"\n",
"def selectedPop(path):\n",
" file = read_csv(path)\n",
" file.drop([\"Males\",\"Population\"],axis=1,inplace=True)\n",
" file = (file.groupby([\"ITTER107\",\"Region\"])[\"Females\"].sum()).reset_index()\n",
" return file \n",
"\n",
"pop18 = \"../data/cleanDS/Population2018Selected_clean.csv\"\n",
"pop19 = \"../data/cleanDS/Population2019Selected_clean.csv\"\n",
"\n",
"sel_pop18 = selectedPop(pop18)\n",
"sel_pop18[\"Time\"] = 2018 # adds the Time column\n",
"sel_pop19 = selectedPop(pop19)\n",
"sel_pop19[\"Time\"] = 2019\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2. A simple function getPercentage
as been implemented.
It takes a num
numerator and a pop
denominator value and returns the result of the percentage equation."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"def getPercentage(num,pop):\n",
" return num * 100 / pop"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Another function percentages
has been implemented.
It takes in input the dataset DStotal
of the total selected female population and the dataset DSpartial
of the mashup datasets MD2 with pregnancies absolute values.
\n",
"This function transforms absolute values of DSpartial
in percentage values and export the result dataset in csv format"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# create empty lists to fill with % values\n",
"Total_perc = []\n",
"Live_births_perc = []\n",
"Miscarriages_perc = []\n",
"Abortions_perc = []\n",
"\n",
"def percentages(DStotal, DSpartial):\n",
" # merge total DS with partial DS so it's then possible to iterate over rows\n",
" mergedDF = merge(DStotal, DSpartial, left_on=\"ITTER107\", right_on=\"ITTER107\", how=\"left\")\n",
" mergedDF = mergedDF[[\"ITTER107\", \"Region_x\", \"Females\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\"]]\n",
"\n",
" for idx, row in mergedDF.iterrows():\n",
" year = str(DStotal.at[idx, \"Time\"]) # save year value to add later as a column\n",
"\n",
" # calculate percentages for each value\n",
" p_tot = getPercentage((mergedDF.at[idx, \"Total\"]), (mergedDF.at[idx, \"Females\"]))\n",
" p_lb = getPercentage((mergedDF.at[idx, \"Live_births\"]), (mergedDF.at[idx, \"Females\"]))\n",
" p_m = getPercentage((mergedDF.at[idx, \"Miscarriages\"]), (mergedDF.at[idx, \"Females\"]))\n",
" p_a = getPercentage((mergedDF.at[idx, \"Abortions\"]), (mergedDF.at[idx, \"Females\"]))\n",
"\n",
" Total_perc.append(p_tot)\n",
" Live_births_perc.append(p_lb)\n",
" Miscarriages_perc.append(p_m)\n",
" Abortions_perc.append(p_a)\n",
"\n",
" # change absolute values with percentage values\n",
" DSpartial[\"Live_births\"] = Live_births_perc\n",
" DSpartial[\"Miscarriages\"] = Miscarriages_perc\n",
" DSpartial[\"Abortions\"] = Abortions_perc\n",
" DSpartial[\"Total\"] = Total_perc\n",
" DSpartial[\"Time\"] = year\n",
"\n",
" DSpartial.to_csv(\"../data/mashupDS/MD2-PERC-\" + year + \".csv\", index=False)\n",
" return True\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the visualisation of the 2017 mashup dataset with percentage values below as example.\n",
"All MD2 - Absolute values can be found in ` data > mashupDS `
\n",
"Full code is available at `scripts > MASHUP.py `"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ITTER107 | \n",
" Region | \n",
" Live_births | \n",
" Miscarriages | \n",
" Abortions | \n",
" Total | \n",
" Time | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ITC1 | \n",
" Piemonte | \n",
" 2.657263 | \n",
" 0.150715 | \n",
" 0.814543 | \n",
" 3.622520 | \n",
" 2017 | \n",
"
\n",
" \n",
" 1 | \n",
" ITC2 | \n",
" Valle d'Aosta / Vallée d'Aoste | \n",
" 2.308943 | \n",
" 0.146341 | \n",
" 0.747967 | \n",
" 3.203252 | \n",
" 2017 | \n",
"
\n",
" \n",
" 2 | \n",
" ITC3 | \n",
" Liguria | \n",
" 2.609758 | \n",
" 0.133834 | \n",
" 0.965931 | \n",
" 3.709523 | \n",
" 2017 | \n",
"
\n",
" \n",
" 3 | \n",
" ITC4 | \n",
" Lombardia | \n",
" 2.733171 | \n",
" 0.121560 | \n",
" 0.686134 | \n",
" 3.540865 | \n",
" 2017 | \n",
"
\n",
" \n",
" 4 | \n",
" ITD3 | \n",
" Veneto | \n",
" 2.409824 | \n",
" 0.116356 | \n",
" 0.452450 | \n",
" 2.978630 | \n",
" 2017 | \n",
"
\n",
" \n",
" 5 | \n",
" ITD4 | \n",
" Friuli-Venezia Giulia | \n",
" 2.431512 | \n",
" 0.135889 | \n",
" 0.621467 | \n",
" 3.188868 | \n",
" 2017 | \n",
"
\n",
" \n",
" 6 | \n",
" ITD5 | \n",
" Emilia-Romagna | \n",
" 2.912133 | \n",
" 0.113271 | \n",
" 0.795802 | \n",
" 3.821206 | \n",
" 2017 | \n",
"
\n",
" \n",
" 7 | \n",
" ITDA | \n",
" Trentino Alto Adige / Südtirol | \n",
" 2.891472 | \n",
" 0.081404 | \n",
" 0.530754 | \n",
" 3.503631 | \n",
" 2017 | \n",
"
\n",
" \n",
" 8 | \n",
" ITE1 | \n",
" Toscana | \n",
" 2.566290 | \n",
" 0.129133 | \n",
" 0.765446 | \n",
" 3.460869 | \n",
" 2017 | \n",
"
\n",
" \n",
" 9 | \n",
" ITE2 | \n",
" Umbria | \n",
" 2.557644 | \n",
" 0.099302 | \n",
" 0.666053 | \n",
" 3.322999 | \n",
" 2017 | \n",
"
\n",
" \n",
" 10 | \n",
" ITE3 | \n",
" Marche | \n",
" 2.472675 | \n",
" 0.128590 | \n",
" 0.513020 | \n",
" 3.114284 | \n",
" 2017 | \n",
"
\n",
" \n",
" 11 | \n",
" ITE4 | \n",
" Lazio | \n",
" 2.372483 | \n",
" 0.154194 | \n",
" 0.767115 | \n",
" 3.293792 | \n",
" 2017 | \n",
"
\n",
" \n",
" 12 | \n",
" ITF1 | \n",
" Abruzzo | \n",
" 2.580312 | \n",
" 0.092697 | \n",
" 0.559220 | \n",
" 3.232228 | \n",
" 2017 | \n",
"
\n",
" \n",
" 13 | \n",
" ITF2 | \n",
" Molise | \n",
" 2.290837 | \n",
" 0.124502 | \n",
" 0.634960 | \n",
" 3.050299 | \n",
" 2017 | \n",
"
\n",
" \n",
" 14 | \n",
" ITF3 | \n",
" Campania | \n",
" 3.333806 | \n",
" 0.083509 | \n",
" 0.473763 | \n",
" 3.891078 | \n",
" 2017 | \n",
"
\n",
" \n",
" 15 | \n",
" ITF4 | \n",
" Puglia | \n",
" 2.800429 | \n",
" 0.180728 | \n",
" 0.771968 | \n",
" 3.753125 | \n",
" 2017 | \n",
"
\n",
" \n",
" 16 | \n",
" ITF5 | \n",
" Basilicata | \n",
" 1.927664 | \n",
" 0.135388 | \n",
" 0.399716 | \n",
" 2.462768 | \n",
" 2017 | \n",
"
\n",
" \n",
" 17 | \n",
" ITF6 | \n",
" Calabria | \n",
" 3.106330 | \n",
" 0.138982 | \n",
" 0.405213 | \n",
" 3.650524 | \n",
" 2017 | \n",
"
\n",
" \n",
" 18 | \n",
" ITG1 | \n",
" Sicilia | \n",
" 4.053410 | \n",
" 0.188269 | \n",
" 0.529401 | \n",
" 4.771081 | \n",
" 2017 | \n",
"
\n",
" \n",
" 19 | \n",
" ITG2 | \n",
" Sardegna | \n",
" 2.258309 | \n",
" 0.104430 | \n",
" 0.667050 | \n",
" 3.029789 | \n",
" 2017 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ITTER107 Region Live_births Miscarriages \\\n",
"0 ITC1 Piemonte 2.657263 0.150715 \n",
"1 ITC2 Valle d'Aosta / Vallée d'Aoste 2.308943 0.146341 \n",
"2 ITC3 Liguria 2.609758 0.133834 \n",
"3 ITC4 Lombardia 2.733171 0.121560 \n",
"4 ITD3 Veneto 2.409824 0.116356 \n",
"5 ITD4 Friuli-Venezia Giulia 2.431512 0.135889 \n",
"6 ITD5 Emilia-Romagna 2.912133 0.113271 \n",
"7 ITDA Trentino Alto Adige / Südtirol 2.891472 0.081404 \n",
"8 ITE1 Toscana 2.566290 0.129133 \n",
"9 ITE2 Umbria 2.557644 0.099302 \n",
"10 ITE3 Marche 2.472675 0.128590 \n",
"11 ITE4 Lazio 2.372483 0.154194 \n",
"12 ITF1 Abruzzo 2.580312 0.092697 \n",
"13 ITF2 Molise 2.290837 0.124502 \n",
"14 ITF3 Campania 3.333806 0.083509 \n",
"15 ITF4 Puglia 2.800429 0.180728 \n",
"16 ITF5 Basilicata 1.927664 0.135388 \n",
"17 ITF6 Calabria 3.106330 0.138982 \n",
"18 ITG1 Sicilia 4.053410 0.188269 \n",
"19 ITG2 Sardegna 2.258309 0.104430 \n",
"\n",
" Abortions Total Time \n",
"0 0.814543 3.622520 2017 \n",
"1 0.747967 3.203252 2017 \n",
"2 0.965931 3.709523 2017 \n",
"3 0.686134 3.540865 2017 \n",
"4 0.452450 2.978630 2017 \n",
"5 0.621467 3.188868 2017 \n",
"6 0.795802 3.821206 2017 \n",
"7 0.530754 3.503631 2017 \n",
"8 0.765446 3.460869 2017 \n",
"9 0.666053 3.322999 2017 \n",
"10 0.513020 3.114284 2017 \n",
"11 0.767115 3.293792 2017 \n",
"12 0.559220 3.232228 2017 \n",
"13 0.634960 3.050299 2017 \n",
"14 0.473763 3.891078 2017 \n",
"15 0.771968 3.753125 2017 \n",
"16 0.399716 2.462768 2017 \n",
"17 0.405213 3.650524 2017 \n",
"18 0.529401 4.771081 2017 \n",
"19 0.667050 3.029789 2017 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"md2_perc_2017 = read_csv(\"../data/mashupDS/MD2-PERC-2017.csv\")\n",
"md2_perc_2017"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### MD3 - (Higher) education rates in young women in each region\n",
"As the source dataset D7 did not have, with absolute values, the subdivision in regions (but only had data regarding the whole national territory of Italy), we had to use the \"percentage values\".\n",
"The percentage is calculated on the total population of 18-24 years old in each region, without distinctions in terms of gender.\n",
"\n",
"However, given the assumptions of the project, we had to focus on the female population of this age range and we did so by employing the following proportion:\n",
"* % early leavers : total 18-24 = [% early leavers F] : total 18-24 F\n",
"\n",
"Where:\n",
"1. total 18-24 -> data to be gathered from the general population datasets (D1,D2), filtering them with ages 18-24\n",
"2. total 18-24 F -> data to be gathered from the selected population datasets (D1, D2), filtering them with ages 18-24\n",
"3. % early leavers -> data contained in the source dataset D7\n",
"4. [% early leavers F] -> data to be gathered following the proportion\n",
"\n",
"#### Step 1: Total population 18-24 for each region"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# === A) TOTAL 18-24\n",
"# Get data for ages 18-24 from the general population datasets\n",
"def selectAge(df, age1, age2):\n",
" df = df.drop(df[(df.Age == \"100 and over\") | (df.Age == \"Total\")].index)\n",
" df[\"Age\"] = df[\"Age\"].astype('int')\n",
" young_df = df.loc[(df[\"Age\"] >= age1) & (df[\"Age\"] <= age2)]\n",
" return young_df\n",
"\n",
"\n",
"def young_generalPop(df, age1, age2):\n",
" young_df = selectAge(df, age1, age2)\n",
" young_df = (df.groupby([\"ITTER107\", \"Region\"])\n",
" [\"Population\"].sum()).reset_index()\n",
" return young_df\n",
"\n",
"def young_femalePop(df, age1, age2):\n",
" young_Fdf = selectAge(df, age1, age2)\n",
" young_Fdf = (df.groupby([\"ITTER107\", \"Region\"])\n",
" [\"Females\"].sum()).reset_index()\n",
" return young_Fdf\n",
"\n",
"\n",
"# --- 2017\n",
"pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n",
"# Delete gender distinction\n",
"pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") |\n",
" (pop17.Sex == \"Females\")].index)\n",
"# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of young_pop17\n",
"young_pop17 = selectAge(pop17, 18, 24)\n",
"young_pop17 = (pop17.groupby([\"Region code\", \"Region\"])[\n",
" \"Population\"].sum()).reset_index()\n",
"# Adding columns for easier identification\n",
"young_pop17[\"Time\"] = 2017\n",
"young_pop17[\"Age range\"] = \"18-24\"\n",
"\n",
"# --- 2018\n",
"pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n",
"pop18.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n",
"young_pop18 = young_generalPop(pop18, 18, 24)\n",
"young_pop18[\"Time\"] = 2018\n",
"young_pop18[\"Age range\"] = \"18-24\"\n",
"\n",
"# --- 2019\n",
"pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n",
"pop19.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n",
"young_pop19 = young_generalPop(pop19, 18, 24)\n",
"young_pop19[\"Time\"] = 2019\n",
"young_pop19[\"Age range\"] = \"18-24\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 2: Total female population 18-24 for each region\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# === B) TOTAL 18-24 F\n",
"# Same as before, just DO NOT delete the gender distinction\n",
"# --- 2017\n",
"f_pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n",
"f_pop17 = f_pop17.drop(f_pop17[(f_pop17.Sex == \"Males\") | (f_pop17.Sex == \"Total\")].index)\n",
"# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of youngF_pop17\n",
"youngF_pop17 = selectAge(f_pop17, 18, 24)\n",
"youngF_pop17 = (f_pop17.groupby([\"Region code\", \"Region\"])[\"Population\"].sum()).reset_index()\n",
"# Adding columns for easier identification\n",
"youngF_pop17[\"Time\"] = 2017\n",
"youngF_pop17[\"Age range\"] = \"18-24\"\n",
"youngF_pop17[\"Gender\"] = \"Females\"\n",
"\n",
"# --- 2018\n",
"f_pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n",
"f_pop18.drop([\"Males\", \"Population\"], axis=1, inplace=True)\n",
"youngF_pop18 = young_femalePop(f_pop18, 18, 24)\n",
"youngF_pop18[\"Time\"] = 2018\n",
"youngF_pop18[\"Age range\"] = \"18-24\"\n",
"\n",
"# --- 2019\n",
"f_pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n",
"youngF_pop19 = young_femalePop(f_pop19, 18, 24)\n",
"youngF_pop19[\"Time\"] = 2019\n",
"youngF_pop19[\"Age range\"] = \"18-24\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 3: Percentage of early leavers for each region"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"d7_clean = read_csv(\"../data/cleanDS/D7_clean.csv\")\n",
"d7_2017 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2018) | (d7_clean.TIME == 2019)].index)\n",
"d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n",
"d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 4: Percentage of female early leavers for each region\n",
"\n",
"(this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# === C) % EARLY LEAVERS F (PROPORTION)\n",
"# --- 2017\n",
"MD3_2017 = (merge(young_pop17,d7_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
"MD3_2017 = (merge(youngF_pop17,MD3_2017,left_on=\"Region code\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"ITTER107\",\"Region code_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
"for idx,row in MD3_2017.iterrows():\n",
" result = (row[\"Value\"] * row[\"Population_FEMALE\"])/(row[\"Population_GENERAL\"])\n",
" MD3_2017.loc[idx,\"Female Early Leavers\"] = result\n",
"#MD3_2017.to_csv(\"data/mashupDS/MD3_17.csv\")\n",
"\n",
"# row[\"Value\"] : row[\"Population_GENERAL\"] = % early leavers F : row[\"Population_FEMALE\"]\n",
"\n",
"# --- 2018\n",
"d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n",
"\n",
"MD3_2018 = (merge(young_pop18,d7_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
"MD3_2018 = (merge(youngF_pop18,MD3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
"\n",
"for idx,row in MD3_2018.iterrows():\n",
" result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n",
" MD3_2018.loc[idx,\"Female Early Leavers\"] = result\n",
"#MD3_2018.to_csv(\"data/mashupDS/MD3_18.csv\")\n",
"\n",
"# --- 2019\n",
"d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n",
"MD3_2019 = (merge(young_pop19,d7_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
"MD3_2019 = (merge(youngF_pop19,MD3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
"for idx,row in MD3_2019.iterrows():\n",
" result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n",
" MD3_2019.loc[idx,\"Female Early Leavers\"] = result\n",
"#MD3_2019.to_csv(\"data/mashupDS/MD3_19.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "9e178ac2d01006df9e74389e28c68e3ef7cb95acb5e5b42727cfb2189b57440d"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}