{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Mashup datasets\n", "\n", "This Jupyter Notebook analyses the mashup datasets for [\"Blessed Be the Fruit\"](https://github.com/OrsolaMBorrini/blessedfruit), an open data project regarding the analysis of factors that might influence pregnancy rates in young women in Italy.\n", "\n", "From the starting 7 source datasets, three different mashup datasets have been created:\n", "Id | Dataset | Description (factor of interest) | Original source datasets\n", "--- | --- | --- | --- \n", "MD1 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D1, D2, D3 \n", "MD2 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6\n", "MD3 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D1, D2, D7\n", "\n", "### Disclaimer\n", "This Jupyter Notebook is of informational nature only, it is not thought to be used for the clean up process for the datasets, but only for their analysis and the explanation of the process. The code actually used for the clean up can be found in `scripts > MASHUP.py`.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Mashup phase\n", "### MD1 - Religious observance in each region\n", "To create a dataset showing the percentage of religious observance in each region, we need to compare the total general population of each region (with no distinction of gender) to the religious observance (in thousands value) in each region. To get this result we will use the following source datasets:\n", "* D1 and D2 -> for the data regarding total general population in each region for each of the three years of interest (`data > cleanDS > Population2017_Generalclean.csv`, `Population2018_Generalclean.csv`, `Population2019_Generalclean.csv`)\n", "* D3 -> for the data regarding religious observance (thousands value) in each region (`data > cleanDS > D3_clean.csv`)\n", "\n", "First of all the datasets regarding general population were stripped of any distinction between genders and the various ages were merged together. Additional columns to keep track of the time were added accordingly." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ITTER107RegionPopulationTime
0ITC1Piemonte86224342019
1ITC2Valle d'Aosta / Vallée d'Aoste2500682019
2ITC3Liguria30496522019
3ITC4Lombardia200552042019
4ITD3Veneto97582662019
5ITD4Friuli-Venezia Giulia24124322019
6ITD5Emilia-Romagna89282382019
7ITDATrentino Alto Adige / Südtirol21561382019
8ITE1Toscana73851102019
9ITE2Umbria17403302019
10ITE3Marche30253442019
11ITE4Lazio115114002019
12ITF1Abruzzo25878822019
13ITF2Molise6010322019
14ITF3Campania114242862019
15ITF4Puglia79066102019
16ITF5Basilicata11065082019
17ITF6Calabria37882202019
18ITG1Sicilia97505802019
19ITG2Sardegna32232422019
\n", "
" ], "text/plain": [ " ITTER107 Region Population Time\n", "0 ITC1 Piemonte 8622434 2019\n", "1 ITC2 Valle d'Aosta / Vallée d'Aoste 250068 2019\n", "2 ITC3 Liguria 3049652 2019\n", "3 ITC4 Lombardia 20055204 2019\n", "4 ITD3 Veneto 9758266 2019\n", "5 ITD4 Friuli-Venezia Giulia 2412432 2019\n", "6 ITD5 Emilia-Romagna 8928238 2019\n", "7 ITDA Trentino Alto Adige / Südtirol 2156138 2019\n", "8 ITE1 Toscana 7385110 2019\n", "9 ITE2 Umbria 1740330 2019\n", "10 ITE3 Marche 3025344 2019\n", "11 ITE4 Lazio 11511400 2019\n", "12 ITF1 Abruzzo 2587882 2019\n", "13 ITF2 Molise 601032 2019\n", "14 ITF3 Campania 11424286 2019\n", "15 ITF4 Puglia 7906610 2019\n", "16 ITF5 Basilicata 1106508 2019\n", "17 ITF6 Calabria 3788220 2019\n", "18 ITG1 Sicilia 9750580 2019\n", "19 ITG2 Sardegna 3223242 2019" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "from pandas import *\n", "import functools as ft\n", "\n", "# ==== MD1 ====\n", "\n", "# GENERAL POPULATION FILES\n", "pop17 = read_csv(\"../data/cleanDS/Population2017General_clean.csv\")\n", "pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") | (pop17.Sex == \"Females\")].index)\n", "gen_pop17 = (pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n", "gen_pop17[\"Time\"] = 2017\n", "\n", "def generalPop(path):\n", " file = read_csv(path)\n", " file.drop([\"Males\",\"Females\"],axis=1,inplace=True)\n", " file = (file.groupby([\"ITTER107\",\"Region\"])[\"Population\"].sum()).reset_index()\n", " return file\n", "\n", "pop18 = \"../data/cleanDS/Population2018General_clean.csv\"\n", "pop19 = \"../data/cleanDS/Population2019General_clean.csv\"\n", "\n", "gen_pop18 = generalPop(pop18)\n", "gen_pop18[\"Time\"] = 2018\n", "gen_pop19 = generalPop(pop19)\n", "gen_pop19[\"Time\"] = 2019\n", "\n", "# Example\n", "gen_pop19" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Two ancillary functions were created to first get the percentage values from the thousands values that we had in the tables (function `getPercentage`) and to add the column with the percentage value in the specific dataframe (`addPercentageColumn`)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "d3_clean = read_csv(\"../data/cleanDS/D3_clean.csv\")\n", "\n", "def getPercentage(num,pop):\n", " return num * 100 / pop\n", "\n", "def addPercentageColumn(yearlyDF):\n", " for idx,row in yearlyDF.iterrows():\n", " perc = getPercentage((row[\"Value\"]*1000),row[\"Population\"])\n", " yearlyDF.loc[idx,\"Percentage\"] = perc\n", " return yearlyDF" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As the process is very similar with regards for each year, we have created one single function to create (and then save as a csv file) all three mashup datasets: `createYearlyDF`.\n", "\n", "This function has two input parameters:\n", "1. `genPopList` -> a list of the general population dataframes\n", "2. `rightDF` -> the dataframe regarding religious observance (containing data for all three years, from 2017 to 2019)\n", "\n", "For each year (which is obtained by checking the first value under the column \"Time\" of whatever general population dataframe), we cleaned up the religious observance dataframe, keeping only the data needed for that specific year (see: if-elif structure in the code below + drop instruction).\n", "\n", "Then, we have merged the two input dataframes together using the unique code region and we subsequently dropped other unnecessary and repeated columns (namely \"Territory\",\"TIME\"...).\n", "\n", "Finally, we called the ancillary function to add a \"Percentage\" column to the resulting dataframe and we have saved it as a new csv file (this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project).\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def createYearlyDF(genPopList,rightDF):\n", " for item in genPopList:\n", " if item[\"Time\"][0] == 2017:\n", " d3_2017 = rightDF.drop(rightDF[(rightDF.TIME == 2018) | (rightDF.TIME == 2019)].index)\n", " MD1_17 = (merge(item,d3_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"Territory\",\"ITTER107\",\"TIME\"],axis=1)\n", " addPercentageColumn(MD1_17)\n", " #MD1_17.to_csv(\"data/mashupDS/MD1_17.csv\")\n", " elif item[\"Time\"][0] == 2018:\n", " d3_2018 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2019)].index)\n", " MD1_18 = (merge(item,d3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n", " addPercentageColumn(MD1_18)\n", " #MD1_18.to_csv(\"data/mashupDS/MD1_18.csv\")\n", " elif item[\"Time\"][0] == 2019:\n", " d3_2019 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2018)].index)\n", " MD1_19 = (merge(item,d3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n", " addPercentageColumn(MD1_19)\n", " #MD1_19.to_csv(\"data/mashupDS/MD1_19.csv\")\n", " return \n", "\n", "createYearlyDF([gen_pop17,gen_pop18,gen_pop19],d3_clean)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In the end, we have three mashup datasets (one for each year in our scope) regarding religious observance rates in each region in Italy." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Region codeRegionPopulationTimeTIPO_DATO_AVQValuePercentage
00ITC1Piemonte434991120176_NEVER_RELIG110525.402819
11ITC1Piemonte434991120176_WEEK_RELIG100723.149899
22ITC2Valle D'Aosta/Vallée D'Aoste12621320176_NEVER_RELIG3729.315522
33ITC2Valle D'Aosta/Vallée D'Aoste12621320176_WEEK_RELIG2217.430851
44ITC3Liguria154154120176_WEEK_RELIG25216.347278
55ITC3Liguria154154120176_NEVER_RELIG53634.770402
66ITC4Lombardia998696220176_WEEK_RELIG253925.423147
77ITC4Lombardia998696220176_NEVER_RELIG258625.893760
88ITD3Veneto488093620176_WEEK_RELIG130226.675211
99ITD3Veneto488093620176_NEVER_RELIG96419.750310
1010ITD4Friuli-Venezia Giulia121115520176_NEVER_RELIG35329.145733
1111ITD4Friuli-Venezia Giulia121115520176_WEEK_RELIG24320.063493
1212ITD5Emilia-Romagna444592020176_NEVER_RELIG152034.188649
1313ITD5Emilia-Romagna444592020176_WEEK_RELIG83418.758772
1414ITDATrentino-Alto Adige/Südtirol106873820176_WEEK_RELIG26724.982737
1515ITDATrentino-Alto Adige/Südtirol106873820176_NEVER_RELIG21420.023617
1616ITE1Toscana371204820176_WEEK_RELIG68218.372607
1717ITE1Toscana371204820176_NEVER_RELIG125533.808830
1818ITE2Umbria87647720176_WEEK_RELIG17720.194483
1919ITE2Umbria87647720176_NEVER_RELIG20823.731370
2020ITE3Marche152633120176_NEVER_RELIG31820.834275
2121ITE3Marche152633120176_WEEK_RELIG40426.468702
2222ITE4Lazio577460620176_WEEK_RELIG138323.949686
2323ITE4Lazio577460620176_NEVER_RELIG148525.716040
2424ITF1Abruzzo130605920176_WEEK_RELIG31023.735528
2525ITF1Abruzzo130605920176_NEVER_RELIG21816.691436
2626ITF2Molise30656420176_WEEK_RELIG9029.357655
2727ITF2Molise30656420176_NEVER_RELIG5116.636004
2828ITF3Campania576288920176_WEEK_RELIG174230.227894
2929ITF3Campania576288920176_NEVER_RELIG78013.534878
3030ITF4Puglia400096620176_NEVER_RELIG62415.596234
3131ITF4Puglia400096620176_WEEK_RELIG131832.942044
3232ITF5Basilicata56296820176_WEEK_RELIG16429.131318
3333ITF5Basilicata56296820176_NEVER_RELIG8014.210399
3434ITF6Calabria192425720176_NEVER_RELIG26313.667613
3535ITF6Calabria192425720176_WEEK_RELIG62232.324165
3636ITG1Sicilia494218820176_NEVER_RELIG82116.612075
3737ITG1Sicilia494218820176_WEEK_RELIG149230.189058
3838ITG2Sardegna163104020176_WEEK_RELIG38223.420640
3939ITG2Sardegna163104020176_NEVER_RELIG39824.401609
\n", "
" ], "text/plain": [ " Unnamed: 0 Region code Region Population Time \\\n", "0 0 ITC1 Piemonte 4349911 2017 \n", "1 1 ITC1 Piemonte 4349911 2017 \n", "2 2 ITC2 Valle D'Aosta/Vallée D'Aoste 126213 2017 \n", "3 3 ITC2 Valle D'Aosta/Vallée D'Aoste 126213 2017 \n", "4 4 ITC3 Liguria 1541541 2017 \n", "5 5 ITC3 Liguria 1541541 2017 \n", "6 6 ITC4 Lombardia 9986962 2017 \n", "7 7 ITC4 Lombardia 9986962 2017 \n", "8 8 ITD3 Veneto 4880936 2017 \n", "9 9 ITD3 Veneto 4880936 2017 \n", "10 10 ITD4 Friuli-Venezia Giulia 1211155 2017 \n", "11 11 ITD4 Friuli-Venezia Giulia 1211155 2017 \n", "12 12 ITD5 Emilia-Romagna 4445920 2017 \n", "13 13 ITD5 Emilia-Romagna 4445920 2017 \n", "14 14 ITDA Trentino-Alto Adige/Südtirol 1068738 2017 \n", "15 15 ITDA Trentino-Alto Adige/Südtirol 1068738 2017 \n", "16 16 ITE1 Toscana 3712048 2017 \n", "17 17 ITE1 Toscana 3712048 2017 \n", "18 18 ITE2 Umbria 876477 2017 \n", "19 19 ITE2 Umbria 876477 2017 \n", "20 20 ITE3 Marche 1526331 2017 \n", "21 21 ITE3 Marche 1526331 2017 \n", "22 22 ITE4 Lazio 5774606 2017 \n", "23 23 ITE4 Lazio 5774606 2017 \n", "24 24 ITF1 Abruzzo 1306059 2017 \n", "25 25 ITF1 Abruzzo 1306059 2017 \n", "26 26 ITF2 Molise 306564 2017 \n", "27 27 ITF2 Molise 306564 2017 \n", "28 28 ITF3 Campania 5762889 2017 \n", "29 29 ITF3 Campania 5762889 2017 \n", "30 30 ITF4 Puglia 4000966 2017 \n", "31 31 ITF4 Puglia 4000966 2017 \n", "32 32 ITF5 Basilicata 562968 2017 \n", "33 33 ITF5 Basilicata 562968 2017 \n", "34 34 ITF6 Calabria 1924257 2017 \n", "35 35 ITF6 Calabria 1924257 2017 \n", "36 36 ITG1 Sicilia 4942188 2017 \n", "37 37 ITG1 Sicilia 4942188 2017 \n", "38 38 ITG2 Sardegna 1631040 2017 \n", "39 39 ITG2 Sardegna 1631040 2017 \n", "\n", " TIPO_DATO_AVQ Value Percentage \n", "0 6_NEVER_RELIG 1105 25.402819 \n", "1 6_WEEK_RELIG 1007 23.149899 \n", "2 6_NEVER_RELIG 37 29.315522 \n", "3 6_WEEK_RELIG 22 17.430851 \n", "4 6_WEEK_RELIG 252 16.347278 \n", "5 6_NEVER_RELIG 536 34.770402 \n", "6 6_WEEK_RELIG 2539 25.423147 \n", "7 6_NEVER_RELIG 2586 25.893760 \n", "8 6_WEEK_RELIG 1302 26.675211 \n", "9 6_NEVER_RELIG 964 19.750310 \n", "10 6_NEVER_RELIG 353 29.145733 \n", "11 6_WEEK_RELIG 243 20.063493 \n", "12 6_NEVER_RELIG 1520 34.188649 \n", "13 6_WEEK_RELIG 834 18.758772 \n", "14 6_WEEK_RELIG 267 24.982737 \n", "15 6_NEVER_RELIG 214 20.023617 \n", "16 6_WEEK_RELIG 682 18.372607 \n", "17 6_NEVER_RELIG 1255 33.808830 \n", "18 6_WEEK_RELIG 177 20.194483 \n", "19 6_NEVER_RELIG 208 23.731370 \n", "20 6_NEVER_RELIG 318 20.834275 \n", "21 6_WEEK_RELIG 404 26.468702 \n", "22 6_WEEK_RELIG 1383 23.949686 \n", "23 6_NEVER_RELIG 1485 25.716040 \n", "24 6_WEEK_RELIG 310 23.735528 \n", "25 6_NEVER_RELIG 218 16.691436 \n", "26 6_WEEK_RELIG 90 29.357655 \n", "27 6_NEVER_RELIG 51 16.636004 \n", "28 6_WEEK_RELIG 1742 30.227894 \n", "29 6_NEVER_RELIG 780 13.534878 \n", "30 6_NEVER_RELIG 624 15.596234 \n", "31 6_WEEK_RELIG 1318 32.942044 \n", "32 6_WEEK_RELIG 164 29.131318 \n", "33 6_NEVER_RELIG 80 14.210399 \n", "34 6_NEVER_RELIG 263 13.667613 \n", "35 6_WEEK_RELIG 622 32.324165 \n", "36 6_NEVER_RELIG 821 16.612075 \n", "37 6_WEEK_RELIG 1492 30.189058 \n", "38 6_WEEK_RELIG 382 23.420640 \n", "39 6_NEVER_RELIG 398 24.401609 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "md1_17 = read_csv(\"../data/mashupDS/MD1_17.csv\")\n", "md1_17" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### MD2 - Pregnancy rates in young women in each region\n", "\n", "#### MD2 - Absolute values\n", "\n", "This mashup dataset contains the total number of pregnancies in young women aged 15-25, divided per region. A dataset for each year (2017, 2018, 2019) has been created.\n", "\n", "The following source datasets have been used:\n", "* D4 - \"Mother - Age and Citizenship\"\n", "* D5 - \"Spontaneous abortions - resignation from the place of the event: Age of women - prov.\"\n", "* D6 - \"Induced abortions - Migration: Events by region of residence of the woman and region of intervention\"\n", "\n", "The result dataset contains the following columns:\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ITTER107RegionLive_birthsMiscarriagesAbortionsTotalTime
\n", "\n", "Total contains the data retreived by summing values of live_births, spontaneous abortions and induced abortions in D4, D5, D6.\n", "\n", "Disclaimer\n", "\n", "* Although our selected age group would be 15-25 years old, unfortunately, these istat datasets were organized in age classes of 15-24, 25-29. For this reason we have decided to keep only age class 15-24 since, including pregnancies data for women closer to 30 years old, hence closer to the average age of pregnancy in women, would have falsified our data too much\n", "\n", "* Dataset 5 does not have values for a class age of 15 -24 as the other two datasets have, instead, it gives values of live births aggregated for a class of women until 17 years old. However, we considered valid to aggregate numbers from this dataset without further distinction of age, since values for pregnancies under 15 years old are very close to 0, hence irrelevant.\n", "\n", "\n", "For all datasets (D4, D5 and D6), the same process has been adopted. For the sake of documentation we will use D4 as an example.\n", "\n", "1. Age groups have been aggregated, cleaning eventual unwanted values from source datasets " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "pathD4 = \"../data/srcDS/D4Pregnancy/cleanedDS/cleanedD4-2019-Pregnancy.csv\"\n", "live_births = read_csv(pathD4, keep_default_na=False,\n", " dtype= {\n", " \"RESIDENCE_TERR\":\"string\",\n", " \"CITIZENSHIP_MOTHER\": \"string\",\n", " \"MOTHER_AGE\" :\"string\",\n", " \"OBS_VALUE\" : \"int64\"\n", " })\n", "\n", "live_births = live_births.query(\"MOTHER_AGE != 'Y25'\") # create a df without data for Y25\n", "live_births = live_births.query(\"RESIDENCE_TERR != 'ITD1'\")\n", "live_births = live_births.query(\"RESIDENCE_TERR != 'ITD2'\") \n", "# take out the column with mother_age so I can sum all of the values per region\n", "live_births = live_births[[\"RESIDENCE_TERR\", \"OBS_VALUE\"]] \n", "live_births = live_births.groupby(\"RESIDENCE_TERR\", as_index=False).sum()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "2. Regional code and Region name have been added
\n", "
In D4, that only had regional codes:
" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "ID_name_dict = { \"ITC1\" : \"Piemonte\",\n", " \"ITC2\" : \"Valle d'Aosta / Vallée d'Aoste\",\n", " \"ITC3\" : \"Liguria\",\n", " \"ITC4\" : \"Lombardia\",\n", " \"ITDA\" : \"Trentino Alto Adige / Südtirol\",\n", " \"ITD3\" : \"Veneto\",\n", " \"ITD4\" : \"Friuli-Venezia Giulia\",\n", " \"ITD5\" : \"Emilia-Romagna\",\n", " \"ITE1\" : \"Toscana\",\n", " \"ITE2\" : \"Umbria\",\n", " \"ITE3\" : \"Marche\",\n", " \"ITE4\" : \"Lazio\",\n", " \"ITF1\" : \"Abruzzo\",\n", " \"ITF2\" : \"Molise\",\n", " \"ITF3\" : \"Campania\",\n", " \"ITF4\" : \"Puglia\",\n", " \"ITF5\" : \"Basilicata\",\n", " \"ITF6\" : \"Calabria\",\n", " \"ITG1\" : \"Sicilia\",\n", " \"ITG2\" : \"Sardegna\"\n", " }\n", "\n", "\n", "i = 0\n", "\n", "# add NL Region name to dataset\n", "\n", "region_name = []\n", "\n", "for idx, row in live_births.iterrows():\n", " region_name.append(ID_name_dict[live_births.at[idx, \"RESIDENCE_TERR\"]])\n", "\n", "live_births[\"Region\"] = region_name #maybe try inserting at idx 1 ??\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
2.2. In D5 and D5, that only had Region name:
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "pathD5 = \"../data/srcDS/D5Pregnancy/cleanedDS/cleanedD5-2019-Pregnancy.csv\"\n", "miscarriages = read_csv(pathD5, keep_default_na=False,\n", " dtype= {\n", " \"Territorio\":\"string\",\n", " \"Classe di età\": \"string\",\n", " \"Value\" : \"int64\"\n", " })\n", "\n", "pathD6 = \"../data/srcDS/D6Pregnancy/cleanedDS/cleanedD6-2019-Pregnancy.csv\"\n", "abortions = read_csv(pathD6, keep_default_na=False,\n", " dtype= {\n", " \"Territorio dell'evento\":\"string\",\n", " \"Età e classe di età\": \"string\",\n", " \"Value\" : \"int64\"\n", " })\n", " \n", "region_code = []\n", "region_name = []\n", "\n", "for key, value in ID_name_dict.items():\n", " region_name.append(value)\n", " region_code.append(key)\n", "\n", "DF_ID_name = DataFrame({\"ITTER107\" : Series(region_code, dtype=\"string\"), \"Region\": Series(region_name, dtype=\"string\")})\n", "\n", "miscarriages = merge(DF_ID_name, miscarriages, left_on=\"Region\", right_on=\"Territorio\")\n", "miscarriages.drop([\"Territorio\"], axis=1, inplace=True)\n", "\n", "abortions = merge(DF_ID_name, abortions, left_on=\"Region\", right_on=\"Territorio dell'evento\")\n", "abortions.drop([\"Territorio dell'evento\"], axis=1, inplace=True)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "3. If needed columns label have been renamed" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "live_births.rename(columns = {\"RESIDENCE_TERR\" : \"ITTER107\", \"OBS_VALUE\" : \"Live_births\"}, inplace=True)\n", "miscarriages.rename(columns = {\"Value\" : \"Miscarriages\"}, inplace=True)\n", "abortions.rename(columns = { \"Value\" : \"Abortions\"}, inplace=True)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "4. A column Time specifying the year of the dataset has been added.

\n", "For this purpose, a function addyear has been implemented. It takes in imput pathyear the path of the dataset of interest, and df the Dataframe object of the dataset of interest.
\n", "If Time column is not already in the dataset, the function adds it, etracting the correct year from the name of the dataset found in the path. It returns yr the year of the dataset as a string\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "import re\n", "def addyear(pathyear, df):\n", " if \"Time\" not in df.columns[0]:\n", " year2017 = re.search(\"2017\", pathyear)\n", " year2018 = re.search(\"2018\", pathyear)\n", " year2019 = re.search(\"2019\", pathyear)\n", "\n", " years = [year2017, year2018, year2019]\n", " for el in years:\n", " if el is not None:\n", " yr = el.group()\n", " df[\"Time\"] = yr\n", " return yr\n", " return False\n", "\n", "yr = addyear(pathD4, live_births)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "5. Finally, the resulting datasets are merged together based on Region's value and the total pregnancies are calculated summing each DS' value. They are exported as csv to create the final mashup dataset MD2." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "PregnancyDS = merge(live_births, miscarriages, left_on= \"ITTER107\", right_on=\"ITTER107\")\n", "PregnancyDS = merge(PregnancyDS, abortions, left_on= \"ITTER107\", right_on=\"ITTER107\", how=\"left\") #how=\"left\" is because I don'7 have lazio's value rn\n", "\n", "total = PregnancyDS.sum(axis=1, numeric_only=True) # sums all pregnancies value per region\n", "PregnancyDS[\"Total\"] = total\n", "PregnancyDS = PregnancyDS[[\"ITTER107\", \"Region\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\", \"Time\"]]\n", "PregnancyDS.to_csv(\"../data/mashupDS/MD2-ASS-\" + yr + \".csv\", index=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Find the visualisation of the 2017 mashup dataset with absolute values below as example.\n", "All MD2 - Absolute values can be found in ` data > mashupDS `

\n", "Full code is available at `scripts > MASHUP.py `" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ITTER107RegionLive_birthsMiscarriagesAbortionsTotalTime
0ITC1Piemonte5448309167074272017
1ITC2Valle d'Aosta / Vallée d'Aoste1429461972017
2ITC3Liguria17949266425502017
3ITC4Lombardia134685993381174482017
4ITD3Veneto5944287111673472017
5ITD4Friuli-Venezia Giulia13427534317602017
6ITD5Emilia-Romagna6016234164478942017
7ITDATrentino Alto Adige / Südtirol17765032621522017
8ITE1Toscana4392221131059232017
9ITE2Umbria10564127513722017
10ITE3Marche18469638323252017
11ITE4Lazio6770440218993992017
12ITF1Abruzzo16986136821272017
13ITF2Molise368201024902017
14ITF3Campania122163061736142582017
15ITF4Puglia6508420179487222017
16ITF5Basilicata598421247642017
17ITF6Calabria344215444940452017
18ITG1Sicilia119065531555140142017
19ITG2Sardegna17308051123212017
\n", "
" ], "text/plain": [ " ITTER107 Region Live_births Miscarriages \\\n", "0 ITC1 Piemonte 5448 309 \n", "1 ITC2 Valle d'Aosta / Vallée d'Aoste 142 9 \n", "2 ITC3 Liguria 1794 92 \n", "3 ITC4 Lombardia 13468 599 \n", "4 ITD3 Veneto 5944 287 \n", "5 ITD4 Friuli-Venezia Giulia 1342 75 \n", "6 ITD5 Emilia-Romagna 6016 234 \n", "7 ITDA Trentino Alto Adige / Südtirol 1776 50 \n", "8 ITE1 Toscana 4392 221 \n", "9 ITE2 Umbria 1056 41 \n", "10 ITE3 Marche 1846 96 \n", "11 ITE4 Lazio 6770 440 \n", "12 ITF1 Abruzzo 1698 61 \n", "13 ITF2 Molise 368 20 \n", "14 ITF3 Campania 12216 306 \n", "15 ITF4 Puglia 6508 420 \n", "16 ITF5 Basilicata 598 42 \n", "17 ITF6 Calabria 3442 154 \n", "18 ITG1 Sicilia 11906 553 \n", "19 ITG2 Sardegna 1730 80 \n", "\n", " Abortions Total Time \n", "0 1670 7427 2017 \n", "1 46 197 2017 \n", "2 664 2550 2017 \n", "3 3381 17448 2017 \n", "4 1116 7347 2017 \n", "5 343 1760 2017 \n", "6 1644 7894 2017 \n", "7 326 2152 2017 \n", "8 1310 5923 2017 \n", "9 275 1372 2017 \n", "10 383 2325 2017 \n", "11 2189 9399 2017 \n", "12 368 2127 2017 \n", "13 102 490 2017 \n", "14 1736 14258 2017 \n", "15 1794 8722 2017 \n", "16 124 764 2017 \n", "17 449 4045 2017 \n", "18 1555 14014 2017 \n", "19 511 2321 2017 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "md2_ass_2017 = read_csv(\"../data/mashupDS/MD2-ASS-2017.csv\")\n", "md2_ass_2017" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### MD2 - Percentage values\n", "\n", "This dataset contains the percentage values of total pregnancies, live births, spontaneous abortions and induced abortions calculated over the total female population aged 15-25.\n", "\n", "The following datasets have been used:

\n", "D1-D2 selected Population cleaned datasets
\n", "\n", "* Population2017Selected_clean.csv\n", "* Population2018Selected_clean.csv\n", "* Population2019Selected_clean.csv\n", "\n", " MD2 with absolute values
\n", "\n", "* MD2-ASS-2017.csv\n", "* MD2-ASS-2018.csv\n", "* MD2-ASS-2019.csv\n", "\n", "\n", "1. Males values + Total values have been taken out from selected population datasets whereas Female values have been summed and grouped by region\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# SELECTED POPULATION FILES\n", "sel_pop17 = read_csv(\"../data/cleanDS/Population2017Selected_clean.csv\")\n", "# only keep \"Females\" column\n", "sel_pop17 = sel_pop17.drop(sel_pop17[(sel_pop17.Sex == \"Males\") | (sel_pop17.Sex == \"Total\")].index)\n", "# Put together ages and sum value under column \"Population\"\n", "sel_pop17 = (sel_pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n", "sel_pop17[\"Time\"] = 2017\n", "sel_pop17.rename(columns = {\"Region code\" : \"ITTER107\", \"Population\" : \"Females\"}, inplace=True)\n", "\n", "def selectedPop(path):\n", " file = read_csv(path)\n", " file.drop([\"Males\",\"Population\"],axis=1,inplace=True)\n", " file = (file.groupby([\"ITTER107\",\"Region\"])[\"Females\"].sum()).reset_index()\n", " return file \n", "\n", "pop18 = \"../data/cleanDS/Population2018Selected_clean.csv\"\n", "pop19 = \"../data/cleanDS/Population2019Selected_clean.csv\"\n", "\n", "sel_pop18 = selectedPop(pop18)\n", "sel_pop18[\"Time\"] = 2018 # adds the Time column\n", "sel_pop19 = selectedPop(pop19)\n", "sel_pop19[\"Time\"] = 2019\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "2. A simple function getPercentage as been implemented.

It takes a num numerator and a pop denominator value and returns the result of the percentage equation." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def getPercentage(num,pop):\n", " return num * 100 / pop" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "3. Another function percentages has been implemented.

It takes in input the dataset DStotal of the total selected female population and the dataset DSpartial of the mashup datasets MD2 with pregnancies absolute values.

\n", "This function transforms absolute values of DSpartial in percentage values and export the result dataset in csv format" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# create empty lists to fill with % values\n", "Total_perc = []\n", "Live_births_perc = []\n", "Miscarriages_perc = []\n", "Abortions_perc = []\n", "\n", "def percentages(DStotal, DSpartial):\n", " # merge total DS with partial DS so it's then possible to iterate over rows\n", " mergedDF = merge(DStotal, DSpartial, left_on=\"ITTER107\", right_on=\"ITTER107\", how=\"left\")\n", " mergedDF = mergedDF[[\"ITTER107\", \"Region_x\", \"Females\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\"]]\n", "\n", " for idx, row in mergedDF.iterrows():\n", " year = str(DStotal.at[idx, \"Time\"]) # save year value to add later as a column\n", "\n", " # calculate percentages for each value\n", " p_tot = getPercentage((mergedDF.at[idx, \"Total\"]), (mergedDF.at[idx, \"Females\"]))\n", " p_lb = getPercentage((mergedDF.at[idx, \"Live_births\"]), (mergedDF.at[idx, \"Females\"]))\n", " p_m = getPercentage((mergedDF.at[idx, \"Miscarriages\"]), (mergedDF.at[idx, \"Females\"]))\n", " p_a = getPercentage((mergedDF.at[idx, \"Abortions\"]), (mergedDF.at[idx, \"Females\"]))\n", "\n", " Total_perc.append(p_tot)\n", " Live_births_perc.append(p_lb)\n", " Miscarriages_perc.append(p_m)\n", " Abortions_perc.append(p_a)\n", "\n", " # change absolute values with percentage values\n", " DSpartial[\"Live_births\"] = Live_births_perc\n", " DSpartial[\"Miscarriages\"] = Miscarriages_perc\n", " DSpartial[\"Abortions\"] = Abortions_perc\n", " DSpartial[\"Total\"] = Total_perc\n", " DSpartial[\"Time\"] = year\n", "\n", " DSpartial.to_csv(\"../data/mashupDS/MD2-PERC-\" + year + \".csv\", index=False)\n", " return True\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Find the visualisation of the 2017 mashup dataset with percentage values below as example.\n", "All MD2 - Absolute values can be found in ` data > mashupDS `

\n", "Full code is available at `scripts > MASHUP.py `" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ITTER107RegionLive_birthsMiscarriagesAbortionsTotalTime
0ITC1Piemonte2.6572630.1507150.8145433.6225202017
1ITC2Valle d'Aosta / Vallée d'Aoste2.3089430.1463410.7479673.2032522017
2ITC3Liguria2.6097580.1338340.9659313.7095232017
3ITC4Lombardia2.7331710.1215600.6861343.5408652017
4ITD3Veneto2.4098240.1163560.4524502.9786302017
5ITD4Friuli-Venezia Giulia2.4315120.1358890.6214673.1888682017
6ITD5Emilia-Romagna2.9121330.1132710.7958023.8212062017
7ITDATrentino Alto Adige / Südtirol2.8914720.0814040.5307543.5036312017
8ITE1Toscana2.5662900.1291330.7654463.4608692017
9ITE2Umbria2.5576440.0993020.6660533.3229992017
10ITE3Marche2.4726750.1285900.5130203.1142842017
11ITE4Lazio2.3724830.1541940.7671153.2937922017
12ITF1Abruzzo2.5803120.0926970.5592203.2322282017
13ITF2Molise2.2908370.1245020.6349603.0502992017
14ITF3Campania3.3338060.0835090.4737633.8910782017
15ITF4Puglia2.8004290.1807280.7719683.7531252017
16ITF5Basilicata1.9276640.1353880.3997162.4627682017
17ITF6Calabria3.1063300.1389820.4052133.6505242017
18ITG1Sicilia4.0534100.1882690.5294014.7710812017
19ITG2Sardegna2.2583090.1044300.6670503.0297892017
\n", "
" ], "text/plain": [ " ITTER107 Region Live_births Miscarriages \\\n", "0 ITC1 Piemonte 2.657263 0.150715 \n", "1 ITC2 Valle d'Aosta / Vallée d'Aoste 2.308943 0.146341 \n", "2 ITC3 Liguria 2.609758 0.133834 \n", "3 ITC4 Lombardia 2.733171 0.121560 \n", "4 ITD3 Veneto 2.409824 0.116356 \n", "5 ITD4 Friuli-Venezia Giulia 2.431512 0.135889 \n", "6 ITD5 Emilia-Romagna 2.912133 0.113271 \n", "7 ITDA Trentino Alto Adige / Südtirol 2.891472 0.081404 \n", "8 ITE1 Toscana 2.566290 0.129133 \n", "9 ITE2 Umbria 2.557644 0.099302 \n", "10 ITE3 Marche 2.472675 0.128590 \n", "11 ITE4 Lazio 2.372483 0.154194 \n", "12 ITF1 Abruzzo 2.580312 0.092697 \n", "13 ITF2 Molise 2.290837 0.124502 \n", "14 ITF3 Campania 3.333806 0.083509 \n", "15 ITF4 Puglia 2.800429 0.180728 \n", "16 ITF5 Basilicata 1.927664 0.135388 \n", "17 ITF6 Calabria 3.106330 0.138982 \n", "18 ITG1 Sicilia 4.053410 0.188269 \n", "19 ITG2 Sardegna 2.258309 0.104430 \n", "\n", " Abortions Total Time \n", "0 0.814543 3.622520 2017 \n", "1 0.747967 3.203252 2017 \n", "2 0.965931 3.709523 2017 \n", "3 0.686134 3.540865 2017 \n", "4 0.452450 2.978630 2017 \n", "5 0.621467 3.188868 2017 \n", "6 0.795802 3.821206 2017 \n", "7 0.530754 3.503631 2017 \n", "8 0.765446 3.460869 2017 \n", "9 0.666053 3.322999 2017 \n", "10 0.513020 3.114284 2017 \n", "11 0.767115 3.293792 2017 \n", "12 0.559220 3.232228 2017 \n", "13 0.634960 3.050299 2017 \n", "14 0.473763 3.891078 2017 \n", "15 0.771968 3.753125 2017 \n", "16 0.399716 2.462768 2017 \n", "17 0.405213 3.650524 2017 \n", "18 0.529401 4.771081 2017 \n", "19 0.667050 3.029789 2017 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "md2_perc_2017 = read_csv(\"../data/mashupDS/MD2-PERC-2017.csv\")\n", "md2_perc_2017" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### MD3 - (Higher) education rates in young women in each region\n", "As the source dataset D7 did not have, with absolute values, the subdivision in regions (but only had data regarding the whole national territory of Italy), we had to use the \"percentage values\".\n", "The percentage is calculated on the total population of 18-24 years old in each region, without distinctions in terms of gender.\n", "\n", "However, given the assumptions of the project, we had to focus on the female population of this age range and we did so by employing the following proportion:\n", "* % early leavers : total 18-24 = [% early leavers F] : total 18-24 F\n", "\n", "Where:\n", "1. total 18-24 -> data to be gathered from the general population datasets (D1,D2), filtering them with ages 18-24\n", "2. total 18-24 F -> data to be gathered from the selected population datasets (D1, D2), filtering them with ages 18-24\n", "3. % early leavers -> data contained in the source dataset D7\n", "4. [% early leavers F] -> data to be gathered following the proportion\n", "\n", "#### Step 1: Total population 18-24 for each region" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === A) TOTAL 18-24\n", "# Get data for ages 18-24 from the general population datasets\n", "def selectAge(df, age1, age2):\n", " df = df.drop(df[(df.Age == \"100 and over\") | (df.Age == \"Total\")].index)\n", " df[\"Age\"] = df[\"Age\"].astype('int')\n", " young_df = df.loc[(df[\"Age\"] >= age1) & (df[\"Age\"] <= age2)]\n", " return young_df\n", "\n", "\n", "def young_generalPop(df, age1, age2):\n", " young_df = selectAge(df, age1, age2)\n", " young_df = (df.groupby([\"ITTER107\", \"Region\"])\n", " [\"Population\"].sum()).reset_index()\n", " return young_df\n", "\n", "def young_femalePop(df, age1, age2):\n", " young_Fdf = selectAge(df, age1, age2)\n", " young_Fdf = (df.groupby([\"ITTER107\", \"Region\"])\n", " [\"Females\"].sum()).reset_index()\n", " return young_Fdf\n", "\n", "\n", "# --- 2017\n", "pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n", "# Delete gender distinction\n", "pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") |\n", " (pop17.Sex == \"Females\")].index)\n", "# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of young_pop17\n", "young_pop17 = selectAge(pop17, 18, 24)\n", "young_pop17 = (pop17.groupby([\"Region code\", \"Region\"])[\n", " \"Population\"].sum()).reset_index()\n", "# Adding columns for easier identification\n", "young_pop17[\"Time\"] = 2017\n", "young_pop17[\"Age range\"] = \"18-24\"\n", "\n", "# --- 2018\n", "pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n", "pop18.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n", "young_pop18 = young_generalPop(pop18, 18, 24)\n", "young_pop18[\"Time\"] = 2018\n", "young_pop18[\"Age range\"] = \"18-24\"\n", "\n", "# --- 2019\n", "pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n", "pop19.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n", "young_pop19 = young_generalPop(pop19, 18, 24)\n", "young_pop19[\"Time\"] = 2019\n", "young_pop19[\"Age range\"] = \"18-24\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 2: Total female population 18-24 for each region\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === B) TOTAL 18-24 F\n", "# Same as before, just DO NOT delete the gender distinction\n", "# --- 2017\n", "f_pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n", "f_pop17 = f_pop17.drop(f_pop17[(f_pop17.Sex == \"Males\") | (f_pop17.Sex == \"Total\")].index)\n", "# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of youngF_pop17\n", "youngF_pop17 = selectAge(f_pop17, 18, 24)\n", "youngF_pop17 = (f_pop17.groupby([\"Region code\", \"Region\"])[\"Population\"].sum()).reset_index()\n", "# Adding columns for easier identification\n", "youngF_pop17[\"Time\"] = 2017\n", "youngF_pop17[\"Age range\"] = \"18-24\"\n", "youngF_pop17[\"Gender\"] = \"Females\"\n", "\n", "# --- 2018\n", "f_pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n", "f_pop18.drop([\"Males\", \"Population\"], axis=1, inplace=True)\n", "youngF_pop18 = young_femalePop(f_pop18, 18, 24)\n", "youngF_pop18[\"Time\"] = 2018\n", "youngF_pop18[\"Age range\"] = \"18-24\"\n", "\n", "# --- 2019\n", "f_pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n", "youngF_pop19 = young_femalePop(f_pop19, 18, 24)\n", "youngF_pop19[\"Time\"] = 2019\n", "youngF_pop19[\"Age range\"] = \"18-24\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 3: Percentage of early leavers for each region" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d7_clean = read_csv(\"../data/cleanDS/D7_clean.csv\")\n", "d7_2017 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2018) | (d7_clean.TIME == 2019)].index)\n", "d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n", "d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 4: Percentage of female early leavers for each region\n", "\n", "(this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === C) % EARLY LEAVERS F (PROPORTION)\n", "# --- 2017\n", "MD3_2017 = (merge(young_pop17,d7_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n", "MD3_2017 = (merge(youngF_pop17,MD3_2017,left_on=\"Region code\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"ITTER107\",\"Region code_FEMALE\",\"Region_FEMALE\"],axis=1)\n", "for idx,row in MD3_2017.iterrows():\n", " result = (row[\"Value\"] * row[\"Population_FEMALE\"])/(row[\"Population_GENERAL\"])\n", " MD3_2017.loc[idx,\"Female Early Leavers\"] = result\n", "#MD3_2017.to_csv(\"data/mashupDS/MD3_17.csv\")\n", "\n", "# row[\"Value\"] : row[\"Population_GENERAL\"] = % early leavers F : row[\"Population_FEMALE\"]\n", "\n", "# --- 2018\n", "d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n", "\n", "MD3_2018 = (merge(young_pop18,d7_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n", "MD3_2018 = (merge(youngF_pop18,MD3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n", "\n", "for idx,row in MD3_2018.iterrows():\n", " result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n", " MD3_2018.loc[idx,\"Female Early Leavers\"] = result\n", "#MD3_2018.to_csv(\"data/mashupDS/MD3_18.csv\")\n", "\n", "# --- 2019\n", "d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n", "MD3_2019 = (merge(young_pop19,d7_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n", "MD3_2019 = (merge(youngF_pop19,MD3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n", "for idx,row in MD3_2019.iterrows():\n", " result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n", " MD3_2019.loc[idx,\"Female Early Leavers\"] = result\n", "#MD3_2019.to_csv(\"data/mashupDS/MD3_19.csv\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9e178ac2d01006df9e74389e28c68e3ef7cb95acb5e5b42727cfb2189b57440d" } } }, "nbformat": 4, "nbformat_minor": 2 }