{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mashup datasets\n",
    "\n",
    "This Jupyter Notebook analyses the mashup datasets for [\"Blessed Be the Fruit\"](https://github.com/OrsolaMBorrini/blessedfruit), an open data project regarding the analysis of factors that might influence pregnancy rates in young women in Italy.\n",
    "\n",
    "From the starting 7 source datasets, three different mashup datasets have been created:\n",
    "Id | Dataset | Description (factor of interest) | Original source datasets\n",
    "--- | --- | --- | --- \n",
    "MD1 | Religious observance in each region | RELIGION - % of religious observance in each region (over the total population) | D1, D2, D3 \n",
    "MD2 | Pregnancy rates in young women in each region | PREGNANCY - % of pregnancies in young women (15-25) in each region (over the total population of young women aged 15-25) | D4, D5, D6\n",
    "MD3 | (Higher) education rates in young women in each region | EDUCATION - % of women early leavers (18-24) in each region (over the total population) | D1, D2, D7\n",
    "\n",
    "### Disclaimer\n",
    "This Jupyter Notebook is of informational nature only, it is not thought to be used for the clean up process for the datasets, but only for their analysis and the explanation of the process. The code actually used for the clean up can be found in `scripts > MASHUP.py`.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mashup phase\n",
    "### MD1 - Religious observance in each region\n",
    "To create a dataset showing the percentage of religious observance in each region, we need to compare the total general population of each region (with no distinction of gender) to the religious observance (in thousands value) in each region. To get this result we will use the following source datasets:\n",
    "* D1 and D2 -> for the data regarding total general population in each region for each of the three years of interest (`data > cleanDS > Population2017_Generalclean.csv`, `Population2018_Generalclean.csv`, `Population2019_Generalclean.csv`)\n",
    "* D3 -> for the data regarding religious observance (thousands value) in each region (`data > cleanDS > D3_clean.csv`)\n",
    "\n",
    "First of all the datasets regarding general population were stripped of any distinction between genders and the various ages were merged together. Additional columns to keep track of the time were added accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ITTER107</th>\n",
       "      <th>Region</th>\n",
       "      <th>Population</th>\n",
       "      <th>Time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ITC1</td>\n",
       "      <td>Piemonte</td>\n",
       "      <td>8622434</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ITC2</td>\n",
       "      <td>Valle d'Aosta / Vallée d'Aoste</td>\n",
       "      <td>250068</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ITC3</td>\n",
       "      <td>Liguria</td>\n",
       "      <td>3049652</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ITC4</td>\n",
       "      <td>Lombardia</td>\n",
       "      <td>20055204</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ITD3</td>\n",
       "      <td>Veneto</td>\n",
       "      <td>9758266</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>ITD4</td>\n",
       "      <td>Friuli-Venezia Giulia</td>\n",
       "      <td>2412432</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ITD5</td>\n",
       "      <td>Emilia-Romagna</td>\n",
       "      <td>8928238</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ITDA</td>\n",
       "      <td>Trentino Alto Adige / Südtirol</td>\n",
       "      <td>2156138</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ITE1</td>\n",
       "      <td>Toscana</td>\n",
       "      <td>7385110</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>ITE2</td>\n",
       "      <td>Umbria</td>\n",
       "      <td>1740330</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>ITE3</td>\n",
       "      <td>Marche</td>\n",
       "      <td>3025344</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>ITE4</td>\n",
       "      <td>Lazio</td>\n",
       "      <td>11511400</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>ITF1</td>\n",
       "      <td>Abruzzo</td>\n",
       "      <td>2587882</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ITF2</td>\n",
       "      <td>Molise</td>\n",
       "      <td>601032</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>ITF3</td>\n",
       "      <td>Campania</td>\n",
       "      <td>11424286</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>ITF4</td>\n",
       "      <td>Puglia</td>\n",
       "      <td>7906610</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>ITF5</td>\n",
       "      <td>Basilicata</td>\n",
       "      <td>1106508</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>ITF6</td>\n",
       "      <td>Calabria</td>\n",
       "      <td>3788220</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>ITG1</td>\n",
       "      <td>Sicilia</td>\n",
       "      <td>9750580</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>ITG2</td>\n",
       "      <td>Sardegna</td>\n",
       "      <td>3223242</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ITTER107                          Region  Population  Time\n",
       "0      ITC1                        Piemonte     8622434  2019\n",
       "1      ITC2  Valle d'Aosta / Vallée d'Aoste      250068  2019\n",
       "2      ITC3                         Liguria     3049652  2019\n",
       "3      ITC4                       Lombardia    20055204  2019\n",
       "4      ITD3                          Veneto     9758266  2019\n",
       "5      ITD4           Friuli-Venezia Giulia     2412432  2019\n",
       "6      ITD5                  Emilia-Romagna     8928238  2019\n",
       "7      ITDA  Trentino Alto Adige / Südtirol     2156138  2019\n",
       "8      ITE1                         Toscana     7385110  2019\n",
       "9      ITE2                          Umbria     1740330  2019\n",
       "10     ITE3                          Marche     3025344  2019\n",
       "11     ITE4                           Lazio    11511400  2019\n",
       "12     ITF1                         Abruzzo     2587882  2019\n",
       "13     ITF2                          Molise      601032  2019\n",
       "14     ITF3                        Campania    11424286  2019\n",
       "15     ITF4                          Puglia     7906610  2019\n",
       "16     ITF5                      Basilicata     1106508  2019\n",
       "17     ITF6                        Calabria     3788220  2019\n",
       "18     ITG1                         Sicilia     9750580  2019\n",
       "19     ITG2                        Sardegna     3223242  2019"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import csv\n",
    "from pandas import *\n",
    "import functools as ft\n",
    "\n",
    "# ==== MD1 ====\n",
    "\n",
    "# GENERAL POPULATION FILES\n",
    "pop17 = read_csv(\"../data/cleanDS/Population2017General_clean.csv\")\n",
    "pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") | (pop17.Sex == \"Females\")].index)\n",
    "gen_pop17 = (pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
    "gen_pop17[\"Time\"] = 2017\n",
    "\n",
    "def generalPop(path):\n",
    "    file = read_csv(path)\n",
    "    file.drop([\"Males\",\"Females\"],axis=1,inplace=True)\n",
    "    file = (file.groupby([\"ITTER107\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
    "    return file\n",
    "\n",
    "pop18 = \"../data/cleanDS/Population2018General_clean.csv\"\n",
    "pop19 = \"../data/cleanDS/Population2019General_clean.csv\"\n",
    "\n",
    "gen_pop18 = generalPop(pop18)\n",
    "gen_pop18[\"Time\"] = 2018\n",
    "gen_pop19 = generalPop(pop19)\n",
    "gen_pop19[\"Time\"] = 2019\n",
    "\n",
    "# Example\n",
    "gen_pop19"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Two ancillary functions were created to first get the percentage values from the thousands values that we had in the tables (function `getPercentage`) and to add the column with the percentage value in the specific dataframe (`addPercentageColumn`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "d3_clean = read_csv(\"../data/cleanDS/D3_clean.csv\")\n",
    "\n",
    "def getPercentage(num,pop):\n",
    "    return num * 100 / pop\n",
    "\n",
    "def addPercentageColumn(yearlyDF):\n",
    "    for idx,row in yearlyDF.iterrows():\n",
    "        perc = getPercentage((row[\"Value\"]*1000),row[\"Population\"])\n",
    "        yearlyDF.loc[idx,\"Percentage\"] = perc\n",
    "    return yearlyDF"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the process is very similar with regards for each year, we have created one single function to create (and then save as a csv file) all three mashup datasets: `createYearlyDF`.\n",
    "\n",
    "This function has two input parameters:\n",
    "1. `genPopList` -> a list of the general population dataframes\n",
    "2. `rightDF` -> the dataframe regarding religious observance (containing data for <b>all three years</b>, from 2017 to 2019)\n",
    "\n",
    "For each year (which is obtained by checking the first value under the column \"Time\" of whatever general population dataframe), we cleaned up the religious observance dataframe, keeping only the data needed for that specific year (see: if-elif structure in the code below + drop instruction).\n",
    "\n",
    "Then, we have merged the two input dataframes together using the unique code region and we subsequently dropped other unnecessary and repeated columns (namely \"Territory\",\"TIME\"...).\n",
    "\n",
    "Finally, we called the ancillary function to add a \"Percentage\" column to the resulting dataframe and we have saved it as a new csv file (this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createYearlyDF(genPopList,rightDF):\n",
    "    for item in genPopList:\n",
    "        if item[\"Time\"][0] == 2017:\n",
    "            d3_2017 = rightDF.drop(rightDF[(rightDF.TIME == 2018) | (rightDF.TIME == 2019)].index)\n",
    "            MD1_17 = (merge(item,d3_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"Territory\",\"ITTER107\",\"TIME\"],axis=1)\n",
    "            addPercentageColumn(MD1_17)\n",
    "            #MD1_17.to_csv(\"data/mashupDS/MD1_17.csv\")\n",
    "        elif item[\"Time\"][0] == 2018:\n",
    "            d3_2018 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2019)].index)\n",
    "            MD1_18 = (merge(item,d3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n",
    "            addPercentageColumn(MD1_18)\n",
    "            #MD1_18.to_csv(\"data/mashupDS/MD1_18.csv\")\n",
    "        elif item[\"Time\"][0] == 2019:\n",
    "            d3_2019 = rightDF.drop(rightDF[(rightDF.TIME == 2017) | (rightDF.TIME == 2018)].index)\n",
    "            MD1_19 = (merge(item,d3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"Territory\",\"TIME\"],axis=1)\n",
    "            addPercentageColumn(MD1_19)\n",
    "            #MD1_19.to_csv(\"data/mashupDS/MD1_19.csv\")\n",
    "    return        \n",
    "\n",
    "createYearlyDF([gen_pop17,gen_pop18,gen_pop19],d3_clean)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the end, we have <b>three mashup datasets</b> (one for each year in our scope) regarding <b>religious observance rates in each region in Italy</b>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Region code</th>\n",
       "      <th>Region</th>\n",
       "      <th>Population</th>\n",
       "      <th>Time</th>\n",
       "      <th>TIPO_DATO_AVQ</th>\n",
       "      <th>Value</th>\n",
       "      <th>Percentage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>ITC1</td>\n",
       "      <td>Piemonte</td>\n",
       "      <td>4349911</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>1105</td>\n",
       "      <td>25.402819</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>ITC1</td>\n",
       "      <td>Piemonte</td>\n",
       "      <td>4349911</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1007</td>\n",
       "      <td>23.149899</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>ITC2</td>\n",
       "      <td>Valle D'Aosta/Vallée D'Aoste</td>\n",
       "      <td>126213</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>37</td>\n",
       "      <td>29.315522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>ITC2</td>\n",
       "      <td>Valle D'Aosta/Vallée D'Aoste</td>\n",
       "      <td>126213</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>22</td>\n",
       "      <td>17.430851</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>ITC3</td>\n",
       "      <td>Liguria</td>\n",
       "      <td>1541541</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>252</td>\n",
       "      <td>16.347278</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>ITC3</td>\n",
       "      <td>Liguria</td>\n",
       "      <td>1541541</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>536</td>\n",
       "      <td>34.770402</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>ITC4</td>\n",
       "      <td>Lombardia</td>\n",
       "      <td>9986962</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>2539</td>\n",
       "      <td>25.423147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>ITC4</td>\n",
       "      <td>Lombardia</td>\n",
       "      <td>9986962</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>2586</td>\n",
       "      <td>25.893760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>ITD3</td>\n",
       "      <td>Veneto</td>\n",
       "      <td>4880936</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1302</td>\n",
       "      <td>26.675211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>ITD3</td>\n",
       "      <td>Veneto</td>\n",
       "      <td>4880936</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>964</td>\n",
       "      <td>19.750310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>ITD4</td>\n",
       "      <td>Friuli-Venezia Giulia</td>\n",
       "      <td>1211155</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>353</td>\n",
       "      <td>29.145733</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>ITD4</td>\n",
       "      <td>Friuli-Venezia Giulia</td>\n",
       "      <td>1211155</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>243</td>\n",
       "      <td>20.063493</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>ITD5</td>\n",
       "      <td>Emilia-Romagna</td>\n",
       "      <td>4445920</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>1520</td>\n",
       "      <td>34.188649</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>ITD5</td>\n",
       "      <td>Emilia-Romagna</td>\n",
       "      <td>4445920</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>834</td>\n",
       "      <td>18.758772</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>ITDA</td>\n",
       "      <td>Trentino-Alto Adige/Südtirol</td>\n",
       "      <td>1068738</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>267</td>\n",
       "      <td>24.982737</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>ITDA</td>\n",
       "      <td>Trentino-Alto Adige/Südtirol</td>\n",
       "      <td>1068738</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>214</td>\n",
       "      <td>20.023617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>ITE1</td>\n",
       "      <td>Toscana</td>\n",
       "      <td>3712048</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>682</td>\n",
       "      <td>18.372607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>ITE1</td>\n",
       "      <td>Toscana</td>\n",
       "      <td>3712048</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>1255</td>\n",
       "      <td>33.808830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>ITE2</td>\n",
       "      <td>Umbria</td>\n",
       "      <td>876477</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>177</td>\n",
       "      <td>20.194483</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>ITE2</td>\n",
       "      <td>Umbria</td>\n",
       "      <td>876477</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>208</td>\n",
       "      <td>23.731370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>20</td>\n",
       "      <td>ITE3</td>\n",
       "      <td>Marche</td>\n",
       "      <td>1526331</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>318</td>\n",
       "      <td>20.834275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>21</td>\n",
       "      <td>ITE3</td>\n",
       "      <td>Marche</td>\n",
       "      <td>1526331</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>404</td>\n",
       "      <td>26.468702</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>22</td>\n",
       "      <td>ITE4</td>\n",
       "      <td>Lazio</td>\n",
       "      <td>5774606</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1383</td>\n",
       "      <td>23.949686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>23</td>\n",
       "      <td>ITE4</td>\n",
       "      <td>Lazio</td>\n",
       "      <td>5774606</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>1485</td>\n",
       "      <td>25.716040</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>24</td>\n",
       "      <td>ITF1</td>\n",
       "      <td>Abruzzo</td>\n",
       "      <td>1306059</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>310</td>\n",
       "      <td>23.735528</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>25</td>\n",
       "      <td>ITF1</td>\n",
       "      <td>Abruzzo</td>\n",
       "      <td>1306059</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>218</td>\n",
       "      <td>16.691436</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>26</td>\n",
       "      <td>ITF2</td>\n",
       "      <td>Molise</td>\n",
       "      <td>306564</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>90</td>\n",
       "      <td>29.357655</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>27</td>\n",
       "      <td>ITF2</td>\n",
       "      <td>Molise</td>\n",
       "      <td>306564</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>51</td>\n",
       "      <td>16.636004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>28</td>\n",
       "      <td>ITF3</td>\n",
       "      <td>Campania</td>\n",
       "      <td>5762889</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1742</td>\n",
       "      <td>30.227894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>29</td>\n",
       "      <td>ITF3</td>\n",
       "      <td>Campania</td>\n",
       "      <td>5762889</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>780</td>\n",
       "      <td>13.534878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>30</td>\n",
       "      <td>ITF4</td>\n",
       "      <td>Puglia</td>\n",
       "      <td>4000966</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>624</td>\n",
       "      <td>15.596234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>31</td>\n",
       "      <td>ITF4</td>\n",
       "      <td>Puglia</td>\n",
       "      <td>4000966</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1318</td>\n",
       "      <td>32.942044</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>32</td>\n",
       "      <td>ITF5</td>\n",
       "      <td>Basilicata</td>\n",
       "      <td>562968</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>164</td>\n",
       "      <td>29.131318</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>33</td>\n",
       "      <td>ITF5</td>\n",
       "      <td>Basilicata</td>\n",
       "      <td>562968</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>80</td>\n",
       "      <td>14.210399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>34</td>\n",
       "      <td>ITF6</td>\n",
       "      <td>Calabria</td>\n",
       "      <td>1924257</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>263</td>\n",
       "      <td>13.667613</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>35</td>\n",
       "      <td>ITF6</td>\n",
       "      <td>Calabria</td>\n",
       "      <td>1924257</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>622</td>\n",
       "      <td>32.324165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>36</td>\n",
       "      <td>ITG1</td>\n",
       "      <td>Sicilia</td>\n",
       "      <td>4942188</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>821</td>\n",
       "      <td>16.612075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>37</td>\n",
       "      <td>ITG1</td>\n",
       "      <td>Sicilia</td>\n",
       "      <td>4942188</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>1492</td>\n",
       "      <td>30.189058</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>38</td>\n",
       "      <td>ITG2</td>\n",
       "      <td>Sardegna</td>\n",
       "      <td>1631040</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_WEEK_RELIG</td>\n",
       "      <td>382</td>\n",
       "      <td>23.420640</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>39</td>\n",
       "      <td>ITG2</td>\n",
       "      <td>Sardegna</td>\n",
       "      <td>1631040</td>\n",
       "      <td>2017</td>\n",
       "      <td>6_NEVER_RELIG</td>\n",
       "      <td>398</td>\n",
       "      <td>24.401609</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Unnamed: 0 Region code                        Region  Population  Time  \\\n",
       "0            0        ITC1                      Piemonte     4349911  2017   \n",
       "1            1        ITC1                      Piemonte     4349911  2017   \n",
       "2            2        ITC2  Valle D'Aosta/Vallée D'Aoste      126213  2017   \n",
       "3            3        ITC2  Valle D'Aosta/Vallée D'Aoste      126213  2017   \n",
       "4            4        ITC3                       Liguria     1541541  2017   \n",
       "5            5        ITC3                       Liguria     1541541  2017   \n",
       "6            6        ITC4                     Lombardia     9986962  2017   \n",
       "7            7        ITC4                     Lombardia     9986962  2017   \n",
       "8            8        ITD3                        Veneto     4880936  2017   \n",
       "9            9        ITD3                        Veneto     4880936  2017   \n",
       "10          10        ITD4         Friuli-Venezia Giulia     1211155  2017   \n",
       "11          11        ITD4         Friuli-Venezia Giulia     1211155  2017   \n",
       "12          12        ITD5                Emilia-Romagna     4445920  2017   \n",
       "13          13        ITD5                Emilia-Romagna     4445920  2017   \n",
       "14          14        ITDA  Trentino-Alto Adige/Südtirol     1068738  2017   \n",
       "15          15        ITDA  Trentino-Alto Adige/Südtirol     1068738  2017   \n",
       "16          16        ITE1                       Toscana     3712048  2017   \n",
       "17          17        ITE1                       Toscana     3712048  2017   \n",
       "18          18        ITE2                        Umbria      876477  2017   \n",
       "19          19        ITE2                        Umbria      876477  2017   \n",
       "20          20        ITE3                        Marche     1526331  2017   \n",
       "21          21        ITE3                        Marche     1526331  2017   \n",
       "22          22        ITE4                         Lazio     5774606  2017   \n",
       "23          23        ITE4                         Lazio     5774606  2017   \n",
       "24          24        ITF1                       Abruzzo     1306059  2017   \n",
       "25          25        ITF1                       Abruzzo     1306059  2017   \n",
       "26          26        ITF2                        Molise      306564  2017   \n",
       "27          27        ITF2                        Molise      306564  2017   \n",
       "28          28        ITF3                      Campania     5762889  2017   \n",
       "29          29        ITF3                      Campania     5762889  2017   \n",
       "30          30        ITF4                        Puglia     4000966  2017   \n",
       "31          31        ITF4                        Puglia     4000966  2017   \n",
       "32          32        ITF5                    Basilicata      562968  2017   \n",
       "33          33        ITF5                    Basilicata      562968  2017   \n",
       "34          34        ITF6                      Calabria     1924257  2017   \n",
       "35          35        ITF6                      Calabria     1924257  2017   \n",
       "36          36        ITG1                       Sicilia     4942188  2017   \n",
       "37          37        ITG1                       Sicilia     4942188  2017   \n",
       "38          38        ITG2                      Sardegna     1631040  2017   \n",
       "39          39        ITG2                      Sardegna     1631040  2017   \n",
       "\n",
       "    TIPO_DATO_AVQ  Value  Percentage  \n",
       "0   6_NEVER_RELIG   1105   25.402819  \n",
       "1    6_WEEK_RELIG   1007   23.149899  \n",
       "2   6_NEVER_RELIG     37   29.315522  \n",
       "3    6_WEEK_RELIG     22   17.430851  \n",
       "4    6_WEEK_RELIG    252   16.347278  \n",
       "5   6_NEVER_RELIG    536   34.770402  \n",
       "6    6_WEEK_RELIG   2539   25.423147  \n",
       "7   6_NEVER_RELIG   2586   25.893760  \n",
       "8    6_WEEK_RELIG   1302   26.675211  \n",
       "9   6_NEVER_RELIG    964   19.750310  \n",
       "10  6_NEVER_RELIG    353   29.145733  \n",
       "11   6_WEEK_RELIG    243   20.063493  \n",
       "12  6_NEVER_RELIG   1520   34.188649  \n",
       "13   6_WEEK_RELIG    834   18.758772  \n",
       "14   6_WEEK_RELIG    267   24.982737  \n",
       "15  6_NEVER_RELIG    214   20.023617  \n",
       "16   6_WEEK_RELIG    682   18.372607  \n",
       "17  6_NEVER_RELIG   1255   33.808830  \n",
       "18   6_WEEK_RELIG    177   20.194483  \n",
       "19  6_NEVER_RELIG    208   23.731370  \n",
       "20  6_NEVER_RELIG    318   20.834275  \n",
       "21   6_WEEK_RELIG    404   26.468702  \n",
       "22   6_WEEK_RELIG   1383   23.949686  \n",
       "23  6_NEVER_RELIG   1485   25.716040  \n",
       "24   6_WEEK_RELIG    310   23.735528  \n",
       "25  6_NEVER_RELIG    218   16.691436  \n",
       "26   6_WEEK_RELIG     90   29.357655  \n",
       "27  6_NEVER_RELIG     51   16.636004  \n",
       "28   6_WEEK_RELIG   1742   30.227894  \n",
       "29  6_NEVER_RELIG    780   13.534878  \n",
       "30  6_NEVER_RELIG    624   15.596234  \n",
       "31   6_WEEK_RELIG   1318   32.942044  \n",
       "32   6_WEEK_RELIG    164   29.131318  \n",
       "33  6_NEVER_RELIG     80   14.210399  \n",
       "34  6_NEVER_RELIG    263   13.667613  \n",
       "35   6_WEEK_RELIG    622   32.324165  \n",
       "36  6_NEVER_RELIG    821   16.612075  \n",
       "37   6_WEEK_RELIG   1492   30.189058  \n",
       "38   6_WEEK_RELIG    382   23.420640  \n",
       "39  6_NEVER_RELIG    398   24.401609  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md1_17 = read_csv(\"../data/mashupDS/MD1_17.csv\")\n",
    "md1_17"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MD2 - Pregnancy rates in young women in each region\n",
    "\n",
    "#### MD2 - Absolute values\n",
    "\n",
    "This mashup dataset contains the <b>total number of pregnancies</b> in young women aged 15-25, divided per region. A dataset for each year (2017, 2018, 2019) has been created.\n",
    "\n",
    "The following source datasets have been used:\n",
    "* D4 - \"Mother - Age and Citizenship\"\n",
    "* D5 -  \"Spontaneous abortions - resignation from the place of the event: Age of women - prov.\"\n",
    "* D6 - \"Induced abortions - Migration: Events by region of residence of the woman and region of intervention\"\n",
    "\n",
    "The result dataset contains the following columns:\n",
    "<table>\n",
    "<tr>\n",
    "<th>ITTER107</th>\n",
    "<th>Region</th>\n",
    "<th>Live_births</th>\n",
    "<th>Miscarriages</th>\n",
    "<th>Abortions</th>\n",
    "<th>Total</th>\n",
    "<th>Time</th>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "<code>Total</code> contains the data retreived by summing values of live_births, spontaneous abortions and induced abortions in D4, D5, D6.\n",
    "\n",
    "<b>Disclaimer</b>\n",
    "\n",
    "* Although our selected age group would be 15-25 years old, unfortunately, these istat datasets were organized in age classes of 15-24, 25-29. For this reason we have decided to keep <b>only age class 15-24</b> since, including pregnancies data for women closer to 30 years old, hence closer to the average age of pregnancy in women, would have falsified our data too much\n",
    "\n",
    "* <span style=\"color: blue;\">Dataset 5</span> does not have values for a class age of 15 -24 as the other two datasets have, instead, it gives values of live births aggregated for a class of women <span style=\"color: blue;\">until 17 years old</span>. However, we considered valid to aggregate numbers from this dataset without further distinction of age, since values for pregnancies under 15 years old are very close to 0, hence irrelevant.\n",
    "\n",
    "\n",
    "For all datasets (D4, D5 and D6), the same process has been adopted. For the sake of documentation we will use D4 as an example.\n",
    "\n",
    "1. Age groups have been aggregated, cleaning eventual unwanted values from source datasets "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "pathD4 = \"../data/srcDS/D4Pregnancy/cleanedDS/cleanedD4-2019-Pregnancy.csv\"\n",
    "live_births = read_csv(pathD4, keep_default_na=False,\n",
    "            dtype= {\n",
    "                \"RESIDENCE_TERR\":\"string\",\n",
    "                \"CITIZENSHIP_MOTHER\": \"string\",\n",
    "                \"MOTHER_AGE\" :\"string\",\n",
    "                \"OBS_VALUE\" : \"int64\"\n",
    "            })\n",
    "\n",
    "live_births = live_births.query(\"MOTHER_AGE != 'Y25'\") # create a df without data for Y25\n",
    "live_births = live_births.query(\"RESIDENCE_TERR != 'ITD1'\")\n",
    "live_births = live_births.query(\"RESIDENCE_TERR != 'ITD2'\") \n",
    "# take out the column with mother_age so I can sum all of the values per region\n",
    "live_births = live_births[[\"RESIDENCE_TERR\", \"OBS_VALUE\"]] \n",
    "live_births = live_births.groupby(\"RESIDENCE_TERR\", as_index=False).sum()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Regional code and Region name have been added<br>\n",
    "<center>In D4, that only had regional codes:</center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "ID_name_dict = { \"ITC1\" : \"Piemonte\",\n",
    "                \"ITC2\" : \"Valle d'Aosta / Vallée d'Aoste\",\n",
    "                \"ITC3\" : \"Liguria\",\n",
    "                \"ITC4\" : \"Lombardia\",\n",
    "                \"ITDA\" : \"Trentino Alto Adige / Südtirol\",\n",
    "                \"ITD3\" : \"Veneto\",\n",
    "                \"ITD4\" : \"Friuli-Venezia Giulia\",\n",
    "                \"ITD5\" : \"Emilia-Romagna\",\n",
    "                \"ITE1\" : \"Toscana\",\n",
    "                \"ITE2\" : \"Umbria\",\n",
    "                \"ITE3\" : \"Marche\",\n",
    "                \"ITE4\" : \"Lazio\",\n",
    "                \"ITF1\" : \"Abruzzo\",\n",
    "                \"ITF2\" : \"Molise\",\n",
    "                \"ITF3\" : \"Campania\",\n",
    "                \"ITF4\" : \"Puglia\",\n",
    "                \"ITF5\" : \"Basilicata\",\n",
    "                \"ITF6\" : \"Calabria\",\n",
    "                \"ITG1\" : \"Sicilia\",\n",
    "                \"ITG2\" : \"Sardegna\"\n",
    "                }\n",
    "\n",
    "\n",
    "i = 0\n",
    "\n",
    "# add NL Region name to dataset\n",
    "\n",
    "region_name = []\n",
    "\n",
    "for idx, row in live_births.iterrows():\n",
    "    region_name.append(ID_name_dict[live_births.at[idx, \"RESIDENCE_TERR\"]])\n",
    "\n",
    "live_births[\"Region\"] = region_name #maybe try inserting at idx 1 ??\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>2.2. In D5 and D5, that only had Region name:</center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "pathD5 = \"../data/srcDS/D5Pregnancy/cleanedDS/cleanedD5-2019-Pregnancy.csv\"\n",
    "miscarriages = read_csv(pathD5, keep_default_na=False,\n",
    "             dtype= {\n",
    "                \"Territorio\":\"string\",\n",
    "                \"Classe di età\": \"string\",\n",
    "                \"Value\" : \"int64\"\n",
    "            })\n",
    "\n",
    "pathD6 = \"../data/srcDS/D6Pregnancy/cleanedDS/cleanedD6-2019-Pregnancy.csv\"\n",
    "abortions = read_csv(pathD6, keep_default_na=False,\n",
    "             dtype= {\n",
    "                \"Territorio dell'evento\":\"string\",\n",
    "                \"Età e classe di età\": \"string\",\n",
    "                \"Value\" : \"int64\"\n",
    "            })\n",
    "            \n",
    "region_code = []\n",
    "region_name = []\n",
    "\n",
    "for key, value in ID_name_dict.items():\n",
    "    region_name.append(value)\n",
    "    region_code.append(key)\n",
    "\n",
    "DF_ID_name = DataFrame({\"ITTER107\" : Series(region_code,  dtype=\"string\"),  \"Region\": Series(region_name,  dtype=\"string\")})\n",
    "\n",
    "miscarriages = merge(DF_ID_name, miscarriages, left_on=\"Region\", right_on=\"Territorio\")\n",
    "miscarriages.drop([\"Territorio\"], axis=1, inplace=True)\n",
    "\n",
    "abortions = merge(DF_ID_name, abortions, left_on=\"Region\", right_on=\"Territorio dell'evento\")\n",
    "abortions.drop([\"Territorio dell'evento\"], axis=1, inplace=True)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. If needed columns label have been renamed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "live_births.rename(columns = {\"RESIDENCE_TERR\" : \"ITTER107\", \"OBS_VALUE\" : \"Live_births\"}, inplace=True)\n",
    "miscarriages.rename(columns = {\"Value\" : \"Miscarriages\"}, inplace=True)\n",
    "abortions.rename(columns = { \"Value\" : \"Abortions\"}, inplace=True)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. A column <code>Time</code> specifying the year of the dataset has been added. <br><br>\n",
    "For this purpose, a function <code>addyear</code> has been implemented. It takes in imput <code>pathyear</code> the path of the dataset of interest, and <code>df</code> the Dataframe object of the dataset of interest.<br>\n",
    "If <code>Time</code> column is not already in the dataset, the function adds it, etracting the correct year from the name of the dataset found in the path. It returns <code>yr</code> the year of the dataset as a string\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "def addyear(pathyear, df):\n",
    "    if \"Time\" not in df.columns[0]:\n",
    "        year2017 = re.search(\"2017\", pathyear)\n",
    "        year2018 = re.search(\"2018\", pathyear)\n",
    "        year2019 = re.search(\"2019\", pathyear)\n",
    "\n",
    "        years = [year2017, year2018, year2019]\n",
    "        for el in years:\n",
    "            if el is not None:\n",
    "                yr = el.group()\n",
    "                df[\"Time\"] = yr\n",
    "                return yr\n",
    "        return False\n",
    "\n",
    "yr = addyear(pathD4, live_births)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Finally, the resulting datasets are merged together based on Region's value and the total pregnancies are calculated summing each DS' value. They are exported as <b>csv</b> to create the final mashup dataset MD2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "PregnancyDS = merge(live_births, miscarriages, left_on= \"ITTER107\", right_on=\"ITTER107\")\n",
    "PregnancyDS = merge(PregnancyDS, abortions, left_on= \"ITTER107\", right_on=\"ITTER107\", how=\"left\") #how=\"left\" is because I don'7 have lazio's value rn\n",
    "\n",
    "total = PregnancyDS.sum(axis=1, numeric_only=True) # sums all pregnancies value per region\n",
    "PregnancyDS[\"Total\"] = total\n",
    "PregnancyDS = PregnancyDS[[\"ITTER107\", \"Region\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\", \"Time\"]]\n",
    "PregnancyDS.to_csv(\"../data/mashupDS/MD2-ASS-\" + yr + \".csv\", index=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the visualisation of the 2017 mashup dataset with absolute values below as example.\n",
    "All MD2 - Absolute values can be found in ` data > mashupDS ` <br><br>\n",
    "Full code is available at `scripts > MASHUP.py `"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ITTER107</th>\n",
       "      <th>Region</th>\n",
       "      <th>Live_births</th>\n",
       "      <th>Miscarriages</th>\n",
       "      <th>Abortions</th>\n",
       "      <th>Total</th>\n",
       "      <th>Time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ITC1</td>\n",
       "      <td>Piemonte</td>\n",
       "      <td>5448</td>\n",
       "      <td>309</td>\n",
       "      <td>1670</td>\n",
       "      <td>7427</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ITC2</td>\n",
       "      <td>Valle d'Aosta / Vallée d'Aoste</td>\n",
       "      <td>142</td>\n",
       "      <td>9</td>\n",
       "      <td>46</td>\n",
       "      <td>197</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ITC3</td>\n",
       "      <td>Liguria</td>\n",
       "      <td>1794</td>\n",
       "      <td>92</td>\n",
       "      <td>664</td>\n",
       "      <td>2550</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ITC4</td>\n",
       "      <td>Lombardia</td>\n",
       "      <td>13468</td>\n",
       "      <td>599</td>\n",
       "      <td>3381</td>\n",
       "      <td>17448</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ITD3</td>\n",
       "      <td>Veneto</td>\n",
       "      <td>5944</td>\n",
       "      <td>287</td>\n",
       "      <td>1116</td>\n",
       "      <td>7347</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>ITD4</td>\n",
       "      <td>Friuli-Venezia Giulia</td>\n",
       "      <td>1342</td>\n",
       "      <td>75</td>\n",
       "      <td>343</td>\n",
       "      <td>1760</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ITD5</td>\n",
       "      <td>Emilia-Romagna</td>\n",
       "      <td>6016</td>\n",
       "      <td>234</td>\n",
       "      <td>1644</td>\n",
       "      <td>7894</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ITDA</td>\n",
       "      <td>Trentino Alto Adige / Südtirol</td>\n",
       "      <td>1776</td>\n",
       "      <td>50</td>\n",
       "      <td>326</td>\n",
       "      <td>2152</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ITE1</td>\n",
       "      <td>Toscana</td>\n",
       "      <td>4392</td>\n",
       "      <td>221</td>\n",
       "      <td>1310</td>\n",
       "      <td>5923</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>ITE2</td>\n",
       "      <td>Umbria</td>\n",
       "      <td>1056</td>\n",
       "      <td>41</td>\n",
       "      <td>275</td>\n",
       "      <td>1372</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>ITE3</td>\n",
       "      <td>Marche</td>\n",
       "      <td>1846</td>\n",
       "      <td>96</td>\n",
       "      <td>383</td>\n",
       "      <td>2325</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>ITE4</td>\n",
       "      <td>Lazio</td>\n",
       "      <td>6770</td>\n",
       "      <td>440</td>\n",
       "      <td>2189</td>\n",
       "      <td>9399</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>ITF1</td>\n",
       "      <td>Abruzzo</td>\n",
       "      <td>1698</td>\n",
       "      <td>61</td>\n",
       "      <td>368</td>\n",
       "      <td>2127</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ITF2</td>\n",
       "      <td>Molise</td>\n",
       "      <td>368</td>\n",
       "      <td>20</td>\n",
       "      <td>102</td>\n",
       "      <td>490</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>ITF3</td>\n",
       "      <td>Campania</td>\n",
       "      <td>12216</td>\n",
       "      <td>306</td>\n",
       "      <td>1736</td>\n",
       "      <td>14258</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>ITF4</td>\n",
       "      <td>Puglia</td>\n",
       "      <td>6508</td>\n",
       "      <td>420</td>\n",
       "      <td>1794</td>\n",
       "      <td>8722</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>ITF5</td>\n",
       "      <td>Basilicata</td>\n",
       "      <td>598</td>\n",
       "      <td>42</td>\n",
       "      <td>124</td>\n",
       "      <td>764</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>ITF6</td>\n",
       "      <td>Calabria</td>\n",
       "      <td>3442</td>\n",
       "      <td>154</td>\n",
       "      <td>449</td>\n",
       "      <td>4045</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>ITG1</td>\n",
       "      <td>Sicilia</td>\n",
       "      <td>11906</td>\n",
       "      <td>553</td>\n",
       "      <td>1555</td>\n",
       "      <td>14014</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>ITG2</td>\n",
       "      <td>Sardegna</td>\n",
       "      <td>1730</td>\n",
       "      <td>80</td>\n",
       "      <td>511</td>\n",
       "      <td>2321</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ITTER107                          Region  Live_births  Miscarriages  \\\n",
       "0      ITC1                        Piemonte         5448           309   \n",
       "1      ITC2  Valle d'Aosta / Vallée d'Aoste          142             9   \n",
       "2      ITC3                         Liguria         1794            92   \n",
       "3      ITC4                       Lombardia        13468           599   \n",
       "4      ITD3                          Veneto         5944           287   \n",
       "5      ITD4           Friuli-Venezia Giulia         1342            75   \n",
       "6      ITD5                  Emilia-Romagna         6016           234   \n",
       "7      ITDA  Trentino Alto Adige / Südtirol         1776            50   \n",
       "8      ITE1                         Toscana         4392           221   \n",
       "9      ITE2                          Umbria         1056            41   \n",
       "10     ITE3                          Marche         1846            96   \n",
       "11     ITE4                           Lazio         6770           440   \n",
       "12     ITF1                         Abruzzo         1698            61   \n",
       "13     ITF2                          Molise          368            20   \n",
       "14     ITF3                        Campania        12216           306   \n",
       "15     ITF4                          Puglia         6508           420   \n",
       "16     ITF5                      Basilicata          598            42   \n",
       "17     ITF6                        Calabria         3442           154   \n",
       "18     ITG1                         Sicilia        11906           553   \n",
       "19     ITG2                        Sardegna         1730            80   \n",
       "\n",
       "    Abortions  Total  Time  \n",
       "0        1670   7427  2017  \n",
       "1          46    197  2017  \n",
       "2         664   2550  2017  \n",
       "3        3381  17448  2017  \n",
       "4        1116   7347  2017  \n",
       "5         343   1760  2017  \n",
       "6        1644   7894  2017  \n",
       "7         326   2152  2017  \n",
       "8        1310   5923  2017  \n",
       "9         275   1372  2017  \n",
       "10        383   2325  2017  \n",
       "11       2189   9399  2017  \n",
       "12        368   2127  2017  \n",
       "13        102    490  2017  \n",
       "14       1736  14258  2017  \n",
       "15       1794   8722  2017  \n",
       "16        124    764  2017  \n",
       "17        449   4045  2017  \n",
       "18       1555  14014  2017  \n",
       "19        511   2321  2017  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md2_ass_2017 = read_csv(\"../data/mashupDS/MD2-ASS-2017.csv\")\n",
    "md2_ass_2017"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### MD2 - Percentage values\n",
    "\n",
    "This dataset contains the <b>percentage values of total pregnancies, live births, spontaneous abortions and induced abortions</b> calculated over the total female population aged 15-25.\n",
    "\n",
    "The following datasets have been used:<br><br>\n",
    "<b>D1-D2 selected Population cleaned datasets</b><br>\n",
    "\n",
    "* Population2017Selected_clean.csv\n",
    "* Population2018Selected_clean.csv\n",
    "* Population2019Selected_clean.csv\n",
    "\n",
    "<b> MD2 with absolute values </b><br>\n",
    "\n",
    "* MD2-ASS-2017.csv\n",
    "* MD2-ASS-2018.csv\n",
    "* MD2-ASS-2019.csv\n",
    "\n",
    "\n",
    "1. Males values + Total values have been taken out from selected population datasets whereas Female values have been summed and grouped by region\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SELECTED POPULATION FILES\n",
    "sel_pop17 = read_csv(\"../data/cleanDS/Population2017Selected_clean.csv\")\n",
    "# only keep \"Females\" column\n",
    "sel_pop17 = sel_pop17.drop(sel_pop17[(sel_pop17.Sex == \"Males\") | (sel_pop17.Sex == \"Total\")].index)\n",
    "# Put together ages and sum value under column \"Population\"\n",
    "sel_pop17 = (sel_pop17.groupby([\"Region code\",\"Region\"])[\"Population\"].sum()).reset_index()\n",
    "sel_pop17[\"Time\"] = 2017\n",
    "sel_pop17.rename(columns = {\"Region code\" : \"ITTER107\", \"Population\" : \"Females\"}, inplace=True)\n",
    "\n",
    "def selectedPop(path):\n",
    "    file = read_csv(path)\n",
    "    file.drop([\"Males\",\"Population\"],axis=1,inplace=True)\n",
    "    file = (file.groupby([\"ITTER107\",\"Region\"])[\"Females\"].sum()).reset_index()\n",
    "    return file    \n",
    "\n",
    "pop18 = \"../data/cleanDS/Population2018Selected_clean.csv\"\n",
    "pop19 = \"../data/cleanDS/Population2019Selected_clean.csv\"\n",
    "\n",
    "sel_pop18 = selectedPop(pop18)\n",
    "sel_pop18[\"Time\"] = 2018   # adds the Time column\n",
    "sel_pop19 = selectedPop(pop19)\n",
    "sel_pop19[\"Time\"] = 2019\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. A simple function <code>getPercentage</code> as been implemented. <br><br> It takes a <code>num</code> numerator and a <code>pop</code> denominator value and returns the result of the percentage equation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getPercentage(num,pop):\n",
    "    return num * 100 / pop"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Another function <code>percentages</code> has been implemented. <br><br> It takes in input the dataset <code>DStotal</code> of the total selected female population and the dataset <code>DSpartial</code> of the mashup datasets MD2 with pregnancies absolute values.<br><br>\n",
    "This function transforms absolute values of <code>DSpartial</code> in percentage values and export the result dataset in <b>csv</b> format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create empty lists to fill with % values\n",
    "Total_perc = []\n",
    "Live_births_perc = []\n",
    "Miscarriages_perc = []\n",
    "Abortions_perc = []\n",
    "\n",
    "def percentages(DStotal, DSpartial):\n",
    "    # merge total DS with partial DS so it's then possible to iterate over rows\n",
    "    mergedDF = merge(DStotal, DSpartial, left_on=\"ITTER107\", right_on=\"ITTER107\", how=\"left\")\n",
    "    mergedDF = mergedDF[[\"ITTER107\", \"Region_x\", \"Females\", \"Live_births\", \"Miscarriages\", \"Abortions\", \"Total\"]]\n",
    "\n",
    "    for idx, row in mergedDF.iterrows():\n",
    "        year = str(DStotal.at[idx, \"Time\"]) # save year value to add later as a column\n",
    "\n",
    "        # calculate percentages for each value\n",
    "        p_tot = getPercentage((mergedDF.at[idx, \"Total\"]), (mergedDF.at[idx, \"Females\"]))\n",
    "        p_lb = getPercentage((mergedDF.at[idx, \"Live_births\"]), (mergedDF.at[idx, \"Females\"]))\n",
    "        p_m = getPercentage((mergedDF.at[idx, \"Miscarriages\"]), (mergedDF.at[idx, \"Females\"]))\n",
    "        p_a = getPercentage((mergedDF.at[idx, \"Abortions\"]), (mergedDF.at[idx, \"Females\"]))\n",
    "\n",
    "        Total_perc.append(p_tot)\n",
    "        Live_births_perc.append(p_lb)\n",
    "        Miscarriages_perc.append(p_m)\n",
    "        Abortions_perc.append(p_a)\n",
    "\n",
    "    # change absolute values with percentage values\n",
    "    DSpartial[\"Live_births\"] = Live_births_perc\n",
    "    DSpartial[\"Miscarriages\"] = Miscarriages_perc\n",
    "    DSpartial[\"Abortions\"] = Abortions_perc\n",
    "    DSpartial[\"Total\"] = Total_perc\n",
    "    DSpartial[\"Time\"] = year\n",
    "\n",
    "    DSpartial.to_csv(\"../data/mashupDS/MD2-PERC-\" + year + \".csv\", index=False)\n",
    "    return True\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the visualisation of the 2017 mashup dataset with percentage values below as example.\n",
    "All MD2 - Absolute values can be found in ` data > mashupDS ` <br><br>\n",
    "Full code is available at `scripts > MASHUP.py `"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ITTER107</th>\n",
       "      <th>Region</th>\n",
       "      <th>Live_births</th>\n",
       "      <th>Miscarriages</th>\n",
       "      <th>Abortions</th>\n",
       "      <th>Total</th>\n",
       "      <th>Time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ITC1</td>\n",
       "      <td>Piemonte</td>\n",
       "      <td>2.657263</td>\n",
       "      <td>0.150715</td>\n",
       "      <td>0.814543</td>\n",
       "      <td>3.622520</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ITC2</td>\n",
       "      <td>Valle d'Aosta / Vallée d'Aoste</td>\n",
       "      <td>2.308943</td>\n",
       "      <td>0.146341</td>\n",
       "      <td>0.747967</td>\n",
       "      <td>3.203252</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ITC3</td>\n",
       "      <td>Liguria</td>\n",
       "      <td>2.609758</td>\n",
       "      <td>0.133834</td>\n",
       "      <td>0.965931</td>\n",
       "      <td>3.709523</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ITC4</td>\n",
       "      <td>Lombardia</td>\n",
       "      <td>2.733171</td>\n",
       "      <td>0.121560</td>\n",
       "      <td>0.686134</td>\n",
       "      <td>3.540865</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ITD3</td>\n",
       "      <td>Veneto</td>\n",
       "      <td>2.409824</td>\n",
       "      <td>0.116356</td>\n",
       "      <td>0.452450</td>\n",
       "      <td>2.978630</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>ITD4</td>\n",
       "      <td>Friuli-Venezia Giulia</td>\n",
       "      <td>2.431512</td>\n",
       "      <td>0.135889</td>\n",
       "      <td>0.621467</td>\n",
       "      <td>3.188868</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ITD5</td>\n",
       "      <td>Emilia-Romagna</td>\n",
       "      <td>2.912133</td>\n",
       "      <td>0.113271</td>\n",
       "      <td>0.795802</td>\n",
       "      <td>3.821206</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ITDA</td>\n",
       "      <td>Trentino Alto Adige / Südtirol</td>\n",
       "      <td>2.891472</td>\n",
       "      <td>0.081404</td>\n",
       "      <td>0.530754</td>\n",
       "      <td>3.503631</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ITE1</td>\n",
       "      <td>Toscana</td>\n",
       "      <td>2.566290</td>\n",
       "      <td>0.129133</td>\n",
       "      <td>0.765446</td>\n",
       "      <td>3.460869</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>ITE2</td>\n",
       "      <td>Umbria</td>\n",
       "      <td>2.557644</td>\n",
       "      <td>0.099302</td>\n",
       "      <td>0.666053</td>\n",
       "      <td>3.322999</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>ITE3</td>\n",
       "      <td>Marche</td>\n",
       "      <td>2.472675</td>\n",
       "      <td>0.128590</td>\n",
       "      <td>0.513020</td>\n",
       "      <td>3.114284</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>ITE4</td>\n",
       "      <td>Lazio</td>\n",
       "      <td>2.372483</td>\n",
       "      <td>0.154194</td>\n",
       "      <td>0.767115</td>\n",
       "      <td>3.293792</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>ITF1</td>\n",
       "      <td>Abruzzo</td>\n",
       "      <td>2.580312</td>\n",
       "      <td>0.092697</td>\n",
       "      <td>0.559220</td>\n",
       "      <td>3.232228</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ITF2</td>\n",
       "      <td>Molise</td>\n",
       "      <td>2.290837</td>\n",
       "      <td>0.124502</td>\n",
       "      <td>0.634960</td>\n",
       "      <td>3.050299</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>ITF3</td>\n",
       "      <td>Campania</td>\n",
       "      <td>3.333806</td>\n",
       "      <td>0.083509</td>\n",
       "      <td>0.473763</td>\n",
       "      <td>3.891078</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>ITF4</td>\n",
       "      <td>Puglia</td>\n",
       "      <td>2.800429</td>\n",
       "      <td>0.180728</td>\n",
       "      <td>0.771968</td>\n",
       "      <td>3.753125</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>ITF5</td>\n",
       "      <td>Basilicata</td>\n",
       "      <td>1.927664</td>\n",
       "      <td>0.135388</td>\n",
       "      <td>0.399716</td>\n",
       "      <td>2.462768</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>ITF6</td>\n",
       "      <td>Calabria</td>\n",
       "      <td>3.106330</td>\n",
       "      <td>0.138982</td>\n",
       "      <td>0.405213</td>\n",
       "      <td>3.650524</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>ITG1</td>\n",
       "      <td>Sicilia</td>\n",
       "      <td>4.053410</td>\n",
       "      <td>0.188269</td>\n",
       "      <td>0.529401</td>\n",
       "      <td>4.771081</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>ITG2</td>\n",
       "      <td>Sardegna</td>\n",
       "      <td>2.258309</td>\n",
       "      <td>0.104430</td>\n",
       "      <td>0.667050</td>\n",
       "      <td>3.029789</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ITTER107                          Region  Live_births  Miscarriages  \\\n",
       "0      ITC1                        Piemonte     2.657263      0.150715   \n",
       "1      ITC2  Valle d'Aosta / Vallée d'Aoste     2.308943      0.146341   \n",
       "2      ITC3                         Liguria     2.609758      0.133834   \n",
       "3      ITC4                       Lombardia     2.733171      0.121560   \n",
       "4      ITD3                          Veneto     2.409824      0.116356   \n",
       "5      ITD4           Friuli-Venezia Giulia     2.431512      0.135889   \n",
       "6      ITD5                  Emilia-Romagna     2.912133      0.113271   \n",
       "7      ITDA  Trentino Alto Adige / Südtirol     2.891472      0.081404   \n",
       "8      ITE1                         Toscana     2.566290      0.129133   \n",
       "9      ITE2                          Umbria     2.557644      0.099302   \n",
       "10     ITE3                          Marche     2.472675      0.128590   \n",
       "11     ITE4                           Lazio     2.372483      0.154194   \n",
       "12     ITF1                         Abruzzo     2.580312      0.092697   \n",
       "13     ITF2                          Molise     2.290837      0.124502   \n",
       "14     ITF3                        Campania     3.333806      0.083509   \n",
       "15     ITF4                          Puglia     2.800429      0.180728   \n",
       "16     ITF5                      Basilicata     1.927664      0.135388   \n",
       "17     ITF6                        Calabria     3.106330      0.138982   \n",
       "18     ITG1                         Sicilia     4.053410      0.188269   \n",
       "19     ITG2                        Sardegna     2.258309      0.104430   \n",
       "\n",
       "    Abortions     Total  Time  \n",
       "0    0.814543  3.622520  2017  \n",
       "1    0.747967  3.203252  2017  \n",
       "2    0.965931  3.709523  2017  \n",
       "3    0.686134  3.540865  2017  \n",
       "4    0.452450  2.978630  2017  \n",
       "5    0.621467  3.188868  2017  \n",
       "6    0.795802  3.821206  2017  \n",
       "7    0.530754  3.503631  2017  \n",
       "8    0.765446  3.460869  2017  \n",
       "9    0.666053  3.322999  2017  \n",
       "10   0.513020  3.114284  2017  \n",
       "11   0.767115  3.293792  2017  \n",
       "12   0.559220  3.232228  2017  \n",
       "13   0.634960  3.050299  2017  \n",
       "14   0.473763  3.891078  2017  \n",
       "15   0.771968  3.753125  2017  \n",
       "16   0.399716  2.462768  2017  \n",
       "17   0.405213  3.650524  2017  \n",
       "18   0.529401  4.771081  2017  \n",
       "19   0.667050  3.029789  2017  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md2_perc_2017 = read_csv(\"../data/mashupDS/MD2-PERC-2017.csv\")\n",
    "md2_perc_2017"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MD3 - (Higher) education rates in young women in each region\n",
    "As the source dataset D7 did not have, with absolute values, the subdivision in regions (but only had data regarding the whole national territory of Italy), we had to use the \"percentage values\".\n",
    "The percentage is calculated on the total population of 18-24 years old in each region, without distinctions in terms of gender.\n",
    "\n",
    "However, given the assumptions of the project, we had to focus on the female population of this age range and we did so by employing the following proportion:\n",
    "* % early leavers : total 18-24 = [% early leavers F] : total 18-24 F\n",
    "\n",
    "Where:\n",
    "1. total 18-24 -> data to be gathered from the general population datasets (D1,D2), filtering them with ages 18-24\n",
    "2. total 18-24 F -> data to be gathered from the selected population datasets (D1, D2), filtering them with ages 18-24\n",
    "3.  % early leavers -> data contained in the source dataset D7\n",
    "4. [% early leavers F] -> data to be gathered following the proportion\n",
    "\n",
    "#### Step 1: Total population 18-24 for each region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === A) TOTAL 18-24\n",
    "# Get data for ages 18-24 from the general population datasets\n",
    "def selectAge(df, age1, age2):\n",
    "    df = df.drop(df[(df.Age == \"100 and over\") | (df.Age == \"Total\")].index)\n",
    "    df[\"Age\"] = df[\"Age\"].astype('int')\n",
    "    young_df = df.loc[(df[\"Age\"] >= age1) & (df[\"Age\"] <= age2)]\n",
    "    return young_df\n",
    "\n",
    "\n",
    "def young_generalPop(df, age1, age2):\n",
    "    young_df = selectAge(df, age1, age2)\n",
    "    young_df = (df.groupby([\"ITTER107\", \"Region\"])\n",
    "                [\"Population\"].sum()).reset_index()\n",
    "    return young_df\n",
    "\n",
    "def young_femalePop(df, age1, age2):\n",
    "    young_Fdf = selectAge(df, age1, age2)\n",
    "    young_Fdf = (df.groupby([\"ITTER107\", \"Region\"])\n",
    "                [\"Females\"].sum()).reset_index()\n",
    "    return young_Fdf\n",
    "\n",
    "\n",
    "# --- 2017\n",
    "pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n",
    "# Delete gender distinction\n",
    "pop17 = pop17.drop(pop17[(pop17.Sex == \"Males\") |\n",
    "                   (pop17.Sex == \"Females\")].index)\n",
    "# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of young_pop17\n",
    "young_pop17 = selectAge(pop17, 18, 24)\n",
    "young_pop17 = (pop17.groupby([\"Region code\", \"Region\"])[\n",
    "               \"Population\"].sum()).reset_index()\n",
    "# Adding columns for easier identification\n",
    "young_pop17[\"Time\"] = 2017\n",
    "young_pop17[\"Age range\"] = \"18-24\"\n",
    "\n",
    "# --- 2018\n",
    "pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n",
    "pop18.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n",
    "young_pop18 = young_generalPop(pop18, 18, 24)\n",
    "young_pop18[\"Time\"] = 2018\n",
    "young_pop18[\"Age range\"] = \"18-24\"\n",
    "\n",
    "# --- 2019\n",
    "pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n",
    "pop19.drop([\"Males\", \"Females\"], axis=1, inplace=True)\n",
    "young_pop19 = young_generalPop(pop19, 18, 24)\n",
    "young_pop19[\"Time\"] = 2019\n",
    "young_pop19[\"Age range\"] = \"18-24\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 2: Total female population 18-24 for each region\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === B) TOTAL 18-24 F\n",
    "# Same as before, just DO NOT delete the gender distinction\n",
    "# --- 2017\n",
    "f_pop17 = read_csv(\"data/cleanDS/Population2017General_clean.csv\")\n",
    "f_pop17 = f_pop17.drop(f_pop17[(f_pop17.Sex == \"Males\") | (f_pop17.Sex == \"Total\")].index)\n",
    "# Fixing the \"Age\" column (changing it to int values) to allow for a better condition for the creation of youngF_pop17\n",
    "youngF_pop17 = selectAge(f_pop17, 18, 24)\n",
    "youngF_pop17 = (f_pop17.groupby([\"Region code\", \"Region\"])[\"Population\"].sum()).reset_index()\n",
    "# Adding columns for easier identification\n",
    "youngF_pop17[\"Time\"] = 2017\n",
    "youngF_pop17[\"Age range\"] = \"18-24\"\n",
    "youngF_pop17[\"Gender\"] = \"Females\"\n",
    "\n",
    "# --- 2018\n",
    "f_pop18 = read_csv(\"data/cleanDS/Population2018General_clean.csv\")\n",
    "f_pop18.drop([\"Males\", \"Population\"], axis=1, inplace=True)\n",
    "youngF_pop18 = young_femalePop(f_pop18, 18, 24)\n",
    "youngF_pop18[\"Time\"] = 2018\n",
    "youngF_pop18[\"Age range\"] = \"18-24\"\n",
    "\n",
    "# --- 2019\n",
    "f_pop19 = read_csv(\"data/cleanDS/Population2019General_clean.csv\")\n",
    "youngF_pop19 = young_femalePop(f_pop19, 18, 24)\n",
    "youngF_pop19[\"Time\"] = 2019\n",
    "youngF_pop19[\"Age range\"] = \"18-24\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3: Percentage of early leavers for each region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d7_clean = read_csv(\"../data/cleanDS/D7_clean.csv\")\n",
    "d7_2017 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2018) | (d7_clean.TIME == 2019)].index)\n",
    "d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n",
    "d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 4: Percentage of female early leavers for each region\n",
    "\n",
    "(this last step has been commented out after a first run in order to avoid interfeering with the rest of the code/project)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === C) % EARLY LEAVERS F (PROPORTION)\n",
    "# --- 2017\n",
    "MD3_2017 = (merge(young_pop17,d7_2017,left_on=\"Region code\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
    "MD3_2017 = (merge(youngF_pop17,MD3_2017,left_on=\"Region code\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"ITTER107\",\"Region code_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
    "for idx,row in MD3_2017.iterrows():\n",
    "    result = (row[\"Value\"] * row[\"Population_FEMALE\"])/(row[\"Population_GENERAL\"])\n",
    "    MD3_2017.loc[idx,\"Female Early Leavers\"] = result\n",
    "#MD3_2017.to_csv(\"data/mashupDS/MD3_17.csv\")\n",
    "\n",
    "# row[\"Value\"] : row[\"Population_GENERAL\"] = % early leavers F : row[\"Population_FEMALE\"]\n",
    "\n",
    "# --- 2018\n",
    "d7_2018 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2019)].index)\n",
    "\n",
    "MD3_2018 = (merge(young_pop18,d7_2018,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
    "MD3_2018 = (merge(youngF_pop18,MD3_2018,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
    "\n",
    "for idx,row in MD3_2018.iterrows():\n",
    "    result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n",
    "    MD3_2018.loc[idx,\"Female Early Leavers\"] = result\n",
    "#MD3_2018.to_csv(\"data/mashupDS/MD3_18.csv\")\n",
    "\n",
    "# --- 2019\n",
    "d7_2019 = d7_clean.drop(d7_clean[(d7_clean.TIME == 2017) | (d7_clean.TIME == 2018)].index)\n",
    "MD3_2019 = (merge(young_pop19,d7_2019,left_on=\"ITTER107\",right_on=\"ITTER107\")).drop([\"TIME\",\"Territory\"],axis=1)\n",
    "MD3_2019 = (merge(youngF_pop19,MD3_2019,left_on=\"ITTER107\",right_on=\"ITTER107\",suffixes=('_GENERAL','_FEMALE'))).drop([\"Time_FEMALE\",\"Age range_FEMALE\",\"Region_FEMALE\"],axis=1)\n",
    "for idx,row in MD3_2019.iterrows():\n",
    "    result = (row[\"Value\"] * row[\"Females\"])/(row[\"Population\"])\n",
    "    MD3_2019.loc[idx,\"Female Early Leavers\"] = result\n",
    "#MD3_2019.to_csv(\"data/mashupDS/MD3_19.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "9e178ac2d01006df9e74389e28c68e3ef7cb95acb5e5b42727cfb2189b57440d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}