We made some experience on how CSV datasets are made and how to explore their content. Let’s move to the more advanced practice of combining two different datasets. The python pandas code is at the end of the article.
The reasons we may need to combine datasets vary from case to case, in general it’s useful for a cross-analysis that would be otherwise impossible. We attempted a geographical interpretation of health-related data and we need demographic data to complement the analysis. One of the results so far is a plot of the number of hospital beds present in each italian region:
We had the impression of a strong imbalance among regions, so we decided to investigate deeper by taking into account how many people actually live in each region. To compute a beds per citizen metric we are forced to retrieve the number of regional citizens from another dataset, because the first does not contain this information.
The second dataset
The number of citizens present in each municipality is available in this dataset from dati.gov.it. The process to parse it is similar to the one we already saw, with the difference that this time we must specify in the
read_csv command what is the character used as a thousands separator, otherwise pandas will interpret numbers like “10.345” as 10 with some decimals instead of 10345. It’s an awful idea to use a thousands separator for machine readable data, if you ask me, but we know that data cleaning is a practice we cannot avoid.
After loading data, we run
groupby to group the number of citizens in each municipality by region and
sum to obtain the number of citizens per region.
The first step to combine this new dataset to the one we already analyzed is to find a variable present in both (a key) that allows us to merge (join) the datasets as if they were one. Pandas gives us an easy
join command to merge two different dataframes, at the condition that they share a namesake column as key (in our case
Codice Regione). There are many kinds of joins and a plethora of options so it’s worth reading the relative documentation. The operation can be visualized as follows:
Now that we have all data in place, we add a new column
Letti per Cittadino (beds per citizen) by dividing the number of beds in a region for the number of residents in that region. Let’s print some descriptive statistics and plot the new variable.
count 21.000000 mean 0.003663 std 0.000365 min 0.002924 25% 0.003359 50% 0.003716 75% 0.004008 max 0.004245 Name: Letti per Cittadino, dtype: float64
As expected, correcting the numbers with demographic information reveals another situation. The region of Lombardia is not even more in the first place, whereas in the previous plot seemed to be unfairly advantaged. The differences between regions are less severe, with the last (Calabria: 0.002924) measuring ~68% of the first (Emilia: 0.004245).
This example confirmed once again the importance of giving context to data, to calibrate our interpretations and ultimately make wiser decisions. Open data are a powerful tool, but we need to learn a set of skills (data literacy) to handle messy data, merge different datasets and analyze, visualize, interpret them.
In the last decade the need for a more integrated data ecosystem led to an interesting global debate between scientists, politicians, entepreneurs and activists. Powerful data-oriented technologies, like graph databases and linked data, are on their way to make our life better. More on this in the next posts.