Combining two CSV

We made some experience on how CSV datasets are made and how to explore their content. Let’s move to the more advanced practice of combining two different datasets. The python pandas code is at the end of the article.

Introduction

The reasons we may need to combine datasets vary from case to case, in general it’s useful for a cross-analysis that would be otherwise impossible. We attempted a geographical interpretation of health-related data and we need demographic data to complement the analysis. One of the results so far is a plot of the number of hospital beds present in each italian region:

biggest_hospitals

We had the impression of a strong imbalance among regions, so we decided to investigate deeper by taking into account how many people actually live in each region. To compute a beds per citizen metric we are forced to retrieve the number of regional citizens from another dataset, because the first does not contain this information.

The second dataset

The number of citizens present in each municipality is available in this dataset from dati.gov.it. The process to parse it is similar to the one we already saw, with the difference that this time we must specify in the read_csv command what is the character used as a thousands separator, otherwise pandas will interpret numbers like “10.345” as 10 with some decimals instead of 10345. It’s an awful idea to use a thousands separator for machine readable data, if you ask me, but we know that data cleaning is a practice we cannot avoid.
After loading data, we run groupby to group the number of citizens in each municipality by region and sum to obtain the number of citizens per region.

Dataset integration

The first step to combine this new dataset to the one we already analyzed is to find a variable present in both (a key) that allows us to merge (join) the datasets as if they were one. Pandas gives us an easy join command to merge two different dataframes, at the condition that they share a namesake column as key (in our case  Codice Regione). There are many kinds of joins and a plethora of options so it’s worth reading the relative documentation. The operation can be visualized as follows:

JOIN

Analysis

Now that we have all data in place, we add a new column Letti per Cittadino (beds per citizen) by dividing the number of beds in a region for the number of residents in that region. Let’s print some descriptive statistics and plot the new variable.

count 21.000000
mean 0.003663
std 0.000365
min 0.002924
25% 0.003359
50% 0.003716
75% 0.004008
max 0.004245
Name: Letti per Cittadino, dtype: float64

beds_per_citizen

As expected, correcting the numbers with demographic information reveals another situation. The region of Lombardia is not even more in the first place, whereas in the previous plot seemed to be unfairly advantaged. The differences between regions are less severe, with the last (Calabria: 0.002924) measuring ~68% of the first (Emilia: 0.004245).

Molise seems to be the only southern region in the top part of the chart, while the bottom is populated by southern regions only (Sicilia, Puglia, Campania, Clabria). An interesting prosecution of our analysis would be a map visualization of these data, in which each region is colored in proportion with the numbers. The technical name of such a plot is choropleth and it’s rather difficult to realize in python, which is an amazing language for data exploration and analysis, but not for interactive visualization. In a future post we will learn how to make a choropleth for the browser, using more user-interface-oriented Javascript libraries like D3 or Leaflet.

Conclusion

This example confirmed once again the importance of giving context to data, to calibrate our interpretations and ultimately make wiser decisions. Open data are a powerful tool, but we need to learn a set of skills (data literacy) to handle messy data, merge different datasets and analyze, visualize, interpret them.
In the last decade the need for a more integrated data ecosystem led to an interesting global debate between scientists, politicians, entepreneurs and activists. Powerful data-oriented technologies, like graph databases and linked data, are on their way to make our life better. More on this in the next posts.

Code


One comment

Comments are closed.