This tutorial is focused on choosing, cleaning, analysing and plotting an open dataset with python pandas. We discuss the insights extracted from the dataset and show the importance of taking contextual information into account.
Choose a dataset
Let’s download an open dataset to familiarize with the CSV format and enjoy data exploration. An interesting one contains the number of hospitals’ beds in Italy and can be freely downloaded from dati.gov.it. It is a good dataset for two reasons:
- it contains down-to-earth health care information. Data science gives the most value when applied to data from the real world.
- it’s messy, full of missing values and not documented enough. The first thing to accept for a data professional is that most data is scattered and cluttered. Making it compact and clean is (the most important) part of the job.
As we saw in a precedent post, a CSV is just a table (grid) of data, one record per row, one variable per column. Pandas is a python library containing many useful functionalities to work with tabular data. In pandas’ terminology our table is called Dataframe and each column is called Series.
Load into pandas and clean
We first upload the data in memory with the command
read_csv, specifying the file name and the separator. With the
head(n) function we can inspect the first n rows of the file and with the
dtypes attribute we can check if pandas has deduced correctly the data type for each column.
Anno int64 Codice Regione int64 Descrizione Regione object Codice Azienda int64 Tipo Azienda int64 Codice struttura int64 Denominazione struttura object Indirizzo object Codice Comune int64 Comune object Sigla provincia object Codice tipo struttura float64 Descrizione tipo struttura object Tipo di Disciplina object Posti letto Ordinari object Posti letto Day Hospital object Posti letto Day Surgery object Totale posti letto object
The reason some of the numeric columns were considered text (
object) is the precence of values like “N.D.” in columns containing numbers. A fast solution to avoid correcting the columns’ dtype one by one is to force numeric conversion with the function
Anno int64 Codice Regione int64 Descrizione Regione object Codice Azienda int64 Tipo Azienda int64 Codice struttura int64 Denominazione struttura object Indirizzo object Codice Comune int64 Comune object Sigla provincia object Codice tipo struttura float64 Descrizione tipo struttura object Tipo di Disciplina object Posti letto Ordinari float64 Posti letto Day Hospital float64 Posti letto Day Surgery float64 Totale posti letto float64
The dataset contains information on the last few years, a good resource because it would allow us to see how the numbers evolved year by year. For now, let’s restrict our efforts to the year 2014. We consider for now only the most important variable in the dataset, “Totale posti letto” (the total number of beds available in the hospital). We first use the
describe function to obtain some descriptive statistics, then we plot a histogram to see how the beds are distributed.
The results of the code above is:
count 1926.000000 mean 114.462617 std 187.686820 min 1.000000 25% 20.000000 50% 48.000000 75% 119.750000 max 2169.000000 Name: Totale posti letto, dtype: float64
We are just looking at one Series (a column in the CSV) and we already obtained some insights. The average number of beds is ~114, so we may expect that each hospital contains around that number of beds. Actually, by looking at the distribution we see that most hospitals contain less than 100 beds. The high bar on the left means “there are more than 800 hospitals with just a few tens of beds”, while the small bars on the right mean “there are a few hospitals with thousands of beds”. The hospital system contains a few hubs (big hospitals) and many small nodes. This was expected because we know big cities tend to contain big health care centers. The lesson to learn here is: look at a variable’s distribution because average and variance alone are misleading. Let’s see which are the hospitals offering the most beds.
Denominazione struttura Totale posti letto 8131 AO CITTA' DELLA SALUTE E DELLA SCIENZA D 2169 9205 POLICLINICO A. GEMELLI E C.I.C. 1500 8879 AZIENDA OSPEDALIERO-UNIVERSITARIA DI BOL 1494 8659 AZ.OSP.UNIVERSITARIA INTEGRATA VERONA 1413 8657 AZIENDA OSPEDALIERA DI PADOVA 1399 8985 AZ. OSPEDALIERO - UNIVERSITARIA CAREGGI 1287 9207 POLICLINICO U. I 1273 8354 PRES.OSPEDAL.SPEDALI CIVILI BRESCIA 1236 8721 IRCCS AOU S.MARTINO - IST 1202 9530 AO UNIV. CONS. POLICLINICO BARI 1191 8978 AZIENDA OSPEDALIERO-UNIVERSITARIA PISANA 1181 8581 OSPEDALE DI TREVISO 1169 8786 PRESIDIO OSPEDALIERO UNICO - AZIENDA DI 1145 8693 AZIENDA OSPEDALIERO-UNIVERSITARIA 1063 8871 AZIENDA OSPEDALIERO-UNIVERSITARIA DI PAR 1031 8409 OSPEDALE CA' GRANDA-NIGUARDA - MILANO 1014 8301 FONDAZ.IRCCS CA' GRANDA - OSPEDALE MAGGI 976 8384 OSPEDALE PAPA GIOVANNI XXIII - BG 973 8768 PRESIDIO OSPEDALIERO PROVINCIALE NUOVO S 965 9197 AZ.OSP.SAN CAMILLO-FORLANINI 950
To go deeper we could plot the total number of beds available in each region. We group beds count by region using
sum the beds in each region.
It may seem that the region of Lombardia has an unfair amount of hospital beds. This is a typical trap for the beginner data scientist, comparing data without considering context. Would the story told by the above plot change, knowing that Lombardia is the most populated region in Italy? To take a better stance we should devide each bar in the plot by the number of citizens in that region. The real question to ask is: How many beds there are per citizen in each region? Another good question to ask from a geographical point of view is: which places are too far from any hospital? Both questions will be addressed in future tutorials by integrating this dataset with external data.
It is possible to extract valuable insights from an open dataset. The first difficult part regards cleaning the data, then it’s relatively easy to make statistics and plots. Always try to consider variables distribution and context, to avoid a superficial interpretation. The next difficult part comes with integrating the data with other datasets, a practice that leads to deeper insights. Some excersises on this dataset:
- EASY: plot the distribution of the variable “Posti letto Day Hospital”
- MEDIUM: group “Posti letto Day Hospital” by “Comune”
- HARD: find the biggest hospitals for each of the values of the variable “Tipo di Disciplina”