Explore a CSV dataset with pandas

This tutorial is focused on choosing, cleaning, analysing and plotting an open dataset with python pandas. We discuss the insights extracted from the dataset and show the importance of taking contextual information into account.

Choose a dataset

Let’s download an open dataset to familiarize with the CSV format and enjoy data exploration. An interesting one contains the number of hospitals’ beds in Italy and can be freely downloaded from dati.gov.it. It is a good dataset for two reasons:

  1. it contains down-to-earth health care information. Data science gives the most value when applied to data from the real world.
  2. it’s messy, full of missing values and not documented enough. The first thing to accept for a data professional is that most data is scattered and cluttered. Making it compact and clean is (the most important) part of the job.

As we saw in a precedent post, a CSV is just a table (grid) of data, one record per row, one variable per column. Pandas is a python library containing many useful functionalities to work with tabular data. In pandas’ terminology our table is called Dataframe and each column is called Series.

Load into pandas and clean

We first upload the data in memory with the command read_csv, specifying the file name and the separator. With the head(n) function we can inspect the first n rows of the file and with the dtypes attribute we can check if pandas has deduced correctly the data type for each column.

Anno                            int64
Codice Regione                  int64
Descrizione Regione            object
Codice Azienda                  int64
Tipo Azienda                    int64
Codice struttura                int64
Denominazione struttura        object
Indirizzo                      object
Codice Comune                   int64
Comune                         object
Sigla provincia                object
Codice tipo struttura         float64
Descrizione tipo struttura     object
Tipo di Disciplina             object
Posti letto Ordinari           object
Posti letto Day Hospital       object
Posti letto Day Surgery        object
Totale posti letto             object

The reason some of the numeric columns were considered text (object) is the precence of values like “N.D.” in columns containing numbers. A fast solution to avoid correcting the columns’ dtype one by one is to force numeric conversion with the function convert_objects.

Anno                            int64
Codice Regione                  int64
Descrizione Regione            object
Codice Azienda                  int64
Tipo Azienda                    int64
Codice struttura                int64
Denominazione struttura        object
Indirizzo                      object
Codice Comune                   int64
Comune                         object
Sigla provincia                object
Codice tipo struttura         float64
Descrizione tipo struttura     object
Tipo di Disciplina             object
Posti letto Ordinari          float64
Posti letto Day Hospital      float64
Posti letto Day Surgery       float64
Totale posti letto            float64

Analysis 

The dataset contains information on the last few years, a good resource because it would allow us to see how the numbers evolved year by year. For now, let’s restrict our efforts to the year 2014. We consider for now only the most important variable in the dataset, “Totale posti letto” (the total number of beds available in the hospital). We first use the describe function to obtain some descriptive statistics, then we plot a histogram to see how the beds are distributed.
The results of the code above is:

count    1926.000000
mean      114.462617
std       187.686820
min         1.000000
25%        20.000000
50%        48.000000
75%       119.750000
max      2169.000000
Name: Totale posti letto, dtype: float64

distribution_beds We are just looking at one Series (a column in the CSV) and we already obtained some insights. The average number of beds is ~114, so we may expect that each hospital contains around that number of beds. Actually, by looking at the distribution we see that most hospitals contain less than 100 beds. The high bar on the left means “there are more than 800 hospitals with just a few tens of beds”, while the small bars on the right mean “there are a few hospitals with thousands of beds”. The hospital system contains a few hubs (big hospitals) and many small nodes. This was expected because we know big cities tend to contain big health care centers. The lesson to learn here is: look at a variable’s distribution because average and variance alone are misleading. Let’s see which are the hospitals offering the most beds.

                       Denominazione struttura  Totale posti letto
8131  AO CITTA' DELLA SALUTE E DELLA SCIENZA D                2169
9205           POLICLINICO A. GEMELLI E C.I.C.                1500
8879  AZIENDA OSPEDALIERO-UNIVERSITARIA DI BOL                1494
8659     AZ.OSP.UNIVERSITARIA INTEGRATA VERONA                1413
8657             AZIENDA OSPEDALIERA DI PADOVA                1399
8985  AZ. OSPEDALIERO - UNIVERSITARIA CAREGGI                 1287
9207                          POLICLINICO U. I                1273
8354  PRES.OSPEDAL.SPEDALI CIVILI BRESCIA                     1236
8721                 IRCCS AOU S.MARTINO - IST                1202
9530           AO UNIV. CONS. POLICLINICO BARI                1191
8978  AZIENDA OSPEDALIERO-UNIVERSITARIA PISANA                1181
8581                       OSPEDALE DI TREVISO                1169
8786  PRESIDIO OSPEDALIERO UNICO - AZIENDA DI                 1145
8693         AZIENDA OSPEDALIERO-UNIVERSITARIA                1063
8871  AZIENDA OSPEDALIERO-UNIVERSITARIA DI PAR                1031
8409  OSPEDALE CA' GRANDA-NIGUARDA - MILANO                   1014
8301  FONDAZ.IRCCS CA' GRANDA - OSPEDALE MAGGI                 976
8384  OSPEDALE PAPA GIOVANNI XXIII - BG                        973
8768  PRESIDIO OSPEDALIERO PROVINCIALE NUOVO S                 965
9197              AZ.OSP.SAN CAMILLO-FORLANINI                 950

To go deeper we could plot the total number of beds available in each region. We group beds count by region using groupBy, then sum the beds in each region.
biggest_hospitals It may seem that the region of Lombardia has an unfair amount of hospital beds. This is a typical trap for the beginner data scientist, comparing data without considering context. Would the story told by the above plot change, knowing that Lombardia is the most populated region in Italy? To take a better stance we should devide each bar in the plot by the number of citizens in that region. The real question to ask is: How many beds there are per citizen in each region? Another good question to ask from a geographical point of view is: which places are too far from any hospital? Both questions will be addressed in future tutorials by integrating this dataset with external data.

Conclusions

It is possible to extract valuable insights from an open dataset. The first difficult part regards cleaning the data, then it’s relatively easy to make statistics and plots. Always try to consider variables distribution and context, to avoid a superficial interpretation. The next difficult part comes with integrating the data with other datasets, a practice that leads to deeper insights. Some excersises on this dataset:

  • EASY: plot the distribution of the variable “Posti letto Day Hospital”
  • MEDIUM: group “Posti letto Day Hospital” by “Comune”
  • HARD: find the biggest hospitals for each of the values of the variable “Tipo di Disciplina”

3 comments

Comments are closed.