One of the most common format used to share open data is the CSV (Comma Separated Values).
It is a simple text file with a data point in each row, where each row contains (as expected) comma separated values. The first row of the file is usually reserved for the variable names. A dummy CSV with all my friends’ names, phone numbers and addresses could be:
"Name", "Phone number", "Address" "Piero", 324325987, "Rome, Italy" "Dario", 345934859, "Rome, Italy" "Rick", 345934859, "Berlin, Germany" ...
It is good practice to wrap textual values into quotes
" because they may contain the character used as a separator. An address containing commas like “Rome, Italy” would mess up the separation between the different variables, if not wrapped by quotes.
How to open a CSV file
The advantages of the CSV format are simplicity and portability. Any text editor can open a CSV file, but if you need to explore, analyze and plot the data you need proper tooling. Let’s see some of the tools available on desktop computers.
One possibility is to use a spreadsheet program, like Microsoft Excel, Apple Numbers or LibreOffice Calc. I suggest Calc because it’s open source. All these programs let you do basic data manipulation. If you drag a CSV file into Calc you will see a window like this:
Here you help the importing process by indicating the CSV separator, which should be by definition the comma
,, but in many case is the semicolon
;. You could also find tabs
\t or pipes
| as a separator. The other options can be usually left untouched.
If you are comfortable with the terminal on Linux or the Mac you can
cd into the directory containing the CSV and run the command
head -n lines file to preview the first n lines of the file.
Data science requires the ability to manipulate the data in many different ways, so that spreadsheets are often not enough. A simple task like plotting the distribution of a variable can get irritating without a scripting library. There are many choices, my favourite one is pandas in the python programming language. To open a CSV import the library and use the
read_csv function, specifiying the file name and the separator.
I hope this post gave you an idea of what is a CSV file and how you can start looking into them.
Don’t let the numbers scare you. They can offer help to understand reality and support your decisions.