DataScienceInPractice

Chapter 3: Read.csv

Read data.csv

Use the read.csv command to read a csv file.

See the example below:

data <- read.csv('dat/data.csv', sep=",", header = T, stringsAsFactors=FALSE)

If you want to change the separator modify the parameter sep=","

Diagnosing problems with your data frame

After you read the csv file you will have to look at the dimensions of your data frame and the column of the classes. dim gives dimensions; the class function gives the type of a column and the str function will give you a summary of the structure of the object.

dim(data)
lapply(data, class)
str(data)

Real example: Reading a csv file from the Ayuntamiento de Madrid.

The file is avaible in this link.

First try to read it:

multas <- read.csv('http://datos.madrid.es/datosabiertos/MULTAS/2015/04/201504_detalle.csv', sep=";", header = T, stringsAsFactors=FALSE)

Note that we are reading directly from the website and not from the file in a folder.


multas <- read.csv('http://datos.madrid.es/datosabiertos/MULTAS/2015/04/201504_detalle.csv', sep=";", header = T, stringsAsFactors=FALSE)
> dim(multas)
[1] 186619     12
> lapply(multas, class)
$CALIFICACION
[1] "character"

$LUGAR
[1] "character"

$MES
[1] "integer"

$ANIO
[1] "integer"

$HORA
[1] "numeric"

$IMP_BOL
[1] "numeric"

$DESCUENTO
[1] "character"

$PUNTOS
[1] "integer"

$DENUNCIANTE
[1] "character"

$HECHO.BOL
[1] "character"

$VEL_LIMITE
[1] "integer"

$VEL_CIRCULA
[1] "integer"

> str(multas)
'data.frame':    186619 obs. of  12 variables:
 $ CALIFICACION: chr  "GRAVE     " "LEVE      " "GRAVE     " "GRAVE     " ...
 $ LUGAR       : chr  "M 30 KM 29 CALZADA 1                    " "AV GLORIETAS-RDA SUR                    " "PO ALABARDEROS 24                       " "KM 12, M-30 CALZADA 1                   " ...
 $ MES         : int  4 4 4 4 4 4 4 4 4 4 ...
 $ ANIO        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ HORA        : num  9.45 13.35 13.5 16.35 12.05 ...
 $ IMP_BOL     : num  200 100 200 100 90 200 200 90 200 90 ...
 $ DESCUENTO   : chr  "SI" "SI" "SI" "SI" ...
 $ PUNTOS      : int  0 0 0 0 0 4 0 0 3 0 ...
 $ DENUNCIANTE : chr  "POLICIA MUNICIPAL   " "POLICIA MUNICIPAL   " "POLICIA MUNICIPAL   " "POLICIA MUNICIPAL   " ...
 $ HECHO.BOL   : chr  "CONDUCCION NEGLIGENTE: CIRCULAR POR ENCIMA DE VELOCIDAD REBASANDO VHOS                                                       " "CIRCULAR POR ZONA RESERVADA AL USO EXCLUSIVO DE PEATONES.                                                                    " "ESTACIONAR EN ZONA SE\xd1ALIZADA PARA USO EXCLUSIVO DE PERSONAS CON MOVILIDAD REDUCIDA.                                        "| __truncated__ "SOBREPASAR LA VELOCIDADM\xc1XIMA EN V\xcdAS LIMITADAS EN 60 km/h O M\xc1S.                                                     "| __truncated__ ...
 $ VEL_LIMITE  : int  NA NA NA 80 NA NA NA NA NA NA ...
 $ VEL_CIRCULA : int  NA NA NA 84 NA NA NA NA NA NA ...

Errors

Sometimes you may have need to fix some problems.

Error: duplicate 'row.names' are not allowed

When you try to read a csv file and get the error below you have to add row.names=NULL in the read.csv command.

The error:

multas <- read.csv('~/git/github.com/it4dgroup/datosabiertosespana/dat/ayuntamiento-madrid/multas/201504_detalle.csv', sep=",", header = T, stringsAsFactors=FALSE)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed

The solution:

multas <- read.csv('~/git/github.com/it4dgroup/datosabiertosespana/dat/ayuntamiento-madrid/multas/201504_detalle.csv', sep=",", header = T, stringsAsFactors=FALSE, row.names=NULL)

But in the case above the real problem was the separator. You have to change from sep="," to sep=";" and it will work.

Skip row

Add the parameter skip=rowtoskip like skip=1

multas <- read.csv('~/git/github.com/it4dgroup/datosabiertosespana/dat/ayuntamiento-madrid/multas/201504_detalle.csv', sep=";", header = T, stringsAsFactors=FALSE, row.names=NULL, skip=1)

Dataset without Header

Add the parameter header = T to header = F

T = Header equal True F = Header equal False

See the example below with Header equal False

multas <- read.csv('~/git/github.com/it4dgroup/datosabiertosespana/dat/ayuntamiento-madrid/multas/201504_detalle.csv', sep=";", header = F, stringsAsFactors=FALSE)

Source:

http://www.r-bloggers.com/using-r-common-errors-in-table-import/[](http://www.r-bloggers.com/using-r-common-errors-in-table-import/)

Other links: http://www.statmethods.net/input/missingdata.html

http://science.nature.nps.gov/im/datamgmt/statistics/r/fundamentals/manipulation.cfm

http://www.cyclismo.org/tutorial/R/input.html