# R and SQL, and more

## A use case

Explaining of a concept can be dull sometimes. Solving a real-world problem and use it as a vehicle to tell the reader about integrating R and SQL, as I do in this title, may be more interesting.

Looking for Dutch BNP (the gross national product, aka **GDP**) figures, I came
across this graph:

It appears in
BNP
at cbs and it tells me about the recent change in BNP, but nothing
about it's context, the past. Of course, the past is not very
interesting, but *not very* is still more than nothing. I asked
myself: how can I represent the current BNP in full context without
discarding the relative importance of the most recent years?

Normally one accomplishes this by using logarithmic scales. These
work fine, as various natural phenomena work exactly like that:
the difference between human weight related to age where 1 and 10
months feels as important as the difference between 10 and 100
months. I am seeking the reverse; the bigger the number, the more
important the intervals between successive steps. So instead of
using *x* versus *log x* I use some base close to 1, e.g. 1.06:
*x* versus *1.06^x*.

Note that it is common for graph-makers to choose an interval which suites the point they are trying to make. I don't think that the Dutch CBS is trying to do this, but above graph may feed arguments for attacking policymakers bringing the year 2008 onto us. This graph will tell us that it was a terrible year indeed, but one needs more context in order to convince the critical reader. Below I will explain how to do this without losing detail.

## The graphs using R

R is a multi-purpose programming language designed by
statisticians. It is not necessarily the best language for
statistics or presentation, but it is very opportunistic and fits
the way statisticians work. Byt the way, this is not me saying:
'statisticians are opportunistic people', the language *R* however
is: there are numerous quick wins breaking elegance along the way.

The R program uses a table with columns for *year* and
*bnp*.

The program to display a graph of BNP's through the years then reads along the lines of:

```
plot( mydata$year, mydata$bnp,
type="l" )
```

We can see now, 2008 really was something exceptional.

This doesn't bring us our goal though: we want all information, but we want to be informed about recent BNP's better (with more detail) than older ones. The above graph takes too much space

For this I designed a simple re-do of the X-Axis. Instead of using
`mydata$year`

I will use `base^mydata$rank`

. The *rank* is the year,
but then starting at 1 (i.e. the minimum year minus 1 is subtracted).
I raise some kind of
base to the power of the X-value. This means that the amount of
space between the lower X-es is small, and between large X-es is
big. The axis should be labelled accordingly of course:

base = 1.04 nsamples = seq( 1, nrow( mydata ) ) plot( base^mydata$rank, mydata$bnp, type="l", xaxt="n" ) axis( 1, at=base^nsamples, labels=mydata$year, cex.axis=1.2, las=2 )

This is all sound and well, but when I discussed this with John
D. Cook (the one running John D. Cook
Consulting) during SUE2014, he said:
'nice idea, but will people be able to correctly understand it?'. A graph is
for quickly comprehending what's going on. If this type of graph
fails the *quickly* and also the *comprehend*, it's not a good idea.

For me, the context of *all* historical data is needed for
*comprehension*. The lack of detail in the far past and the greater
detail in recent numbers feeds the *quick*.

There is a quirk though. Graphs like these are *derived* while
watching. This is about the mathematical *derivation*: direction of
the lines in the graph. Everybody wants to tell whether the trend is
upwards, downwards or if there is a cyclic thing going on. This is
why I put a straight line in the graph below, a line which indicates what
the *average* or *overall* direction of the graph is. Steeper lines
then, grow quicker than the average. The lightgrey one is for the
overall growth, I put in a lightgreen one for the average growth
since 2000.

For the BNP's this looks like:

base = 1.04 nsamples = seq( 1, nrow( mydata ) ) plot( base^mydata$rank, mydata$bnp, type="l", xaxt="n" ) axis( 1, at=base^nsamples, labels=mydata$year, cex.axis=1.2, las=2 ) endX = nrow( mydata ) minY = min( mydata$bnp, na.rm=TRUE ) endY = mydata$bnp[ endX ] straightx = seq( 1, endX ) straighty = minY + ((endY-minY) * (mydata$rank[0:endX]/mydata$rank[endX])) lines( base^straightx, straighty, col="lightgrey" ) endX = nrow( mydata ) minY = mydata$bnp[mydata$year == 2000] minX = mydata$rank[mydata$year == 2000] endY = mydata$bnp[ endX ] straightx = seq( minX, endX ) straighty = minY + ((endY-minY) * ((mydata$rank[minX:endX]-minX)/(mydata$rank[endX]-minX))) lines( base^straightx, straighty, col="lightgreen" )

The effect of choosing a different base is shown by using 1.02 instead of 1.04:

On the other hand; 1.1 gives us:

The CBS graph is about relative change in BNP, I will come back to that in another title. I will also attend to inflation correction then.

## The source of the BNP numbers

BNP-figures can be retrieved from statline and 200 years of time-series. After some ETL (extract, transform, load) cycles, all figures were in a RDBMS, ready to be used for presentation purposes.

The data-model is simple:

- a table called
**bnp**with columns for*year*,*bnp*and*currency*. - a table called
**valuta**for currency-transformation (using*name*,*value*and*refvaluta*).

The data extracted, transformed and loaded delivers BNP-figures from long ago until pretty recent:

select min(year), max(year) from bnp

min | max |
---|---|

1807 | 2016 |

Also, hence the currency-table, figures before 2000 are in *guldens*
and after that in *Euro's*.

## Some SQL now

R interacts, through org-mode, with SQL-queries for retrieving the data.

The query to retrieve a time-series for Euro-converted BNP's is relatively straight forward (although the currencies make it quite complex):

select year, round( avg( bnp /1000000 / v.value * e.value) ) as bnp, 1 + year - (select min(year) from bnp) as rank, round(avg(bnp)/1000000) as bnpraw from bnp JOIN valuta v ON v.name = bnp.valuta JOIN valuta e ON e.name = v.refvaluta where e.name = 'EUR'

I chose taking the average, as there is some information duplication; some BNP's from before 2001 are reported in Euro's as well. There is two problems to solve: calculation towards one single currency (we'll choose the €) and resolving possible conflicts between extracted BNP's for the same year:

select year, round(avg(bnp/1000000/v.value * e.value) ) as avg, round(min(bnp/1000000/v.value * e.value) ) as min, round(max(bnp/1000000/v.value * e.value) ) as max from bnp JOIN valuta v ON v.name = bnp.valuta JOIN valuta e ON e.name = v.refvaluta where e.name = 'EUR' group by year having count(*) > 1 order by year

year | avg | min | max |
---|---|---|---|

1969 | 51806 | 49680 | 53932 |

1970 | 58119 | 55747 | 60491 |

1971 | 65828 | 62917 | 68739 |

1972 | 74120 | 70894 | 77345 |

1973 | 84881 | 81286 | 88476 |

1974 | 96338 | 92217 | 100459 |

1975 | 106449 | 101778 | 111120 |

1976 | 121336 | 116050 | 126623 |

1977 | 131720 | 126546 | 136895 |

1978 | 142119 | 136397 | 147840 |

1979 | 151813 | 145182 | 158444 |

1980 | 162942 | 155048 | 170836 |

1981 | 171408 | 162544 | 180271 |

1982 | 177796 | 169287 | 186304 |

1983 | 184245 | 175772 | 192718 |

1984 | 193641 | 184099 | 203184 |

1985 | 201571 | 193102 | 210040 |

1986 | 207880 | 198692 | 217068 |

1987 | 209991 | 200044 | 219937 |

1988 | 218925 | 207686 | 230163 |

1989 | 231865 | 220061 | 243670 |

1990 | 246172 | 234400 | 257943 |

1991 | 259502 | 246208 | 272796 |

1992 | 270972 | 256885 | 285059 |

1993 | 278733 | 263855 | 293610 |

1994 | 293826 | 278744 | 308909 |

1995 | 313787 | 302234 | 325341 |

1996 | 328011 | 315059 | 340964 |

1997 | 349211 | 333462 | 364961 |

1998 | 370761 | 352207 | 389315 |

As I'm not an economist and because I'm kind of a *completionist*, I
went for the average. The actual graphs in this title may uncover
whether this was wise or not.

Testing this SQL-code can be done by inspecting the first 10 rows:

```
<<bnps>>
limit 10
```

year | bnp | rank | bnpraw |
---|---|---|---|

1807 | 222 | 1 | 490 |

1808 | 198 | 2 | 436 |

1809 | 185 | 3 | 408 |

1810 | 4 | ||

1811 | 5 | ||

1812 | 6 | ||

1813 | 7 | ||

1814 | 8 | ||

1815 | 214 | 9 | 471 |

1816 | 231 | 10 | 508 |

Some values of BNP are not available, we will have to make our R code cater with that.

The code used by the R code then is contained in *bnps*:

<<bnp>> group by year order by year

## about this title

The document to generate the scripts has the same source as the document you are reading now.

Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.

This title was written between 3rd and 9th of August 2017