AboutGDPROthersSitemap

R and SQL, and more

A use case

Explaining of a concept can be dull sometimes. Solving a real-world problem and use it as a vehicle to tell the reader about integrating R and SQL, as I do in this title, may be more interesting.

Looking for Dutch BNP (the gross national product, aka GDP) figures, I came across this graph:

Sorry, your browser does not support SVG.

It appears in BNP at cbs and it tells me about the recent change in BNP, but nothing about it's context, the past. Of course, the past is not very interesting, but not very is still more than nothing. I asked myself: how can I represent the current BNP in full context without discarding the relative importance of the most recent years?

Normally one accomplishes this by using logarithmic scales. These work fine, as various natural phenomena work exactly like that: the difference between human weight related to age where 1 and 10 months feels as important as the difference between 10 and 100 months. I am seeking the reverse; the bigger the number, the more important the intervals between successive steps. So instead of using x versus log x I use some base close to 1, e.g. 1.06: x versus 1.06^x.

Note that it is common for graph-makers to choose an interval which suites the point they are trying to make. I don't think that the Dutch CBS is trying to do this, but above graph may feed arguments for attacking policymakers bringing the year 2008 onto us. This graph will tell us that it was a terrible year indeed, but one needs more context in order to convince the critical reader. Below I will explain how to do this without losing detail.

The graphs using R

R is a multi-purpose programming language designed by statisticians. It is not necessarily the best language for statistics or presentation, but it is very opportunistic and fits the way statisticians work. Byt the way, this is not me saying: 'statisticians are opportunistic people', the language R however is: there are numerous quick wins breaking elegance along the way.

The R program uses a table with columns for year and bnp.

The program to display a graph of BNP's through the years then reads along the lines of:

plot( mydata$year, mydata$bnp, 
      type="l" )

Sorry, your browser does not support SVG.

We can see now, 2008 really was something exceptional.

This doesn't bring us our goal though: we want all information, but we want to be informed about recent BNP's better (with more detail) than older ones. The above graph takes too much space

For this I designed a simple re-do of the X-Axis. Instead of using mydata$year I will use base^mydata$rank. The rank is the year, but then starting at 1 (i.e. the minimum year minus 1 is subtracted). I raise some kind of base to the power of the X-value. This means that the amount of space between the lower X-es is small, and between large X-es is big. The axis should be labelled accordingly of course:

base = 1.04
nsamples = seq( 1, nrow( mydata ) )
plot( base^mydata$rank, mydata$bnp, 
      type="l", xaxt="n" )
axis( 1, at=base^nsamples, 
      labels=mydata$year, 
      cex.axis=1.2, 
      las=2 )

Sorry, your browser does not support SVG.

This is all sound and well, but when I discussed this with John D. Cook (the one running John D. Cook Consulting) during SUE2014, he said: 'nice idea, but will people be able to correctly understand it?'. A graph is for quickly comprehending what's going on. If this type of graph fails the quickly and also the comprehend, it's not a good idea.

For me, the context of all historical data is needed for comprehension. The lack of detail in the far past and the greater detail in recent numbers feeds the quick.

There is a quirk though. Graphs like these are derived while watching. This is about the mathematical derivation: direction of the lines in the graph. Everybody wants to tell whether the trend is upwards, downwards or if there is a cyclic thing going on. This is why I put a straight line in the graph below, a line which indicates what the average or overall direction of the graph is. Steeper lines then, grow quicker than the average. The lightgrey one is for the overall growth, I put in a lightgreen one for the average growth since 2000.

For the BNP's this looks like:

base = 1.04
nsamples = seq( 1, nrow( mydata ) )
plot( base^mydata$rank, 
      mydata$bnp, 
      type="l", xaxt="n" )
axis( 1, at=base^nsamples, 
      labels=mydata$year, 
      cex.axis=1.2,
      las=2 ) 

endX = nrow( mydata )
minY = min( mydata$bnp, 
            na.rm=TRUE )
endY = mydata$bnp[ endX ]
straightx = seq( 1, endX )
straighty = minY + 
     ((endY-minY) * 
      (mydata$rank[0:endX]/mydata$rank[endX]))
lines( base^straightx, 
       straighty, 
       col="lightgrey" )

endX = nrow( mydata )
minY = mydata$bnp[mydata$year == 2000]
minX = mydata$rank[mydata$year == 2000]
endY = mydata$bnp[ endX ]
straightx = seq( minX, endX )
straighty = minY + 
     ((endY-minY) * 
      ((mydata$rank[minX:endX]-minX)/(mydata$rank[endX]-minX)))
lines( base^straightx, 
       straighty, 
       col="lightgreen" )

Sorry, your browser does not support SVG.

The effect of choosing a different base is shown by using 1.02 instead of 1.04:

Sorry, your browser does not support SVG.

On the other hand; 1.1 gives us:

Sorry, your browser does not support SVG.

The CBS graph is about relative change in BNP, I will come back to that in another title. I will also attend to inflation correction then.

The source of the BNP numbers

BNP-figures can be retrieved from statline and 200 years of time-series. After some ETL (extract, transform, load) cycles, all figures were in a RDBMS, ready to be used for presentation purposes.

The data-model is simple:

  • a table called bnp with columns for year, bnp and currency.
  • a table called valuta for currency-transformation (using name, value and refvaluta).

The data extracted, transformed and loaded delivers BNP-figures from long ago until pretty recent:

select min(year), max(year) 
  from bnp
min max
1807 2016

Also, hence the currency-table, figures before 2000 are in guldens and after that in Euro's.

Some SQL now

R interacts, through org-mode, with SQL-queries for retrieving the data.

The query to retrieve a time-series for Euro-converted BNP's is relatively straight forward (although the currencies make it quite complex):

select year, 
       round( avg( 
          bnp /1000000 / v.value * e.value)
       ) as bnp, 
       1 + year - 
          (select min(year) from bnp) 
         as rank,
       round(avg(bnp)/1000000) 
         as bnpraw
  from bnp 
       JOIN valuta v 
         ON v.name = bnp.valuta
       JOIN valuta e 
         ON e.name = v.refvaluta
 where e.name = 'EUR'

I chose taking the average, as there is some information duplication; some BNP's from before 2001 are reported in Euro's as well. There is two problems to solve: calculation towards one single currency (we'll choose the €) and resolving possible conflicts between extracted BNP's for the same year:

select year, 
       round(avg(bnp/1000000/v.value * e.value)
       ) as avg, 
       round(min(bnp/1000000/v.value * e.value)
       ) as min, 
       round(max(bnp/1000000/v.value * e.value)
       ) as max
  from bnp 
       JOIN valuta v 
         ON v.name = bnp.valuta
       JOIN valuta e 
         ON e.name = v.refvaluta
 where e.name = 'EUR'
 group by year
having count(*) > 1
 order by year
year avg min max
1969 51806 49680 53932
1970 58119 55747 60491
1971 65828 62917 68739
1972 74120 70894 77345
1973 84881 81286 88476
1974 96338 92217 100459
1975 106449 101778 111120
1976 121336 116050 126623
1977 131720 126546 136895
1978 142119 136397 147840
1979 151813 145182 158444
1980 162942 155048 170836
1981 171408 162544 180271
1982 177796 169287 186304
1983 184245 175772 192718
1984 193641 184099 203184
1985 201571 193102 210040
1986 207880 198692 217068
1987 209991 200044 219937
1988 218925 207686 230163
1989 231865 220061 243670
1990 246172 234400 257943
1991 259502 246208 272796
1992 270972 256885 285059
1993 278733 263855 293610
1994 293826 278744 308909
1995 313787 302234 325341
1996 328011 315059 340964
1997 349211 333462 364961
1998 370761 352207 389315

As I'm not an economist and because I'm kind of a completionist, I went for the average. The actual graphs in this title may uncover whether this was wise or not.

Testing this SQL-code can be done by inspecting the first 10 rows:

<<bnps>>
limit 10
year bnp rank bnpraw
1807 222 1 490
1808 198 2 436
1809 185 3 408
1810   4  
1811   5  
1812   6  
1813   7  
1814   8  
1815 214 9 471
1816 231 10 508

Some values of BNP are not available, we will have to make our R code cater with that.

The code used by the R code then is contained in bnps:

<<bnp>>
group by year
order by year

about this title

The document to generate the scripts has the same source as the document you are reading now.

Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.

This title was written between 3rd and 9th of August 2017