R and SQL, and more

A use case

Explaining of a concept can be dull sometimes. Solving a real-world problem and use it as a vehicle to tell the reader about integrating R and SQL, as I do in this title, may be more interesting.

Looking for Dutch BNP (the gross national product, aka GDP) figures, I came across this graph:

It appears in BNP at cbs and it tells me about the recent change in BNP, but nothing about it's context, the past. Of course, the past is not very interesting, but not very is still more than nothing. I asked myself: how can I represent the current BNP in full context without discarding the relative importance of the most recent years?

Normally one accomplishes this by using logarithmic scales. These work fine, as various natural phenomena work exactly like that: the difference between human weight related to age where 1 and 10 months feels as important as the difference between 10 and 100 months. I am seeking the reverse; the bigger the number, the more important the intervals between successive steps. So instead of using x versus log x I use some base close to 1, e.g. 1.06: x versus 1.06^x.

Note that it is common for graph-makers to choose an interval which suites the point they are trying to make. I don't think that the Dutch CBS is trying to do this, but above graph may feed arguments for attacking policymakers bringing the year 2008 onto us. This graph will tell us that it was a terrible year indeed, but one needs more context in order to convince the critical reader. Below I will explain how to do this without losing detail.

The graphs using R

R is a multi-purpose programming language designed by statisticians. It is not necessarily the best language for statistics or presentation, but it is very opportunistic and fits the way statisticians work. Byt the way, this is not me saying: 'statisticians are opportunistic people', the language R however is: there are numerous quick wins breaking elegance along the way.

The R program uses a table with columns for year and bnp.

The program to display a graph of BNP's through the years then reads along the lines of:

Listing 1: graph of all BNP's

plot( mydata$year, mydata$bnp, 
      type="l" )

We can see now, 2008 really was something exceptional.

This doesn't bring us our goal though: we want all information, but we want to be informed about recent BNP's better (with more detail) than older ones. The above graph takes too much space

For this I designed a simple re-do of the X-Axis. Instead of using mydata$year I will use base^mydata$rank. The rank is the year, but then starting at 1 (i.e. the minimum year minus 1 is subtracted). I raise some kind of base to the power of the X-value. This means that the amount of space between the lower X-es is small, and between large X-es is big. The axis should be labelled accordingly of course:

Listing 2: graph of all BNP's, recent has more detail

base = 1.04
nsamples = seq( 1, nrow( mydata ) )
plot( base^mydata$rank, mydata$bnp, 
      type="l", xaxt="n" )
axis( 1, at=base^nsamples, 
      labels=mydata$year, 
      cex.axis=1.2, 
      las=2 )

This is all sound and well, but when I discussed this with John D. Cook (the one running John D. Cook Consulting) during SUE2014, he said: 'nice idea, but will people be able to correctly understand it?'. A graph is for quickly comprehending what's going on. If this type of graph fails the quickly and also the comprehend, it's not a good idea.

For me, the context of all historical data is needed for comprehension. The lack of detail in the far past and the greater detail in recent numbers feeds the quick.

There is a quirk though. Graphs like these are derived while watching. This is about the mathematical derivation: direction of the lines in the graph. Everybody wants to tell whether the trend is upwards, downwards or if there is a cyclic thing going on. This is why I put a straight line in the graph below, a line which indicates what the average or overall direction of the graph is. Steeper lines then, grow quicker than the average. The lightgrey one is for the overall growth, I put in a lightgreen one for the average growth since 2000.

For the BNP's this looks like:

Listing 3: graph of all BNP's with a straight line for average

base = 1.04
nsamples = seq( 1, nrow( mydata ) )
plot( base^mydata$rank, 
      mydata$bnp, 
      type="l", xaxt="n" )
axis( 1, at=base^nsamples, 
      labels=mydata$year, 
      cex.axis=1.2,
      las=2 ) 

endX = nrow( mydata )
minY = min( mydata$bnp, 
            na.rm=TRUE )
endY = mydata$bnp[ endX ]
straightx = seq( 1, endX )
straighty = minY + 
     ((endY-minY) * 
      (mydata$rank[0:endX]/mydata$rank[endX]))
lines( base^straightx, 
       straighty, 
       col="lightgrey" )

endX = nrow( mydata )
minY = mydata$bnp[mydata$year == 2000]
minX = mydata$rank[mydata$year == 2000]
endY = mydata$bnp[ endX ]
straightx = seq( minX, endX )
straighty = minY + 
     ((endY-minY) * 
      ((mydata$rank[minX:endX]-minX)/(mydata$rank[endX]-minX)))
lines( base^straightx, 
       straighty, 
       col="lightgreen" )

The effect of choosing a different base is shown by using 1.02 instead of 1.04:

On the other hand; 1.1 gives us:

The CBS graph is about relative change in BNP, I will come back to that in another title. I will also attend to inflation correction then.

The source of the BNP numbers

BNP-figures can be retrieved from statline and 200 years of time-series. After some ETL (extract, transform, load) cycles, all figures were in a RDBMS, ready to be used for presentation purposes.

The data-model is simple:

a table called bnp with columns for year, bnp and currency.
a table called valuta for currency-transformation (using name, value and refvaluta).

The data extracted, transformed and loaded delivers BNP-figures from long ago until pretty recent:

select min(year), max(year) 
  from bnp

min	max
1807	2016

Also, hence the currency-table, figures before 2000 are in guldens and after that in Euro's.

Some SQL now

R interacts, through org-mode, with SQL-queries for retrieving the data.

The query to retrieve a time-series for Euro-converted BNP's is relatively straight forward (although the currencies make it quite complex):

Listing 4: bnp

select year, 
       round( avg( 
          bnp /1000000 / v.value * e.value)
       ) as bnp, 
       1 + year - 
          (select min(year) from bnp) 
         as rank,
       round(avg(bnp)/1000000) 
         as bnpraw
  from bnp 
       JOIN valuta v 
         ON v.name = bnp.valuta
       JOIN valuta e 
         ON e.name = v.refvaluta
 where e.name = 'EUR'

I chose taking the average, as there is some information duplication; some BNP's from before 2001 are reported in Euro's as well. There is two problems to solve: calculation towards one single currency (we'll choose the €) and resolving possible conflicts between extracted BNP's for the same year:

select year, 
       round(avg(bnp/1000000/v.value * e.value)
       ) as avg, 
       round(min(bnp/1000000/v.value * e.value)
       ) as min, 
       round(max(bnp/1000000/v.value * e.value)
       ) as max
  from bnp 
       JOIN valuta v 
         ON v.name = bnp.valuta
       JOIN valuta e 
         ON e.name = v.refvaluta
 where e.name = 'EUR'
 group by year
having count(*) > 1
 order by year

year	avg	min	max
1969	51806	49680	53932
1970	58119	55747	60491
1971	65828	62917	68739
1972	74120	70894	77345
1973	84881	81286	88476
1974	96338	92217	100459
1975	106449	101778	111120
1976	121336	116050	126623
1977	131720	126546	136895
1978	142119	136397	147840
1979	151813	145182	158444
1980	162942	155048	170836
1981	171408	162544	180271
1982	177796	169287	186304
1983	184245	175772	192718
1984	193641	184099	203184
1985	201571	193102	210040
1986	207880	198692	217068
1987	209991	200044	219937
1988	218925	207686	230163
1989	231865	220061	243670
1990	246172	234400	257943
1991	259502	246208	272796
1992	270972	256885	285059
1993	278733	263855	293610
1994	293826	278744	308909
1995	313787	302234	325341
1996	328011	315059	340964
1997	349211	333462	364961
1998	370761	352207	389315

As I'm not an economist and because I'm kind of a completionist, I went for the average. The actual graphs in this title may uncover whether this was wise or not.

Testing this SQL-code can be done by inspecting the first 10 rows:

Listing 5: limited bnp

<<bnps>>
limit 10

year	bnp	rank	bnpraw
1807	222	1	490
1808	198	2	436
1809	185	3	408
1810		4
1811		5
1812		6
1813		7
1814		8
1815	214	9	471
1816	231	10	508

Some values of BNP are not available, we will have to make our R code cater with that.

The code used by the R code then is contained in bnps:

Listing 6: bnps

<<bnp>>
group by year
order by year

about this title

The document to generate the scripts has the same source as the document you are reading now.

Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.

This title was written between 3rd and 9th of August 2017