R and SQL, and more
A use case
Explaining of a concept can be dull sometimes. Solving a real-world problem and use it as a vehicle to tell the reader about integrating R and SQL, as I do in this title, may be more interesting.
Looking for Dutch BNP (the gross national product, aka GDP) figures, I came across this graph:
It appears in BNP at cbs and it tells me about the recent change in BNP, but nothing about it's context, the past. Of course, the past is not very interesting, but not very is still more than nothing. I asked myself: how can I represent the current BNP in full context without discarding the relative importance of the most recent years?
Normally one accomplishes this by using logarithmic scales. These work fine, as various natural phenomena work exactly like that: the difference between human weight related to age where 1 and 10 months feels as important as the difference between 10 and 100 months. I am seeking the reverse; the bigger the number, the more important the intervals between successive steps. So instead of using x versus log x I use some base close to 1, e.g. 1.06: x versus 1.06^x.
Note that it is common for graph-makers to choose an interval which suites the point they are trying to make. I don't think that the Dutch CBS is trying to do this, but above graph may feed arguments for attacking policymakers bringing the year 2008 onto us. This graph will tell us that it was a terrible year indeed, but one needs more context in order to convince the critical reader. Below I will explain how to do this without losing detail.
The graphs using R
R is a multi-purpose programming language designed by statisticians. It is not necessarily the best language for statistics or presentation, but it is very opportunistic and fits the way statisticians work. Byt the way, this is not me saying: 'statisticians are opportunistic people', the language R however is: there are numerous quick wins breaking elegance along the way.
The R program uses a table with columns for year and bnp.
The program to display a graph of BNP's through the years then reads along the lines of:
plot( mydata$year, mydata$bnp, type="l" )
We can see now, 2008 really was something exceptional.
This doesn't bring us our goal though: we want all information, but we want to be informed about recent BNP's better (with more detail) than older ones. The above graph takes too much space
For this I designed a simple re-do of the X-Axis. Instead of using
mydata$year I will use
base^mydata$rank. The rank is the year,
but then starting at 1 (i.e. the minimum year minus 1 is subtracted).
I raise some kind of
base to the power of the X-value. This means that the amount of
space between the lower X-es is small, and between large X-es is
big. The axis should be labelled accordingly of course:
base = 1.04 nsamples = seq( 1, nrow( mydata ) ) plot( base^mydata$rank, mydata$bnp, type="l", xaxt="n" ) axis( 1, at=base^nsamples, labels=mydata$year, cex.axis=1.2, las=2 )
This is all sound and well, but when I discussed this with John D. Cook (the one running John D. Cook Consulting) during SUE2014, he said: 'nice idea, but will people be able to correctly understand it?'. A graph is for quickly comprehending what's going on. If this type of graph fails the quickly and also the comprehend, it's not a good idea.
For me, the context of all historical data is needed for comprehension. The lack of detail in the far past and the greater detail in recent numbers feeds the quick.
There is a quirk though. Graphs like these are derived while watching. This is about the mathematical derivation: direction of the lines in the graph. Everybody wants to tell whether the trend is upwards, downwards or if there is a cyclic thing going on. This is why I put a straight line in the graph below, a line which indicates what the average or overall direction of the graph is. Steeper lines then, grow quicker than the average. The lightgrey one is for the overall growth, I put in a lightgreen one for the average growth since 2000.
For the BNP's this looks like:
base = 1.04 nsamples = seq( 1, nrow( mydata ) ) plot( base^mydata$rank, mydata$bnp, type="l", xaxt="n" ) axis( 1, at=base^nsamples, labels=mydata$year, cex.axis=1.2, las=2 ) endX = nrow( mydata ) minY = min( mydata$bnp, na.rm=TRUE ) endY = mydata$bnp[ endX ] straightx = seq( 1, endX ) straighty = minY + ((endY-minY) * (mydata$rank[0:endX]/mydata$rank[endX])) lines( base^straightx, straighty, col="lightgrey" ) endX = nrow( mydata ) minY = mydata$bnp[mydata$year == 2000] minX = mydata$rank[mydata$year == 2000] endY = mydata$bnp[ endX ] straightx = seq( minX, endX ) straighty = minY + ((endY-minY) * ((mydata$rank[minX:endX]-minX)/(mydata$rank[endX]-minX))) lines( base^straightx, straighty, col="lightgreen" )
The effect of choosing a different base is shown by using 1.02 instead of 1.04:
On the other hand; 1.1 gives us:
The CBS graph is about relative change in BNP, I will come back to that in another title. I will also attend to inflation correction then.
The source of the BNP numbers
The data-model is simple:
- a table called bnp with columns for year, bnp and currency.
- a table called valuta for currency-transformation (using name, value and refvaluta).
The data extracted, transformed and loaded delivers BNP-figures from long ago until pretty recent:
select min(year), max(year) from bnp
Also, hence the currency-table, figures before 2000 are in guldens and after that in Euro's.
Some SQL now
R interacts, through org-mode, with SQL-queries for retrieving the data.
The query to retrieve a time-series for Euro-converted BNP's is relatively straight forward (although the currencies make it quite complex):
select year, round( avg( bnp /1000000 / v.value * e.value) ) as bnp, 1 + year - (select min(year) from bnp) as rank, round(avg(bnp)/1000000) as bnpraw from bnp JOIN valuta v ON v.name = bnp.valuta JOIN valuta e ON e.name = v.refvaluta where e.name = 'EUR'
I chose taking the average, as there is some information duplication; some BNP's from before 2001 are reported in Euro's as well. There is two problems to solve: calculation towards one single currency (we'll choose the €) and resolving possible conflicts between extracted BNP's for the same year:
select year, round(avg(bnp/1000000/v.value * e.value) ) as avg, round(min(bnp/1000000/v.value * e.value) ) as min, round(max(bnp/1000000/v.value * e.value) ) as max from bnp JOIN valuta v ON v.name = bnp.valuta JOIN valuta e ON e.name = v.refvaluta where e.name = 'EUR' group by year having count(*) > 1 order by year
As I'm not an economist and because I'm kind of a completionist, I went for the average. The actual graphs in this title may uncover whether this was wise or not.
Testing this SQL-code can be done by inspecting the first 10 rows:
<<bnps>> limit 10
Some values of BNP are not available, we will have to make our R code cater with that.
The code used by the R code then is contained in bnps:
<<bnp>> group by year order by year
about this title
The document to generate the scripts has the same source as the document you are reading now.
Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.
This title was written between 3rd and 9th of August 2017