PostgreSQL RULES for pseudonymisation

Pseudonymisation; a use case

Data has value and, as stated in GDPR Unplugged, can be quite risky to handle.

Keeping it safe by itself is not a problem, but for data to be valuable, it needs to be accessed. The process of accessing it introduces the risk: data may leak. Leaking riskful data is bad for continuity in many ways.

One solution to make riskful data harmless, is pseudonymisation. This word is invented by the EU and mentioned in the GDPR document.

What they mean by this is a 1-to-1 function (a bijection in mathematical terms) where going one way is easy, but going back is terribly hard.¹ The terribly then is documented by the EU to need additional information for attributing the personal data to a specific data subject which is to be kept separately (see pretext item (29)).

So going back needs some extra information only authorised people have access to.

In Anonymise and tag simultaneously I describe how this could work. In this title I will show exactly how. The performance impact and the way to get back from a scrambled field will be dealt with in subsequent titles.

Riskful data displayed in a harmless way

The riskfulness in employee data are names, date-of-births, acountnames, e-mailaddresses and roles. Let's assume that the name of someone's department is not a risk.

Making the employees table harmless, I need to provide for a RULE which pseudonymises these columns based on the access-level of the current user.

To make things simple, I assume that for personnel data, HR and no others have full access to the riskful data (except the director of the board of course).

Before being able to present a new table for employees, I need to move the old one out of the way. It is called employeesold for now.

A rule then could be:

Listing 1: a rule for making data harmless

CREATE OR REPLACE RULE "_RETURN" AS
ON SELECT TO employees
DO INSTEAD
    SELECT id, 
           pn,
           pseudonymiseifarisk( name ) as name,  
           pseudonymiseifarisk( prefix ) as prefix,
           pseudonymiseifarisk( firstname ) as firstname,
           NULL::date as dateofbirth,
           sex, department, 
           primaryrole, ssn, 
           pseudonymiseifarisk( accountname ) as accountname,
           pseudonymiseifarisk( email ) as email
      FROM employeesold;

Where the function pseudonymiseifarisk returns the column untouched in case the current user is not a risk and the pseudonymised one in case the current user is a risk.²

The predicate function to decide this is called isarisk and contains all business logic needed:

Listing 2: function pseudonymise

CREATE OR REPLACE FUNCTION isarisk( ) RETURNS boolean as
$$
   SELECT count(*) = 0
     FROM employeesold e 
     JOIN departments d ON d.id = e.department
     JOIN jobroles j ON j.id = e.primaryrole
    WHERE pseudonymise( accountname ) = pseudonymise( CURRENT_USER )
          AND (d.name = 'human resources' OR (d.name = 'board' AND j.name = 'director' ))
$$ LANGUAGE 'sql' STABLE;

Now let's see whether this gives us what we need:

Listing 3: test one

set role jjansen;
select id, name from employees;

SET
id	name
1	1e3de977b838c5361b45f5da310103df
2	54de3e9b7d1ceead9523ab02b302306f
3	498c57aa172b093bf9984aa8a4bdd33b
4	1548cb8abd1a0fbce372a7da4c1acc94
5	8e4b6fe638e62db9c4efae442b80b1b2
6	b7af03c1e63f54a05377b0cbb3ddfae5

Listing 4: test two

set role acompy;
select id, name from employees;

SET
id	name
1	Harfort
2	Orquort
3	Compy
4	Jansen
5	Jooks
6	Müller

This is only part of the work, for INSERT and UPDATE, rules must be created as well. This is to avoid pseudonymised data ends up being inserted in the database. These rules are pretty straightforward, ignoring all riskful columns.

As this is a lot of work and error-prone, my proposal is to change column-level security to cater for this in a single statement.

The function for pseudonymise

Listing 5: function pseudonymise

DROP FUNCTION pseudonymise( text );
CREATE FUNCTION pseudonymise( text ) RETURNS text as
$$
   select case when length( $1 ) = 32
               then $1
               else md5( $1||to_char(CURRENT_DATE,'yyyymmdd')||CURRENT_USER )
           end
$$ LANGUAGE 'sql';

Note that I make this function idempotent. This is crucial as this particular table contains information about the authorization level and has harmful data at the same time.

Listing 6: function pseudonymiseifarisk

DROP FUNCTION pseudonymiseifarisk( text );
CREATE FUNCTION pseudonymiseifarisk( text ) RETURNS varchar as
$$
   select case when isarisk()
               then pseudonymise( $1 )
               else $1
           end
$$ LANGUAGE 'sql';

Used datamodel and content

Listing 7: create tables

DROP TABLE employees; 
DROP TABLE employeesold; 
DROP TABLE departments; 
DROP TABLE jobroles; 
CREATE TABLE departments
( id          serial PRIMARY KEY,
  name        varchar(32) UNIQUE
);
CREATE TABLE jobroles
( id          serial PRIMARY KEY,
  name        varchar(32) UNIQUE
);
CREATE TABLE employees
( id          serial,
  pn          integer,
  name        varchar(24),
  prefix      varchar(6),
  firstname   varchar(24),
  dateofbirth date,
  sex         char,
  department  integer,
  primaryrole integer,
  ssn         varchar(14),
  accountname varchar(14),
  email       varchar(40)
);
CREATE TABLE employeesold
( id          serial PRIMARY KEY,
  pn          integer,
  name        varchar(24),
  prefix      varchar(6),
  firstname   varchar(24),
  dateofbirth date,
  sex         char,
  department  integer NOT NULL REFERENCES departments( id ),
  primaryrole integer NOT NULL REFERENCES jobroles( id ),
  ssn         varchar(14),
  accountname varchar(14) UNIQUE,
  email       varchar(40)
);

Filling the tables is easy; the dataset for employees is small for now.

Listing 8: fill tables

DELETE FROM employees;
DELETE FROM employeesold;
DELETE FROM jobroles;
DELETE FROM departments;
INSERT INTO departments (name) VALUES ('finance'), ('human resources'), 
                                      ('operations'), ('board'), ('sales');
INSERT INTO jobroles ( name ) VALUES ('account manager'), ('systems integrator'),
                                     ('controller'), ('manager'), ('director');
INSERT INTO employeesold ( pn, name, prefix, firstname, dateofbirth, sex, 
                        department, primaryrole, ssn, accountname, email )
               VALUES ( 6, 'Harfort', null, 'Jane', '2001-01-02', 'f',
                        1, 4, '', 'jharfort', 'jane.harfort@thecomp.eu' ),
                      ( 8, 'Orquort', 'd''', 'Jeff', '1997-10-02', 'm',
                        2, 4, '', 'jdorquort', 'jeff.d.orquort@thecomp.eu' ),
                      ( 1, 'Compy', null, 'Anne', '1999-10-12', 'f',
                        4, 5, '', 'acompy', 'anne@thecomp.eu' ),
                      (14, 'Jansen', null, 'Jake', '2000-4-5', 'm',
                        1, 3, '', 'jjansen', 'jake.jansen@thecomp.eu' ),
                      (15, 'Jooks', null, 'Peter', '1999-04-05', 'm',
                        3, 4, '', 'pjooks', 'peter.jooks@thecomp.eu' ),
                      (16, 'Müller', null, 'Elke', '2001-07-05', 'f',
                        3, 2, '', 'emueller', 'elke.mueller@thecomp.eu' );

Listing 9: content

SELECT e.id, e.pn, 
       e.firstname||' '||coalesce(e.prefix, '')||e.name as fullname, 
       e.dateofbirth, e.sex, 
       d.name AS dep, j.name AS primaryrole, 
       e.ssn, e.accountname, e.email
  FROM employeesold e
  JOIN departments d ON d.id = e.department
  JOIN jobroles j ON j.id = e.primaryrole

id	pn	fullname	dateofbirth	sex	dep	primaryrole	accountname	email
1	6	Jane Harfort	2001-01-02	f	finance	manager	jharfort	jane.harfort@thecomp.eu
2	8	Jeff d'Orquort	1997-10-02	m	human resources	manager	jdorquort	jeff.d.orquort@thecomp.eu
3	1	Anne Compy	1999-10-12	f	board	director	acompy	anne@thecomp.eu
4	14	Jake Jansen	2000-04-05	m	finance	controller	jjansen	jake.jansen@thecomp.eu
5	15	Peter Jooks	1999-04-05	m	operations	manager	pjooks	peter.jooks@thecomp.eu
6	16	Elke Müller	2001-07-05	f	operations	systems integrator	emueller	elke.mueller@thecomp.eu

about this title

The document to generate and run the scripts has the same source as the document you are reading now.

Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.

This title was written between 26th and 28th of September 2017

Footnotes:

going back is only necessary for forensic purposes, it needs to be working for most data.

good rule of thumb: data can be riskful, a person is a risk or is not.