AboutGDPROthersSitemap

PostgreSQL RULES for pseudonymisation

Pseudonymisation; a use case

Data has value and, as stated in GDPR Unplugged, can be quite risky to handle.

Keeping it safe by itself is not a problem, but for data to be valuable, it needs to be accessed. The process of accessing it introduces the risk: data may leak. Leaking riskful data is bad for continuity in many ways.

One solution to make riskful data harmless, is pseudonymisation. This word is invented by the EU and mentioned in the GDPR document.

What they mean by this is a 1-to-1 function (a bijection in mathematical terms) where going one way is easy, but going back is terribly hard.1 The terribly then is documented by the EU to need additional information for attributing the personal data to a specific data subject which is to be kept separately (see pretext item (29)).

So going back needs some extra information only authorised people have access to.

In Anonymise and tag simultaneously I describe how this could work. In this title I will show exactly how. The performance impact and the way to get back from a scrambled field will be dealt with in subsequent titles.

Riskful data displayed in a harmless way

The riskfulness in employee data are names, date-of-births, acountnames, e-mailaddresses and roles. Let's assume that the name of someone's department is not a risk.

Making the employees table harmless, I need to provide for a RULE which pseudonymises these columns based on the access-level of the current user.

To make things simple, I assume that for personnel data, HR and no others have full access to the riskful data (except the director of the board of course).

Before being able to present a new table for employees, I need to move the old one out of the way. It is called employeesold for now.

A rule then could be:

CREATE OR REPLACE RULE "_RETURN" AS
ON SELECT TO employees
DO INSTEAD
    SELECT id, 
           pn,
           pseudonymiseifarisk( name ) as name,  
           pseudonymiseifarisk( prefix ) as prefix,
           pseudonymiseifarisk( firstname ) as firstname,
           NULL::date as dateofbirth,
           sex, department, 
           primaryrole, ssn, 
           pseudonymiseifarisk( accountname ) as accountname,
           pseudonymiseifarisk( email ) as email
      FROM employeesold;

Where the function pseudonymiseifarisk returns the column untouched in case the current user is not a risk and the pseudonymised one in case the current user is a risk.2

The predicate function to decide this is called isarisk and contains all business logic needed:

CREATE OR REPLACE FUNCTION isarisk( ) RETURNS boolean as
$$
   SELECT count(*) = 0
     FROM employeesold e 
     JOIN departments d ON d.id = e.department
     JOIN jobroles j ON j.id = e.primaryrole
    WHERE pseudonymise( accountname ) = pseudonymise( CURRENT_USER )
          AND (d.name = 'human resources' OR (d.name = 'board' AND j.name = 'director' ))
$$ LANGUAGE 'sql' STABLE;

Now let's see whether this gives us what we need:

set role jjansen;
select id, name from employees;
SET  
id name
1 1e3de977b838c5361b45f5da310103df
2 54de3e9b7d1ceead9523ab02b302306f
3 498c57aa172b093bf9984aa8a4bdd33b
4 1548cb8abd1a0fbce372a7da4c1acc94
5 8e4b6fe638e62db9c4efae442b80b1b2
6 b7af03c1e63f54a05377b0cbb3ddfae5
set role acompy;
select id, name from employees;
SET  
id name
1 Harfort
2 Orquort
3 Compy
4 Jansen
5 Jooks
6 Müller

This is only part of the work, for INSERT and UPDATE, rules must be created as well. This is to avoid pseudonymised data ends up being inserted in the database. These rules are pretty straightforward, ignoring all riskful columns.

As this is a lot of work and error-prone, my proposal is to change column-level security to cater for this in a single statement.

The function for pseudonymise

DROP FUNCTION pseudonymise( text );
CREATE FUNCTION pseudonymise( text ) RETURNS text as
$$
   select case when length( $1 ) = 32
               then $1
               else md5( $1||to_char(CURRENT_DATE,'yyyymmdd')||CURRENT_USER )
           end
$$ LANGUAGE 'sql';

Note that I make this function idempotent. This is crucial as this particular table contains information about the authorization level and has harmful data at the same time.

DROP FUNCTION pseudonymiseifarisk( text );
CREATE FUNCTION pseudonymiseifarisk( text ) RETURNS varchar as
$$
   select case when isarisk()
               then pseudonymise( $1 )
               else $1
           end
$$ LANGUAGE 'sql';

Used datamodel and content

DROP TABLE employees; 
DROP TABLE employeesold; 
DROP TABLE departments; 
DROP TABLE jobroles; 
CREATE TABLE departments
( id          serial PRIMARY KEY,
  name        varchar(32) UNIQUE
);
CREATE TABLE jobroles
( id          serial PRIMARY KEY,
  name        varchar(32) UNIQUE
);
CREATE TABLE employees
( id          serial,
  pn          integer,
  name        varchar(24),
  prefix      varchar(6),
  firstname   varchar(24),
  dateofbirth date,
  sex         char,
  department  integer,
  primaryrole integer,
  ssn         varchar(14),
  accountname varchar(14),
  email       varchar(40)
);
CREATE TABLE employeesold
( id          serial PRIMARY KEY,
  pn          integer,
  name        varchar(24),
  prefix      varchar(6),
  firstname   varchar(24),
  dateofbirth date,
  sex         char,
  department  integer NOT NULL REFERENCES departments( id ),
  primaryrole integer NOT NULL REFERENCES jobroles( id ),
  ssn         varchar(14),
  accountname varchar(14) UNIQUE,
  email       varchar(40)
);

Filling the tables is easy; the dataset for employees is small for now.

DELETE FROM employees;
DELETE FROM employeesold;
DELETE FROM jobroles;
DELETE FROM departments;
INSERT INTO departments (name) VALUES ('finance'), ('human resources'), 
                                      ('operations'), ('board'), ('sales');
INSERT INTO jobroles ( name ) VALUES ('account manager'), ('systems integrator'),
                                     ('controller'), ('manager'), ('director');
INSERT INTO employeesold ( pn, name, prefix, firstname, dateofbirth, sex, 
                        department, primaryrole, ssn, accountname, email )
               VALUES ( 6, 'Harfort', null, 'Jane', '2001-01-02', 'f',
                        1, 4, '', 'jharfort', 'jane.harfort@thecomp.eu' ),
                      ( 8, 'Orquort', 'd''', 'Jeff', '1997-10-02', 'm',
                        2, 4, '', 'jdorquort', 'jeff.d.orquort@thecomp.eu' ),
                      ( 1, 'Compy', null, 'Anne', '1999-10-12', 'f',
                        4, 5, '', 'acompy', 'anne@thecomp.eu' ),
                      (14, 'Jansen', null, 'Jake', '2000-4-5', 'm',
                        1, 3, '', 'jjansen', 'jake.jansen@thecomp.eu' ),
                      (15, 'Jooks', null, 'Peter', '1999-04-05', 'm',
                        3, 4, '', 'pjooks', 'peter.jooks@thecomp.eu' ),
                      (16, 'Müller', null, 'Elke', '2001-07-05', 'f',
                        3, 2, '', 'emueller', 'elke.mueller@thecomp.eu' );
SELECT e.id, e.pn, 
       e.firstname||' '||coalesce(e.prefix, '')||e.name as fullname, 
       e.dateofbirth, e.sex, 
       d.name AS dep, j.name AS primaryrole, 
       e.ssn, e.accountname, e.email
  FROM employeesold e
  JOIN departments d ON d.id = e.department
  JOIN jobroles j ON j.id = e.primaryrole
id pn fullname dateofbirth sex dep primaryrole ssn accountname email
1 6 Jane Harfort 2001-01-02 f finance manager   jharfort jane.harfort@thecomp.eu
2 8 Jeff d'Orquort 1997-10-02 m human resources manager   jdorquort jeff.d.orquort@thecomp.eu
3 1 Anne Compy 1999-10-12 f board director   acompy anne@thecomp.eu
4 14 Jake Jansen 2000-04-05 m finance controller   jjansen jake.jansen@thecomp.eu
5 15 Peter Jooks 1999-04-05 m operations manager   pjooks peter.jooks@thecomp.eu
6 16 Elke Müller 2001-07-05 f operations systems integrator   emueller elke.mueller@thecomp.eu

about this title

The document to generate and run the scripts has the same source as the document you are reading now.

Most scripts are bare bone, the amount of fancy stuff is kept to an absolute minimum in order to present only the concepts at hand and only that.

This title was written between 26th and 28th of September 2017

Footnotes:

1
going back is only necessary for forensic purposes, it needs to be working for most data.
2
good rule of thumb: data can be riskful, a person is a risk or is not.