Big Data’s 2nd Class Citizens – Security & Privacy!

Posted on July 1, 2013


For many years I have been pushing the concept of the hidden value on Corporate Balance sheets – data. The challenge has always been how to achieve this securely and without compromising the trust of much of the customer privacy that is embodied in and at the mercy of the use of said data.

Barclays Bank is the latest in a long list of corporates looking to capitalise on their data from their services that generate huge volumes of customer’s data – Bank tells 13 million customers it is to start selling information on their spending habits to other companies’ . It has raised a number of eyebrows, particularly in the vacuum of any hard facts as to how they intend to protect their customers privacy.

Concerns have traditionally been allayed by the following:

  1. Data anonymising
  2. Implementation of security policies and procedures built on a strict risk management discipline that hardens systems security, using encryption to produce a highly compliant environment.

All good and well in the old world, but with the huge volumes of data now being processed leveraging cloud technologies and solutions such as the current Big Data poster boy, Open Source Hadoop, the goal posts have moved dramatically.

Hadoop is an Open Source effort at its heart, a highly distributed storage and compute open source solution to mining real time streams of data (Big Data) that is not readily complaint with traditional enterprise security management toolsets. It is largely dependent on its hosted environment and the practice and policies around that environment for its security and compliance. The more components that are brought into play in any security context raises the risk of compromise, and this is the challenge.

But surely the anonymity of the data takes away these concerns I hear you say? Hmmm… that would be all good and well IF the data was truly anonymous. Data sets even with Personally Identifiable Information (PII) removed still contain patterns embodied in the records, it is after all what contains much of the value of the data. Standalone these data sets are inert to Privacy compromise, BUT in very few cases are they data mined in isolation. The very essence of data mining is to extract value form data, and to do so most data sets are combined with other data sets. Alone they hold negligible value but when cross referenced significant commercial insight can be gained to deliver real monetary gain or competitive advantage. Not to mention in the more sinister dimension of covert surveillance, but that is a whole different ball game.

Fine so far, BUT the way this happens in the real world is often merging or mining anonymous data sets alongside data sets that contain Personally Identifiable Information. This is where the balloon goes up on data anonymity, with increase compute power and advancements in datamining algorithms the anonymity of data can be reversed.

Take for simplicities sake two data sets. Company A has a customer relationship management system with customer identifiable information that it legitimately owns. Company A legitimately acquires data from Company B. The data from Company B is anonymised to protect its consumers information but to allow it to commercially realise value from its data. Because both these companies operate in a similar market place they are fundamentally pooling data from a common use base. Running new data mining algorithms it is possible to attach the anonymous records of Company B’s data with their identifiable user records of Company A with a 90% success rate. Data acquisitions like this happen all the time and the outcomes are worrying.

Welcome to the world of Differential Privacy, notion of privacy tailored to the problem of statistical disclosure control or rather how to release statistical information about a set of people without compromising the privacy of any individual. Fort a wealth of information on this subject, head over to Microsoft Research Database Privacy site

I question Barclays Bank or ANY commercial entities ability to address these issues convincingly. Organisations are fighting against their own DNA inbuilt commercial objectives to extrapolate value from these data acquisitions and the trust placed in their by individuals they treat with.

As for Big Data, well it’s the new kid on the block, and is heading for the inevitable Trough of Disillusionment, to use a Gartner term, that could give pause for maturity around its security and privacy credentials. We can but hope.