Big Data Oversight or Persecution by Algorithm?

Posted on March 24, 2014


We are at an event horizon of yet another seismic shift in technologies progression and impact on our everyday lives. A shift that on the surface is all but invisible and little understood by many but is already resonating deep into the core fabric of our freedom and liberty as individuals and society at large.

In the over simplistic throwaway tone adopted by Zuckerberg, aka Facebook, challenges to data piracy are brushed off with the statement that users with nothing to hide have nothing to fear, a myth debunked! The sad reality of these words is their naivety as we live in the shadow of multiple examples going back across many centuries where at varying scales this attitude has undermined social freedoms.

The exponential rise of Cloud Computing, the utilisation of computing resources, with its lowering of the cost bar to data storage and ease of access to cheap computational systems opened up a veritable Pandora’s Box. This comes in a many facets, for example:

Data Volume

Where less than a decade ago organisations and individuals would diligently prune great volumes of data to retain just the bare essentials, today data storage is so cheap such prudency has been swept aside and ALL data and information is being stored, leading to huge data warehouses of information being retained indefinably. Local jurisdictional laws are being side lined as technological nuances outstrip regulators ability to adapt and protect. The retention and use of Personal Identifiable Information (PII) beyond its original use case is now the norm. Anonymity mechanisms banded about by companies as protection mechanisms are facile, as new database techniques allow data sets to be re-attached with over 90% accuracy, rendering ANY data stored or held at best ‘pseudo-anonymous’. That being anonymous at the discretion of the Data Controller (the holder of said data). Then there is the invidious class of corporation that attempts through its terms and conditions to contractually acquire and retain FOREVER and for their own use ANY data supplied, Facebook, Google and Amazon the principle protagonists.

The conclusion is ANY data ANY individual, organisation or a third party may supply is almost guaranteed to be retained somewhere AND be amalgamated and used out with its original scope or purpose for which it was given up. By way of examples, which are by no means exhaustive in detail or exclusivity include:

  • UK National Health data – On the pretext of better diagnosis, this has already been demonstrated to be a commercial venture and data is openly shared and poorly secured.
  • ANY online advertising entity that reads your Web Browser cookies – Aggregation of browsing behaviour occurs real-time and is pervasive as browsing habits are shared by backend marketing companies, retailers and search entities as they drive their relentless advert targeting at users.
  • Google Search engine, email and Google Apps usage and content scanning – In the pretext of service improvements no data is sacrosanct, emails and documents are scanned, and whilst these are professed to be anonymous machine activities not human, they expose more than is commonly confessed to.
  • Facebook, everything you submit or anything another individual may post that relates to you is forever at risk of exposure – Facebook have repeatedly adjusted their security policies rendering content previously ‘Private’ as public. Facebook also reserve the right to use your images for their own advertising purposes as well.
  • Vodafone mobile tracking data which is sold to advertising agencies – If you own a mobile phone you are one of the millions who have volunteered to participate in the biggest monitoring exercise the world has ever seen. Every move you make every step you take every connection is being monitored, recorded and available.
  • Experian credit rating agency selling its database to marketing companies – Dictators of who can and who cannot, the credit rating agencies dictate our lives in hidden ways that risk severe fallout on individuals liberty as data errors multiply and impact credit scores, which are very hard to get corrected as these behemoths profess to be greater than though at judging our credibility.

In the UK now the politicians are getting in on the act as government bureaucrats have floated the idea of selling individuals tax information, albeit anonymised! Yes you read that correctly, if you are a UK citizen YOUR tax returns could be readily available in the public domain. As I have stated above and in earlier blogs it is not hard for this data to be reattached to your identity.

Quality & Accuracy

Data and statistical analysis research, for example, was historically confined by real world practical economics of compiling data sets. Data collection and storage was costly to compile and analyse which lead to the practice of statistical analysis of small data sets as representative of a larger body of data. The assurance of quality was preserved by the diligence applied in striving for accuracy and credibility of the data as well as a representative spread across whatever criteria was appropriate for the scope of enquiry. To achieve the same levels of accuracy for statistical purposes across todays petabytes of data make it almost an impossible exercise, therefore data is becoming ‘dirty’ and subject to inaccuracy.

Today datasets are stupendously huge and so conveniently amalgamated, demanding a new approach. Which has coined the term ‘Bid Data’, where data quality has gone out of the window in favour of quantity. The principle being adopted that the completeness of data across all records relevant to a subject is no longer necessary because of the sheer volume of records that can now be referenced. Analysis of these huge volumes possible due to the cheap and conveniently available storage and computing power supplied as a result of Cloud Computing and the development of dedicated ‘Big Data’ computational systems. Data supplied for one purpose today now ends up being influenced by records from disparate sources with questionable outcomes.

Facebook is the outstanding villain in this regard. Continuing to flaunt any regard for Personal Identifiable Information as it harvests user data and markets this to advertising companies as well as reserving the right to use this data WITHOUT their direct consent.

Then there is the Google Flu predictor, the fallen poster boy. Where Google in its adolescent rush for recognition in disciplines out with its search capability professed to be able to predict the annual Flu breakouts in the US, it fell afoul of its own hype. On the face of it to be able to predict the spread of Flu was a fantastic proposition of huge value and benefit to society and health organisations that annually struggle to respond to Flu outbreaks. Google programmers asked questions of its huge data resources – compiled from years of monitoring users online through its own search engine as well as any website that subscribed to its ‘Free’ analytics service, reading all emails that touch the Gmail Service, monitoring Goole Apps usage and scanning associated documents as well as the tracking and recording of mobile activity on their Android platform – they professed to be able to draw what were assumed to be consistent insights from their data that paralleled the annual Flu breakouts. Heralding vindication for their voracious data appetite, that they had achieved a breakthrough, only to have their self-adorned laurels cast asunder as subsequent years Flu breakouts failed to reflect the Google predictions. The Google programmers with their marketing hype inflated egos were found to be human after all. They did not see the unpredictable impact of unrelated trending data within large data sets to materially distort their analysis.

Large Data Sets are like oceans. They have hidden depths, to extend the ocean analogy. There are big currents a la Gulf Stream and there are localised currents and tides which in turn are influenced in unpredictable ways at a macro level by wind, temperature and of course man at a micro level. Google in their human fallible haste were in essence looking at something akin to a local tidal pattern when they thought they were tapped into the certainty of a Data ‘Gulf Stream’. The Google Flu predictor is little more than an exercise in why data quality is still relevant and ‘Big Data’ is still in its infancy and requires careful governance.

Transparency & Accountability

Data analysis is no longer dependent on man-made diligently audited and qualified algorithms but algorithms that evolve dynamically as they become abstracted through machine learning and AI (Artificial Intelligence) programming techniques. Today algorithms running against large data sets are no longer fully understood even by their developers and supervisors.

The aforementioned example of the Google Flu predictor is an early example of this, where advanced machine learning and AI algorithm programming techniques were deployed, they evolved out with the controlled understanding of their creators. Like a boy racer irresponsibly let loose in a Formula 1 car, accelerating off purposefully only to find themselves waking up in A&E (Accident and Emergency), thinking they were the driver only to find they were little more than a passenger. Assuming they could even control the fickle balance of an F1 clutch and accelerator to get off the mark, OK then they had automated launch control … enough of the F1 digression. The point being even the big boys, Google, Amazon, Facebook et al, are still driving with L plates (learners) when it comes to Big Data, so be warned those corporates thinking they have it sussed have some rude awakenings ahead.

Now let’s combine this algorithmic alchemy with the blooming volumes of data available to organisations, extrapolate this into the n’th dimensions with merger and acquisitions and operational memorandum’s of understanding that allow organisations to share and combine data, then the picture takes on an all to Orwellian perspective. A prospect too tempting to ignore for Governments amongst others, NSA (US National Security Agency) springs to mind for some reason!

Don’t get me wrong, Predictive Analysis has been around for as long as data has been complied and analysed with great corporate and social success and benefit. Logistics is a frequently quoted market sector that uses this with great accuracy to efficiently route deliveries saving fuel and increasing efficiency. The key point here being they are working within a controlled data scope albeit with huge data volumes.

The water starts to muddy as we move into the realms of Correlation Analysis. In summary Correlation Analysis finds random relationships in data. That by itself is nothing earth shattering, but when those relationships are multiplied up across huge volumes of mixed data they start to reveal occurrences of factors or attributes that do not directly relate to the original query this starts to get into the realms of probability theory. That being if A, B and C occur together, if there then appears an associated relationship with say P then X, Y and Z are likely. These associated factors or attributes become a ‘Proxy’ that when a particular variable appears would dictate a high probability of a certain outcome. This ‘Proxy’ or associated relationship takes on an altogether different class of data insight extrapolation and have all kinds of implications.

Applied within diligently defined data scopes these Proxies can be hugely insightful. Airlines for example use the monitoring of vibrational and operational outputs from indirect and imperceptibly associated parts of an airliner to predict part failures and optimise preventative maintenance procedures saving millions and raising safety standards in the process. This is achievable because the data sets are controlled in scope. The Correlation Analysis allows for Proxies to become consistent that help flag up the probability of an occurrence such as a part failure that would have been impossible to extrapolate from more rigid traditional algorithm programming techniques. Machine learning and AI techniques allow the man-made algorithms to ‘evolve’ and produce correlations that extend in scope and complexity beyond the capability of the original programmer or programming team. The algorithms themselves no longer recognisable as they become exponentially complex and interwoven.

As these algorithms become more exponential in their scope and complexity their output becomes almost impossible to validate. This drives an interpretative behavioural change from questioning Why and outcome has occurred, but simply accepting What is produced. In the context of the airline, the data is confined to the aircraft albeit down to the most innocuous vibration. But in the context of our Google Flu predictor example earlier the data knows no bounds. The consequences are therefore unpredictable and subject to unknown ‘currents’ of influence which means accepting blindly ‘What is produced’ and being unable to answer the Why, is a worrying regression as data influences more and more of our lives.

For example if middle class individual employed locally of A religion who shops at B supermarket and frequents C websites suddenly starts travelling internationally to x, y or z locations then there is a high probability that he is a terrorist. An extreme example where the travel ‘Proxy’ match with the A, B and C factors = a high probability that the individual could be a terrorist.

This is called the ‘Minority Report’ syndrome, where individuals are pre-judged on probability outputs from Big Data Correlation Analytics, and not on their actual actions. Such scenarios warn of a future where individuals are judged and found guilty NOT on their actual intent and actions but on probability. A frightening prospect, and real risk to freedom and liberty.

This is not far removed from what is already going on in reality. The Memphis Tennessee Police use an algorithm called CRUSH (Criminal Reduction Utilising Statistical History), to extrapolate from that Criminal Statistical History data the ‘probability’ of anti-social flare ups in certain parts of the city.

Then there is the pernicious Google PageRank, the closely guarded secret sauce of the Google search engine that impacts commercial destinies every time Google choses to tweak it.

If Big Data is to grow up it will need to subject itself to checks and balances like any other facet of our lives. Organisations need to be accountable for their decisions and Correlation Analysis of the type articulated above will require their algorithms to be transparent, organisations held accountable.

A good start here would be the Google PageRank algorithm. This I believe has now reached a point in its maturity curve when combined with the anti-trust practices of its owner (Google) that it requires independent auditing. In an ideal world I would hope to see Google adopt its Open Source approach and allow the IT community to vet the algorithm, after all RSA have done so with their encryption technology amongst many other organisations without too much loss of market share. In fact their openness has enhanced their credibility. I suspect not in this case. After all Google is little more than a search engine company and their only hold on value is the control they wield as the signpost and advertising hub of the Internet.

This is as you will be able to deduce not going to be straightforward, but then I suggest neither were many of the compliance and independent auditing practices we now regard as the norm when they were first postulated.