Public interest in data science and Big data is mounting as data-driven decision-making becomes part of everyday life. The combination of data availability, sophisticated analysis techniques, and scalable infrastructures is rapidly changing the way we do business, socialize, conduct research, and govern society. In just a few years, society has shifted from being predominantly analog to digital. Society, organizations, and people are always on. Data are collected on anything, at any time, and in any place. The spectacular growth of the digital universe, summarized by the overhyped term Big data, makes it possible to record, derive, and analyze the behavior of people, machines, and organizations. The Internet of Things (IoT) is rapidly expanding, with our homes, cars, and cities becoming smart by exploiting the collected data. These developments are also changing the way scientific research is performed. Model-driven approaches are supplemented with data-driven approaches. For example, genomics and evidence-based medicine are revolutionizing disease understanding and treatment.
Data science provides unprecedented possibilities and can be disruptive in both positive and negative ways. Data science enables new services and products, and allows processes to become more efficient and effective. Organizations in retail, banking, logistics, and tourism that do not exploit data smartly will probably not survive the next decade. The uptake of organizations like Facebook, Google, Apple, Uber, Bol, BrandLoyalty, Twitter, and Amazon illustrates the economic value of data. Also, more traditional organizations are becoming data-driven, e.g., in 2015 the Dutch Tax Administration announced its plan to replace 5.000 administrative with 1.500 data science positions. This illustrates the shift from human decision-making to data-driven decisions. Profiling and nudging of tax payers will yield gains, but also triggers legal and ethical questions (cf. Knowledge agenda Dutch tax department).
Data science techniques may invade someone’s privacy and can be used to favor one individual over another. Sophisticated algorithms are designed to pick-up statistical patterns in training data. If the training data (i.e., the data used to learn a predictive model) reflect existing social biases against a segment of society, the algorithm is likely to incorporate these biases. In addition, there is a general tendency for automated decisions to favor and be more accurate towards those who belong to statistically dominant groups. Legitimacy of analysis results relies on transparency, but often the link between input data and outcomes can be misunderstood by stakeholders. Also, classic problems of confusing correlation and causality and data overfitting, lead to inaccurate decisions and conclusions.
These challenges are not new. Around 1660, John Graunt studied London’s death records to predict life expectancy. At the end of the 19th century, Francis Galton introduced statistical concepts like regression and correlation. Over 50 years ago, John Tukey pioneered the field of exploratory data analysis and laid the foundations for data science. In statistics, different types of biases (e.g., selection bias) have been studied for decades. However, classic approaches do not consider legal and ethical aspects related to fairness and accuracy and are not appropriate for Big data settings.
The significant public interest was highlighted in the recent Dutch National Research Agenda (NWA) where hundreds of questions were related to data science and Big data (also from different application domains like personalized medicine and smart industries), and many of these revealed valid concerns related to irresponsible date use.
The RDS program will address these problems and concerns by bringing together a nationwide team of experts from a range of data science disciplines. RDS is a response to forces that either try to use data in an irresponsible manner (e.g., leaking confidential information or making unfair/flawed decisions) or that try to circumvent the use of data altogether (e.g., through restrictive legislation). RDS will develop data science techniques, infrastructures and approaches that are responsible by design. This demands scientific breakthroughs that can only be achieved by a multi-disciplinary team.
Many definitions have been proposed for data science. Here, we use the following definition:
Data science is an interdisciplinary field aiming to turn data into real value. Data may be structured or unstructured, big or small, static or streaming. Value may be provided in the form of predictions, automated decisions, models learned from data, or any type of data visualization delivering insights. Data science includes data extraction, data preparation, data exploration, data transformation, storage and retrieval, computing infrastructures, various types of mining and learning, presentation of explanations and predictions, and the exploitation of results taking into account ethical, social, legal, and business aspects.
The figure provides an overview of a typical “data science pipeline”. People, machines, systems, organizations, communities, and societies are producing data. Data are collected when a citizen submits a tax declaration, a customer orders a book online, an X-ray machine is used to take a picture, a traveler sends a tweet, or a scientist conducts an experiment. One may need to extract, load, transform, and/or de-identify such data before they may be used for analysis. Analysis results include models (e.g., a decision tree or a process model), automated decisions, predictions, and recommendations, and outcomes that need to be interpreted by stakeholders.
. . . RDS aims to provide the
technology to build in fairness,
accuracy, confidentiality, and
transparency in systems thus
ensuring responsible use without
inhibiting the power of data
The “four V’s of Big data” – Volume, Velocity, Variety, and Veracity – refer to widely acknowledged challenges in the context of data science and Big data. The first ‘V’ (Volume) refers to the massive scale of some data sources. For example, Facebook has over 1.6 billion active users and stores hundreds of petabytes of user data. The second ‘V’ (Velocity) refers to the frequency
of incoming data that need to be processed. It may be impossible to store all data or the data may change so quickly that traditional batch processing approaches cannot cope with high-velocity streams of data. The third ‘V’ (Variety) refers to the different types of data coming from multiple sources. Structured data may be augmented by unstructured data (e.g., free text, audio, and video). Moreover, to derive maximal value, data from different sources need to be combined (correlating data is often a major challenge). The fourth ‘V’ (Veracity) refers to the trustworthiness of the data. Sensor data may be uncertain, multiple users may use the same account, tweets may be generated by software rather than people, etc.
Instead of focusing on the “four V’s of Big data”, we focus on “FACT” – Fairness, Accuracy, Confidentiality, and Transparency (see figure) – thereby acknowledging the concerns in society. Topics like variety and veracity are clearly related to “FACT”. However, unlike most Big data approaches, responsible data science is driven by the ideal of incorporating social and ethical values or aspects when turning data into value. This necessitates a community effort and a multi-disciplinary approach as proposed for RDS.
Our notion of responsible is inspired by the emerging field of responsible innovation. What does “responsible” mean in this context? According to Koops it refers to being (ethically) acceptable, sustainable, socially desirable, leading to socially desirable outcomes, care for the future, and taking account of social and ethical aspects and balancing economic, socio-cultural and environmental aspects. Owen et al. state that “responsible innovation requires a continuous, iterative process of learning (…) which integrates expertise and understanding and invites in perspectives from stakeholders and publics”. This involves, according to the authors, an on-going process of anticipation, reflection, deliberation, and response. In RDS, we adopt the same notion of responsible, but tailor it towards data science (rather than innovation in general).
From the overall “responsibility” notion, we distill four main challenges specific to data science:
- T1 Fairness:
- Data science without prejudice – How to avoid unfair conclusions even if they are true?
- T2 Accuracy:
- Data science without guesswork – How to answer questions with a guaranteed level of accuracy?
- T3 Confidentiality:
- Data science that ensures confidentiality – How to answer questions without revealing secrets?
- T4 Transparency:
- Data science that provides transparency – How to clarify answers such that they become indisputable?
The RDS program revolves around these “FACT” challenges, with one track devoted to each. In each of the four tracks, RDS brings together researchers from different organizations and disciplines.
The four tracks are clearly related and all contribute to the overall goal to enable and ensure responsible data science. Often tradeoffs need to be made. For example, it may be impossible to achieve the highest level of accuracy, due to constraints related to fairness and confidentiality (e.g., the performance of a classifier may be impaired by removing sensitive attributes or adding fairness constraints). Transparency and confidentiality may also be conflicting. However, in RDS we also anticipate innovations that break tradeoffs between fairness, accuracy, confidentiality, and transparency, and create win-win situations. For example, ensuring the robustness of analysis results may help to protect privacy (because smaller changes to the input data do not influence the results noticeably). Polymorphic encryption may be used to strike a new balance between transparency and confidentiality, i.e., selected parts of the encrypted data become decryptable by specific users at controlled points in time.
To ensure collaboration not just within, but also between tracks, researchers collaborate in four thematic areas (see the figure above). In each of the thematic areas there will be multiple larger case studies to test and combine the results from various subprojects. Next to organizations involved in the case studies, organizations like Statistics Netherlands (CBS), WRR, and the Rathenau Institute support RDS. RDS fosters collaboration between the different disciplines and this is best achieved using concrete questions and data from these four areas. The tracks and thematic areas bring together researchers from different disciplines. Moreover, we aim for a common RDS infrastructure where data and software are shared. This includes shared platforms for experimentation and tool development.
Track T1 focuses on Fairness. Data science approaches learn from training data while maximizing an objective (e.g., the percentage of correctly classified instances). However, this does not imply that the outcome is fair. The training data may be biased or minorities may be underrepresented. Even if sensitive attributes are omitted, members of certain groups may still be systematically rejected. Profiling may lead to further stigmatization of certain groups.
Track T2 focuses on Accuracy. The abundance of data suggests that we should let the data “speak for themselves”. Data science makes this possible, but at the same time analyses of data sets – large or small – often produce inaccurate results. In general, it is challenging to “let the data speak” in a reliable manner. If one tests enough hypotheses, one will eventually look true. Data science approaches should not just present results or make predictions, but also explicitly provide meta-information on the accuracy of the output.
Track T3 focuses on Confidentiality. Data science heavily relies on the sharing of data. If individuals do not trust the “data science pipeline” and worry about confidentiality, they will not share their data. The goal of this track is not to prevent data from being distributed and gathered, but to exploit data in a safe and controlled manner.
Track T4 concentrates on Transparency. Data science can only be effective if people trust the results and are able to correctly interpret the outcomes. Data science should not be viewed as a black box that magically transforms data into value. Many design choices need to be made in the “data science pipeline” shown in the figure below. The journey from raw data to meaningful conclusions involves multiple steps and actors, thus accountability and comprehensibility are key for transparency.
A1 Responsible Science
As science moves into the Big data age, there is a risk that it loses its status as the one provider of real knowledge, i.e., conclusions and inferences you can rely upon. Raw scientific data sets tend to be extremely noisy, and software provides “conclusions” such as pvalues with a single mouse click. Thematic area A1 develops new statistical techniques providing more reliable uncertainty assessments for scientific data, trading-off inferential power with the (sometimes aligned, sometimes opposite) goals of fairness and confidentiality, and to improve transparency by better communication of scientific results.
A2 Responsible Health
The abundance of medical and human behavior data deeply influences medical research and healthcare at all levels. Within thematic area A2, we will develop data science techniques that preserve privacy for highly sensitive health data to ensure fairness for minority patient groups, to accurately optimize multiple treatment options, and to allow us to draw medical conclusions (in the presence of noise and missing data).
A3 Responsible Business
Responsible companies should seek a balance between maintaining a competitive position using data science and also a responsible attitude towards their stakeholders (employees, customers and shareholders). Thematic area A3 will explore ways for businesses to strive for a fair treatment of their employees and customers; to maximize the accuracy of their processes and decisions; to preserve the confidentiality of their data; and to create a culture of transparency and accountability towards all stakeholders.
A4 Responsible Government
The government’s role is especially challenging in responsible data science as finding a balance between contrasting demands is key, e.g., using data science innovatively in policy implementation, and at the same time protecting citizens against any predictable and unforeseen effects. Unlike the private sector, citizens cannot turn to another party when confronted with unfair conclusions generated by data science applications or lack of accuracy, confidentiality, and transparency.