Certificate Program Explores Ethical Questions in Data Science

Computer keyboard with Ethics key

At Georgetown University’s School of Continuing Studies, ethics is not simply an academic subject; it is a vital part of professional education. That’s why the new Certificate in Advanced Data Science, which debuts in spring 2020, will require a separate course in ethics and infuse discussions of ethical issues throughout the curriculum.

As Faculty Director Benjamin Bengfort, Ph.D., explains, the number and scope of ethical issues in data science are enormous, with questions of privacy, bias, and human autonomy arising regularly as the field changes from day to day.

An expert in machine learning and natural language processing, Bengfort also explores the myriad ways in which data science can solve problems and improve our quality of life. On March 28 and 29, he will lead the Georgetown delegation to #ExpeditionHacks Combat Human Trafficking Challenge, at George Mason University’s Arlington, Virginia, campus.

Here, Bengfort talks about why ethics is integral to the field of data science:

Why does the Certificate in Advanced Data Science devote so much time to ethics?

Ethics is infused into the curriculum for two reasons. First, data science and machine learning present unique ethical challenges that need to be well understood by the technical community. Second, the audience for an advanced data science certificate is generally well prepared to solve technical challenges, but less prepared to confront ethical ones. We are combining experience with self-knowledge to create a foundation for the next generation of data science leaders to do even more amazing things than we have done before.

The Certificate in Data Science also considers ethics a core part of its curriculum, but as we move toward more in-depth topics in the Advanced Data Science Certificate, the ethical challenges become more apparent and harder to deal with. By giving students an entire course to grapple with ethical issues we provide them with a forum to dive deeply into the conflicts and trade-offs that are unique to algorithms and machine learning.

Why should ethics be the first consideration when working with machine learning and natural language processing?

There is a pervasive belief in the technical community of a separation between disciplines like engineering and mathematics and the humanities: Whereas humans are subject to moral and ethical violations, algorithms and mathematics are, by their nature, pure and unbiased and cannot behave irrationally. Unfortunately, this belief is used to absolve data scientists from responsibility much in the same way you might hear that “guns don't kill people, people kill people.”

Machine learning is algorithmic in nature, true, but the machine learns from data that is produced by humans, and therefore it learns those humans’ implicit biases. We see this everywhere: from sensors that don't detect black hands or faces, to ads for executive positions that target only men, to filters that censor books with homosexual characters. Unfortunately, there is no simple solution to this problem; the more a data scientist tweaks a model, the further that model gets from its unbiased, mathematical form. If taken too far, the model may simply become unusable since it either doesn't reflect reality, or worse, creates adverse effects that could not be considered ahead of time.

Ethics needs to be considered first because machine-learning algorithms are applied very quickly to a very large audience. For example, a model might influence tens of millions of decisions (e.g., loan decisions) before someone notices that it is biased in a way that could ruin people's lives. Ethical considerations are also the starting point to understanding the mathematical implications of the underlying behavior of the model.

Projects don't involve just one model; they involve multiple models that interact with each other in complex ways. And these models exist in time, learning and growing over the course of their deployment. If ethics is not considered first, then, as a project changes and becomes more rooted in applications, it becomes very difficult to unwind the effects.

How do you address the problem of unintended consequences?

Part of software engineering and machine learning is considering edge cases. It is impossible to capture them all, but we do our best to test our software and models to make sure nothing goes wrong when deployed in production. Often this takes the form of preventing exceptions or software crashes, but if we add ethics to the mix we could probably include it with some of the testing and quality assurance we already do. For example, I can imagine stress-testing models to see how they work when introduced to a suite of biases or unexpected environments. In the end, rather than just shipping a production release on deadline, the best thing we can do is take a day or two to fully validate our models and software with a group that includes all major stakeholders. This is a leadership and management challenge as much a technical one because the common method is iterating fast and pushing releases as soon as possible.