We inhabit a world where the boundaries between literacy and numeracy have become increasingly fuzzy, and where all media have become fundamentally multimedia. Likewise, as consumers of information, we have grown increasingly accustomed to the role of data visualization in everyday life. From complex maps of World Cup matchups, to responsive infographics that guide our home-buying and rental choices, to animated plots that teach us what we each can be doing to help flatten the curve, interactive data visualization has become a critical part of how we understand the world around us and make decisions.
As data visualization becomes more widespread, the ability to distill large amounts of data for a general audience is one of the most valued skills of data scientists - and you don’t have to be a graphic designer to do it! Below is a Q&A I hosted with Laura Lorenz, Senior Software Engineer at Prefect to discuss how you can make rich, interactive, responsible visualizations part of your data engineering toolkit.
Bengfort: Why is interactivity important to visualizations?
Lorenz: Whenever I bring a static visualization to a meeting, usually for some first-pass analysis embedded into a PowerPoint (the old school way), there are always more questions. There’s another viewpoint I missed. This becomes so much more self-serve if the visualization is built interactively first. Sometimes there’s a whole other slice that shows a really critical distinction. Locking your data into one analysis isn’t necessary anymore with all the interactive libraries we have today. And sometimes, you can’t even really “see” the story of a visualization without the benefit of an animation. Time series data is a great example. There’s often a great benefit to animating time series data to showcase the passage of time. You recently recommended the late Hans Rosling’s TED Talk about this topic, which I would recommend as well to everyone to see. He shows how adding interactivity onto a few visualizations can really reveal incredible insights.
Bengfort: What tools or libraries do data scientists need to learn to produce interactive visualizations?
Lorenz: Fundamentally the trend, and what we are teaching towards, is interactive visualizations that can be deployed in a web browser. They’re more portable, for one thing. There’s also a long standing set of standards and tooling around front-end web development to lean on. But this usually means the visualizations must be implemented in JavaScript and HTML, which is all the browser can understand. For most data professionals, they don’t have any experience with web development, possibly no JavaScript experience. Most of them are coming from Python, right? But now there is a growing ecosystem based around Vega and Vega-Lite, that allows engineers to declare their visualizations in Python, and they automatically transpile at runtime to the JavaScript the browser can understand. This fall I’ll be teaching two libraries, one in Javascript, one in Python, to come at it from both angles. Probably everyone has heard of, is maybe scared of, D3.js, which is the industry standard for interactive visualizations on the web. It’s come a long way in terms of usability for new JavaScript users, and if you stick with it, it can be very powerful since it exposes a very low-level API for data binding to the DOM that gives you a lot of flexibility for totally custom interactive visualizations. But I’ll also be going through the Python library Altair, which leverages Vega-Lite to allow visualization authors to implement in Python while still exporting in JavaScript. It’s not as flexible but the API will be a lot more familiar to people coming from Python. And there are plenty of insights to derive from comparatively basic visualizations, going super custom is not really the way to go most of the time anyways.
Bengfort: The enthusiastic response of the data visualization community to the COVID crisis has drawn mixed feedback; how much do data scientists need to understand about the data they're plotting? How should we navigate the responsibilities of data visualization?
Lorenz: I’m glad you phrase it in terms of ‘responsibilities’. While we were brainstorming the curriculum for the Certificate in Advanced Data Science, we were coming up with the program outcomes we wanted students to walk away with. We were on the visualization outcomes. We were brainstorming adjectives for good visualizations like “robust”, “rich”, “interactive”, “compelling”, and then we stopped and said, “None of these are about ethics”, in which case I suggested the word “responsible”. It really clicked. Your visualizations will be responsible for telling a story, for changing people’s minds, which changes their behavior. Visualizations are memorable, easy to share; they have a lot of power. Regarding the COVID crisis, the “flatten the curve” visualizations were the backbone of a public health communications strategy that depended on a visual snapshot someone could share easily on Facebook. That graph is going to be part of mainstream consciousness for a long time. There’s no one answer to this, but it’s something you have to be cognizant about all the time. There are some basic rules of thumb, you know, keeping your axes ranges similar, that type of thing, which can be memorized. But there is also an intuition you can build.
Bengfort: How much data is it possible to visualize? Is there such a thing as too much data?
Lorenz: There are really two questions here, one in terms of performance and one in terms of design. From a performance perspective, these are web visualizations, right? This is rendering in the client’s browser, in their JavaScript runtime. There is a hard limitation here regarding the client machine. This is related to data size but it’s also related to the performance of the rendering library, and if you’re doing any runtime data manipulation. One of the skills visualization engineers working with bigger data sets inevitably must develop is enough data engineering know-how to abstract large raw source data sets so they don’t take forever to load. There’s a number of techniques here about preprocessing, aggregation, caching. The design side of it is more of a human question. You need to be able to step back and say “can one person understand or use this much data at once”? There’s a trap when your visualizations are interactive, that you think you can shove everything in there and the reader can slice it themselves. You still have to tell your story, in a way that isn’t going to overwhelm them. Making a visualization interactive means that there’s more user experience to design for.
It is easy to see that data science requires thoughtful consideration of how we use data to aid decision making or even automate it. Design and creativity combined with rigorous and programmatic methods are required to ensure data visualization is not just successful, but also effective and responsibly communicating insights to information consumers.
If you’re interested in learning more about interactive data visualization, Laura is teaching a course on interactive data visualization in the Certificate in Advanced Data Science program this fall.