On June 12, Dr. Dewar, a data scientist at bit.ly with a PhD in systems engineering, talked to the hackNY Fellows on “Measuring Attention.” For those unfamiliar with the work bit.ly does, bit.ly is a url shortening service that’s very popular on micro-blogging sites like twitter. A first impression about bit.ly might be that bit.ly provides a url shortening and nothing more. But that would be too superficial a description, if not false.
Offline (and some online) surveys for research and analysis are usually conducted through the use of questionnaires, opinion polls, and manual observations. But the tedious and slow survey process of data collection can be scaled up, sped up, and made more accurate by making and sharing bit.ly links. Did you know that bit.ly is collects detailed information – necessary for rigorous analysis and accurate redirection – when a user clicks on a bit.ly link? The information bit.ly receives about the user includes location of the user, the IP address of the user’s computer, the probability that the user that clicks on the bit.ly link is not a robot (cool, right?), and other useful information about the user. You might ask: what does bit.ly do with all the information collected from users that click on bit.ly links? That’s where the data scientists at bit.ly come in. All the data that bit.ly collects every second – or every fraction of a second – is mined, analyzed, and visualized by scientists at bit.ly. This process is done by a team of data scientists at bit.ly or a subset of the team ( bit.ly has a 7-member data science team out of ~40 employees).
Dr. Dewar began his talk with an idea that he believes has revolutionized the way we mine and use data: the 4th paradigm of scientific discovery. The 4th paradigm was proposed by Jim Gray and described in the book “The Fourth Paradigm: Data-Intensive Scientific Discovery.” Gray predicted that scientific innovation will, in the future, be data driven. To support the idea of this paradigm, Dr. Dewar showed us a Venn diagram originally by Drew Conway that illuminates the relationship among Hacking Skills, Math & Stats Knowledge, and Substantive Expertise.
Dr. Dewar also noted that the task of “Measuring Attention” is two fold: Tracking Attention, and Being Responsible. The data collected from clicks on bit.ly links can be used to make histograms, line graphs, or “spike trains”, using the d3.js library or any other data visualization library, that are clear visual indicators of trends at a particular time, place, and in some given geographic region. For example, just before the Egyptian revolution, the State Security Intelligence Service of Egypt blocked Egyptian residents from accessing the Internet by shutting down a crucial data center in Cairo. Dr. Dewar showed us a plot of the Internet usage over time in Cairo before, during, and after the shut down in Egypt. The “spike train” graph had a deep and isolated valley immediately after the shut down. After Internet access was restored, the valley disappeared. Suppose we didn’t know about the shut down in Egypt, we can work backwards to investigate the circumstances behind such an abnormal point in our data set and eventually find out about the shut down. In other words, data science helps us investigate irregular occurrences. This is just a small glimpse into how visualization of data sets can yield interesting and sometimes crucial results. For more data visualization plots, check out Dr. Dewar’s github repository.
The plots data scientists make speak louder than words and can, and in fact are meant to, be very convincing. As a result, data scientists are saddled with the responsibility of depicting information in the most accurate way possible without deceiving viewers. In addition, Dr. Dewar believes that the data that are made available to data scientists shouldn’t be exploited to siphon information for hideous means like for selling obtrusive ads.
Dr. Dewar’s a renowned data scientist (and really humble guy too). It was a pleasure listening to him (and, of course asking him questions) and we hope to see him again.