The Case for Uber’s “Deceptive” Data

In late July, Data & Society researcher Alexander Rosenblat claimed that Uber misinforms users by showing “phantom cars” at locations on the app’s map where no rides are actually available. Uber representatives were quick to deny the allegation, providing a number of explanations for the inaccuracies cited, from inevitable delays in location updates to the intention of protecting drivers from possible threats in light of the recent riots in Paris.

Rosenblat also points out that competitors could take advantage of accurate location data by sending drivers to fill the gaps in Uber’s service.

While Rosenblat and fellow researcher Luke Stark have not yet published their findings, it is worth examining how Uber has attempted to strike a balance between customer satisfaction and protection of company interests. As it turns out, this tradeoff lies at the center of a fascinating area of research known as differential privacy.

Differential privacy is a mathematical approach to the issue of how to publish a large dataset of sensitive information—say, census data, medical records, or even customers’ product preferences—in a statistically relevant manner without compromising the privacy of individuals whose information was used in the dataset.

Privacy protection cannot be guaranteed with as much certainty as one might think. In 1997, Latanya Sweeney, then a graduate student pursuing a degree in computer science at MIT, showed just how easy it can be to re-identify individuals in an “anonymized” collection of data. Massachusetts had decided to publish a large insurance database for medical research; William Weld, then governor, promised that individuals’ identities would remain secure, as their health records had been stripped of their names, social security numbers, and addresses.

Sweeney proved him dead wrong. By cross-referencing information that remained available in the database—date of birth, ZIP code, and gender—with public voter registration lists, she determined exactly which health records belonged to the governor himself. Anonymization hadn’t ensured anonymity after all.

This privacy breach was not an isolated event. When Netflix released anonymized data about users’ viewing habits, two computer-scientists cross-referenced the “de-identified” information with information in a public IMDB database to pinpoint individual users’ identities. And in 2006, chaos ensued when AOL published users’ anonymized search histories in order to conduct research on information retrieval. Again, through cross-reference and deduction, it became possible to re-identify individuals in the study.

Such threats to privacy pose a serious problem in a world increasingly driven by big data. Consider Massachusetts’ case for analyzing medical records. Access to this information surely provides a wide range of benefits: scientists can search for correlations between the diagnosis of an illness and a specific gene, behavior, or other trait; hospitals can draw conclusions about optimal forms of healthcare for patients; insurance companies can determine rates and premiums. But in practice, analyzing such sensitive data relies on the guarantee that the patients will not be putting their privacy at risk by contributing their records to the study.

So how can people feel safe having their personal information released to the public?

That’s where differential privacy, formally modeled by computer scientist Cynthia Dwork, comes into play.

The concept applies to the algorithms used to release data from a database of sensitive information. When such an algorithm is differentially private, someone can ask any number of questions about a dataset, and the answers provided will be “blurred” in a very specific way so that they remain accurate while disclosing almost nothing about individuals. In fact, the key to differential privacy is that whether an individual has even been included in the dataset at all is not revealed in the answers to queries.

Differential privacy thus attains the desired level of privacy protection without adversely affecting the data’s utility.

So it does make sense for Uber to “blur” its data, even in a form as obvious as that of “phantom cars”—but the addition of these inaccurate locations should be determined in a mathematical way. Perhaps Uber was seeking to balance protecting its drivers with providing accurate displays of driver availability. Rosenblat and Stark’s forthcoming study will hopefully shed some light on the issue. Still, whatever Uber’s reasons for “misleading” users may be, the company would certainly stand to benefit from putting differential privacy to use.

In fact, this goes for any business that relies on big data. While the concept remains largely theoretical, researchers hope that its use will become more widespread. Recent studies have examined how to apply these ideas to healthcare. Data collected from social media sites like Facebook could affect companies’ advertising strategies, or even enable researchers to draw conclusions about how people form social networks, without exposing users’ private information. In today’s age, when large-scale data hacks seem to be occurring with alarming frequency, Uber may just have the right idea.