Fri, 14 Jul 2017

Surveillance, Big Data, and Big Stupidity

— SjG @ 4:21 pm

(This post was started in March of ’16, revised later.)

Recently, a friend I’ll call Cassie was on a trip abroad to a country I’ll call Absurdia. She went to access her Google mail account, and was promptly locked out by the clever security system. It had determined that someone was accessing the account from overseas. Presumably, she was asked one or more security questions that she couldn’t answer (“When did you first create this account?”) along with one or another of her own security questions. OK, bad on her, you might say, for not remembering the answers to her security questions, and hooray for adaptive security that protected her account from unauthorized access!

But let’s examine that for a moment. Adaptive security recognized that the access was from a new place — not merely a different computer or IP address, but a different country. Great, makes a lot of sense. But if we step back to the weeks before her departure, Cassie was being served ads for hotels around Absurdia. She was being served ads for taxi companies in Absurdia, airline bargains for nonstop flights to Absurdia, and online language courses in Absurdese. You see, Google processes GMail messages, and extracts keywords and knowledge in order to serve ads that the user will find interesting1. When Cassie emailed people about her upcoming trip to Absurdia, Google’s algorithms understood enough to start serving travel related ads for the place. Google “knew” that Cassie was going to Absurdia. But this knowledge was not propagated beyond the ad-serving system.

Back in the 80s, my sister did a semester abroad in Rostock, in what was then the German Democratic Republic — East Germany. There was a very limited exchange program between Brown University and the GDR, and she was one of a handful of American students who took advantage of it. We have some family history in Rostock. A great-aunt had lived there, and my sister wanted to do some research on what had become of her. This great-aunt had been elderly by the time of the Second World War, and my sister wanted to know if she had died of natural causes (sadly, it turns out that she had not).

Now, the reason I’m telling this seemingly unrelated story involves something that happened years later. After the reunification of Germany, and as part of the national reconciliation process, people could request their Stasi files. That’s the collection of data that had been accumulated by the Staatssicherheitsdienst — the secret police — gathered via informants, phone taps, reading mail, and so forth. Naturally, during the tense Cold-War Reagan years, the East German security apparatus assumed that any American who would study there was a CIA agent, so my sister’s file was extensive.

Her file was also slightly ridiculous: pages and pages of hand-written notes, filled with scuttlebutt and rumor. What was particularly enlightening was just how far off base the operatives had been. They missed critical details, and misinterpreted others. My sister’s attempts to track down our great-aunt became, in their notes, a frustrated attempt to make contact with a hitherto unknown agent. With all the data they gathered, with all the information they accumulated, there was no actual gain in knowledge. In fact, there could have been even greater costs: the incorrect assumptions and misunderstanding could have resulted in the agency siphoning off resources to pursue this phantom.

Now, you might suggest that I’m the one who is missing the point here. Perhaps, you could argue, that this is the nature of bureaucracy. The agents monitoring my sister were obligated to report to their superiors, so they grasped at whatever straws were available, and willfully ignored clues that would get in the way of a narrative that would please the authorities.

But in a way, that is the point. Surveillance generally finds what it’s seeking and only utilizes it for the purpose at hand.

In this day where Big Data is a tech industry buzzword, we continuously see articles on “business intelligence” and adaptive systems. More data gathering will solve all kinds of business problems. We read that credit card companies can predict divorce, that Target Stores predict pregnancies, and so on2.

And there are other successes. In the last year, there was a fascinating article on how a programmer helped discover cheating in the crossword puzzle world. “I guess that’s the nature of any data set. You might find things you’d rather not see,” said one of the people who contributed data to the collection that ended up confirming the plagiarism.

But Amazon still serves me ads for, say, umbrellas for weeks after I actually buy an umbrella from them. Maybe they set the flag when I look at the products, but don’t unset it when I buy one. I do work on the site from a new computer, and suddenly I’m being served wedding ads. Ads are scattershot, and the only penalty for throwing stuff at the wall to see what sticks is the lower-value ads could crowd out the higher-value ads.

This kind of bad data processing is annoying, but not harmful. The same is not true with crime-prediction, voter targeting, insurance assessment, and other tasks upon which “deep learning” is being brought to bear. If the AI is built with bad assumptions, it can have serious effects on people. Training AI with “real world data” that’s been filtered by the status quo is equally dangerous. I think it’s obvious what happens. You can become un-insurable, denied loans, put on a no-fly list, and worse. “I do assure you, Mrs. Buttle, the Ministry is very scrupulous about following up and eradicating any error.”3

Whenever I fuck up something spectacularly in a complicated piece of code, I think of the Donald Fagen lyric:

A just machine to make big decisions
Programmed by fellows with compassion and vision

Unfortunately, as we see time and again, both of those attributes are often lacking. Stressed or overworked programmers, get-rich-quick VC and startup culture, bad assumptions, and a lack of examining the biases built into data sets all contribute to the failure of our machines to live up to that ideal.

1 Google issues a blog post at the end of June 2017, saying this practice would stop.

2 Interestingly, in the update to that article, Visa indignantly claims they do not track marital status, nor offer a service to predict divorces. Maybe the protest is carefully worded to hide their capabilities, or maybe it’s straightforward and honest. The fact remains that credit card companies know an enormous amount about their customers.

3 As Terry Gilliam, Tom Stoppard, and Charles McKeown captured so deftly in Brazil

Filed in: