Gary Smith, writing in Wired:
Nobel laureate Richard Feynman once asked his Caltech students to calculate the probability that, if he walked outside the classroom, the first car in the parking lot would have a specific license plate, say 6ZNA74. Assuming every number and letter are equally likely and determined independently, the students estimated the probability to be less than 1 in 17 million. When the students finished their calculations, Feynman revealed that the correct probability was 1: He had seen this license plate on his way into class. Something extremely unlikely is not unlikely at all if it has already happened.
The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.
Lots of other examples.
The moral? “Good research begins with a clear idea of what one is looking for and expects to find. Data mining just looks for patterns and inevitably finds some.”
From an O’Reilly newsletter:
In a recent O’Reilly survey, we found that the skills gap remains one of the key challenges holding back the adoption of machine learning. The demand for data skills (“the sexiest job of the 21st century”) hasn’t dissipated—LinkedIn recently found that demand for data scientists in the US is “off the charts,” and our survey indicated that the demand for data scientists and data engineers is strong not just in the US but globally.
With the average shelf life of a skill today at less than five years and the cost to replace an employee estimated at between six and nine months of the position’s salary, there’s increasing pressure on tech leaders to retain and upskill rather than replace their employees in order to keep data projects (such as machine learning implementations) on track. We’re also seeing more training programs aimed at executives and decision makers, who need to understand how these new ML technologies can impact their current operations and products.
Beyond investments in narrowing the skills gap, companies are beginning to put processes in place for their data science projects, for example creating analytics centers of excellence that centralize capabilities and share best practices. Some companies are also actively maintaining a portfolio of use cases and opportunities for ML.
Note the average shelf-life of a skill and then ponder why the UK government is not boosting the Open University.
This morning’s Observer column:
The tech craze du jour is machine learning (ML). Billions of dollars of venture capital are being poured into it. All the big tech companies are deep into it. Every computer science student doing a PhD on it is assured of lucrative employment after graduation at his or her pick of technology companies. One of the most popular courses at Stanford is CS229: Machine Learning. Newspapers and magazines extol the wonders of the technology. ML is the magic sauce that enables Amazon to know what you might want to buy next, and Netflix to guess which films might interest you, given your recent viewing history.
To non-geeks, ML is impenetrable, and therefore intimidating…
This morning’s Observer column:
The question on everyone’s mind as Google hoovered up robotics companies was: what the hell was a search company doing getting involved in this business? Now we know: it didn’t have a clue.
Last week, Bloomberg revealed that Google was putting Boston Dynamics up for sale. The official reason for unloading it is that senior executives in Alphabet, Google’s holding company, had concluded (correctly) that Boston Dynamics was years away from producing a marketable product and so was deemed disposable. Two possible buyers have been named so far – Toyota and Amazon. Both make sense for the obvious reason that they are already heavy users of robots and it’s clear that Amazon in particular would dearly love to get rid of humans in its warehouses at the earliest possible opportunity…
The Economist has an interesting article on how major universities are now having trouble holding on to their machine-learning and AI academics. As the industrial frenzy about these technologies mounts, this is perfectly understandable, though it’s now getting to absurd proportions. The Economist claims, for example, that some postgraduate students are being lured away – by salaries “similar to those fetched by professional athletes” – even before they complete their doctorates. And Uber lured “40 of the 140 staff of the National Robotics Engineering Centre at Carnegie Mellon University, and set up a unit to work on self-driving cars”.
All of which is predictable: we’ve seen it happen before, for example, with researchers who have data-analytics skillsets. But it raises several questions.
The first is whether this brain brain will, in the end, turn out to be self-defeating? After all, the graduate students of today are the professors of tomorrow. And since, in the end, most of the research and development done in companies tends to be applied, who will do the ‘pure’ research on which major advances in many fields depend?
Secondly, and related to that, since most industrial R&D is done behind patent and other intellectual-property firewalls, what happens to the free exchange of ideas on which intellectual progress ultimately depends? In that context, for example, it’s interesting to see the way in which Google’s ownership of Deepmind seems to be beginning to constrain the freedom of expression of its admirable co-founder, Demis Hassabis.
Thirdly, since these technologies appear to have staggering potential for increasing algorithmic power and perhaps even changing the relationship between humanity and its machines, the brain drain from academia – with its commitment to open enquiry, sensitivity to ethical issues, and so on – to the commercial sector (which traditionally has very little interest in any of these things) is worrying.
This morning’s Observer column:
Whenever regulators gather to discuss market failures, the cliche “level playing field” eventually surfaces. When regulators finally get around to thinking about what happens in the online world, especially in the area of personal data, then they will have to come to terms with the fact that the playing field is not just tilted in favour of the online giants, but is as vertical as that rockface in Yosemite that two Americans have finally managed to free climb.
The mechanism for rotating the playing field is our old friend, the terms and conditions agreement, usually called the “end user licence agreement” (EULA) in cyberspace. This invariably consists of three coats of prime legal verbiage distributed over 32 pages, which basically comes down to this: “If you want to do business with us, then you will do it entirely on our terms; click here to agree, otherwise go screw yourself. Oh, and by the way, all of your personal data revealed in your interactions with us belongs to us.”
The strange thing is that this formula applies regardless of whether you are actually trying to purchase something from the author of the EULA or merely trying to avail yourself of its “free” services.
When the history of this period comes to be written, our great-grandchildren will marvel at the fact that billions of apparently sane individuals passively accepted this grotesquely asymmetrical deal. (They may also wonder why our governments have shown so little interest in the matter.)…
Yesterday I gave a talk about so-called ‘Big Data’ to a group of senior executives. At one stage I used the famous Walmart pop-tart discovery as an example of how organisations sometimes discover things they didn’t know by mining their data. But now comes an equally intriguing data-mined discovery — from Alibaba:
Earlier this summer, a group of data crunchers looking at underwear sales at Alibaba came across a curious trend: women who bought larger bra sizes also tended to spend more (link in Chinese). Dividing intimate-apparel shoppers into four categories of spending power, analysts at the e-commerce giant found that 65% of women of cup size B fell into the “low” spend category, while those of a size C or higher mostly fit into the “middle” or higher group.
The explanation might be fairly straightforward: it could be that the data merely demonstrate that younger women have less spending power, for instance. But Alibaba is deep into this data-mining stuff. The report claims that last year the company set up a Big Data unit with 800 employees. It also quotes a Gartner factoid that currently less than 5% of ecommerce companies are using data analytics.
This morning’s Observer column about the Facebook ’emotional contagion’ experiment.
The arguments about whether the experiment was unethical reveal the extent to which big data is changing our regulatory landscape. Many of the activities that large-scale data analytics now make possible are undoubtedly “legal” simply because our laws are so far behind the curve. Our data-protection regimes protect specific types of personal information, but data analytics enables corporations and governments to build up very revealing information “mosaics” about individuals by assembling large numbers of the digital traces that we all leave in cyberspace. And none of those traces has legal protection at the moment.
Besides, the idea that corporations might behave ethically is as absurd as the proposition that cats should respect the rights of small mammals. Cats do what cats do: kill other creatures. Corporations do what corporations do: maximise revenues and shareholder value and stay within the law. Facebook may be on the extreme end of corporate sociopathy, but really it’s just the exception that proves the rule.
danah boyd has a typically insightful blog post about this.
She points out that there are all kinds of undiscussed contradictions in this stuff. Most if not all of the media business (off- and online) involves trying to influence people’s emotions, but we rarely talk about this. But when an online company does it, and explains why, then there’s a row.
Facebook actively alters the content you see. Most people focus on the practice of marketing, but most of what Facebook’s algorithms do involve curating content to provide you with what they think you want to see. Facebook algorithmically determines which of your friends’ posts you see. They don’t do this for marketing reasons. They do this because they want you to want to come back to the site day after day. They want you to be happy. They don’t want you to be overwhelmed. Their everyday algorithms are meant to manipulate your emotions. What factors go into this? We don’t know.
Facebook is not alone in algorithmically predicting what content you wish to see. Any recommendation system or curatorial system is prioritizing some content over others. But let’s compare what we glean from this study with standard practice. Most sites, from major news media to social media, have some algorithm that shows you the content that people click on the most. This is what drives media entities to produce listicals, flashy headlines, and car crash news stories. What do you think garners more traffic – a detailed analysis of what’s happening in Syria or 29 pictures of the cutest members of the animal kingdom? Part of what media learned long ago is that fear and salacious gossip sell papers. 4chan taught us that grotesque imagery and cute kittens work too. What this means online is that stories about child abductions, dangerous islands filled with snakes, and celebrity sex tape scandals are often the most clicked on, retweeted, favorited, etc. So an entire industry has emerged to produce crappy click bait content under the banner of “news.”
Guess what? When people are surrounded by fear-mongering news media, they get anxious. They fear the wrong things. Moral panics emerge. And yet, we as a society believe that it’s totally acceptable for news media – and its click bait brethren – to manipulate people’s emotions through the headlines they produce and the content they cover. And we generally accept that algorithmic curators are perfectly well within their right to prioritize that heavily clicked content over others, regardless of the psychological toll on individuals or the society. What makes their practice different? (Other than the fact that the media wouldn’t hold itself accountable for its own manipulative practices…)
Somehow, shrugging our shoulders and saying that we promoted content because it was popular is acceptable because those actors don’t voice that their intention is to manipulate your emotions so that you keep viewing their reporting and advertisements. And it’s also acceptable to manipulate people for advertising because that’s just business. But when researchers admit that they’re trying to learn if they can manipulate people’s emotions, they’re shunned. What this suggests is that the practice is acceptable, but admitting the intention and being transparent about the process is not.
An Observer essay on one of the obsessions of our times. Published today.
I’m collecting these for a talk I’m giving on Big Data’s obsession with correlation rather than causation.