Monday, November 12, 2012

A Practical Approach to Reading Signals in Data


Big data isn't new. A wealth of database information has been lying dormant in companies for years — only now we have the technology to understand it. Using new tools and methods, you can use that information to make better predictions (and decisions) about your business.

Here's one way to get started: Imagine a matrix — a spreadsheet, if you will. The rows capture something you know about one interaction. That interaction might be a sale, or some other single activity (often referred to as a "case"). The columns are aspects of that interaction; sticking with the shopping analogy, columns might be how you paid, or at what time of day the sale occurred. These are known as "signals," because they may help predict some future target variable. It might be, for example, what else you might want to purchase, given a current purchase.

Depending on what you're measuring, it might look something like this:
datasignals580px.jpg

Your goal is to build this matrix, and have it be as large and as complete as possible. If you're like most companies, you likely already have some data in databases and your web server logs. It's where your data is not complete that the new methods can be applied. Finding, storing, and merging all this data requires computational horsepower, network, and cheap storage; the rapid increase in smart phones is just one way that each of these has become ubiquitous over the past few years.

This means that to get more accurate results, you'll need to expand your data set. There are a couple of ways to scale up the amount of data you are using to make better predictions:

First, you can add more cases. This is, for example, how retailers make sales inferences. Adding more cases (rows in your spreadsheet) reduces the likeliness of statistical outliers and random variance in your measurements, so you can be more confident in the outcome. A retailer will have a lot of transaction data it can use to make inferences.

But the more powerful way is to add signals. Adding signals (columns) allows you to do two things: First, it can reveal new relationships, enabling new inferences — with a new variable, you may see a correlation in the data you never realized before. Second, adding signals makes your inferences less subject to bias in any number of individual signals. You add cases, keeping the same signals, to make your understanding of those variables better. In contrast, you add signals to make it possible to overcome errors in other signals you rely on.

Although much of the discussion of big data has focused on adding cases, — in fact, the common perception of "big data" is being able to track lots of transactions — but adding signals is most likely to transform a business. The more signals you have, the more new knowledge you can create. For example, Google uses hundreds of signals to rank web pages.

The evolution of underwriting — the process of judging loan eligibility — is another big data success story that's still being told. Historically, underwriting was done by someone who knew the applicant. Prototypically, a bank officer would make credit decisions for applicants based on the applicants' "character" — which church they attended, which school their kids were in, etc. Underwriting based on a credit officer's opinion used a lot of perspective about the applicant, but wasn't very scalable — there are only so many loan officers in the world. And, of course, the officers were using a small number of signals, and so there was systemic bias in the process.

In the early 1970's, Fair Isaac rose to global prominence as a provider of the standardized FICO score that supplanted much of the credit officers' role. The standardized score massively increased credit availability and thus lowered the cost of borrowing. However, FICO scores have their limits. The scores perform especially poorly for those without much information in their credit files, or those with relatively bad credit. It's not FICO's fault — it's the math they use. With fairly few signals in their models, the FICO score doesn't have the ability to distinguish between credit risk in a generally high risk group.

The way to address this is to add more signals. For example, thousands of signals can be used to analyze an individual's credit risk. This can be everything from excess income available, to the time an applicant spent on the application, to whether an applicant's social security number shows up as associated with a dead person. The more signals used, the more accurate a financial picture a lender can get, particularly for thin file applicants who need the access to credit and likely don't have the traditional data points a lender analyzes.
Fair Isaac has millions of cases to use, but more signals give a better product. In this case, they can produce a more thorough picture of an individual's credit score than the industry standard, and the result is lower cost credit to a larger number of people.

As we used to say at Google, "Opinions are great, data is better." Big data is both cases and signals, but signals win in the end. Instead of spending your technology time and dollars to get additional cases, use these resources to get additional signals that allow you to find new relationships.

A Practical Approach to Reading Signals in Data
Douglas Merrill

No comments:

Post a Comment