Monday, March 17, 2008

Premiums from Credit Data: Going Wrong

The modeling effort ran into trouble. The models were drastically underperforming from what was anticipated. The team tried every modeling approach they could think of, with little success. Eventually the whole project budget was used up in this first unsuccessful phase with little to show for it. I was brought in at the end but couldn't help much.

There's a long list of things that went wrong.

The team forgot the project they were on. They were using approaches appropriate to marketing response models and they were working in a different world. Doing 40% better than random doesn't work well for marketing response models but here it meant we could improve the insurance company rate models by 40% which is fairly impressive. Before the project started the team needed to put serious thought into what success would look like.

The team let an initial step in the project take over the project. At the least, that initial step should have been ruthlessly time-boxed. Since that initial step wasn't directly on the path towards the outcome it should not have been in the project.

The team didn't do any data exploration. When I was brought onto the project near the end, one of the first things that I did was to look closely at the data. What I found was that over 10% of the file had under $10 in six-month premiums, and many other records had extremely low six-month premiums. In other words, a large chunk of the data we were working with wasn't what we think of as insurance policies.

This goes to an earlier point, that often DBAs know the structure of their data very well but often have very little idea of the distribution and informational content of their data. Averages, minimums, maximums, most of what we can get easily through SQL don't tell the story. One has to look closely at all the values and usually this means using specialized software packages to analyze data.

We got a second chance later, fortunately.

No comments: