Saturday, November 14, 2015

Data Science - Live for the Learning Curve

I started doing data science in 1994. The tools I've used, in no particular order, are

 VSAM, JCL, SAS, SQL, PL/SQL, T-SQL, c-shell, Perl, R, Python, Java, Visual Basic, Tableau, Excel, flat files, Hadoop, HBase, Pig, Hive, DecisionSeries, AdminPortal

 That's a neat 21 tools in 21 years, and actually they are kind of obsolete already. There's a whole new data science paradigm starting of companies selling algorithms as APIs: send your data off in a web call, get a score back. Algorithms As A Service. Microsoft and Algorithmia come to mind.

 Whatever you're using now, wait a year: you'll have something new in your toolkit. If you're just getting through a data science course with R, Python, and Hadoop; well, that should keep you a couple of years.

 Live for the learning curve.

Sunday, August 16, 2009

Bozoing Measurements VII

A while ago I saw a consultant give a presentation. He had been given 20 campaigns to analyze. He spent a lot of time discussing the one campaign that was significant at the 5% level.

Sunday, July 12, 2009

Bozoing Campaign Measurements VI

And the hits keep coming.

This story involves a tracking database. The database was tracking long-running campaigns, where the process was that a customer 1) contacted the company via customer care 2) at that point, was randomized on a by-campaign basis. Once there was customer activity, that customer was tracked for three months.

Here's where it gets tricky. On the next customer contact the treatment group was given the pitch again if they still qualified whereas the control group was automatically not given the pitch. That means in the treatment group the next contact generates a campaign-relevant data point whereas in the control group it doesn't. Remember the three month-tracking? After three months any control group customers are dropped out of the database, whereas treatment group customers that are still in contact with the company are still tracked. These are long-running campaigns. So the control group was composed of customers that had at most a three-month window to take the offer whereas the treatment group had a potentially unlimited time to take the offer. What a clever way to make sure the results are excellent!

I was once reviewing analysis of campaigns from this system. I was originally asked to make sure the T-Test formula was right, and poked around in the data a little. I saw a weird thing: the campaign results were a linear function of the control group size. The smaller the control group the better the results. I commented that they really shouldn't publish results until they had figure out what the Weird Thing was. Looking back, I can see how the database anomaly aboce could account for the effect. As time goes on, customers are going to be dropped out of the control group. Also, the treatment group will be given longer and longer to take the offer. So as time goes on, the control group numbers will fall and treatment group takes will rise.

So all the positive results that were being ascribed to the marketing system could have been due to the reporting anomalies.

Saturday, June 27, 2009

Customers are Weird

Really, really weird.

Imagine a company with 2mm customers. Reasonable-sized, not huge.

How many people do you know well? Maybe 100 people? Think about the absolute weirdest person you know. That company has customers that are literally 100 times weirder than the weirdest person you know. In fact, they've got 200 of them.

It's a bad idea to think you know what customers are going to do without testing, measuring, and finding out.

Bozoing Campaign Measurements V

Here's a classic: toss all negative results.

Clearly, everything we do is positive, right?

Nope. Anything that can have an effect can have a negative effect.

(I've met a number of marketing people that really truly believe that people wait at home looking forward to their telemarketing calls. And that calling something 'viral' in a powerpoint is enough to actually create a viral marketing campaign).

There's another factor. Depressingly, a lot of marketing campaigns do absolutely nothing. Random noise takes over; half will be a little positive and half will a little negative. Toss the negative results and you're left with a bunch of positive results. Add them up and suddenly you've got significant positive results from random noise. This is bad.

I've seen an interesting variant on this technique from a very well-paid consultant. Said VWPC analyzed 20 different campaigns and reported extensively on the one campaign that had results that were significant at a 5% level.

Sunday, June 7, 2009

Bozoing Campaign Measurements - IV

Another installment in the "How to Bozo Simple Campaign Analysis". I've got a lot of them. It's amazing how inventive people get when it comes to messing up data.

Anyway, this is from a customer onboarding program. When the company got a new customer, they would give them a call in a month to see how things were going. There was a carefully held out control group. The reporting, needless to say, wasn't test and control. It was "total control" vs. "the test group that listened to the whole onboarding message". The goal was to enhance customer retention.

The program directors were convinced that the "recieve the call or not" decision was completely random; and given that it was completely random the reporting should be concentrated on only those that were effected by the program (that again -- it's amazing how often the idea comes up).

Clearly, the decision to respond to telemarketing is a non-random decision, and I have no idea what lonely neurons fired in the directors brains to make them think that. To start with, someone who is at home to take a call during business hours is going to be a very different population that people that go to work. More importantly, a person that thinks highly of a company is much more likely to listen to a call than someone who isn't that fond of a company.

Unsurprisingly, the original reporting showed a strong positive result. When I finally did the test/control analysis, the result showed that there was no real effect from the campaign.

Sunday, May 31, 2009

Statistics and DBAs

Statistics and DBA work really are two different disciplines, although from the outside we're both numbers people. I've learned the hard way that there's a lot that I don't know about how to set up a database. Likewise, I've had some database people push some very strange ideas about how to do analysis.

Take random samples. Unless I can actually see the code used to make random samples, I'd rather do random sampling myself. My favorite example of the problem was "we randomly gave you data from California".

Time sensitivity is another issue. I was making a customer attrition study for a cell phone company. We wanted to look at attrition over a year, so we needed customer data from the start of the year and we see how it effects attrition. What happened was that the database people, instead of following our instructions gave us customer data from the end of the year instead of start.

Why? "Don't you want the most current data possible?" It's the nature of reporting to get the most current data possible for the report, and understanding statistical analysis that will often require data from the past is a little alien to that way of thinking.