Friday, April 27, 2018

Setting Up Data Science Projects

At a high level, a data science project has six components in three classes:
  1. Objective 
    1. What is the intended benefit?
    2. How is the problem going to be solved?
  2. Data
    1.  What is the data will be output?
    2. What is the data that will be coming in?
  3. Technical
    1. What will be the analytic approach?
    2. What code will be written?
These aren't steps. Projects often go back and forth between the various components. For instance, we'll realize that the code resources we have available won't make the solution we want possible, so we go all the way back to component 1.2 to see if there is a different way of solving the problem.

The components are ordered in terms of importance! Getting the right problem to solve is vastly more important that writing the best code.

However, almost all of the papers and talks and blogs concentrate on components 3.2. The least important step actually gets the most attention.

That's the problem. I don't know of a good way of getting better at the higher-value stages except to try to always understand what you are doing. For instance, radically changing component 3.1 may mean you are really changing the business problem being addressed; just make sure that is what you want to do.

Friday, April 20, 2018

Lehman's Law

JT Lehman (JT Lehman) has a great rule for setting up data science projects:

If we only knew _____, then we could do _____ and have impact ______. Fill in the blanks.

A lot of project problems take care of themselves if you've got the problem definition right.

Sunday, April 15, 2018

Estimating the 2018 House Elections

Nate Silver ( made a comment about the 2018 U.S. House of Representative Elections, in that he was expecting modest gains but the long tail for the GOP was really bad. I want to investigate that idea. We don't have enough data to make a hard calculation, but what we can do is to put some solidity around our intuitions.

I'm making a Monte Carlo simulation of the election. The natural way to think about the elections is i comparison with the 2016 Presidential election, so I'm going to start with the Clinton vote in each Congressional district. I'm using CLinton and not Trump because it is easier to think about things from the Democratic point of view.

In the recent special elections, the Democrats have been beating the 2016 Presidential vote by about 17 points on average. The Democrats have been having an ~8 point lead lead over the GOP in the generic House tracker ( so that leaves a bunch of gap to be explained.

Anecdotally, the Democrats have a marked enthusiasm gap. I have a vague memory of 4 percent, so let's go with that. Also, it seems to me that in the recent special elections the Democrats have been fielding pretty good, above average candidates while the GOP has been fielding bad-to-terrible candidates. This makes sense to me; if I was an ambitious Democrat I would be looking to find a way to get into the game, whereas if I were an ambition Republican I would be finding excuses to sit this one out. So let's give the Democrats a 4 point 'better candidates' boost. This won't apply to races where the GOP incumbent is staying in the race.

This gives us as a baseline matching the recent special elections

Democratic Advantages

Higher Approval Rating                   8 points
Higher Enthusiasm                        4 points
Better Candidates                        4 points
Total                                   16 points

which roughly matches the 17 points we are seeing from the special elections. Why am I breaking the 16 points down like this? Having three buckets makes it a lot easier to think about than having one big lump.

Other Effects

I'm adding a 6-point incumbent advantage. The quantity here is fairly arbitrary; I'd like a good way of getting a better handle on this number in this context. 

I also want to add uncertainty factors. I want one factor to represent uncertainty due to the national environment changing in the next seven months. and another uncertainty factor for race-by-race factors. I'm treating both as normal effects with a mean of 0 and a standard deviation of 2.75. Why 2.75 (which is admittedly a weird number to use)? I'm figuring I want the Democrats to have a 99% chance of getting control of the house in the situation where the general election acts like the special elections, and calibrating the uncertainty to a standard deviation of 2.75 does that.

Incumbent Bonus                          6 points
National Uncertainty                     2.75 s.d.
By-District Uncertainty                  2.75 s.d.

Mainline Results (2018 general matches special elections)

194 is the current Dem seats in the House; 218 is how many seats the Dems need to take control of the House.
so we are looking at about a 60-seat gain on average, and the long tail is pretty brutal for the GOP: a 100-seat loss is quite possible.

Other Options

An advantage of having a model like this is we can change the assumptions and see what the effect is.

Let's start by cutting the basic advantage in half.
Still pretty good, still about an 80% chance of winning the House.. Now let's cut the enthusiasm and candidate bonus in half as well.
Not so good -- only a 40% chance to taking the House.
Let's try moving the base up to 8%, but keeping the enthusiasm and candidate factors at 2.
This is looking pretty good -- about at 88% chance of gaining the House.
This is telling us that it is all about keeping that base advantage; better candidates do not matter that much. 
Lastly, let's try to take away all the Dem advantages:
We let a House that looks a lot like what we got in 2016. This is a decent sanity check for the method.

The Actual Code

# coding: utf-8

# In[1]:

# The goal of this program is to get an idea of the possibilities #
# and spreads of the 2018 house election. We don't have enought   #
# data to make a real prediction, but we can make something that  #
# can give us an idea of the possibilities.                       #
# Recent special elections have been running in the Democrats'    #
# favor, typically doing 15-20 basis points better that the 2016  #
# Trump win. To get a better handle on the wins, we break the Dem #
# Advantage down into three chunks                                #
#     1) Basic favorability advantage; right now that is running  #
#        7-8 basis points in the Dem's favor                      #
#     2) Enthusiasm gap: It seeems like Dems are getting to the   #
#        polls in much higher numbers; I have heard 4 basis points#
#     3) Better candidates. It seems like the Dems have been      #
#        fielding much better candidates that the GOP in the      #
#        recent races; taking 4 basis points here makes the total #
#        Dem advantage match their special election performance   #
# The 'better candidates' factor applies to campaigns where either#
# the Dem is the incumbent, or the GOP incument is not running for#
# some reason.                                                    #
# We also have a factor for incumbency. Right now it it set at 6  #
# basis points; this is out of thin air and a good way to improve #
# the model is to get a better understanding of this factor.      #
# Lastly, we have to factors that represent uncertainty. One      #
# factor is a normal-distributed random number that applies to all#
# the races, representing changes in the national mood; the other #
# is a race-by-race random normal number. Right now both have mean#
# 0 and the same standard deviation. The spread was calibrated to #
# give the Democrats a 99% chance of winning the house under      #
# the most optimistic scenario I considered, which was            #
#     basic +8                                                    #
#     enthusiasm +4                                               #
#     candidate +4                                                #
# This corresponds to the results in the special elections so it  #
# is possible the GOP could do much worse.                        #

import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# In[2]:


# In[3]:


# In[4]:

# This is where the paramters for the simulation get set
# spread = 3.25 goes with incumbent=4.0
# spread = 2.25 goes with incumbent=8.0
# spread = 2.75 goes with incumbent=6.0

demBoost = 8.0
demEnthus = 4.0
spread = 2.75
sims = 10000

def demParty(x):
    if x=='Democratic Party':
        return 1
        return 0

def winf(x):
    if x>50.0:
        return 1
        return 0

# In[5]:

# Running the election simulations
# candidateFactor does not apply if the the incumbent is a Republican who is running.
# We are baing the results off of the Clinton percent in the 2016 election, so the incumbent
# effect for GOP candidates is negative.

for j in range(sims):
    natRand = np.random.normal(0,spread,1)
    disRand = np.random.normal(0,spread,election.shape[0])
    election['candidateFactor'] = ( election['Party'].apply(demParty)*candidateFactor+
    election['incumbent'] = (1-election['Retiring'])*incumbent*(-1+2*election['Party'].apply(demParty))
    election['wins'] = election['result'].apply(winf)

# In[6]:

# Create the histograms of the simulations
font = {'weight' : 'bold',
        'size'   : 120}
mp.rc('font', **font)

plt.hist(winList, 20, density=False,facecolor='blue')
plt.xlabel('Democratic Wins; Green Line is 218, Red Line is 194',fontsize=200)
plt.ylabel('Simulations Out Of '+str(sims),fontsize=200)
title1='Democratic House Wins: Base Advantage '+str(demBoost)+' Enthusiasm \n'
title2= str(demEnthus)+' Candidate Factor '+str(candidateFactor)+' Incumbancy ' + str(incumbent)+' Random '+str(spread)
plt.axvline(218,linewidth=40,color='xkcd:bright green')


# In[7]:

# Democratic worst result

# In[8]:

# the percent chance of the Dems not taking the house
len([x for x in winList if x < 220 ])/sims

Saturday, November 14, 2015

Data Science - Live for the Learning Curve

I started doing data science in 1994. The tools I've used, in no particular order, are

 VSAM, JCL, SAS, SQL, PL/SQL, T-SQL, c-shell, Perl, R, Python, Java, Visual Basic, Tableau, Excel, flat files, Hadoop, HBase, Pig, Hive, DecisionSeries, AdminPortal

 That's a neat 21 tools in 21 years, and actually they are kind of obsolete already. There's a whole new data science paradigm starting of companies selling algorithms as APIs: send your data off in a web call, get a score back. Algorithms As A Service. Microsoft and Algorithmia come to mind.

 Whatever you're using now, wait a year: you'll have something new in your toolkit. If you're just getting through a data science course with R, Python, and Hadoop; well, that should keep you a couple of years.

 Live for the learning curve.

Sunday, August 16, 2009

Bozoing Measurements VII

A while ago I saw a consultant give a presentation. He had been given 20 campaigns to analyze. He spent a lot of time discussing the one campaign that was significant at the 5% level.

Sunday, July 12, 2009

Bozoing Campaign Measurements VI

And the hits keep coming.

This story involves a tracking database. The database was tracking long-running campaigns, where the process was that a customer 1) contacted the company via customer care 2) at that point, was randomized on a by-campaign basis. Once there was customer activity, that customer was tracked for three months.

Here's where it gets tricky. On the next customer contact the treatment group was given the pitch again if they still qualified whereas the control group was automatically not given the pitch. That means in the treatment group the next contact generates a campaign-relevant data point whereas in the control group it doesn't. Remember the three month-tracking? After three months any control group customers are dropped out of the database, whereas treatment group customers that are still in contact with the company are still tracked. These are long-running campaigns. So the control group was composed of customers that had at most a three-month window to take the offer whereas the treatment group had a potentially unlimited time to take the offer. What a clever way to make sure the results are excellent!

I was once reviewing analysis of campaigns from this system. I was originally asked to make sure the T-Test formula was right, and poked around in the data a little. I saw a weird thing: the campaign results were a linear function of the control group size. The smaller the control group the better the results. I commented that they really shouldn't publish results until they had figure out what the Weird Thing was. Looking back, I can see how the database anomaly aboce could account for the effect. As time goes on, customers are going to be dropped out of the control group. Also, the treatment group will be given longer and longer to take the offer. So as time goes on, the control group numbers will fall and treatment group takes will rise.

So all the positive results that were being ascribed to the marketing system could have been due to the reporting anomalies.

Saturday, June 27, 2009

Customers are Weird

Really, really weird.

Imagine a company with 2mm customers. Reasonable-sized, not huge.

How many people do you know well? Maybe 100 people? Think about the absolute weirdest person you know. That company has customers that are literally 100 times weirder than the weirdest person you know. In fact, they've got 200 of them.

It's a bad idea to think you know what customers are going to do without testing, measuring, and finding out.