Friday, April 27, 2018

Setting Up Data Science Projects

At a high level, a data science project has six components in three classes:
  1. Objective 
    1. What is the intended benefit?
    2. How is the problem going to be solved?
  2. Data
    1.  What is the data will be output?
    2. What is the data that will be coming in?
  3. Technical
    1. What will be the analytic approach?
    2. What code will be written?
These aren't steps. Projects often go back and forth between the various components. For instance, we'll realize that the code resources we have available won't make the solution we want possible, so we go all the way back to component 1.2 to see if there is a different way of solving the problem.

The components are ordered in terms of importance! Getting the right problem to solve is vastly more important that writing the best code.

However, almost all of the papers and talks and blogs concentrate on components 3.2. The least important step actually gets the most attention.

That's the problem. I don't know of a good way of getting better at the higher-value stages except to try to always understand what you are doing. For instance, radically changing component 3.1 may mean you are really changing the business problem being addressed; just make sure that is what you want to do.

Friday, April 20, 2018

Lehman's Law

JT Lehman (JT Lehman) has a great rule for setting up data science projects:

If we only knew _____, then we could do _____ and have impact ______. Fill in the blanks.

A lot of project problems take care of themselves if you've got the problem definition right.

Sunday, April 15, 2018

Estimating the 2018 House Elections

Nate Silver (http://fivethirtyeight.com/) made a comment about the 2018 U.S. House of Representative Elections, in that he was expecting modest gains but the long tail for the GOP was really bad. I want to investigate that idea. We don't have enough data to make a hard calculation, but what we can do is to put some solidity around our intuitions.

I'm making a Monte Carlo simulation of the election. The natural way to think about the elections is i comparison with the 2016 Presidential election, so I'm going to start with the Clinton vote in each Congressional district. I'm using CLinton and not Trump because it is easier to think about things from the Democratic point of view.

In the recent special elections, the Democrats have been beating the 2016 Presidential vote by about 17 points on average. The Democrats have been having an ~8 point lead lead over the GOP in the generic House tracker (https://projects.fivethirtyeight.com/congress-generic-ballot-polls) so that leaves a bunch of gap to be explained.

Anecdotally, the Democrats have a marked enthusiasm gap. I have a vague memory of 4 percent, so let's go with that. Also, it seems to me that in the recent special elections the Democrats have been fielding pretty good, above average candidates while the GOP has been fielding bad-to-terrible candidates. This makes sense to me; if I was an ambitious Democrat I would be looking to find a way to get into the game, whereas if I were an ambition Republican I would be finding excuses to sit this one out. So let's give the Democrats a 4 point 'better candidates' boost. This won't apply to races where the GOP incumbent is staying in the race.

This gives us as a baseline matching the recent special elections

Democratic Advantages

Higher Approval Rating                   8 points
Higher Enthusiasm                        4 points
Better Candidates                        4 points
Total                                   16 points

which roughly matches the 17 points we are seeing from the special elections. Why am I breaking the 16 points down like this? Having three buckets makes it a lot easier to think about than having one big lump.

Other Effects

I'm adding a 6-point incumbent advantage. The quantity here is fairly arbitrary; I'd like a good way of getting a better handle on this number in this context. 

I also want to add uncertainty factors. I want one factor to represent uncertainty due to the national environment changing in the next seven months. and another uncertainty factor for race-by-race factors. I'm treating both as normal effects with a mean of 0 and a standard deviation of 2.75. Why 2.75 (which is admittedly a weird number to use)? I'm figuring I want the Democrats to have a 99% chance of getting control of the house in the situation where the general election acts like the special elections, and calibrating the uncertainty to a standard deviation of 2.75 does that.

Incumbent Bonus                          6 points
National Uncertainty                     2.75 s.d.
By-District Uncertainty                  2.75 s.d.

Mainline Results (2018 general matches special elections)

194 is the current Dem seats in the House; 218 is how many seats the Dems need to take control of the House.
so we are looking at about a 60-seat gain on average, and the long tail is pretty brutal for the GOP: a 100-seat loss is quite possible.

Other Options

An advantage of having a model like this is we can change the assumptions and see what the effect is.

Let's start by cutting the basic advantage in half.
Still pretty good, still about an 80% chance of winning the House.. Now let's cut the enthusiasm and candidate bonus in half as well.
Not so good -- only a 40% chance to taking the House.
Let's try moving the base up to 8%, but keeping the enthusiasm and candidate factors at 2.
This is looking pretty good -- about at 88% chance of gaining the House.
This is telling us that it is all about keeping that base advantage; better candidates do not matter that much. 
Lastly, let's try to take away all the Dem advantages:
We let a House that looks a lot like what we got in 2016. This is a decent sanity check for the method.

The Actual Code


# coding: utf-8

# In[1]:


###################################################################
# The goal of this program is to get an idea of the possibilities #
# and spreads of the 2018 house election. We don't have enought   #
# data to make a real prediction, but we can make something that  #
# can give us an idea of the possibilities.                       #
# Recent special elections have been running in the Democrats'    #
# favor, typically doing 15-20 basis points better that the 2016  #
# Trump win. To get a better handle on the wins, we break the Dem #
# Advantage down into three chunks                                #
#     1) Basic favorability advantage; right now that is running  #
#        7-8 basis points in the Dem's favor                      #
#     2) Enthusiasm gap: It seeems like Dems are getting to the   #
#        polls in much higher numbers; I have heard 4 basis points#
#     3) Better candidates. It seems like the Dems have been      #
#        fielding much better candidates that the GOP in the      #
#        recent races; taking 4 basis points here makes the total #
#        Dem advantage match their special election performance   #
# The 'better candidates' factor applies to campaigns where either#
# the Dem is the incumbent, or the GOP incument is not running for#
# some reason.                                                    #
# We also have a factor for incumbency. Right now it it set at 6  #
# basis points; this is out of thin air and a good way to improve #
# the model is to get a better understanding of this factor.      #
# Lastly, we have to factors that represent uncertainty. One      #
# factor is a normal-distributed random number that applies to all#
# the races, representing changes in the national mood; the other #
# is a race-by-race random normal number. Right now both have mean#
# 0 and the same standard deviation. The spread was calibrated to #
# give the Democrats a 99% chance of winning the house under      #
# the most optimistic scenario I considered, which was            #
#     basic +8                                                    #
#     enthusiasm +4                                               #
#     candidate +4                                                #
# This corresponds to the results in the special elections so it  #
# is possible the GOP could do much worse.                        #
###################################################################

import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt


# In[2]:


election=pd.read_csv("C:/work/election2018/election2018v2.txt",sep='\t')


# In[3]:


election.head()


# In[4]:


# This is where the paramters for the simulation get set
# spread = 3.25 goes with incumbent=4.0
# spread = 2.25 goes with incumbent=8.0
# spread = 2.75 goes with incumbent=6.0

demBoost = 8.0
demEnthus = 4.0
candidateFactor=4.0
incumbent=6.0
spread = 2.75
sims = 10000

def demParty(x):
    if x=='Democratic Party':
        return 1
    else:
        return 0

def winf(x):
    if x>50.0:
        return 1
    else:
        return 0
    
winList=[]


# In[5]:


# Running the election simulations
# candidateFactor does not apply if the the incumbent is a Republican who is running.
# We are baing the results off of the Clinton percent in the 2016 election, so the incumbent
# effect for GOP candidates is negative.

for j in range(sims):
    natRand = np.random.normal(0,spread,1)
    disRand = np.random.normal(0,spread,election.shape[0])
    election['demBoost']=demBoost
    election['demEnthus']=demEnthus
    election['candidateFactor'] = ( election['Party'].apply(demParty)*candidateFactor+
                                    (1-election['Party'].apply(demParty))*election['Retiring']*candidateFactor)
    election['incumbent'] = (1-election['Retiring'])*incumbent*(-1+2*election['Party'].apply(demParty))
    election['result']=(election['Clinton']+election['demBoost']+election['demEnthus']+election['candidateFactor']+
                        election['incumbent']+natRand+disRand)
    election['wins'] = election['result'].apply(winf)
    winList.append(election['wins'].sum())
    


# In[6]:


# Create the histograms of the simulations
font = {'weight' : 'bold',
        'size'   : 120}
mp.rc('font', **font)

plt.figure(figsize=(120,120))
plt.hist(winList, 20, density=False,facecolor='blue')
plt.xlabel('Democratic Wins; Green Line is 218, Red Line is 194',fontsize=200)
plt.ylabel('Simulations Out Of '+str(sims),fontsize=200)
title1='Democratic House Wins: Base Advantage '+str(demBoost)+' Enthusiasm \n'
title2= str(demEnthus)+' Candidate Factor '+str(candidateFactor)+' Incumbancy ' + str(incumbent)+' Random '+str(spread)
plt.title(title1+title2,fontsize=200)
plt.axvline(218,linewidth=40,color='xkcd:bright green')
plt.axvline(194,linewidth=40,color='xkcd:red')

plt.savefig('C:\work\election2018\map5.png')


# In[7]:


# Democratic worst result
min(winList)


# In[8]:


# the percent chance of the Dems not taking the house
len([x for x in winList if x < 220 ])/sims