Data science scratchpad

I spent the last computer-week working solely on data science, that mixed, applied practice of statistics, exploratory data analysis, machine learning, and so on. Despite my involvement in simple statistics, classification & prediction-oriented analysis, as well as the 'data science stack' - in Python or R - have been a blind spot.

A week is a tiny fraction of the time you'd actually need to spend to become competent at the tasks, but it was enough for me to get over some of the intial learning curve and get more familiar with the thought process.

I took a journal, for myself, to solidify and experiment with ideas as well as vent without cluttering social media. Here's that journal, in unedited, super loose form. For the historical record, if nothing else: don't expect pithy or well-formed thoughts here. It's in chronological order, and I have a few thoughts at the end.

Day one, May 24

Finished a first tutorial on Kaggle. Quick observations:

Yep, there are a bunch of libraries I used, but they all played together really well.

Data Frames are the most unusual and new concept from a computer science standpoint - they're so aggressively Object Oriented and magical, and I'm surprised that they work across all of these libraries. seaborn, the charting library I used, was able to read them and create charts. Data Frames are basically:

Objects that contain typed data in table format
And let you manipulate that data both by column and row accessors, as well as slice it, get its head and tail
Each column also has fancy methods attached to the column object, so like:

# this replaces the 'Cabin' column with a version
# that has N/A special values for any values called 'N'
df.Cabin = df.Cabin.fillna('N')

I'm using Python for now. I have nothing more against R than most programmers have against R, and if faced with a problem for which there's a library in R, I'll do the reasonable thing and use R.

A pleasant surprise was how classifiers can be easily swapped in and out, with only changing parameters. I tried a Random Forest Classifier first, as was the default in the Python tutorial, and then tried a Bagging Classifier (not much better), SVM (worse), Gradient Boosting (better!), and Ada Boost (same-ish). I've been reading the documentation for all of these, and am kind of familiar with the concepts.

Brings me to one of my feeling thoughts: not quite a thought or a feeling! Most of the time in programming world, I spend a good amount of time reaching under the hood and hoping to understand how things work. So, writing simple statistics was really educational, and I spend a good amount of time diving into the internals of whatever I am relying on, like React or Node.

I'm sure that this happens in data science land too, but it seems less like a norm than in computer science. The language-switching is definitely a little odd, the fact that most R libraries are written in C or C++ or FORTRAN, thus kind of discouraging developers from reading them. For good reason they're written in faster languages - to be fast - but nonetheless the concept of a 'black box', in the bad way, comes to mind with a lot of the documentation I read and the way that people make decisions.

ANYWAY

So in terms of process I'm really trying to get a basic idea of the decision tree (get it) and standard workflow of a classifier problem. Maybe it's also the workflow of other data science problems, but anyway, let's get down to it - as far as I can tell:

Exploratory data analysis

You know, like d3. This usually happens first, because you're going to mess up the data and replace, for instance, human-readable string columns with machine-friendly numbers pretty soon. In Python land we (I) (maybe just me) do this with seaborn, a Python library that works like Matplotlib but gives a higher-level interface, so you can tell it 'bar chart' and it does it. I guess Matplotlib is like d3 and seaborn is like one of the 1,000 d3 abstraction modules.

Data munging (non-strings)

In the tutorial I followed, the data munging was for two reasons at least:

Jeff, in his explanation, says that he transforms the age column, which contains integers in the 2-100 (ish) range into age groups like 'Young Adult' in order to avoid overfitting. It looks like this is correct, but just barely.
- With age classes: 0.8286580594679187
- Without age classes: Mean Accuracy: 0.8258215962441314

From the tutorial:

"Each Cabin starts with a letter. I bet this letter is much more important than the number that follows, let's slice it off."

I wonder about bets like this: did they test both possibilities? Is it useful to?

Let's try the opposite:

Just cabin groups: 0.8286580594679187
Full cabin information: 0.8314162754303599

Hey, look at that extra decimal point! Chill.

Sidenote:

Jupyter notebooks are cool! It took me a while to realize that In[*] means it's working. You can scroll down and see all the cells being populated.

I don't really know about this generation of 'useful' and non-useful values: would love to have a more scientific way of saying, for instance, whether Embarked was important or not.

Anyway:

With Embarked: 0.8300665101721441
Without: 0.8314162754303599

I kind of thought that Embarked would give information about the economic class of passengers, making their survival status a little easier to decide. Anyway.

That ramble went on for a bit.

So you do a bunch of munging

Constant variables like "all integers from 2-100" into groups like quartiles to age groups
Filling in N/A values with other values so that they don't trip up algorithms
Dropping columns that aren't important

And then you do another stage of data munging: labelling.

Labelling

Maybe that isn't the right word: basically what d3.ordinalScale() does. You take a bunch of string values, and map each unique value to a number from 0 to ∞, and then you replace them. This makes the dataset more computer-friendly, and less human-friendly, but that should be okay because you did your EDA (Exploratory Data Analysis) previously.

Creating the classifier

Oh, man, classifiers. They're so 'easy to use' programming-wise, and I really wonder about their nuances, and how people reason about them. I tried:

GradientBoostingClassifier: since this was pointed out with the XDG
BaggingClassifier: idk
SVM: a support vector machine! Kind of familiar with the concept behind this one, though in a super high-level fashion.
RandomForestClassifier: the one that the tutorial had, the one I started out with. I sort of understand this one.

But I kind of missed something: in order to train these classifiers, you make a table of all the columns except the one you want to predict, and then a column of what you want to predict, and you use 'fit' with (all the columns that might predict a score, the column you want to predict).

From there on, it's all a numbers game: trying to figure out how good your prediction is on the test data, on similar-ish test data, basically, that.

Day 2 May 25

I'm starting a house prices kaggle as #2. And this Python notebook as a start. This one's regression instead of classification: where the survival thing was an algorithm trying to guess 0 or 1 (categories), this is an algorithm trying to guess a value (linear) from a bunch of attributes.

This tutorial uses matplotlib to do a scatter plot. I'm going to use seaborn to make it harder. Python plots don't look very good.

The 'scatter plot' options in seaborn all look like they're for specialized other kinds of data, not the linear-in-two-dimensions kind I'm used to. Looks like lmplot is what I want.

Reconnecting to the server sucks. I will try to set up a Jupyter notebook locally soon. Since kaggle makes them Docker images, that'll be two learning-birds with one stone.

Well, got it to work, and it is, as expected, fewer lines than the matplotlib version and looks a little better

sns.lmplot(data=train, x="SalePrice", y="GrLivArea")

Damn, seaborn's ability to do small multiples is excellent. For the lmplot, you just add col= and it finds all the variations and displays them all. Is this what d3 abstractions are like? Well, no - I think that DataFrames make the Python world so much better than the convoluted JavaScript world, in this case.

Man, Data Frames are crazy, like

train[train.GrLivArea < 4000]

Breezily 'filters away buildings with more than 4,000 sq ft of living space, which would be like

train = train.filter(x => x.GrLivArea < 4000);

Even in the simplest case, and more convoluted if you don't want to be an arrow-function hipster.

Though the peace really falls apart in the data munging section here too, it has like

train.loc[:, "BsmtQual"] = train.loc[:, "BsmtQual"].fillna("No")
train.loc[:, "BsmtCond"] = train.loc[:, "BsmtCond"].fillna("No")
train.loc[:, "BsmtExposure"] = train.loc[:, "BsmtExposure"].fillna("No")
train.loc[:, "BsmtFinType1"] = train.loc[:, "BsmtFinType1"].fillna("No")
train.loc[:, "BsmtFinType2"] = train.loc[:, "BsmtFinType2"].fillna("No")
train.loc[:, "BsmtFullBath"] = train.loc[:, "BsmtFullBath"].fillna(0)
train.loc[:, "BsmtHalfBath"] = train.loc[:, "BsmtHalfBath"].fillna(0)

That whole 'many lines of repetitive imperative code' feeling. Which isn't great. For these sections, I'm using my refactoring impulses: basically sorting them by the things that they do "these 10 statements replace a variable with 0" and then considering how to shorten them by basically iterating through data rather than splitting lines over lines. It's probably silly to refactor here, but Tom wants to.

I switch to vim (neovim) for these parts. Refactoring without vim pains me. Also the Kaggle notebook has that thing where it autoinserts the paired ] when you type [, which is just a modern version of Clippy and I never ever want a computer to type for me.

Wait seriously, python for loop values really become semi-global like if you do

for myval in ['A']
	print(myval)

print(myval)

It just escapes, it isn't scoped? JavaScript is only as bad as everything else out there.

Okay, refactored that into

string_replacements = {
  "TA": ["HeatingQC", "KitchenQual", "ExterCond", "ExterQual"],
  "No": ["GarageQual", "GarageCond", "GarageFinish", "GarageType",
         "MiscFeature", "PoolQC", "FireplaceQu", "Fence", "BsmtQual",
         "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"],
   0: ["GarageArea", "GarageCars", "BsmtFullBath" , "BsmtHalfBath",
              "EnclosedPorch", "Fireplaces", "BsmtUnfSF", "HalfBath",
              "KitchenAbvGr", "MasVnrArea", "ScreenPorch",
              "TotRmsAbvGrd", "BedroomAbvGr", "OpenPorchSF", "PoolArea",
              "WoodDeckSF", "MiscVal", "LotFrontage"],
  "Norm": ["Condition1", "Condition2"],
  "N": ["CentralAir", "PavedDrive"],
  "Typ": ["Functional"],
  "Reg": ["LotShape"],
  "None": ["MasVnrType", "Alley"],
  "Normal": ["SaleCondition"],
  "AllPub": ["Utilities"]
}

for replacement, fields in string_replacements.items():
    for field in fields:
        train.loc[:, field] = train.loc[:, field].fillna(replacement)

Not super abstract, but clean enough to make me happy. I'm kind of surprised that there isn't an abstraction for this: it seems like the messiest part of most of these examples. But messy longform code is also pretty easy to tinker with and debug so many people don't want concise refactored code.

Super confused about this tutorial switching up 1 indexing with 0 indexing. I know that to the regression it all looks the same, it's like that terrible scene in the Matrix, but nonetheless why. Like why would January be 1 but GarageCondition=No be 0. Why. Also could we be using a LabelEncoder here instead of manually reassigning each column to a new value? Maybe since months are ordered it matters?

Debugging these Kaggle scripts is not great. The environment doesn't show line numbers, so you have to go on context, and the Pandas errors don't show you which column in a long list of .replace({}) column replacements is the one with the problem.

Okay lol what, when you run a table cell, it can run them semi-out-of-order, like if a table cell drops something 'in place' then running it multiple times can throw an error because it's already dropped. I was sort of imagining Jupyter notebooks as being basically immutable and in-order, but maybe totally not? This seems... well, wrong?

Okay, I think my approach for learning from this one will be

Stop simplifying and improving data at the point I'm at - basically the 'removing things that will totally break it' and before the 'improving its edibility' stage.
See how it goes
Add back those improvement steps one by one and see how much if any they improve the score.

New lingo discovered! One-hot encoding. Not much to this lingo: it's a binary value with one 1. That's it.

Okay so this tutorial introduces some words and concepts that have been floating around for a while and I am very excited to finally understand!

Loss function
Regularization
Overfitting (I kind of know this one already)
Regression (I definitely know this but only really in the low-dimension simple variations, not the fancy ones)
Residuals (what are they!?)

It's funny how pretty much all of this tutorial was legible for a little while but then BAM in comes the TERMS

I'm not getting nearly the same numbers in local CV compared to public LB Standardization cannot be done before the partitioning

The level of wat for the non-educated really ramps up quickly. Moving on.

Running list of acronyms

RMSE: Root Mean Squared Error: the mean of the square root of the error. The error is the difference between expected and actual values, it's square-rooted

May 31

BACK AT IT, after a lovely wedding weekend.

I had an AHA moment this weekend, the kind you have after learning two things and then kind of connecting them slowly.

So, I was pestering someone in the scipy IRC channel about why y_train was always written in lowercase and X_train was written in uppercase. That kind of little thing tends to burrow into my brain and make me waste time wondering.

Anyway, the idea is that y_train is a vector (a 1-dimensional array) and X_train is a data frame / matrix (an n-dimensional, table-like array). Which makes sense to have a naming convention - data frames are much more advanced and have a much different interface than vectors.

But beneath that, the naming convention - y and x - finally locked in the idea that, even though multi-dimensional regression is so much more magic than linear regression, it's kind of, sort of, comparable. Which finally gives me something of a mental image:

Like: you can, if you really want to, sort of imagine multiple regression as having a really complicated X-axis but otherwise trying to do the same kind of thing as linear regression. Of course, this is wildly sort of inaccurate - multidimensional regression is really complicated and the algorithms we're using here are complicated enough that few folks are really trying to open up the black box.

But nevertheless some sort of mental picture of the input, mechanism, and desired result really helped me feel more grounded in this search.

I keep stumbling on the other tutorial about this dataset and the extended analogy... I can't really cope with that sort of extended analogy.

Ugh, the way that the Python notebook really makes me kind of angry about how this violates programming ideas. Like the lines

# Handle remaining missing values for numerical features by using median as replacement
print("NAs for numerical features in train: %d" % train_num.isnull().values.sum())
train_num = train_num.fillna(train_num.median())
print("Remaining NAs for numerical features in train %s" % train_num.isnull().values.sum())

If you run it once, it produces

NAs for numerical features in train: 81
Remaining NAs for numerical features in train 0

Then if you re-run it, it produces:

NAs for numerical features in train: 0
Remaining NAs for numerical features in train 0

Even worse, if you write something that's not idempotent, like

numerical_features = numerical_features.drop("SalePrice")

Running it twice causes it to crash the second time.

I'm a newcomer to these lands so I don't want to assume too much, and I also really appreciate the way that this implementation approach probably is memory-efficient and simple to implement, but:

Could it instead store a stack of environments, rather than one environment that changes mutably? That way you could 'pop' the stack to run each script on the state before it runs?
Or, ideally, in a fancy immutable world, the programming language itself would allow for these changes to be not-in-place by default?

Back to the issue at hand.

On the 'decoding jargon front'

Loss function: sounds so simple. Also called cost function. As far as I can tell, it is the badness indicator: it gives you a number that says how far your estimate is from reality. In least squares linear regression, the loss function is the squared differences between data and the line.

I think. It's hard to tell, because the errors and residuals page doesn't mention 'loss function', and neither does simple linear regression.

Math Wikipedia, may I count the ways in which you infuriate me: you're reading the first paragraph of the article, which was copyedited by some kind soul, and then skip to the meat, and every sentence begins with "Consider that" or "If we assume" or "Formally" and you know that you're lost.

I know about that whole 'allergic to math' thing, and know so many people who claim it, or even if I mention that I work with computers, they immediately tell me that they're hopeless at numbers and such. Which, of course, is super sad - the learned helplessness, the gendered nature of the whole thing.

But gosh, math makes itself so easy to hate with its own special form of English and its obsession with symbols and rhetorical approach. Even if you like the substance, which I do - enough to do this all in my free time, if you're allergic to the cultural markers, you're going to have an allergy attack every time you try to Wikipedia-learn something, because there is no rest, no mercy granted to those who don't want to wade through the passive-voice trash that Wikipedia-about-math is written in.

End rant. Back to the Kernel.

So, the tutorial I'm following tries out three algorithms for the regression, but I am, honestly, sort of not interested in digging into three things - I'd rather learn one a little more deeply. Also, Kaggle just lost my work and doesn't have any sort of 'revision' saving.

In the interest of laziness, I also sorted through the forks of the juliencs regression to find one that created output - something the original doesn't do.

So, here's where I have something of a question: this 'exploratory data analysis' and all this data-munging really modified the training dataset significantly. And then LassoCV uses that munged data to figure out how much each feature (column, basically) affects the output and how. Will I have to do all the same preprocessing on the test data test.csv as I did with train.csv? The answer is.. yes.

Oh, and - hey, just found out about TPOT, which kind of answers my question from earlier about parameter optimization, which seemed totally fiddly and unscientific in the Kaggle examples I found. TPOT apparently automatically turns tons of knobs to choose parameters and algorithms that give good answers to certain problems. It's wild, and one of the first examples I've found of genetic programming that works. I guess pngcrush is kind of like that, in a way.

Back to the issue: actually generating output. So this is where people indent all of the data-munging code and turn it into functions.

Kernel stopped, restarted, lost work. This will be the last time I work in an online kernel.

Kernel stopped again. Definitely installing a local notebook next.

The Wikipedia page on Regularization is pretty good! The first example explains it really well and drives at one of the core things about these learning techniques:

You work with the training set, and want to create a model that describes the basic workings of the data, the real-world implication that, say horsepower has on mpg of engines.
Your loss function might encourage you to describe the training set too well - basically just model it, thus "overfitting", in ways that don't have any real-world function and will actually reduce the model's accuracy to other data sets, especially the test data set.
Regularization is something you add to encourage a less 'spiky' or detailed model, which tends to be a model that works better on real-world data. It's like an un-overfitting. It's a model tailor that lets out overtight curves.

June 1

I've gotten halfway through installing the Kaggle docker image a few times, and am finally on the home stretch, I think. This is my first experience with Docker, so I'm a million miles away from forming any authoritative opinions, but I do have a sneaking suspicion that 'running iPython' with Docker is kind of using a hasselblad to take a selfie. The download size alone is a lot: it's been so long since anything required >1GB of space on disk to run. Maybe LaTeX or Haskell did, but those are unusual.

I'm going to install locally, since I'm nearly 100% sure that a fraction of the Docker image's real size is required to just run iPython.

$ brew cask install anaconda

The Docker image is... 15GB. There's probably a bunch of Linux in there. Maybe LibreOffice? Just kidding, I hope.

Oookay, so Docker (VirtualBox) spiked memory up to 5gb, which increased memory pressure to 100%, causing my system to fall over. I'll have to wait until I buy a big scary desktop local NAS/processing machine, I guess.

So, instead I'm going to run virtualenv with Jupyter in it. Which was actually kind of a breeze. No big deal.

I wrote this tiny lil function to make data munging code cleaner:

def apply_munge(fn):
    global train
    global test
    train = fn(train)
    test = fn(test)

Basically it turns

train = foo_fn(train)
test = foo_fn(test)

Into

apply_munge(foo_fn)

Which has some refactoring advantages:

If there are >2 datasets, we can adjust that in one place.
If we want to compose methods, we can adjust apply_munge to take varargs.

Hm, interesting nitpicky thing: the kernels have two datasets:

train.csv - subset of data with results (in this case, home prices) so you can validate your algorithms
test.csv - set of data without results that you submit joined with your predictions

Kinda confusingly, the tutorial has a line like:

X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)

In which X_test is a big of data split out, kind of reserved off in a corner so you can train and test on different 'splits' of the data. But it isn't the test set.

You know, it's kind of a bummer that Jupyter doesn't expose the current environment of variables so you can see them statically rather than inquiring about them through script snippets.

Man, Python is such a convenient language: to figure out the mismatch of columns between two dataframes, I guessed:

set(train) - set(test)

And it worked, perfectly.

But still, that leaves me with the error, and this error message:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Which really begs the question... which.

I hate ragging on Jupyter so much: it's such an amazing project, with so much going for it, and I'm wildly biased because I wrote mistakes.io and still think it's sort of "the right way" of doing this. It has the logical model of "execution goes top to bottom" whereas Jupyter has code going top to bottom but execution going in any order.

TIL in Python that

"Extern""Qual" # evaluates to "ExternQual"

Pretty odd. If you forget a comma in a list of strings, like

cats = ["George"
"Fiona"]

You get a single-element array of GeorgeFiona. Interesting!

In a week, I got a slightly better understanding of the roles of the tools in data science, as well as the intent of each step and why each step is necessary. There's still 99.9% of the theory left to learn, and I'm likely 2-3 weeks away from being able to solve one of these kaggles without a tutorial to guide me.

Some things that were surprising for me:

I didn't like Jupyter's execution model nearly as much as I was expecting to, as I've reiterated to death in these notes.
The interoperability of these Python libraries is really, really impressive, and I was happy with the quality of all of them.
I probably need to tone down the refactoring instinct with data munging code, or look for libraries that express that code in a better way. The typical munging on Kaggle is less than great, in terms of code quality.
Python visualization libraries like seaborn in this case really do an excellent job: I wonder if JavaScript had data frames, d3 abstractions could be as good.
Data frames are pretty wild and so darn convenient.
Kaggle is an impressive platform, but online notebooks proved to be pretty annoying, and I ran into a significant number of bugs on the site.
The tutorials written on Kaggle could be more newbie-friendly - all tutorials could be more newbie friendly - but they are so great for what they are. Big thanks to the folks who write them.
Parameterization seems fishy and I'm really interested in using tpot next time to see if it can't be done with more scientific rigor.

June 1, 2017 Tom MacWright
@macwright.com on Bluesky, @tmcw@mastodon.social on Mastodon