Friday, March 17, 2017

Geographic Distances: A Quick Trip Around the Great Circle

Recently, I wanted to calculate the distance between locations on the Earth. Finding a handy solution, I thought some reader might be interested. In my situation, location data included ZIP codes (American postal codes). Also available to me is a look-up table of the latitude and longitude of the geometric centroid of each ZIP code. Since the areas identified by ZIP codes are usually geographical small, and making the "close enough" assumption that this planet is perfectly spherical, trigonometry will allow distance calculations which are, for most purposes, precise enough.

Given the latitude and longitude of cities 'A' and 'B', the following line of MATLAB code will calculate the distance between the two coordinates "as the crow flies" (technically, the "great circle distance"), in kilometers:

DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)));

Note that latitude and longitude are expected as decimal degrees. If your data is in degrees/minutes/seconds, a quick conversion will be needed.

I've checked this formula against a second source and quickly verified it using a few pairs of cities:

% 'A' = New York
% 'B' = Atlanta
% Random on-line reference: 1202km
LatA = 40.664274;
LongA =  -73.9385;
LatB = 33.762909;
LongB = -84.422675;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =


% 'A' = New York
% 'B' = Los Angeles
% Random on-line reference: 3940km (less than 0.5% difference)<0 .5="" br="" difference="">
LatA = 40.664274;
LongA =  -73.9385;
LatB = 34.019394;
LongB = -118.410825;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =



"How Far is Berlin?", by Alan Zeichick, published in the Sep-1991 issue of "Computer Language" magazine. Note that Zeichick credits as his source an HP-27 scientific calculator, from which  he reverse-engineered the formula above.

"Trigonometry DeMYSTiFieD, 2nd edition", by Stan Gibilisco (ISBN: 978-0-07-178024-7)

Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique,  etc.

I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Thursday, March 10, 2016

Re-Coding Categorical Variables

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:

1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…
  • a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
  • b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
  • c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.
3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.

The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.

Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)

Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.

Monday, March 07, 2016

MATLAB as (Near-)PseudoCode

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:

He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.

Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;
RSS = sum( residuals_squared );

Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);

Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.

Saturday, April 12, 2014

Probability: A Halloween Puzzle


Though Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.

Recently, I was given the following probability question to answer:

Halloween Probability Puzzler

The number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces of candy that I should have on hand so that I have only a 5% probability of running out?

I am not aware of the author of this puzzle nor its origin. If anyone cares to identify either, I'd be glad to give credit where it's due.


I interpret the phrase "ignoring time effects", above, to mean that there is no correlation among arrivals over time; in other words, the count of trick-or-treater arrivals during any five minute interval is statistically independent of the counts in the other intervals.

So, the described model of trick-or-treater behaviors is that they arrive in random numbers during each time interval, and each take a single piece of candy plus a random amount more. The problem, then, becomes finding the 95th percentile of the distribution of candy taken during the entire evening, since that's the amount beyond which we'd run out of candy 5% of the time.

Solution Idea 1 (A Dead End)

The most direct method of solving this problem, accounting for all possible outcomes, seemed hopelessly complex to me. Trying to match each possible number of pieces of candy taken over so many arrivals over so many intervals results in an astronomical number of combinations. Suspecting that someone with more expertise in combinatorics than myself might see a way through that mess, I quickly gave up on that idea and fell back to something I know better: the computer (leading to solution idea 2... ).

Solution Idea 2

The next thing which occurred to me was Monte Carlo simulation, which is way of solving problems by approximating probability distributions using many random samples from those distributions. Sometimes very many samples are required, but contemporary computer hardware easily accommodates this need (at least for this problem). With a bit of code, I could approximate the various distributions in this problem and their interactions and have plenty of candy come October 31.

While Monte Carlo simulation and its variants are used to solve very complex real-world problems, the basic idea is very simple: Draw samples from the indicated distributions, have them interact as they do in real life, and accumulate the results. Using the poissrnd() function provided in the MATLAB Statistics Toolbox, that is exactly what I did (apologies for the dreadful formatting):

% HalloweenPuzzler
% Program to solve the Halloween Probability Puzzler by the Monte Carlo method
% by Will Dwinnell
% Note: Requires the poissrnd() and prctile() functions from the Statistics Toolbox
% Last modified: Apr-02-2014

% Parameter
B = 4e5;   % How many times should we run the nightly simulation?

% Initialize
S = zeros(B,1);  % Vector of nightly simulation totals

% Loop over simulated Halloween evenings
for i = 1:B
  % Loop over 5-minute time intervals for this night
  for j = 1:24
    % Determine number of arrivals tonight
    Arrivals = poissrnd(5);
    % Are there any?
    if  (Arrivals > 0)
      % ...yes, so process them.
      % Loop over tonight's trick-or-treaters
      for k = 1:Arrivals
        % Add this trick-or-treater's candy to the total
        S(i) = S(i) + 1 + poissrnd(1);

% Determine the 95th percentile of our simulated nightly totals for the answer
disp(['You need ' int2str(prctile(S,95)) ' pieces of candy.'])

% Graph the distribution of nightly totals
grid on
title({'Halloween Probability Puzzler: Monte Carlo Solution',[int2str(B) ' Replicates']})
xlabel('Candy (pieces/night)')


Fair warning: This program takes a long time to run (hours, if run on a slow machine).

Hopefully the function of this simulation is adequately explained in the comments. To simulate a single Halloween night, the program loops over all time periods (there are 24 intervals, each 5-minutes long, between 6:00 and 8:00), then all arrivals (if any) within each time period. Inside the innermost loop, the number of pieces of candy taken by the given trick-or-treater is generated. The outermost loop governs the multiple executions of the nightly simulation.

The poissrnd() function generates random values from the Poisson distribution (which are always integers, in case you're fuzzy on the Poisson distribution). Its lone parameter is the mean of the Poisson distribution in question. A graph is generated at the end, displaying the simulated distribution for an entire night.

Important Caveat

Recall my mentioning that Monte Carlo is an approximate method, several paragraphs above. The more runs through this process, the closer it mimics the behavior of the probability distributions. My first run used 1e5 (one hundred thousand) runs, and I qualified my response to the person who gave me this puzzle that my answer, 282, was quite possibly off from the correct one "by a piece of candy or two". Indeed, I was off by one piece. Notice that the program listed above now employs 4e5 (four hundred thousand) runs, which yielded the exactly correct answer, 281, the first time I ran it.