Sunday, February 18, 2007

Dividing Values Into Equal-Sized Groups

This is just a quick tip for MATLAB users who need to divide collections of values into even-sized groups. If one is fortunate enough to have the MATLAB Statistics Toolbox available, the tiedrank function is very handy for this sort of thing. (It's not hard to build something similar to tiedrank, using sort anyway.)

This will best be explained via example, so let's generate some example data:


>> rand('twister',951); X = rand(8,1)

X =

0.5798
0.0504
0.0241
0.7555
0.6569
0.3020
0.2042
0.5651


It is desired that this data be divided into 4 equal-sized groups, by magnitude. The following line does this:



>> F = ceil(4 * tiedrank(X) / length(X))

F =

3
1
1
4
4
2
2
3


The variable F now contains an integer code representing the assigned group number, with 1 being the smallest group. Notice that there are two each of the values 1, 2, 3 and 4. Also notice how easy it is to extract all members of any given group. Here, for example, the members of the lowest-valued group are displayed:


>> X(F == 1)

ans =

0.0504
0.0241


Here is an example using a much larger number of items:


>> rand('twister',951); X = rand(10000,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2500 25.00%


The tabulate function is a convenient routine for generating frequency tables from the Statistics Toolbox. Again, notice the even distribution of values across bins. What happens if the total count cannot be divided evenly among the bins? Let's see:


>> rand('twister',951); X = rand(10001,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2500 25.00%
3 2500 25.00%
4 2501 25.01%


This procedure ensures that the counts in the smallest bin and the largest bin never differ by more than 1, assuming that this is possible. Observe the progression, as the total count is incremented:


>> rand('twister',951); X = rand(10002,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 25.00%
2 2501 25.00%
3 2500 25.00%
4 2501 25.00%

>> rand('twister',951); X = rand(10003,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2500 24.99%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%

>> rand('twister',951); X = rand(10004,1);
>> F = ceil(4 * tiedrank(X) / length(X));
>> tabulate(F)
Value Count Percent
1 2501 25.00%
2 2501 25.00%
3 2501 25.00%
4 2501 25.00%


The only catch is the case of multiple instances of the same value. The tiedrank function prevents the repeated values from being broken up. Here is an example with such data:


>> X = [1 2 2 2 5 6 7 8]'

X =

1
2
2
2
5
6
7
8

>> F = ceil(4 * tiedrank(X) / length(X))

F =

1
2
2
2
3
3
4
4


The distribution among bins is now uneven, but tiedrank is doing the best it can. Depending what is needed, this may be justified. If, however, if one needs to break these repeated values apart, even if arbitrarily, then tiedrank may be easily replaced with another procedure which does this.


This process is useful for assigning values to quantiles or n-iles, such as deciles or percentiles. Applied to randomly-generated values, it is also useful for assigning cases to folds for stratified k-fold cross-validation, or to strata for stratified sampling.


See Also

Feb-10-2007 posting, Stratified Sampling.

10 comments:

Anonymous said...

Excellent tutorial here. I develop in Java and sometimes use Matlab for fast proto-type of algorithms before coding it in Java for application deployment.

keep up the good work.

Will Dwinnell said...

Thanks, I appreciate that very much!

Can you share anything else about your work?

Anonymous said...

To calculate percentiles with Matlab I would better use the "percentile.m" funtion from Osprey toolbox (I get different results and I know percentile.m works fine). If you do not want to download the whole Osprey toolbox you can use the following *.m code:

============================
function y = percentile(x, pct)
% y = PERCENTILE(x, pct)
% Example: percentile(x, 0.10) is the value that is higher than 10%
% of the elements of x, and less than the remaining 90%.
% If the length of x is such that there is no element of x exactly
% corresponding to the 'pct' position, a weighted average of the two
% adjacent values is used. pct must be between 0 and 1 inclusive.
%
% percentile(x, 1) is a slow way to get max(x).
% percentile(x, 0.5) is the median of x.
% percentile(x, 0) is a slow way to get min(x).
%
% If x is a matrix, percentile operates on columns, returning multiple columns.
% If pct is a vector, multiple rows are returned, one per element of pct.
%
% See also median, sort, max, min.

x=x(:);
I=find(x>=0);
x=x(I);

[mm,nn] = size(x);
if (mm == 1), x = x.'; end
x = sort(x);
n = ((mm - 1) * pct(:) + 1);
r = rem(n, 1);
r1 = r * ones(1, nn);
y = (1-r1) .* x(n-r,:);
ix = find(r); % when n=mm, x(n+1,:) doesn't exist
y(ix,:) = y(ix,:) + r1(ix,:) .* x(n(ix)-r(ix)+1,:);
============================

Will Dwinnell said...

Just to be clear: The method I described was intended to divide a given data into equal sized-groups (to the extent that this is possible), not to determine percentile values.

To illustrate the difference (which is subtle), consider a data set with many duplicates of a single value. Many consecutive percentiles (i.e. percentiles 34, 35, 36 and 37) will share this duplicated value. Strictly speaking, a specific instance of those duplicate values does not reside in any particular one of those percentiles. Rather the percentiles are measures of the data set as a whole and more than one percentile may return the same value.

My article, however, was about dividing data into distinguishable segments (binning, as is used in machine learning or data mining). All of those duplicate values, for my purposes must fall into a single segment, so there will sometimes be discrepancies from the measured percentiles (or deciles, etc.), when observations are few enough or when the distribution of the data is sufficiently aberrant.

Anonymous said...

Is there a clever way to bin data based on value. Similar to calculating a histogram. I want to write a function like:

function binnedData = binData(data, numBins)

Suppose numBins = 5. Then the smallest 20% values will have bin = 1 and so on. I can't do it without looping which is suprising.

Anonymous said...

Thanks for this, I found it very useful.

Anonymous said...

just came across this post. Very elegant and simple solution. nice!

Anonymous said...

Thank you very much...It was very useful!!

Anonymous said...

I used the technique for a dataset having 2300 rows with normalized values. It did not evenly distribute the data. What was the problem?

Ayu said...

Hey, so this is pretty useful when you're only binning the train sample but then when you binning the test sample, it has to follow the binning on the train sample we have established. I'm not sure how to go about doing it. Do you have any suggestion?