Eve V. E. Kovacs

2016-05-20 22:51:34 UTC

Dear Scikit-learn gurus,

Sorry to spam the whole list but I couldn't find a better email for my

question regarding the results of the predict_proba method in the Random

Forest classifier.

I tried to reproduce the output of this method by following the description

given in the documentation: That is, I averaged over the class probabilities

for each tree in the forest. I computed the class probability

for each tree, for each object in my test data, by

first determining in which leaf of the tree my test datum landed. Then I set

the class probabilities equal to the fraction of objects in each class in the

training data that also landed in the same leaf.

For example, if my test datum landed in node 55 of tree #0,

and supposing that 10 objects from my training data also landed in node 55 of

tree #0, with 4 objects in the first cllass and 6 in the second, then the

probabilities for that tree would be [0.4, 0.6]. (And then I average these

probabilities over all the trees in the forest.)

Unfortunately, the answers that I get for the probabilities from the above

algorithm and the results of predict_proba don't agree.

For example, for 4 objects in my test data I get the following probabilites:

[ 0.99718369 0.00281631]

[ 0.99711619 0.00288381]

[ 0.99680974 0.00319026]

[ 0.55153962 0.44846038]

but predict_proba gives

[1.0 0.0]

[1.0 0.0]

[1.0 0.0]

[0.4 0.6]

Can anyone please tell me what I am doing wrong? I have checked the source code

and the averaging step seems to be correct. I must be misinterpreting how to

compute the class probabilities.

Thanks

Eve

***************************************************************

Eve Kovacs

Argonne National Laboratory,

Room L-177, Bldg. 360, HEP

9700 S. Cass Ave.

Argonne, IL 60439 USA

Phone: (630)-252-6208

Fax: (630)-252-5047

email: ***@anl.gov

***************************************************************

Sorry to spam the whole list but I couldn't find a better email for my

question regarding the results of the predict_proba method in the Random

Forest classifier.

I tried to reproduce the output of this method by following the description

given in the documentation: That is, I averaged over the class probabilities

for each tree in the forest. I computed the class probability

for each tree, for each object in my test data, by

first determining in which leaf of the tree my test datum landed. Then I set

the class probabilities equal to the fraction of objects in each class in the

training data that also landed in the same leaf.

For example, if my test datum landed in node 55 of tree #0,

and supposing that 10 objects from my training data also landed in node 55 of

tree #0, with 4 objects in the first cllass and 6 in the second, then the

probabilities for that tree would be [0.4, 0.6]. (And then I average these

probabilities over all the trees in the forest.)

Unfortunately, the answers that I get for the probabilities from the above

algorithm and the results of predict_proba don't agree.

For example, for 4 objects in my test data I get the following probabilites:

[ 0.99718369 0.00281631]

[ 0.99711619 0.00288381]

[ 0.99680974 0.00319026]

[ 0.55153962 0.44846038]

but predict_proba gives

[1.0 0.0]

[1.0 0.0]

[1.0 0.0]

[0.4 0.6]

Can anyone please tell me what I am doing wrong? I have checked the source code

and the averaging step seems to be correct. I must be misinterpreting how to

compute the class probabilities.

Thanks

Eve

***************************************************************

Eve Kovacs

Argonne National Laboratory,

Room L-177, Bldg. 360, HEP

9700 S. Cass Ave.

Argonne, IL 60439 USA

Phone: (630)-252-6208

Fax: (630)-252-5047

email: ***@anl.gov

***************************************************************