2021年4月30日星期五

Pandas stratified sampling by count

I want to create a sample column that will evenly select vcount by sId and cId:

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})          sId      cId     vcount  0      s0       c0     322  1      s0       c1     168  2      s1       c2    1818  3      s1       c3      81  4      s2       c4   13114  5      s2       c5       5  6      s2       c6       3  7      s2       c7       2  8      s3       c8    1979  9      s3       c9    1561  10     s3      c10    1548  11     s3      c11    1009  12     s3      c12      11  

Right now I need it to work for sample 100, expected output

      sId      cId  vcount  sample  0      s0       c0     322      50  1      s0       c1     168      50  2      s1       c2    1818      50  3      s1       c3      81      50  4      s2       c4   13114      90  5      s2       c5       5       5  6      s2       c6       3       3  7      s2       c7       2       2  8      s3       c8    1979      22  9      s3       c9    1561      22  10     s3      c10    1548      22  11     s3      c11    1009      23  12     s3      c12      11      11  

As you can see for the sId s2 there are 4 cIds, so we would want 25 from each cIds; however one 1 has more than 25 so we have to select all other cIds and get the remaining from c4. Similarly s0 has 2 cIds so we want 50 each and there are more than 50 samples from each cId. For s3 it doesn't matter which one get's the largest sample, I just need the distribution to be as uniform as possible.

The goal is to select all of the cId for each sId and divide the 100 as evenly as possible.

I couldn't figure this out and manually typed in the sample column; however that isn't a reasonable solution when the list gets larger.

https://stackoverflow.com/questions/67342576/pandas-stratified-sampling-by-count May 01, 2021 at 11:16AM

没有评论:

发表评论