I want to create a sample column that will evenly select vcount
by sId
and cId
:
df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}}) sId cId vcount 0 s0 c0 322 1 s0 c1 168 2 s1 c2 1818 3 s1 c3 81 4 s2 c4 13114 5 s2 c5 5 6 s2 c6 3 7 s2 c7 2 8 s3 c8 1979 9 s3 c9 1561 10 s3 c10 1548 11 s3 c11 1009 12 s3 c12 11
Right now I need it to work for sample 100, expected output
sId cId vcount sample 0 s0 c0 322 50 1 s0 c1 168 50 2 s1 c2 1818 50 3 s1 c3 81 50 4 s2 c4 13114 90 5 s2 c5 5 5 6 s2 c6 3 3 7 s2 c7 2 2 8 s3 c8 1979 22 9 s3 c9 1561 22 10 s3 c10 1548 22 11 s3 c11 1009 23 12 s3 c12 11 11
As you can see for the sId s2 there are 4 cIds, so we would want 25 from each cIds; however one 1 has more than 25 so we have to select all other cIds and get the remaining from c4. Similarly s0 has 2 cIds so we want 50 each and there are more than 50 samples from each cId. For s3 it doesn't matter which one get's the largest sample, I just need the distribution to be as uniform as possible.
The goal is to select all of the cId
for each sId
and divide the 100 as evenly as possible.
I couldn't figure this out and manually typed in the sample column; however that isn't a reasonable solution when the list gets larger.
https://stackoverflow.com/questions/67342576/pandas-stratified-sampling-by-count May 01, 2021 at 11:16AM
没有评论:
发表评论