2021年1月14日星期四

R data.table: optimize speed of row operations by (different) groups

So, I have this huge data set (possibly thousands of entries) that is received by my code as a data.table in long format, that looks like this:

#sample DT  sample_size = 8  DT0 <- data.table(    DATE = seq.Date(from = as.Date("2020/3/01"), by = "day", length.out = sample_size),    BANANA = seq(30, by=0.060, length.out = sample_size),    ORANGE = seq(5, by=0.035, length.out = sample_size),    APPLE = seq(12, by=0.6, length.out = sample_size),    LEMON = seq(10, by=0.01, length.out = sample_size),    GRAPE = seq(0.5, by=0.13, length.out = sample_size)  )    DT <- melt(DT0,              id.vars = c("DATE"),              variable.name = "FRUIT",             value.name = "PRICE")  setkeyv(DT, cols=c("FRUIT", "DATE"))    

DT is exactly how my data looks like right now:

> DT            DATE  FRUIT  PRICE   1: 2020-03-01 BANANA 30.000   2: 2020-03-02 BANANA 30.060   3: 2020-03-03 BANANA 30.120   4: 2020-03-04 BANANA 30.180   5: 2020-03-05 BANANA 30.240   6: 2020-03-06 BANANA 30.300   7: 2020-03-07 BANANA 30.360   8: 2020-03-08 BANANA 30.420   9: 2020-03-01 ORANGE  5.000  10: 2020-03-02 ORANGE  5.035  11: 2020-03-03 ORANGE  5.070  12: 2020-03-04 ORANGE  5.105  13: 2020-03-05 ORANGE  5.140  14: 2020-03-06 ORANGE  5.175  15: 2020-03-07 ORANGE  5.210  16: 2020-03-08 ORANGE  5.245  17: 2020-03-01  APPLE 12.000  18: 2020-03-02  APPLE 12.600  19: 2020-03-03  APPLE 13.200  20: 2020-03-04  APPLE 13.800  21: 2020-03-05  APPLE 14.400  22: 2020-03-06  APPLE 15.000  23: 2020-03-07  APPLE 15.600  24: 2020-03-08  APPLE 16.200  25: 2020-03-01  LEMON 10.000  26: 2020-03-02  LEMON 10.010  27: 2020-03-03  LEMON 10.020  28: 2020-03-04  LEMON 10.030  29: 2020-03-05  LEMON 10.040  30: 2020-03-06  LEMON 10.050  31: 2020-03-07  LEMON 10.060  32: 2020-03-08  LEMON 10.070  33: 2020-03-01  GRAPE  0.500  34: 2020-03-02  GRAPE  0.630  35: 2020-03-03  GRAPE  0.760  36: 2020-03-04  GRAPE  0.890  37: 2020-03-05  GRAPE  1.020  38: 2020-03-06  GRAPE  1.150  39: 2020-03-07  GRAPE  1.280  40: 2020-03-08  GRAPE  1.410            DATE  FRUIT  PRICE  

Now, let's say that I need to compute in a new column ("RESULT"), the difference ( or any other more complex operation) between the PRICE of each of the items in FRUIT and the price of one constant specified fruit (ex: GRAPE), for each day.

Just to help you visualize the idea, the RESULT column would be something like as the result of these operations, for each day:

result := PRICE BANANA - PRICE GRAPE
result := PRICE ORANGE - PRICE GRAPE
result := PRICE APPLE - PRICE GRAPE
result := PRICE BANANA - PRICE GRAPE
result := PRICE LEMON - PRICE GRAPE
result := PRICE GRAPE - PRICE GRAPE

And here's how I managed to do it, after many hours of trial and error (and still not quite understanding what I'm doing):

#my try:  chosen_fruit <- "GRAPE"    setkey(DT, DATE)  DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE]    
> DT            DATE  FRUIT  PRICE RESULTS   1: 2020-03-01 BANANA 30.000  29.500   2: 2020-03-01 ORANGE  5.000   4.500   3: 2020-03-01  APPLE 12.000  11.500   4: 2020-03-01  LEMON 10.000   9.500   5: 2020-03-01  GRAPE  0.500   0.000   6: 2020-03-02 BANANA 30.060  29.430   7: 2020-03-02 ORANGE  5.035   4.405   8: 2020-03-02  APPLE 12.600  11.970   9: 2020-03-02  LEMON 10.010   9.380  10: 2020-03-02  GRAPE  0.630   0.000  11: 2020-03-03 BANANA 30.120  29.360  12: 2020-03-03 ORANGE  5.070   4.310  13: 2020-03-03  APPLE 13.200  12.440  14: 2020-03-03  LEMON 10.020   9.260  15: 2020-03-03  GRAPE  0.760   0.000  16: 2020-03-04 BANANA 30.180  29.290  17: 2020-03-04 ORANGE  5.105   4.215  18: 2020-03-04  APPLE 13.800  12.910  19: 2020-03-04  LEMON 10.030   9.140  20: 2020-03-04  GRAPE  0.890   0.000  21: 2020-03-05 BANANA 30.240  29.220  22: 2020-03-05 ORANGE  5.140   4.120  23: 2020-03-05  APPLE 14.400  13.380  24: 2020-03-05  LEMON 10.040   9.020  25: 2020-03-05  GRAPE  1.020   0.000  26: 2020-03-06 BANANA 30.300  29.150  27: 2020-03-06 ORANGE  5.175   4.025  28: 2020-03-06  APPLE 15.000  13.850  29: 2020-03-06  LEMON 10.050   8.900  30: 2020-03-06  GRAPE  1.150   0.000  31: 2020-03-07 BANANA 30.360  29.080  32: 2020-03-07 ORANGE  5.210   3.930  33: 2020-03-07  APPLE 15.600  14.320  34: 2020-03-07  LEMON 10.060   8.780  35: 2020-03-07  GRAPE  1.280   0.000  36: 2020-03-08 BANANA 30.420  29.010  37: 2020-03-08 ORANGE  5.245   3.835  38: 2020-03-08  APPLE 16.200  14.790  39: 2020-03-08  LEMON 10.070   8.660  40: 2020-03-08  GRAPE  1.410   0.000            DATE  FRUIT  PRICE RESULTS  

And my final desired output:

setkey(DT, FRUIT)    > DT            DATE  FRUIT  PRICE RESULTS   1: 2020-03-01 BANANA 30.000  29.500   2: 2020-03-02 BANANA 30.060  29.430   3: 2020-03-03 BANANA 30.120  29.360   4: 2020-03-04 BANANA 30.180  29.290   5: 2020-03-05 BANANA 30.240  29.220   6: 2020-03-06 BANANA 30.300  29.150   7: 2020-03-07 BANANA 30.360  29.080   8: 2020-03-08 BANANA 30.420  29.010   9: 2020-03-01 ORANGE  5.000   4.500  10: 2020-03-02 ORANGE  5.035   4.405  11: 2020-03-03 ORANGE  5.070   4.310  12: 2020-03-04 ORANGE  5.105   4.215  13: 2020-03-05 ORANGE  5.140   4.120  14: 2020-03-06 ORANGE  5.175   4.025  15: 2020-03-07 ORANGE  5.210   3.930  16: 2020-03-08 ORANGE  5.245   3.835  17: 2020-03-01  APPLE 12.000  11.500  18: 2020-03-02  APPLE 12.600  11.970  19: 2020-03-03  APPLE 13.200  12.440  20: 2020-03-04  APPLE 13.800  12.910  21: 2020-03-05  APPLE 14.400  13.380  22: 2020-03-06  APPLE 15.000  13.850  23: 2020-03-07  APPLE 15.600  14.320  24: 2020-03-08  APPLE 16.200  14.790  25: 2020-03-01  LEMON 10.000   9.500  26: 2020-03-02  LEMON 10.010   9.380  27: 2020-03-03  LEMON 10.020   9.260  28: 2020-03-04  LEMON 10.030   9.140  29: 2020-03-05  LEMON 10.040   9.020  30: 2020-03-06  LEMON 10.050   8.900  31: 2020-03-07  LEMON 10.060   8.780  32: 2020-03-08  LEMON 10.070   8.660  33: 2020-03-01  GRAPE  0.500   0.000  34: 2020-03-02  GRAPE  0.630   0.000  35: 2020-03-03  GRAPE  0.760   0.000  36: 2020-03-04  GRAPE  0.890   0.000  37: 2020-03-05  GRAPE  1.020   0.000  38: 2020-03-06  GRAPE  1.150   0.000  39: 2020-03-07  GRAPE  1.280   0.000  40: 2020-03-08  GRAPE  1.410   0.000            DATE  FRUIT  PRICE RESULTS  

As you can see, I was, somehow, able to reach the target. But it doesn't seem ideal.

I'm learning R (and programming in general) now, so I am VERY new to all of this, especially in making the code as optimized as possible (consider memory constraints here).

Any suggestions on how to improve the speed/performance of operation or better ways to do that? Thank you very much in advance!

https://stackoverflow.com/questions/65730226/r-data-table-optimize-speed-of-row-operations-by-different-groups January 15, 2021 at 11:46AM

没有评论:

发表评论