So, I have this huge data set (possibly thousands of entries) that is received by my code as a data.table in long format, that looks like this:
#sample DT sample_size = 8 DT0 <- data.table( DATE = seq.Date(from = as.Date("2020/3/01"), by = "day", length.out = sample_size), BANANA = seq(30, by=0.060, length.out = sample_size), ORANGE = seq(5, by=0.035, length.out = sample_size), APPLE = seq(12, by=0.6, length.out = sample_size), LEMON = seq(10, by=0.01, length.out = sample_size), GRAPE = seq(0.5, by=0.13, length.out = sample_size) ) DT <- melt(DT0, id.vars = c("DATE"), variable.name = "FRUIT", value.name = "PRICE") setkeyv(DT, cols=c("FRUIT", "DATE")) DT is exactly how my data looks like right now:
> DT DATE FRUIT PRICE 1: 2020-03-01 BANANA 30.000 2: 2020-03-02 BANANA 30.060 3: 2020-03-03 BANANA 30.120 4: 2020-03-04 BANANA 30.180 5: 2020-03-05 BANANA 30.240 6: 2020-03-06 BANANA 30.300 7: 2020-03-07 BANANA 30.360 8: 2020-03-08 BANANA 30.420 9: 2020-03-01 ORANGE 5.000 10: 2020-03-02 ORANGE 5.035 11: 2020-03-03 ORANGE 5.070 12: 2020-03-04 ORANGE 5.105 13: 2020-03-05 ORANGE 5.140 14: 2020-03-06 ORANGE 5.175 15: 2020-03-07 ORANGE 5.210 16: 2020-03-08 ORANGE 5.245 17: 2020-03-01 APPLE 12.000 18: 2020-03-02 APPLE 12.600 19: 2020-03-03 APPLE 13.200 20: 2020-03-04 APPLE 13.800 21: 2020-03-05 APPLE 14.400 22: 2020-03-06 APPLE 15.000 23: 2020-03-07 APPLE 15.600 24: 2020-03-08 APPLE 16.200 25: 2020-03-01 LEMON 10.000 26: 2020-03-02 LEMON 10.010 27: 2020-03-03 LEMON 10.020 28: 2020-03-04 LEMON 10.030 29: 2020-03-05 LEMON 10.040 30: 2020-03-06 LEMON 10.050 31: 2020-03-07 LEMON 10.060 32: 2020-03-08 LEMON 10.070 33: 2020-03-01 GRAPE 0.500 34: 2020-03-02 GRAPE 0.630 35: 2020-03-03 GRAPE 0.760 36: 2020-03-04 GRAPE 0.890 37: 2020-03-05 GRAPE 1.020 38: 2020-03-06 GRAPE 1.150 39: 2020-03-07 GRAPE 1.280 40: 2020-03-08 GRAPE 1.410 DATE FRUIT PRICE Now, let's say that I need to compute in a new column ("RESULT"), the difference ( or any other more complex operation) between the PRICE of each of the items in FRUIT and the price of one constant specified fruit (ex: GRAPE), for each day.
Just to help you visualize the idea, the RESULT column would be something like as the result of these operations, for each day:
result := PRICE BANANA - PRICE GRAPE
result := PRICE ORANGE - PRICE GRAPE
result := PRICE APPLE - PRICE GRAPE
result := PRICE BANANA - PRICE GRAPE
result := PRICE LEMON - PRICE GRAPE
result := PRICE GRAPE - PRICE GRAPE
And here's how I managed to do it, after many hours of trial and error (and still not quite understanding what I'm doing):
#my try: chosen_fruit <- "GRAPE" setkey(DT, DATE) DT[DT[FRUIT == chosen_fruit], RESULTS := PRICE - i.PRICE] > DT DATE FRUIT PRICE RESULTS 1: 2020-03-01 BANANA 30.000 29.500 2: 2020-03-01 ORANGE 5.000 4.500 3: 2020-03-01 APPLE 12.000 11.500 4: 2020-03-01 LEMON 10.000 9.500 5: 2020-03-01 GRAPE 0.500 0.000 6: 2020-03-02 BANANA 30.060 29.430 7: 2020-03-02 ORANGE 5.035 4.405 8: 2020-03-02 APPLE 12.600 11.970 9: 2020-03-02 LEMON 10.010 9.380 10: 2020-03-02 GRAPE 0.630 0.000 11: 2020-03-03 BANANA 30.120 29.360 12: 2020-03-03 ORANGE 5.070 4.310 13: 2020-03-03 APPLE 13.200 12.440 14: 2020-03-03 LEMON 10.020 9.260 15: 2020-03-03 GRAPE 0.760 0.000 16: 2020-03-04 BANANA 30.180 29.290 17: 2020-03-04 ORANGE 5.105 4.215 18: 2020-03-04 APPLE 13.800 12.910 19: 2020-03-04 LEMON 10.030 9.140 20: 2020-03-04 GRAPE 0.890 0.000 21: 2020-03-05 BANANA 30.240 29.220 22: 2020-03-05 ORANGE 5.140 4.120 23: 2020-03-05 APPLE 14.400 13.380 24: 2020-03-05 LEMON 10.040 9.020 25: 2020-03-05 GRAPE 1.020 0.000 26: 2020-03-06 BANANA 30.300 29.150 27: 2020-03-06 ORANGE 5.175 4.025 28: 2020-03-06 APPLE 15.000 13.850 29: 2020-03-06 LEMON 10.050 8.900 30: 2020-03-06 GRAPE 1.150 0.000 31: 2020-03-07 BANANA 30.360 29.080 32: 2020-03-07 ORANGE 5.210 3.930 33: 2020-03-07 APPLE 15.600 14.320 34: 2020-03-07 LEMON 10.060 8.780 35: 2020-03-07 GRAPE 1.280 0.000 36: 2020-03-08 BANANA 30.420 29.010 37: 2020-03-08 ORANGE 5.245 3.835 38: 2020-03-08 APPLE 16.200 14.790 39: 2020-03-08 LEMON 10.070 8.660 40: 2020-03-08 GRAPE 1.410 0.000 DATE FRUIT PRICE RESULTS And my final desired output:
setkey(DT, FRUIT) > DT DATE FRUIT PRICE RESULTS 1: 2020-03-01 BANANA 30.000 29.500 2: 2020-03-02 BANANA 30.060 29.430 3: 2020-03-03 BANANA 30.120 29.360 4: 2020-03-04 BANANA 30.180 29.290 5: 2020-03-05 BANANA 30.240 29.220 6: 2020-03-06 BANANA 30.300 29.150 7: 2020-03-07 BANANA 30.360 29.080 8: 2020-03-08 BANANA 30.420 29.010 9: 2020-03-01 ORANGE 5.000 4.500 10: 2020-03-02 ORANGE 5.035 4.405 11: 2020-03-03 ORANGE 5.070 4.310 12: 2020-03-04 ORANGE 5.105 4.215 13: 2020-03-05 ORANGE 5.140 4.120 14: 2020-03-06 ORANGE 5.175 4.025 15: 2020-03-07 ORANGE 5.210 3.930 16: 2020-03-08 ORANGE 5.245 3.835 17: 2020-03-01 APPLE 12.000 11.500 18: 2020-03-02 APPLE 12.600 11.970 19: 2020-03-03 APPLE 13.200 12.440 20: 2020-03-04 APPLE 13.800 12.910 21: 2020-03-05 APPLE 14.400 13.380 22: 2020-03-06 APPLE 15.000 13.850 23: 2020-03-07 APPLE 15.600 14.320 24: 2020-03-08 APPLE 16.200 14.790 25: 2020-03-01 LEMON 10.000 9.500 26: 2020-03-02 LEMON 10.010 9.380 27: 2020-03-03 LEMON 10.020 9.260 28: 2020-03-04 LEMON 10.030 9.140 29: 2020-03-05 LEMON 10.040 9.020 30: 2020-03-06 LEMON 10.050 8.900 31: 2020-03-07 LEMON 10.060 8.780 32: 2020-03-08 LEMON 10.070 8.660 33: 2020-03-01 GRAPE 0.500 0.000 34: 2020-03-02 GRAPE 0.630 0.000 35: 2020-03-03 GRAPE 0.760 0.000 36: 2020-03-04 GRAPE 0.890 0.000 37: 2020-03-05 GRAPE 1.020 0.000 38: 2020-03-06 GRAPE 1.150 0.000 39: 2020-03-07 GRAPE 1.280 0.000 40: 2020-03-08 GRAPE 1.410 0.000 DATE FRUIT PRICE RESULTS As you can see, I was, somehow, able to reach the target. But it doesn't seem ideal.
I'm learning R (and programming in general) now, so I am VERY new to all of this, especially in making the code as optimized as possible (consider memory constraints here).
Any suggestions on how to improve the speed/performance of operation or better ways to do that? Thank you very much in advance!
https://stackoverflow.com/questions/65730226/r-data-table-optimize-speed-of-row-operations-by-different-groups January 15, 2021 at 11:46AM
没有评论:
发表评论