2021年3月2日星期二

R - identify sequences in a vector

Suppose I have a vector ab containing A's and B's. I want to identify sequences and create a vector v with length(ab) that indicates the sequence length at the beginning and end of a given sequence and NA otherwise.

I have however the restriction that another vector x with 0/1 will indicate that a sequence ends.

So for example:

rep("A", 6)    "A" "A" "A" "A" "A" "A"    x <- c(0,0,1,0,0,0)    0 0 1 0 0 0  

should give

v <- c(3 NA 3 3 NA 3)  

An example could be the following:

ab <- c(rep("A", 5), "B", rep("A", 3))  "A" "A" "A" "A" "A" "B" "A" "A" "A"  x <- c(rep(0,3),1,0,1,rep(0,3))  0 0 0 1 0 1 0 0 0  

Here the output should be:

4 NA NA 4 1 1 3 NA 3    (without the restriction it would be)  5 NA NA NA 5 1 3 NA 3  

So far, my code without the restriction looks like this:

ab <- c(rep("A", 5), "B", rep("A", 3))  x <- c(rep(0,3),1,0,1,rep(0,3))    cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?  idx <- which(cng) # where do the  changes take place?  idx <- c(idx,length(ab)) # include the last value  seq_length <- diff(c(0, idx)) # how long are the sequences?    # create v  v <- rep(NA, length(ab))  v[idx] <- seq_length # sequence end  v[idx-(seq_length-1)] <- seq_length # sequence start  v  

Does anyone have an idea how I can implement the restriction? (And since my vector has 2 Millions of observations, I wonder whether there would be a more efficient way than my approach) I would appreciate any comments! Many thanks in advance!

https://stackoverflow.com/questions/66443043/r-identify-sequences-in-a-vector March 03, 2021 at 12:15AM

没有评论:

发表评论