2021年3月30日星期二

Printing lines of file2 when two fields from file1 match substrings of a single field in file2

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.

Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".

Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.

Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.


File1:

Transcription_Start     Translation_Start       Translation_Stop        Transcription_Stop      Strand  Expression  2776                    2968    +       920  17374                   17563   +       1959  2968                    2786    -       802  17563                   17375   -       1694  19606                   19395   -       1914  

File2:

>-::NC_013316.1:2776-2968(+)  ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC  TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC  GCTCGGAAACGGACGCTAATACCGCATAC  >-::NC_013316.1:17374-17563(+)  AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA  >-::NC_013316.1:2786-2968(-)  GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC  >antisense_CDR20291_r27::NC_013316.1:10830-11707(-)  TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA  >antisense_CDR20291_r27::NC_013316.1:11814-11874(-)  TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA  

Desired Output:

>-::NC_013316.1:2776-2968(+)  ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC  TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC  GCTCGGAAACGGACGCTAATACCGCATAC  >-::NC_013316.1:17374-17563(+)  AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA  >-::NC_013316.1:2786-2968(-)  GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC  

This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):

$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output  >-::NC_013316.1:2776-2968(+)  ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC  TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC  GCTCGGAAACGGACGCTAATACCGCATAC  >-::NC_013316.1:17374-17563(+)  AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA  >-::NC_013316.1:2786-2968(-)  GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC  >antisense_CDR20291_r27::NC_013316.1:10830-11707(-)  TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA  >antisense_CDR20291_r27::NC_013316.1:11814-11874(-)  TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA  

How can I process my files with awk (or similar) to achieve my goal?

https://stackoverflow.com/questions/66881030/printing-lines-of-file2-when-two-fields-from-file1-match-substrings-of-a-single March 31, 2021 at 10:57AM

没有评论:

发表评论