Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression 2776 2968 + 920 17374 17563 + 1959 2968 2786 - 802 17563 17375 - 1694 19606 19395 - 1914 File2:
>-::NC_013316.1:2776-2968(+) ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC GCTCGGAAACGGACGCTAATACCGCATAC >-::NC_013316.1:17374-17563(+) AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA >-::NC_013316.1:2786-2968(-) GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC >antisense_CDR20291_r27::NC_013316.1:10830-11707(-) TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA >antisense_CDR20291_r27::NC_013316.1:11814-11874(-) TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA Desired Output:
>-::NC_013316.1:2776-2968(+) ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC GCTCGGAAACGGACGCTAATACCGCATAC >-::NC_013316.1:17374-17563(+) AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA >-::NC_013316.1:2786-2968(-) GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output >-::NC_013316.1:2776-2968(+) ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC GCTCGGAAACGGACGCTAATACCGCATAC >-::NC_013316.1:17374-17563(+) AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA >-::NC_013316.1:2786-2968(-) GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC >antisense_CDR20291_r27::NC_013316.1:10830-11707(-) TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA >antisense_CDR20291_r27::NC_013316.1:11814-11874(-) TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA How can I process my files with awk (or similar) to achieve my goal?
https://stackoverflow.com/questions/66881030/printing-lines-of-file2-when-two-fields-from-file1-match-substrings-of-a-single March 31, 2021 at 10:57AM
没有评论:
发表评论