$10
1) You will write a program in Python/R/C (just pick a language you like – I’m listing ones here that I prefer) that will take as input a FASTQ file and print the distribution of quality scores across all reads. You can summarize the distribution of Q scores at each base with a statistic of your choice (e.g. mean, mode, median, quantile distribution). If you’d like, you can also plot the distribution of Q scores as a box plot much like what’s generated by FASTQC. You will then run your program on the provided FASTQ file, and obtain the output from it.
Note that a FASTQ file has the following format:
This format is repeated for each read.
The idea is real simple; for each character in the quality score line, the ASCII value of that character - 33 = Q. Thereon, Q = -10log10Pe, where Pe is the probability of error in calling that nucleotide base.
Here are functions in various languages to convert to the ASCII encoding:
Python: ord()
R: iconv()
C: When you scanf() the character, you scanf() with a %c, which automatically converts it into its ASCII encoding