8 Loops, Logic and Functions: Diving Deeper into R
If you are currently participating in a timetabled BIOS103 QS workshop, please ensure that you cover all of this section’s content and complete this week’s formative and summative assessments in the BIOS103 Canvas Course.
By now, you can read files, generate summary tables, visualise data, and even perform some statistical tests in R. That’s an incredible foundation to build upon. When it comes to QS, you’ve come further and faster than any first year student before you. I salute you.
This week, we’re stepping outside the lab’s core content to focus on key coding fundamentals. Why? These skills are vital for bioinformatics, a field you’ll explore in depth next semester. Our focus in this chapter on FASTA files - a universal format for genomic and proteomic data - will introduce you to the power of automating complex tasks with code.
You might find this week challenging, and that’s okay! Growth happens when we push beyond our comfort zones. Remember, support is always available - engage with the workshop channel on Microsoft Teams. If you feel you want in-person support then the demo team and I will be in the Rendall Building PC Teaching Centre between 9 and 11am on Thursday. Drop-in whenever.
So, take a deep breath. You’ve got this. Let’s unlock the true potential of coding together! Here are this weeks learning objectives:
- Understand how to use for loops to repeat tasks efficiently.
- Learn how to apply if statements to control the flow of your code with logical conditions.
- Gain skills in building functions to simplify complex tasks and increase code reusability.
- Learn how to organise your scripts effectively so you can call your pre-built functions whenever needed and minimise clutter.
- Build confidence in self-directed learning to solve coding challenges independently.
I Don’t Want to Read This Boring Chapter!
Fair enough. I understand that many of you will want to get stuck in immediately with your RStudio and complete your formative/summative tasks for this week. If this is the case then you should do the following first:
- Start a new R project for this week.
- Download the genomic_data_100.fasta example sequence file and move it to your project directory.
- Watch video 8.4 Counting Motifs to learn how to complete question 1 on this week’s quizzes.
- Watch video 8.5 The Challenge to learn how to complete question 2.
For those who have a bit more patience, I strongly recommend taking the time to read and understand the following sections. It will deepen your understanding of coding and help you build stronger problem-solving and analytical skills!
8.1 What is FASTA?
FASTA is a standard file format used to store nucleotide or protein sequences, widely employed in bioinformatics and genomics. Each entry in a FASTA file begins with a header line, starting with a >
character, followed by a description or identifier for the sequence. The lines that follow contain the actual sequence, represented by letters corresponding to nucleotides(e.g. A, T, G, C for DNA) or amino acids (e.g., M, P, L, K for proteins).
FASTA files are compact, easy to parse, and compatible with a wide range of bioinformatics tools, making them idea for sharing and analysing genomic or proteomic data. They can store single or multiple sequences in a simple, human-readable format.
8.2 Extracting Sequences from a FASTA File
If you haven’t done so already, start a new R project and create a new R script called “fasta-tools.R”. Download the genomic_data_100.fasta example file and move it to your project directory.
Copy and paste the code block below into your script and run each line in turn.
# FILE: fasta-tools.R
# The following code will read the sequences from the example FASTA file
# Specify filename of example FASTA file
fasta_file <- "genomic_data_100.fasta"
# Read lines from file to a vector
fasta_lines <- readLines(fasta_file)
# Define an empty vector to hold the sequences in the file
sequences <- c()
# Loop over every line in fasta_lines vector
for (line in fasta_lines) {
# Logic to ignore any line that starts with a ">" character
if (!startsWith(line, ">")) {
# If line doesn't start with ">" then add it as an item to sequences vector
sequences <- c(sequences, line)
}
}
# Inspect your sequences vector
print(paste0("Number of sequences in file: ", length(sequences),
"; First sequence: ", sequences[1]))
## [1] "Number of sequences in file: 100; First sequence: TGGGATAAAAACGAACGACTCGAGTCAAACGACAACAGTTATAAAGCACCCTCGGGTCTACTGATACTTCTGCTTGTTGCTACCAAATCCCAACTTTGCGTAAAGATAGACTCTGCTACTACGAACGTGGGTTTTGTACAGTGTGCCAGAGACCTCAGGTAAGGCCAGTAACTGTGTGTGCCATAAGGGGGCTTCTAAGAACTACACCTGCGCGTTAACGGTATCATATTGCCTGACAAGACACTCAATCTACACCAACGAGGGGTAACTGATCAACGAATTTGTCTCGCGTATCGGGAGAGCATCCACCCCTATGGGCGAATAACTAGATAGAGATCCTCAGAAACCGTTCCATCCCACTAACGGACAACCGATCACGGATATCTAATGGGCGGATACCCTGGACTTCCCCTGTATGACGGTCTGGAGTCAGAGTGAAGCCGGGCGCTTGAGTTTCTCAGGTTGCTCGGGTATTGATGTCCGGGCCACAAGCACGTATAATGGAGTGATCGTGCTGTTCTCGCAACTACCAAACTTGATTTACTTCAGGAGTCTAACGATACCTCGCAGTCACTGGTATCAATGCTCCGTCTAAAACAGTCGTTTTCAACCCGAACGATAATCATATAGCGGACCTTCGCAGGTGTCCGACACCTTTATGAAGGGGTCTGCGACGGAATCTGGGACACTTGGAGTGCATCGAACGCATGATCATTCCAACAACCTTGCAGGCGGTGAGCCGGCCCGGGCATATCAGGTCAAGCTTTGCGCGATTGAAATAGCGCAGAAATCCACTATGGATGTTACCTCGATTCACCTTTTCCATGACCCATATTCCCATGAGGGTGGAGGCAAACCTGGACATCATAGCCCCACGGTTAGAGACCTGCTACCATAAATGATCTTCGAACGATCCTCTAAGAGGCCATTGAACCTATCAACTTCGCAGAGTAGTGCGGGCTTTCTGGTGCAAGCTTAACTCAGTCTTATTGGTAATAAT"
Let’s break this code down in detail:
1. Specify the File Name
The script begins with fasta_file <- "genomic_data_100.fasta"
.
- This step defines the name of the file we want to work with. The variable
fasta_file
acts as a placeholder for the file name, so if the file name changes later, we only need to update it here. - Keeping the file name as a variable makes the script more flexible and reusable.
2. Read the File into Memory
The line fasta_lines <- readLines(fasta_file)
reads the contents of the FASTA file into a vector called fasta_lines
.
- The function
readLines()
reads each line of the file as a separate element in the vector. For example, the vector might look like this:- “>sequence_1”
- “ATGCGTACGTA”
- “>sequence_2”
- “GCTAGCTAGCT”
This raw input includes both header lines (starting with >
) and sequence lines. At this point, nothing has been filtered yet.
3. Create a Place to Store Sequences
Next, we initialise an empty vector with sequences <- c()
.
- This step prepares a container to store the biological sequences. Since we only want the actual genetic data (not the headers), we need this separate vector.
- Starting with an empty vector ensures that we don’t accidentally mix in data from other parts of the script.
4. Process Each Line in the File
Now comes the heart of the script: a loop that processes every line of the file one by one. The loop starts with for (line in fasta_lines)
.
- What does this mean?
- The for loop iterates through each element in
fasta_lines
. - During each iteration, the current line is stored in the variable line.
- The for loop iterates through each element in
- For example, on the first iteration,
line
might be">sequence_1"
. On the second iteration, it might be"ATGCGTACGTA"
, and so on.
5. Use an if Statement to Filter Lines
Inside the loop, the script uses if (!startsWith(line, ">"))
to decide whether to process the current line.
- The function
startsWith(line, ">")
checks if the line begins with the>
character (indicating a header). If it does, theif
condition evaluates toFALSE
, and the line is ignored. - The
!
symbol means “not.” So!startsWith(line, ">")
evaluates toTRUE
if the line does not start with>
. These are the sequence lines we want.
6. Add Sequences to the Vector
When the condition is TRUE
(i.e., the line is a sequence, not a header), the line is added to the sequences vector using sequences <- c(sequences, line)
.
- The
c()
function combines the existingsequences
vector with the new line, effectively appending the sequence. - This ensures that all sequence lines are collected in one place.
7. What’s With the Curly Brackets {}?
In the script, the block of code inside the loop and the if
statement is enclosed in curly brackets {}
. For example:
- What do the curly brackets do?
- They group the code together, indicating what should run when the
if
condition isTRUE
. Without curly brackets, only the first line after if would execute, which can lead to errors in more complex scripts.
- They group the code together, indicating what should run when the
- Why are they needed here?
- Since we’re using both a condition (
if
) and a loop (for
), the brackets ensure the script knows which pieces of code belong together and when they should run.
- Since we’re using both a condition (
- Do I need to indent my code inside the curly brackets?
- While indentation is not strictly required for the code to run, it is highly recommended. Indentation helps improve readability, making it clear what code belongs inside the curly brackets. Proper indentation makes it easier for others (and for you) to understand the logic and flow of the script, especially as your code grows more complex. In some languages, e.g. Python, indentation and the use of white space is not optional!
8. Check the Results
The script ends with a single line to check the results:
print(paste0("Number of sequences in file: ", length(sequences), "; First sequence: ", sequences[1]))
- What does this do?
length(sequences)
counts the total number of sequences extracted.sequences[1]
retrieves the first sequence.- The
paste0()
function combines these values into a single message for output.
Why is This Script Useful?
The script above serves as a foundational tool for working with FASTA files in R, efficiently extracting sequence data while ignoring headers. Using a loop, conditional statement, and vectors ensures it can scale to larger datasets. The final output provides a quick summary for easy verification.
8.3 Writing Functional Code
In programming, a function is a reusable block of code that performs a specific task. Functions take inputs, process them, and return outputs. Using functions helps avoid repetitive code, making your scripts more modular and easier to maintain.
For example, in previous sections, we’ve used functions like print()
to display output. The print()
function takes an argument (the thing you want to display) and prints it to the console. Functions can be as simple as this or as complex as needed, depending on the task.
Now, let’s explore how to write your own functions by transforming the code we’ve been working with into a reusable function. Remove all the code in your exiting fasta-tools.R
file, then copy and paste the following code.
# FILE: fasta-tools.R
# Define a FUNCTION to extract all sequences from ANY FASTA file
extract_sequences_from_fasta <- function(fasta_file) {
# Read lines from file to a vector
fasta_lines <- readLines(fasta_file)
# Define an empty vector to hold the sequences in the file
sequences <- c()
# Loop over every line in fasta_lines vector
for (line in fasta_lines) {
# Logic to ignore any line that starts with a ">" character
if (!startsWith(line, ">")) {
# If line doesn't start with ">" then add it as an item to sequences vector
sequences <- c(sequences, line)
}
}
return(sequences)
}
Try running the code above by clicking the source button in the top-right of your script window. What happens? Anything?
You’ll probably see some blue text in your console output, but you shouldn’t see any outputs in black. That’s because right now, you’re script isn’t actually doing anything. All you’re doing by running the script is storing the function into your computer’s memory fo use later. You should see the function extract_sequences_from_data
appear in an area called Functions in your environment window in the top-right of R Studio.
Here’s what we’ve done to “functionify” the code:
- Encapsulation of Logic into a Function
The main change is that we’ve wrapped the entire process inside a function definition:
This function now acts as a stand-alone unit, which can be used whenever we need to extract sequences from a FASTA file, with the fasta_file
parameter acting as an input to specify which file to read.
2. Parameterisation
In the original script, the file path was hard-coded directly into the script (fasta_file <- "genomic_data_100.fasta"
). In the function, we’ve replaced this with a parameter:
This allows us to pass in any file path when calling the function, making it much more flexible. Instead of being tied to a specific file, the function now works for any FASTA file. How cool is that?
3. Returning the Result
In the original code, the sequences were processed and printed directly within the script. In the function version, we use the return(sequences)
statement at the end of the function to output the extracted sequences. This allows the function to be used in any context where we want to capture the output, such as storing it in a variable or processing it further.
So I’ve Functionified My Code. Now What?
In the example below, we’ll use the extract_sequences_from_fasta()
function that we’ve just defined in our fasta-tools.R
script file in a new script called main.R
.
Create a new script called main.R
and copy and paste the following block of code into it. Both fasta-tools.R
and main.R
scripts must sit next to each other in your project directory.
# FILE: main.R
# Import your fasta-tools.R file
source("fasta-tools.R")
# Use the extract_sequences_from_data function that you have defined in your fasta-tools.R file
sequences <- extract_sequences_from_fasta("genomic_data_100.fasta")
# Inspect your sequences vector
print(paste0("Number of sequences in file: ", length(sequences)))
## [1] "Number of sequences in file: 100"
What’s Happening Here?
Source Your Function: By using
source("fasta-tools.R")
, we bring in theextract_sequences_from_fasta
function into the current script. This makes the function available for use without having to redefine it.Calling the Function: The function is called with
extract_sequences_from_fasta("genomic_data_100.fasta")
. We pass in the FASTA file name, and the function does all the work of extracting the sequences. We then move on to inspect and analyse the results.Keeping It Clean: The
main.R
script remains clean and uncluttered, focusing on high-level tasks like extracting and analysing the sequences. The function handles the details of working with the FASTA file.
Why Is This Useful?
Reusability: We can reuse the
extract_sequences_from_fasta
function in any future script. Instead of repeating the code for sequence extraction each time, we simply call the function and pass in the file name. This reduces redundancy and saves time.Clarity and Maintainability: By placing the logic for sequence extraction in its own file and function, the main script (
main.R
) is easier to read and understand. If something goes wrong with the sequence extraction, we’ll know exactly where to look—insidefasta-tools.R
based on the error message that is output to our console.Scalability: As your projects grow, you’ll often need to extract sequences from different FASTA files or perform additional processing. With functions, it’s easy to add new functionality without cluttering your main scripts.
8.4 Counting Motifs
A motif is a recurring sequence pattern in DNA, often associated with important biological functions such as transcription factor binding or gene regulation. Counting motifs provides a foundation for understanding genomic organisation and function, revealing insights into regulatory mechanisms, evolutionary patterns, and functional elements. By analysing motif frequencies, researchers can uncover critical information about gene expression, epigenetics, and genetic variation, making it an essential step in exploring the complexity of genomes.
With all that in mind, lets build on our fasta-tools.R
script by adding functions that:
- Count the number of times a specific motif occurs in a sequence.
- Calculate the average frequency that a specific motif occurs in ALL sequences in our FASTA FILE.
If we do that, then we can complete our first formative quiz question this week. Hooray!
Writing a count_motif
function
Update your fasta-tools.R
script with the following function. Click the source button to update your computer’s memory with the additional function.
# FILE: fasta-tools.R
# Extend the file above with the following ...
# Define a function that can count the number of times a motif appears in a sequence
count_motif <- function(sequence, motif="AAA") {
# Set counter to zero
count <- 0
# Loop through sequence character by character
for (i in seq(1, nchar(sequence), by=nchar(motif))) {
# Slice your sequence to create a substring that is the same size as your motif
chunk <- substr(sequence, i, i + nchar(motif)-1)
# if your chunk and motif are the same add 1 to the count
if (motif == chunk) {
count <- count + 1
}
}
# Return the final count value for further processing or storage to a variable
return(count)
}
Yikes! What an earth is going on here? Hold on, I’ll explain in detail very shortly. But first, let’s update your main.R
to test the new function. Add the following code to main.R
and then click source to run all lines.
# FILE: main.R
# Extend the file above with the following ...
# Count the number of times "ACG" occurs in the first sequence
print(paste0("Number of times 'ACG' occurs in the first sequence: ", count_motif(sequences[1], "ACG")))
## [1] "Number of times 'ACG' occurs in the first sequence: 3"
You should see the output above. If yes, well done! If no, check that you’ve not made any copy/paste mistakes and that you’ve definitely recompiled your fast-tools.R
script first.
Let’s now return to that count_motif
function we added to fasta-tools.R
and explain what is happening.
1. Function Definition
count_motif
is the name of the function- It takes two arguments:
sequence
: a string representing the DAN sequence to search within.mofit
: a string representing the patter to count. If no motif is provided, it defaults to"AAA
.
2. Initialise Counter
- A variable
count
is set to0
to keep track of how many times the motif is found.
3. Loop Through the Sequence
- A
for
loop iterates through the sequence. seq(1, nchar(sequence), by=nchar(motif))
generates a sequence of starting positions, advancing by the length of the motif each time.- The function
nchar
simply returns the number of characters in the sequence (which I know is 1000).
- The function
4. Extract a Substring
substr(sequence, i, i + nchar(motif) - 1)
extracts a substring from the sequence starting at positioni
and spanning the length of the motif. For example, ifi=1
to begin,sequence = "ATGCTT ..." and motif="ACG":
subs(“ATGCTT …”, 1, 3) will return the substring “ATG” and save it as the firstchunk
.
5. Compare Substring to the Motif
- The extracted substring (
chunk
) is compared to the motif. - If they match,
count
is incremented by 1. - Note: The
==
is very important! conditional statements likeif
andwhile
rely and logical conditions (TRUE
andFALSE
) to determine the flow of a program. Using==
ensures that the condition evaluates whether the two expressions are equal before deciding the next action. The use of a single=
when making these conditional statements will throw an error - yuck!
6. Return the Result.
- The function returns the total count of motif occurrences for further use.
7. Using the count_motif
Function
We call the count_motif
in our main.R
script within a print
function:
print(paste0("Number of times 'ACG' occurs in the first sequence: ", count_motif(sequences[1], "ACG")))
- This generates the output that you should see below the function definition code block above.
We’ve now got nearly everything we need to answer this week’s first quiz question:
Calculate the average number of times that the motif “AAT” appears in each sequence within the FASTA file.
Hmmm, how might you go about this? Here’s what I’d do:
- Write a new function, e.g.
calculate_mean_motif_freq
in yourfasta-tools.R
file. - The function will take two arguments:
sequences
: a vector containing all of your extracted sequences.motif
: the string of characters you’re interested in searching for
- I’d start my function my creating an empty vector called
motif_freq
to store the number of times the motif is found in each sequence. - Then, I’d create a loop that started like this:
for (sequence in sequences) {
- During the loop I can use my
count_motif
function on each sequence and append the returned value to mymotif_freq
vector.
- I’d finish my function by using use the
mean()
function onmotif_freq
vector and returning the value. - Lastly, I’ll call this function in my
main.R
script and print the output.
I’d really love you to have a go at this yourself. Go on! You can do it. Remember, you can ask for help in the Workshop channel on Microsoft Teams. I’m also extremely happy for you to collaborate with each other and share your solutions. So that you know that you’re correct, the answer you’re expecting for a motif="AAT"
and using the example dataset above is 5.16 to 2 d.p.
If you get really stuck, that’s OK. I’ve included a full fasta-tools.R
file at the end of this chapter that includes a calculate_mean_motif_freq
function that you can use. Also, here is a 10 min video on how to answer the first quiz question:
8.5 The Fun Bit (AKA - The Last Horrible Bit)
As mentioned above, you’ll find a complete fasta-tools.R
script at the end of this chapter. The script contains some additional helper functions:
reverse_complement
: generates the reverse complement of a DNA sequence.find_most_common_triplet
: identifies the most frequent three letter “triplet” in a given sequence.
With these tools in hand, we can start building workflows in a main.R
script to solve more complex problems. Here’s one for you to tackle:
The Challenge
You have a FASTA file full of DNA sequences. You need to:
- Extract all the sequences from the file.
- Generate a reverse complement for every sequence.
- Combine all the reverse complements into a single string.
- Identify the most common three-letter “triplet” in that string.
Sounds fun right? Below is an example of a main.R
script that implements the above workflow. Here’s the twist: the script isn’t complete! It’s up to you to figure out the missing pieces! This is an exercise in self-directed learning. You should do whatever it takes to solve the problem. This may include doing things that you’re not comfortable with, including:
- Googling.
- ChatGPTing.
- Asking for help in the Workshop channel in Teams (it will always be given!).
- Asking for help from your peers.
Here’s the partially written script:
# Load in function from fasta-tools.R
source("fasta-tools.R")
# Define file_path variable
file_path <- "genomic_data_100.fasta"
# Extract sequences from fasta file
sequences <- extract_sequences_from_fasta(file_path)
# Generate vector of reverse_complements
reverse_complements <-
# Join all reverse complements into a single string
joined_sequence <-
# Get most common triplet
most_common_triplet <- find_most_common_triplet(joined_sequence)
# Output the result
print(most_common_triplet)
Hint: For the example dataset “genomic_data_100.fasta”, the most common triplet is "GAT"
.
Why This Challenge Matters
The real skill of coding isn’t just cutting and pasting little snippets that I feed you. It’s about learning how to problem-solve, think critically, and figure out how to bridge the gaps in a workflow.
This challenge gives you the chance to:
- Develop problem-solving skills by working through an incomplete script.
- Collaborate with others to find solutions—you’re not in this alone!
- Practice researching and applying new concepts independently.
Don’t be afraid to get stuck—it’s part of the learning process. Push through, and you’ll gain a deeper understanding of how to work with real-world data and scripts. The skills you build here will serve you far beyond this exercise.
8.6 fasta-tools.R
# Define a function to extract sequences from a FASTA file
extract_sequences_from_fasta <- function(fasta_file) {
# Read all lines from the specified FASTA file into a vector
fasta_lines <- readLines(fasta_file)
# Create an empty vector to store DNA sequences
sequences <- c()
# Loop through each line in the file
for (line in fasta_lines) {
# Check if the line does NOT start with '>', which marks header lines in FASTA format
if (!startsWith(line, ">")) {
# If it's a sequence line, add it to the 'sequences' vector
sequences <- c(sequences, line)
}
}
# Return the vector of sequences
return(sequences)
}
# Define a function to count the occurrences of a motif in a sequence
count_motif <- function(sequence, motif="AAA") {
# Initialise a counter variable to track motif occurrences
count <- 0
# Loop through the sequence in chunks matching the motif's length
for (i in seq(1, nchar(sequence), by=nchar(motif))) {
# Extract a substring (chunk) from the sequence starting at position 'i'
chunk <- substr(sequence, i, i+nchar(motif)-1)
# Compare the chunk to the motif; if they match, increment the counter
if (motif == chunk) {
count <- count + 1
}
}
# Return the total count of motif occurrences
return(count)
}
# Define a function to calculate the mean frequency of a motif across multiple sequences
calculate_mean_motif_freq <- function(sequences, motif = "AAA") {
# Create an empty vector to store motif counts for each sequence
motif_frequency <- c()
# Loop through each sequence in the input list
for (sequence in sequences) {
# Use the 'count_motif' function to count occurrences of the motif in the current sequence
count <- count_motif(sequence, motif)
# Append the count to the 'motif_frequency' vector
motif_frequency <- c(motif_frequency, count)
}
# Calculate and return the mean (average) motif frequency
return(mean(motif_frequency))
}
# Define a function to compute the reverse complement of a DNA sequence
reverse_complement <- function(sequence) {
# Create a named vector mapping each nucleotide to its complement
complement_map <- c("A" = "T", "T" = "A", "C" = "G", "G" = "C")
# Split the sequence into individual nucleotides (characters)
nucleotides <- strsplit(sequence, split = "")[[1]]
# Replace each nucleotide with its complement using the mapping vector
complement <- complement_map[nucleotides]
# Reverse the complemented sequence and join it back into a single string
reverse_complement_sequence <- paste(rev(complement), collapse = "")
# Return the reverse complement sequence
return(reverse_complement_sequence)
}
# Define a function to find the most common triplet (3-nucleotide substring) in a sequence
find_most_common_triplet <- function(sequence) {
# Check if the input sequence is long enough to contain at least one triplet
if (nchar(sequence) < 3) {
stop("Sequence must be at least 3 nucleotides long.") # Stop with an error message if too short
}
# Generate all triplets (3-character substrings) from the sequence
triplets <- sapply(1:(nchar(sequence) - 2), function(i) {
substr(sequence, i, i + 2) # Extract triplet starting at position 'i'
})
# Count the frequency of each unique triplet using the 'table' function
triplet_counts <- table(triplets)
# Identify the most common triplet(s) by finding the maximum count
most_common <- names(triplet_counts[triplet_counts == max(triplet_counts)])
# Return the most common triplet(s) as a vector
return(most_common)
}