](https://galaxyproject.github.io/training-material/templates/slides/)
- Slide deck related to the tutorials:
- - [Tutorial name](http://galaxyproject.github.io/training-material/templates/slides/tutorial.html)
+ - [Tutorial name](https://github.con/galaxyproject/training-material/templates/tutorials)
# Tutorials
@@ -19,10 +19,10 @@ Several tutorial with hands-on are available for this topic:
## Input datasets
-*Zenodo DOI badge, e.g. [](http://dx.doi.org/10.5281/zenodo.60520)*
+*Zenodo DOI badge, e.g. [](https://dx.doi.org/10.5281/zenodo.60520)*
The input datasets for the tutorials are available on
-[Zenodo with a dedicated DOI](http://dx.doi.org/10.5281/zenodo.60520).
+[Zenodo with a dedicated DOI](https://dx.doi.org/10.5281/zenodo.60520).
## Galaxy instance
diff --git a/topics/usegalaxy/docker/README.md b/topics/usegalaxy/docker/README.md
index 61483432..a9e0f7e2 100644
--- a/topics/usegalaxy/docker/README.md
+++ b/topics/usegalaxy/docker/README.md
@@ -24,4 +24,4 @@ For more details about this command line or specific usage, please consult the
# Support & Bug Reports
-You can file an [GitHub issue](https://github.com/bgruening/training-material/issues) or ask us on the [Galaxy development list](http://lists.bx.psu.edu/listinfo/galaxy-dev).
+You can file an [GitHub issue](https://github.com/bgruening/training-material/issues) or ask us on the [Galaxy development list](https://lists.bx.psu.edu/listinfo/galaxy-dev).
diff --git a/topics/usegalaxy/slides/index.html b/topics/usegalaxy/slides/index.html
index 2c2df12e..f49bebd0 100644
--- a/topics/usegalaxy/slides/index.html
+++ b/topics/usegalaxy/slides/index.html
@@ -10,7 +10,7 @@
### Topic
-Blabla
+Blabla
- Blabla
- Blabla
@@ -26,7 +26,7 @@

-[*Zang and Mortazavi, Nature, 2012*](http://www.nature.com/ni/journal/v13/n9/full/ni.2407.html)
+[*Zang and Mortazavi, Nature, 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4138050/)
---
diff --git a/topics/usegalaxy/slides/tutorial.html b/topics/usegalaxy/slides/tutorial.html
index 2d368109..195fb511 100644
--- a/topics/usegalaxy/slides/tutorial.html
+++ b/topics/usegalaxy/slides/tutorial.html
@@ -11,7 +11,7 @@
### tutorial
-Blabla
+Blabla
- Blabla
- Blabla
@@ -27,7 +27,7 @@

-[*Zang and Mortazavi, Nature, 2012*](http://www.nature.com/ni/journal/v13/n9/full/ni.2407.html)
+[*Zang and Mortazavi, Nature, 2012*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4138050/)
---
diff --git a/topics/usegalaxy/tutorials/collections/tutorial.md b/topics/usegalaxy/tutorials/collections/tutorial.md
index fd04d6a1..c7c698ee 100644
--- a/topics/usegalaxy/tutorials/collections/tutorial.md
+++ b/topics/usegalaxy/tutorials/collections/tutorial.md
@@ -18,71 +18,71 @@ Here we will show Galaxy features designed to help with the analysis of large nu
- `M117C1-ch_1`- family 117, child, 1-st (**F**) read from **cheek**
- `M117C1-ch_2`- family 117, child, 2-nd (**R**) read from **cheek**
-These datasets represent genomic DNA (enriched for mitochondria via a long range PCR) isolated from blood (`bl`) and cheek (buccal swab, `ch`) of mother (`M117`) and her child (`M117C1`) that was sequenced on an Illumina miSeq machine as paired-read library (250-bp reads; see our [2014](http://www.pnas.org/content/111/43/15474.abstract) manuscript for **Methods**).
+These datasets represent genomic DNA (enriched for mitochondria via a long range PCR) isolated from blood (`bl`) and cheek (buccal swab, `ch`) of mother (`M117`) and her child (`M117C1`) that was sequenced on an Illumina miSeq machine as paired-read library (250-bp reads; see our [2014](https://www.pnas.org/content/111/43/15474.abstract) manuscript for **Methods**).
## Load data from Galaxy library
Right click (or Ctrl-click) on [this link](https://usegalaxy.org/library/list#folders/Fab5f788f07073c11) to open a new browser window and position this window side-by-side with the window displaying this tutorial. You will see this:
-
+
{: .img-responsive}
Select all datasets and click **to History** button. This will import all datasets into a history. Follow the direction and will see a screen like this:
-
+
{: .img-responsive}
## Creating a paired dataset collection
-Now click the checkbox in
and you will see your history changing like this:
+Now click the checkbox in  and you will see your history changing like this:
-
+
{: .img-responsive}
Let's click **All**, which will select all datasets in the history, then click **For all selected...** and finally select **Build List of Dataset Pairs** from the following menu:
-
+
{: .img-responsive}
The following wizard will appear:
-
+
{: .img-responsive}
In this case Galaxy automatically assigned pairs using the `_1` and `_2` endings of dataset names. Let's however pretend that this did not happen. Click on **Unpair all** (highlighted in red in the figure above) link and then on **Clear** link (highlighted in blue in the figure above). The interface will change into its virgin state:
-
+
{: .img-responsive}
Hopefully you remember that we have paired-end data in this scenario. Datasets containing the first (forward) and the second (reverse) read are differentiated by having `_1` and `_2` in the filename. We can use this feature in dataset collection wizard to pair our datasets. Type `_1` in the left **Filter this list** text box and `_2` in the right:
-
+
{: .img-responsive}
You will see that the dataset collection wizard will automatically filter lists on each side of the interface:
-
+
{: .img-responsive}
Now you can either click **Auto pair** if pairs look good to you (proper combinations of datasets are listed in each line) or pair each forward/reverse group individually by pressing **Pair these datasets** button separating each pair:
-
+
{: .img-responsive}
Now it is time to name the collection:
-
+
and create the collection by clicking **Create list**. A new item will appear in the history as you can see on the panel **A** below. Clicking on collection will expand it to show four pairs it contains (panel **B**). Clicking individual pairs will expand them further to reveal **forward** and **reverse** datasets (panel **C**). Expanding these further will enable one to see individual datasets (panel **D**).
-
+
{: .img-responsive}
## Using collections
By now we see that a collection can be used to bundle a large number of items into a single history item. This means that many Galaxy tools will be able to process all datasets in a collection transparently to you. Let's try to map these datasets to human genome using `bwa-mem` mapper:
-
+
{: .img-responsive}
Here is what you need to do:
@@ -95,17 +95,17 @@ Here is what you need to do:
You will see jobs being submitted and new datasets appearing in the history. IN particular below you can see that Galaxy has started four jobs (two yellow and two gray). This is because we have eight paired datasets with each pair being processed separately by `bwa-mem`. As a result we have four `bwa-mem` runs:
-
+
{: .img-responsive}
Once these jobs are finished they will disappear from the history and all results will be represented as a new collection:
-
+
{: .img-responsive}
Let's look at this collection by clicking on it (panel **A** in the figure below). You can see that now this collection is no longer paired (compared to the collection we created in the beginning of this tutorial). This is because `bwa-mem` takes forward and reverse data as input, but produces only a single BAM dataset as the output. So what we have in the result is a *list* of four dataset (BAM files; panels **B** and **C**).
-
+
{: .img-responsive}
## Processing collection as a single entity
@@ -116,12 +116,12 @@ Now that `bwa-mem` has finished and generated a collection of BAM datasets we ca
Let's perform cleanup of our BAM files with `cleanSam` utility from the **Picard** package:
-
+
{: .img-responsive}
If you look at the picture above carefully, you will see that the **Select SAM/BAM dataset or dataset collection** parameter is empty (it says `No sam or bam datasets available.`). This is because we do not have single SAM or BAM datasets in the history. Instead we have a collection. So all you need to do is to click on the **folder**
button and you will get our BAM collection selected:
-
+
{: .img-responsive}
Click **Execute**. As an output this tool will produce a collection contained cleaned data.
@@ -130,24 +130,24 @@ Click **Execute**. As an output this tool will produce a collection contained cl
Now let's clean the dataset further by only preserving truly paired reads (reads satisfying two requirements: (1) read is paired, and (2) it is mapped as a proper pair). For this we will use `Filter SAM or BAM` tools from **SAMTools** collection:
-
+
{: .img-responsive}
parameters should be set as shown below. By setting mapping quality to `20` we avoid reads mapping to multiple locations and by using **Filter on bitwise flag** option we ensure that the resulting dataset will contain only properly paired reads. This operation will produce yet another collection containing now filtered datasets.
-
+
{: .img-responsive}
### Merging collection into a single dataset
The beauty of BAM datasets is that they can be combined in a single entity using so called *Read group*. This allows to bundle reads from multiple experiments into a single dataset where read identity is maintained by labelling every sequence with *read group* tags. So let's finally reduce this collection to a single BAM dataset. For this we will use `MergeSamFiles` tool for the `Picard` suite:
-
+
{: .img-responsive}
Here we select the collection generated by the filtering tool described above:
-
+
{: .img-responsive}
This operation will **not** generate a collection. Instead, it will generate a single BAM dataset containing mapped reads from our four samples (`M117-bl`, `M117-ch`, `M117C1-bl`, and `M117C1-ch`).
@@ -156,21 +156,21 @@ This operation will **not** generate a collection. Instead, it will generate a s
So we have one BAM dataset combining everything we've done so far. Let's look at the contents of this dataset using a genome browser. First, we will need to downsample the dataset to avoiding overwhelming the browser. For this we will use `Downsample SAM/BAM` tool:
-
+
{: .img-responsive}
Set **Probability (between 0 and 1) that any given read will be kept** to roughly `5%` (or `0.05`) using the slider control:
-
+
This will generate another BAM dataset containing only 5% of the original reads and much smaller as a result. Click on this dataset and you will see links to various genome browsers:
-
+
{: .img-responsive}
Click the **Human hg38** link in the **display with IGV** line as highlighted above (to learn more about displaying Galaxy data in IGV with this [movie](https://vimeo.com/123442619#t=4m16s)). Below is an example generated with IGV on these data. In this screenshot reads are colored by read group (four distinct colors). A yellow inset displays additional information about a single read. One can see that this read corresponds to read group `M117-bl`.
-
+
{: .img-responsive}
## We did not fake this:
diff --git a/topics/usegalaxy/tutorials/dip/tutorial.md b/topics/usegalaxy/tutorials/dip/tutorial.md
index 25c1f002..4ae14459 100644
--- a/topics/usegalaxy/tutorials/dip/tutorial.md
+++ b/topics/usegalaxy/tutorials/dip/tutorial.md
@@ -36,16 +36,16 @@ MathJax.Hub.Config({
});
-Today we hear a lot about personalized medicine. Yet the *personalization* is defined by the genetic make up of the individual. Today we will discuss how this information can be uncovered from the genomic sequencing data. The figure above shows distribution of rare and common variants in 1,092 human genomes described by the [1000 Genome Consortium](http://www.nature.com/nature/journal/v491/n7422/abs/nature11632.html).
+Today we hear a lot about personalized medicine. Yet the *personalization* is defined by the genetic make up of the individual. Today we will discuss how this information can be uncovered from the genomic sequencing data. The figure above shows distribution of rare and common variants in 1,092 human genomes described by the [1000 Genome Consortium](https://www.nature.com/nature/journal/v491/n7422/abs/nature11632.html).
# Calling variants
-Variant calling is a complex field that was significantly propelled by advances in DNA sequencing and efforts of large scientific consortia such as the [1000 Genomes](http://www.1000genomes.org). Here we summarize basic ideas central to Genotype and Variant calling. First, let's contrast the two things although they often go together:
+Variant calling is a complex field that was significantly propelled by advances in DNA sequencing and efforts of large scientific consortia such as the [1000 Genomes](https://www.1000genomes.org). Here we summarize basic ideas central to Genotype and Variant calling. First, let's contrast the two things although they often go together:
* **Variant calling** - identification of positions where the sequenced sample is different from the reference sequence (or [reference genome graph](https://github.com/vgteam/vg));
* **Genotype calling** - identifying individual's genotype at variable sites.
-A typical workflow for variation discovery involves the following steps (e.g., see Nielsen et al. [2011](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html)):
+A typical workflow for variation discovery involves the following steps (e.g., see Nielsen et al. [2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/)):
1. Mapping reads against the reference genome
2. Thresholding BAM datasets by, for example, retaining paired, properly mapped reads
@@ -55,7 +55,7 @@ A typical workflow for variation discovery involves the following steps (e.g., s
6. Performing filtering and genotype quality score recalibration
7. Annotating variants and performing downstream analyses
-However, continuing evolution of variant detection methods has made some of these steps obsolete. For instance, omitting quality score recalibration and re-alignment (steps 3 and 4 above) when using haplotype-aware variant callers such as [FreeBayes](https://github.com/ekg/freebayes) does not have an effect on the resulting calls (see Brad Chapman's methodological comparisons at [bcbio](http://bit.ly/1S9kFJN)). Before going forward with an actual genotype calling in Galaxy let's take a look as some basic ideas behind modern variant callers.
+However, continuing evolution of variant detection methods has made some of these steps obsolete. For instance, omitting quality score recalibration and re-alignment (steps 3 and 4 above) when using haplotype-aware variant callers such as [FreeBayes](https://github.com/ekg/freebayes) does not have an effect on the resulting calls (see Brad Chapman's methodological comparisons at [bcbio](https://bit.ly/1S9kFJN)). Before going forward with an actual genotype calling in Galaxy let's take a look as some basic ideas behind modern variant callers.
## How does SNP calling and genotyping work?
@@ -92,7 +92,7 @@ Now, the probability of having a variant and it being observed in our sequencing
| | | |
|:----------------------------------:|:----------------------------------:|:-----------------------------------:|
-|  |  |  |
+|  |  |  |
| $P(A)$
Polymorphisms | $P(B)$
Variant calls | $P(AB)$
Polymorphisms + Varinat calls |
Now we can ask the following question: *What is the probability of a having a real polymorphism* $A$ *given our observation of variants in reads* $B$? In other words *what is the probability of* $A$ *given* $B$? Or, as stated in the original [blog](https://oscarbonilla.com/2009/05/visualizing-bayes-theorem/): "*given that we are in region $B$ what is the probability that we are in the region $AB$*?":
@@ -128,10 +128,10 @@ In the simplest case we can estimate these as follows:
Suppose $S_i$ is a base in read $i$ corresponding to a genome position with genotype $G$. The probability of seeing $S_i$ given $G$, or $P(S_i|G)$, is given by the quality score of $S_i$ (the quality scores are given by base calling software and reported as [phred scores](https://en.wikipedia.org/wiki/Phred_quality_score)). Thus the genotype likelihood $P(S|G)$ is the product of $P(S_i|G)$ over all $i$. In reality however there are many other sources of uncertainty (in addition to base qualities) that are incorporated in the calculation of data likelihoods including NGS technology-related issues, dependency of error rates on substitution type (e.g., transitions versus transversions), sequencing context etc...
### $P(G)$ - a single sample case
-One can assign an equal probability to all possible genotypes, or to source this information based on previously obtained knowledge containing in a database, such as [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/). In this case (as exemplified in [Nielsen et al. 2011](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html)) we may, for instance, have a site with a **G/T** polymorphism and genotypes **GG**, **TT**, and **GT** having frequencies of 0.45, 0.45, 0.09, respectively. We will use these values as priors.
+One can assign an equal probability to all possible genotypes, or to source this information based on previously obtained knowledge containing in a database, such as [dbSNP](https://www.ncbi.nlm.nih.gov/SNP/). In this case (as exemplified in [Nielsen et al. 2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/)) we may, for instance, have a site with a **G/T** polymorphism and genotypes **GG**, **TT**, and **GT** having frequencies of 0.45, 0.45, 0.09, respectively. We will use these values as priors.
### $P(G)$ - a multi-sample case
-Genotype calling reliability can be significantly improved when analyzing multiple samples jointly. In this case genotype frequencies can be inferred from allele frequencies using Hardy-Weinberg equilibrium ([HWE](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle)). The following example (again from [Nielsen et al. 2011](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html)) illustrates this idea: suppose you are calling genotypes for a single individual using a combination of multiple samples. There are two genotypes, **AT** and **AA**, with equally large genotype likelihoods. If, however, in our collection of multiple samples the frequency of **A** is 1% ($p = 0.01$; $q = 1 - p = 0.99$), then from the HWE we have:
+Genotype calling reliability can be significantly improved when analyzing multiple samples jointly. In this case genotype frequencies can be inferred from allele frequencies using Hardy-Weinberg equilibrium ([HWE](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle)). The following example (again from [Nielsen et al. 2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/)) illustrates this idea: suppose you are calling genotypes for a single individual using a combination of multiple samples. There are two genotypes, **AT** and **AA**, with equally large genotype likelihoods. If, however, in our collection of multiple samples the frequency of **A** is 1% ($p = 0.01$; $q = 1 - p = 0.99$), then from the HWE we have:
| | | |
|---------|---------|--------|
@@ -149,25 +149,25 @@ This makes it highly unlikely that **AA** is a true genotype of this individual.
* **Variant quality recalibration is avoided** by incorporating a number of metrics, such as read placement bias and allele balance, directly into the Bayesian model;
* **Ability to incorporate non-diploid cases** such as pooled datasets or data from polyploid samples.
-Freebayes is a *haplotype-based* variant caller. This implies that instead of looking at an individual positions within an alignment of reads to the reference genome, it looks at a haplotype window, length of which is dynamically determined (see section 3.2. in [FreeBayes manuscript](http://arxiv.org/pdf/1207.3907v2.pdf)):
+Freebayes is a *haplotype-based* variant caller. This implies that instead of looking at an individual positions within an alignment of reads to the reference genome, it looks at a haplotype window, length of which is dynamically determined (see section 3.2. in [FreeBayes manuscript](https://arxiv.org/pdf/1207.3907v2.pdf)):
| |
|-------------------------------------------|
-|  |
+|  |
|Looking at a haplotype window makes misalignments tolerable. In this case a low complexity poly(A) stretch is misaligned. As a result looking at individual positions will result in calling multiple spurious varians. In the case of FreeBayes looking at a haplotype identifies two alleles (this is a diploid example) `A(7)` and `A(6)`, while `A(8)` is likely an error. Image by [Erik Garrison](https://github.com/ekg/freebayes)|
# Let's try it
## The data
-In this example we will perform variant calling and annotation using [genome in the bottle data](http://jimb.stanford.edu/giab/). Specifically, we will use Ashkenazim Father-Mother-Son trio data from the Personal Genome Project:
+In this example we will perform variant calling and annotation using [genome in the bottle data](https://jimb.stanford.edu/giab/). Specifically, we will use Ashkenazim Father-Mother-Son trio data from the Personal Genome Project:
* HG002 - NA24385 - huAA53E0 (son)
* HG003 - NA24149 - hu6E4515 (father)
* HG004 - NA24143 - hu8E87A9 (mother)
-Yet for a quick tutorial these datasets are way too big, so we created a downsampled (watered down) dataset. This dataset was produced by mapping the trio reads against the `hg19` version of the human genome, merging the resulting bam files together (we use readgroups to label individual reads so they can be traced to each of the original individuals), and restricting alignments to a small portion of chromosome 19 containing the [*POLRMT*](http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=5442) gene.
+Yet for a quick tutorial these datasets are way too big, so we created a downsampled (watered down) dataset. This dataset was produced by mapping the trio reads against the `hg19` version of the human genome, merging the resulting bam files together (we use readgroups to label individual reads so they can be traced to each of the original individuals), and restricting alignments to a small portion of chromosome 19 containing the [*POLRMT*](https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=5442) gene.
Here is what to do to load the data:
@@ -175,15 +175,15 @@ Here is what to do to load the data:
>
>Go to the [data library](https://usegalaxy.org/library/list#folders/F9ff2d127cd7ed6bc) and select both BAM and PED datasets. Then Click **to History** button:
>
->
+>
>
>Galaxy will ask you if you want to import these data into a new history, which you might want (in the case below I called this history `genotyping try`):
>
->
+>
>
>The datasets will appear in your history:
>
->
+>
>
{: .hands_on}
@@ -194,13 +194,13 @@ Here is what to do to load the data:
>Select **FreeBayes** from **NGS: Variant Analysis** section of the tool menu (left pane of Galaxy's interface).
>Make sure the top part of the interface looks like shown below. Here we selected `GIAB-Ashkenazim-Trio-hg19` as input and set **Using reference genome** to `hg19` and **Choose parameter selection level** to `5`. The interface should look like this:
>
->
+>
>
>Scrolling down to **Tweak algorithmic features?** click `Yes` and set **Calculate the marginal probability of genotypes and report as GQ in each sample field in the VCF output** to `Yes`. This would help us evaluating the quality of genotype calls.
>
->
+>
>
->Depending on how busy Galaxy is this may take a little bit of time (coffee break?). Eventially this will produce a dataset in [VCF](http://www.1000genomes.org/wiki/Analysis/variant-call-format) format containing 35 putative variants. Before we can continue we need to post-process this dataset by breaking compound variants into multiple independent variants with **VcfAllelicPrimitives** tool found within **NGS: VCF Manipulation** section. This is necessary for ensuring the smooth sailing through downstream analyses:
+>Depending on how busy Galaxy is this may take a little bit of time (coffee break?). Eventially this will produce a dataset in [VCF](https://www.1000genomes.org/wiki/Analysis/variant-call-format) format containing 35 putative variants. Before we can continue we need to post-process this dataset by breaking compound variants into multiple independent variants with **VcfAllelicPrimitives** tool found within **NGS: VCF Manipulation** section. This is necessary for ensuring the smooth sailing through downstream analyses:
>
{: .hands_on}
@@ -208,7 +208,7 @@ Here is what to do to load the data:
>
>Select FreeBayes output as the input for this tool and make sure **Maintain site and allele-level annotations when decomposing** and **Maintain genotype-level annotations when decomposing** are set to `Yes`:
>
->
+>
>
{: .hands_on}
@@ -229,27 +229,27 @@ chr19 618854 . G A 81.7546
### Annotating variants with SnpEff
-At this point we are ready to begin annotating variants using [SnpEff](http://snpeff.sourceforge.net/SnpEff.html). SnpEff, a project maintained by [Pablo Cingolani](https://www.linkedin.com/in/pablocingolani) "*...annotates and predicts the effects of variants on genes (such as amino acid changes)...*" and so is critical for functional interpretation of variation data.
+At this point we are ready to begin annotating variants using [SnpEff](https://snpeff.sourceforge.net/SnpEff.html). SnpEff, a project maintained by [Pablo Cingolani](https://www.linkedin.com/in/pablocingolani) "*...annotates and predicts the effects of variants on genes (such as amino acid changes)...*" and so is critical for functional interpretation of variation data.
>### Running SNPeff
>
>Select **NGS: Variant Analysis** → **SnpEff**. Select the latest version of annotation database matching genome version against which reads were mapped and VCF produced. In this case it is `GRCh37.75: hg19`:
>
->
+>
>
>SnpEff will generate two outputs: (1) an annotated VCF file and (2) an HTML report. The report contains a number of useful metrics such as distribution of variants across gene features:
>
->
+>
>
>or changes to codons:
>
->
+>
>
{: .hands_on}
## Manipulating variation data with GEMINI
-Now that we have an annotated VCF file it is time to peek inside our variation data. [Aaron Quinlan](http://quinlanlab.org/), creator of [GEMINI](http://gemini.readthedocs.org/en/latest/index.html), calls it *Detective work*.
+Now that we have an annotated VCF file it is time to peek inside our variation data. [Aaron Quinlan](https://quinlanlab.org/), creator of [GEMINI](http://gemini.readthedocs.org/en/latest/index.html), calls it *Detective work*.
### Loading data into GEMINI
@@ -267,11 +267,11 @@ The first step is to convert a VCF file we would like to analyze into a GEMINI d
>
>So let's load data into GEMINI. Set VCF and PED inputs:
>
->
+>
>
>This creates a sqlite database. To see the content of the database use **GEMINI_db_info**:
>
->
+>
>
>This produce a list of [all tables and fields](https://github.com/nekrut/galaxy/wiki/datasets/gemini_tables.txt) in the database.
>
@@ -279,9 +279,9 @@ The first step is to convert a VCF file we would like to analyze into a GEMINI d
### Querying GEMINI database
-GEMINI database is queried using the versatile SQL language (more on SQL [here](http://swcarpentry.github.io/sql-novice-survey)). In Galaxy's version of GEMINI this is done using **GEMINI_query** tool. Within this tool SQL commands are typed directly into the **The query to be issued to the database** text box. Let's begin getting information from some of the tables we discovered with **GEMINI_db_info** tool above.
+GEMINI database is queried using the versatile SQL language (more on SQL [here](https://swcarpentry.github.io/sql-novice-survey)). In Galaxy's version of GEMINI this is done using **GEMINI_query** tool. Within this tool SQL commands are typed directly into the **The query to be issued to the database** text box. Let's begin getting information from some of the tables we discovered with **GEMINI_db_info** tool above.
-The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/gemini-tutorials/Intro-To-Gemini.pdf)" tutorial. For extensive documentation see "[Querying GEMINI](http://gemini.readthedocs.org/en/latest/content/querying.html)".
+The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/gemini-tutorials/Intro-To-Gemini.pdf)" tutorial. For extensive documentation see "[Querying GEMINI](https://gemini.readthedocs.org/en/latest/content/querying.html)".
> ### Are there "novel" varinats that are not annotated in dbSNP database?
>
@@ -294,7 +294,7 @@ The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/ge
>
>into **The query to be issued to the database** field of the interface:
>
->
+>
>
>As we can see from [output (Click this link to see it)](https://usegalaxy.org/datasets/bbd44e69cb8906b51bb37b9032761321/display/?preview=True) there are 21 variants that are not annotated in dbSNP.
>
@@ -314,7 +314,7 @@ The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/ge
>SELECT rs_ids, aaf_esp_ea, impact, clinvar_disease_name, clinvar_sig FROM variants WHERE filter is NULL and gene = 'POLRMT'
>```
>
->(column definitions can be found [here](http://gemini.readthedocs.org/en/latest/content/database_schema.html))
+>(column definitions can be found [here](https://gemini.readthedocs.org/en/latest/content/database_schema.html))
>
>[Output](https://usegalaxy.org/datasets/bbd44e69cb8906b540d65297cd1d26bb/display/?preview=True) shows varinats found within the *POLRMT* gene.
>
@@ -345,7 +345,7 @@ GEMINI provides access to genotype, sequencing depth, genotype quality, and geno
>gt_types.HG002_NA24385_son <> HOM_REF
>```
>
->
+>
>
>This produce [a list of sites](https://usegalaxy.org/datasets/bbd44e69cb8906b560921700703d0255/display/?preview=True)
>
@@ -423,7 +423,7 @@ Let's try a few examples.
>
> * `(gt_types).(*).(==HET).(all)`
>
->the [all operator](http://gemini.readthedocs.org/en/latest/content/querying.html#the-all-operator) implies that want results for **all** afftected individuals). Output will look like [this](https://usegalaxy.org/datasets/bbd44e69cb8906b5819e1404b5e127d1/display/?preview=True).
+>the [all operator](https://gemini.readthedocs.org/en/latest/content/querying.html#the-all-operator) implies that want results for **all** afftected individuals). Output will look like [this](https://usegalaxy.org/datasets/bbd44e69cb8906b5819e1404b5e127d1/display/?preview=True).
>
{: .question}
@@ -444,18 +444,18 @@ This short tutorial should give you an overall idea on how generate variant data
* Right click on **Galaxy history** link and open Galaxy history in another new browser tab
* When Galaxy history interface opens you will need to click **Import history** link highlighted within a red outline in the following figure:
-
+
* If you have a wide screen arrange browsers tabs side by side:
-
+
* Proceed with tutorial. For example, to repeat the following command from GEMINI tutorial:
-
+
* Use Galaxy's **GEMINI_load** tool:
-
+
* and so on....
diff --git a/topics/usegalaxy/tutorials/dunovo/tutorial.md b/topics/usegalaxy/tutorials/dunovo/tutorial.md
index ad7827d2..a78156ba 100644
--- a/topics/usegalaxy/tutorials/dunovo/tutorial.md
+++ b/topics/usegalaxy/tutorials/dunovo/tutorial.md
@@ -4,25 +4,25 @@ topic_name: usegalaxy
tutorial_name: dunovo
---
-This page explains how to perform discovery of low frequency variants from duplex sequencing data. As an example we use the _ABL1_ dataset published by [Schmitt and colleagues](http://www.nature.com/nmeth/journal/v12/n5/full/nmeth.3351.html) (SRA accession [SRR1799908](http://www.ncbi.nlm.nih.gov/sra/?term=SRR1799908)).
+This page explains how to perform discovery of low frequency variants from duplex sequencing data. As an example we use the _ABL1_ dataset published by [Schmitt and colleagues](https://www.ncbi.nlm.nih.gov/pubmed/25849638) (SRA accession [SRR1799908](https://www.ncbi.nlm.nih.gov/sra/?term=SRR1799908)).
# Background
-Calling low frequency variants from next generation sequencing (NGS) data is challenging due to significant amount of noise characteristic of these technologies. [Duplex sequencing](http://www.pnas.org/content/109/36/14508.short) (DS) was designed to address this problem by increasing sequencing accuracy by over four orders of magnitude. DS uses randomly generated barcodes to uniquely tag each molecule in a sample. The tagged fragments are then PCR amplified prior to the preparation of a sequencing library, creating fragment families characterized by unique combination of barcodes at both 5’ and 3’ ends:
+Calling low frequency variants from next generation sequencing (NGS) data is challenging due to significant amount of noise characteristic of these technologies. [Duplex sequencing](https://www.pnas.org/content/109/36/14508.short) (DS) was designed to address this problem by increasing sequencing accuracy by over four orders of magnitude. DS uses randomly generated barcodes to uniquely tag each molecule in a sample. The tagged fragments are then PCR amplified prior to the preparation of a sequencing library, creating fragment families characterized by unique combination of barcodes at both 5’ and 3’ ends:
->[](http://www.pnas.org/content/109/36/14508/F1.expansion.html)
+>[](https://www.pnas.org/content/109/36/14508/F1.expansion.html)
>
->The logic of duplex sequencing. From [Schmitt:2012](http://www.pnas.org/content/109/36/14508.short).
+>The logic of duplex sequencing. From [Schmitt:2012](https://www.pnas.org/content/109/36/14508.short).
The computational analysis of DS data (Part `C` in the figure above) produces two kinds of output:
* Single Strand Consensus Sequences (SSCS; panel `iv` in the figure above);
* Duplex Consensus Sequences (DCS; panel `v` in the figure above).
-The DCSs have the ultimate accuracy, yet the SSCSs can also be very useful when ampliconic DNA is used as an input to a DS experiment. Let us illustrate the utility of SSCSs with the following example. Suppose one is interested in quantifying variants in a virus that has a very low titer in body fluids. Since DS procedure requires a substantial amount of starting DNA (between [between 0.2 and 3 micrograms](http://nature.com/nprot/journal/v9/n11/full/nprot.2014.170.html)) the virus needs to be enriched. This can be done, for example, with a PCR designed to amplify the entire genome of the virus. Yet the problem is that during the amplification heterologous strands will almost certainly realign to some extent forming hetoroduplex molecules:
+The DCSs have the ultimate accuracy, yet the SSCSs can also be very useful when ampliconic DNA is used as an input to a DS experiment. Let us illustrate the utility of SSCSs with the following example. Suppose one is interested in quantifying variants in a virus that has a very low titer in body fluids. Since DS procedure requires a substantial amount of starting DNA (between [between 0.2 and 3 micrograms](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271547/)) the virus needs to be enriched. This can be done, for example, with a PCR designed to amplify the entire genome of the virus. Yet the problem is that during the amplification heterologous strands will almost certainly realign to some extent forming hetoroduplex molecules:
->
+>
>
>Heteroduplex formation in ampliconic templates. Image by Barbara Arbeithuber from [Stoler:2016](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1039-4). Here there are two distinct types of viral genomes: carrying `A` and `G`. Because the population of genomes is enriched via PCR, heteroduplex formation takes place, skewing frequency estimates performed using DCSs.
@@ -32,9 +32,9 @@ In the image above there are two alleles: green (A) and red (G). After PCR a fra
The entire analysis described here is accessible as a [Galaxy history](https://usegalaxy.org/u/aun1/h/duplex-analysis-abl1) (by clicking on this link you can create your own copy and play with it).
->
+>
>
->Each history item has a Rerun  button. Clicking this button will show you how this tool was run with all parameters filled in exactly.
+>Each history item has a Rerun  button. Clicking this button will show you how this tool was run with all parameters filled in exactly.
This analysis (and consequently the Galaxy's history) can be divided into three parts
1. Consensus generation from initial sequencing reads;
@@ -42,7 +42,7 @@ This analysis (and consequently the Galaxy's history) can be divided into three
3. Analysis of Single Strand Consensus Sequences (SSCS):
->
+>
>
>Analysis outline
@@ -52,11 +52,11 @@ The starting point of the analyses are sequencing reads (usually in [fastq](http
## Getting data in and assessing quality
-We uploaded [Schmitt:2015](http://www.nature.com/nmeth/journal/v12/n5/full/nmeth.3351.html)) data directly from SRA as shown in [this screencast](https://vimeo.com/121187220). This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse) to obtain the following plots:
+We uploaded [Schmitt:2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4414912/)) data directly from SRA as shown in [this screencast](https://vimeo.com/121187220). This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse) to obtain the following plots:
|
:--|:---
- | 
+ | 
**A**. Forward | **B**. Reverse
One can see that these data are of excellent quality and no additional processing is required before we can start the actual analysis.
@@ -71,7 +71,7 @@ From tool section **NGS: Du Novo** we ran:
This is the exact image of the **Make consensus reads** interface:
->
+>
>
>Making DCS and SSCS. **Note** that **Output single-strand consensus sequences** is set to `Yes`. [Above](#background) we explained why single-strand consensus sequences (SSCS) may be important in some applications. [Below](#analysis-of-single-strand-consensus-data) we show how they can be used.
@@ -81,7 +81,7 @@ This is the exact image of the **Make consensus reads** interface:
The _Du Novo_ algorithm occasionally inserts`N`and/or [IUPAC notations](https://en.wikipedia.org/wiki/Nucleic_acid_notation) at sites where a definive base cannot be identified according to the major rule consensus. We however do not want such bases when we call variants. The tool **Sequence Content Trimmer** will help with filtering these out. Here are the parameters we used:
->
+>
>
>Sequence Content Trimmer settings . Where:
- `Paired reads = Paired` (because DCSs are reported as forward and reverse)
- `Bases to filter on = NRYSWKMBDHV` (all ambiguous nucleotides)
- `Frequency threshold = 0.2` (A window /see the next parameter below/ cannot have more than 20% of ambiguous bases)
- `Size of the window = 10` (Size of the window)
- `Invert filter bases = No`
- `Set a minimum read length = 50` (We do not want _very_ short reads)
@@ -89,9 +89,9 @@ The _Du Novo_ algorithm occasionally inserts`N`and/or [IUPAC notations](https://
[The previous step](#filtering-consensuses) filters forward and reverse DCSs and reports them in [FASTA](https://en.wikipedia.org/wiki/FASTA_format) format. Yet the downstream tools require [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format. To address this we convert FASTA into fastq using **Combine FASTA and QUAL** from tool section **NGS: QC and manipulation**. In this case the quality values are filled in with the maximum allowed value of 93 (essentially we fake them here), which is fine as we will not rely on quality scores in the rest of the analysis.
->
+>
>
->Combine FASTA and QUAL. **Note** that here two datasets (#8 and #9) are selected simultaneously because we clicked the multiple datasets button the left of the **FASTA File** dropdown:

+>Combine FASTA and QUAL. **Note** that here two datasets (#8 and #9) are selected simultaneously because we clicked the multiple datasets button the left of the **FASTA File** dropdown:

## Calling variants
@@ -106,13 +106,13 @@ At this point we have trimmed DCSs in fastq format. We can now proceed to callin
Here we use two mappers for added reliability (this is not necessary in most situations as long as you use the right mapper for input data). To differentiate between results produced by each mapper we assign readgroups (this is done by clicking on **Set read groups information** dropdown). For example, for **BWA-MEM** you would set parameters like this:
->
+>
>
>Running BWA-MEM. **Note** that we are comparing DCSs against human genome version `hg38`, use forward and reverse DCSs are the `first` and `second` set of reads. Readgroup **SM** and **ID** tags are set `bwa-mem`.
We then repeat essentially the same with **BWA**:
->
+>
>
>Running BWA. **Note** here we use `bwa` as the readgroup **ID** and **SM** tags.
@@ -120,7 +120,7 @@ We then repeat essentially the same with **BWA**:
Since we have used two mappers - we have two BAM datasets. Yet because we have set readgroups we can now merge them into a single BAM dataset. This is because the individual reads will be labelled with readgroups (you will see how it will help later). To merge we use **MergeSamFiles** from tool section **NGS: Picard**:
->
+>
>
>Merging BAM datasets.
@@ -128,7 +128,7 @@ Since we have used two mappers - we have two BAM datasets. Yet because we have s
To normalize the positional distribution of indels we use **Left Align** utility (**NGS: Variant Analysis**) from [FreeBayes](https://github.com/ekg/freebayes#indels) package. This is necessary to avoid erroneous polymorphisms flanking regions with indels (e.g., in low complexity loci):
->
+>
>
>Left aligning indels. **Note** here we use `hg38` as well. Obviously, one must use the same genome built you have aligned against with **BWA-MEM** and **BWA**.
@@ -136,21 +136,21 @@ To normalize the positional distribution of indels we use **Left Align** utility
To identify sites containing variants we use **Naive Variant Caller (NVC)** (tool section **NGS: Variant Analysis**) which produces a simple count of differences given coverage and base quality per site (remember that our qualities were "faked" during the conversion from FASTA to fastq and cannot be used here). So in the case of _ABL1_ we set parameters as follow:
->
+>
>
>Finding variants with NVC. Here:
- `Using reference genome = hg38` (As mentioned above, needs to be set to the same genome one have mapped against.)
- `Restrict to regions: Chromosome = chr9` (_ABL1_ is on chromosome 9. We set this to prevent **NVC** from wandering across the genome to save time.)
- `Minimum number of reads needed to consider a REF/ALT = 0` (Trying to maximize the number of sites. We can filter later.)
- `Minimum base quality = 20` (This default and is irrelevant because of "faking" quality scores during the conversion from FASTA to fastq).
- `Minimum mapping quality = 20` (This is helpful because it prevents reads mapping to multiple locations from being included in the tabulation. Such reads will have mapping quality of 0.)
- `Ploidy = 1` (Ploidy is irrelevant here as it is a mixture of multiple genomes)
- `Only write out positions with possible alternate alleles = No` (We can filter later)
- `Report counts by strand = Yes` (This will be helpful to gauge the strand bias).
The **NVC** generates a [VCF](https://en.wikipedia.org/wiki/Variant_Call_Format) file that can be viewed at genome browsers such as [IGV](https://www.broadinstitute.org/igv/). Yet one rarely finds variants by looking at genome browsers. The next step is to generate a tab-delimited dataset of nucleotide counts using **Variant Annotator** from tool section **NGS: Variant Analysis**. We ran it with the following parameters:
->
+>
>
>Annotating variable sites. Here `Coverage threshold = 10` (To reduce noise) and `Output stranded base counts = Yes` (to see strand bias)
There are 3,264 lines in the output, which is clearly too much. Using **Filter** tool (tool section **Filter and Sort**) with expression `c16 >= 0.01`(because column 16 contains minor allele frequency - MAF - and we are interested in those sites where MAF >= 1%):
->
+>
>
>Filtering variable sites.
@@ -165,7 +165,7 @@ bwa-mem | 130,880,141 | A | G | 0.479 |
We can see that results of both mappers agree very well. The reason we see these numbers grouped by mappers is because we have set the readgroups while [mapping](#align-against-genome-with-bwa-and-bwa-mem).
-The polymorphism we are interested in (and the one reported by [Schmitt:2015] (http://www.nature.com/nmeth/journal/v12/n5/full/nmeth.3351.html)) is at the position 130,872,141 and has a frequency of 1.3%. The other site (position 130,880,141) is a known common variant [rs2227985](http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?type=rs&rs=rs2227985), which is heterozygous in this sample.
+The polymorphism we are interested in (and the one reported by [Schmitt:2015] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4414912/)) is at the position 130,872,141 and has a frequency of 1.3%. The other site (position 130,880,141) is a known common variant [rs2227985](https://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?type=rs&rs=rs2227985), which is heterozygous in this sample.
# Analysis of single strand consensus data
@@ -186,12 +186,12 @@ The analysis described above can be rerun using a workflow. Workflow combined al
* _Du Novo_ analysis from reads (import from [here](https://usegalaxy.org/u/aun1/w/duplex-analysis-from-reads)). This workflow uses fastq reads as input. It should be used if you analyze data for first time.
* _Du Novo_ analysis from aligned families (import from [here](https://usegalaxy.org/u/aun1/w/copy-of-duplex-analysis-from-reads)). This workflow starts with aligned families. It should be used for re-analysis of already generated DCS and SSCS data.
->[](https://galaxyproject.org/duplex/fromReads.png)
+>[](https://galaxyproject.org/duplex/fromReads.png)
>
>Starting from Reads
->[](https://galaxyproject.org/duplex/fromDCS.png)
+>[](https://galaxyproject.org/duplex/fromDCS.png)
>
>Starting from DCS/SSCS data
diff --git a/topics/usegalaxy/tutorials/history/tutorial.md b/topics/usegalaxy/tutorials/history/tutorial.md
index 00ba0e35..c71b50d8 100644
--- a/topics/usegalaxy/tutorials/history/tutorial.md
+++ b/topics/usegalaxy/tutorials/history/tutorial.md
@@ -31,7 +31,7 @@ clear you sessions - that history will be lost!** We can not recover it for you
### Current history controls
-
+
Above the current history panel are three buttons: the refresh, history options, and 'view all histories' button.
@@ -42,7 +42,7 @@ The history options button opens the history options menu which allows you to pe
The 'view all histories' button sends you to the interface for
-[managing multiple histories](../images/index.md#managing_multiple_histories).
+[managing multiple histories](../../images/index.md#managing_multiple_histories).
## History Information
@@ -58,7 +58,7 @@ All histories begin with the name 'Unnamed history'. Non-anonymous users can ren
3. Press 'Enter' to save the new name. The input field will disappear and the new name display.
4. To cancel renaming, press 'Esc' or click outside the input field.
-
+
### Tagging a history
@@ -76,7 +76,7 @@ To tag a history:
3. Press enter or select one of the previous tags with your arrow keys or mouse.
4. To remove an existing tag, click the small 'X' on the tag or use the backspace key while in the input field.
-
+
### Annotating a history
@@ -94,7 +94,7 @@ To annotate a history:
entered since the 'Tab' button is used to switch between controls on the page - tabs can be pasted in however).
4. To save the annotation, click the 'Done' button.
-
+
### History size
@@ -122,7 +122,7 @@ There are several different 'states' a dataset can be in:
1. If a previously running or queued job has been paused by Galaxy, the dataset will be in the **paused** state.
You can re-start/resume paused jobs using the options menu above the history panel and selecting 'Resume Paused Jobs'.
-
+
Datasets in the panel are initially shown in a 'summary' view, that only displays:
@@ -137,7 +137,7 @@ Datasets in the panel are initially shown in a 'summary' view, that only display
action. For example, the 'edit' button is disabled for datasets that are still queued or running.
{: .alert .alert-warning}
-
+
You can click the dataset name and the view will expand to show more details:
@@ -148,7 +148,7 @@ You can click the dataset name and the view will expand to show more details:
1. a row of buttons that allow further actions on the dataset
1. a **peek** of the data: a couple of rows of data with the column headers (if available)
-
+
**Note:** many of these details are only displayed if the dataset has finished running, is in the 'ok' state, and
@@ -167,7 +167,7 @@ history has hidden datasets, the number will appear there (e.g. '3 hidden') as a
the hidden datasets are shown. Each hidden dataset has a link in the top of the summary view that allows you to unhide
it. You can click that link again (which will now be 'hide hidden') to make them not shown again.
-
+
### Deleting and undeleting datasets
@@ -178,7 +178,7 @@ link. Clicking this link (e.g. '3 deleted') will make the deleted datasets visib
link for manually undeleting it above its title. You can click that link again (which will now be 'hide deleted') to
make them not shown again.
-
+
### Purging datasets and removing them permanently from Galaxy
@@ -211,7 +211,7 @@ You can also hide, delete, and purge multiple datasets at once by **multi-select
an action doesn't apply to a selected dataset - like deleting a deleted dataset - nothing will happen.)
1. You can click the multiselect button again to hide the checkboxes again.
-
+
### Searching for datasets
@@ -234,7 +234,7 @@ For example:
**Note:** searches are case-insensitive. For example, `VCF` and `vcf` are equivalent.
{: .alert .alert-warning}
-
+
### Clearing a search
@@ -263,7 +263,7 @@ You can enclose text and include spaces using double quotes: `name="My Dataset"
If you find normal searching is showing too many datasets, and not what you're looking for, try the advanced keyword
search.
-
+
### Search and multiselect
@@ -293,22 +293,22 @@ history method is presented here:
Click the multi-history icon at the top right of the 'Analyze Data' (home) page. Note: you must be logged in to
see the icon and use the multi-history page. You should see all the (non-deleted) histories that you've created.
-
+
Click the '...' icon button in the grey header at the top of the page. You should see a dialog that presents some options for viewing the histories. Click the 'include deleted histories' option.
-
+
The page should reload and now both non-deleted and deleted histories will be displayed. Deleted histories will
have a small message under their titles stating 'This history has been deleted'.
-
+
Now click the small button with the down arrow just above the deleted history you want to undelete. Then click
the 'Undelete' option there. Your history should now be undeleted.
-
+
Click the 'Switch to' button at the top of that history and then click 'done' at the very top left to return to
the 'Analyze Data' page.
-
+
## Dataset Collections
@@ -342,7 +342,7 @@ the forward reads and one file contains the reverse reads. Many bioinformatic to
further simplify this by placing both files into on 'Dataset Pair' collection. Only two files will be added to the
collection: forward and reverse.
-
+
### Dataset list
@@ -350,7 +350,7 @@ Choose 'Dataset List' when you have a set of files that are of the same type and
analysis. The datasets in a dataset list must have unique names (e.g. you cannot have two datasets in a dataset list
with the name '1.bed').
-
+
### List of dataset pairs
@@ -359,4 +359,4 @@ to create this is currently the most flexible and potentially most complicated.
datasets sent to the interface based on the dataset names. You are free to select your own pairs, however, and change
the order of the collection. Click the help text at the top of the interface to see more information.
-
+
diff --git a/topics/usegalaxy/tutorials/ngs/tutorial.md b/topics/usegalaxy/tutorials/ngs/tutorial.md
index bf15697a..7b13fd2e 100644
--- a/topics/usegalaxy/tutorials/ngs/tutorial.md
+++ b/topics/usegalaxy/tutorials/ngs/tutorial.md
@@ -42,10 +42,10 @@ Finally, datasets can be uploaded directly from NCBI's short read archive:
### Try it yourself
-- Create a new Galaxy history at http://usegalaxy.org (don't forget to log in).
+- Create a new Galaxy history at https://usegalaxy.org (don't forget to log in).
- Import the following two datasets (for help see the above video):
- - [A set of Forward reads](http://www.bx.psu.edu/~anton/share/ng_test_data/var/raw_mother-ds-1.fq.gz)
- - [A set of Reverse reads](http://www.bx.psu.edu/~anton/share/ng_test_data/var/raw_mother-ds-2.fq.gz)
+ - [A set of Forward reads](https://www.bx.psu.edu/~anton/share/ng_test_data/var/raw_mother-ds-1.fq.gz)
+ - [A set of Reverse reads](https://www.bx.psu.edu/~anton/share/ng_test_data/var/raw_mother-ds-2.fq.gz)
These are paired end data (see below for explanation of what paired-end is) for a single Illumina run. Keep Galaxy history for later. We will need it again in a few minutes.
@@ -53,7 +53,7 @@ These are paired end data (see below for explanation of what paired-end is) for
## What is Fastq?
-[FastQ](http://en.wikipedia.org/wiki/FASTQ_format) is not a very well defined format. In the beginning various manufacturers of sequencing instruments were free to interpret fastq as they saw fit, resulting in a multitude of fastq flavors. This variation stemmed primarily from different ways of encoding quality values as described [here](http://en.wikipedia.org/wiki/FASTQ_format) (below you will explanation of quality scores and their meaning). Today, [fastq Sanger](http://www.ncbi.nlm.nih.gov/pubmed/20015970) version of the format is considered to be the standard form of fastq. Galaxy is using fastq sanger as the only legitimate input for downstream processing tools and provides [a number of utilities for converting fastq files](http://www.ncbi.nlm.nih.gov/pubmed/20562416) into this form (see **NGS: QC and manipulation** section of Galaxy tools).
+[FastQ](https://en.wikipedia.org/wiki/FASTQ_format) is not a very well defined format. In the beginning various manufacturers of sequencing instruments were free to interpret fastq as they saw fit, resulting in a multitude of fastq flavors. This variation stemmed primarily from different ways of encoding quality values as described [here](https://en.wikipedia.org/wiki/FASTQ_format) (below you will explanation of quality scores and their meaning). Today, [fastq Sanger](https://www.ncbi.nlm.nih.gov/pubmed/20015970) version of the format is considered to be the standard form of fastq. Galaxy is using fastq sanger as the only legitimate input for downstream processing tools and provides [a number of utilities for converting fastq files](https://www.ncbi.nlm.nih.gov/pubmed/20562416) into this form (see **NGS: QC and manipulation** section of Galaxy tools).
Fastq format looks like this:
@@ -88,7 +88,7 @@ It is common to prepare pair-end and mate-pair sequencing libraries. This is hig
| |
|----|
-|  |
+|  |
|**Paired-end and mate-pair reads**. In paired end sequencing (left) the actual ends of rather short DNA molecules (less than 1kb) are determined, while for mate pair sequencing (right) the ends of long molecules are joined and prepared in special sequencing libraries. In these mate pair protocols, the ends of long, size-selected molecules are connected with an internal adapter sequence (i.e. linker, yellow) in a circularization reaction. The circular molecule is then processed using restriction enzymes or fragmentation. Fragments are enriched for the linker and outer library adapters are added around the two combined molecule ends. The internal adapter can then be used as a second priming site for an additional sequencing reaction in the same orientation or sequencing can be performed from the second adapter, from the reverse strand. (From Ph.D. dissertation by [Martin Kircher](https://core.ac.uk/download/pdf/35186947.pdf))|
@@ -168,9 +168,9 @@ The base qualities allow us to judge how trustworthy each base in a sequencing r
-Illumina sequencing is based on identifying the individual nucleotides by the fluorescence signal emitted upon their incorporation into the growing sequencing read. Once the fluorescence intensities are extracted and translated into the four letter code. The deduction of nucleotide sequences from the images acquired during sequencing is commonly referred to as base calling. Due to the imperfect nature of the sequencing process and limitations of the optical instruments, base calling will always have inherent uncertainty. This is the reason why FASTQ files store the DNA sequence of each read together with a position-specific quality score that represents the error probability, i.e., how likely it is that an individual base call may be incorrect. The score is called [Phred score](http://www.phrap.com/phred/), $Q$, which is proportional to the probability $p$ that a base call is incorrect, where $Q = −10lg(p)$. For example, a Phred score of 10 corresponds to one error in every ten base calls ($Q = −10lg(0.1)$), or 90% accuracy; a Phred score of 20 corresponds to one error in every 100 base calls, or 99% accuracy. A higher Phred score thus reflects higher confidence in the reported base. To assign each base a unique score identifier (instead of numbers of varying character length), Phred scores are typically represented as ASCII characters. At http://ascii-code.com/ you can see which characters are assigned to what number. For raw reads, the range of scores will depend on the sequencing technology and the base caller used (Illumina, for example, used a tool called Bustard, or, more recently, RTA). Unfortunately, Illumina has been anything but consistent in how they calculated and ASCII-encoded the Phred score (see below)! In addition, Illumina now allows Phred scores for base calls with as high as 45, while 41 used to be the maximum score until the HiSeq X. This may cause issues with downstream sapplications that expect an upper limit of 41.
+Illumina sequencing is based on identifying the individual nucleotides by the fluorescence signal emitted upon their incorporation into the growing sequencing read. Once the fluorescence intensities are extracted and translated into the four letter code. The deduction of nucleotide sequences from the images acquired during sequencing is commonly referred to as base calling. Due to the imperfect nature of the sequencing process and limitations of the optical instruments, base calling will always have inherent uncertainty. This is the reason why FASTQ files store the DNA sequence of each read together with a position-specific quality score that represents the error probability, i.e., how likely it is that an individual base call may be incorrect. The score is called [Phred score](https://www.phrap.com/phred/), $Q$, which is proportional to the probability $p$ that a base call is incorrect, where $Q = −10lg(p)$. For example, a Phred score of 10 corresponds to one error in every ten base calls ($Q = −10lg(0.1)$), or 90% accuracy; a Phred score of 20 corresponds to one error in every 100 base calls, or 99% accuracy. A higher Phred score thus reflects higher confidence in the reported base. To assign each base a unique score identifier (instead of numbers of varying character length), Phred scores are typically represented as ASCII characters. At http://ascii-code.com/ you can see which characters are assigned to what number. For raw reads, the range of scores will depend on the sequencing technology and the base caller used (Illumina, for example, used a tool called Bustard, or, more recently, RTA). Unfortunately, Illumina has been anything but consistent in how they calculated and ASCII-encoded the Phred score (see below)! In addition, Illumina now allows Phred scores for base calls with as high as 45, while 41 used to be the maximum score until the HiSeq X. This may cause issues with downstream sapplications that expect an upper limit of 41.
-
+
@@ -180,19 +180,19 @@ Sanger/Phred format that is also used by other sequencing platforms and the sequ
| |
|--------------------------------------------------------------|
-|  |
+|  |
| The ASCII interpretation and ranges of the different Phred score notations used by Illumina and the original Sanger interpretation. Although the Sanger format allows a theoretical score of 93, raw sequencing reads typically do not exceed a Phred score of 60. In fact, most Illumina-based sequencing will result in maximum scores of 41 to 45 (image from [Wikipedia](https://en.wikipedia.org/wiki/FASTQ_format) |
## Assessing data quality
-One of the first steps in the analysis of NGS data is seeing how good the data actually is. [FastqQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a fantastic tool allowing you to gauge the quality of fastq datasets (and deciding whether to blame or not to blame whoever has done sequencing for you).
+One of the first steps in the analysis of NGS data is seeing how good the data actually is. [FastqQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a fantastic tool allowing you to gauge the quality of fastq datasets (and deciding whether to blame or not to blame whoever has done sequencing for you).
| | |
|:---------------------------------------|:-----------------------------------|
-|  |  |
+|  |  |
|**A.** Excellent quality | **B.** Hmmm...OK |
-Here you can see FastQC base quality reports (the tools gives you many other types of data) for two datasets: **A** and **B**. The **A** dataset has long reads (250 bp) and very good quality profile with no qualities dropping below [phred score](http://www.phrap.com/phred/) of 30. The **B** dataset is significantly worse with ends of the reads dipping below phred score of 20. The **B** reads may need to be trimmed for further processing.
+Here you can see FastQC base quality reports (the tools gives you many other types of data) for two datasets: **A** and **B**. The **A** dataset has long reads (250 bp) and very good quality profile with no qualities dropping below [phred score](https://www.phrap.com/phred/) of 30. The **B** dataset is significantly worse with ends of the reads dipping below phred score of 20. The **B** reads may need to be trimmed for further processing.
@@ -204,11 +204,11 @@ QC datasets you have uploaded before.
Mapping of NGS reads against reference sequences is one of the key steps of the analysis. Now it is time to see how this is done in practice. Below is a list of key publications highlighting mainstream mapping tools:
-- 2009 Bowtie 1 - [Langmead et al.](http://genomebiology.com/content/10/3/R25)
+- 2009 Bowtie 1 - [Langmead et al.](https://genomebiology.com/content/10/3/R25)
- 2012 Bowtie 2 - [Langmead and Salzberg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322381/)
-- 2009 BWA - [Li and Durbin](http://bioinformatics.oxfordjournals.org/content/25/14/1754.long)
-- 2010 BWA - [Li and Durbin](http://bioinformatics.oxfordjournals.org/content/26/5/589)
-- 2013 BWA-MEM - [Li](http://arxiv.org/abs/1303.3997)
+- 2009 BWA - [Li and Durbin](https://academic.oup.com/bioinformatics/article/25/14/1754/225615/Fast-and-accurate-short-read-alignment-with)
+- 2010 BWA - [Li and Durbin](https://academic.oup.com/bioinformatics/article/26/5/589/211735/Fast-and-accurate-long-read-alignment-with-Burrows)
+- 2013 BWA-MEM - [Li](https://arxiv.org/abs/1303.3997)
## Mapping against a pre-computed genome index
@@ -216,7 +216,7 @@ Mappers usually compare reads against a reference sequence that has been transfo
| |
|--------------------------------------------------------------|
-|  |
+|  |
|Mapping against a pre-computed index in Galaxy.|
For example, the image above shows indexes for `hg38` version of the human genome. You can see that there are actually three choices: (1) `hg38`, (2) `hg38 canonical` and (3) `hg38 canonical female`. The `hg38` contains all chromosomes as well as all unplaced contigs. The `hg38 canonical` does not contain unplaced sequences and only consists of chromosomes 1 through 22, X, Y, and mitochondria. The
@@ -236,7 +236,7 @@ If Galaxy does not have a genome you need to map against, you can upload your ge
| |
|--------------------------------------------------------------|
-|  |
+|  |
|Mapping against a pre-computed index in Galaxy |
In this case Galaxy will first create an index from this dataset and then run mapping analysis against it. The following video shows how this works in practice:
@@ -253,7 +253,7 @@ As shown below, SAM files typically contain a short header section and a very lo
| |
|--------------------------------------------------------------|
-|  |
+|  |
|**Schematic representation of a SAM file**. Each line of the optional header section starts with “@”, followed by the appropriate abbreviation (e.g., SQ for sequence dictionary which lists all chromosomes names (SN) and their lengths (LN)). The vast majority of lines within a SAM file typically correspond to read alignments where each read is described by the 11 mandatory entries (black font) and a variable number of optional fields (grey font). From [tutorial](http://chagall.med.cornell.edu/RNASEQcourse/Intro2RNAseq.pdf) by Friederike Dündar, Luce Skrabanek, and Paul Zumbo.|
## SAM Header
@@ -282,7 +282,7 @@ ERR458493 .552967 16 chrI 140 255 12 M61232N37M2S * 0 0 CCACTCGTTCACCAGGGCCGGCGG
The following table explains the format and content of each field. The `FLAG`, `CIGAR`, and the optional fields (marked in blue) are explained in more detail below. The number of optional fields can vary widely between different SAM files and even between reads within in the same file. The field types marked in blue are explained in more detail in the main text below.
-
+
### `FLAG` field
@@ -292,7 +292,7 @@ The following table gives an overview of the different properties that can be en
| |
|--------------------------------------------------------------|
-|  |
+|  |
|The `FLAG` field of SAM files stores information about the respective read alignment in one single decimal number. The decimal number is the sum of all the answers to the Yes/No questions associated with each binary bit. The hexadecimal representation is used to refer to the individual bits (questions). A bit is set if the corresponding state is true. For example, if a read is paired, `0x1` will be set, returning the decimal value of 1. Therefore, all `FLAG` values associated with paired reads must be uneven decimal numbers. Conversely, if the `0x1` bit is unset (= read is not paired), no assumptions can be made about `0x2`, `0x8`, `0x20`, `0x40` and `0x80` because they refer to paired reads. From [tutorial](http://chagall.med.cornell.edu/RNASEQcourse/Intro2RNAseq.pdf) by Friederike Dündar, Luce Skrabanek, and Paul Zumbo|
In a run with single reads, the flags you most commonly see are:
@@ -339,7 +339,7 @@ The sum of lengths of the **M**, **I**, **S**, **=**, **X** operations must equa
| |
|---------------------------------|
-||
+||
|From [tutorial](http://chagall.med.cornell.edu/RNASEQcourse/Intro2RNAseq.pdf) by Friederike Dündar, Luce Skrabanek, and Paul Zumbo.|
### Optional fields
@@ -369,13 +369,13 @@ Thus, for example, we can use the NM:i:0 tag to select only those reads which ma
One of the key features of SAM/BAM format is the ability to label individual reads with readgroup tags. This allows pooling results of multiple experiments into a single BAM dataset. This significantly simplifies downstream logistics: instead of dealing with multiple datasets one can handle just one. Many downstream analysis tools such as variant callers are designed to recognize readgroup data and output results on per-readgroup basis.
-One of the best descriptions of BAM readgroups is on [GATK support site](http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files). We have gratefully stolen two tables describing the most important readgroup tags - `ID`, `SM`, `LB`, and `PL` - from GATK forum and provide them here:
+One of the best descriptions of BAM readgroups is on [GATK support site](https://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files). We have gratefully stolen two tables describing the most important readgroup tags - `ID`, `SM`, `LB`, and `PL` - from GATK forum and provide them here:
-
+
GATK forum also provides the following example:
-
+
To see an example of read group manipulation in Galaxy see the following video:
@@ -386,9 +386,9 @@ To see an example of read group manipulation in Galaxy see the following video:
We support four major toolsets for processing of SAM/BAM datasets:
* [DeepTools](https://deeptools.github.io/) - a suite of user-friendly tools for the visualization, quality control and normalization of data from deep-sequencing DNA sequencing experiments.
- * [SAMtools](http://www.htslib.org/) - various utilities for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.
+ * [SAMtools](https://www.htslib.org/) - various utilities for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.
* [BAMtools](https://github.com/pezmaster31/bamtools/wiki/Tutorial_Toolkit_BamTools-1.0.pdf) - a toolkit for reading, writing, and manipulating BAM (genome alignment) files.
- * [Picard](http://broadinstitute.github.io/picard/) - a set of Java tools for manipulating high-throughput sequencing data (HTS) data and formats.
+ * [Picard](https://broadinstitute.github.io/picard/) - a set of Java tools for manipulating high-throughput sequencing data (HTS) data and formats.
The following video highlights de-duplication, filtering, and cleaning of a BAM dataset using BAMtools and Picard tools:
@@ -402,12 +402,12 @@ Perform a similar analyses with your own data.
## PCR duplicates
-Preparation of sequencing libraries (at least at the time of writing) for technologies such as Illumina (used in this examples) involves PCR amplification. It is required to generate sufficient number of sequencing templates so that a reliable detection can be performed by base callers. Yet PCR has it's biases, which are especially profound in cases of multitemplate PCR used for construction of sequencing libraries (Kanagawa et al. [2003](http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16233530)).
+Preparation of sequencing libraries (at least at the time of writing) for technologies such as Illumina (used in this examples) involves PCR amplification. It is required to generate sufficient number of sequencing templates so that a reliable detection can be performed by base callers. Yet PCR has it's biases, which are especially profound in cases of multitemplate PCR used for construction of sequencing libraries (Kanagawa et al. [2003](https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16233530)).
| |
|----------------------------------------------|
-|  |
-|Analyzing molecules aligning with the same outer coordinates, a mapping quality of at least 30 and a length of at least 30nt, resulted in an average coverage of 12.9 per PCR duplicate and an empirical coverage distribution similar to an exponential/power law distribution (left upper panel). This indicates that many molecules are only observed for deeper sequencing while other molecules are available at higher frequencies. Analyzing length (left middle panel) and GC content (left lower panel) patterns as well as the combination (right panel) shows higher PCR duplicate counts for a GC content between 30% to 70% as well as for shorter molecules compared to longer molecules. This effect may be due to an amplification bias from the polymerase or the cluster generation process necessary for Illumina sequencing. From Ph.D. dissertation of [Martin Kircher](http://www.qucosa.de/fileadmin/data/qucosa/documents/7110/pflichtexemplar_final.pdf)).|
+|  |
+|Analyzing molecules aligning with the same outer coordinates, a mapping quality of at least 30 and a length of at least 30nt, resulted in an average coverage of 12.9 per PCR duplicate and an empirical coverage distribution similar to an exponential/power law distribution (left upper panel). This indicates that many molecules are only observed for deeper sequencing while other molecules are available at higher frequencies. Analyzing length (left middle panel) and GC content (left lower panel) patterns as well as the combination (right panel) shows higher PCR duplicate counts for a GC content between 30% to 70% as well as for shorter molecules compared to longer molecules. This effect may be due to an amplification bias from the polymerase or the cluster generation process necessary for Illumina sequencing. From Ph.D. dissertation of [Martin Kircher](https://www.qucosa.de/fileadmin/data/qucosa/documents/7110/pflichtexemplar_final.pdf)).|
Duplicates can be identified based on their outer alignment coordinates or using sequence-based clustering. One of the common ways for identification of duplicate reads is the `MarkDuplicates` utility from [Picard](https://broadinstitute.github.io/picard/command-line-overview.html) package. It is designed to identify both PCR and optical duplicates:
@@ -423,5 +423,5 @@ However, one has to be careful when removing duplicates in cases when the sequen
| |
|----------------------------------------------|
-|  |
-| The Variant Allele Frequency (VAF) bias determined by coverage and insert size variance. Reads are paired-end and read length is 76. The insert size distribution is modeled as a Gaussian distribution with mean at 200 and standard deviation shown on the x-axis. The true VAF is 0.05. The darkness at each position indicates the magnitude of the bias in the VAF. (From Zhou et al. [2013](http://bioinformatics.oxfordjournals.org/content/30/8/1073)). |
+|  |
+| The Variant Allele Frequency (VAF) bias determined by coverage and insert size variance. Reads are paired-end and read length is 76. The insert size distribution is modeled as a Gaussian distribution with mean at 200 and standard deviation shown on the x-axis. The true VAF is 0.05. The darkness at each position indicates the magnitude of the bias in the VAF. (From Zhou et al. [2013](https://bioinformatics.oxfordjournals.org/content/30/8/1073)). |
diff --git a/topics/usegalaxy/tutorials/non-dip/tutorial.md b/topics/usegalaxy/tutorials/non-dip/tutorial.md
index c0cec573..81ac1634 100644
--- a/topics/usegalaxy/tutorials/non-dip/tutorial.md
+++ b/topics/usegalaxy/tutorials/non-dip/tutorial.md
@@ -9,8 +9,8 @@ tutorial_name: non-dip
The majority of life on Earth is non-diploid and represented by prokaryotes, viruses and their derivatives such as our own mitochondria or plant's chloroplasts. In non-diploid systems allele frequencies can range anywhere between 0 and 100% and there could be multiple (not just two) alleles per locus. The main challenge associated with non-diploid variant calling is the difficulty in distinguishing between sequencing noise (abundant in all NGS platforms) and true low frequency variants. Some of the early attempts to do this well have been accomplished on human mitochondrial DNA although the same approaches will work equally good on viral and bacterial genomes:
-* 2014 - [Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA](http://www.pnas.org/content/111/43/15474.abstract)
-* 2015 - [Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations](http://www.pnas.org/content/112/8/2491.abstract).
+* 2014 - [Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA](https://www.pnas.org/content/111/43/15474.abstract)
+* 2015 - [Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations](https://www.pnas.org/content/112/8/2491.abstract).
As an example of non-diploid system we will be using human mitochondrial genome as an example. However, this approach will also work for most bacterial and viral genomes as well.
@@ -21,22 +21,22 @@ There are two ways one can call variants:
| |
|--------------------------|
-|  |
+|  |
| This figure from a manuscript by [Olson:2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4493402/) contrasts the two approaches. |
In this tutorials we will take the *first* path is which we map reads against an existing assembly. Later in the course (after we learn about assembly approaches) we will try the second approach as well.
-The goal of this example is to detect heteroplasmies (variants within mitochondrial DNA). Mitochondria is transmitted maternally and heteroplasmy frequencies may change dramatically and unpredictably during the transmission, due to a germ-line bottleneck [Cree:2008](http://www.nature.com/ng/journal/v40/n2/abs/ng.2007.63.html). As we mentioned above the procedure for finding variants in bacterial or viral genomes will be essentially the same.
+The goal of this example is to detect heteroplasmies (variants within mitochondrial DNA). Mitochondria is transmitted maternally and heteroplasmy frequencies may change dramatically and unpredictably during the transmission, due to a germ-line bottleneck [Cree:2008](https://www.nature.com/ng/journal/v40/n2/abs/ng.2007.63.html). As we mentioned above the procedure for finding variants in bacterial or viral genomes will be essentially the same.
[A Galaxy Library](https://usegalaxy.org/library/list#folders/Fe4842bd0c37b03a7) contains datasets representing a child and a mother. These datasets are obtained by paired-end Illumina sequencing of human genomic DNA enriched for mitochondria. The enrichment was performed using long-range PCR with two primer pairs that amplify the entire mitochondrial genome. This means that these samples still contain a lot of DNA from the nuclear genome, which, in this case, is a contaminant.
# Importing example datasets
-For this tutorial we have prepared a subset of data previously [published](http://www.pnas.org/content/111/43/15474.abstract) by our group. Let's import these data into Galaxy.
+For this tutorial we have prepared a subset of data previously [published](https://www.pnas.org/content/111/43/15474.abstract) by our group. Let's import these data into Galaxy.
> ### Data upload from a Galaxy Library
>
-> 
+> 
>
> * Go to this [this Galaxy library](https://usegalaxy.org/library/list#folders/Fe4842bd0c37b03a7)
> * You will see screen like the one shown above
@@ -46,7 +46,7 @@ For this tutorial we have prepared a subset of data previously [published](http:
> * click **Import**
> * A green message will appear once the import is done. Click on it and will see the history you have just created. It will be populated with the four datasets as shown below:
>
-> 
+> 
>
{: .hands_on}
@@ -57,13 +57,13 @@ Before proceeding with the analysis, we need to find out how good the data actua
> ### Quality Control of the data
>
>
-> 
+> 
>
->QC'ing reads using [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Note that we selected all four datasets at once by pressing the middle button  adjacent to the **Short read data from your current history** widget. Once `FastQC` job runs, you will be able to look at the HTML reports generated by this tool.
+>QC'ing reads using [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Note that we selected all four datasets at once by pressing the middle button  adjacent to the **Short read data from your current history** widget. Once `FastQC` job runs, you will be able to look at the HTML reports generated by this tool.
>
>The data have generally high quality in this example:
>
->
+>
>
>FastQC plot for one of the mitochondrial datasets shows that qualities are acceptable for 250 bp reads (mostly in the green, which is at or above [phred score](https://en.wikipedia.org/wiki/Phred_quality_score) of 30).
{: .hands_on}
@@ -75,12 +75,12 @@ Our reads are long (250 bp) and as a result we will be using [bwa mem](https://a
> ### Mapping with `bwa mem`
>
->
+>
>
>Running `bwa mem` on our datasets. Look **carefully** at parameter settings:
>
> * We select `hg38` version of the human genome as the reference
-> * By using the middle button again  we select datasets 1 and 3 as **Select the first set of reads** and datasets 2 and 4 as **Select the second set of reads**. Galaxy will automatically launch two bwa-mem jobs using datasets 1,2 and 3,4 generating two resulting BAM files.
+> * By using the middle button again  we select datasets 1 and 3 as **Select the first set of reads** and datasets 2 and 4 as **Select the second set of reads**. Galaxy will automatically launch two bwa-mem jobs using datasets 1,2 and 3,4 generating two resulting BAM files.
> * By setting **Set read groups information** to `Set read groups (SAM/BAM specifications)` and clicking **Auto-assign** we will ensure that the reads in the resulting BAM dataset are properly set.
{: .hands_on}
@@ -92,14 +92,14 @@ We can BAM dataset using **NGS: Picard** → **MergeSAMFiles** tool:
> ### Merging multiple datasets into one
>
->
+>
>
>Merging two BAM datasets into one. Note that two inputs are highlighted.
{: .hands_on}
## Removing duplicates
-Preparation of sequencing libraries (at least at the time of writing) for technologies such as Illumina (used in this example) involves PCR amplification. It is required to generate sufficient number of sequencing templates so that a reliable detection can be performed by base callers. Yet PCR has it's biases, which are especially profound in cases of multitemplate PCR used for construction of sequencing libraries (Kanagawa et al. [2003](http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16233530)).
+Preparation of sequencing libraries (at least at the time of writing) for technologies such as Illumina (used in this example) involves PCR amplification. It is required to generate sufficient number of sequencing templates so that a reliable detection can be performed by base callers. Yet PCR has it's biases, which are especially profound in cases of multitemplate PCR used for construction of sequencing libraries (Kanagawa et al. [2003](https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16233530)).
Duplicates can be identified based on their outer alignment coordinates or using sequence-based clustering. One of the common ways for identification of duplicate reads is the `MarkDuplicates` utility from [Picard](https://broadinstitute.github.io/picard/command-line-overview.html) package. It is designed to identify both PCR and optical duplicates (the following is an excerpt from Picard documentation):
@@ -109,7 +109,7 @@ Let's use **NGS: Picard** → **MarkDuplicates** tool:
> ### De-duplicating mapped data
>
->
+>
>
>De-duplicating the merged BAM dataset
{: .hands_on}
@@ -137,7 +137,7 @@ In other words the two datasets had ~6% and ~9% duplicates, respectively.
# Left-aligning indels
-Left aligning of indels (a variant of re-aligning) is extremely important for obtaining accurate variant calls. This concept, while not difficult, requires some explanation. For illustrating how left-aligning works we expanded on an example provided by [Tan:2015](http://bioinformatics.oxfordjournals.org/content/31/13/2202.abstract). Suppose you have a reference sequence and a sequencing read:
+Left aligning of indels (a variant of re-aligning) is extremely important for obtaining accurate variant calls. This concept, while not difficult, requires some explanation. For illustrating how left-aligning works we expanded on an example provided by [Tan:2015](https://academic.oup.com/bioinformatics/article/31/13/2202/196142/Unified-representation-of-genetic-variants). Suppose you have a reference sequence and a sequencing read:
```
@@ -160,13 +160,13 @@ GGGCACACACAGGG Ref: GCA
GGG--CACACAGGG Alt: G
```
-The last of these is *left-aligned*. In this case gaps (dashes) as moved as far left as possible (for a formal definition of left-alignment and variant normalization see [Tan:2015](http://bioinformatics.oxfordjournals.org/content/31/13/2202.abstract)).
+The last of these is *left-aligned*. In this case gaps (dashes) as moved as far left as possible (for a formal definition of left-alignment and variant normalization see [Tan:2015](https://bioinformatics.oxfordjournals.org/content/31/13/2202.abstract)).
Let's perform left alignment using **NGS: Variant Analysis** → **BamLeftAlign**:
> ### Left-aligning indels
>
->
+>
>
>Left-aligning a de-duplicated BAM dataset
{: .hands_on}
@@ -177,7 +177,7 @@ Remember that we are trying to call variants in mitochondrial genome. Let focus
> ### Filtering BAM data
>
->
+>
>
>Filtering reads. There are several important point to note here:
>
@@ -193,19 +193,19 @@ FreeBayes is widely used for calling variants in diploid systems. However, it ca
> ### Running `FreeBayes`
>
->
+>
>
>Set genome to `hg38` (the latest version)
>
->
+>
>
>Set regions to `chrM` from `1` to `16000`. This will simply save us time since we are only interested in mitochondrial variants anyway
>
->
+>
>
>Choose `Complete list of all samples` from **Choose parameter selection level** drop down.
>
->
+>
>
>This is one of the most important parameter choices one needs to make when calling variants in non-diploid systems. Here set **Set population model** to `Yes` and then:
>
@@ -213,7 +213,7 @@ FreeBayes is widely used for calling variants in diploid systems. However, it ca
>* Set **Assume that samples result from pooled sequencing** to `Yes`
>* Set **Output all alleles which pass input filters, regardless of genotyping outcome or model** to `Yes`
>
->
+>
>
>We will also set **Allelic scope** to `Yes` and restrict variant types to single nucleotide polymorphisms only by:
>
@@ -222,7 +222,7 @@ FreeBayes is widely used for calling variants in diploid systems. However, it ca
>
>Mitochondria has a number of low complexity regions (mononucleotide repeats). Setting these parameters as described above will decrease noise from these regions.
>
->
+>
>
>Finally, let's set **Input filters** to `Yes` and set:
>
@@ -330,8 +330,8 @@ Even though we selected somewhat stringent input parameters (restricting base qu
| |
|----------------------------|
-||
-|Here you can see that in an ideal case (indicated with a green star) a variant is evenly represent by different areas of sequencing reads (cycle and placement biases) and is balanced across the two strands (strand bias). Allele imbalance is not applicable in our case as it reflects significant deviation from the diploid (50/50) expectation (see [here](../images/freebayes.pdf) for more details).|
+||
+|Here you can see that in an ideal case (indicated with a green star) a variant is evenly represent by different areas of sequencing reads (cycle and placement biases) and is balanced across the two strands (strand bias). Allele imbalance is not applicable in our case as it reflects significant deviation from the diploid (50/50) expectation (see [here](../../images/freebayes.pdf) for more details).|
A robust tool set for processing VCF data is provided by [vcflib](https://github.com/vcflib/vcflib) developed by Erik Garrison, the author of FreeBayes. One way to filter VCF is using `INFO` fields of the VCF dataset. If you look at the VCF dataset shown above you will see all comment lines beginning with `##INFO`. These are `INFO` fields. Each VCF record contains a list of `INFO` tags describing a wide range of properties for each VCF record. You will see that FreeBayes and NVC differ significantly in the number and types of `INFO` fields each of these caller generates. This why the two require different filtering strategies.
@@ -346,7 +346,7 @@ Among numerous types of data generated by FreeBayes let's consider the following
To perform filtering we will use **NGS: VCF Manipulation** → **VCFfilter**):
> ### Filtering VCF data
>
->
+>
>
>Filtering FreeBayes VCF for strand bias (`SPR` and `SAP`), placement bias (`EPP`), variant quality (`QUAL`), and depth of coverage (`DP`).
{: .hands_on}
@@ -363,7 +363,7 @@ chrM 8557 . G C 2590.97 . AB=0.267066;ABP=790.051;AC=2;AF=0.5;AN=4;AO=446;CIGAR=
```
# Looking at the data
-For visalizaning VCFs Galaxy relies on the two external tools. The first is called [VCF.IOBIO](http://vcf.iobio.io/) and is developed by [Gabor Marth's group](http://marthlab.org/) at the University of Utah. The second is called [IGV](http://software.broadinstitute.org/software/igv/) developed by Broad Institute.
+For visalizaning VCFs Galaxy relies on the two external tools. The first is called [VCF.IOBIO](https://vcf.iobio.io/) and is developed by [Gabor Marth's group](http://marthlab.org/) at the University of Utah. The second is called [IGV](http://software.broadinstitute.org/software/igv/) developed by Broad Institute.
## VCF.IOBIO
@@ -371,16 +371,16 @@ VCF.IOBIO can be invoked by expanding a VCF dataset in Galaxy's history by click
> ### Displaying data in VCF.IOBIO
>
->
+>
>
>Clicking on the dataset above will expand it as shown below:
>
->
+>
>
>At the bottom there is a link "display at vcf.iobio"
>Clicking on this link will start indexing of VCF datasets, which is required to display them. After indexing VCF.IOBIO will open:
>
->
+>
>
>Of course there are not that many variants to look at in this example. Nevertheless there are helpful statistics such as Transition/Transversion (Ts/Tn) ratio.
{: .hands_on}
@@ -391,7 +391,7 @@ Similarly to VCF.BIOIO expanding a history item representing a VCF dataset will
> ### Displaying data in IGV
>
->
+>
>
>At the bottom there is a link "display at IGV: local Human hg38"
>The difference between "local" and "Human hg38" links is explained in the following video:
@@ -400,7 +400,7 @@ Similarly to VCF.BIOIO expanding a history item representing a VCF dataset will
>
>Visualizing our FreeBayes dataset will produce this:
>
->
+>
>
>Here we focus on one particular variant at position 3,243 for reasons that will become apparent in the next section.
{: .hands_on}
@@ -413,13 +413,13 @@ Using **NGS: VCF Manipulation** → **VCFtoTab-delimited** on the filtered V
> ### From VCF to Tab-delimited data
>
->
+>
>
>Make sure **Report data per sample** is set to `Yes`
>
>This will produce a dataset with *very* many columns:
>
->
+>
>
>There are 53 columns in this dataset (not all are shown here).
{: .hands_on}
@@ -437,7 +437,7 @@ To cut these columns out we will use **Text Manipulation** → **Cut**
> ### Cutting columns from a file
>
->
+>
>
>Note that column names are pre-ceded with `c`
>
@@ -486,14 +486,14 @@ Time to really do it yourself. Please, complete the following exercise:
>
>Suppose you obtained a virus from some source and you would like to see how it is different from its published reference sequence. You have sequenced the virus and obtained two Illumina files (these files are large, so don't open them. Rather copy their addresses (right click) and use them to upload into Galaxy as explained in *Hints* section below):
>
->- [Forward reads](http://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/f.fq.gz)
->- [Reverse reads](http://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/r.fq.gz)
+>- [Forward reads](https://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/f.fq.gz)
+>- [Reverse reads](https://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/r.fq.gz)
>
->Analyze these files using Galaxy as was explained in this lesson by mapping them against [this reference genome](http://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/phix.fa) (again right click to copy the address); see *Tips*).
+>Analyze these files using Galaxy as was explained in this lesson by mapping them against [this reference genome](https://www.bx.psu.edu/~anton/share/ng_test_data/bmmb554/hw4/phix.fa) (again right click to copy the address); see *Tips*).
>
> > ### :bulb: Tips
> >
-> > - You need to upload reads and the reference genome into Galaxy (http://usegalaxy.org) as shown in [this video](https://vimeo.com/120973708)
+> > - You need to upload reads and the reference genome into Galaxy (https://usegalaxy.org) as shown in [this video](https://vimeo.com/120973708)
> > - You will be mapping reads against an uploaded reference genome as shown in [this video](https://vimeo.com/123108417).
> {: .tip}
{: .comment}
diff --git a/topics/usegalaxy/tutorials/rb-rnaseq/tutorial.md b/topics/usegalaxy/tutorials/rb-rnaseq/tutorial.md
index 33e909ad..7f322f60 100644
--- a/topics/usegalaxy/tutorials/rb-rnaseq/tutorial.md
+++ b/topics/usegalaxy/tutorials/rb-rnaseq/tutorial.md
@@ -19,7 +19,7 @@ In this lesson we will focus on the **Reference genome-based** type of RNA seq.
The *Everything's connected* slide by Dündar et al. (2015) explains the overall idea:
-
+
There is a variety of ways in which RNA is treated during its conversion to cDNA and eventual preparation of sequencing libraries. In general the experimental workflow includes the following steps:
@@ -28,7 +28,7 @@ There is a variety of ways in which RNA is treated during its conversion to cDNA
* Second strand synthesis using DNA polymerase;
* Library preparation for sequencing.
-In listing these basic steps we are ignoring a vast amount of details such as, for example, normalization strategies and procedures needed to deal with rare RNAs or degraded samples (see [Adiconis:2013](http://nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html)). Yet, there are two important experimental considerations that would effect the ways in which one analyses data and interprets the results. These are:
+In listing these basic steps we are ignoring a vast amount of details such as, for example, normalization strategies and procedures needed to deal with rare RNAs or degraded samples (see [Adiconis:2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3821180/)). Yet, there are two important experimental considerations that would effect the ways in which one analyses data and interprets the results. These are:
* Priming for the first cDNA strand synthesis;
* Stranded versus Non-stranded libraries.
@@ -38,7 +38,7 @@ In listing these basic steps we are ignoring a vast amount of details such as, f
Reverse Transcriptase (RT) requires a primer. One can leverage the fact that the majority of processed mRNAs are polyadenylated and use oligo-dT primer to (mostly) restrict cDNA synthesis to fully processed mRNAs. Alternatively one can use a mix of random oligonucleotides to prime RT at a multitude of internal sites irrespective of RNA type and maturation status:
->
+>
>
>**Oligo-dT vs. random priming**
>Oligo-dT (**A**) and random priming (**B**)
@@ -49,24 +49,24 @@ Depending on the choice of the approach one would have different types of RNAs i
RNAs that are typically targeted in RNAseq experiments are single stranded (e.g., mRNAs) and thus have polarity (5' and 3' ends that are functionally distinct):
->
+>
>
>**Relationship between DNA and RNA orientation**
-During a typical RNAseq experiment the information about strandedness is lost after both strands of cDNA are synthesized, size selected, and converted into sequencing library. However, this information can be quite useful for various aspects of RNAseq analysis such as transcript reconstruction and quantification. There is a number of methods for creating so called *stranded* RNAseq libraries that preserve the strand information (for an excellent overview see Levin et al. [2010](http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html)):
+During a typical RNAseq experiment the information about strandedness is lost after both strands of cDNA are synthesized, size selected, and converted into sequencing library. However, this information can be quite useful for various aspects of RNAseq analysis such as transcript reconstruction and quantification. There is a number of methods for creating so called *stranded* RNAseq libraries that preserve the strand information (for an excellent overview see Levin et al. [2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3005310/)):
->[](http://www.nature.com/nmeth/journal/v7/n9/fig_tab/nmeth.1491_F1.html)
+>
>
>**Generation of stranded RNAseq libraries**
->Different types of stranded library generation protocols from [Levin:2010](http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html)
+>Different types of stranded library generation protocols from [Levin:2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3005310/)
Depending on the approach and whether one performs single- or paired-end sequencing there are multiple possibilities on how to interpret the results of mapping of these reads onto genome/transcriptome:
->[](http://sailfish.readthedocs.org/en/master/library_type.html)
+>[](https://sailfish.readthedocs.org/en/master/library_type.html)
>
>**Effects of RNAseq library types**
->Image and description below is from [Sailfish documentation](http://sailfish.readthedocs.org/en/master/library_type.html)
+>Image and description below is from [Sailfish documentation](https://sailfish.readthedocs.org/en/master/library_type.html)
The relative orientation of the reads is only relevant if the library is pair-ended. The possible options are:
@@ -90,11 +90,11 @@ So by combining the relative orientation of reads is I, O, or M (if reads are pa
However, in practice, if you use Illumina paired-end RNAseq protocols you are unlikely to uncover many of these possibilities. You will either deal with:
* unstranded RNAseq data (**IU** type from above. Also called **fr-unstranded** in TopHat/Cufflinks jargon);
- * stranded RNAseq data produced with Illumina TrueSeq RNAseq kits and [dUTP tagging](http://nar.oxfordjournals.org/content/37/18/e123) (**ISR** type from above or **fr-firststrand** in TopHat/Cufflinks nomenclature).
+ * stranded RNAseq data produced with Illumina TrueSeq RNAseq kits and [dUTP tagging](https://nar.oxfordjournals.org/content/37/18/e123) (**ISR** type from above or **fr-firststrand** in TopHat/Cufflinks nomenclature).
The implication of stranded RNAseq is that you can distinguish whether the reads are derived from forward- or reverse-encoded transcripts:
->
+>
>
>**Stranded RNAseq data look like this**
>This example contrasts unstranded and stranded RNAseq experiments. Red transcripts are from + strand and blue are from - strand. In stranded example reads are clearly stratified between the two strands. A small number of reads from opposite strand may represent anti-sense transcription. The image from GATC Biotech.
@@ -108,7 +108,7 @@ An RNAseq experiment without a sufficient number of replicates will be a waste o
>* **Biological replicates**. There is an on-going debate over what kinds of samples represent true biological replicates. Obviously, the variability between different samples will be greater between RNA extracted from two unrelated humans than between RNA extracted from two different batches of the same cell line. In the latter case, most of the variation that will eventually be detected was probably introduced by the experimenter (e.g., slightly differing media and plating conditions). Nevertheless, this is variation the researcher is typically not interested in assessing, therefore the ENCODE consortium defines biological replicates as RNA from an independent growth of cells/tissue (ENCODE [2011](https://genome.ucsc.edu/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf)).
->The number of replicates should be as high as practically possible. Most RNAseq experiments include three replicates and some have as many as 12 (see Schurch et al. [2015](http://arxiv.org/abs/1505.02017)).
+>The number of replicates should be as high as practically possible. Most RNAseq experiments include three replicates and some have as many as 12 (see Schurch et al. [2015](https://arxiv.org/abs/1505.02017)).
## Read mapping
@@ -121,57 +121,57 @@ After sequencing is performed you have a collection of sequencing reads for each
### TopHat, TopHat2, and HiSat
-[Tophat](http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract) was one of the first tools designed specifically to address this problem by identifying potential exons using reads that do map to the genome, generating possible splices between neighboring exons, and comparing reads that did not initially map to the genome agaisnt these *in silico* created junctions:
+[Tophat](https://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract) was one of the first tools designed specifically to address this problem by identifying potential exons using reads that do map to the genome, generating possible splices between neighboring exons, and comparing reads that did not initially map to the genome agaisnt these *in silico* created junctions:
->[](http://bioinformatics.oxfordjournals.org/content/25/9/1105/F1.expansion.html)
+>[](https://bioinformatics.oxfordjournals.org/content/25/9/1105/F1.expansion.html)
>
>**TopHat and TopHat2: Mapping RNAseq regions to genome**
->In TopHat reads are mapped against the genome and are separated into two categories: (1) those that map, and (2) those that initially unmapped (IUM). "Piles" of reads representing potential exons are extended in search of potential donor/acceptor splice sites and potential splice junctions are reconstructed. IUMs are then mapped to these junctions. Image from [Trapnell:2009](http://bioinformatics.oxfordjournals.org/content/25/9/1105.full).
+>In TopHat reads are mapped against the genome and are separated into two categories: (1) those that map, and (2) those that initially unmapped (IUM). "Piles" of reads representing potential exons are extended in search of potential donor/acceptor splice sites and potential splice junctions are reconstructed. IUMs are then mapped to these junctions. Image from [Trapnell:2009](https://bioinformatics.oxfordjournals.org/content/25/9/1105.full).
->[](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-4-r36)
+>[](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-4-r36)
>
>**TopHat has been subsequently improved with the development of TopHat2**
>Image from [Kim:2012](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-4-r36) summarizes steps involved in aligning of RNAseq reads with TopHat2
-To further optimize and speed up spliced read alignment Kim at al. [2015](http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3317.html) developed [HISAT](http://ccb.jhu.edu/software/hisat2/index.shtml). It uses a set of [FM-indices](https://en.wikipedia.org/wiki/FM-index) consisting one global genome-wide index and a collection of ~48,000 local overlapping 42 kb indices (~55,000 56 kb indices in HiSat2). This allows to find initial seed locations for potential read alignments in the genome using global index and to rapidly refine these alignments using a corresponding local index:
+To further optimize and speed up spliced read alignment Kim at al. [2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4655817/) developed [HISAT](https://ccb.jhu.edu/software/hisat2/index.shtml). It uses a set of [FM-indices](https://en.wikipedia.org/wiki/FM-index) consisting one global genome-wide index and a collection of ~48,000 local overlapping 42 kb indices (~55,000 56 kb indices in HiSat2). This allows to find initial seed locations for potential read alignments in the genome using global index and to rapidly refine these alignments using a corresponding local index:
->
+>
>
>**Hierarchical Graph FM index in HiSat/HiSat2**
->A part of the read (blue arrow) is first mapped to the genome using the global FM index. The HiSat then tries to extend the alignment directly utilizing the genome sequence (violet arrow). In (**a**) it succeeds and this read aligned as it completely resides within an exon. In (**b**) the extension hits a mismatch. Now HiSat takes advantage of the local FM index overlapping this location to find the appropriate matting for the remainder of this read (green arrow). The (**c**) shows a combination these two strategies: the beginning of the read is mapped using global FM index (blue arrow), extended until it reaches the end of the exon (violet arrow), mapped using local FM index (green arrow) and extended again (violet arrow). Image from [Kim:2015](http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3317.html)
+>A part of the read (blue arrow) is first mapped to the genome using the global FM index. The HiSat then tries to extend the alignment directly utilizing the genome sequence (violet arrow). In (**a**) it succeeds and this read aligned as it completely resides within an exon. In (**b**) the extension hits a mismatch. Now HiSat takes advantage of the local FM index overlapping this location to find the appropriate matting for the remainder of this read (green arrow). The (**c**) shows a combination these two strategies: the beginning of the read is mapped using global FM index (blue arrow), extended until it reaches the end of the exon (violet arrow), mapped using local FM index (green arrow) and extended again (violet arrow). Image from [Kim:2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4655817/)
### STAR mapper
-[STAR aligner](https://github.com/alexdobin/STAR) is a fast alternative for mapping RNAseq reads against genome utilizing uncompressed [suffix array](https://en.wikipedia.org/wiki/Suffix_array). It operates in [two stages](http://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635.abstract). In the first stage it performs seed search:
+[STAR aligner](https://github.com/alexdobin/STAR) is a fast alternative for mapping RNAseq reads against genome utilizing uncompressed [suffix array](https://en.wikipedia.org/wiki/Suffix_array). It operates in [two stages](https://academic.oup.com/bioinformatics/article/29/1/15/272537/STAR-ultrafast-universal-RNA-seq-aligner). In the first stage it performs seed search:
->
+>
>
>**STAR's seed search**
->Here a read is split between two consecutive exons. STAR starts to look for a *maximum mappable prefix* (MMP) from the beginning of the read until it can no longer match continuously. After this point it start to MMP for the unmatched portion of the read (**a**). In the case of mismatches (**b**) and unalignable regions (**c**) MMPs serve as anchors from which to extend alignments. Image from [Dobin:2013](http://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635.full.pdf+html).
+>Here a read is split between two consecutive exons. STAR starts to look for a *maximum mappable prefix* (MMP) from the beginning of the read until it can no longer match continuously. After this point it start to MMP for the unmatched portion of the read (**a**). In the case of mismatches (**b**) and unalignable regions (**c**) MMPs serve as anchors from which to extend alignments. Image from [Dobin:2013](https://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635.full.pdf+html).
At the second stage STAR stitches MMPs to generate read-level alignments that (contrary to MMPs) can contain mismatches and indels. A scoring scheme is used to evaluate and prioritize stitching combinations and to evaluate reads that map to multiple locations. STAR is extremely fast but requires a substantial amount of RAM to run efficiently.
## Transcript reconstruction
-The previous step - mapping - assigns RNAseq reads to genomic locations and identifies splice junctions from reads that originate from different exons. At transcript reconstruction step this information is taken further in attempt to build transcript models. There is a number of tools for performing this task. A benchmarking paper by [Hayer:2015](http://bioinformatics.oxfordjournals.org/content/early/2015/09/03/bioinformatics.btv488.full.pdf+html) attempted to compare performance of existing approaches with one of the outcomes shown below:
+The previous step - mapping - assigns RNAseq reads to genomic locations and identifies splice junctions from reads that originate from different exons. At transcript reconstruction step this information is taken further in attempt to build transcript models. There is a number of tools for performing this task. A benchmarking paper by [Hayer:2015](https://bioinformatics.oxfordjournals.org/content/early/2015/09/03/bioinformatics.btv488.full.pdf+html) attempted to compare performance of existing approaches with one of the outcomes shown below:
->[](http://bioinformatics.oxfordjournals.org/content/early/2015/09/08/bioinformatics.btv488/F5.large.jpg)
+>[](https://bioinformatics.oxfordjournals.org/content/early/2015/09/08/bioinformatics.btv488/F5.large.jpg)
>
>**Comparison of transcript reconsruction approaches**
->Here *recall* (the number of correctly constructed forms divided by the total number of real forms) versus *precision* (true positives divided by the sum of true positives and false positives) are plotted for seven transcript assemblers tested on two simulated datasets: *EnsemblPerfect* and *EnsemblRealistic*. The shaded region is indicating suboptimal performance (i.e., the white, unshaded region is "good"). The figure is from [Hayer:2015](http://bioinformatics.oxfordjournals.org/content/early/2015/09/03/bioinformatics.btv488.full.pdf+html).
+>Here *recall* (the number of correctly constructed forms divided by the total number of real forms) versus *precision* (true positives divided by the sum of true positives and false positives) are plotted for seven transcript assemblers tested on two simulated datasets: *EnsemblPerfect* and *EnsemblRealistic*. The shaded region is indicating suboptimal performance (i.e., the white, unshaded region is "good"). The figure is from [Hayer:2015](https://bioinformatics.oxfordjournals.org/content/early/2015/09/03/bioinformatics.btv488.full.pdf+html).
-Based on these results [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/) and [StringTie](https://ccb.jhu.edu/software/stringtie/) have satisfactory performence. The following discussion is based on inner workings of StringTie.
+Based on these results [Cufflinks](https://cole-trapnell-lab.github.io/cufflinks/) and [StringTie](https://ccb.jhu.edu/software/stringtie/) have satisfactory performence. The following discussion is based on inner workings of StringTie.
### Transcriptome assembly with StringTie
[StringTie](https://ccb.jhu.edu/software/stringtie/) assembles transcripts from spliced read alignemnts produced by tools such as STAR, TopHat, or HISAT and simultaneously estimates their abundances using counts of reads assigned to each transcript. The following images illustrates details of StringTie workflow:
->
+>
>
>**StringTie workflow**
->Image from [Pertea:2015](http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html)
+>Image from [Pertea:2015](StringTie enables improved reconstruction of a transcriptome from RNA-seq reads)
In essence StringTie builds an alternative splice graph from overlapping reads in a given locus. In such a graph nodes correspond to exons (or, rather, contiguous regions of genome covered by reads; colored regions on the figure above), while edges are represented by reads connecting these exons. Next, it identifies a path within the splice graph that has the highest weight (largest number of reads on edges). Such path would correspond to an assembled transcript at this iteration of the algorithm. Because the edge weight is equal to the number of the reads StringTie estimates the coverage level for this transcript (see below) which can be used to estimate the transcript's abundance. Reads that are associated with the transcript that was just assembled are then removed and the graph is updated to perform the next iteration of the algorithm.
@@ -181,38 +181,38 @@ Transcriptome quantification attempts to estimate expression levels of individua
### Assigning reads to transcripts
-To associate reads with transcripts they (the reads) need to be aligned to the transcriptome. Tools like Cufflinks and StringTie reconstruct transcripts from spliced read alignments generated by other programs (TopHat, HISAT, STAR), so they already have the information about which reads belong to each reconstructed transcript. Other tools such as [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/), [Kallisto](http://pachterlab.github.io/kallisto/), and [Salmon](http://combine-lab.github.io/salmon/) perform *lightweight* alignment of RNAseq reads against existing transcriptome sequences. The goal of lightweight alignment is to quickly distribute the reads across transcripts they likely originate from without worrying too much about producing high quality alignments. The upside of this is that the entire procedure can be performed very quickly. The downside is that these tools require high quality transcriptome as input, which is not a problem if you work with humans or mice, but is a problem if you are studying Hyacinth macaw or any other brilliantly colored creatures.
+To associate reads with transcripts they (the reads) need to be aligned to the transcriptome. Tools like Cufflinks and StringTie reconstruct transcripts from spliced read alignments generated by other programs (TopHat, HISAT, STAR), so they already have the information about which reads belong to each reconstructed transcript. Other tools such as [Sailfish](https://www.cs.cmu.edu/~ckingsf/software/sailfish/), [Kallisto](http://pachterlab.github.io/kallisto/), and [Salmon](http://combine-lab.github.io/salmon/) perform *lightweight* alignment of RNAseq reads against existing transcriptome sequences. The goal of lightweight alignment is to quickly distribute the reads across transcripts they likely originate from without worrying too much about producing high quality alignments. The upside of this is that the entire procedure can be performed very quickly. The downside is that these tools require high quality transcriptome as input, which is not a problem if you work with humans or mice, but is a problem if you are studying Hyacinth macaw or any other brilliantly colored creatures.
#### Lightweight alignment
-[Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/) has been initially designed to utilize [*k*-mer](https://en.wikipedia.org/wiki/K-mer) matching for finding association between reads and corresponding transcripts:
+[Sailfish](https://www.cs.cmu.edu/~ckingsf/software/sailfish/) has been initially designed to utilize [*k*-mer](https://en.wikipedia.org/wiki/K-mer) matching for finding association between reads and corresponding transcripts:
->
+>
>
>**Assigning reads to transcripts: Sailfish**
->Sailfish indexes input transcriptome for a fixed *k*-mer length and compares *k*-mers derived from RNAseq reads against this index. Image from [Patro:2014](http://www.nature.com/nbt/journal/v32/n5/full/nbt.2862.html)
+>Sailfish indexes input transcriptome for a fixed *k*-mer length and compares *k*-mers derived from RNAseq reads against this index. Image from [Patro:2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4077321/)
-The current version of Sailfish uses [quasi-alignment](http://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf) to extend exact matches found with *k*-mers:
+The current version of Sailfish uses [quasi-alignment](https://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf) to extend exact matches found with *k*-mers:
->
+>
>
>**Quasi-alignment of reads in Sailfish**
->In Sailfish version [0.7.0](https://github.com/kingsfordgroup/sailfish/releases/tag/v0.7.0) and up transcriptome is concatenated into a single sequence using `$` separators from which a [suffix array](https://en.wikipedia.org/wiki/Suffix_array) and a [hash table](https://en.wikipedia.org/wiki/Hash_table) are constructed. A *k*-mer from an RNAseq read (green) is looked up in the hash table, which immediately gives its position in the suffix array allowing to extend the march as described in the legend and the [paper](http://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf). Image from [Srivastava:2015](http://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf)
+>In Sailfish version [0.7.0](https://github.com/kingsfordgroup/sailfish/releases/tag/v0.7.0) and up transcriptome is concatenated into a single sequence using `$` separators from which a [suffix array](https://en.wikipedia.org/wiki/Suffix_array) and a [hash table](https://en.wikipedia.org/wiki/Hash_table) are constructed. A *k*-mer from an RNAseq read (green) is looked up in the hash table, which immediately gives its position in the suffix array allowing to extend the march as described in the legend and the [paper](https://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf). Image from [Srivastava:2015](http://biorxiv.org/content/biorxiv/early/2015/10/22/029652.full.pdf)
-[Kallisto](http://pachterlab.github.io/kallisto/) also utilizes *k*-mer matching but uses a different data structure. It constructs a [De Bruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph) from transcriptome input (pane **b** of the figure below). This graph is different from De Bruijn graphs used for genome assembly in that its nodes are *k*-mers and transcripts correspond to paths through the graph. To accommodate multiple transcripts that can lay along the same path (or sub-path) the paths are "colored" with each transcript given a distinct "color" (in genome assembly the graph is built from the reads and nodes usually correspond to overlaps between *k*-mers forming incoming and outgoing edges). Non-branching sections of the graph that have identical coloring are "glued" into contigs. Finally a [hash table](https://en.wikipedia.org/wiki/Hash_table) is built that stores the position of each transcriptome *k*-mer within the graph:
+[Kallisto](https://pachterlab.github.io/kallisto/) also utilizes *k*-mer matching but uses a different data structure. It constructs a [De Bruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph) from transcriptome input (pane **b** of the figure below). This graph is different from De Bruijn graphs used for genome assembly in that its nodes are *k*-mers and transcripts correspond to paths through the graph. To accommodate multiple transcripts that can lay along the same path (or sub-path) the paths are "colored" with each transcript given a distinct "color" (in genome assembly the graph is built from the reads and nodes usually correspond to overlaps between *k*-mers forming incoming and outgoing edges). Non-branching sections of the graph that have identical coloring are "glued" into contigs. Finally a [hash table](https://en.wikipedia.org/wiki/Hash_table) is built that stores the position of each transcriptome *k*-mer within the graph:
->
+>
>
>**Assigning reads to transcripts: Kallisto**
->Here a black read is being associated with a set consisting of red, blue, and green transcripts (**a**). First, a graph is built from transcriptome (**b**). Next, by finding common *k*-mers between the read and the graph the read is "threaded" along a path (**c** and **d**). The colors along that path would indicate which transcripts it is likely derived from. Specifically, this is done by taking intersection of "colors" (**c**). It this case the read is assigned to two transcripts: red and blue. Image from [Bray:2015](http://arxiv.org/pdf/1505.02710v2.pdf)
+>Here a black read is being associated with a set consisting of red, blue, and green transcripts (**a**). First, a graph is built from transcriptome (**b**). Next, by finding common *k*-mers between the read and the graph the read is "threaded" along a path (**c** and **d**). The colors along that path would indicate which transcripts it is likely derived from. Specifically, this is done by taking intersection of "colors" (**c**). It this case the read is assigned to two transcripts: red and blue. Image from [Bray:2015](https://arxiv.org/pdf/1505.02710v2.pdf)
[Salmon](https://combine-lab.github.io/salmon/about/) does not use *k*-mer matching approach. Instead it creates [bwa](https://github.com/lh3/bwa)-like [FM-index](https://en.wikipedia.org/wiki/FM-index) and uses it to finds chains of *Maximal Exact Matches* (MEMs) and *Super Maximal Exact Matches* (SMEMs) between a read and the transcriptome.
-[Patro:2015](http://biorxiv.org/content/biorxiv/early/2015/06/27/021592.full.pdf) define a MEM as "*a substring that is shared by the query (read) and reference (transcript) that cannot be extended in either direction without introducing a mismatch*". Similraly, a SMEM is defined as a "*MEM that is not contained within any other MEM on the query.*" One of the advantages of utilizing the FM-index is that a new index does not need to re-generated for a search with different set of parameters. In the case of Sailfish and Kallisto an index is dependent on *k*-mer length and has to be recomputed every time the *k* is changed. The overall schematics of Salmon operation is as follows:
+[Patro:2015](https://biorxiv.org/content/biorxiv/early/2015/06/27/021592.full.pdf) define a MEM as "*a substring that is shared by the query (read) and reference (transcript) that cannot be extended in either direction without introducing a mismatch*". Similraly, a SMEM is defined as a "*MEM that is not contained within any other MEM on the query.*" One of the advantages of utilizing the FM-index is that a new index does not need to re-generated for a search with different set of parameters. In the case of Sailfish and Kallisto an index is dependent on *k*-mer length and has to be recomputed every time the *k* is changed. The overall schematics of Salmon operation is as follows:
->
+>
>
>**Assigning reads to transcripts: Salmon**
->Image from [Patro:2015](http://biorxiv.org/content/biorxiv/early/2015/06/27/021592.full.pdf)
+>Image from [Patro:2015](https://biorxiv.org/content/biorxiv/early/2015/06/27/021592.full.pdf)
### Estimating transcript levels
@@ -222,10 +222,10 @@ Once reads are apportioned across individual transcripts they can be quantified.
StringTie, which performs assembly and quantification simultaneously converts splice graph into a flow network for which it solves [the maximum flow problem](https://en.wikipedia.org/wiki/Maximum_flow_problem). The maximum flow is such network represents the expression level for a given transcript:
->
+>
>
>**StringTie flow network**
->Here each exon node from the splice graph is split into *in* and *out* nodes connected with an edge weighted by the number of reads corresponding to that exon. For example, the first exon is covered by seven reads and so the edge between 1-in and 1-out has a weight of 7. Expression level would correspond to the maximum flow through a path representing a given transcript. Image from [Pertea:2015](http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html)
+>Here each exon node from the splice graph is split into *in* and *out* nodes connected with an edge weighted by the number of reads corresponding to that exon. For example, the first exon is covered by seven reads and so the edge between 1-in and 1-out has a weight of 7. Expression level would correspond to the maximum flow through a path representing a given transcript. Image from [Pertea:2015](StringTie enables improved reconstruction of a transcriptome from RNA-seq reads)
#### Expectation Maximization
@@ -245,10 +245,10 @@ The Expectation/Maximization framework (EM) is utilized in a number of tools suc
During next expectation stage read are re-apportioned across transcripts and the procedure is repeated until convergence:
->
+>
>
>**Expectation Maximization (EM)**
->Image from [Pacher:2011](http://arxiv.org/pdf/1104.3889v2.pdf)
+>Image from [Pacher:2011](https://arxiv.org/pdf/1104.3889v2.pdf)
### Understanding quantification metrics
@@ -259,7 +259,7 @@ As we've seen above quantification for a transcript is estimated using the numbe
In their [tutorial](http://chagall.med.cornell.edu/RNASEQcourse/) Dündar et al. have compiled a table summarizing various metrics. Below is description of normalization technique for within sample comparisons (between sample comparison can be found in the next section on differential expression analysis):
->
+>
>
>**RNAseq normalization metrics: Within sample comparisons**
>Table from Dündar et al. [2015](http://chagall.med.cornell.edu/RNASEQcourse/)
@@ -275,12 +275,12 @@ The goal of differential expression analysis (DE) is to find gene (DGE) or trans
* Estimate the *magnitude* of expression differences;
* Estimate the *significance* of expression differences.
-For this expression is estimated from read counts and attempts are made to correct for variability in measurements using replicates that are absolutely essential accurate results (see below). We begin our short discussion on DE by reproducing a figure from [Trapnell:2013](http://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html) highlighting some of the challenges associated with judging expression differences from read counts:
+For this expression is estimated from read counts and attempts are made to correct for variability in measurements using replicates that are absolutely essential accurate results (see below). We begin our short discussion on DE by reproducing a figure from [Trapnell:2013](https://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html) highlighting some of the challenges associated with judging expression differences from read counts:
->
+>
>
>**Differential expression: Read counts and Expression levels**
->**Change in fragment count for a gene does not necessarily equal a change in expression**. (**a**) Simple read-counting schemes sum the fragments incident on a gene’s exons. The exon-union model counts reads falling on any of a gene’s exons, whereas the exon-intersection model counts only reads on constitutive exons. (**b**) Both of the exon-union and exon intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms. The true expression is estimated by the sum of the length-normalized isoform read counts. The discrepancy between a change in the union or intersection count and a change in gene expression is driven by a change in the abundance of the isoforms with respect to one another. In the top row, the gene generates the same number of reads in conditions A and B, but in condition B, all of the reads come from the shorter of the two isoforms, and thus the true expression for the gene is higher in condition B. The intersection count scheme underestimates the true change in gene expression, and the union scheme fails to detect the change entirely. In the middle row, the intersection count fails to detect a change driven by a shift in the dominant isoform for the gene. The union scheme detects a shift in the wrong direction. In the bottom row, the gene’s expression is constant, but the isoforms undergo a complete switch between conditions A and B. Both simplified counting schemes register a change in count that does not reflect a change in gene expression. Figure from [Trapnell:2013] (http://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html)
+>**Change in fragment count for a gene does not necessarily equal a change in expression**. (**a**) Simple read-counting schemes sum the fragments incident on a gene’s exons. The exon-union model counts reads falling on any of a gene’s exons, whereas the exon-intersection model counts only reads on constitutive exons. (**b**) Both of the exon-union and exon intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms. The true expression is estimated by the sum of the length-normalized isoform read counts. The discrepancy between a change in the union or intersection count and a change in gene expression is driven by a change in the abundance of the isoforms with respect to one another. In the top row, the gene generates the same number of reads in conditions A and B, but in condition B, all of the reads come from the shorter of the two isoforms, and thus the true expression for the gene is higher in condition B. The intersection count scheme underestimates the true change in gene expression, and the union scheme fails to detect the change entirely. In the middle row, the intersection count fails to detect a change driven by a shift in the dominant isoform for the gene. The union scheme detects a shift in the wrong direction. In the bottom row, the gene’s expression is constant, but the isoforms undergo a complete switch between conditions A and B. Both simplified counting schemes register a change in count that does not reflect a change in gene expression. Figure from [Trapnell:2013] (https://www.nature.com/nbt/journal/v31/n1/abs/nbt.2450.html)
The following discussion of DGE logic is reproduced from [Dündar:2015](http://chagall.med.cornell.edu/RNASEQcourse/).
@@ -297,13 +297,13 @@ gene **i** is quite small.
>In contrast to the Poisson distribution, we now need to estimate two parameters from the read counts: the mean as well as the dispersion. The precision of these estimates strongly depends on the number (and variation) of replicates – the more replicates, the better the grasp on the underlying mean expression values of unchanged genes and the variance that is due to biological variation rather than the experimental treatment. For most RNA-seq experiments, only two to three replicates are available, which is not enough for reliable mean and variance estimates. Some tools therefore compensate for the lack of replication by borrowing information across genes with similar expression values and shrink a given gene’s variance towards the regressed values. These fitted values of the mean and dispersion are then used instead of the raw estimates
to test for differential gene expression.
>
->The best performing tools tend to be [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), [DESeq/DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), and [limma-voom](https://www.bioconductor.org/packages/release/bioc/html/limma.html) (see Rapaport et al. ([2013](http://link.springer.com/article/10.1186/gb-2013-14-9-r95)); Soneson and Delorenzi ([2013](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91)); Schurch et al. ([2015](http://arxiv.org/abs/1505.02017)) for reviews of DGE tools). DESeq and limma-voom tend to be more conservative than edgeR (better control of false positives), but edgeR is recommended for experiments with fewer than 12 replicates (Schurch et al., [2015](http://arxiv.org/abs/1505.02017)).
+>The best performing tools tend to be [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), [DESeq/DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html), and [limma-voom](https://www.bioconductor.org/packages/release/bioc/html/limma.html) (see Rapaport et al. ([2013](https://link.springer.com/article/10.1186/gb-2013-14-9-r95)); Soneson and Delorenzi ([2013](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91)); Schurch et al. ([2015](http://arxiv.org/abs/1505.02017)) for reviews of DGE tools). DESeq and limma-voom tend to be more conservative than edgeR (better control of false positives), but edgeR is recommended for experiments with fewer than 12 replicates (Schurch et al., [2015](http://arxiv.org/abs/1505.02017)).
## Let's try it
### The data
-In this example we will use a downsampled version of simulated *Drosophila melanogaster* RNA-seq data used by Trapnell et al. [2012](http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html). These include two conditions (**C1** and **C2**), each containing three replicates (**R1**, **R2**, and **R3**) sequenced as a paired end library. Thus in total there are 12 fastq datasets.
+In this example we will use a downsampled version of simulated *Drosophila melanogaster* RNA-seq data used by Trapnell et al. [2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/). These include two conditions (**C1** and **C2**), each containing three replicates (**R1**, **R2**, and **R3**) sequenced as a paired end library. Thus in total there are 12 fastq datasets.
Here is what to do to load the data:
@@ -311,11 +311,11 @@ Here is what to do to load the data:
>Go to the [data library](https://usegalaxy.org/library/list#folders/Ff4ce53393dae30ee) and select all fastq files. Then Click `to History` button:
>
->
+>
>
>The datasets will appear in your history:
>
->
+>
>
>Twelve datasets make a lot of clicking necessary. To avoid this annoyance we will combine them into two collections - **c1** and **c2** as shown in the video below. Also, see this [tutorial](collections.html) for yet another explanation of dataset collections.
>
@@ -334,29 +334,29 @@ We will map the reads with TopHat2. Select **TopHat** from **NGS: RNA Analysis**
>* **TopHat settings to use** = `Full parameter list` This is done to be able to specify the strandedness of the library.
>* **Library Type** = `FR First Strand`
>
->
+>
>
>The same procedure is then repeated for collection **c2**. In the end it generates a lot of datasets in the history resulting in something resembling an image below. TopHat produces five types of output and because we started with dataset collections every one of the green boxes shown below is actually a collection of outputs for **c1** and **c2**, respectively.
>
->
+>
Let's now take a look at some of the alignments. We will use IGV for this purpose.
>First, let's drill down to actual alignments produced by TopHat. For example, in figure shown above simply click on **TopHat on collection 14: accepted_hits** and you will see a list of datasets corresponding to alignments of reads derived from each conditions:
>
->
+>
>
>Now, click on **c2-r1x** and the following will appear:
>
->
+>
>
>Finally, use **D. melanogaster** link (highlighted above) and follow the on-screen instructions. By focusing IGV on genomic position `chrX:11,897,111-11,920,446` you will be able to see spliced alignments produced by TopHat:
>
->
+>
>
->and [sashimi plots](http://software.broadinstitute.org/software/igv/Sashimi) highlighting potential splice junctions:
+>and [sashimi plots](https://software.broadinstitute.org/software/igv/Sashimi) highlighting potential splice junctions:
>
->
+>
### Performing differential expression analysis
@@ -364,21 +364,21 @@ Using mapped reads produced by TopHat we will perform analysis of differential g
#### Gene-based read counting with HTseq-count
-[`HTSeq-count`](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) is one of the most popular tools for gene quantification. `HTseq-count` gives you multiple choices on how to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature:
+[`HTSeq-count`](https://www-huber.embl.de/users/anders/HTSeq/doc/count.html) is one of the most popular tools for gene quantification. `HTseq-count` gives you multiple choices on how to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature:
->[](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
+>[](https://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
>
>**`HTseq-count` read/feature overlap modes**
->The `htseq-count` script of the HTSeq suite offers three different modes to handle details of read–feature overlaps that are depicted here. The default of featureCounts is the behavior of the union option. Image is from [HTseq documentation](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html); Caption by [Dündar:2015](http://chagall.med.cornell.edu/RNASEQcourse/)
+>The `htseq-count` script of the HTSeq suite offers three different modes to handle details of read–feature overlaps that are depicted here. The default of featureCounts is the behavior of the union option. Image is from [HTseq documentation](https://www-huber.embl.de/users/anders/HTSeq/doc/count.html); Caption by [Dündar:2015](http://chagall.med.cornell.edu/RNASEQcourse/)
Before we can use `HTseq-count` we need to download gene annotations for version `dm3` of the *Drosophila melanogaster* genome. We use version `dm3` because it is the same genome we have mapped reads against during the TopHat step.
#### Getting *Drosophila malanogaster* gene annotation from UCSC
>Select **UCSC Main** from **Get Data** section of the menu. Within the UCSC Genome Browser interface set parameters as shown below. In particular make sure that **assembly** is set ti `dm3` and **output format** is set to `GTF`. Click **get output**.
->[](../images/ucsc_dm3.png)
+>[](../../images/ucsc_dm3.png)
>
->This [GTF](http://www.ensembl.org/info/website/upload/gff.html) dataset will be used one of the input for HTseq-count.
+>This [GTF](https://www.ensembl.org/info/website/upload/gff.html) dataset will be used one of the input for HTseq-count.
`HTseq-count` takes two inputs: (1) mapped reads in BAM format and (2) a GTF dataset containing annotation of genes. Using these inputs it will compute the number of reads per gene.
@@ -386,10 +386,10 @@ Before we can use `HTseq-count` we need to download gene annotations for version
>`htseq-count` needs strand information to proceed. The strand information is specified as `+`, `-`, or `.` (unknown) in a GTF dataset. `htseq-count` does not like `.` and will generate an error if such unstranded features appear in data. To prevent these errors from happening we will filter them out from GTF file using **Filter** tool from **Filter and Sort** section of tool menu. Here `c7 != "."` means that we need to filter all rows where strand column (column #7) contains a dot:
>
->
+>
>
>Select **htseq-count** from **NGS: RNA analysis** section on the left side of the menu. Set parameters as shown below. The red arrow points that to enable `htseq-count` to see collections, you need to select the 'folder' button. In the case of this particular Galaxy [history](https://usegalaxy.org/u/aun1/h/rna-seq-tutorial) we will need to run `htseq-count` twice. Once on TopHat alignemnts for collection **c1** (dataset #37; shown below) and then on alignments for collection **c2** (dataset # 57).|
-|
+|
>
>This will generate [read counts per gene](https://usegalaxy.org/datasets/bbd44e69cb8906b5d1e80eae6d363142/display/?preview=True).
@@ -407,24 +407,24 @@ Before we can use `HTseq-count` we need to download gene annotations for version
>The `DESeq2` Galaxy's interface is shown below. `DESeq2` allows to incorporate multiple *factors* in the analysis. In our case we only have one factor, which we call **Conditions**. This is because we are trying to find genes that are differentially expressed between two conditions. The first condition will the first **factor level**, while condition 2 will be the second **factor level**. Here the input for this first factor level is set to a collection `84: htseq-count on collection 37` and the input for the second input is set to `92: htseq-count on collection 57`. Make sure that **Visualising the analysis results** is set to `Yes`:
>
->
+>
>
>This will produce [output](https://usegalaxy.org/datasets/bbd44e69cb8906b5d648fe21c36ac662/display/?preview=True) as shown below. The columns are: (**1**) gene identifier, (**2**) mean normalised counts, averaged over all samples from both conditions, (**3**) logarithm (base 2) of the fold change, (**4**) the standard error estimate for the log2 fold change estimate, (**5**) [Wald test](https://en.wikipedia.org/wiki/Wald_test) statistic, (**6**) p value for the statistical significance of this change, and (**7**) *p*-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate ([FDR](https://en.wikipedia.org/wiki/False_discovery_rate)). There is only one gene with significant change in gene expression between conditions: `CG1803-RC` with *p*-value = 1.6x10-05
>
->
+>
>
>In addition to the [list of genes](https://usegalaxy.org/datasets/bbd44e69cb8906b5d648fe21c36ac662/display/?preview=True) DESeq2 outputs a graphical summary of the result. It includes a number of plots that should be used to evaluate the quality of the experiment. The histogram of *p*-values below shows that in our sample there is in fact just one instance of a significant *p*-value:
>
->
+>
>
>The [MA plot](https://en.wikipedia.org/wiki/MA_plot) below shows the relationship between the expression change (M) and average expression strength (A). Genes with adjusted *p*-value < 0.1 are in red (there is only one such gene in thi sample at the bottom of the graph):
>
->
+>
>
>The Principal Component Analysis ([PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)) shows the separation between Condition 1 and 2. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects (each replicate is plotted as an individual data point):
>
->
+>
>
>A heatmap of sample-to-sample distance matrix gives us an overview over similarities and dissimilarities between samples:
>
->
+>
diff --git a/topics/variant-analysis/README.md b/topics/variant-analysis/README.md
index 9c13eb9b..b09cb223 100644
--- a/topics/variant-analysis/README.md
+++ b/topics/variant-analysis/README.md
@@ -1,4 +1,4 @@
Variant analysis
================
-Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) before adding or updating any material
\ No newline at end of file
+Please refer to the [CONTRIBUTING.md](../../CONTRIBUTING.md) before adding or updating any material
diff --git a/topics/variant-analysis/slides/index.html b/topics/variant-analysis/slides/index.html
index dfdb4805..f165a680 100644
--- a/topics/variant-analysis/slides/index.html
+++ b/topics/variant-analysis/slides/index.html
@@ -72,5 +72,5 @@
### 2 tutorials
-- [Introductory tutorial](http://galaxyproject.github.io/training-material/Exome-Seq/tutorials/Exome-Seq.html)
-- [Detailed tutorial](http://galaxyproject.github.io/training-material/Exome-Seq/tutorials/Diploid-variant-calling.html)
+- [Introductory tutorial](https://galaxyproject.github.io/training-material/topics/variant-analysis/tutorials/exome-seq/tutorial.html)
+- [Detailed tutorial](https://galaxyproject.github.io/training-material/topics/variant-analysis/tutorials/diploid-variant-calling/tutorial.html)
diff --git a/topics/variant-analysis/tutorials/diploid-variant-calling/tutorial.md b/topics/variant-analysis/tutorials/diploid-variant-calling/tutorial.md
index 3b7dff1a..a7475483 100644
--- a/topics/variant-analysis/tutorials/diploid-variant-calling/tutorial.md
+++ b/topics/variant-analysis/tutorials/diploid-variant-calling/tutorial.md
@@ -8,12 +8,12 @@ tutorial_name: diploid-variant-calling
# Introduction
-Variant calling is a complex field that was significantly propelled by advances in DNA sequencing and efforts of large scientific consortia such as the [1000 Genomes](http://www.1000genomes.org). Here we summarize basic ideas central to Genotype and Variant calling. First, let's contrast the two things although they often go together:
+Variant calling is a complex field that was significantly propelled by advances in DNA sequencing and efforts of large scientific consortia such as the [1000 Genomes](https://www.1000genomes.org). Here we summarize basic ideas central to Genotype and Variant calling. First, let's contrast the two things although they often go together:
* **Variant calling** - identification of positions where the sequenced sample is different from the reference sequence (or [reference genome graph](https://github.com/vgteam/vg));
* **Genotype calling** - identifying individual's genotype at variable sites.
-A typical workflow for variation discovery involves the following steps (e.g., see Nielsen et al. [2011](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html)):
+A typical workflow for variation discovery involves the following steps (e.g., see Nielsen et al. [2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/)):
1. Mapping reads against the reference genome;
2. Thresholding BAM datasets by, for example, retaining paired, properly mapped reads;
@@ -23,7 +23,7 @@ A typical workflow for variation discovery involves the following steps (e.g., s
6. Performing filtering and genotype quality score recalibration;
7. Annotating variants and performing downstream analyses.
-However, continuing evolution of variant detection methods has made some of these steps obsolete. For instance, omitting quality score recalibration and re-alignment (steps 3 and 4 above) when using haplotype-aware variant callers such as [FreeBayes](https://github.com/ekg/freebayes) does not have an effect on the resulting calls (see Brad Chapman's methodological comparisons at [bcbio](http://bit.ly/1S9kFJN)
+However, continuing evolution of variant detection methods has made some of these steps obsolete. For instance, omitting quality score recalibration and re-alignment (steps 3 and 4 above) when using haplotype-aware variant callers such as [FreeBayes](https://github.com/ekg/freebayes) does not have an effect on the resulting calls (see Brad Chapman's methodological comparisons at [bcbio](https://bit.ly/1S9kFJN)
> ### Agenda
>
@@ -102,11 +102,11 @@ Suppose *Ri* is a base in read *i* corresponding to a genome position
#### *P(G)* - a single sample case
-One can assign an equal probability to all possible genotypes, or to source this information based on previously obtained knowledge containing in a database, such as [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/). In this case (as exemplified in [Nielsen et al.](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html) we may, for instance, have a site with a **G/T** polymorphism and genotypes **GG**, **TT**, and **GT** having frequencies of 0.45, 0.45, 0.09, respectively. We will use these values as priors.
+One can assign an equal probability to all possible genotypes, or to source this information based on previously obtained knowledge containing in a database, such as [dbSNP](https://www.ncbi.nlm.nih.gov/SNP/). In this case (as exemplified in [Nielsen et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/) we may, for instance, have a site with a **G/T** polymorphism and genotypes **GG**, **TT**, and **GT** having frequencies of 0.45, 0.45, 0.09, respectively. We will use these values as priors.
#### *P(G)* - a multi-sample case
-Genotype calling reliability can be significantly improved when analyzing multiple samples jointly. In this case genotype frequencies can be inferred from allele frequencies using Hardy-Weinberg equilibrium ([HWE](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle)). The following example (again from [Nielsen et al.](http://www.nature.com/nrg/journal/v12/n6/full/nrg2986.html)) illustrates this idea: suppose you are calling genotypes for a single individual using a combination of multiple samples. There are two genotypes, **AT** and **AA**, with equally large genotype likelihoods. If, however, in our collection of multiple samples the frequency of **A** is 1% (*p* = 0.01; *q* = 1 - *p* = 0.99), then from the HWE we have:
+Genotype calling reliability can be significantly improved when analyzing multiple samples jointly. In this case genotype frequencies can be inferred from allele frequencies using Hardy-Weinberg equilibrium ([HWE](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle)). The following example (again from [Nielsen et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593722/)) illustrates this idea: suppose you are calling genotypes for a single individual using a combination of multiple samples. There are two genotypes, **AT** and **AA**, with equally large genotype likelihoods. If, however, in our collection of multiple samples the frequency of **A** is 1% (*p* = 0.01; *q* = 1 - *p* = 0.99), then from the HWE we have:
| AA (*p2*) | AT (2*pq*) | TT (*q2*)|
|---------|---------|--------|
@@ -123,7 +123,7 @@ This makes it highly unlikely that **AA** is a true genotype of this individual.
> * **Variant quality recalibration is avoided** by incorporating a number of metrics, such as read placement bias and allele balance, directly into the Bayesian model;
> * **Ability to incorporate non-diploid cases** such as pooled datasets or data from polyploid samples.
-Freebayes is a *haplotype-based* variant caller. This implies that instead of looking at an individual positions within an alignment of reads to the reference genome, it looks at a haplotype window, length of which is dynamically determined (see section 3.2. in [FreeBayes manuscript](http://arxiv.org/pdf/1207.3907v2.pdf)):
+Freebayes is a *haplotype-based* variant caller. This implies that instead of looking at an individual positions within an alignment of reads to the reference genome, it looks at a haplotype window, length of which is dynamically determined (see section 3.2. in [FreeBayes manuscript](https://arxiv.org/pdf/1207.3907v2.pdf)):
|Haplotype-based calling |
|------------------------|
@@ -140,7 +140,7 @@ In this example we will perform variant calling and annotation using [genome in
* HG003- NA24149 - hu6E4515 (father)
* HG004- NA24143 - hu8E87A9 (mother)
-Yet for a quick tutorial these datasets are way too big, so we created a [downsampled dataset](http://dx.doi.org/10.5281/zenodo.60520). This dataset was produced by mapping the trio reads against `hg19` version of the human genome, merging the resulting bam files together (we use readgroups to label individual reads so they can be traced to each of the original individuals), and restricting alignments to a small portion of chromosome 19 containing the [*POLRMT*](http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=5442) gene.
+Yet for a quick tutorial these datasets are way too big, so we created a [downsampled dataset](https://dx.doi.org/10.5281/zenodo.60520). This dataset was produced by mapping the trio reads against `hg19` version of the human genome, merging the resulting bam files together (we use readgroups to label individual reads so they can be traced to each of the original individuals), and restricting alignments to a small portion of chromosome 19 containing the [*POLRMT*](http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=5442) gene.
> ### :pencil2: Hands-on: Variant calling
>
@@ -173,7 +173,7 @@ Yet for a quick tutorial these datasets are way too big, so we created a [downsa
> 
{: .hands_on}
-This produces a dataset in [VCF](http://www.1000genomes.org/wiki/Analysis/variant-call-format) format containing 35 putative variants. Before we can continue we need to post-process this dataset by breaking compound variants into multiple independent variants with **VcfAllelicPrimitives** tool found within **VCF Tools** section. This is necessary for ensuring the smooth sailing through downstream analyses:
+This produces a dataset in [VCF](https://www.1000genomes.org/wiki/Analysis/variant-call-format) format containing 35 putative variants. Before we can continue we need to post-process this dataset by breaking compound variants into multiple independent variants with **VcfAllelicPrimitives** tool found within **VCF Tools** section. This is necessary for ensuring the smooth sailing through downstream analyses:
> ### :pencil2: Hands-on: Post-processing
>
@@ -192,7 +192,7 @@ This produces a dataset in [VCF](http://www.1000genomes.org/wiki/Analysis/varian
## Annotating variants with SnpEff
-At this point we are ready to begin annotating variants using [**SnpEff**](http://snpeff.sourceforge.net/SnpEff.html). **SnpEff**, a project maintained by [Pablo Cingolani](https://www.linkedin.com/in/pablocingolani) "*...annotates and predicts the effects of variants on genes (such as amino acid changes)...*" and so is critical for functional interpretation of variation data.
+At this point we are ready to begin annotating variants using [**SnpEff**](https://snpeff.sourceforge.net/SnpEff.html). **SnpEff**, a project maintained by [Pablo Cingolani](https://www.linkedin.com/in/pablocingolani) "*...annotates and predicts the effects of variants on genes (such as amino acid changes)...*" and so is critical for functional interpretation of variation data.
### :pencil2: Annotating variants
@@ -209,7 +209,7 @@ or changes to codons
## Manipulating variation data with GEMINI
-Now that we have an annotated VCF file it is time to peek inside our variation data. [Aaron Quinlan](http://quinlanlab.org/), creator of [GEMINI](http://gemini.readthedocs.org/en/latest/index.html), calls it *Detective work*.
+Now that we have an annotated VCF file it is time to peek inside our variation data. [Aaron Quinlan](https://quinlanlab.org/), creator of [GEMINI](http://gemini.readthedocs.org/en/latest/index.html), calls it *Detective work*.
### Loading data into GEMINI
@@ -239,11 +239,11 @@ This produce a list of all tables and fields in the database.
### Querying GEMINI database
-GEMINI database is queried using the versatile SQL language (more on SQL [here](http://swcarpentry.github.io/sql-novice-survey)). In Galaxy's version of GEMINI this is done using **GEMINI_query** tool. Within this tool SQL commands are typed directly into the **The query to be issued to the database** text box. Let's begin getting information from some of the tables we discovered with **GEMINI_db_info** tool above.
+GEMINI database is queried using the versatile SQL language (more on SQL [here](https://swcarpentry.github.io/sql-novice-survey)). In Galaxy's version of GEMINI this is done using **GEMINI_query** tool. Within this tool SQL commands are typed directly into the **The query to be issued to the database** text box. Let's begin getting information from some of the tables we discovered with **GEMINI_db_info** tool above.
> ### :bulb: Tip: Gemini tutorials
>
-> The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/gemini-tutorials/Intro-To-Gemini.pdf)" tutorial. For extensive documentation see "[Querying GEMINI](http://gemini.readthedocs.org/en/latest/content/querying.html)".
+> The examples below are taken from "[Intro to Gemini](https://s3.amazonaws.com/gemini-tutorials/Intro-To-Gemini.pdf)" tutorial. For extensive documentation see "[Querying GEMINI](https://gemini.readthedocs.org/en/latest/content/querying.html)".
{: .tip}
> ### :pencil2: Hands-on: Selecting "novel" variants that are not annotated in dbSNP database
@@ -258,7 +258,7 @@ GEMINI database is queried using the versatile SQL language (more on SQL [here](
> ### :pencil2: Find variants in POLRMT gene
>
-> The query `SELECT * FROM variants WHERE filter is NULL and gene = 'POLRMT'` will produce [output](https://usegalaxy.org/datasets/bbd44e69cb8906b5a0bb5b2cc0695697/display/?preview=True) with very large number of columns. To restrict the number of columns to a manageable set let's use this command: `SELECT rs_ids, aaf_esp_ea, impact, clinvar_disease_name, clinvar_sig FROM variants WHERE filter is NULL and gene = 'POLRMT'` (column definitions can be found [here](http://gemini.readthedocs.org/en/latest/content/database_schema.html))
+> The query `SELECT * FROM variants WHERE filter is NULL and gene = 'POLRMT'` will produce [output](https://usegalaxy.org/datasets/bbd44e69cb8906b5a0bb5b2cc0695697/display/?preview=True) with very large number of columns. To restrict the number of columns to a manageable set let's use this command: `SELECT rs_ids, aaf_esp_ea, impact, clinvar_disease_name, clinvar_sig FROM variants WHERE filter is NULL and gene = 'POLRMT'` (column definitions can be found [here](https://gemini.readthedocs.org/en/latest/content/database_schema.html))
[Output](https://usegalaxy.org/datasets/bbd44e69cb8906b540d65297cd1d26bb/display/?preview=True) shows variants found within the *POLRMT* gene.
@@ -301,7 +301,7 @@ Wildcards simply writing SQL expressions when searching across multiple terms. T
>
>
> Click to view answer
-> Type `SELECT chrom, start, end, ref, alt, gene, impact, (gts).(*) FROM variants` into The query to be issued to the database and `(gt_types).(*).(==HET).(all)` into Restrictions to apply to genotype values. Here we use wildcards for the query (`(gts.*)` = get genotypes for all samples) and genotype filtering (`(gt_types).(*).(==HET).(all)`, the all operator implies that want results for all afftected individuals). It will generate this output.
+> Type `SELECT chrom, start, end, ref, alt, gene, impact, (gts).(*) FROM variants` into The query to be issued to the database and `(gt_types).(*).(==HET).(all)` into Restrictions to apply to genotype values. Here we use wildcards for the query (`(gts.*)` = get genotypes for all samples) and genotype filtering (`(gt_types).(*).(==HET).(all)`, the all operator implies that want results for all afftected individuals). It will generate this output.
>
{: .question}
diff --git a/topics/variant-analysis/tutorials/exome-seq/tutorial.md b/topics/variant-analysis/tutorials/exome-seq/tutorial.md
index 6e778401..bed3a53f 100644
--- a/topics/variant-analysis/tutorials/exome-seq/tutorial.md
+++ b/topics/variant-analysis/tutorials/exome-seq/tutorial.md
@@ -179,7 +179,7 @@ to simplify the variant representation.
## Annotate your variants
-To annotate the variants, we use the [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/),
+To annotate the variants, we use the [dbSNP](https://www.ncbi.nlm.nih.gov/SNP/),
the NCBI database of genetic variation and then `hg19` database with **SnpEff**.
> ### :pencil2: Hands-on: Annotating variants
@@ -262,7 +262,7 @@ relations, additional annotations and most importantly its fast to search.
# Variant analysis
**GEMINI query** is the most versatile of all the GEMINI tools. You can use it to
-ask 'interesting' questions in simple SQL (see the GEMINI [handbook](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003153) on its usage).
+ask 'interesting' questions in simple SQL (see the GEMINI [handbook](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003153) on its usage).
### **GEMINI query** examples:
- `select chrom, start, end from variants` will show you some information on all