91407 - Laboratory of Comparative Genomics

Academic Year 2022/2023

  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Biodiversity and Evolution (cod. 5824)

Learning outcomes

After this course the student will master the basic computational and bioinformatics methods necessary for handling and analyzing genomics data. The course will provide theoretical and practical skills to work with High-Throughput Sequencing data through a pipeline that includes quality check and filtering, de novo assembly of genomes/transcriptomes, mapping and variant discovery, RNA-Seq, normalization, transcript quantification, identificaton of differentially expressed genes, annotation of coding and non-coding elements, orthology identification. Everything will be presented in a comparative framework, to identify the evolutionary components that contributed to shape the extant organisms.

Course contents

1. Technologies, methods, and applications

Technologies: Sanger sequencing vs massive parallel sequencing; Reversible Terminator Sequencing (Illumina); Single-Molecule Real-Time (SMRT) Sequencing (PacBio); Nanopore Sequencing; Comparison of sequencing platforms.

Sequencing methods: directed sequencing, hierarchical shotgun sequencing, whole-genome shotgun sequencing; genetic maps; short reads, paired-end sequencing, mate-pair sequencing, long reads, chromosome conformation technologies (e.g.: Hi-C).

Assembly strategies: greedy approach; Overlap-Layout-Consensus (OLC) approach; de Bruijn Graph approach; Comparison of assembly strategies.

Applications: RAD-Seq; RNA-Seq; bisulfite sequencing; ChIP-Seq; Iso-Seq; Single-Cell genomics and transcriptomics.

Genomics, big data, computational biology and bioinformatics.


2. Database, algorithms for alignment and sequence similarity search

Overview of public databases (GenBank, EMBL, RefSeq, EggNOG, Ensembl, UniProtKB, InterPro, Pfam, PROSITE, Swiss-Prot, Gene Ontology); gene identification: gene prediction, ORFs; sequence identity/similarity; substitution matrices; sequence comparison statistics; compositional bias; global vs local alignments; search algorithms: BLAST+, Diamond; motifs, domains and profiles: sequence motifs, sequence logo, protein domains and domain architectures; HMM profiles, HMMER.


3. Comparative Genomics

The comparative method; model organisms; the inductive method and the problem with generalization in biology; the importance of basic research. Examples of comparative genomics.


4. Introduction to the Unix Shell

Biology and “Big Data”, robust and reproducible research, experimental design, management and organization of data and online documentation, online resources.

Bash: login to the workstation using Guacamole, path, files, directories, permits; text editors (vim, nanos); download and data transfer: wget, curl, scp; data integrity check, md5sum; data compression/decompression: zip, gzip, tar; merge, sort and compare files: cat, join, sort, diff, comm; text file manipulation: grep, AWK, sed.

Bash scripting: background, screen; concatenation, pipe, semicolon, &&; standard output and standard error; variables; command substitution; loops: for, until, while.

Conda environments.


5. Comparative Genomics Project

  • K-mer-based genome characterization: FASTQ and FASTA formats; quality check; k-mer frequency calculation; estimation of genome size, repetitive content, and heterozygosity.
  • De novo assembly: draft assembly; genome polishing; evaluation of assembly quality; search and elimination of contaminants; reference-based scaffolding; evaluation and comparison by whole-genome alignment.
  • Annotation: annotation of transposons and other repetitive elements; evidence-based gene prediction; ab initio gene prediction; introduction to gene predictor training (machine learning); GO annotation and GO enrichment.
  • Search for ortholog genes: concepts of molecular orthology; Orthofinder.
  • Molecular evolution: dN/dS computation.
  • RNA-Seq and comparatice transcriptomics: read mapping; filtering and count of mapped reads; normalization; differential transcription.

Readings/Bibliography

  • Vince Buffalo “Bioinformatics Data Skills”, O’Reilly.

  • Arthur M. Lesk “Introduction to Bioinformatics” (Fifth Edition), Oxford University Press

  • Arthur M. Lesk “Introduction to Genomics” (Third Edition), Oxford University Press

  • Scientific papers and online material (including dedicated GitHub) e Teams.

Teaching methods

The course will alternate frontal theoretical lessons to practical, hands-on lessons during which the students will have the opportunity to perform analyses on a dataset of their choice.
Each student will choose a biological problem and a dataset (among public data) on which they will develop a project that will be evaluated for the final grade.

Before taking this course, it is highly recommended to have attended the following courses:

  • 91400 - Biometria Evoluzionistica ed Ecologica
  • 91360 - Genetica di Popolazione ed Evoluzione Molecolare
  • 91789 - Evoluzione e Filogenesi (C.I.)
  • 91399 - Evoluzione del Genoma

As concerns the teaching methods of this course unit, all students must attend Module 1, 2 on Health and Safety online.

Assessment methods

Evaluation of the project and brief interview (focused on the project).

The requirements for the projects and the submission procedures will be explained during the introductory lesson.

Teaching tools

Slides, scientific papers, online material (including a dedicated GitHub website), hands-on sessions on the PC, use of a high performance workstation.

Office hours

See the website of Fabrizio Ghiselli