Mini-workshop 2007

From ICISWiki

Jump to: navigation, search

Back to Main Page > Workshops and Meetings

Contents

ICIS mini - workshop

Date  : March 19-23, 2007
Venue  : University of Queensland, Brisbane, Australia
Agenda : pdf

Participants

  • UQ
    • Ian DeLacy
    • Sandra Micallef
    • Vivi Arief
  • IRRI
    • Graham McLaren
    • Thomas Metz
    • Rowena Valerio
  • Triticarte
    • Grzegorz Uszynski
  • CRCVAW
    • Clare Johnson
  • ACPFG
    • Dave Edwards

Summary of Requirements for the Project

Basic Queries for Genotype, Phenotype and characterization data

  1. Given germplasm list and clone list, produce genotype data
  2. Given germplasm list, trait list and study list, produce phenotypic data
    1. Merge 1 and 2
  3. List Management
    1. Access germplasm list from central
    2. Access germplasm list from text file
    3. Access germplasm list from local
    4. Appending and merging of germplasm list
    5. Management of study list (folder system for studies and naming system or description for datasets)
    6. Management of trait list
  4. Dataset download/ description of datasets
  5. Content Description
    1. Given a germplasm list, what studies were these germplasms evaluated?
    2. Given a germplasm list, what clones have geen genotyped?
    3. Given a germplasm list, what traits have been measured/
    4. Tabulate co-occurence of genotypes/traits in studies. How many genotypes are evaluated in each pair of studies?
  6. A pedigree view/download tool - will be discussed with the team



Monday 19th March 2007

Visualisation tools for genotype and phenotype data.

Rowena Valerio gave a demonstration of ghe Genotype Visualisation Tool (GVT) for marker data from David Marshall (SCRI, UK). http://cropforge.org/projects/genotypevisual/ Graham McLaren gave an overview of a presentation given by Casper aan den Boom at ICIS 2006, about the Graphical Genotype Tool (GGT), developed by The Netherlands Plant Genomics Network. This software is freely available but it is not open source. www.pbr.wwr.nl/uk/resources IRRI staff are working with the GGT developers to integrate it with ICIS.

Dave Edwards, the newly appointed leader for the Brisbane node of the Australian Centre for Plant Functional Genomics (ACPFG), talked about his work with Marker QTL databases at DPI (Department of Primary Industries), and possible future collaborations with Dave Mathews of Grain Genes (http://wheat.pw.usda.gov)

Graham McLaren gave another presentation which Kyle Braak (CIMMYT) presented at ICIS2006, about the Comparative Map and Trait Viewer (CMTV). This tool uses the ISYS Integrated System (http://www.ncgr.org/isys/), which was developed by NCGR in collaboration with 4 CGIAR centres and which allows construction of consensus maps. The CMTV tool uses ISYS to connect to other tools and data sources.

Another visualisation tool is the Peditree software, which is also planned to be integrated with ICIS.( http://www.dpw.wageningen-ur.nl/pv/pub/Peditree/index.htm)

COP (Coefficient of Parentage) Matrices : Ian DeLacy and Graham McLaren discussed some specifications this visualisation tool should have to be useful and efficient when handling large amount of data

IRRI recommended lines web interface: This tool was presented at ICIS2006 by Alex Cosico from IRRI. Input: A set of pedigrees which has a large amount of clean phenotypic data (multiple observations for each trait) Aim: to be able to query the dataset to select best lines for seed.

  1. select traits of interest
  2. specify criteria

Values for criteria selection can be selected from histograms and sliders – these give a graphical visualisation of the searching space in the database so users can narrow down what they want out of the database. (http://www.iris.irri.org/nursery/views/nursery.jsp)

Use of DArT data in selection and breeding

Ian DeLacy gave a talk about Marker Assisted Breeding and integrating pedigree based whole genome analysis. QTL detection and validation has limited power, is expensive and time consuming.

  • Genotypic data: Pedigree data, phenotype data, gene data
  • Genotypic information: Graphical genotypes, information on relatives
  • What is needed to do the pedigree association analysis?
  • Marker system
  • Information management system
  • Bioinformatics system (analysis)
  • Decision support system for breeding (visualisation decision tools)
  • And most importantly integration of systems

Overall gain is genetic gain

  • IBS: Identity by State
  • IBD: Identity by descent

Association Mapping and Genetic Diversity for wheat improvement:

Prerequisites:

  1. Sufficient markers for structured characterization and for markers in specific genes/regions. Road marker generation: maize and barley are much ahead of wheat. DArT: genome specific. ~1000 markers expected. It is low cost ($36/sample), but there is no complete map so far. Lack of D genome markers and a question of reliability.
  2. Diverse material
  3. Determine ID - LD: Non random association of alleles at different loci. Physical distance: Evolutionary forces affecting LD are drift, admixture, population size and selection. Not much is known about LD in wheat. LD in other crops : Barley, Arabidopsis, Sorghum and Maize.
  4. Access population structure

Quantification by determining genetic diversity (not pedigree based) (see paper by Gael Pressoir and Jianming Yu)


Tuesday 20th March 2007

Storing DArT data in ICIS

Rowena Valerio gave a demonstration of how to store DArT data in ICIS. A tutorial for this exercise can be found on the ICISWIKI site : http://cropwiki.irri.org/icis/index.php/Loading_of_DArT_Data_into_ICIS

Components of ICIS include GMS : Geneology Management System DMS: Data Management System IMS: Inventory Management System GRIMS: Genetic Resources Information Management System GEMS: Gene Management System

To load the DArT data into GEMS and DMS, the workbook is used. In the workbook, the Factors are GID, Marker ID and Clone ID. The marker ID is generated by GEMS when loading – if it does not already exist for that GID, it is created – similar to how GIDs are generated in GMS). The variate in the workbook is Allele. The first step is to transfer the data in the spreadsheet (given by breeders) from ‘crosstab’ format into serial format. The second step is to get the marker IDs (look them up in the database and create them if they do not exist) (Note for Rowena: New IDs assigned should be clearly identified from ones already in the database, to make sure you don’t have any typing/spelling errors resulting in duplicate markers.) The current limitation is that Excel spreadsheets are only limited to 64,000 rows. Possible solutions to this problem is either having more than one spreadsheet in the workbook or load data straight from the matrix instead of creating the serial format and then loading. Explanation of how data is stored in DMS and GEMS schema– Graham gave a presentation prepared by Arllet Portugal.

Discussion followed on what DArT data values should be eg. 1 or P for presence 0 or A for absence X or M or NULL for missing or cannot be determined


DArT : Row data production and processing

– Grzegorz Uszynski

Grzegorz Uszynski from Diversity Array Technologies gave a presentation on how the DArT data is produced.

Definition of some terms used in the presentation:

  • Library : a collection of clones already in their database
  • Targets: Clones which are prepared in the same way as the library ones.
  • Spots on image indicate level of hybridization
  • System based on LAMP architecture (Linux, Apache, MySql, PHP)

LIMS system at Diversity Array Technologies is called DArT.db and is web enabled

  • It ensures proper clone and sample tracking
  • Barcodes utilities
  • Covers all lab technologies
  • All major tasks are described as protocols
  • Integrated image analysis component so analysis can be defined and results viewed online
  • Includes support for client/order management

DArT soft is a toolbox for data analysis

  • 2 modes of data anlaysis: local and network
  • Local analysis: image analysis and polymorphism analysis
  • A sample is a DNA extract described by plate location
  • Organism/species, genotype/tissuename combination
  • Genotype(name) is a synonym for sample
  • Primary aim is to identify sample NOT to invent new terminology (ontology)

Marker identification

  • Marker is a DNA fragment identified as polymorphic in an analysed group of

samples

  • Always identified by clone name and clone ID
  • Relations between these identifiers are 1:1:[0/1]

DArTsoft development and future

  • Working with Brian Cullis on data detrending and gradients corrections
  • Working with NICTA on machine learning technologies

Discussion on data output format: XML vs XLS vs CSV

  • Excel most popular but limitations exist in terms of number of rows and columns
  • XML – not normally readable (needs converters to read on its own), very large files but perfect for multidimensional approach
  • CSV is easily portable and readable in other software packages such as Excel. However multidimensional is problematic. Quite compact in size and expansion means changes in format.

Discussion followed with Grsegorz about how markers are detected – discussion to be taken further with Andrejz Kiljan tomorrow.


Wednesday 21st March 2007

Pre-structuring of data for data warehousing to increase performance on data querying

Thomas Metz and Rowena Valerio

PostgreSQL is much faster than MySQL. Comparing how long it takes for each to covert a spreadsheet from serial format to matrix format – PostgreSQL takes 2.5 seconds while MySQL takes 68 seconds. The reason for this being that in PostgreSQL a dataset from a query is kept in memory and looped through next query instead of querying the whole database again and again.

The phenotype and genotype data are in two separate tables. These tables need to be joined by linking the GID fields.

Discussion on how to represent a GID which has been genotyped more than once and has different alleles each time.

Data structure: 3 separate tables

  • List of clones
  • List of GIDs
  • Phenotype data
  1. Get genotype data for a specified list of GIDs and clones
    1. Specifying a GID list either
      • by filtering or
      • by choosing a germplasm list
    2. Selecting clones
      • Filtering by
        1. chromosome
        2. quality value
        3. PIC value
        4. by non missing %
      • by input list
  2. Get phenotype data for a specified list of germplasm and traits
    • germplasm specified same way as in (A)
    • traits specified mostly through an input list (need to make a list of traits that are available in database, with a check box next to it)

TRAITS

 %PRESENT

VALUES

 

 

 

 

Yield (t/ha)

70%

> 2.5 < 3.7 t/ha

 

Plant height (cm)

63%

> 10 < 20 cm

 

Disease A

50%

 

 

Yield (grams/plot)

80%

 

 

Yield (Kg/ha)

20%

 

Values can be specified through an interface similar to the “Recommended lines” web interface designed by Alex Cosico (http://www.iris.irri.org/nursery/views/nursery.jsp)

Query Design and User Interface Refinement

Crosstab queries required to summarise data by germplasm and study/trial

By trait

 

Trial ID

Trait name

1

2

3

4

5

Grain Yield (t/ha)

Y

Y

Y

 

Y

Grain Yield (kg/ha)

 

 

Y

Y

 

Plant height (cm)

Y

Y

Y

Y

 

Disease A

 

Y

Y

 

Y

Disease B

Y

 

Y

 

Y

By germplasm

 

Trial ID

GID

1

2

3

4

5

Hartog :

Y

Y

Y

 

Y

Sunco

 

 

Y

Y

 

Baviacora

Y

Y

Y

Y

 

Yallaroi

 

Y

Y

 

Y

Tamaroi

Y

 

Y

 

Y

Occurrence table

indicates what proportion of one particular study occurs in another study (lines or germplasm)

 

Year or study

Year or study

1

2

3

4

5

1

--

 

 

 

 

2

10%

--

 

 

 

3

10%

14%

--

 

 

4

 

7%

15%

--

 

5

2%

5%

21%

3%

--

Features required

  • SETGEN: Merge lists
  • WORKBOOK: Download a whole study (or dataset within study)
  • WEB INTERFACE: download whole study (or dataset within study)
  • DMS: Give names to datasets for more meaningful descriptions to help downloads.

Thursday 22nd March 2007

Recap of decisions taken so far:

Queries needed for GP and Characterisation data.

  1. Given a Germplasm List and Clone list -> produce genotype data
  2. Given a germplams list and trait list and study list -> produce phenotype data
  3. Merge A and B on GID?
  4. § List management
    1. Access germplam list from Central
    2. Access germplasm list from text
    3. Access germplasm list from local
    4. Appending and merging (distinct values only) germplasm lists
    5. Management of study lists – introduce folder system for studies
    6. Management of trait lists (ontology)
  5. § Data set download and description of datasets
  6. Content description
    1. Given a germplasm list –
      1. What studies have evaluated lines?
      2. What clones have been genotyped?
      3. What traits have been measured?
    2. Tabulate GID vs study
    3. Tabulate co-occurrence – how many genotypesa are evaluated in each pair of studies (for a trait or any or all of a set of traits)
  7. A pedigree view and Download Tool

§ These are a priority for CAGE project at UQ.

Molecular selection

Graham McLaren : MOSEL – molecular selection tool to facilitate marker assisted breeding. It will access pedigree, genotype and phenotype data for test and reference lines (ancestors). There are 3 types of neighbourhoods

  1. Progenitors, derivatives and maintenance
  2. Founders : earliest genotyped ancestors from which all others ancestors and test lines have been produced
  3. Genetic info: about loci provided through a map dataset. Map dat may specify QTL and gene information.

Allele Index

LOCUS

ALLELE

FOUNDER

TARGET

1

A

1

Y

1

B

1

N

1

A

2

Y

1

C

2

N

2

M

1

N

2

N

1

N

2

P

2

Y

….

 

 

 


GermPID

GID EntNo

Selection

Target Loci

Loci in founder 2

Which carry the target

L F2, Lr, Ls, Lt

Founder 3

L1

L2

L3

L4

 

 

 

 

 

 

 

 

Test line

 

¨ X

 

 

 

 

 

X= distance of the line to the target Additional information: R project: http://cran.at.r-project.org/ Heat plus package

Graphviz and GGobi (http://www.ggobi.org) – tools for visualisation



Friday 23rd March 2007

Meeting with ITS representative Gavin Fuller. Gavin Fuller from ITS (Information Technology Services) joined the meeting tp amswer some questions from Thomas about the ITS server environment.

ITS prefer UNIX to Windows and MySQL to PostgreSQL. Gavin to get back to us with the questions raised.

Additional discussions about how to synchronise different levels of Global/Central/Provincial Central/Local databases. We need to have a system where difference between databases is detected and extracted and sent to the administrator for review.

More on this discussion to be included in ICIS 2007?


R-script for preliminary data analysis

Personal tools