01&02 Introduction
From ICISWiki
Contents |
USING ICIS TO MANAGE INFORMATION IN BREEDING PROGRAMS
INTRODUCTION
International germplasm exchange was the engine of the Green Revolution. In the past, however, much of the important information generated from this exchange was only accessible locally, i.e. in field books or researchers’ files. Although major international initiatives for germplasm collection and conservation followed the Green Revolution, much collected materials are still not used because information related to their characterization remain difficult to access. As a result, the potential impact upon agriculture has not yet been fully realized.
The free exchange of information, through international crop information systems, should now provide the foundation for a Second Green Revolution that adds value to germplasm by seamlessly uniting its conservation, evaluation, utilization and exchange. Furthermore, new technologies in molecular biology and genomics mean that traditional phenotypic information must be linked to large quantities of sequence and genetic information so that functional genomics and allele mining activities can speed up germplasm enhancement.
CIMMYT devised an information strategy and developed software on a mainframe computer during the 1980’s to facilitate the unambiguous identification of wheat germplasm, thereby establishing links between information coming from different sources. The read-only International Wheat Information System (IWIS) compact disk (Fox et al., 1996) duplicated data querying capabilities and some of the genealogical diagnostics of the mainframe version In 1995, CIMMYT and IRRI surveyed other CGIAR centers to establish a project to develop an International Crop Information System (Fox and Skovmand, 1996) applicable to a wide range of crops. Extensive communication among CGIAR centers highlighted the economies to be gained by collaborating on the development of an information system that could be used for many crops.
THE INTERNATIONAL CROP INFORMATION SYSTEM (ICIS)
Several CGIAR centers, National Agricultural Research Systems and Advanced Research Institutes are collaborating to develop ICIS as a generic system that will accommodate all data sources for any crop and breeding system. The vision of ICIS is to integrate different data types in both private and public datasets into a single information system and provide specialist views and applications that operate on the single integrated data platform. After all phases of development are complete, ICIS will support a range of activities – from germplasm conservation, evaluation, functional genomics, allele mining, breeding, testing and release. Data will be accessible from CD-ROM or the WWW, and users may either adopt the complete system or link only to its innovative genealogical features.
The driving force behind ICIS is accessing and sharing data rather than providing analytical and statistical tools. This is because the major bottleneck to intelligent data integration and utilization is not statistical software, but rather the drudgery of finding, extracting, preparing and managing the data. ICIS exports managed data in formats designed to make full use of external statistical software.
ICIS currently comprises:
- a genealogy management component to capture and process historical genealogies as well as to maintain evolving pedigrees, and to provide the basis for unique identification and internationally accepted nomenclature conventions for each crop;
- a data management component for genetic, phenotypic, and environmental data generated by evaluation and testing, as well as for providing links to genomic maps;
- links to Geographical Information Systems (GIS) that can manipulate all data associated with latitude and longitude, e.g. international, regional and national testing programs;
- applications for maintaining, updating and correcting genealogy records and tracking changes and updates;
- applications for producing field books and managing sets of breeding material and for diagnostics such as coefficients of parentage and genetic profiles for planning crosses;
- tools to add new breeding methods, new data fields and new traits;
- tools for submitting data to crop curators and for distributing data updates via CD-ROM and electronic networking;
- project management capabilities including basic experimental designs; and links to program management and monitoring systems;
- (RICE VERSION) integrated interfaces to a growing range of structural and functional genomics datasets.
Remote Users
One of the innovative features of ICIS is that it permits independent users to integrate their own local data with public central data. ICIS does this by allowing read-only access to the central database for a particular crop and supporting a local copy of the ICIS data model where the local data is stored. Apart from providing user-friendly access to the data, so that crop scientists can make informed decisions, the system provides a local data management system for the user and captures relevant data for the crop. Periodic updates to the central database by users make their data available to all other users as well as browsers of the central database.
The data model and database system of ICIS are designed for maximum flexibility to cater for as wide a range of crops as possible. The model must be independently adopted for a specific crop and data entered to create an independent system for that crop.
Data Model for the Genealogy Management System
The core of ICIS is a common genealogical data model called the Genealogy Management System (GMS) which is generically designed to accommodate a wide range of crops. The functions of GMS are to:
- assign and maintain unique germplasm identification,
- retain and manage information on genealogy and
- manage nomenclature and chronology of germplasm development.
Each germplasm entity is identified by a GERMPLASM_ID (GID). The logical connection between a GID and a packet of seeds or other propagating materials is that different packets which germplasm specialists would not mix get different GIDs. Information on method, location, date of genesis and other attributes is managed through the data model shown in Figure 1. Each germplasm record is linked to its progenitors through their GIDs.
Germplasm is divided into two categories – generative and derivative. Generative germplasm is produced by methods such as crossing or mutation that tend to increase and combine genetic variation. Derivative germplasm is produced by methods such as selection which tend to refine, target and reduce genetic variation or maintain genetic status through management methods such as seed increase or conservation. Germplasm produced by generative methods may have any number of progenitors. Derivative germplasm is derived from a single germplasm source.
Each instance of germplasm, whether generative or derivative, falls into a single germplasm group identified by a group source. For germplasm produced by a generative process (such as an F1 from a cross) or for germplasm of unknown genesis (such as a land race) this group is defined by its own GID. Germplasm produced by derivative or management methods retains the group ID of its source, although the group is often known when the source is not. For example, the cross may be known even when the line source is missing.
Figure 1. Data Model for the ICIS Genealogy Management System (GMS)
Method definitions are stored with complete documentation, including bibliographic references. Some methods depend on parameters, which may vary each time the method is used, such as the number and mixing proportions of parents in the generation of a population. These parameters can be defined by assigning an attribute to the GID of the germplasm being produced. Attributes are flexible and user-definable data fields.
Germplasm gets a multitude of labels during the development and release process. These are all tracked as NAMES in GMS. One name must be identified as the preferred name for display purposes. For a given genetic entity, different preferred names can be used in local and central applications. Names may contain imbedded information, and this can be made accessible to application programs for specific name types by specifying a format for the name.
Attributes are text variables used to store information about the genesis, genealogy, nomenclature or chronology of germplasm. Attribute types are defined and described as USER_DEFINED_FIELDS. Like names, attributes may contain imbedded information in the form of sub-fields or variables within the attribute text.
Location information is stored to record the origin or destination of germplasm or the location of sites where information or data on germplasm was collected. Locations can be as precise as fields or plots or as large as countries or even regions. Locations are defined by name and can be associated with latitude and longitude points or polygons to allow spatial analysis.
Data Model for the Data Management System
The Functions of the Data Management System (DMS) are to:
- store and manage documented and structured data from genetic resource, variety evaluation and crop improvement studies;
- link data to specialized data sources such as GMS, soil and climate databases and;
- facilitate inquiries, searches and data extraction across studies according to a structured criteria for data selection.
DMS can accommodate raw observed data, derived data and summary statistics. Data may have continuous or discrete numeric values, text or categorical character values. For example, observations on disease resistance or nutrient efficiency of a genotype can be numerical measurements, scored or calculated indices or text data. More complex forms of data, such as pictures or documents will also be considered. The basic data model of DMS is shown in Figure 2 which shows the linkages between the entities described below.
A STUDY is the basic, reportable unit of research. It is synonymous with the notions of experiment, nursery or survey. Since DMS must deal with any of these, we will use the term study. A study is characterized by a set of scientific objectives and testable hypotheses and results in the collection of one or more data sets. The division of data into sets is usually motivated by convenience. For example, data collected from different sampling scales is most conveniently treated in different data sets. Similarly, data collected at different times or from different locations are also often treated as different data sets, although it is feasible to treat these divisions in a single data set. The point is that DMS is flexible enough to manage data in all the ways that researchers require.
Figure 2. Data Model for the ICIS Data Management System (DMS)
FACTORS are classifying variables in a study that take values from finite sets of discrete LEVELS. These levels are usually labeled in some ways to document the source and context of the data by expressing the conditions under which the data were collected or derived. Examples of this are the names of treatments or design structures applying to the unit or units from which data are recorded, or conditions such as the time and location of measurement. These LABELS are usually listed in columns in the data set. The study itself is treated in the data model as a factor with exactly one level – the study name. Hence, every study has at least one factor. A single factor is often represented by more than one set of LABELs.
Factors are named and described in each study. They have three main attributes – the PROPERTY of the experimental material or survey units being manipulated or stratified, the METHOD or procedure by which the levels are applied, and the SCALE or measurement units in which the levels are expressed. All levels of a particular factor are expressed in the same scale. Names of factors are consistent within studies and equivalent factors are linked across studies through common PROPERTIES. PROPERTY entries are subject to a controlled vocabulary to facilitate this linkage.
Data sources such as field objects or sampling units are identified by combinations of levels of design or sampling factors. Data values such as treatment means are associated with level combinations of treatment factors which do not correspond to field objects but which can be thought of as data sources. Both types of data sources, field objects, and treatment combinations are referred to as OBSERVATION UNITS. Observation units are conceptually equivalent to rows in a serially structured spreadsheet.
Each study involves the recording of data for one or more properties of some observation units. The data being recorded are described by VARIATES and are often represented as data columns in spreadsheets. Each VARIATE has the same three attributes as a FACTOR, i.e. the PROPERTY or trait being measured, the METHOD or procedure by which the value is observed or derived, and the SCALE or measurement units in which the value is expressed. VARIATES are named and described within studies and the name should be consistent throughout a study. The common vocabulary of PROPERTY links VARIATES across studies in the same way that FACTORS are linked across studies.
The central entity in the DMS data model is the DATUM which links to exactly one OBSERVATION UNIT and exactly one VARIATE. The DATUM is conceptually equivalent to a cell in a variate column of a spreadsheet or field book. The most important attribute of a datum is its VALUE, i.e. the recorded value of the associated VARIATE for the associated OBSERVATION UNIT.
Effects are sets of observation units in a study that are indexed by levels of subsets of the FACTORS in the study. Effects form natural hierarchies according to the nesting of their index-factor subsets. Data associated with different effects may result from data collection at different sampling scales or on different field objects, but they often arise by statistical amalgamation of data values over related units in lower level effects.
Correcting Data
Corrections and changes will inevitably occur in any database. Only authorised users can make these changes, and all changes are logged so that the sequence of changes can be traced and can be ‘undone’ if required. Such logged functions are increasingly necessary to comply with requirements associated with establishing intellectual property rights of over germplasm.
Changes commonly occur when new information about an existing germplasm record is entered into a local database. If the existing record is in the local database, then the local user can complete the changes. But if it is in the central database, then changes cannot be completed until the central database is updated. Verifying and completing requested changes is part of the update process, and sufficient information and justification needs to be recorded in the changes table to allow the process to be completed.
Verification and completion of changes to the central database may take some time, but local users would like to see their changes reflected immediately. This is achieved by the software that always checks the local CHANGES table for central changes and applies them at run time for the specific installation where they are recorded.
User defined data
The ICIS data model is extremely flexible and can accommodate new data types as required. Users are able to define new relationships between germplasm by specifying new breeding methods. They can specify new attributes of the germplasm to be stored, types of names, location descriptors and traits, scales and methods for characterization and evaluation data.
Stand Alone Software Modules
Components of ICIS include a Genealogy Management System (GMS), Set Generation Module (SETGEN) including the External Pedigree Input Tool (EPIT), Field Book Module (FLDBK), Trait Management System (TMS), Data Management System (DMS), a Work Book (WRKBK) for data input and query and a Data Retriever for cross-study data queries (RTV). The first four modules focus on germplasm and management of genealogy and nomenclature. The next three handle the management of evaluation and characterization data, and the Retriever provides access to both raw and processed data.
AN ICIS EXAMPLE:THE INTERNATIONAL RICE INFORMATION SYSTEM (IRIS)
The GMS of the rice implementation of ICIS – the International Rice Information System (IRIS) – stores information on more than one million varieties, breeding lines, and accessions of rice. This allows pedigree analysis to trace germplasm flows and relationships between lines that can be used to improve evaluation estimates or plan crop improvement programs. The IRIS DMS contains 5 million data values from over 500 studies from breeding, screening and international testing trials. This allows integrative analysis over different environments.
Access via the World wide Web
A WWW interface for ICIS has been developed and deployed for IRIS (http://www.iris.irri.org). This will extend the functionality of ICIS into a broader range of biological datasets including molecular genetic (QTL), genome sequence, mutant, gene expression array, proteomics and allele mining data. Each class of data is accessed through a specialized “view” which queries and displays the target data type in its biological context and cross-references it to related data including traditional germplasm and evaluation data.
Special effort is being made to incorporate controlled vocabularies and ontologies from the Gene Ontology Consortium (www.geneontology.org) and Plant Ontology Consortium (www.plantontology.org), in particular, to capture traits and phenotyping. One particular context in exploiting such ontologies is the “MutantView” with which researchers can specify mutant phenotypes as a means of identifying specific mutant stocks. Entries in the IRIS GeneView are cross-indexed to GO ontologies where feasible.
LINKS TO GLOBAL CROP INFORMATICS RESOURCES
Other databases of agricultural information have close links with IRIS (or ICIS in general). The System-wide Information Network for Genetic Resources (SINGER;http://www.singer.cgiar.org) aims to provide global access to genetic resources data across all CGIAR mandated crops and commodities. Germplasm records in ICIS that relate to accessions in the CGIAR collections are linked with SINGER so the ICIS can provide information on the utilization and deployment of those genetic resources. Each individual ICIS database manages data for a particular crop and generally does not share data with other crops. The integrating role for genetic resources information across crops is played by SINGER. IRIS also has collaboration with the USDA sponsored GrainGenes and Gramene database initiatives; with the International Rice Genome Sequencing Project (RGP, TIGR and other partners), with Oryzabase (Japanese National Institute of Genetics) and with the Beijing Genome Institute. Linking these implementations to ICIS will provide access to sequence information and molecular maps which will facilitate functional genomics and allele mining and integrate information across crops at the genomic end of the spectrum.
ICIS SOFTWARE AVAILABILITY
ICIS is freely available and the ICIS System, including WWW site components, is being developed under the open source software model. Where possible, the IRIS genomic and WWW interface components are being developed by the adoption of public open source database schemata and software components such as those from the GO consortium (e.g. the GO database schemata and the Amigo browser), GMOD (gbrowse and cmap at http://www.gmod.org) and the Open Bioinformatics Foundation (www.open-bio.org).
To promote the open and collaborative spirit of the project, a community project site was established for the ICIS project at the bioinformatics.org WWW site, which also provides email list services for discussion groups about ICIS. Community participation is also promoted by open annual ICIS workshops. The latest information about the project may be accessed at www.icis.cgiar.org.
REFERENCES
Fox, P.N., Lopez, C., Skovmand, B., Sanchez, H., Herrera, R., White, J.W., Duveiller, E. and van Ginkel, M. 1996. International Wheat Information System (IWIS), Version 1. Mexico, D.F.: CIMMYT. On Compact Disk.
Fox, P.N. and Skovmand, B. 1996. The International Crop Information System (ICIS) - Connects Genebank to Breeder to Farmer’s Field. In Plant Adaptation and Crop Improvement (eds M. Cooper and G.L. Hammer), CAB International.
McLaren CG, Bruskiewich RM, Portugal AM, Cosico AB. 2005. The International Rice Information System: a platform for meta-analysis of rice crop data. Plant Physio. 139 (2): 637-642