Indian Genome Variation Consortium : Phase 1


In the first phase of validation, we determined the extent of genetic diversity and heterogeneity prevalent in the Indian population For this purpose, populations were categorized based on the different geographical zones and linguistic categories and two contrasting populations in terms of their sizes were minimally selected and prioritized from each category wherever available (See Population details). We identified 55 subpopulations from which it was decided, to collect, on an average 40 samples from each population in the first phase. Individual samples from 31 of these are also represented in our discovery panel (See Composition of Discovery panel). In addition 8 large, 1 isolated and 2 special populations are also included in this validation panel to maximize representation of people of India. The diversity information generated from the first phase of validation will be utilized as a criterion for further collection of samples to attain the target of 15,000.

         Identification of populations

The project aims to provide SNP database from well-defined ethnic groups that have been chosen to represent the entire spectrum of diversity within the Indian population. Considering the population diversity (See Diversity of Indian Population), two issues had to be addressed. One, defining the composition of the population substructure, which captures the entire genetic diversity and other, the composition of a small panel of samples for SNP discovery that would ensure representation of SNPs from the entire Indian population. For this purpose the project has been carried out in two steps; discovery of SNPs on a small panel (See Discovery panel) of 43 samples followed by estimation of their frequency  in a larger set of samples,  which constitutes the validation panel (See Description of Validation Panel). This, we felt, would give an estimate of the genetic heterogeneity that would help us in further substructuring of the population.

  Composition of the Discovery Panel [TOP]

With a view to discover novel SNPs as well as to determine the presence/frequency of the reported SNPs in the Indian population, an initial panel was made comprising of representatives drawn from 43 different subpopulations (See Map of Discovery Panel). This discovery panel included samples, both tribal and non-tribal, belonging to diverse geographical zones and linguistic backgrounds to maximize novel SNP discovery.  Though 43 individuals per se do not represent the entire Indian population, such a diverse set does increase the heterogeneity in terms of SNP discovery as compared to a set of samples from a single subpopulation.

Composition of the validation panel

The populations for validation panel have been identified based on the following criteria; geographical zones, linguistic groups, practice of endogamy, presence of minority communities from different religious groups and existence of populations of different sizes. Four major linguistic lineages, namely, Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic have been considered. We also categorized the populations as small if their size was <1 million and large if >10 million. This strategy has been followed to ensure maximum coverage of the Indian population as well as to capture the minor alleles in large out bred populations.

Sample Collection [TOP]

The identification of populations as well as collection of samples have been carried out with the help of trained anthropologists, social workers and community health workers, as their participation is essential for establishing rapport with the general public. Also, individuals fluent in the local language of the concerned populations have been consulted and have been actively involved in the study in order to get maximum and authentic information from the donors and also to help them to better understand the purpose of carrying out such an investigation. Endogamy of the populations has been established by taking extensive information about the marriage pattern, gathered through pedigrees and interview of family members of the donor as well as published literature. A general template to obtain informed consent from the donors of the samples is used and in cases where the donor is illiterate, thumb impression is used. In addition, verbal tape-recorded consent of the donors has also been taken. It has been is ensured that the individuals are unrelated at least to the first cousin level. All the institutes have  participated in the collection of samples, with three nodal centres, IGIB, CCMB and IICB, which are connected to the other centres. IGIB, CDRI, IMTECH and ITRC have ensured collection of samples from the northern and central parts of India, IGIB and CCMB from western part, IICB from eastern part and CCMB from the southern part of the country.


Ethical Clearance

Each institute has obtained prior ethical clearance from the Institutional Bioethics Committee (IBC) for the collection of samples following the guidelines of Indian Council of Medical Research (ICMR) ( for the complete period of 5 years. A uniform bar-coded detailed questionnaire has been developed, containing information pertaining to ethnicity, family history of diseases and other phenotypic traits of the sample donor. Prior to sample collection, it is explained to the participants that the personal identifiers in the questionnaire are confidential and are not available to the researchers. Also, the samples are irretrievably coded. It is also explained to the volunteers that the project aims at understanding the extent of variability and diversity in different subpopulations and  the basal data generated in this study would be used for disease specific association studies. We also ensure that the participation is entirely voluntary and no materialistic promises are made to the donors. Also, no promise for a genetic test is provided.

Managing ethical issues [TOP]

Although the project will include no personal identifiers, each sample is identifiable through a sample code as well as a population code. There is a provision for the volunteers to withdraw from the study at their will. Though naming the population with a particular set of tag SNPs allows a better interpretation of the biological significance to be used in future studies of association, population history and population relatedness, it does, however, have important ethical and social ramifications. To avoid any social backlash that could destabilize the very fabric of Indian society, i.e. unity in diversity, a decision was taken against disclosing the identity of the populations. This is because, the way a population is labeled in this project and described in publications will have implications for all members of the population, as all of them (and all members of closely related populations) might be affected by the interpretation and use of findings of future studies. The samples collected from different populations are bar-coded with each population being given a specific code revealing the linguistic affinity of the population, the geographic zone to which the population belongs as well as the type of population, viz, large endogamous population, isolated population or special population.

Strategy and methods for marker discovery and validation [TOP]

In the first phase of the project, screening for novel SNPs is carried out in 75 genes on the discovery panel of 43 samples. For this purpose, amplicons were generated in exonic regions spanning nearly the entire gene. This would not only provide the data on the SNPs shared between the different Indian subpopulations and Indian and other world populations but also reveal population specific indigenous SNPs. In addition, it provides data on 86 chromosomes, thus enabling the identification of SNPs with an overall minimal allele frequency (MAF) >0.05 to be used for further validation.

For the discovery of novel SNPs, bi-directional sequencing of the 43 samples of the discovery panel was carried out. A few selection criteria were evolved for prioritizing the SNPs for validation based on the data on novel and putative functional SNPs as well as minor allele frequencies of the SNPs in the discovery panel. Information on the frequencies of SNPs in different databases like dbSNP, Celera, RealSNP and HapMap along with the information on haplotype block structures and tag SNPs were taken into consideration during selection of SNPs for validation. Also, though flexible, spacing between the different selected SNPs within a gene was taken care of depending upon the size of the gene so as to uniformly cover the entire gene. After going through these series of filters, additional gaps were filled, if required, by SNPs reported in the database based on different validation criteria such as multiple submissions.

In the first phase of the project, novel and reported SNPs from 75 genes are being validated on 1,871 samples collected from different populations using the Sequenom massarray system. The validation process is being carried out in 2 steps - the initial confirmation of SNPs is being done in population pools followed by estimation of frequency in the individual samples. These data would give us insights for further identification of informative SNPs and substructuring of the population, which would enable a judicious collection of the samples.

In the Second phase of the project which involves genotyping large number of SNPs  in reference population Affymetrix and Illumina platform are being used.


Data release policy of the Indian Genome Variation project [TOP]

It is envisaged that the Indian Genome Variation project would eventually be useful for identifying predisposed haplotypes for common and complex disorders or the common functional polymorphisms, which might be useful for pharmacogenomics studies. It would be a resource that catalogues the common patterns of genetic variation in important complex disease candidate genes. There is provision for incorporating or widening the scope of the project as more and more information on the human genome variations is being made available with additional information on patterns of linkage disequilibrium, as well as development of cost effective high throughput technologies. Though nearly 11 million SNPs have been released in the public database and the HapMap data is available, the selection of the appropriate set of markers for identifying susceptibility haplotypes for different complex genetic diseases is still debatable. Moreover, these databases do not include Indian samples. In the Indian Genome Variation consortium, there is also a provision for parallel research to determine factors, which could lead to generation of informative repeats and SNPs suitable for designing case-control association studies. These inputs can be incorporated during the development of the SNP database of the Indian population.

Usage of the portal will be freely available for all academic users around the world. However, the discoveries arising out of the IGV project will be IPR protected and will be licensed for commercial exploitation.


Website copyright Institute of
Website copyright Institute of Genomics and Integrative Biology. All Rights Reserved. The servers are free for academic use. Please contact IPR Cell for commercial use. No part of this should be Downloaded or used in any way without prior permission of the Director, IGIB. Best viewed at 800 X 600 Resolutions | Internet Explorer 5.0 or Later Version