Log-tranformation and GWAS

Log-tranformation and GWAS

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Before doing any GWAS (genome-wide association study) it is necessary to check for the normality of the phenotypic distribution. If the phenotype is normally distributed only once it is log-transformed, what phenotypic data do I have to use while doing the GWAS? The non-transformed one or the log-transformed?

Yes, use the log-transformed phenotype. if you want to use a normally distributed error term, and this will only occur if the phenotype is log-transformed, then you must log-transform the phenotype.

The phenotype should be distributed in the same manner as the error term in your regression model of the GWAS.

$pheno = eta_1 X_D + eta_2 X_A +epsilon$

$epsilon = mathcal{N}(x | mu,sigma)$

Following the central limit theorem, most of phenotypes are normally distributed. However, there are cases where a Bernoulli distribution is correct, and a logistic style regression should be used.


MFs techniques infer low-dimensional structure from high-dimensional omics data to enable visualization and inference of complex biological processes (CBPs).

Different MFs applied to the same data will learn different factors. Exploratory data analysis should employ multiple MFs, whereas a specific biological question should employ a specific MF tailored to that problem.

MFs learn two sets of low-dimensional representations (in each matrix factor) from high-dimensional data: one defining molecular relationships (amplitude) and another defining sample-level relationships (pattern).

Data-driven functional pathways, biomarkers, and epistatic interactions can be learned from the amplitude matrix.

Clustering, subtype discovery, in silico microdissection, and timecourse analysis are all enabled by analysis of the pattern matrix.

MF enables both multi-omics analyses and analyses of single-cell data.

Omics data contain signals from the molecular, physical, and kinetic inter- and intracellular interactions that control biological systems. Matrix factorization (MF) techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in applications ranging from pathway discovery to timecourse analysis. We review exemplary applications of MF for systems-level analyses. We discuss appropriate applications of these methods, their limitations, and focus on the analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with MF enables discovery from high-throughput data beyond the limits of current biological knowledge – answering questions from high-dimensional data that we have not yet thought to ask.


  • 1 Department of Cell Biology, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
  • 2 Center for Applied Genomics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
  • 3 Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
  • 4 Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

Juvenile idiopathic arthritis (JIA) is the most common chronic rheumatic disease among children which could cause severe disability. Genomic studies have discovered substantial number of risk loci for JIA, however, the mechanism of how these loci affect JIA development is not fully understood. Neutrophil is an important cell type involved in autoimmune diseases. To better understand the biological function of genetic loci in neutrophils during JIA development, we took an integrated multi-omics approach to identify target genes at JIA risk loci in neutrophils and constructed a protein-protein interaction network via a machine learning approach. We identified genes likely to be JIA risk loci targeted genes in neutrophils which could contribute to JIA development.


Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).

GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017).

Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013).

Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).

Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).

Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).

Kreimer, A. et al. Predicting gene expression in massively parallel reporter assays: a comparative study. Hum. Mutat. 38, 1240–1250 (2017).

Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep-learning-based sequence model. Nat. Methods 12, 931–934 (2015).

Aguet, F. et al. Local genetic effects on gene expression across 44 human tissues. Nature 550, 204–213 (2017).

Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013).

Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat. Neurosci. 17, 1418–1428 (2014).

Fairfax, B. P. et al. Genetics of gene expression in primary immune cells identifies cell-type-specific master regulators and roles of HLA alleles. Nat. Genet. 44, 502–510 (2012).

Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).

MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

Germain, M. et al. Genetics of venous thrombosis: insights from a new genome-wide association study. PLoS One 6, e25581 (2011).

Tang, W. et al. A genome-wide association study for venous thromboembolism: the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium. Genet. Epidemiol. 37, 512–521 (2013).

Plagnol, V. et al. Genome-wide association analysis of autoantibody positivity in type 1 diabetes cases. PLoS Genet. 7, e1002216 (2011).

Chu, X. et al. A genome-wide association study identifies two new risk loci for Graves’ disease. Nat. Genet. 43, 897–901 (2011).

Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476, 214–219 (2011).

Graham, R. R. et al. Genetic variants near TNFAIP3 on 6q23 are associated with systemic lupus erythematosus. Nat. Genet. 40, 1059–1061 (2008).

Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015).

Lee, Y.-C. et al. Two new susceptibility loci for Kawasaki disease identified through genome-wide association analysis. Nat. Genet. 44, 522–525 (2012).

Xi, H. et al. Analysis of overrepresented motifs in human core promoters reveals dual regulatory roles of YY1. Genome Res. 17, 798–806 (2007).

Stenson, P. D. et al. The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 (2009).

Nagaizumi, K. et al. Two double-heterozygous mutations in the F7 gene show different manifestations. Br. J. Haematol. 119, 1052–1058 (2002).

Feldmann, J. et al. Munc13-4 is essential for cytolytic granules fusion and is mutated in a form of familial hemophagocytic lymphohistiocytosis (FHL3). Cell 115, 461–473 (2003).

Ng, Y.-S., Wardemann, H., Chelnis, J., Cunningham-Rundles, C. & Meffre, E. Bruton’s tyrosine kinase is essential for human B cell tolerance. J. Exp. Med. 200, 927–934 (2004).

Yamagata, K. et al. Mutations in the hepatocyte nuclear factor-4α gene in maturity-onset diabetes of the young (MODY1). Nature 384, 458–460 (1996).

Servitja, J.-M. et al. Hnf-1α (MODY3) controls tissue-specific transcriptional programs and exerts opposed effects on cell growth in pancreatic islets and liver. Mol. Cell. Biol. 29, 2945–2959 (2009).

Huang, F. W. et al. Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959 (2013).

Vinagre, J. et al. Frequency of TERT promoter mutations in human cancers. Nat. Commun. 4, 2185 (2013).

Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary-association statistics. Nat. Rev. Genet. 18, 117–127 (2017).

Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn’s disease susceptibility. Nat. Genet. 39, 830–832 (2007).

Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat. Genet. 40, 955–962 (2008).

Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).

Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119–124 (2012).

Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).

Kirino, Y. et al. Genome-wide association analysis identifies new susceptibility loci for Behçet’s disease and epistasis between HLA-B*51 and ERAP1. Nat. Genet. 45, 202–207 (2013).

Jiang, D. K. et al. Genetic variants in five novel loci including CFB and CD40 predispose to chronic hepatitis B. Hepatology 62, 118–128 (2015).

de Souza, N. The ENCODE project. Nat. Methods 9, 1046 (2012).

Bernstein, B. E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).

Chen, T. & Guestrin, C. XGBoost. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (ACM, San Francisco, 2016).

Bühlmann, P. Boosting for high-dimensional linear models. Ann. Stat. 34, 559–583 (2006).

1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

Efron, B. Size, power and false discovery rates. Ann. Stat. 35, 1351–1377 (2007).

Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).

Uhlen, M. et al. Tissue-based map of the human proteome. Science 347, 1260419–1260419 (2015).

Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).


We thank M. Picciotto, C. Pittenger and A. Che for critically reading the manuscript. We thank R. Terwilliger for technical assistance. We are grateful to the families who donated to this research. This work was supported with resources and use of facilities at the VA Connecticut Health Care System, West Haven, CT, the Central Texas Veterans Health Care System, Temple, TX, the Durham VA Healthcare System, Durham, NC, the VA San Diego Healthcare System, La Jolla, CA, the VA Boston Healthcare System, Boston, MA, USA, and the National Center for PTSD, US Department of Veterans Affairs. The research reported here was supported by the Department of Veterans Affairs, Veteran Health Administration, VISN1 Career Development Award and a Brain and Behavior Research Foundation Young Investigator Award to M.J.G. and by NIMH grants MH093897 and MH105910 to R.S.D. The views expressed here are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs (VA) or the US government.

Materials and Methods

Parasite collection and culture adaption

Plasmodium falciparum infected blood samples were collected during the years 2004–2011 from febrile patients attending local clinics located along the China-Myanmar border. Uncomplicated P. falciparum malaria was identified by microscopy. The study has been approved by the Institutional Review Boards of Penn State University, Kunming Medical University and Department of Health of Kachin, and informed consent was obtained from all adult participants and from parent/guardian of children. The collecting and processing of venous blood samples were performed in accordance with the Appropriate Technology in Health (PATH) guidelines. A total of 94 parasite isolates (2 in 2004, 8 in 2007 15 in 2008 58 in 2009 8 in 2010 and 4 in 2011), determined to be monoclonal infections by genotyping three polymorphic antigen markers msp1, msp2 and glutamate-rich protein, were retrieved from an archive of culture-adapted parasites and used for in vitro drug assay and genotyping. Routine cultures of the parasites were maintained in type O + human red blood cells in complete medium supplemented with 6% human AB serum under an atmosphere of 90% N2/5% O2/5% CO2.

In vitro drug assay and IC50 calculation

The standard SYBR Green I-based fluorescence assay was used to assess parasite susceptibilities to 10 antimalarial drugs: DHA, AS, AM, PND, LMF, PPQ, MQ, CQ, QN and SP (S:P = 20:1, w/w), of which CQ and SP served as internal controls to validate the analysis. DHA, AS, AM, MQ, CQ, QN, and SP were purchased from Sigma (St. Louis, MO, USA), while PND, LMF and PPQ were obtained from Kunming Pharmaceutical Co. (Kunming, Yunnan, China). The stock solution of AM was dissolved in dimethyl sulfoxide (DMSO), and SP in 40% DMSO. Others were prepared as previous described 10 . In vitro cultures were synchronized with two rounds of 5% D-sorbitol treatment, and ring-stage parasites were assayed for drug sensitivity in 96-well microtiter plates at 1% hematocrit and 0.5% parasitemia. Three biological replicates were tested for each isolate and each drug concentration was repeated twice. To reduce the variations between plates, the standard laboratory clone 3D7 was always included as a reference. RSA was performed to measure susceptibility of ring-stage parasites to DHA following the INV10 Standard Operating Procedure 10 . IC50 was measured using a non-linear regression model in GraphPad Prism 5 (GraphPad Software, Inc. La Jolla, CA, USA). Normal distribution of the assay values was tested by Shapiro-Wilk test. Geometric mean of the IC50 and 95% confidence interval were calculated for normally distributed data. Median and interquartile range were calculated if the data were not normally distributed. The correlation between drug assays was assessed by Spearman’s correlation coefficients in R. Student t-test was applied to investigate potentially significant differences in mean assay values between the field isolates and 3D7.

SNP array and data imputation

Parasite genomic DNA was extracted from cultured isolates using Wizard® Genomic DNA Purification Kit (Promega, WI, USA). Genomic DNA was genotyped utilizing a high-density Affymetrix SNP array that allows interrogation of over 17,000 SNPs performed at RML Genomics Unit, Research Technologies Branch, NIAID. The BRLMM-P algorithm was used to call the SNPs as previously reported 32 . Genotypes were transferred to a spreadsheet for further analyses. SNPs within var, rifin and stevor genes were excluded to reduce artifacts due to duplicated sequences. SNPs with call rates below 90% were also removed. The SNP data were then imputed by using the software BEAGLE v3.3.2 33 . This dataset was further filtered to remove multi-allelic and low-frequency SNPs with MAF of <2% by PLINK 1.9.

LD measurement and population structure determination

The genome-wide pairwise LD was measured by PLINK 1.9, setting a flag of -ld-window-r 2 to 0 to include all pairs of SNPs. A setting of this flag to 0.3 was used to measure the SNP pairs with R 2 > 0.3. LD values were plotted by using the R software suite. Population structure was investigated by PCA based on the variance-standardized relationship matrix using PLINK 1.9.

Phenotypes were transformed by natural logarithm and rank-based inverse-normal transformation. To account for zeros in the RSA results, 0.5% was added to all values for log-transformation. GWAS was performed using multiple software packages, GEMMA, PLINK 1.9, linear-regression and nonparametric regression in R. In addition, the newly developed WarpedLMM that estimates optimal transformation of the phenotypes was also applied to the analysis. For PLINK, R and WarpedLMM, the top three principal components as covariates were applied to correct the inflation and reduce the influences of population structure. GEMMA estimates genetic relatedness from genotypes and automatically adjusts the inflation. P values were obtained from each model and a threshold after Bonferroni correction (0.05/number of SNPs analyzed) was used to assess the genome-wide significance. Q-Q plots for P values were used to evaluate the robustness of different models in minimizing inflation due to population stratification. The genomic inflation factor lambda (λ) was calculated for each drug susceptibility phenotype by an R package snpStats. Manhattan plots were generated by modified scripts from the R package CMplot.

SNP diversity analysis

Genetic diversity at resistance loci was evaluated by the average heterozygosity. We calculated the expected heterozygosity using the equation , where He is the expected heterozygosity, n is the sample size and pi is the frequency of the ith allele. The heterozygosity values were averaged by gene, and if there was only one SNP in a gene or no annotation for one SNP, it was grouped in the neighbor gene within 10 kb.

Recent directional selection

Positive selection at resistance loci was examined by EHH in the R package rehh 60 , a method that detects the transmission of an extended haplotype without recombination. The SNPs at S220A in pfcrt, C59R in pfdhfr and T38N in pfatg18 served as the focal SNPs in the EHH analysis. The iHS, which was also implemented in rehh, was used to identify genome-wide long-range directional selection. The iHS is the standardized log-ratio of the integrated extended-haplotype homozygosity for the ancestral and derived alleles at a core SNP, with large positive values indicating long haplotypes carrying the ancestral allele and large negative values indicating long haplotypes carrying the derived allele. Both extreme positive and negative iHS scores are potentially interesting, since ancestral alleles may hitchhike with selected site having large positive score 45 . Therefore, we used the absolute values of iHS to capture unusually long haplotypes surrounding both types of alleles.

4 Genome-wide association analysis

4.1 Association analysis of typed single nucleotide polymorphisms (step 7)

Association analysis typically involves regressing each SNP separately on a given trait, adjusted for patient-level clinical, demographic, and environmental factors. The assumed underlying genetic model of association for each SNP (e.g., dominant, recessive, or additive) will impact the resulting findings however, because of the large number of SNPs and the generally uncharacterized relationships to the outcome, a single additive model is typically selected. In this case and as illustrated in the code provided, each SNP is represented as the corresponding number of minor alleles (0, 1, or 2). Notably, coding SNP variables based on alternative models (e.g., dominant or recessive) is straightforward, and the association analysis described proceeds identically 26, 27 . In practice, a Bonferonni-corrected genome-wide significance threshold of 5 × 10 −8 is used for control of the family-wise error rate. This cutoff is based on research, suggesting approximately one-million independent SNPs across the genome (e.g., 28 ), so tends be applied regardless of the actual number of typed or imputed SNPs under investigation.

In our setting, two genotyped SNPs in the cholesteryl ester transfer protein (CETP) gene region, rs1532625 and rs247617, are suggestive of association (p < 5×10 −6 ) with respective p-values of 8.92 × 10 −8 and 1.25 × 10 −7 . CETP is a well-characterized gene that has been associated previously with HDL-C (e.g., 26 ). More information on these SNPs and the process of post-analytic interrogation is provided in steps 9 and 10 later.

4.2 Association analysis of imputed data (step 8)

At this stage, we map 70 imputed SNPs to the CETP region, of which 16 are significant at the suggestive association threshold of 5 × 10 −6 .

SM designed the study. SM, HN, and DI supervised the study. TVH, JDB, HN, DH, and SM performed the field trial and generated data. DFC, SDM, JA, HS, DH, and SM analyzed data. SM, DFC, and SDM wrote the manuscript with input from the other authors.

The authors thank Alex de Vliegher and his team from the Flanders Research Institute for Agriculture, Fisheries and Food (ILVO) for field trial management, members of various laboratories at the VIB-UGent Center for Plant Systems Biology for assistance with harvesting, Karl Kremling for advice on analysis of the diversity panel data, Ethalinda Cannon for greatly facilitating data retrieval from MaizeGDB, and three anonymous reviewers for very helpful comments. Funding for the RNA-seq and metabolomics data generation and funding for the work of DH and TVH were provided by Syngenta Crop Protection, LLC. SDM is a fellow of the Research Foundation-Flanders (FWO, grant 1146319N).

Jeroen van Rooij and Pooja R. Mandaviya contributed equally as first authors.

Peter A. C. ’t Hoen, Bas Heijmans, and Joyce B. J. van Meurs contributed equally as last authors.


Department of Internal Medicine, Erasmus Medical Center, Rotterdam, the Netherlands

Jeroen van Rooij, Pooja R. Mandaviya & Joyce B. J. van Meurs

Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, the Netherlands

Faculty of Medical Sciences, University of Groningen, Groningen, the Netherlands

The Generation R Study Group, Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands

The Generation R Study Group, Department of Pediatrics, Erasmus Medical Center, Rotterdam, the Netherlands

Department of Biological Psychology, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands

Department of Psychiatry, VU University Medical Center, Amsterdam, the Netherlands

Department of Genetics, University of Groningen, Groningen, the Netherlands

Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands

Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center Nijmegen, Nijmegen, the Netherlands

Molecular Epidemiology, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands


Genome-wide association analyses

To detect novel loci conferring susceptibility to persistent HBV infection, we carried out a two-stage GWAS (Supplementary Fig. 1). In the discovery GWAS stage, we used genotypes from 12,027 individuals by various genotyping platforms providing genome-wide coverage (Table 1 and Supplementary Note) 12,13,14,15 . With the plasma/serum of these subjects available, we determined who of them were PIs (cases) or SRs (controls) by screening for hepatitis B surface antigen (HBsAg), and antibodies against HBsAg (anti-HBs) and hepatitis B core antigen (anti-HBc). Totally, 1,251 cases and 1,057 controls were involved in the GWAS stage, all of whom are of Chinese ancestry recruited from Guangxi, Guangdong and Jiangsu provinces, respectively (Table 1, Supplementary Table 1a and Supplementary Note). In the replication stage, four independent sample sets of Chinese ancestry that were recruited from Jiangsu, Guangxi, Guangdong and Beijing provinces, respectively, were included (Supplementary Note). With the same sample inclusion and exclusion criteria as those used in the discovery GWAS stage, the replication stage totally consisted of 3,905 cases and 3,356 controls (Table 1 and Supplementary Table 1a) 16,17 .

To extend the coverage to the genomic region in the GWAS stage, we used genotypes of autosomal SNPs that passed strict quality checks to impute genotypes of SNPs across the chromosomes for all subjects (Methods section and Supplementary Table 2). We performed three rounds of imputation using the data from the HapMap project phase II, HapMap project phase III and the 1000 Genomes Project as references, respectively, and generated genotypes of 2,177,782, 1,059,015 and 4,494,311 SNPs, respectively (Supplementary Table 3). To assess the accuracy of genotyping and imputation, we resequenced a ∼ 127-Kb genomic region at 1p36.22 in 274 subjects randomly selected from the GWAS stage (Supplementary Note). Excellent concordance between the array genotyping and sequencing was observed in these individuals (98.6% P<2.2 × 10 −16 , Kappa test Supplementary Table 4). A high consistency between the imputation and sequencing was also observed (Pearson’s correlation r=0.94, P<2.2 × 10 −16 Supplementary Fig. 2a). Moreover, we noted that the SNPs with high imputation quality (imputation r 2 >0.8) showed a higher consistency between the imputation and sequencing than those with low imputation quality (Pearson’s correlation r=0.95 and 0.86, respectively Supplementary Fig. 2b,c).

Having shown the validity of array genotyping and imputation data in the GWAS stage, we then carried out genotype–phenotype association analyses using non-integer allele numbers in logistic regression model, with adjustment for age, sex and principal components-based correction for population stratification (Methods section and Supplementary Fig. 3). A quantile–quantile plot showed a good match between the distributions of observed P values and those expected by chance (inflation factor λ=1.05 Supplementary Fig. 4), indicating minimal overall inflation of the genome-wide statistical results.

Several previously reported SNPs were replicated

Recent GWASs have identified a number of SNPs that were significantly associated with the risk of persistent HBV infection. In this study, we confirmed the genetic effects of HLA-DP (index rs9277535, P=3.8 × 10 −6 and rs3077, P=2.3 × 10 −3 ), HLA-DQ (rs2856718, P=1.8 × 10 −3 and rs7453920, P=5.5 × 10 −6 ), CFB (rs12614, P=4.0 × 10 −3 ) and CD40 (rs1883832, P=6.9 × 10 −3 ) (Supplementary Table 5), which have been identified in previous GWASs (refs 6, 7, 10). However, other four SNPs (rs652888, rs1419881, rs3130542 and rs4821116) in or near EHMT2, TCF19, HLA-C and UBE2L3 loci 8,9 failed to be replicated in this study (all P>0.05 Supplementary Table 5). These results were unlikely to be caused by the error of imputation, as these four SNPs were either directly genotyped or imputed with high imputation quality (imputation r 2 >0.96). We also reviewed the previous candidate gene-based association studies of persistent HBV infection. In addition to the SNPs in HLA-DP and HLA-DQ, the SNPs in the microRNA gene MIR219A1 at 6p21.32 (ref. 18) were also be replicated (P=2.6 × 10 −6 and 1.3 × 10 −6 for rs421446 and rs107822, respectively Supplementary Table 6). However, the other previously reported SNPs did not show any consistent associations in this study (Supplementary Table 6). These inconsistent associations between our study and the previous studies may be due to the different study design or racial diversity.

To further explore the associations between the HLA classical alleles and persistent HBV infection, we performed HLA allele genotyping in silico on the basis of the known SNPs genotypes (Supplementary Note) using the R package HIBAG (ref. 19). Four previously identified alleles (HLA-DPB1*201, HLA-DQA1*301, HLA-DQB*301 and HLA-DQB*302) were replicated in the present study (all P<5 × 10 −3 Supplementary Table 7) (ref. 9). In addition, we newly identified the allele HLA-DPB1*501 (P=4.0 × 10 −4 ), which was in moderate linkage disequilibrium (LD) with the previously identified SNPs rs3077 (r 2 =0.30) and rs9277535 (r 2 =0.56) (Supplementary Table 7).

Recent GWASs have also identified several SNPs that are associated with HBV-related liver phenotypes. Among those SNPs, several ones in HLA-DP and HLA-DQ that were significantly associated with hepatitis B vaccine response or HBV-related HCC also showed suggestive associations with persistent HBV infection (Supplementary Table 8), reflecting shared genetic risk factors among the HBV-related phenotypes. However, all the other SNPs showed no associations with persistent HBV infection in our GWAS data (Supplementary Table 8), suggesting that the molecular mechanisms among these phenotypes are largely different.

A new susceptibility locus at 8p21.3 was identified

In addition to the previously reported SNPs in HLA-DP, HLA-DQ and MIR219A1, seventy-two loci showed significant associations with P≤1 × 10 −4 in the discovery GWAS stage in this study. We then selected all of these top 72 signals for replication (Supplementary Data 1 and Supplementary Table 9 Methods section) in an independent sample set (replication stage 1, Jiangsu population). Of these 72 tested SNPs, 6 SNPs showed significant associations in the same direction as observed in the GWAS stage (Supplementary Data 1). These 6 SNPs were further genotyped in another sample set (replication stage 2, Guangxi population), and only rs7000921 at 8p21.3 were replicated (Supplementary Data 1). Consistently, rs7000921 showed evidence of association in replication stage 3 (Guangdong population) and stage 4 (Beijing population Supplementary Data 1). In the combined analyses, rs7000921 (odds ratio (OR)=0.78, Pmeta=3.2 × 10 −12 ) reached genome-wide significance for association with persistent HBV infection (Fig. 1a, Table 2 and Supplementary Fig. 5). No evidence of heterogeneity for OR values of rs7000921 was observed among all these sample sets (Pheterogeneity=0.29 Table 2).

(a) The genetic association results were shown for SNPs in the region 1-Mb up- or downstream of the index SNP rs7000921. Genomic positions are based on NCBI Build 36. In the meta-analysis, the P value of rs7000921 is shown as purple diamonds, with their initial P value in the GWAS stage shown as purple dots. The LD values (r 2 ) to rs7000921 for the other SNPs are indicated by marker colour. Red signifies r 2 ≥0.8, orange 0.6≤r 2 <0.8, green 0.4≤r 2 <0.6, light blue 0.2≤r 2 <0.4 and blue r 2 <0.2. Estimated recombination rates (from the HapMap project phase II) are plotted in light blue. Genes within the region surrounding rs7000921 are annotated, with the positions of transcripts shown by arrows. (b,c) The mRNA expression levels of nearby genes in subjects with different rs7000921 genotypes (CC, CT and TT) were shown. The mRNA expression levels were log2 transformed. Expression levels of each gene were normalized to the mean level of homozygotes for the major allele of rs7000921 (TT genotype) in 31 (b) or 88 (c) human liver tissue samples. Among the 31 liver samples, one sample failed to be genotyped for the rs7000921, thus the analyses were only restricted in the remaining 30 samples. Among the 88 liver samples, three subjects were considered as outliers (their mRNA levels of INTS10>mean+3 s.d. or <mean −3 s.d.), thus the analyses were restricted in the remaining 85 samples. P values were derived from linear regression analyses, and were considered to be significant when below 0.05 after Bonferroni correction by multiplying the number of comparisons. Error bars indicate s.e.m.

We further investigated the effect of rs7000921 on persistent HBV infection using stratification by sex and age. In the pooled case–control samples, we found no appreciable variation of the effects across the subgroups stratified by age or sex for rs7000921 (Pheterogeneity=0.088 and 0.26, respectively Supplementary Table 10). The interaction effects between rs7000921 and viral factors (for example, HBV genotypes and mutations, and viral load) were not assessed because these data were not fully available in our samples. Therefore, the possibility that the association signals detected by rs7000921 reflect some other aspects of disease biology related to persistent HBV infection risk cannot be completely ruled out.

INTS10 was identified as the causative gene at 8p21.3

The SNP rs7000921 is located at intergenic region on chromosome 8p21.3. Six genes (CSGALNACT1, INST10, LPL, SLC18A1, ATP6V1B2 and LZTS1) are located within 1 Mb from this SNP (Fig. 1a). To identify potentially causative gene(s) at 8p21.3, we performed eQTL analyses based on liver tissues from 31 patients with persistent HBV infection (Methods section). We found that the protective minor allele C of rs7000921 was significantly associated with elevated transcript levels of INST10 (P=6.8 × 10 −3 Fig. 1b and Supplementary Data 2). This liver eQTL finding was then replicated in an independent sample set of 88 human liver tissues (P=3.1 × 10 −3 Fig. 1c, Supplementary Data 2 and Methods section) 20,21 . In these two sample sets, the expression of INST10 in protective allele carriers (TC or CC of rs7000921) showing 22–31% elevation compared with that in risk allele carriers (TT Fig. 1b,c). The associations remained significant even after Bonferroni correction for multiple comparisons. When these two sample sets were pooled together, we achieved a more significant eQTL signal (Fisher’s combined P=2.5 × 10 −4 ). No significant eQTL signals were found between the rs7000921 and the other five genes at 8p21.3. Taken together, these results suggest a potential role for INTS10 in persistent HBV infection. However, the allele-specific changes of INST10 expression in liver tissues were not seen in lymphocytes of HapMap populations, suggesting that the underlying regulatory mechanism is tissue-specific.

To investigate candidate causative variants, we performed functional annotation for the genetic variants that are tagged by the index SNP rs7000921 (r 2 >0.7) on the basis of publically available data sets or tools (Supplementary Note and Supplementary Table 11). All the SNPs highly correlated with rs7000921 are located at intergenic regions, of which rs11991803 (r 2 =0.739) and rs4922214 (r 2 =0.729) are in conserved regions predicted to have high regulatory potential scores (Supplementary Table 11). The eQTL analyses of rs11991803 and rs4922214 showed genotype-specific expression of INST10, similar to the results of rs7000921 (Supplementary Fig. 6). We further checked the data from the Encyclopedia of DNA Elements (ENCODE) database, and found that the rs11991803 was within a transcriptional repressor CCCTC-binding factor–binding site detected in multiple cell types including the human hepatoma cell line HepG2, suggesting that this variant might be involved in gene regulation (Supplementary Fig. 7). Taken together, these observations suggest that the causative variants at 8p21.3 may have long-range action in the regulation of INTS10 expression in certain cell types and merit further investigation in the future.

INTS10 suppresses HBV replication

INTS10 is a subunit of the integrator complex, which can interact with RNA polymerase II to mediate 3′ end processing of small nuclear RNAs U1 and U2, the core components of spliceosome 22,23,24 . In addition, the integrator complex mediates transcriptional initiation, pause release and transcriptional termination at diverse classes of gene targets, including host small nuclear RNAs and coding genes and viral microRNAs 25,26,27 . INTS10 is expressed in a wide range of tissue types including in liver tissues, according to the RNA-Seq Atlas database 28 . However, the specific roles of INTS10 in diseases, for example, in persistent HBV infection, remain unclear.

To investigate whether the INTS10 plays a role on HBV replication, we used in vitro cell culture assay systems. The immortalized human hepatocyte cell line L02 was transfected with pAAV-HBV1.2 vectors, together with either pLV-EGFP-INTS10 or pLV-EGFP control vectors (Fig. 2a). Compared with the cells expressing control vectors, the significantly decreased levels of HBV markers, including replicative intermediates of HBV DNAs, HBV RNAs, HBsAg and hepatitis B e antigen (HBeAg), were observed in cells stably expressing exogenous INTS10 (all P<0.01 Fig. 2b–e and Supplementary Table 12). Consistent with these findings, knockdown by two independent siRNAs targeting INTS10 led to significantly elevated levels of HBV DNAs, HBV RNAs, HBsAg and HBeAg (all P<0.05 Fig. 2a,f–i). The antiviral activity of INTS10 against HBV is not limited to the L02 cells. INTS10 can also efficiently reduce the HBV markers in human hepatoma cell line HepG2.2.15 which constitutively produces HBV (Fig. 2j–r), and in human hepatoma cell line HepG2 which was co-transfected with the pAAV-HBV1.2 vectors (Supplementary Fig. 8). Taken together, these results suggest that INST10 plays a role in suppressing HBV replication in vitro.

(a) Protein levels of INTS10 in cellular lysates of L02 cells. L02 cells ( ∼ 2 × 10 5 ) were transfected with pAAV-HBV1.2 vectors, together with pLV-EGFP-INTS10 vector (INTS10) or pLV-EGFP control vector(Vector) (up) or with INTS10-specific siRNAs (Si-INTS10#1 and Si-INTS10#2) or non-targeting scrambled siRNA (Si-Ctrl) (down). (b,c) Levels of HBV DNAs (b), 3.5 Kb pregenomic RNAs (pgRNAs) and 2.4/2.1 Kb Pre-S/S RNAs (c) in L02 cells with INTS10 overexpression. (d,e) Levels of HBsAg (d) and HBeAg (e) in supernatants of L02 cells with INTS10 overexpression. (f,g) Levels of HBV DNAs (f), pgRNAs and Pre-S/S RNAs (g) in L02 cells with INTS10 knockdown. (h,i) Levels of HBsAg (h) and HBeAg (i) in supernatants of L02 cells with INTS10 knockdown. (j) Protein levels of INTS10 in cellular lysates of HepG2.2.15 cells. HepG2.2.15 cells ( ∼ 2 × 10 5 ) were transfected with INTS10 or control vectors (up), or with INTS10-specific or control siRNAs(down). (k,l) Levels of HBV DNAs (k), pgRNAs and Pre-S/S RNAs (l) in HepG2.2.15 cells with INTS10 overexpression. (m,n) Levels of HBsAg (m) and HBeAg (n) in supernatants of HepG2.2.15 cells with INTS10 overexpression. (o,p) Levels of HBV DNAs (o), pgRNAs and Pre-S/S RNAs (p) in HepG2.2.15 cells with INTS10 knockdown. (q,r) Levels of HBsAg (q) and HBeAg (r) in supernatants of HepG2.2.15 cells with INTS10 knockdown. All the supernatants and cells were collected 72 h post-transfection. Protein levels of INTS10 were examined by western blot analyses. HBV DNA levels in cells were measured by Southern blot analysis (left) and quantitative real-time PCR (qRT-PCR right). The pgRNAs and Pre-S/S RNAs of HBV in cells were measured by northern blot analysis with 18S ribosome RNA (rRNA) indicating RNA loading in each lane (left), and qRT-PCR normalized to human β-actin gene ACTB (right). The levels of HBsAg and HBeAg in supernatants were measured by enzyme-linked immunosorbent assays (ELISA). Error bars indicate s.d. P values were determined using two-tailed unpaired t-test. *P<0.05, **P<0.01 and ***P<0.001. rcDNA, relaxed circular DNA dsDNA, double-stranded DNA ssDNA, single-stranded DNA.

INTS10 suppresses HBV replication via IRF3-dependent manner

We then sought to explore the underlying mechanisms by which INTS10 suppresses HBV replication by analysing mRNA expression profiles of liver tissues from 31 HBV carriers (Supplementary Note). Comparing samples with high INTS10 expression to samples with low INTS10 expression, we identified 402 differentially expressed genes (false discovery rate Q value<0.01 and fold change>1.2 Supplementary Data 3a) in determining biological pathways that are altered after INTS10 dysregulation. Intriguingly, we observed significant enrichment and activation of the spliceosome (Pnominal=1.5 × 10 −5 , ranks the first) and the retinoic acid-inducible gene-I-like receptor (RLR) signalling pathway (Pnominal=1.8 × 10 −3 , ranks the second Supplementary Data 3b) in samples with high INTS10 expression. Given the important roles of integrator complex in spliceosome, the enrichment of term spliceosome may reflect the intrinsic physiologic function of INTS10. Notably, however, the RLR members such as retinoic acid-inducible gene-I (RIG-I) and melanoma differentiation-associated gene 5 (MDA5) have been shown to sense the HBV and activate innate immune signalling in hepatocytes to suppress virus replication 29,30,31 . To determine whether the RLR signalling pathway was regulated by INTS10 in other independent samples, we performed similar analyses in two data sets from the Gene Expression Omnibus (GEO) database (accession number GSE25097 and GSE22058), which contain 289 and 96 liver tissues, respectively (Supplementary Data 3c,d). Again, the term spliceosome ranked the first in both data sets (Pnominal=1.0 × 10 −8 and 2.6 × 10 −3 , respectively), and the RLR-related pathway ranked the third (Pnominal=2.6 × 10 −2 ) and ninth (Pnominal=0.20), respectively (Supplementary Data 3b). Taken together, these results suggest that INST10 may be involved in inhibition of HBV replication through the RLR pathway.

Binding of RLRs to virus-derived nucleic acids activates the downstream signalling pathways in a manner dependent on the adaptor protein mitochondrial antiviral signalling protein (also known as IPS-1, VISA or Cardif), leading to the activation of the IRF3 and NF-κB and the subsequent production of type I interferons (IFNs, including IFN-α and IFN-β) and type III IFNs (that is, IFN-λ, including IFNL1 (also known as IL29), IFNL2 (IL28A) and IFNL3 (IL28B)) and inflammatory cytokines 32 . Thus, we examined whether the HepG2.2.15 cells transfected with the INTS10 expression plasmid could activate the IRF3 and NF-κB. We found that overexpression of INTS10 could increase IRF3 phosphorylation (p-IRF3), whereas the NF-κB could not be activated (Fig. 3a). Consistent with these findings, knockdown of INTS10 by siRNAs led to significantly decreased levels of p-IRF3, whereas not influencing the activity of NF-κB (Fig. 3b). Furthermore, we found that overexpression of INTS10 could potently activate IFN-stimulated response element (ISRE) in reporter assays (P<0.01 Fig. 3c) and elevate mRNA levels of type III IFNs (IFNL1 and IFNL2/3 P<0.05 Fig. 3d and Supplementary Table 12), but not influence the type I IFNs. Consistent with these findings, knockdown of INTS10 significantly reduced the activity of ISRE reporter and mRNA levels of IFNLN1 and IFNLN2/3 (Fig. 3e,f). To ensure that these observations could be applied to other types of hepatocytes, we then did the same experiments in L02 and HepG2 cells co-transfected with HBV1.2 vectors and obtained identical results (Supplementary Figs 9 and 10). Next, we investigated whether the IRF3 pathway is required for the INTS10-elicited immunity against HBV infection. Indeed, we found that the activation of ISRE reporter, elevation of mRNA levels of type III IFNs and the reduction of HBV markers by enforced INTS10 expression were weakened when cells were transfected with siRNAs targeting IRF3 (Fig. 3g–m and Supplementary Fig. 11). Accordingly, in the liver tissues of patients persistently infected with HBV, we observed that the protein levels of INTS10 were positively correlated with those of p-IRF3 (ρ=0.38, P=0.015), but not p-p65 (Fig. 4a–c and Supplementary Table 13). Taken together, these results suggest that INTS10 suppresses HBV replication in an IRF3-dependent manner.

(a,b) Levels of phosphorylated (p-) or total proteins in lysates of HepG2.2.15 cells measured by western blot analyses, with GAPDH or Tubulin indicating protein loading in each lane. The cells were transfected with pLV-EGFP-INTS10 vector (INTS10) or pLV-EGFP control vector (Vector) (a), or with INTS10-specific siRNAs (Si-INTS10#1 and Si-INTS10#2) or non-targeting scrambled siRNA controls (Si-Ctrl) (b). (c) Luciferase activity of ISRE reporter plasmids 48 h after co-transfection into cells with INTS10 or control vectors. RLU, relative luciferase units. (d) The mRNA levels of IFNL1 and IFNL2/3 measured by quantitative real-time PCR (qRT-PCR) normalized to human β-actin gene ACTB in cells transfected with INTS10 or control vectors. (e) Luciferase activity of ISRE reporter plasmids 48 h after co-transfection into cells with INTS10-specific or control siRNAs. (f) The mRNA levels of IFNL1 and IFNL2/3 measured by qRT-PCR normalized to ACTB in cells transfected INTS10-specific or control siRNAs. (g) Luciferase activity of ISRE reporter plasmids 48 h after co-transfection into cells with INTS10 or control vectors and an IRF3-specific siRNA (Si-IRF3#1) or non-targeting scrambled siRNA controls (Si-Ctrl). (h) Levels of INTS10 and IRF3 measured by western blot analyses, with GAPDH indicating protein loading in each lane. (i) The mRNA levels of IFNL1 (left) and IFNL2/3 (right) measured by qRT-PCR normalized to ACTB in cells co-transfected with INTS10 or control vectors and IRF3-specific or control siRNAs. (j,k) Levels of HBV DNAs (j), 3.5 Kb pregenomic RNAs (pgRNAs) and 2.4/2.1 Kb Pre-S/S RNAs (k) in cells co-transfected with INTS10 or control vectors andIRF3-specific or control siRNAs measured by qRT-PCR normalized to ACTB. (l,m) Levels of HBsAg (l) and HBeAg (m) in supernatants of cells co-transfected with INTS10 or control vectors and IRF3-specific or control siRNAs measured by enzyme-linked immunosorbent assays (ELISA). All histograms show mean values from three independent experiments error bars indicate s.d. P values were determined using two-tailed unpaired t-test. *P<0.05, **P<0.01 and ***P<0.001.

(a) Representative images from the liver tissues by immunohistochemistry staining are shown for INTS10, p-p65 and p-IRF3, respectively. The scale bar represents 50 μm. (b,c) The correlation of protein levels between INTS10 and p-IRF3 (b), or between INTS10 and p-p65 (c). Protein levels of INTS10, p-p65 and p-IRF3 were measured in the non-tumour liver tissues of patients with HBV-related HCC by immunohistochemistry staining (n=40). The size of the circle is proportional to the number of samples. A Spearman’s test was used, and the correlation coefficiency (ρ) and the two-tailed P values are shown. (d) The concentration of the plasma INTS10 in persistently HBV infected subjects (PIs) and spontaneously recovered subjects (SRs). The plasma INTS10 levels were measured by enzyme-linked immunosorbent assays (ELISA) in 216 PIs and 80 SRs. Horizontal bars indicate the mean value of each subset. The significance was calculated using two-tailed unpaired t test. (e) Correlation between the plasma INTS10 and HBV DNA load in PIs with positive HBeAg (left) and PIs with negative HBeAg (right). The plasma INTS10 and HBV DNA load were log10 transformed. The correlation coefficiency (r) and the two-tailed P values were then evaluated by Pearson’s test. P values were considered to be significant when below 0.05.

INTS10 correlates with the persistence of HBV infection

To further validate the roles of INTS10 in facilitating HBV clearance, we investigated the plasma INTS10 from subjects persistently infected HBV or those spontaneously recovered from HBV infection. Consistent with the eQTL result in liver tissues, we observed elevated INTS10 protein levels in the plasma of rs7000921 C allele carriers (P=0.020, unpaired t-test Supplementary Fig. 12). Furthermore, the levels of plasma INTS10 in 216 PIs were significantly lower than those in 80 SRs (P=2.0 × 10 −19 , fold change=2.0 Fig. 4d and Supplementary Table 1b). In addition, we found significantly negative correlation between the INTS10 levels and HBV DNA load in the plasma of PIs with positive HBeAg (Pearson correlation coefficient r=−0.41, P=2.5 × 10 −3 ) and those with negative HBeAg (r=−0.17, P=0.028 Fig. 4e). Taken together, these results further support our genetic and functional findings, indicating that insufficiency of INTS10 may contribute to the persistence of HBV infection.


The National Institutes of Health (grant numbers R01 EB022574, R01 LM011360, U01 AG024904, R01 AG19771, P30 AG10133, R01 CA129769, UL1 TR001108 and K01 AG049050) Department of Defense (grant numbers W81XWH-14-2-0151, W81XWH-13-1-0259 and W81XWH-12-2-0012) and National Collegiate Athletic Association (grant number 14132004), as well as by the Indiana University Network Science Institute (IUNI), the Alzheimer’s Association, the Indiana Clinical and Translational Science Institute and the Indiana University/IU Health Strategic Neuroscience Research Initiative (in part).

Jingwen Yan is an assistant professor in the Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University Indianapolis, USA.

Shannon L. Risacher is an assistant professor in the Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA.

Li Shen is an associate professor in the Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA.

Andrew J. Saykin is a professor in the Department of Radiology and Imaging Sciences, Indiana University School of Medicine, USA.

Watch the video: Statistical models used for GWAS Lecture 7 (August 2022).