- Hail依赖的平台,并行处理
- 云平台
对Google cloud SDK的一个简单的wrap,方便操作。
cloudtools is a small collection of command line tools intended to make using Hail on clusters running in Google Cloud's Dataproc service simpler.
These tools are written in Python and mostly function as wrappers around the gcloud suite of command line tools included in the Google Cloud SDK.
Google cloud基本使用
gcloud dataproc clusters delete name
gcloud datastore create-indexes index.yaml
f1 = hc.read("gs://somewhere")
目前只是单独的使用一个VM,如何想批量并行使用Google cloud的VM就必须要使用分布式管理系统,如spark等,hail就是集成了spark。
This snippet starts a cluster named "testcluster" with the 1 master machine, 2 worker machines (the minimum/default), and 6 additional preemptible worker machines. Then, after the cluster is started (this can take a few minutes), a Hail script is submitted to the cluster "testcluster".
1. 在本地运行wrapper,创建Google cloud虚拟机
cluster start testcluster \ --master-machine-type n1-highmem-8 \ --worker-machine-type n1-standard-8 \ --num-workers 8 \ --version devel \ --spark 2.2.0 \ --zone asia-east1-a
2. 启动notebook
cluster connect testcluster notebook
3. 本地提交脚本到Google cloud上
cluster submit testcluster myhailscript.py
4. 登录到Google cloud,安装必备软件
gcloud compute ssh testcluster-m --zone asia-east1-a
5. 安装sklearn
sudo su # to be root and install packages/opt/conda/bin/conda install scikit-learn
/opt/conda/bin/pip install findspark
Depression is more frequently observed among individuals exposed to traumatic events. The relationship between trauma exposure and depression, including the role of genetic variation, is complex and poorly understood. The UK Biobank concurrently assessed depression and reported trauma exposure in 126,522 genotyped individuals of European ancestry. We compared the shared aetiology of depression and a range of phenotypes, contrasting individuals reporting trauma exposure with those who did not (final sample size range: 24,094- 92,957). Depression was heritable in participants reporting trauma exposure and in unexposed individuals, and the genetic correlation between the groups was substantial and not significantly different from 1. Genetic correlations between depression and psychiatric traits were strong regardless of reported trauma exposure, whereas genetic correlations between depression and body mass index (and related phenotypes) were observed only in trauma exposed individuals. The narrower range of genetic correlations in trauma unexposed depression and the lack of correlation with BMI echoes earlier ideas of endogenous depression.
Major depressive disorder (MDD) is a common illness accompanied by considerable morbidity, mortality, costs, and heightened risk of suicide. We conducted a genome-wide association meta-analysis based in 135,458 cases and 344,901 controls and identified 44 independent and significant loci. The genetic findings were associated with clinical features of major depression and implicated brain regions exhibiting anatomical differences in cases. Targets of antidepressant medications and genes involved in gene splicing were enriched for smaller association signal. We found important relationships of genetic risk for major depression with educational attainment, body mass, and schizophrenia: lower educational attainment and higher body mass were putatively causal, whereas major depression and schizophrenia reflected a partly shared biological etiology. All humans carry lesser or greater numbers of genetic risk factors for major depression. These findings help refine the basis of major depression and imply that a continuous measure of risk underlies the clinical phenotype.
The Neale Lab at the Broad Institute used Hail to perform QC and genome-wide association analysis of 2419 phenotypes across 10 million variants and 337,000 samples from the UK Biobank in 24 hours.
Hail’s functionality is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster.
- a library for analyzing structured tabular and matrix data
- a collection of primitives for operating on data in parallel
- a suite of functionality for processing genetic data
- not an acronym
# conda env create -n hail -f $HAIL_HOME/python/hail/environment.ymlsource activate hailcd $HAIL_HOME/tutorialsjhail
Sample Population SuperPopulation isFemale PurpleHair CaffeineConsumptionHG00096 GBR EUR False False 77.0HG00097 GBR EUR True True 67.0HG00098 GBR EUR False False 83.0HG00099 GBR EUR True False 64.0HG00100 GBR EUR True False 59.0HG00101 GBR EUR False True 77.0
.├── _SUCCESS├── cols│ ├── _SUCCESS│ ├── metadata.json.gz│ └── rows│ ├── metadata.json.gz│ └── parts│ └── part-0├── entries│ ├── _SUCCESS│ ├── metadata.json.gz│ └── rows│ ├── metadata.json.gz│ └── parts│ ├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5│ ├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855│ ├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af├── globals│ ├── _SUCCESS│ ├── globals│ │ ├── metadata.json.gz│ │ └── parts│ │ └── part-0│ ├── metadata.json.gz│ └── rows│ ├── metadata.json.gz│ └── parts│ └── part-0├── metadata.json.gz├── references└── rows ├── _SUCCESS ├── metadata.json.gz └── rows ├── metadata.json.gz └── parts ├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5 ├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855 ├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af
什么是LD block?
GWAS对SNP的要求,并不是一定要求致病位点(causal site)在待检测SNP集合中,但必须包含该位点邻近的SNP。基因组在遗传上是由一个一个的haplotype blocks(数k到数百kb)构成的,在这些block内,这些互相有较高LD的SNP在做GWAS检测时,是可以相互替代的。所以做的比较好的GWAS文章里,一个causal site往往相邻近的SNP位点都会被检出GWAS显著。这是一个辅助判断结果是否可信的一个依据。能够检出非同义SNP只是让结果解读找到一个捷径。但如果只选择外显子SNP作为待检测SNP,我相信很难覆盖所有的haplotype blocks。
#!/bin/sh/opt/conda/bin/pip install scikit-learn/opt/conda/bin/pip install numpy/opt/conda/bin/pip install scipy
cluster start testcluster \ --master-machine-type n1-highmem-8 \ --worker-machine-type n1-standard-8 \ --num-workers 2 \ --num-preemptible-workers 2 \ --version devel \ --spark 2.2.0 \ --zone asia-east1-a \ --pkgs scikit-learn \ --init gs://ukb_testdata/additional_init.sh
import hail as hlmt = hl.balding_nichols_model(3, 100, 100)gts_as_rows = mt.annotate_rows( mean = hl.agg.mean(hl.float(mt.GT.n_alt_alleles())), genotypes = hl.agg.collect(hl.float(mt.GT.n_alt_alleles()))).rows()groups = gts_as_rows.group_by( ld_block = gts_as_rows.locus.position // 10).aggregate( genotypes = hl.agg.collect(gts_as_rows.genotypes), ys = hl.agg.collect(gts_as_rows.mean))df = groups.to_spark()from pyspark.sql.functions import udfdef get_intercept(X, y): from sklearn import linear_model clf = linear_model.Lasso(alpha=0.1) clf.fit(X, y) return float(clf.intercept_)get_intercept_udf = udf(get_intercept)df.select(get_intercept_udf("genotypes", "ys").alias("intercept")).show()