• 免费服务热线
  • 400-065-6886
  • 电话:86(0)512-6295 9990
  • 传真:86(0)512-6295 9995
新闻中心

新闻媒体-太阳成tyc7111cc

发稿时间:2020-06-16来源:天昊生物





geo数据库目前收录了4348个数据集记录,包含人的数据1772个,小鼠的数据1642个,大鼠的数据360个,其中属于组织样品有1183个,细胞品系有857个。下面,小编就跟大家详细讲解如何利用geo表达谱数据进行差异表达分析,期待您的评论沟通和转发哦~



geo数据下载


打开ncbi太阳成集团tyc7111cc官网,选择geo datasets,这里我们随便搜一个转录组的数据,如上图所示,由于前面几个gse所提供的表达量文件不规范,这里我们选择登录号为gse132287,点击进入。 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse132287



下载文件打开后,如图所示:



数据准备

1、表达量数据读取

in [1]:

gene <- read.table('gse132287_gene-count-table.xls',header = t, row.names = 1, sep = 't', check.names = false)
head(gene,10)

out[1]:

 gene_name gene_type chr start end strand length mda-a1 mda-a2 mda-a3 mda-g1 mda-g2 mda-g3
ensg00000000003.14_2 tspan6 protein_coding chrx 99882106  99894988 - 4535  2704  3694  2946  3204  2435  3498
ensg00000000005.5_2 tnmd protein_coding chrx 99839799  99854882     1610  0  0  0  0  0  0
ensg00000000419.12_2 dpm1 protein_coding chr20 49551404  49575092 - 1207  4045  5254  4031  4740  3872  4867
ensg00000000457.13_2 scyl3 protein_coding chr1 169818772  169863408 - 6883  1352  1849  1431  1546  1291  1752
ensg00000000460.16_4 c1orf112 protein_coding chr1 169631245  169823221     5967  3490  4828  4071  3885  3376  4586
ensg00000000938.12_2 fgr protein_coding chr1 27938575  27961788 - 3474  1  0  0  1  0  3
ensg00000000971.15_2 cfh protein_coding chr1 196621008  196716634     8145  1620  2346  2001  2148  1863  2507
ensg00000001036.13_2 fuca2 protein_coding chr6 143815948  143832827 - 2793  9471  12928  10236  11808  10084  13539
ensg00000001084.10_3 gclc protein_coding chr6 53362139  53481768 - 8463  2461  3384  2572  2761  2392  3046
ensg00000001167.14_2  nfya  protein_coding  chr6  41040684  41067715     3811  4549  4770  3828  4146  4264  4997


2. 分组数据读取

in [2]:

info= read.table('gse132287_sample_group.txt',header = f,col.names = c('sample','group'))
info

out[2]:

sample group
mda-a1  mda-a
mda-a2  mda-a
mda-a3  mda-a
mda-g1  mda-g
mda-g2  mda-g
mda-g3  mda-g

in [3]:

df = gene[as.character(info$sample)]
head(df)

out[3]:

 mda-a1 mda-a2 mda-a3 mda-g1 mda-g2 mda-g3
ensg00000000003.14_2 2704  3694  2946  3204  2435  3498
ensg00000000005.5_2 0  0  0  0  0  0
ensg00000000419.12_2 4045  5254  4031  4740  3872  4867
ensg00000000457.13_2 1352  1849  1431  1546  1291  1752
ensg00000000460.16_4 3490  4828  4071  3885  3376  458
6ensg00000000938.12_2  1  0  0  1  0  3

in [4]:

coldata <- data.frame(group = info$group )
coldata

out[4]:

group
mda-a
mda-a
mda-a
mda-g
mda-g
mda-g


deseq数据分析

1、安装和加载deseq2包

in [5]:

install.packages('biocmanager')
biocmanager::install('deseq2')

in [6]:

library(deseq2)


2、deseq2分析

构建数据集并标准化数据集

in [7]:

dds <- deseqdatasetfrommatrix(df, dataframe(coldata), design= ~ group )
dds <- deseq(dds,betaprior=false)
dds

out[7]:

class: deseqdataset
dim: 60461 6
metadata(1): version
assays(4): counts mu h cooks
rownames(60461): ensg00000000003.14_2 ensg00000000005.5_2 ...ensg00000284747.1_1 ensg00000284748.1_1
rowdata names(22): basemean basevar ... deviance maxcooks
colnames(6): mda-a1 mda-a2 ... mda-g2 mda-g3
coldata names(2): group sizefactor

表达量数据归一化

in [8]:

df <- as.data.frame(counts(dds, normalized=true))

in [9]:

df['mda-a_mean'] = apply(
df[as.character(info[info$group=='mda-a',1])],1,mean)
df['mda-g_mean'] = apply(
df[as.character(info[info$group=='mda-g',1])],1,mean)
df['gene'] = rownames(df)head(df)

out[9]:

 mda-a1 mda-a2 mda-a3 mda-g1 mda-g2 mda-g3 mda-a_mean mda-g_mean gene
ensg00000000003.14_2 2994.378835 3147.518 3196.921 3101.6740411 2819.888 2992.716062 3112.9393704 2971.425929 
ensg00000000003.14_2ensg00000000005.5_2 0.000000 0.000 0.000 0.0000000 0.000 0.000000 0.0000000 0.000000 
ensg00000000005.5_2ensg00000000419.12_2 4479.386978 4476.735 4374.335 4588.6188998 4484.027 4163.964858 4443.4855603 4412.203499 
ensg00000000419.12_2ensg00000000457.13_2 1497.189418 1575.463 1552.883 1496.6254893 1495.062 1498.924683 1541.8452966 1496.870591 
ensg00000000457.13_2ensg00000000460.16_4 3864.786292 4113.757 4417.742 3760.9249843 3909.627 3923.555134 4132.0948079 3864.702246 
ensg00000000460.16_4ensg00000000938.12_2  1.107389  0.000  0.000  0.9680631  0.000  2.566652  0.3691295  1.178238  ensg00000000938.12_2


差异分析

通过result()可获得最终计算的log2倍数变化和校正前后p值等信息。contrast参数用于指定比较的分组顺序,即谁相对于谁的表达量上调/或下调;padjustmethod设定p值校正方法;alpha为显著性水平,这里0.05为校正后p值小于0.05即为显著。in [10]:
res <- results(dds, contrast = c('group', 'mda-a', 'mda-g'), padjustmethod = 'fdr', alpha = 0.05)
res = as.data.frame(res)
head(res)

out[10]:

basemean log2foldchange lfcse stat pvalue padj
ensg00000000003.14_2 3042.1826497 0.06681719 0.07222184 0.9251661 0.3548795 0.7074537
ensg00000000005.5_2 0.0000000 na na na na nae
nsg00000000419.12_2 4427.8445298 0.01055749 0.06762388 0.1561207 0.8759379 0.9691177
ensg00000000457.13_2 1519.3579439 0.04308650 0.07852322 0.5487103 0.5832043 0.8550993
ensg00000000460.16_4 3998.3985272 0.09654288 0.07227280 1.3358121 0.1816107 0.5297536
ensg00000000938.12_2  0.7736839  -1.73212025  2.76415576  -0.6266363  0.5308977  na

in [11]:

res['type']='not deg'
res[which(res$log2foldchange >= 1 & res$pvalue < 0.05),'type'] <- 'up'
res[which(res$log2foldchange <= 1 & res$pvalue < 0.05),'type'] <- 'down'
res['gene'] = rownames(res)head(res)


out[11]:

 basemean log2foldchange lfcse stat pvalue padj type gene
ensg00000000003.14_2 3042.1826497 0.06681719 0.07222184 0.9251661 0.3548795 0.7074537 not deg e
nsg00000000003.14_2ensg00000000005.5_2 0.0000000 na na na na na not deg ensg00000000005.5_2
ensg00000000419.12_2 4427.8445298 0.01055749 0.06762388 0.1561207 0.8759379 0.9691177 not deg 
ensg00000000419.12_2ensg00000000457.13_2 1519.3579439 0.04308650 0.07852322 0.5487103 0.5832043 0.8550993 not deg
ensg00000000457.13_2ensg00000000460.16_4 3998.3985272 0.09654288 0.07227280 1.3358121 0.1816107 0.5297536 not deg 
ensg00000000460.16_4ensg00000000938.12_2  0.7736839  -1.73212025  2.76415576  -0.6266363  0.5308977  na  not deg 
ensg00000000938.12_2


合并数据

in [12]:

result_merge = merge(df,res,by = 'gene')
result_merge = result_merge[order(result_merge$pvalue),]
head(result_merge)


out[12]:

gene mda-a1 mda-a2 mda-a3 mda-g1 mda-g2 mda-g3 mda-a_mean mda-g_mean basemean log2foldchange lfcse stat pvalue padj type1875
 ensg00000090339.8_2 4852.57694 4787.73773 5220.7701 27261.624 26204.689 25266.976 4953.69492 26244.430 15599.062 -2.405646 0.06409772 -37.53091 0.000000e 00  0.000000e 00 down11138 
ensg00000163739.4_2 193.79301 177.22895 210.5237 3221.714 3519.359 3184.359 193.84854 3308.477 1751.163 -4.097951 0.10170786 -40.29139 0.000000e 00  0.000000e 00 down11703 
ensg00000165795.23_3 17.71822 28.11805 41.2366 45895.870 41106.667 43346.472 29.02429 43449.669 21739.347 -10.548113 0.16963433 -62.18148 0.000000e 00  0.000000e 00 down12624 
ensg00000169429.10_2 2099.60883 2407.92789 2647.8235 28439.757 28773.277 27150.899 2385.12008 28121.311 15253.215 -3.559289 0.07909170 -45.00205 0.000000e 00  0.000000e 00 down10535
ensg00000160710.15_3 85045.23143 85230.08188 83116.6997 21604.263 22847.460 22927.901 84464.00435 22459.875 53461.939 1.911000 0.05663594 33.74183 1.409012e-249 4.237464e-246 up11364  
ensg00000164400.5_2  759.66860  834.16893  959.2934  6727.070  5916.553  6392.674  851.04366  6345.432  3598.238  -2.898827  0.08922127  -32.49032  1.460968e-231  3.661428e-228  down


写入文件保存

in [13]:

write.csv(result_merge,file = 'differential_expression_genes_summary.csv', quote=f,row.names =f)


往期相关链接:







如果您对本文案介绍的方法或代码有疑问,

请扫码添加qq群沟通

【本群将为大家提供】

分享生信分析方案

提供数据素材及分析软件支持

定期开展生信分析线上讲座

qq号:1040471849

太阳成tyc7111cc copyright © 2012-2020 天昊基因科技(苏州)有限公司    all rights reserved   
网站地图