生物信息学巨多网址及其学习和技术交流！

云贵浪子 · 发表于 2004-11-23 10:22:02

文字文字 RNAi技术手册和美国生物信息学培养计划，请对这两个话题进行交流！

云贵浪子 · 发表于 2004-11-26 17:57:11

请大家对RNAi和生物信息学这两个方面进行资源共享，学习和技术交流！

云贵浪子 · 发表于 2004-11-26 18:36:48

1.在设计RNAi实验时，可以先在以下网站进行目标序列的筛选：
  http://www.ambion.com/techlib/misc/siRNA_finder.html
  http://katahdin.cshl.org:9331/RNAi/
  http://www.ic.sunysb.edu/Stu/shilin/rnai.html

2.RNAi目标序列的选取原则：
（1）siRNAs with lower G/C content (35-55%) are more active than those with G/C content higher than 55%；
（2）Beginning with the AUG start codon of your transcript, scan downstream for AA dinucleotide sequences. Record the occurrence of each AA and the 3' adjacent 19 nucleotides as potential siRNA target sites. Tuschl, et al. recommend against designing siRNA to the 5' and 3' untranslated regions (UTRs) and regions near the start codon (within 75 bases) as these may be richer in regulatory protein binding sites. UTR-binding proteins and/or translation initiation complexes may interfere with binding of the siRNP endonuclease complex.
（3）Compare the potential target sites to the appropriate genome database (human, mouse, rat, etc.) and eliminate from consideration any target sequences with significant homology to other coding sequences.

3.siRNA的合成：
(1)IN VITRO: siRNAs generated in vitro by transcription with T7 DNA polymerase;
(2)IN VIVO: Short hairpin RNAs that are transcribed in vivo from vectors containing the human U6 promoter.

4.目前已证实的siRNA可以在下面的网页找到：
http://www.dharmacon.com/4DCGI/W ... 7/4DWPG_11054394851

云贵浪子 · 发表于 2004-11-26 18:37:09

Selection of siRNA duplexes from the target mRNA sequence

Using Drosophila melanogaster lysates (Tuschl et al. 1999), we have systematically analyzed the silencing efficiency of siRNA duplexes as a function of the length of the siRNAs, the length of the overhang and the sequence in the overhang (Elbashir et al. 2001c). The most efficient silencing was obtained with siRNA duplexes composed of 21-nt sense and 21-nt antisense strands, paired in a manner to have a 2-nt 3' overhang. The sequence of the 2-nt 3' overhang makes a small contribution to the specificity of target recognition restricted to the unpaired nucleotide adjacent to the first base pair. 2'-Deoxynucleotides in the 3' overhangs are as efficient as ribonucleotides, but are often cheaper to synthesize and probably more nuclease resistant. We used to select siRNA sequences with TT in the overhang.

The targeted region is selected from a given cDNA sequence beginning 50 to 100 nt downstream of the start codon. Initially, 5' or 3' UTRs and regions nearby the start codon were avoided assuming that UTR-binding proteins and/or translation initiation complexes may interfere with binding of the siRNP or RISC endonuclease complex. More recently, however, we have targeted 3'-UTRs and have not experienced any problems in knocking down the targeted genes. In order to design a siRNA duplex, we search for the 23-nt sequence motif AA(N19)TT (N, any nucleotide) and select hits with approx. 50% G/C-content (30% to 70% has also worked in our hands). If no suitable sequences are found, the search is extended using the motif NA(N21). The sequence of the sense siRNA corresponds to (N19)TT or N21 (position 3 to 23 of the 23-nt motif), respectively. In the latter case, we convert the 3' end of the sense siRNA to TT. The rationale for this sequence conversion is to generate a symmetric duplex with respect to the sequence composition of the sense and antisense 3' overhangs. The antisense siRNA is synthesized as the complement to position 1 to 21 of the 23-nt motif. Because position 1 of the 23-nt motif is not recognized sequence-specifically by the antisense siRNA, the 3'-most nucleotide residue of the antisense siRNA, can be chosen deliberately. However, the penultimate nucleotide of the antisense siRNA (complementary to position 2 of the 23-nt motif) should always be complementary to the targeted sequence. For simplifying chemical synthesis, we always use TT. More recently, we preferentially select siRNAs corresponding to the target motif NAR(N17)YNN, where R is purine (A, G) and Y is pyrimidine (C, U). The respective 21-nt sense and antisense siRNAs therefore begin with a purine nucleotide and can also be expressed from pol III expression vectors without a change in targeting site; expression of RNAs from pol III promoters is only efficient when the first transcribed nucleotide is a purine.

We always design siRNAs with symmetric 3' TT overhangs, believing that symmetric 3' overhangs help to ensure that the siRNPs are formed with approximately equal ratios of sense and antisense target RNA-cleaving siRNPs (Elbashir et al. 2001b; Elbashir et al. 2001c). Please note that the modification of the overhang of the sense sequence of the siRNA duplex is not expected to affect targeted mRNA recognition, as the antisense siRNA strand guides target recognition. In summary, no matter what you do to your overhangs, siRNAs should still function to a reasonable extent. However, using TT in the 3' overhang will always help your RNA synthesis company to let you know when you accidentally order a siRNA sequences 3' to 5' rather than in the recommended format of 5' to 3'. You may think this is funny, but it has happened quite a lot.

Compared to antisense or ribozyme technology, the secondary structure of the target mRNA does not appear to have a strong effect on silencing. We say that, because we have already knocked-down more than 20 genes using a single, essentially randomly chosen siRNA duplex (Harborth et al. 2001). Only 3 siRNA duplexes have been ineffective so far. In one or two other cases, we have found siRNAs to be inactive because the targeting site contained a single-nucleotide polymorphism. We were also able to knock-down two genes simultaneously (e.g. lamin A/C and NuMA) by using equal concentrations of siRNA duplexes.

We recommend to blast-search (NCBI database) the selected siRNA sequence against EST libraries to ensure that only one gene is targeted. In addition, we also recommend to knock-down your gene with two independent siRNA duplexes to control for specificity of the silencing effect. If selected siRNA duplexes do not function for silencing, please check for sequencing errors of the gene, polymorphisms, and whether your cell line is really from the expected species. Our initial studies on the specificity of target recognition by siRNA duplexes indicate that a single point mutation located in the paired region of an siRNA duplex is sufficient to abolish target mRNA degradation (Elbashir et al. 2001c). Furthermore, it is unknown if targeting of a gene by two different siRNA duplexes is more effective than using a single siRNA duplex. We think that the amount of siRNA-associating proteins is limiting for silencing rather than the target accessibility.

云贵浪子 · 发表于 2004-11-26 18:38:45

RNAi资源：
   http://immuneweb.xxmc.edu.cn/RNAi.htm
   http://www.orbigen.com/RNAi_Orbigen.html
   http://www.imb-jena.de/RNA.html
商业化的RNAi载体：
   http://www.invivogen.com/siRNA/psiRNA.htm
   http://www.ambion.com/catalog/CatNum.php?7209
   http://www.imgenex.com/products_genesuppressor.html
   http://www.oligoengine.com/Home/mid_prodPSUPER.html

最近出版的综述和部分文章（部分已下载OK）
1: Chiu YL, Rana TM.
RNAi in Human Cells. Basic Structural and Functional Features of Small
Interfering RNA
Mol Cell. 2002 Sep;10(3):549-61..（OK）

2: Mailand N, Podtelejnikov AV, Groth A, Mann M, Bartek J, Lukas J.
Regulation of G(2)/M events by Cdc25A through phosphorylation-dependent
modulation of its stability.
EMBO J. 2002 Nov 1;21(21):5911-5920.

3: Zhang H, Kolb FA, Brondani V, Billy E, Filipowicz W.
Human Dicer preferentially cleaves dsRNAs at their termini without a requirement
for ATP.
EMBO J. 2002 Nov 1;21(21):5875-5885..（OK）

4: Provost P, Dishart D, Doucet J, Frendewey D, Samuelsson B, Radmark O.
Ribonuclease activity and RNA binding of recombinant human Dicer.
EMBO J. 2002 Nov 1;21(21):5864-5874..（OK）

5: Dernburg AF, Karpen GH.
A Chromosome RNAissance.
Cell. 2002 Oct 18;111(2):159-62..（OK）

6: Carmell MA, Xuan Z, Zhang MQ, Hannon GJ.
The Argonaute family: tentacles that reach into RNAi, developmental control,
stem cell maintenance, and tumorigenesis.
Genes Dev. 2002 Nov 1;16(21):2733-2742. .（OK）

7: Schwarz DS, Hutvagner G, Haley B, Zamore PD.
Evidence that siRNAs Function as Guides, Not Primers, in the Drosophila and
Human RNAi Pathways.
Mol Cell. 2002 Sep;10(3):537-48..（OK）

8: Ramaswamy G, Slack FJ.
siRNA. A Guide for RNA Silencing.
Chem Biol. 2002 Oct;9(10):1053-5..（OK）

9: Calegari F, Haubensak W, Yang D, Huttner WB, Buchholz F.
Tissue-specific RNA interference in postimplantation mouse embryos with
endoribonuclease-prepared short interfering RNA.
Proc Natl Acad Sci U S A. 2002 Oct 29;99(22):14236-40.（free）

10: Capodici J, Kariko K, Weissman D.
Inhibition of HIV-1 Infection by Small Interfering RNA-Mediated RNA
Interference.
J Immunol. 2002 Nov 1;169(9):5196-201..（OK）

11: Timms MW, Van Deursen FJ, Hendriks EF, Matthews KR.
Mitochondrial Development during Life Cycle Differentiation of African
Trypanosomes: Evidence for a Kinetoplast-dependent Differentiation Control
Point.
Mol Biol Cell. 2002 Oct;13(10):3747-59.

12: Zhang Y, Chalfie M.
MTD-1, a touch-cell-specific membrane protein with a subtle effect on touch
sensitivity.
Mech Dev. 2002 Nov;119(1):3.

13: Fukumoto H, Deng A, Irizarry MC, Fitzgerald ML, Rebeck GW.
Induction of the cholesterol transporter ABCA1 in CNS cells by LXR agonists
increases secreted A{beta} levels.
J Biol Chem. 2002 Oct 15 [epub ahead of print]

14: Wojtkowiak A, Siek A, Alejska M, Jarmolowski A, Szweykowska-Kulinska Z,
Figlerowicz M.
RNAi And Viral Vectors As Useful Tools In The Functional Genomics Of Plants.
Construction Of BMV-Based Vectors For RNA Delivery Into Plant Cells.
Cell Mol Biol Lett. 2002;7(2A):511-22..（OK）

15: Sun Y, Cheng Z, Ma L, Pei G.
beta-arrestin2 is critically involved in CXCR4-mediated chemotaxis and this is
mediated by its enhancement of p38 MAPK activation.
J Biol Chem. 2002 Oct 4 [epub ahead of print]

16: Caudy AA, Myers M, Hannon GJ, Hammond SM.
Fragile X-related protein and VIG associate with the RNA interference machinery.
Genes Dev. 2002 Oct 1;16(19):2491-6..（OK）

17: Reichhart JM, Ligoxygakis P, Naitza S, Woerfel G, Imler JL, Gubb D.
Splice-activated UAS hairpin vector gives complete RNAi knockout of single or
double target transcripts in Drosophila melanogaster.
Genesis. 2002 Sep-Oct;34(1-2):160-4..（OK）

18: Enerly E, Larsson J, Lambertsson A.
Reverse genetics in Drosophila: from sequence to phenotype using UAS-RNAi
transgenic flies.
Genesis. 2002 Sep-Oct;34(1-2):152-5.（OK）

19: Borkhardt A.
Blocking oncogenes in malignant cells by RNA interference-New hope for a highly
specific cancer treatment?
Cancer Cell. 2002 Sep;2(3):167..（OK）

20: Gaudilliere B, Shi Y, Bonni A.
RNA interference reveals a requirement for MEF2A in activity-dependent neuronal
survival.
J Biol Chem. 2002 Sep 13 [epub ahead of print](free)

21: Martinez J, Patkaniowska A, Urlaub H, Luhrmann R, Tuschl T.
Single-Stranded Antisense siRNAs Guide Target RNA Cleavage in RNAi.
Cell. 2002 Sep 6;110(5):563..（OK）

22: Tijsterman M, Okihara K, Thijssen K, Plasterk R.
PPW-1, a PAZ/PIWI Protein Required for Efficient Germline RNAi, Is Defective in
a Natural Isolate of C. elegans.
Curr Biol. 2002 Sep 3;12(17):1535..（OK）

23: Xu F, Gaggero C, Cohen S.
Polyadenylation can regulate ColE1 type plasmid copy number independently of any
effect on RNAI decay by decreasing the interaction of antisense RNAI with its
RNAII target.
Plasmid. 2002 Jul;48(1):49..（OK）

24: Hall IM, Shankaranarayana GD, Noma K, Ayoub N, Cohen A, Grewal SI.
Establishment and maintenance of a heterochromatin domain.
Science. 2002 Sep 27;297(5590):2232-7.(free)

25: Zentella R, Yamauchi D, Ho TH.
Molecular dissection of the gibberellin/abscisic Acid signaling pathways by
transiently expressed RNA interference in barley aleurone cells.
Plant Cell. 2002 Sep;14(9):2289-301..（OK）

26: Negeri D, Eggert H, Gienapp R, Saumweber H.
Inducible RNA interference uncovers the Drosophila protein Bx42 as an essential
nuclear cofactor involved in Notch signal transduction.
Mech Dev. 2002 Sep;117(1-2):151.

27: Lassus P, Rodriguez J, Lazebnik Y.
Confirming specificity of RNAi in mammalian cells.
Sci STKE. 2002 Aug 27;2002(147)L13..（OK）

28: Morris JC, Wang Z, Drew ME, Englund PT.
Glycolysis modulates trypanosome glycoprotein expression as revealed by an RNAi
library.
EMBO J. 2002 Sep 2;21(17):4429-38..（OK）

29: Allshire R.
Molecular biology. RNAi and heterochromatin--a hushed-up affair.
Science. 2002 Sep 13;297(5588):1818-9. (free)

30: Volpe TA, Kidner C, Hall IM, Teng G, Grewal SI, Martienssen RA.
Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by
RNAi.
Science. 2002 Sep 13;297(5588):1833-7.(free)

31: Krichevsky AM, Kosik KS.
RNAi functions in cultured mammalian neurons.
Proc Natl Acad Sci U S A. 2002 Sep 3;99(18):11926-9.(free)

32: Oshiumi H, Begum NA, Matsumoto M, Seya T.
[RNA interference for mammalian cells]
Nippon Yakurigaku Zasshi. 2002 Aug;120(2):91-5. Japanese.

33: Coburn GA, Cullen BR.
Potent and specific inhibition of human immunodeficiency virus type 1
replication by RNA interference.
J Virol. 2002 Sep;76(18):9225-31..（OK）

34: Urwin PE, Lilley CJ, Atkinson HJ.
Ingestion of double-stranded RNA by preparasitic juvenile cyst nematodes leads
to RNA interference.
Mol Plant Microbe Interact. 2002 Aug;15:747-52.

云贵浪子 · 发表于 2004-11-26 18:39:29

RNAi - a review

Discovery of RNA Interference (RNAi)

Recently scientists working in different research fields observed a phenomenon they could not immediately understand. Plant biologists were attempting to boost the activity of the gene for chalcone synthase, an enzyme involved in the production of anthocyanin pigments, by introducing a powerful promotor sequence into their petunias. However, instead of a deep purple colour, many of the flowers grew variegated, or virgin white. The researchers concluded that the introduced chalcone synthase gene had somehow muted both itself and a normal petunia gene. Joergensen et al termed this phenomenon of gene silencing "cosuppression" (1).

Their discoveries were supported by another group studying plant RNA viruses. Baulcombe et al (2) were expressing genes from the potato virus X in tobacco plants. The researchers hoped that viral proteins produced by the plants would stimulate its defence allowing the plants to resist subsequent attack by the virus itself. To their surprise the plants with the strongest resistance were those in which the introduced gene was silent. The researchers concluded that the introduced gene was co-suppressing both itself and the same gene in the virus.

In fungi, gene silencing was observed during attempts to boost the production of an orange pigment by the mould Neurospora crassa. Macino and Cogoni introduced extra copies of a gene involved in making a carotenoid pigment. In their experiments a third of the engineered mould bleached out, rather than turning to a deeper orange. Something had suppressed the pigment gene. They termed the observed phenomenon of gene silencing "quelling" (3,4).

Other scientists working with Caenorhabditis elegans obtained strange results in their antisense experiments. The theory behind the antisense approach is to inject complementary RNA sequences into the target organism to block the targeted mRNA. The two sequences should then hybridize stopping the production of the encoded protein. To Guo's surprise even the injected sense strand was active (5). This was later explained as the sense strand used was contaminated with very small amounts of the corresponding antisense strand . In a classic antisense approach these small contaminations would have shown no effect at all.

In 1998 Fire et al., suggested a new mechanism for the phenomenon of gene silencing. In their experiments using Caenorhabditis elegans they showed that double stranded RNA (dsRNA) was even more effective in gene silencing than both sense or antisense strands alone (7). They found that only a few molecules of injected dsRNA were required per affected cell. Fire et al. described this mechanism as extremely gene specific and suggested that the dsRNA mediated silencing was part of a complex biological regulation system. Fire et al. named the phenomenon of gene silencing RNA interference (RNAi).

RNAi Mechanism and Short Interfering RNA (siRNA)

Consistent with gene silencing by dsRNA, Hamilton et al., described the existence of small (about 25nt) RNAs that correspond to the gene that has been silenced in plants .

While looking for a common principle Hammond et al., detected similar short RNAs in Drosophila. They suggested that these are incorporated into a RNA induced silencing complex (RISC) and then are used as a guide in the RNAi mechanism, which then leads to degradation of the corresponding mRNA (9).

Today the basic mechanism of RNA interference (as it has been shown for Drosophila) can be understood as a two step process (10).

First, the dsRNA is cleaved to yield short interfering RNAs (siRNAs) of about 21-23nt length (8, 9, 11-13) with 5' terminal phosphate and 3' short overhangs (~2nt) (12). Then the siRNAs target the corresponding mRNA sequence specific for destruction (Fig. 1) (7-9,11,13,14).

Fig. 1: After dsRNA (of >30nt) is transfered into the cellular system, Dicer (Drosophila; 13) or another RNase III-like enzyme breaks the dsRNA into shorter RNA sequences (about 21-23 nt). These short sequences are called short interfering RNA (siRNA). siRNAs direct target specific mRNA degradation in the RNA induced silencing complex (RISC) (9).

Hammond et al., concluded that the identical size of RNA fragments in plants and animals must be the result of a highly conserved mechanism in nature (9). This theory has been supported by many studies showing that dsRNA induced gene silencing can be found in a number of different species (7, 15-24).

Non-Specific Response of Mammalian Cells

Even though it has been shown that dsRNA can mediate gene-specific interference in early mouse embryos and in mouse oocytes (25, 26), the introduction of dsRNA into somatic mammalian cells is limited. Instead of triggering RNAi, the introduced dsRNA generates a general, non-specific decrease of mRNA often followed by cell death. One response to dsRNA in mammalian cells is mediated by the dsRNA-dependent protein kinase (PKR) which phophorylates and inactivates the translation factor eIF2a, leading to a generalized suppression of protein synthesis, and in some cases apoptosis (27).

siRNA as a Tool

Several recent discoveries have begun to overcome the difficulties of unspecific responses and the cell death of mammalian cells in RNAi experiments.

Elbashir et al., analyzed the rate of 21-23 nt fragment formation after successfully triggering RNAi by several dsRNA in their described Drosophila lysate in vitro system (28). The authors then triggered RNAi efficiently using chemically synthesized siRNA duplexes of the same structure with 3'-overhang ends (11).

In a following study Tuschl et al., demonstrated that chemically synthesized 21 nt siRNA duplexes specifically suppress expression of endogenous and heterologeous genes in different mammalian cell lines, including human kidney (293) and HeLa cells (29). A key discovery of these studies was that no unspecific effects occurred in mammalian cells by transfection of short sequences (<30nt). The authors suggested that 21 nt siRNA duplexes provide a new tool for studying gene function in mammalian cells and may eventually be used as gene-specific therapeutics.

Caplen et al., supported these discoveries on siRNAs mediating RNAi in cell extracts and presented data that synthetic siRNAs can induce gene-specific inhibition of expression in Caenorhabditis elegans and in cell lines from humans and mice (30). This study also presents evidence that siRNAs can have direct effects on gene expression in C. elegans and mammalian cell culture in vivo.

siRNA an Outlook

Today many researchers are excited about the huge potential of siRNA. In Nature Structural Biology Zamore wrote (10): "The ability to initiate RNAi in cultured mammalian cells using siRNA duplexes should dramatically accelerate the pace of reverse genetic analysis of the human genome. It will not be surprising if in five years the loss of function phenotype of virtually every human gene will have been examined in cultured cells using siRNA-mediated RNAi. In fact, new technologies may soon make it possible to fabricate RNAi chips - arrays of siRNAs on which cultured cells of many types can be grown and scored for the effects of suppressing expression of every gene in the genome, one-by-one."

We are commited to focusing our resources on this fast growing field of research. We strongly believe that siRNA technology is the most powerful tool to unravel the function of genes. In combination with our TOM-Chemistry and our expertise in RNA synthesis we hope to contribute significantly to the growth of this exciting technology. In our eyes siRNA will soon be used in a variety of applications such as high throughput target validation and gene therapy.

Literature and further reading

(1) Jorgensen RA., Cluster PD, English J., Que Q., Napoli CA., "Chalcone synthase cosuppression phenotypes in petunia flowers: comparison of sense vs. antisense constructs and single-copy vs. complex T-DNA sequences", Plant Mol. Biol., 31 (5), 957-73, (1996)

(2) Baulcombe DC., "Fast forward genetics based on virus-induced gene silencing", Curr. Opin. Plant. Biol., 2, 109-113, (1999)

(3) Cogoni C, Irelan JT., Schumacher M., Schmidhauser TJ., Selker EU., Macino G., "Transgene silencing of the al-1 gene in vegetative cells of Neurospora is mediated by a cytoplasmic effector and does not depend on DNA-DNA interactions or DNA methylation", EMBO J., 15 (12), 3153-63, (1996)

(4) Cogoni C., Macino G.,Proc. Natl Acad. Sci. USA 94, 10233-10238, (1997)

(5) Guo S. & Kemphues K., "par-1, a gene required for establishing polarity in embryos, encodes a putative Ser/Thr kinase that is symmetrically disrupted", Cell, 81, 611-620 (1995)

Susan Parrish, Jamie Fleenor, SiQun Xu, Craig Mello and Andrew Fire, "Functional Anatomy of a dsRNA Trigger: Differential Requirement for the Two Trigger Strands in RNA Interference", Molecular Cell, 6, 1077-87, (2000)

(7) Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C., "

otent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans", Nature, Vol 391, (1998)

Hamilton AJ. and Baulcombe DC., A species of small antisense RNA in posttranscriptional gene silencing in plants, Science, 286, 950-952, (1999)

(9) Hammond SM., Bernstein E., Beach D., Hannon GJ., An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells., Nature 404, 293-296, (2000)

(10) Zamore P.D., "RNA interference: listening to the sound of silence", Nature Structural Biology, 8, 9, 746-750, (2001)

(11) Zamore PD., Tuschl T., Sharp PA.& Bartel DP., "RNAi: Double stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals", Cell, 101, 25-33, (2000)

(12) Elbashir SM., Lendeckel W. & Tuschl T., "RNA interference is mediated by 21- and 22-nucleotide RNAs", Genes & Development, 15, 188-200, (2001)

(13) Bernstein E., Caudy A.A., Hammond S.M., Hannon G.J., "Role for a bidentate ribonuclease in the initiation step of RNA interference", Nature 409, 363-366, 2001

(14) Yang D., Lu H., Erickson J.W.,"Evidence that processed small dsRNAs may mediate sequence-specific mRNA degradation during RNAi in Drosophila embryos", Curr. Biol., 10, 1191-1200, (2000)

(15) Kennerdell J.R., Carthew R.W., "Use of dsRNA-mediated genetic interference to demonstrate that frizzled and frizzled 2 act in the wingless pathway", Cell, 95, 1017-1026, (1998)

(16) Montgomery M.K., Xu S., Fire A., "RNA as a target of double-stranded RNA-mediated genetic interference in Caenorrhabditis elegans", Proc. Natl. Acad. Sci. USA 95, 15502-15507, (1998)

(17) Ngo H., Tschudi C., Gull K., Ullu E., "Double-stranded RNA induces mRNA degradation in Trypanosoma brucei", Proc. Natl. Acad. Sci. USA 95, 14687-14692, (1998)

(18) Timmons L., Fire A., "Specific interference by ingested dsRNA", Nature 395, 854, (1998)

(19) Bahramian M.B., Zarbl H., "Transcriptional and post-transcriptional silencing of rodent 1 collagen by a homologous transcriptionally self-silenced transgene", Mol. Cell. Biol., 19, 274-283, (1999)

(20) Lohmann J.U., Endl I., Bosch T.C., "Silencing of developmental genes in hydra", Dev. Biol., 214, 211-214, (1999)

(21) Misquitta L., Paterson B.M., "Targeted disruption of gene function in Drosophila by RNA interference (RNA-I): a role for nautilus in embryonic somatic muscle formation", Proc. Natl. Acad. Sci. USA 96, 1451-1456, (1999)

(22) Sanchez A.A., Newmark P.A., "Double-stranded RNA specifically disrupts gene expression during planarian regeneration", Proc. Natl. Acad. Sci. USA 96, 5049-5054, (1999)

(23) Wargelius A., Ellingsen S., Fjose A., "Double-stranded RNA induces specific developmental defects in zebrafish embryos", Biochem. Biophys. Res. Commun., 263, 156-161, (1999)

(24) Li Y.X., Farrell M.J., Liu R., Mohanty N., Kirby M.L., Double-stranded RNA injection produces null phenotypes in zebrafish, Dev. Biol., 217, 394-405, (2000)

(25) Wianny F., Zernicka-Goetz M., Specific interference with gene function by double-stranded RNA in early mouse development, Nature Cell Biol., 2, 70-75, (2000)

(26) Svoboda P., Stein P., Hayashi H. and Schultz R.M.,"Selective reduction of dormant maternal mRNAs in mouse oocytes by RNA interference", Development (Cambridge, U.K.) 127, 4147-4156, 2000)

(27) Clemens M.J., Elia A., "The double stranded RNA-dependent protein kinase PKR: structure and function", J. Interferon Cytokine Res., 17, 503-524, (1997)

(28) Tuschl T., Zamore PD., Lehmann R., Bartel DP. & Sharp PA., "Targeted mRNA degradation by double-stranded RNA in vitro", Genes & Dev., 13, 3191-3197, (1999)

(29) Elbashir SM. et al., "Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells", Nature, 411, 494-498, (2001)

(30) Natasha J. Caplen, Susan Parrish, Farhad Imani, Andrew Fire and Richard A. Morgan, "Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems", PNAS early Edition, 171251798, 1-6, (2001)

(31) Fay D.S., Stanley H.M., Han M., Wood W.B., "A Caenorhabditis elegans homologue of hunchback is required for late stages of development but not early embryonic patterning", Dev. Biol., 205, 240-253, (1999)

云贵浪子 · 发表于 2004-11-26 18:41:50

Characteristics of RNAi

• uses double-stranded RNA (dsRNA) as interfering agent (as opposed to single-stranded)
• high degree of specificity for a gene
• remarkably potent (only a few dsRNA molecules per cell are required for effective interference)
• the interfering activity can be transported across cell boundaries (allows RNAi uptake from the gut and distributed to the somatic tissues and germ lines)
• highly robust interference effect
• RNAi is remarkably long lived and is inheritable (to its progeny)
• RNAi works in planaria, trypanosomes, flies, mice and plants (although still some limitations, see Future uses of RNAi below)
Mechanism of RNAi

RNAi experiments involve three major steps:

1) Production of dsRNA matching gene of interest
• RNA is synthesized in a test tube by adding phage RNA polymerases that recognize phage promoters housed in expression vectors
• T7 and T3 promoters enable transcription of individual strands of the gene when T7 or T3 polymerases are provided
• Both promoters flank the cloned gene
• Usually T7 polymerase drive expression of one strand (anisense) and T3 the complementary strand (sense)
• The two strands are annealed together to make dsRNA

2) Initiation step

• Input of dsRNA - RNAs could be injected directly into the target tissue.
C. elegans and other nematodes can uptake dsRNA when soaked in dsRNA solution or fed plasmids that make dsRNA and consequently exhibit RNAi effects. (Tabara et al., 1998) This is the special characteristic of RNAi where dsRNAs can be transported across cell boundaries and be distributed to target tissues.
• Inside the organism, the dsRNA is processed into 21-23 nucleotide siRNAs. Multiple duplexes of siRNAs are produced by successive cleaveage of a nuclease complex at the 3’ end of the dsRNA
• RNA amplification (the mechanism is still unclear. An RNase III-like activity maybe involved in the RNAi mechanism although the evidence for this hypothesis remains inconclusive)

3) Formation of RNA-induced silencing complex (RISC)
• siRNAs are incorporated into the RISC nuclease complex
• RISC targets the homologous transcript by base pairing interactions between one of the siRNA strands and the endogenous mRNA. It then cleaves the mRNA at the 3' terminus of the siRNA
• Destruction of the target mRNA silenced gene
• Phenotypic analysis - phenotypes were tested differently at different time periods according to the pheLimitations
• RNAi does not efficiently inhibit all genes, an RNAi-based screen will miss some relevant genes (Fraser et al., 2000)
• RNAi may be ineffective against the targeted gene
• RNAi may not accurately phenocopy the null phenotype of all genes and may result in either partial or no loss of function.
•RNAi cannot detect subtle or conditional phenotypes (Fraser et al., 2000)
• Genes encoding neural functions are particularly resistant to RNAi (Fraser et al., 2000)

Advantages of RNAi
•easy to introduce to organism - by feeding or direct injection
•possible to make a library of dsRNA-expressing bacteria that can be used for high-througput genome-wide RNAi screens at very low cost
•vs. the knockout technique - RNAi is faster and less labour intensive. Genes that are lethal when knocked out in embryos can be analyzed with RNAi in cell culture

notypic class. Observations include brood size and progeny viability

云贵浪子 · 发表于 2004-11-26 18:42:50

miRNAs : Discovery of Tiny Regulatory RNAs

What are miRNAs ?
Two distinct pathways exist in animals and plants in which 21- to 23-nt RNAs function as post-transcriptional regulators of gene expression. Small interfering RNAs (siRNAs) act as mediators of sequence-specific mRNA degradation in RNA interference (RNAi) , whereas stRNAs regulate developmental timing by mediating sequence-specific repression of mRNA translation .
The Tuschl group developed a directional cloning procedure to isolate siRNAs after processing of long dsRNAs in Drosophila melanogaster embryo lysate , and simultaneously identified 16 novel 20- to 23-nt short RNAs, which are encoded in the D. melanogaster genome and are expressed in 0- to 2-hour embryos. These small RNAs are called micro-RNAs (miRNAs), with potential regulatory roles.

They

all contain ~22 nucleotides;
are cleaved from somewhat larger RNA precursors - probably by Dicer;
are found in humans, Drosophila, mice, frogs, fish as well as in C. elegans;
may be expressed in
only certain cell types and
at only certain times in the differentiation of a particular cell type.
The new molecules are reported in the Oct. 26 issue of Science magazine by Victor Ambros, professor of genetics and Rosalind C. Lee, a research associate, the same team who identified the first of these little RNAs in a microscopic roundworm a decade ago. Until recently, that original small RNA, the product of the worm lin-4 gene , was the only example. Now Lee and Ambros report that "there are dozens, probably hundreds of little genes like lin-4." Two other research groups who independently had similar results have published their work in the same issue.

Abbreviations for different classes of RNA

fRNA: Functional RNA — essentially synonymous with non-coding RNA
miRNA: MicroRNA — putative translational regulatory gene family
ncRNA: Non-coding RNA — all RNAs other than mRNA
rRNA: Ribosomal RNA
siRNA: Small interfering RNA — active molecules in RNA interference
snRNA: Small nuclear RNA — includes spliceosomal RNAs
snmRNA: Small non-mRNA — essentially synonymous with small ncRNAs
snoRNA: Small nucleolar RNA — most known snoRNAs are involved in rRNA modification
stRNA: Small temporal RNA — for example, lin-4 and let-7 in Caenorhabditis elegans tRNA Transfer RNA
mRNA : Messenger RNA
tRNA : Transfer RNA
Note: The Tuschl group abbreviated their discovery of miRNA as miR-1 to miR-33, and the genes encoding miRNAs are named mir-1 to mir-33. Highly homologous miRNAs are referred to by the same gene number, but followed by a lowercase letter; multiple genomic copies of a mir gene are annotated by adding a dash and a number.

Why are miRNAs important ?
RNA, short for ribonucleic acid, comes in numerous forms. Each RNA is a copy of a gene, part of the cell's DNA (deoxyribonucleic acid). Organisms have thousands of genes that collectively hold information for all the components of an organism. The Dartmouth researchers found 15 new genes in the worm, C. elegans, which all fit this microRNA family and document evidence for many more. Moreover, analysis showed that two of these particular small RNAs are also found in humans, including one that could play a role in the development of heart tissue. The work demonstrates that these microRNAs, also called miRNAs, are an extensive and diverse class of regulators. "Each miRNA is probably matched to one or more other genes whose expression it controls. Their potential importance to control development or physiology is really enormous. If there are hundreds of these in humans and each has two or three targets that it regulates, then there could be many hundreds of genes whose activity is being regulated this way." said Ambros. "It's important to find all the human miRNA genes and understand what they do." Ambros, a geneticist, adapted traditional gene discovery approaches to identifying these little genes. Such studies only became possible since the genomes --the total package of hereditary information-- of humans and other species have been sequenced, and since bioinformatics advances have facilitated computer analysis of vast genetic data stores. In the commonality of life, C.elegans, with its relatively simple genetic apparatus, is a stepping stone to discovering important gene products that are probably performing similar functions in humans. Sequencing tiny RNAs found in C. elegans and comparing their sequences with genome databases of other worms, as well as with insects, mice and humans, the researchers identified the new genes. "These little RNAs are unusual; they dont make protein. What they actually do is interfere with the messenger RNAs that do make protein. The key is that here is a match between the little RNA and its target, and the microRNA binds to the target and makes it incompetent to translate its message in to protein," Ambros said. Genomes contain sequences that are important for what a gene does, as well as other, less important regions; the important sequences are often similar or identical across species. By looking for identical sequences in different genomes, scientists can zero in on those that are functionally important 2E Ambros first compared the genomes of C.elegans, sequenced in 1999 and a related worm, C.briggsae, completed in June. The work illustrates how quickly genetics is moving. "Suddenly we could compare these two genomes, and that broke a logjam in gene discovery." The two worms, 10 million years apart, are close enough in evolutionary time to develop the same way, he continued, "so when we see identical DNA sequences, we infer that this signifies genetic machinery doing the same thing." The findings built on work over the past decade in roundworm mutants with striking developmental defects. In 1991, Lee and Ambros found lin-4, a surprisingly small gene that produced a particular hook-shaped RNA. This little gene seemed to be a temporal switch in development, but unlike most timing controls, instead of synthesizing protein, it repressed protein production from certain other genes. Lin-4 was the first gene of its kind identified, but was an only example in just one organism. "So initially, we worked on something novel, a gene important for C.elegans development, but whether or not it was important for other animals, let alone medical science, was in doubt," Ambros recalled. Things moved forward after a second, different small RNA also identified in the roundworm was found throughout the evolutionary tree from sea urchins to insects and mice, to humans. Several of the small RNAs identified in the current work are also evolutionarily ancient. Among these is mir-1, found in worms, flies and humans, which appears relatively specific to heart tissue. Ambros speculates that mir-1 may play a role in heart development and holds importance in some diseases of the heart. The next step is to identify all the new RNAs and determine how they function. Roundworms alone contain at least 100, Ambros estimates conservatively.

How can informatics techniques help in unraveling the mystery of miRNAs ?
"The nice thing about uncovering these many microRNAs is that it places before us a buffet of questions and research projects larger than any one group could possibly pursue." The excitement lies in these unanswered questions and new opportunities. The convergence of researchers in different disciplines sparks unanticipated collaborations to explore intriguing possibilities and the outcomes can be useful and far ranging. An area rich for collaborative investigation is in the field of genomics, which encompasses volumes of information and poses computational challenges. "We've developed and applied a way of finding many of a new class of gene s and think that in a year we may have found all of the C. elegans miRNAs. If we do, we're not focusing on one gene like we used to, but a whole class of genes," Ambros said. "We'd like to answer broad questions about how they work as well as specific questions about individual genes. The challenges escalate as the number of genes increase and the analyses become more complex. With lots of data we have to be able to weed out what is garbage and what is real: so computer tools and bioinformatics expertise are essential."

云贵浪子 · 发表于 2004-11-26 18:43:37

RNAi 实验设计中的对照设立：
The choice of the right controls makes the whole difference between a good and a bad experiment.
1 Always test the sense and antisense single strands in separate experiments.
2 Try to use a scramble siRNA duplex. This should have the same nucleotide
  composition as your siRNA but lack significant sequence homology to any other gene
  (including yours).
3 If possible, knock-down your gene with two independent siRNA duplexes to control
  the specificity of the silencing process
Current Considerations for siRNA Design

1.Choose a 21 or a 23 nt sequence in the coding region of the mRNA with a GC ratio as close to 50% as possible. Ideally the GC ratio will be between 45% and 55%. An siRNA with 60% GC content has worked in many cases, however when an siRNA with 70% GC content is used for RNAi typically a sharp decrease in the level of silencing is observed. Avoid regions within 50-100 nt of the AUG start codon or 50-100 nt of the termination codon.

2.Avoid more than three guanosines in a row. Poly G sequences can hyperstack and therefore form agglomerates that potentially interfere in the siRNA silencing mechanism.

3.Preferentially choose target sequences that start with two adenosines . This will make synthesis easier, more economical, and create siRNA that is potentially more resistant to nucleases. When a sequence that starts with AA is used, siRNA with dTdT overhangs can be produced.
If it is not possible to find a sequence that starts with AA and matches rules 1and 2, choose any 23 nt region of the coding sequence with a GC content between 45 and 55% that does not have more than three guanosines in a row.

4.Ensure that your target sequence is not homologous to any other genes. It is strongly recommended that a BLAST search of the target sequence be performed to prevent the silencing of unwanted genes with a similar sequence.

5.Based on feedback from various customers, labelling the 3'-end of the sense strand gives the best results with respect to not interfering with the gene silencing mechanism of siRNA. Consult our custom siRNA page for available modifications.

   When these rules are used for siRNA target sequence design RNAi effectively silences genes in more than 80% of cases. Current data indicate that there are regions of some mRNAs where gene silencing does not work. To help ensure that a given target gene is silenced, it is advised that at least two target sequences as far apart on the gene as possible be chosen.
   This method does not take into account mRNA secondary structure. At present it does not appear that mRNA secondary structure has a significant impact on gene silencing.

云贵浪子 · 发表于 2004-11-26 18:51:58

分子生物信息数据库概述
分子生物信息数据库是种类繁多。归纳起来，大体可以分为4个大类，即基因组数据库、核酸和蛋白质一级结构序列数据库、生物大分子(主要是蛋白质)三维空间结构数据库、以上述3类数据库和文献资料为基础构建的二次数据库。基因组数据库来自基因组作图，序列数据库来自序列测定，结构数据库来自X-衍射和核磁共振结构测定。这些数据库是分子生物信息学的基本数据资源，通常称为基本数据库，初始数据库，也称一次数据库。根据生命科学不同研究领域的实际需要，对基因组图谱、核酸和蛋白质序列、蛋白质结构以及文献等数据进行分析、整理、归纳、注释，构建具有特殊生物学意义和专门用途的二次数据库，是数据库开发的有效途径。近年来，世界各国的生物学家和计算机科学家合作，已经开发了几百个二次数据库和复合数据库，也称专门数据库、专业数据库、专用数据库。

一般说来，一次数据库的数据库量大，更新速度快，用户面广，通常需要高性能的计算机硬件、大容量的磁盘空间和专门的数据库管理系统支撑。例如，欧洲生物信息学研究所用Oracle数据库软件管理、维护核酸数据库EMBL。而基因组数据库GDB的管理、运行则基于Sybase数据库系统，即使是安装其镜象。也需要有Sybase支撑。Oracle和Sybase均为流行的数据库管理商业软件。而二次数据库的容量则要小得多，更新速度也不象一次数据库那样快，也可以不用大型商业数据库软件支撑。许多二次数据库的开发基于Web浏览器，使用超文本语言HTML和Java程序编写的图形界面，有的还带有搜索程序。这类针对不同问题开发的二次数据库的最大特点是使用方便，特别适用于计算机使用经验并不丰富的生物学家。

二次数据库种类繁多，以核酸数据库为基础构建的二次数据库有基因调控转录因子数据库TransFac，真核生物启动子数据库EPD，克隆载体数据库Vector，密码子使用表数据库CUTG等。以蛋白质序列数据库为基础构建的二次数据库有蛋白质功能位点数据库Prosite，蛋白质功能位点序列片段数据库Prints，同源蛋白家族数据库Pfam，同源蛋白结构域数据库Blocks。以具有特殊功能的蛋白为基础构建的二次数据库有免疫球蛋白数据库Kabat，蛋白激酶数据库PKinase等。以三维结构原子坐标为基础构建的数据库为结构分子生物学研究提供了有效的工具，如蛋白质二级结构构象参数数据库DSSP，已知空间结构的蛋白质家族数据库FSSP，已知空间结构的蛋白质及其同源蛋白数据库HSSP等。蛋白质回环分类数据库则是用于蛋白质结构、功能和分子设计研究的专门数据库。此外，酶、限制性内切酶、辐射杂交、氨基酸特性表、序列分析文献等，也属于二次数据库或专门数据库。

法国生物信息研究中心Infobiogen生物信息数据库目录DBCat搜集了主要400多个数据库的名称、内容、数据格式、联系地址、网址等详细信息，使用户对目前生物信息数据库有一个详尽的了解。DBCat本身也是一个具有一定数据格式的数据库。DBCat按DNA、RNA、蛋白质、基因图谱、结构、文献等分类，其中大部分数据库是可以免费下载的公用数据库。[链接1.2.1-1]列出安装于北京大学生物信息中心Web服务器上的生物信息数据库名称和种类以及简要说明。

此外，国际上许多生物信息中心建有生物信息学和基因组信息资源网络导航系统[链接1.2.1-2]。其中美国Oak Ride国家实验室人类基因组信息资源导航系统(图1.3)和英国基因组图谱资源中心(Human Genome Mapping Resource Centere，简称HGMP)的GenomeWeb所列网址最为详尽[链接1.2.1-3]，搜集了世界各地基因组中心、基因组数据库、基因组图谱、基因组实验材料、基因突变、遗传疾病、以及生物技术公司、实验规程、网络教程、用户手册等几百个网址。

云贵浪子 · 发表于 2004-11-26 18:59:06

NAR做的数据库分类：Database Categories List
Genomics Databases (non-vertebrate)
Human and other Vertebrate Genomes
Human Genes and Diseases
Metabolic and Signaling Pathways
Microarray Data and other Gene Expression Databases
Nucleotide Sequence Databases
Other Molecular Biology Databases
Protein sequence databases
Proteomics Resources
RNA sequence databases
Structure Databases
1 The Molecular Biology Database Collection: 2004 update
2 GenBank: update
3 The EMBL Nucleotide Sequence Database
4 DDBJ in the stream of various biological data
5 Database resources of the National Center for Biotechnology Information: update
6 MIPS: analysis and annotation of proteins from whole genomes
7 ACLAME: A CLAssification of Mobile genetic Elements
8 HERVd: the Human Endogenous RetroViruses Database: update
9 IMGT/GeneInfo: enhancing V(D)J recombination database accessibility
10 Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities
11 Xpro: database of eukaryotic protein-encoding genes
12 ASD: the Alternative Splicing Database
13 EASED: Extended Alternatively Spliced EST Database
14 DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics
15 DBTSS, DataBase of Transcriptional Start Sites: progress report 2004
16 The Eukaryotic Promoter Database EPD: the impact of in silico primer extension
17 HemoPDB: Hematopoiesis Promoter Database, an information resource of transcriptional regulation in blood cell development
18 JASPAR: an open-access database for eukaryotic transcription factor binding profiles
19 Aptamer Database
20 The European ribosomal RNA database
21 The tmRNA website: reductive evolution of tmRNA in plastids and other endosymbionts
22 The microRNA Registry
23 PIRSF: family classification system at the Protein Information Resource
24 UniProt: the Universal Protein knowledgebase
25 ProTherm, version 4.0: thermodynamic database for proteins and mutants
26 DBSubLoc: database of protein subcellular localization
27 THGS: a web-based database of Transmembrane Helices in Genome Sequences
28 The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data
29 Recent improvements to the PROSITE database
30 The Pfam protein families database
31 SMART 4.0: towards genomic data integration
32 ESTHER, the database of the α/β-hydrolase fold superfamily of proteins
33 EyeSite: a semi-automated database of protein families in the eye
34 KinG: a database of protein kinases in genomes
35 The KNOTTIN website and database: a new information system dedicated to the knottin scaffold
36 MEROPS: the peptidase database
37 Update of NUREBASE: nuclear hormone receptor functional genomics
38 RPG: the Ribosomal Protein Gene database
39 TrSDB: a proteome database of transcription factors
40 AANT: the Amino Acid–Nucleotide Interaction Database
41 SCOR: Structural Classification of RNA, version 2.0
42 ArchDB: automated protein loop classification as a tool for structural genomics
43 The ASTRAL Compendium in 2004
44 DomIns: a web resource for domain insertions in known protein structures
45 The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms
46 DSDBASE: a consortium of native and modelled disulphide bonds in proteins
47 HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database
48 IMGT/3Dstructure-DB and IMGT/StructuralQuery, a database and a tool for immunoglobulin, T cell receptor and MHC structural data
49 E-MSD: an integrated data resource for bioinformatics
50 MODBASE, a database of annotated comparative protein structure models, and associated resources
51 The distribution and query systems of the RCSB Protein Data Bank
52 SCOP database in 2004: refinements integrate structure and sequence family data
53 The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models
54 The SUPERFAMILY database in 2004: additions and improvements
55 SURFACE: a database of protein surface regions for functional annotation
56 3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes
57 TOPS: an enhanced database of protein structural topology
58 Genew: the Human Gene Nomenclature Database, 2004 updates
59 The Gene Ontology (GO) database and informatics resource
60 The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology
61 The Unified Medical Language System (UMLS): integrating biomedical terminology
62 DEG: a database of essential genes
63 FusionDB: a database for in-depth analysis of prokaryotic gene fusion events
64 The KEGG resource for deciphering the genome
65 The ORFanage: an ORFan database
66 TransportDB: a relational database of cellular membrane transport systems
67 VirGen: a comprehensive viral genome resource
68 The CyberCell Database (CCD: a comprehensive, self-updating, relational database to coordinate and facilitate in silico modeling of Escherichia coli
69 coliBASE: an online database for Escherichia coli, Shigella and Salmonella comparative genomics
70 GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins
71 RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12
73 MolliGen, a database dedicated to the comparative genomics of Mollicutes
74 Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms
75 Génolevures: comparative genomics and molecular evolution of hemiascomycetous yeasts
76 SCMD: Saccharomyces cerevisiae Morphological Database
77 yMGV: a cross-species expression data mining tool
78 ApiEST-DB: analyzing clustered EST data of the apicomplexan parasites
79 CryptoDB: the Cryptosporidium genome resource
80 dictyBase: a new Dictyostelium discoideum genome database
81 Full-malaria 2004: an enlarged database for comparative studies of full-length cDNAs of malaria parasites, Plasmodium species
82 GeneDB: a resource for prokaryotic and eukaryotic organisms
83 TcruziDB: an integrated Trypanosoma cruzi genome resource
84 FLAGdb++: a database for the functional analysis of the Arabidopsis genome
85 PHYTOPROT: a database of clusters of plant proteins
86 PlantGDB, plant genome database and analysis tools
87 The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants
88 TropGENE-DB, a multi-tropical crop information system
89 AthaMap: an online resource for in silico transcription factor binding sites in the Arabidopsis thaliana genome
90 MIPS Arabidopsis thaliana Database (MAtD: an integrated biological knowledge resource for plant genomics
91 BGI-RIS: an integrated information resource and comparative analysis workbench for rice genomics
92 The Rice PIPELINE: a unification tool for plant functional genomics
93 Rice Proteome Database based on two-dimensional polyacrylamide gel electrophoresis: its status in 2003
94 MaizeGDB, the community database for maize genetics and genomics
95 SGMD: the Soybean Genomics and Microarray Database
96 CADRE: the Central Aspergillus Data REpository
97 RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects
98 WormBase: a multi-species resource for nematode biology and genomics
99 Flytrap, a database documenting a GFP protein-trap insertion screen in Drosophila melanogaster
100 AppaDB: an AcedB database for the nematode satellite organism Pristionchus pacificus
101 Nematode.net: a tool for navigating sequences from parasitic and free-living nematodes
102 NEMBASE: a resource for parasitic nematode ESTs
103 BRENDA, the enzyme database: updates and major new developments
104 IntEnz, the integrated relational enzyme database
105 MetaCyc: a multiorganism database of metabolic pathways and enzymes
106 The aMAZE LightBench: a web interface to a relational database of cellular processes
107 The Database of Interacting Proteins: 2004 update
108 IntAct: an open source molecular interaction database
109 STCDB: Signal Transduction Classification Database
110 MitoP2, an integrated database on mitochondrial proteins in yeast and man
111 MitoProteome: mitochondrial protein sequence database and annotation system
112 Ensembl 2004
113 FREP: a database of functional repeats in mouse cDNAs
114 The Mouse Genome Database (MGD): integrating biology with the genome
115 The Mouse SAGE Site: database of public mouse SAGE libraries
116 PEDE (Pig EST Data Explorer): construction of a database for ESTs derived from porcine full-length cDNA libraries
117 AluGene: a database of Alu elements incorporated within protein-coding genes
118 The UCSC Table Browser data retrieval tool
119 Human protein reference database as a discovery resource for proteomics
120 HUGE: a database for human KIAA proteins, a 2004 update integrating HUGEppi and ROUGE
121 LIFEdb: a database for functional genomics experiments integrating information from external sources, and serving as a sample tracking system
122 trome, trEST and trGEN: databases of predicted protein sequences
123 Pathbase: a database of mutant mouse pathology
124 HGVbase: a curated resource describing human DNA variation and phenotype relationships
125 topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association
126 RTCGD: retroviral tagged cancer gene database
127 SNP500Cancer: a public resource for sequence validation and assay development for genetic variation in candidate genes
128 ERGDB: Estrogen Responsive Genes Database
129 Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies
130 CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature
131 EICO (Expression-based Imprint Candidate Organizer): finding disease-related imprinted genes
132 GenePaint.org: an atlas of gene expression patterns in the mouse embryo
133 The Centre for Modeling Human Disease Gene Trap resource
134 GermOnline, a cross-species community knowledgebase on germ cell differentiation
135 The mouse Gene Expression Database (GXD): updates and enhancements
136 Hembase: browser and genome portal for hematology and erythroid biology
137 NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service
138 The PEPR GeneChip data warehouse, and implementation of a dynamic time series query tool (SGQT) with graphical interface
139 GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes
140 ANTIMIC: a database of antimicrobial sequences
141 APD: the Antimicrobial Peptide Database
142 The Peptaibol Database: a database for sequences and structures of naturally occurring peptaibols
143 ORFDB: an information resource linking scientific content to a high-quality Open Reading Frame (ORF) collection

增补5个数据库：
144 SWISS-2DPAGE：Two-dimensional polyacrylamide gel electrophoresis database
145 PAD - Proteome Analysis Database
146 AAindex is a database of amino acid indices and amino acid mutation matrices
147 PEP: Predictions for Entire Proteomes
148 RESID

云贵浪子 · 发表于 2004-11-26 18:59:37

功能基因组、蛋白质组资源（数据库，分析软件等）
BioSino Navigator
http://www.biosino.org/pages/gateways.htm

功能基因组资源：蛋白质组
包括常用资源，蛋白-蛋白相互作用数据库，蛋白质家族数据库，分析软件等
http://www.biosino.org/bioinformatics/010929-6.htm
Metabolic Pathways
14 Cell Cycle 6 Signal Transduction 12 ExPASy Biochemical Pathways Digitized version of wall charts courtesy Boehringer Mannheim et al,divided into Metabolic Pathways and Cellular and Molecular Processes,maintained by the Swiss Institute of Bioinformatics,Geneva,Switzerland.Diagram of steps in Krebs tricarb
http://www.maninnet.com/phod.php ... Metabolic_Pathways/

Chemical Network -- Chemical Industry Directory
All Categories Just in this category Biochemicals About 78records found Showing 41 60of 78 3 A company operating in the forefront of genetic engineering as well as DNA,RNA and protein research.This is the ExPASy Expert Protein Analysis System proteomics server of t
http://www.chemnet.com/dir/Europe/Pharmaceuticals,_biochemicals_and_intermediates/Biochemicals/index3.html

蛋白质组学
（Proteomics） 90年代初期开始实施的人类基因组计划，在经过各国科学家近10年的努力下，已经取得了巨大的成就。不仅完成了十余种模式生物（从大肠杆菌酿酒酵母到线虫）基因组全序列的测定工作，还有望在2003年提前完成人类所有基因的全序列测定。那么，知道了人类的全部遗传密码即基因组序列，就可以任意控制人的生老病死吗？其实并不是这么简单。基因组学（genomics）虽然在 ...
http://www.newlifebp.org.cn/shidian/zhuanti/proteomics.htm
网上蛋白质组学资源
由于蛋白质组学研究依赖生物信息学和网络技术，internet 上有关蛋白质组研究的技术支持以及蛋白质数据分析的专家系统、数据库等专门网站很多。
·  www.proteomeworks.bio-rad.com 蛋白质组研究技术、信息等。
·  www.proteinpathways.com 从新药开发角度，发现新蛋白及新作用。
·  www.proteome.med.umich.edu 可以找到蛋白质组研究相关企业、各种仪器来源、新闻等。
·  www.proteome.co.uk 各种蛋白质组研究相关信息
·  www.micromass.co.uk 介绍蛋白质鉴定的先进仪器
·  http://www.expasy.ch 蛋白质鉴定专家系统，Swiss Institute of Bioinformatics
·  http://expasy.proteome.org.au 蛋白质鉴定专家系统，Australian Proteome Analysis Facility
·  http://expasy.cbr.nrc.ca 蛋白质鉴定专家系统，Canadian Bioinformatics Resourse
·
Desktop Sequence Analysis Software
这是国外专栏作家写的文章。对序列分析工具做了详细的介绍。文章文笔流畅，层次清晰，所介绍的软件不仅包括学术上常用的免费软件，也包括那些功能强大的商业软件。文章对各种软件进行了分门别类的介绍，对软件的特性也有很好的阐述。无论对初学者还是专业人员都是很好的参考资料
http://www.biosino.org/bioinformatics/011205-3.htm

Database Mining in the Human Genome Initiative
http://www.biosino.org/bioinformatics/0902-6.htm

Biopath：生化网络的可视化展示
众多的生化反应构成了庞大而复杂的网络，正是这样一个体系行使着种种生理功能。全面的可视化的展示网络中各种子网和路径可以帮助生化学家门理解各种反应底物之间的相互关系，有助于研究的开展。传统上来说，这些网络图是人工手绘的。但是，随着对生化网络知识的逐步积累和深入了解，自动化的生成和可视化就变的很有必要了。BioPath是一种符合上述要求的软件系统。BioPath要一个特别的数据库，其中包含了各种反应、各种的按等级的分类以及反应网络。Biopath可以从数据库中自动生成网络和它们的高质量的可视化表示。下面这篇文章该软件的算法、使用方法和可能的用途。
　http://www.biosino.org/bioinformatics/20020417-1.htm

.蛋白质数据库
1.PIR和PSD PIR国际蛋白质序列数据库 PSD 是由蛋白质信息资源 PIR 慕尼黑蛋白质序列信息中心 MIPS 和日本国际蛋白质序列数据库 JIPID 共同维护的国际上最大的公共蛋白质序列数据库。这是一个全面的经过注释的非冗余的蛋白质序列数据库，包含超过142,000条蛋白质序列至99年9月，其中包括来自几十个完整基因组的蛋白质序列。所有序列数据都经过整理，超过99 的序列已按蛋白质
http://www.bioisland.com.cn/ibidb/html/prodb.htm

Testing and laboratory services
http://www.chemnet.com/dir/Indus ... rvices/index18.html

Chemical Network -- Chemical Industry Directory
http://www.chemnet.com/dir/Regional/Switzerland/index3.html

The guide:
Step 1 obtaining a sequence of interest If you have a sequence of interest proceed to step 2.Once you obtained your sequence of interest YSI save it in a file using the Save As command of your browser.Search a public version of full Medline for topics of interest US Search for sequences of interest US Search protein and nucleic acid databas
http://www.biosino.org/mirror/bi ... .il/gdp/gdlist.html
Bioinformatics and XML
XML因其将数据信息本身的存储与关联与表现形式相分离，强大的可扩展性，本身层次清晰的树形结构特性以及跨平台跨语言的特性而成为良好网际语言，并在各种数据和存取工作中大显神通。生物信息学的发展同样引入了XML技术。下面这篇文章是一个专业性的评论，文章分析了现在XML技术在生物信息学 ..
http://www.biosino.org/bioinformatics/011212-2.htm

Online Services
http://www.maninnet.com/phod.php ... cs/Online_Services/

Proteins
http://www.maninnet.com/phod.php ... Structure/Proteins/

云贵浪子 · 发表于 2004-11-26 19:00:48

生物信息学数据库的新进展
1　生物信息数据库服务的新进展

以大规模序列信息产生为基本特征的HGP, 一开始就与数据库技术同步发展。生物信息学的产生使各类生物信息学数据库不断涌现, 其规模迅速增长, 同时数据结构日趋复杂, 目前生物信息数据库服务已实现高度的计算机化和网络
化。算法和软件的进步、数据的一体化、服务器- 客户模式的建立使之成为生物、医药、农业等学科的强有力工具。

111　生物信息数库的类型

11111　原始数据库　含有一种信息, 有许多来源。例如大型序列计划, 个人提交的资料, 文献和其它数据库。
11112　继生数据库　含有一种信息, 但只(或主要) 来自其它数据库, 或其它数据库的亚库。
11113　知识数据库　知识数据库包含多种来源的相关信息, 如文献、经验或其它数据库。
11114　集成数据库　是原始或继生数据库的集成。

112　常见生物信息数据库

核酸数据库

EMBL
欧洲分子生物学实验室核酸序列数据库,为欧洲最主要核酸序列数据库, 世界两大核酸数据库之一。目前此数据库由其分支机构—EB I(欧洲生物信息研究所) 维护

GenBank
美国国家生物技术情报中心NCB I 基因序列数据库。美国最主要的核酸序列数据库, 世界两大核酸数据库之一

DDBJ
DNA database of Japan, 位于日本核酸序列数据库, 为亚洲主要核酸序列数据库

H IV H IV 序列数据库

IM GT
ImM uno Gene T ics 数据库含有与免疫系统有关的核酸序列数据

DbEST
序列表达标记数据库( Exp ressedSequences Tags) BERL IN 5S rRNA 数据库

EPD
真核启动子数据库(Eukaryo t ic P romo terdatabase)

蛋白数据库

SW ISS2 PRO T
由日内瓦大学医学生物化学系与EMBL共同维护, 是欧洲最主要的蛋白序列数据库, 世界两大蛋白序列数据库之一

P IR
P ro tein Ident ificat ion Resource 蛋白序列鉴定数据库, 由美国国家生物医学研究基金会维护。是美国最主要的蛋白序列数据库, 为世界两大蛋白序列数据库之一

PDB
B rookhaven 蛋白序列三维立体结构数据库

PRO S ITE 蛋白特征序列字典

EN ZYM E 蛋白酶数据库

REBA SE 限制酶数据库

HSSP
同类二级结构蛋白(Homo logy2derived secondary st ructure of p ro teins) 数据库

BLOCKS
蛋白序列块数据库( P ro tein B lock s Database)

KABA T
具有免疫学重要性的蛋白数据库(Database of P ro teins of immuno lo lgical
interest)

OM EGA
蛋白结构信息数据库(P ro tein st ructural info rmat ion)

基因组数据库

D ICTYDB
盘基网柄菌(D ictyo stelium disco ideum) 基因组数据库

EcoGene
大肠杆菌(Escherich ia co li) K12 基因组数据库

FL YBA SE 果蝇(D ro soph ila) 基因组数据库

MA IZEDB 玉米基因组数据库

STYGEN E 沙门氏菌(Salmonella typh imurium )L T 2基因组数据库

SUBTL IST 纤小杆菌(Bacillus subt ilis) 168 基因组数据库

WORM PEP 蠕虫(Caeno rhabdit is elegans) 基因组计划蛋白数据库

SGD 酵母菌(Saccharomyces) 基因组数据库

其它数据库

ECO 2DBA SE 大肠杆菌(Escherich ia co li) 基因2蛋白两维凝胶数据库

GCRDB G2蛋白结合受体数据库(G2p ro tein2

coup led recep to r database)

M IM 人类孟德尔遗传学数据库(M endelian Ingeritance inM an Database)

PHDP 放射杂交体数据库(The Radiat ion Hybrid Database)

AARHU SˆGHEN T22DPA GE
人角质化细胞(kerat inocyte) 两维蛋白凝
胶数据库(A arhus and Ghent universit ies)

SW ISS2 2DPA GE
日内瓦大学(the U niversity of Geneva) 人类两维凝胶蛋白数据库(Human 2D Gel
P ro tein Database)

TRAN SFAC
转录因子数据库(T ranscrip t ion facto rdatabase)

YEPD
酵母电泳蛋白数据库( Yeast elect ropho resis p ro tein database)

SRPDB
信号识别位点数据库(Signal recognit ion part icle database)

EM P
酶和代谢途径数据库(Database of Enzymes andM etabo lic Pathw ays)

2　国内生物信息数据库服务的新进展

近几年国际生物信息发展异常迅猛, 为带动我国生物信息发展, 北京大学物理化学研究所于1996 年开始筹建国内第一家生物信息网络服务器, 通过
WWW: (h t tp: ˆˆwww. ipc. pku. edu. cn)
FTP: (ftp: ˆˆftp. ipc. pku. edu. cn)
E- mail: ( liw z@ipc. pku. edu. cn)
为我国及世界各地科学家提供数据库、生物信息库资源查询、软件和电子邮件等多种服务。

目前我国在生物信息数据库领域的主要任务是:
数据库管理: 实验室数据信息化管理、数据库的共享、数据库标准化、数据库集成建立基因信息的评估与检测系统。
进行基因组信息的可视化和专家系统的研究。
发展次级和专业数据库。
构建我国自已的数据库, 并与国际常用数据库的有效连接和及时更新。

3　网上生物信息学资源的利用

311　Internet 上分析核酸与蛋白质序列　随着大多数生物信息学数据库与Internet 连接, 使得在Internet 上分析核酸与蛋白质序列成为可能, 可分析的种类越来越多。例如, 对序列库的相似性检索; 开放阅读框架(ORF) 预测; 蛋白质二级结构分析; RNA 二级结构预测。

312　E- M ail 上分析核酸与蛋白质序列　各主要生物信息学数据库不仅建立了与Internet 连接, 并开通了电子邮件服务器, 向用户免费提供序列分析服务。只要用户按规定的格式向电子邮件服务器发送序列分析要求, 即可完成序列分析, 用户无需知道有关程序的具体步骤。

4　数据库搜索及序列比较

对于许多新得到的序列, 我们并不知道其相应的生物功能。生物学研究人员希望能够通过搜索序列数据库找到与新序列同源的已知序列, 并根据同源性推测新序列的生物功能。搜索同源序列在一定程度上就是通过相似比较寻找相似序列。在分子生物学中,DNA 或蛋白质的相似性是多方面的, 可能是核酸或氨基酸序列的相似, 可能是结构的相似, 也可能是功能的相似。一个普遍的规律是序列决定结构, 结构决定功能。所以当研究序列的相似性时, 我们最终希望根据这个普遍规律推测新序列相应的结构或功能, 也就是发现新
的生物分子数据的内涵。

5　展望

通过WWW、Gopher、FTP 等网络软件, 各种分子生物学数据库在英特网上为核酸和蛋白质序列查询带来了极大的便利, 其研究方法也得到了很大的改进和完善。人类基因组计划为核酸和蛋白质序列分析提供了用武之地。应当相
信, 在不远的将来, 一个基于英特网的核酸和蛋白质序列分析系统会使生物学研究尤其是分子生物学研究产生巨大的飞跃。

云贵浪子 · 发表于 2004-11-26 19:03:07

生物信息数据库
生物信息数据库是一切生物信息学工作的出发点和操作平台。生物信息数据库种类繁多，大体可以分为4个大类，即基因组数据库、核酸和蛋白质一级结构序列数据库、生物大分子(主要是蛋白质)三维空间结构数据库，以及以上述3类数据库和文献资料为基础构建的二次数据库。基因组数据库来自基因组作图，序列数据库来自序列测定，结构数据库来自X-衍射和核磁共振结构测定。这些数据库是分子生物信息学的基本数据资源，通常称为基本数据库，初始数据库，也称一次数据库。根据生命科学不同研究领域的实际需要，对基因组图谱、核酸和蛋白质序列、蛋白质结构以及文献等数据进行分析、整理、归纳、注释，构建具有特殊生物学意义和专门用途的二次数据库。近年来，世界各国的生物学家和计算机科学家合作，已经开发了几百个二次数据库和复合数据库，也称专门数据库、专业数据库、专用数据库。一般说来，一次数据库的数据库量大，更新速度快，用户面广，通常需要高性能的计算机硬件、大容量的磁盘空间和专门的数据库管理系统支撑，而二次数据库的容量则要小得多，更新速度也不像一次数据库那样快，也可以不用大型商业数据库软件支撑。二次数据库和一次数据库之间没有明确的界限。

3.1 国内外重要的生物信息中心

数据库由生物信息中心维护并提供服务，生物信息中心很多，这里列举国内外一批重要的生物信息中心。
l NCBI 美国国家生物技术信息中心，National Center for Biotechnology Information
NCBI管理着GenBank、UniGene、dbSNP等数据库，提供Entrez、BLAST等数据库检索工具，网址：http://www.ncbi.nlm.nih.gov。
l EBI，欧洲生物信息学研究所，European Bioinformatics Institute
1994年成立于英国剑桥，其前身为位于德国海德堡的欧洲分子生物学实验室的信息部门。EBI 接受了原来EMBL数据库的管理和维护，并且是欧洲分子生物学网（EMBnet）的一个特别节点。主网页网址为：http://www.ebi.ac.uk/。
l EMBnet, 欧洲分子生物学信息网
建立于1988年，在荷兰注册。网址：http://www.embnet.org/。中国在1996年加入其成员国，EMBnet的中国节点设在北京大学生物信息中心PKUCBI。
l EMBL，欧洲分子生物学实验室，European Molecular Biology Laboratory
主要实验室设在德国海德堡，网址：http://www.embl-heidelberg.de。
l NIG 日本国立遗传学研究所，National Institute of Genetics
维护和管理日本DNA数据库DDBJ，网址：http://www.ddbj.nig.ac.jp。该数据库首先反映日本产生的数据，同EMBL、GenBank有合作关系。
l BioSino 中国科学院上海生命科学研究院生物信息中心的网站
它的主要任务是维护我国的核酸序列公共数据库，提供包括各种链接的生物学导航信息，含中英文本。网站主页见图3.1，网址http://www.biosino.org。
l CBI 或PKUCBI，北京大学生物信息中心
CBI成立于1997年3月，它是EMBnet的中国节点，也是亚太生物信息网
APBionet的中国节点。从PKUCBI可以直接进入EMBnet的主页和若干个重要的生物信息数据库的镜像数据库，网址：http://www.cbi.pku.edu.cn。

图3.1 中国的BioSino 网站主页

3.2 生物信息数据库分类

迄今为止，生物信息数据库总数已达500个以上。从1994年开始，牛津大学出版的网上在线杂志（Oxford Journals on line）《核酸研究》（Nucleic Acids Research http://www.oup.co.uk/nar/）每年第一期是生物数据库专辑（图3.2），并且撰文对每一个数据库的性质、内容和更新状况进行综合描述。读者可以按照分类通过链接查找需要的资料。该杂志对数据库的分类归纳如下：
l 综合数据库：包括三大核酸数据库(EMBL/GenBank/DDBJ)、GSDB、TDB、UniGene
l DNA序列数据库：与基因结构和认定有关的数据，如密码子使用频度表、真核生物启动子库、内含子和外显子库。
l RNA序列和核糖体数据库
l 基因图谱数据库
l 人类基因组数据库
l 其他物种基因组数据库
l 基因表达数据库
l 基因突变、病理和免疫数据库
l 蛋白质序列数据库：SWISS-PROT，PIR
l 蛋白质结构数据库：PDB
l 比较基因组学和蛋白质组学数据库： GDB
l 代谢途径和细胞调控数据库
l 与农林牧有关的数据库
l 医学数据库：MEDLINE
l 其他数据库

每一个数据库都有其独有的特征，一个数据库可能跨越两个以上门类，例如，既有蛋白质序列，也有基因序列。可供使用的搜索引擎不少，但是，在使用不同的搜索引擎进行资料查询时，相同的查询请求从不同的搜索引擎返回的结果会有很大差别。使用生物学专业检索引擎较为可靠便捷。

3.3 三大核酸数据库

GenBank、EMBL和DDBJ是国际上三大主要核酸序列数据库。美国国立卫生研究院
(National Institurte of Health，简称NIH) 也于80年代初委托洛斯阿拉莫斯(Los Alamos)国家实验室建立GenBank，后移交给国家生物技术信息中心NCBI，隶属于NIH下设的国家医学图书馆(National Liabraty of Medicine，简称NLM)。EMBL是由欧洲分子生物学实验室(European Molecular Biology Laboratory)于1982年创建的，其名称也由此而来，目前由欧洲生物信息学研究所负责管理。DDBJ是DNA Data Base of Japan的简称，创建于1986年，由日本国家遗传学研究所负责管理。1988年， GenBank、EMBL与DDBJ共同成立了国际核酸序列联合数据库中心，建立了合作关系。根据协议，这三个数据中心各自搜集世界各国有关实验室和测序机构所发布的序列数据，并通过计算机网络每天都将新发现或更新过的数据进行交换，以保证这三个数据库序列信息的完整性（图3.3）。这种合作关系有如下特点：
l 权威性、广泛性、合作性；
l 核酸序列采用相同的记录标准；
l 独立接受数据提交，三者同步工作；
l 每日交换数据，更新内容。
表3.1列举了部分生物学数据库名称，并且进行分类介绍。

图3.2 《核酸研究》的主页

BIBI
EMBL
GenBank
DDBJ

图3.3 三大核酸数据库之间的信息交流

表3.1 生物学数据库概览
数据库分类数据库简介
核酸数据库
GenBank 美国国家生物技术情报中心（NCBI，National Center for Biotechnology Information）基因序列数据库。美国最主要的核酸序列数据库，世界两大核酸数据库之一。
EMBL Database 欧洲分子生物学实验室（European Molecular Biology Laboratory ）核酸序列数据库，为欧洲最主要的核酸序列数据库，世界两大核酸数据库之一。目前此数据库由其分支机构—EBI（the European Bioinformatics Institute，欧洲生物情报研究所）维护。
DDBJ DNA database of Japan (Mishima)，位于日本的核酸序列数据库，为亚洲主要的核酸序列数据库。
HIV Database HIV序列数据库。
IMGT ImMunoGeneTics数据库含有与免疫系统有关的核酸序列数据。
dbEST 表达序列标签数据库（Expressed Sequences Tags）。
BERLIN 5S rRNA 数据库。
EPD 真核启动子数据库（Eukaryotic Promoter database）
蛋白数据库　
SWISS-PROT SWISS-PROT 蛋白序列数据库，由日内瓦大学医学生物化学系（the Department of Medical Biochemistry of the University of Geneva ）与EMBL（European Molecular Biology Laboratory，欧洲分子生物学实验室）共同维护，是欧洲最主要的蛋白序列数据库，世界两大蛋白序列数据库之一。直接下载Swiss-Prot蛋白数据库41版，压缩版，85.7M。英文用户手册及中文用户手册20K, 文本文件格式，希望对大家使用SWISS-PORT数据库有所帮助。均已放在生物软件光盘5中。
PIR PIR（Protein Identification Resource）蛋白序列鉴定数据库，由美国国家生物医学研究基金会（National Biomedical Research Foundation)维护。是美国最主要的蛋白序列数据库，为世界两大蛋白序列数据库之一。
PDB Brookhaven蛋白序列三维立体结构数据库。
PROSITE 蛋白特征序列字典。
ENZYME 蛋白酶数据库。
REBASE 限制酶数据库。
HSSP 同类二级结构蛋白（Homology-derived secondary structure of proteins ）数据库。
BLOCKS 蛋白序列块数据库（Protein Blocks Database）。
KABAT 具有免疫学重要性的蛋白数据库（Database of Proteins of immunological interest）。
OMEGA 蛋白结构信息数据库（protein structural information）。
TMBASE 跨膜蛋白数据库，包括一些预测工具。
基因组数据库　
gdb 人类基因组数据库。
DICTYDB 盘基网柄菌（Dictyostelium discoideum ）基因组数据库。
EcoGene 大肠杆菌（Escherichia coli）K12基因组数据库。
FLYBASE 果蝇（Drosophila ）基因组数据库。
MAIZEDB 玉米基因组数据库。
SGD 酵母菌（Saccharomyces）基因组数据库。
STYGENE 沙门氏菌（Salmonella typhimurium ）LT2基因组数据库。
SUBTILIST 纤小杆菌（Bacillus subtilis ）168基因组数据库。
WORMPEP 蠕虫（Caenorhabditis elegans )基因组计划蛋白数据库。
其它数据库　
ECO2DBASE 大肠杆菌（Escherichia coli）基因-蛋白两维凝胶数据库。
GCRDB G-蛋白结合受体数据库(G-protein--coupled receptor database)。
MIM MIM 人类孟德尔遗传学数据库（Mendelian Inheritance in Man Database）。
PHDP 放射杂交体数据库(The Radiation Hybrid Database)。
AARHUS/GHENT-2DPAGE 人角质化细胞（keratinocyte ）两维蛋白凝胶数据库（Aarhus and Ghent universities）。
SWISS-2DPAGE 日内瓦大学（the University of Geneva）人类两维凝胶蛋白数据库（Human 2D Gel Protein Database）。
TRANSFAC 转录因子数据库（Transcription factor database）。
YEPD 酵母电泳蛋白数据库（ Yeast electrophoresis protein database）。
SRPDB 信号识别位点数据库（Signal recognition particle database ）。
EMP 酶和代谢途径数据库（Database of Enzymes and Metabolic Pathways）。
中国微生物资源数据库群中国微生物菌种目录数据库、经济真菌多媒体数据库、革兰氏阴性杆菌编码鉴定数据库、微生物物种编目数据库、国际核酸序列数据库DDBJ/GenBank/EMBL、国际微生物菌种数据网络MSDN中国国家节点、国际微生物性状编码数据库、弧菌编码鉴定数据库、培养基数据库、亚洲石耳数据库、真菌新种数据库

三大核酸数据库的容量逐年猛增，它们之间的成功合作为推动全球生物信息的交流发挥着越来越重要的作用。

3.3.1 GenBank 数据库

GenBank是NIH遗传序列数据库(http://nvbi.nlm.nih.gov/genban)，它收集了可以公开获得的DNA序列和注释。该数据库的容量以指数形式增长，核酸碱基数目大概每14个月就翻一个倍。目前拥有来自47,000个物种的30亿个碱基。
GenBank核酸序列数据库涵盖了从完整基因组到单个基因等序列数据及部分注释信息，称一次数据库。此外，还有些更有针对性的基因组资源，或称专用数据库。这些专用数据库既包括了上述一次数据库的部分数据，也包括从其它数据库资源获得的信息或交叉链接。这种专门数据库主要分为两大类，一类是模式生物基因组数据库，另一类则与特殊的测序技术有关。这类数据库尽管也包含序列数据，但它们的特色主要是为某一特定的模式生物提供一个完整的数据资源，如酵母（Saccharomyces cerevisiae）、线虫（Caenorhabditis elegans）、果蝇（Drosophila melanogaster）、拟南芥（Arabidopsis thaliana）、幽门螺杆菌（Helicobacter pylori）等。这些数据库从各个不同层次上搜集整理有关信息，以便对某个模式生物全基因组有一个更加完整的了解。网站主页见图3.4。

图3.4 NCBI网站主页
3.3.2 EMBL 数据库

由欧洲生物信息研究所（European Bioinformatics Institute, EBI）维护，网址：http://www.ebi.ac.uk/embl/，主页见图3.5。该数据库分为如下若干个子数据库：
l ESTs （Expressed Sequence Tags）表达序列标签
l Viruses 病毒
l Bacteriophage 噬菌体
l Prokaryotes 原核生物
l Fungi 真菌
l Plants 植物
l Invertebrates 无脊椎动物
l Vertebrates 脊椎动物
l Rodents 啮齿动物
l Mammals 哺乳动物
l Human 人类
l Organelles 细胞器
l GSSs （Genome Survey Sequence）基因组调研序列
l HTG （High Throughput Genomic sequencing）高通量基因组序列
l Patents 专利序列
l STSS （Sqsequence Tag Site）序列标签位点
l Synthetic 合成序列
l Unclassified 未分类序列

图3.5 EMBL数据库主页
3．3．3 DDDBJ数据库

由日本国家遗传研究所（National Institute of Genetics of Japan）维护，网址http://www.ddbj.nig.ac.jp。网站主页见图3.6。

3.4 蛋白质序列数据库
最重要的蛋白质氨基酸序列数据库是EBI维护的SWISS-PROT和美国、德国、日本合作的国际PIR数据库。

3.4.1 SWISS-PROT

由EBI维护，网址：http://www.expasy.ch/sprot，北京大学生物信息中心有SWISS-PROT镜像数据库。
SWISS-PROT 是对数据人工审读很严格的数据库，只有实际存在的蛋白质才被收入，每一条数据都有详细的注释，包括功能、结构域、翻译后的修饰等，以及齐全的引文和相关链接。网站主页见图3.7

图3.6 DDBJ网站主页

图3.7 SWISS-PROT 数据库主页

3.4.2 PIR，Protein Information Resource

维护者为美国华盛顿的全国生物医学研究基金（NBRF）、德国马普学会的慕尼黑蛋白质序列信息中心（MIPS）和日本国际蛋白质序列数据库（JIPID）。
网址：hptt://www-nbrf.georgetown.edu/pir，网站主页见图3.8。
PIR数据库包含所有序列已知的自然界中野生型蛋白质的信息，该数据库的主要目的是提供按同源性和分类学组织的综合的、非冗余的数据库。每周更新，每季度发行新版。内容分为四级，即：PIR1（完全分类清楚）；PIR2（已检查和分类）；PIR3（未检查）；PIR4（未解码翻译）。

3.5 数据库序列格式

数据库都有固定的格式，以便计算机读取。由于数据库形成的历史原因，各个数据库的格式有所不同，基本可以分为EMBL和GenBank两种格式，欧洲国家的许多数据库如SWISS-PROT 都采用EMBL格式。FASTA序列格式是一种十分简单的数据格式，便于进行序列比对。

图3.8 PIR数据库主页

在三大核酸数据库中，识别标志“特性表”（Features Table, FT）包含一批关键字，它们的定义在三个数据库之间取得了统一。详见表3.2。

表 3.2 EMBL 和GenBank数据库的行识别标志
EMBL GenBank 意义
ID LOCUS 标识字符串和简短描述
AC ACCESSION 唯一的提取号
DE DEFINITION 简单的描述
OS SOURCE 来源生物体
OC ORGANISM 来源生物体的分类谱系
DT 建立日期
KW KEYWORDS 关键字
RN REFERENCE 引文编号
RA AUTHORS 引文作者
RT TITLE 引文题目
RL JOURNAL 引文出处
RX 交叉引用
RP 引文对应序列位置
DR COMMENTS 对其他数据库的引用
MEDLINE 引文的MEDLINE号
XX 为阅读清晰而加的空行
CC COMMENT 评注
NI VERSION 可更新的序列版本号
FH FEATURES 特性表头
FT FEATURES 特性表
SQ EMBL序列开始
BASE COUNT GenBank碱基数目
ORIGIN GenBank序列开始标志，该行空位
// // 序列结束标志，空行

一个序列数据库条目的内容一般包括三个部分。第一部分包含命名和文献索引的信息。主序列构成了第三部分。二者之间是一张“序列特性表”（Feature Table），包含计算机可读取的对序列特性的生物学注释，这些信息在序列分析中是不可少的。一个分子生物学数据库对生物学数据注释的质量和收集数据的数量是关键性的，目前，对数据生物学特性的注释远远滞后于数据量的增长速度。

3.5.1 GenBank格式

GenBank格式的每个条目是一个纯文本文件，每行左端或为空格或为识别字，识别字为完整的英文字，不用缩写。
GenBank序列文件由单个的序列条目组成。序列条目由字段组成，每个字段由关键字起始，后面为该字段的具体说明。有些字段又分若干次子字段，以次关键字或特性表说明符开始。每个序列条目以双斜杠“//”作结束标记。序列条目的格式非常重要，关键字从第一列开始，次关键字从第三列开始，特性表说明符从第五列开始。每个字段可以占一行，也可以占若干行。若一行中写不下时，继续行以空格开始。
序列条目的关键字包括LOCUS (代码)，DEFINITION (说明)，ACCESSION (编号)，NID符(核酸标识)，KEYWORDS (关键词)，SOURCE (数据来源)，REFERENCE (文献)，FEATURES (特性表)，BASE COUNT (碱基组成)及ORIGIN (碱基排列顺序)。近版的核酸序列数据库将引入新的关键词SV (序列版本号)，用“编号.版本号”表示，并取代关键词NID。
在GenBank注释的特性表中出现的关键字及其意义详见表3.3。

表3.3 GenBank注释中的关键字
关键字意义关键字意义
3` UTR 3` 非翻译区 modified_base 修饰过的碱基
5` UTR 5`非翻译区 mRNA 信使RNA
-10_signal -10 信号 mutation 突变
-35_signal -35 信号 rRNA 核糖体RNA
CAAT_signal CAAT信号 tRNA 运输RNA
CDS 编码序列，含终止密码子 polyA_signal 多聚A信号
enhancer 增强子 polyA_site 多聚A位点
exon 外显子 prim_transcript 初始转录码
GC_signal GC_信号 promotor 启动子
gene 已命名的基因序列 protein_bind 蛋白质结合位点
intron 内含子 rep_origin 复制起点
LTR 长终端重复序列 repeat_region 重复区
mat_peptide 翻译后被修饰的序列，不含终止密码子 repeat_unit 重复单元
mis_binding 错结合点 satellite 卫星片断
misc_feature 其他性状 sig_peptide 信号肽
misc_RNA 其他RNA TATA_signal TATA 信号
mis_signal 其他信号 terminator 终端子
modified_base 修饰过的碱基

通过GenBank数据库查询检索号为AF010325的生物体的记录资料，我们从所获得的信息对GenBank格式进行解释：

1、头部头部是记录中与数据库关联最大的部分。它以LOCUS为开始
行的标识名。

LOCUS HSA245946 518 bp mRNA linear PRI 04-OCT-2000

LOCUS行 (代码)：是该序列条目的标记，或者说标识符，蕴涵这个序列的功能。其中，HSA245946为检索号，这个号码是唯一的、不可重复的。该字段还包括其它相关内容，如序列长度518 bp、分子类型mRNA、分类码PRI以及录入日期04-OCT-2000。

DEFINITION Homo sapiens mRNA for neuroglobin (NGB gene)

DEFINITION行用以总结纪录的生物学意义。包含属种Homo sapiens（人）、分子类型mRNA for neuroglobin (NGB gene)，神经球蛋白的mRNA。

ACCESSION AJ245946

ACCESSION (编号)：具有唯一性和永久性，这里，代码AJ245946用来表示上述人的神经球蛋白的mRNA序列，在文献中引用这个序列时，应该以此编号为准。

VERSION AJ245946.1 GI:10639033

VERSION（序列版本号）：包含检索号AJ245946.1和基因信息号GI:10639033，一个基因信息号对应于一个核苷酸序列，当序列改变时，基因信息号也改变，但是检索号始终不变。

KEYWORDS neuroglobin; NGB gene.

KEYWORDS (关键词)字段：由该序列的提交者提供，包括该序列的基因产物以及其它相关信息，如本例中neuroglobin; NGB gene。

SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

SOURCE (数据来源)字段：说明该序列是从什么生物体、什么组织得到的，如本例基因序列来源是人的神经球蛋白。次关键字ORGANISM (种属)指出该生物体的分类学地位，如本例表示人、真核生物……等。

REFERENCE 1
AUTHORS Burmester,T., Weich,B., Reinhardt,S. and Hankeln,T.
TITLE A vertebrate globin expressed in the brain
JOURNAL Nature 407 (6803), 520-523 (2000)
MEDLINE 20479975
PUBMED 11029004

REFERENCE (文献)行：说明该序列中的相关文献，包括AUTHORS (作者)，TITLE (题目)及JOURNAL (杂志名)等，以次关键词列出。该字段中还列出医学文献摘要数据库MEDLINE和PUBMED的代码。该代码是超文本链接，点击它可以直接调用上述文献摘要。一个序列可以有多篇文献，以不同序号表示，并给出该序列中的哪一部分与文献有关。
2、特性表（Features）特性表直接表达了记录的生物学背景知识，记录中的一整
套注释有助于快速地抽取相关生物学信息。特性表详细地描述了合法的特性（允许使用的注释），以及这些特性的允许限制词，如果这些注释仅仅是推测或是计算得到的，其可信度降低。

FEATURES Location/Qualifiers
source 1..518
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"

FEATURES (特性表)：具有特定的格式，用来详细描述序列特性。特性表中带有‘/db-xref/’标志的字符可以连接到其它数据库，如本例中的分类数据库（taxon：9606），此外还对翻译所得的信号肽以及最终蛋白质产物进行简要说明。在特性表中，来源特性是唯一一个必须在所有GenBank记录中出现的特性，大多数情况下，一个记录只能有一个来源特性，并带有/organism限定词。

BASE COUNT 87 a 161 c 163 g 107 t

BASE COUNT行是碱基含量字段，给出序列中的碱基组成，如本例所示87个a，161个c，163个g，107个t。
3、序列 GenBank数据库记录以ORIGIN行为序列的引导行，下接碱基序列，以双
斜杠行“//”结束。

ORIGIN
1 atggagcgcc cggagcccga gctgatccgg cagagctggc gggcagtgag ccgcagcccg
61 ctggagcacg gcaccgtcct gtttgccagg ctgtttgccc tggagcctga cctgctgccc
…
481 catccatctg tgtctgtctg ttggcctgta tctgttgt
//

附录3.1：GenBank记录的格式

LOCUS HSA245946 518 bp mRNA linear PRI 04-OCT-2000
DEFINITION Homo sapiens mRNA for neuroglobin (NGB gene).
ACCESSION AJ245946
VERSION AJ245946.1 GI:10639033
KEYWORDS neuroglobin; NGB gene.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1
AUTHORS Burmester,T., Weich,B., Reinhardt,S. and Hankeln,T.
TITLE A vertebrate globin expressed in the brain
JOURNAL Nature 407 (6803), 520-523 (2000)
MEDLINE 20479975
PUBMED 11029004
REFERENCE 2 (bases 1 to 518)
AUTHORS Hankeln,T.
TITLE Direct Submission
JOURNAL Submitted (06-SEP-1999) Hankeln T., Institute of Molecular
Genetics, University of Mainz, Becherweg 32, Mainz, D-55099,
GERMANY
FEATURES Location/Qualifiers
source 1..518
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/tissue_type="brain"
/dev_stage="adult"
gene 1..518
/gene="NGB"
CDS 1..456
/gene="NGB"
/function="oxygen binding protein"
/codon_start=1
/product="neuroglobin"
/protein_id="CAC11133.1"
/db_xref="GI:10639034"
/db_xref="GOA

9NPG2"
/db_xref="SPTREMBL

9NPG2"
/translation="MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQY
NCRQFSSPEDCLSSPEFLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVGVK
LSSFSTVGESLLYMLEKCLGPAFTPATRAAWSQLYGAVVQAMSRGWDGE"
BASE COUNT 87 a 161 c 163 g 107 t
ORIGIN
1 atggagcgcc cggagcccga gctgatccgg cagagctggc gggcagtgag ccgcagcccg
61 ctggagcacg gcaccgtcct gtttgccagg ctgtttgccc tggagcctga cctgctgccc
121 ctcttccagt acaactgccg ccagttctcc agcccagagg actgtctctc ctcgcctgag
181 ttcctggacc acatcaggaa ggtgatgctc gtgattgatg ctgcagtgac caatgtggaa
241 gacctgtcct cactggagga gtaccttgcc agcctgggca ggaagcaccg ggcagtgggt
301 gtgaagctca gctccttctc gacagtgggt gagtctctgc tctacatgct ggagaagtgt
361 ctgggccctg ccttcacacc agccacacgg gctgcctgga gccaactcta cggggccgta
421 gtgcaggcca tgagtcgagg ctgggatggc gagtaagagg cgaccccgcc cggcagcccc
481 catccatctg tgtctgtctg ttggcctgta tctgttgt
//

3.5.2 EMBL格式

EMBL数据库的每一个条目是一份纯文本文件，每一行最前面是由两个大写字母组成的识别标志。表3.4为EMBL序列首行字符串的意义。

表3.4 EMBL 序列格式首行字符串的意义
检索名称
（号）数据库
类型生物分子
类型种属序列长度
（碱基对）
MMFOSB standard RNA MUS 4145 BP

条目的关键字包括ID（序列名称），DE（序列简单说明），AC（序列编号），SV（序列版本号），KW（与序列相关的关键词），OS（序列来源的物种名），OC（序列来源的物种学名和分类学位置），RN（相关文献编号或递交序列的注册信息），RA（相关文献作者或递交序列的作者），RT（相关文献题目），RL（相关文献杂志名或递交序列的作者单位），RX（相关文献Mediline引文代码），RC（相关文献注释），RP（相关文献其他注释），CC（关于序列的注释信息），DR（相关数据库交叉引用号），FH（序列特征表起始），FT（序列特征表子项），SQ（碱基种类统计数）。
典型的EMBL数据库序列记录格式如下：

ID AF111847 standard; RNA; HUM; 2788 BP. 序列名称和基本性质
XX 字段分界标志
AC AF111847 序列接受号
XX
SV AF111847.1 序列版本
XX
DT 14-MAR-2000 (Rel. 63, Created) 序列提交、更新日期
DT 09-MAY-2001 (Rel. 67, Last updated, Version 3)
XX
DE Homo sapiens ARFGAP1 protein (ARFGAP1) mRNA, complete cds. 序列性质简要描述
XX
KW 关键词
XX
OS Homo sapiens (human) 来源种属
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 来源分类
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
XX
RN [1] 参考文献条目
RP 1-2788 文献对应序列位置
RX MEDLINE; 20171380. 文献交叉索引
RX PUBMED; 10704287.
RA Zhang C., Yu Y., Zhang S., Liu M., Xing G., Wei H., Bi J., Liu X., 文献作者
RA Zhou G., Dong C., Hu Z., Zhang Y., Luo L., Wu C., Zhao S., He F.;
RT "Characterization, chromosomal assignment, and tissue expression 文献题目
RT of a novel human gene belonging to the ARF GAP family";
RL Genomics 63(3):400-408(2000). 文献出处
XX
RN [2]
RP 1-2788
RA Zhang C., Yu Y., Zhang S., Ouyang S., Luo L., Wei H., Zhou G.,
RA Zhang Y., Liu M., He F.;
RT
RL Submitted (06-AUG-1999) to the EMBL/GenBank/DDBJ databases.
RL Dept. of Genomics and Proteomics, Institute of Radiation Medicine,
RL Beijing Taiping Road 27, Beijing, Beijing 100850, P. R. China
XX
DR ENSEMBL; ENSG00000100262; ENST00000263245. 库间交叉索引
DR GOA; Q9NP61.
DR SWISS-PROT; Q9NP61; ARG3_HUMAN.
XX
FH Key Location/Qualifiers 序列性质表头
FH
FT source 1..2788 序列性质数据
FT /chromosome="22“
FT /db_xref="taxon:9606"
FT /mol_type="mRNA“
FT /organism="Homo sapiens“
FT /map="22q13.2“
FT /clone="FLB2127“
FT 5'UTR 1..57
FT /gene="ARFGAP1“
FT CDS 58..1608
FT /codon_start=1
FT /db_xref="GOA

9NP61“
FT /db_xref="SWISS-PROT

9NP61“
FT /evidence=NOT_EXPERIMENTAL
FT /gene="ARFGAP1“
FT /product="ARFGAP1 protein“
FT /protein_id="AAF40310.1"
FT /translation="MGDPSKQDILTIFKRLRSVPTNKVCFDCGAKNPSWASITYGVFLC
FT IDCSGSHRSLGVHLSFIRSTELDSNWSWFQLRCMQVGGNASASSFFHQHGCSTNDTNAK
FT YNSRAAQLYREKIKSLASQATRKHGTDLWLDSCVVPPLSPPPKEEDFFASHVSPEVSDT
FT AWASAIAEPSSLTSRPVETTLENNEGGQEQGPSVEGLNVPTKATLEVSSIIKKKPNQAK
FT KGLGAKKGSLGAQKLANTCFNEIEKQAQAADKMKEQEDLAKVVSKEESIVSSLRLAYKD
FT LEIQMKKDEKMNISGKKNVDSDRLGMGFGNCRSVISHSVTSDMQTIEQESPIMAKPRKK
FT YNDDSDDSYFTSSSSYFDEPVELRSSSFSSWDDSSDSYWKKETSKDTETVLKTTGYSDR
FT PTARRKPDYEPVENTDEAQKKFGNVKAISSDMYFGRQSQADYETRARLERLSASSSISS
FT ADLFEEPRKQPAGNYSLSSVLPNAPDMAQFKQGVRSVAGKLSVFANGVVTSIQDRYGS"
FT 3'UTR 1609..2788
FT /gene="ARFGAP1"
XX
SQ Sequence 2788 BP; 914 A; 531 C; 602 G; 741 T; 0 other; 序列开始标志
ttttcgtcga ctcttaccgg ttggctgggc cagctgcgcc gcggctcaca gctgacgatg 60
ggggacccca gcaagcagga catcttgacc atcttcaagc gcctccgctc ggtgcccact 120
(省略)
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa 2760
aaaaaaaaaa aaaaaaaaaa aaaaaaaa 2788
// 序列结束标志

3.5.3 FASTA 格式

这是比较简单而使用最多的序列格式。该格式只有两部分，序列文件的第一行是由大于符号“〉”打头的序列名称和基本性质简要文字说明；从第二行开始时为序列本身，FASTA格式以“〉”起始标识号作为区别于其他数据库的特征，没有特殊的序列结束标志。序列只允许使用标准核苷酸符号或标准的氨基酸的单字母符号，通常核苷酸符号大小写均可，而氨基酸一般用大写字母，有些程序对大小写有明确要求。文件中每一行不要超过80个字母，行中不留空位。FASTA格式还可以用于多序列联配。
典型的FASTA格式如下：
1、单个序列
>FOSB_MOUSE Protein fosB. 338 bp (序列名称/分子类型/基因名称/基因长度）
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
2、多个序列
>gi|114736|sp|P22063|AXO1_RAT Contactin 2 precursor (Axonin-1) (Axonal glycoprotein TAG-1) (Transient axonal glycoprotein 1) (TAX-1)
MGTHARKKASLLLLVLATVALVSSPGWSFAQGTPATFGPIFEEQPIGLLFPEESAEDQVTLACRARASPP
ATYRWKMNGTDMNLEPGSRHQLMGGNLVIMSPTKTQDAGVYQCLASNPVGTVVSKEAVLRFGFLQEFSKE
……
KPPPRRPPGNISWTFSSSSLSLKWDPVVPLRNESTVTGYKMLYQNDLHPTPTLHLTSKNWIEIPVPEDIG
HALVQIRTTGPGGDGIPAEVHIVRNGGTSMMVESAAARPAHPGPAFSCMVILMLAGYQKL
>gi|127857|sp|P13592|NCA2_HUMAN Neural cell adhesion molecule 1, 120 kDa isoform precursor (N-CAM 120) (NCAM-120) (CD56 antigen)
MLQTKDLIWTLFFLGTAVSLQVDIVPSQGEISVGESKFFLCQVAGDAKDKDISWFSPNGEKLTPNQQRIS
VVWNDDSSSTLTIYNANIDDAGIYKCVVTGEDGSESEATVNVKIFQKLMFKNAPTPQEFREGEDAVIVCD
……
EPAKGEPSAPKLEGQMGEDGNSIKVNLIKQDDGGSPIRHYLVRYRALSSEWKPEIRLPSGSDHVMLKSLD
WNAEYEVYVVAENQQGKSKAAHFVFRTSAQPTAIPATLGGNSASYTFVSLLFSAVTLLLLC
>gi|14286138|sp|P20241|NRG_DROME Neuroglian precursor
MWRQSTILAALLVALLCAGSAESKGNRPPRITKQPAPGELLFKVAQQNKESDNPFIIECEADGQPEPEYS
WIKNGKKFDWQAYDNRMLRQPGRGTLVITIPKDEDRGHYQCFASNEFGTATSNSVYVRKAELNAFKDEAA
KTLEAVEGEPFMLKCAAPDGFPSPTVNWMIQESIDGSIKSINNSRMTLDPEGNLWFSNVTREDASSDFYY
……
NKSAGRQSVSSANKPGVESDTDSMAEYGDGDTGQFTEDGSFIGQYVPGKLQPPVSPQPLNNSAAAHQAAP
TAGGSGAAGSAAAAGASGGASSAGGAAASNGGAAAGAVATYV

云贵浪子 · 发表于 2004-11-26 19:05:10

著名和有特色的生物信息数据库——分类介绍
近年来大量生物学实验的数据积累，形成了当前数以百计的生物信息数据库。它们各自按一定的目标收集和整理生物学实验数据，并提供相关的数据查询、数据处理的服务。随着因特网的普及，这些数据库大多可以通过网络来访问，或者通过网络下载。

一般而言，这些生物信息数据库可以分为一级数据库和二级数据库。一级数据库的数据都直接来源于实验获得的原始数据，只经过简单的归类整理和注释；二级数据库是在一级数据库、实验数据和理论分析的基础上针对特定目标衍生而来，是对生物学知识和信息的进一步整理。国际上著名的一级核酸数据库有Genbank数据库、EMBL核酸库和DDBJ库等；蛋白质序列数据库有SWISS-PROT、PIR等；蛋白质结构库有PDB等。国际上二级生物学数据库非常多，它们因针对不同的研究内容和需要而各具特色，如人类基因组图谱库GDB、转录因子和结合位点库TRANSFAC、蛋白质结构家族分类库SCOP等等。

下面将顺序简要介绍一些著名和有特色的生物信息数据库。

2.1 基因和基因组数据库

1. Genbank

Genbank库包含了所有已知的核酸序列和蛋白质序列，以及与它们相关的文献著作和生物学注释。它是由美国国立生物技术信息中心(NCBI)建立和维护的。它的数据直接来源于测序工作者提交的序列；由测序中心提交的大量EST序列和其它测序数据；以及与其它数据机构协作交换数据而来。Genbank每天都会与欧洲分子生物学实验室(EMBL)的数据库，和日本的DNA数据库(DDBJ)交换数据，使这三个数据库的数据同步。到1999年8月，Genbank中收集的序列数量达到460万条，34亿个碱基，而且数据增长的速度还在不断加快。Genbank的数据可以从NCBI的FTP服务器上免费下载完整的库，或下载积累的新数据。NCBI还提供广泛的数据查询、序列相似性搜索以及其它分析服务，用户可以从NCBI的主页上找到这些服务。

Genbank库里的数据按来源于约55,000个物种，其中56%是人类的基因组序列(所有序列中的34%是人类的EST序列)。每条Genbank数据记录包含了对序列的简要描述，它的科学命名，物种分类名称，参考文献，序列特征表，以及序列本身。序列特征表里包含对序列生物学特征注释如：编码区、转录单元、重复区域、突变位点或修饰位点等。所有数据记录被划分在若干个文件里，如细菌类、病毒类、灵长类、啮齿类，以及EST数据、基因组测序数据、大规模基因组序列数据等16类，其中EST数据等又被各自分成若干个文件。

(1)Genbank数据检索

NCBI的数据库检索查询系统是Entrez。Entrez是基于Web界面的综合生物信息数据库检索系统。利用Entrez系统，用户不仅可以方便地检索Genbank的核酸数据，还可以检索来自Genbank和其它数据库的蛋白质序列数据、基因组图谱数据、来自分子模型数据库(MMD的蛋白质三维结构数据、种群序列数据集、以及由PubMed获得Medline的文献数据。

Entrez提供了方便实用的检索服务，所有操作都可以在网络浏览器上完成。用户可以利用Entrez界面上提供的限制条件(Limits)、索引(Index)、检索历史(History)和剪贴板(Clipboard)等功能来实现复杂的检索查询工作。对于检索获得的记录，用户可以选择需要显示的数据，保存查询结果，甚至以图形方式观看检索获得的序列。更详细的Entrez使用说明可以在该主页上获得。

(2)向Genbank提交序列数据

测序工作者可以把自己工作中获得的新序列提交给NCBI，添加到Genbank数据库。这个任务可以由基于Web界面的BankIt或独立程序Sequin来完成。

BankIt是一系列表单，包括联络信息、发布要求、引用参考信息、序列来源信息、以及序列本身的信息等。用户提交序列后，会从电子邮件收到自动生成的数据条目，Genbank的新序列编号，以及完成注释后的完整的数据记录。用户还可以在BankIt页面下修改已经发布序列的信息。BankIt适合于独立测序工作者提交少量序列，而不适合大量序列的提交，也不适合提交很长的序列，EST序列和GSS序列也不应用BankIt提交。BankIt使用说明和对序列的要求可详见其主页面。

大量的序列提交可以由Sequin程序完成。Sequin程序能方便的编辑和处理复杂注释，并包含一系列内建的检查函数来提高序列的质量保证。它还被设计用于提交来自系统进化、种群和突变研究的序列，可以加入比对的数据。Sequin除了用于编辑和修改序列数据记录，还可以用于序列的分析，任何以FASTA或ASN.1格式序列为输入数据的序列分析程序都可以整合到Sequin程序下。在不同操作系统下运行的Sequin程序都可以在ftp://ncbi.nlm.nih.gov/sequin/下找到，Sequin的使用说明可详见其网页。

NCBI的网址是：http://www.ncbi.nlm.nih.gov/。

Entrez的网址是：http://www.ncbi.nlm.nih.gov/entrez/。

BankIt的网址是：：http://www.ncbi.nlm.nih.gov/BankIt。

Sequin的相关网址是：：http://www.ncbi.nlm.nih.gov/Sequin/。

2. EMBL核酸序列数据库

EMBL核酸序列数据库由欧洲生物信息学研究所(EBI)维护的核酸序列数据构成，由于与Genbank和DDBJ的数据合作交换，它也是一个全面的核酸序列数据库。该数据库由Oracal数据库系统管理维护，查询检索可以通过通过因特网上的序列提取系统(SRS)服务完成。向EMBL核酸序列数据库提交序列可以通过基于Web的WEBIN工具，也可以用Sequin软件来完成。

数据库网址是：：http://www.ebi.ac.uk/embl/。

SRS的网址是：：http://srs.ebi.ac.uk/。

WEBIN的网址是：：http://www.ebi.ac.uk/embl/Submission/webin.html。

3. DDBJ数据库

日本DNA数据仓库(DDBJ)也是一个全面的核酸序列数据库，与Genbank和EMBL核酸库合作交换数据。可以使用其主页上提供的SRS工具进行数据检索和序列分析。可以用Sequin软件向该数据库提交序列。

DDBJ的网址是：：http://www.ddbj.nig.ac.jp/。

4. GDB

基因组数据库(GD为人类基因组计划(HGP)保存和处理基因组图谱数据。GDB的目标是构建关于人类基因组的百科全书，除了构建基因组图谱之外，还开发了描述序列水平的基因组内容的方法，包括序列变异和其它对功能和表型的描述。目前GDB中有：人类基因组区域(包括基因、克隆、amplimers PCR 标记、断点breakpoints、细胞遗传标记cytogenetic markers、易碎位点fragile sites、EST序列、综合区域syndromic regions、contigs和重复序列)；人类基因组图谱(包括细胞遗传图谱、连接图谱、放射性杂交图谱、content contig图谱和综合图谱等)；人类基因组内的变异(包括突变和多态性，加上等位基因频率数据)。GDB数据库以对象模型来保存数据，提供基于Web的数据对象检索服务，用户可以搜索各种类型的对象，并以图形方式观看基因组图谱。

GDB的网址是：：http://www.gdb.org/。

GDB的国内镜像是：：http://gdb.pku.edu.cn/gdb/。

2.2 蛋白质数据库

1. PIR和PSD

PIR国际蛋白质序列数据库(PSD)是由蛋白质信息资源(PIR)、慕尼黑蛋白质序列信息中心(MIPS)和日本国际蛋白质序列数据库(JIPID)共同维护的国际上最大的公共蛋白质序列数据库。这是一个全面的、经过注释的、非冗余的蛋白质序列数据库，包含超过142,000条蛋白质序列(至99年9月)，其中包括来自几十个完整基因组的蛋白质序列。所有序列数据都经过整理，超过99%的序列已按蛋白质家族分类，一半以上还按蛋白质超家族进行了分类。PSD的注释中还包括对许多序列、结构、基因组和文献数据库的交叉索引，以及数据库内部条目之间的索引，这些内部索引帮助用户在包括复合物、酶－底物相互作用、活化和调控级联和具有共同特征的条目之间方便的检索。每季度都发行一次完整的数据库，每周可以得到更新部分。

PSD数据库有几个辅助数据库，如基于超家族的非冗余库等。PIR提供三类序列搜索服务：基于文本的交互式检索；标准的序列相似性搜索，包括BLAST、FASTA等；结合序列相似性、注释信息和蛋白质家族信息的高级搜索，包括按注释分类的相似性搜索、结构域搜索GeneFIND等。

PIR和PSD的网址是：：http://pir.georgetown.edu/。

数据库下载地址是：：ftp://nbrfa.georgetown.edu/pir/。

2. SWISS-PROT

SWISS-PROT是经过注释的蛋白质序列数据库，由欧洲生物信息学研究所(EBI)维护。数据库由蛋白质序列条目构成，每个条目包含蛋白质序列、引用文献信息、分类学信息、注释等，注释中包括蛋白质的功能、转录后修饰、特殊位点和区域、二级结构、四级结构、与其它序列的相似性、序列残缺与疾病的关系、序列变异体和冲突等信息。SWISS-PROT中尽可能减少了冗余序列，并与其它30多个数据建立了交叉引用，其中包括核酸序列库、蛋白质序列库和蛋白质结构库等。

利用序列提取系统(SRS)可以方便地检索SWISS-PROT和其它EBI的数据库。

SWISS-PROT只接受直接测序获得的蛋白质序列，序列提交可以在其Web页面上完成。

SWISS-PROT的网址是：：http://www.ebi.ac.uk/swissprot/。

3. PROSITE

PROSITE数据库收集了生物学有显著意义的蛋白质位点和序列模式，并能根据这些位点和模式快速和可靠地鉴别一个未知功能的蛋白质序列应该属于哪一个蛋白质家族。有的情况下，某个蛋白质与已知功能蛋白质的整体序列相似性很低，但由于功能的需要保留了与功能密切相关的序列模式，这样就可能通过PROSITE的搜索找到隐含的功能motif，因此是序列分析的有效工具。PROSITE中涉及的序列模式包括酶的催化位点、配体结合位点、与金属离子结合的残基、二硫键的半胱氨酸、与小分子或其它蛋白质结合的区域等；除了序列模式之外，PROSITE还包括由多序列比对构建的profile，能更敏感地发现序列与profile的相似性。PROSITE的主页上提供各种相关检索服务。

PROSITE的网址是：：http://www.expasy.ch/prosite/。

4. PDB

蛋白质数据仓库(PD是国际上唯一的生物大分子结构数据档案库，由美国Brookhaven国家实验室建立。PDB收集的数据来源于X光晶体衍射和核磁共振(NMR)的数据，经过整理和确认后存档而成。目前PDB数据库的维护由结构生物信息学研究合作组织(RCS负责。RCSB的主服务器和世界各地的镜像服务器提供数据库的检索和下载服务，以及关于PDB数据文件格式和其它文档的说明，PDB数据还可以从发行的光盘获得。使用Rasmol等软件可以在计算机上按PDB文件显示生物大分子的三维结构。

RCSB的PDB数据库网址是：：http://www.rcsb.org/pdb/。

5. SCOP

蛋白质结构分类(SCOP)数据库详细描述了已知的蛋白质结构之间的关系。分类基于若干层次：家族，描述相近的进化关系；超家族，描述远源的进化关系；折叠子(fold)，描述空间几何结构的关系；折叠类，所有折叠子被归于全α、全β、α/β、α＋β和多结构域等几个大类。SCOP还提供一个非冗余的ASTRAIL序列库，这个库通常被用来评估各种序列比对算法。此外，SCOP还提供一个PDB-ISL中介序列库，通过与这个库中序列的两两比对，可以找到与未知结构序列远缘的已知结构序列。

SCOP的网址是：：http://scop.mrc-lmb.cam.ac.uk/scop/。

6. COG

蛋白质直系同源簇(COGs)数据库是对细菌、藻类和真核生物的21个完整基因组的编码蛋白，根据系统进化关系分类构建而成。COG库对于预测单个蛋白质的功能和整个新基因组中蛋白质的功能都很有用。利用COGNITOR程序，可以把某个蛋白质与所有COGs中的蛋白质进行比对，并把它归入适当的COG簇。COG库提供了对COG分类数据的检索和查询，基于Web的COGNITOR服务，系统进化模式的查询服务等。

COG库的网址是：：http://www.ncbi.nlm.nih.gov/COG。

下载COG库和COGNITOR程序在：：ftp://ncbi.nlm.nih.gov/pub/COG。

2.3 功能数据库

1. KEGG

京都基因和基因组百科全书(KEGG)是系统分析基因功能，联系基因组信息和功能信息的知识库。基因组信息存储在GENES数据库里，包括完整和部分测序的基因组序列；更高级的功能信息存储在PATHWAY数据库里，包括图解的细胞生化过程如代谢、膜转运、信号传递、细胞周期，还包括同系保守的子通路等信息；KEGG的另一个数据库是LIGAND，包含关于化学物质、酶分子、酶反应等信息。KEGG提供了Java的图形工具来访问基因组图谱，比较基因组图谱和操作表达图谱，以及其它序列比较、图形比较和通路计算的工具，可以免费获取。

KEGG的网址是：：http://www.genome.ad.jp/kegg/。

2. DIP

相互作用的蛋白质数据库(DIP)收集了由实验验证的蛋白质－蛋白质相互作用。数据库包括蛋白质的信息、相互作用的信息和检测相互作用的实验技术三个部分。用户可以根据蛋白质、生物物种、蛋白质超家族、关键词、实验技术或引用文献来查询DIP数据库。

DIP的网址是：：http://dip.doe-mbi.ucla.edu/。

3. ASDB

可变剪接数据库(ASD包括蛋白质库和核酸库两部分。ASDB(蛋白质)部分来源于SWISS-PROT蛋白质序列库，通过选取有可变剪接注释的序列，搜索相关可变剪接的序列，经过序列比对、筛选和分类构建而成。ASDB(核酸)部分来自Genbank中提及和注释的可变剪接的完整基因构成。数据库提供了方便的搜索服务。

ASDB的网址是：：http://cbcg.nersc.gov/asdb。

4. TRRD

转录调控区数据库(TRRD)是在不断积累的真核生物基因调控区结构－功能特性信息基础上构建的。每一个TRRD的条目里包含特定基因各种结构－功能特性：转录因子结合位点、启动子、增强子、静默子、以及基因表达调控模式等。TRRD包括五个相关的数据表：TRRDGENES(包含所有TRRD库基因的基本信息和调控单元信息)；TRRDSITES(包括调控因子结合位点的具体信息)；TRRDFACTORS(包括TRRD中与各个位点结合的调控因子的具体信息)；TRRDEXP(包括对基因表达模式的具体描述)；TRRDBIB(包括所有注释涉及的参考文献)。TRRD主页提供了对这几个数据表的检索服务。

TRRD的网址是：：http://wwwmgs.bionet.nsc.ru/mgs/dbases/trrd4/。

5. TRANSFAC

TRANSFAC数据库是关于转录因子、它们在基因组上的结合位点和与DNA结合的profiles的数据库。由SITE、GENE、FACTOR、CLASS、MATRIX、CELLS、METHOD和REFERENCE等数据表构成。此外，还有几个与TRANSFAC密切相关的扩展库：PATHODB库收集了可能导致病态的突变的转录因子和结合位点；S/MART DB收集了与染色体结构变化相关的蛋白因子和位点的信息；TRANSPATH库用于描述与转录因子调控相关的信号传递的网络；CYTOMER库表现了人类转录因子在各个器官、细胞类型、生理系统和发育时期的表达状况。TRANSFAC及其相关数据库可以免费下载，也可以通过Web进行检索和查询。

TRANSFAC的网址是：：http://transfac.gbf.de/TRANSFAC/。

2.4 其它数据库资源

1. DBCat

DBCat是生物信息数据库的目录数据库，它收集了500多个生物信息学数据库的信息，并根据它们的应用领域进行了分类。包括DNA、RNA、蛋白质、基因组、图谱、蛋白质结构、文献著作等基本类型。数据库可以免费下载或在网络上检索查询。

DBCat的网址是：：http://www.infobiogen.fr/services/dbcat/。

下载DBCat在：：ftp://ftp.infobiogen.fr/pub/db/dbcat。

2. PubMed

PubMed是NCBI维护的文献引用数据库，提供对MEDLINE、Pre-MEDLINE等文献数据库的引用查询和对大量网络科学类电子期刊的链接。利用Entrez系统可以对PubMed进行方便的查询检索。

PubMed的网址是：：http://www.ncbi.nlm.nih.gov/。

除了以上提及的数据之外，还有许许多多的专门生物信息数据库，涉及了目前生物学研究的各个层面和领域，由于篇幅所限无法一一详述。国内也有一些大数据库的镜像站点和自己开发的有特色的数据库，如欧洲分子生物学网络组织EMBNet中国节点北京大学分子生物信息镜像系统，上海博容基因公司与上海嘉瑞软件公司合作开发的国产汉化基因数据库及分析管理系统，同时国家级的生物信息学中心也在筹建之中。我们期待国内能有更多高质量和使用便利的数据库资源，推动我国生物信息学和整个生命科学的发展。

清华大学生物信息学研究所网址：：http://bioinfo.tsinghua.edu.cn/

北京大学生物信息镜像系统网址：：http://cbi.pku.edu.cn/

云贵浪子 · 发表于 2004-11-26 19:07:29

蛋白质相互作用数据库
【BIND】 - Biomolecular Interaction Network Database
[Introduction]
We have designed and implemented a new database encompassing the growing network of protein and other biomolecular interactions, called BIND (Biomolecular Interaction Network Database).

This database will span the complexity of interaction information gathered through experimental studies of biomolecular interactions. Interaction information will come from the literature, submitters and other databases.

BIND contains interaction, molecular complex and pathway records.
[Link]http://www.bind.ca

--------------------------------------------------------------------------------

【DIP】 - Database of Interacting Proteins
[Introduction]
DIP (Database of Interacting Proteins) is a database of protein pairs that are known to interact with each other. By interact we mean two amino acid chains that bind to each other for a function. The idea is to provide well defined links between proteins that interact. The database is publicly available on the web and is intended to aid those studying protein-protein interactions, signaling pathways, multiple interactions and complex systems.
[Link]http://dip.doe-mbi.ucla.edu/

--------------------------------------------------------------------------------

【PIM】 - Hybrigenics
[Introduction]
Protein Interaction Maps: PIMs. for whole microbial pathogen genomes or selected cDNA libraries. First PIMs. available:
Helicobacter pylori (example below)
Hepatitis C Virus
Saccharomyces cerevisiae
[Link]http://www.hybrigenics.fr/

--------------------------------------------------------------------------------

【PathCalling Yeast Interaction Database 】
[Introduction]
PathCalling is CuraGen's validated, industrial scale proteomic technology designed to functionally analyze important disease related or drug response proteins. PathCalling is an automated and fully operational technology powered by GeneScape., CuraGen's Internet-based operating portal. PathCalling enables scientists to:

Identify protein-protein interactions on a genome-wide scale - Knowledge of interacting proteins provides insight into the function of important genes, elucidates relevant pathways, and facilitates the identification of potential drug targets for use in developing novel therapeutics

Rapidly interpret protein-protein interactions - Powerful bioinformatics software enables comparative cross-species genomic analysis, accelerating functional assignment and drug target discovery

Traverse up and down pathways - Expanding biological pathways involved in disease and drug response increases knowledge of the system, and enables scientists to identify both genes and proteins not previously associated with disease or drug response
[Link]http://portal.curagen.com/extpc/com.curagen.portal.servlet.Yeast

--------------------------------------------------------------------------------

【MINT】 - a Molecular Interactions Database
[Introduction]
MINT is a database designed to store functional interactions between biological molecules (proteins, RNA, DNA). Beyond cataloguing the formation of binary complexes, MINT was conceived to store other type of functional interactions namely enzymatic modifications of one of the partners. Presently MINT focuses on experimentally verified protein-protein interactions. Both direct and indirect relationships are considered.

MINT consists of entries extracted from the scientific literature by expert curators assisted by 'MINT Assistant' a software that targets abstracts containing interaction information and prsents them to the curator in a user friendly format.

The interaction data can be easily extracted and viewed graphically by 'MINT Viewer'.
[Link]http://cbm.bio.uniroma2.it/mint/

--------------------------------------------------------------------------------

【GRID】 - The General Repository for Interaction Datasets
[Introduction]
The General Repository for Interaction Datasets (GRID) is a database of genetic and physical interactions developed in The Tyers Group at the Samuel Lunenfeld Research Institute at Mount Sinai Hospital. It contains interaction data from many sources, including several genome/proteome-wide studies, the MIPS database, and BIND.
[Link]http://biodata.mshri.on.ca/grid/servlet/Index

--------------------------------------------------------------------------------

【InterPreTS】 - protein interaction prediction through tertiary structure
[Introduction]
InterPreTS (Interaction Prediction through Tertiary Structure) is a web-based tool for predicting protein-protein interactions. Given a pair of query sequences, it first searches for homologues in a database of interacting domains (DBID) of known three-dimensional complex structures. Pairs of sequences homologous to a known interacting pair are scored for how well they preserve the atomic contacts at the interaction interface.
[Link]http://www.russell.embl.de/interprets/

--------------------------------------------------------------------------------

【STRING】 - predicted functional associations among genes/proteins
[Introduction]
STRING is a database of predicted functional associations among genes/proteins.
Genes of similar function tend to be maintained in close neighborhood, tend to be present or absent together, i.e. to have the same phylogenetic occurrence, and can sometimes be found fused into a single gene encoding a combined polypeptide. STRING integrates this information from as many genomes as possible to predict functional links between proteins.
[Link]http://www.bork.embl-heidelberg.de/STRING/

--------------------------------------------------------------------------------

【BID】
[Introduction]
An overflow of information on the characterization of protein-protein interactions at the amino-acid level is continuing to develop with the goal of better understanding protein interfaces. For this reason it is necessary to acquire a protein-protein interaction database in which an enormous number of interactions can be easily accessed.
[Link]http://tsailab.tamu.edu/BID/

--------------------------------------------------------------------------------

【PPID】

[Introduction]
The Protein-Protein Interaction Database was originally a single-person's attempt to integrate a gamut of biological/bibliographical/molecular data and build a framework which might help understanding how cells orchestrate their protein content in order to become what they are: machines with a purpose. This is based on the simple paradigm that functionality like signal cascades are held together in a close space, thereby allowing specific events to occur without the necessity of passive diffusion and random events. The PPID database arose from the need to interpret Proteomic datasets, which were generated analysing the NMDA-receptor complex (see H. Husi, M. A. Ward, J. S. Choudhary, W. P. Blackstock and S. G. Grant (2000). Proteomic analysis of NMDA receptor-adhesion protein signaling complexes. Nat Neurosci 3, 661-669.). To study these clusters of proteins requires unavoidably the handling of large datasets, which PPID is generally aimed and tailored for. This database is unifying molecular entries across three species, namely human, rat and mouse and is is footed on sequence databases such as SwissProt, EMBL, TrEMBL (translated EMBL sequences) and Unigene and the literature database PubMed. A typical entry in PPID holds up to three general entries for the three species, all protein and gene accession numbers associated with them (assembled from Blast2 searches of the databases) and the OMIM entry as maintained by Johns Hopkins University. Furthermore protein sequence information is also included, together with known and novel splice-variants of each molecule as found by ClustalW sequence alignments. Entry points also include protein-binding information together with the literature reference. The whole database is curated manually to insure accuracy and quality. Querying the database will be possible by online browsing and batch-submission for large datasets holding accession number information, as can be generated using software like Mascot for mass-spectrometry. Cluster-analysis of the submitted datasets in the form of a graphical output will be developed as well as an easy-to-use web-interface.

An interface is currently being built in collaboration with the Department of Informatics (T. Theodosiou and D. Armstrong) and will be deployed soon

[Link]http://www.anc.ed.ac.uk/mscs/PPID/

----------------------------------------------------------------------------------------------

【C. elegans Protein Interactome 】
[Introduction]
This website is now unavailable! --June 13, 2003 Biozy

------------------------------------------------------------------------------------------------

【SPID】Subtilis Protein interaction Database
[Introduction]
To date, the Database Contains 95 Interactions between 68 Proteins.
[Link]http://www-mig.jouy.inra.fr/bdsi/SPiD/

-------------------------------------------------------------------------------------------------

【Mouse Protein-Protein interactions 】Just some datasets

[Introduction]
Just some datasets about mouse protein-protein interaction
http://genome.gsc.riken.go.jp/ppi/

-------------------------------------------------------------------------------------------------

【Human herpesvirus 1 Protein-Protein interactions 】Just some datasets

[Introduction]
Protein-Protein Interactions Table for Human herpesvirus 1
[Link]http://www.stdgen.lanl.gov/cgi-bin/pp.cgi?dbname=hhv1/

-------------------------------------------------------------------------------------------------

【MDSP 】

[Introduction]
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

[Link]http://www.mdsp.com/yeast/

-------------------------------------------------------------------------------------------------

【COMBASE 】

[Introduction]
This database is under Construction.In the moment, only the tables and figures are available.

[Link]http://salilab.org/sub-pages/combase.html

-----------------------------------------------------------------------------------------------

【Interact】

[Introduction]
Protein-protein interactions are intrinsic to every cellular process. They form the basis of phenomena such as DNA replication and transcription, metabolism, signal transduction, and cell cycle control. Understanding the role of a protein within a cell relies on discovering the biological context in which it performs its tasks1, therefore knowing the interactions it makes is vital.

Information relating to protein-protein interactions has been spread through many disparate databases. Even sequence databases such as SwissProt2 have interaction information hidden in the free text annotation associated with function. Other databases have taken a more structured approach to storing the interaction data such as DIP 3 and The Yeast Protein Database4.The MIPS5 protein database links to a table of the binary protein interactions of yeast, and a list of the 230 multi-protein complexes.

Object oriented database technology provides a means to fully accommodate and query the data associated with protein interactions. The Object oriented database management system (OODBMS) combines the object oriented methodology with database theory, an alliance which integrates the programming language with the database. The problem of the impedance mismatch of relational database systems, whereby the programming language and the database are incompatible is thus resolved. OODBMS have the ability to express complex and hierarchical relationships, via collections and inheritance and provide flexible data types in the form of objects. This provides an ideal model for the abstraction of real life things such as proteins, experiments and references.

[Link]http://www.bioinf.man.ac.uk/resources/interactpr.shtml

-------------------------------------------------------------------------------------------------

【ProChart Protein-Protein Interaction Database】

[Introduction]
Researchers can do it all with AxCell Bioscience抯 ProChart protein-protein interaction database, coupled with the GenoMax Protein-Protein Interaction Analysis module.

[Link]http://www.informaxinc.com/solutions/genomax/prochart.html

-------------------------------------------------------------------------------------------------

【BRITE】

[Introduction]
Biomolecular Relations in Information Transmission and Expression
BRITE is a database of binary relations for computation and comparison of graphs involving genes and proteins. It contains diverse sets of binary relations, including the generalized protein interactions that underlie the KEGG pathway diagrams, systematic experimental data on protein-protein interactions by yeast two-hybrid systems, sequence similarity relations by SSEARCH, expression similarity relations by microarray gene expression profiles, and the cross-reference links between database entries. This is a preliminary version of BRITE for simple retrieval of partners.

[Link]http://www.genome.ad.jp/brite/brite.html

-------------------------------------------------------------------------------------------------

【Allfuse Database】

[Introduction]
Functional Associations of Proteins in Complete Genomes
References:
Enright A.J., Ouzounis C.A.; Genome Biology 2001 2(9):341-347
'Functional Associations of proteins in entire genomes via exhaustive detection of gene fusion'

Enright A.J.,Iliopoulos I., Kyrpides N., Ouzounis C.A.; Nature 402, 86-90 (1999)
&#39

rotein interaction maps for complete genomes based on gene fusion events'

[Link]http://maine.ebi.ac.uk:8000/services/allfuse/

-------------------------------------------------------------------------------------------------

【PIMdb v0.1】

[Introduction]
As part of our project to construct a two-hybrid-generated protein interaction map for most of the ~13,600 proteins encoded by the Drosophila genome, we are building a protein interactions database. In collaboration with Drs. Farshad Fotouhi and Bill Grosky in the Department of Computer Sciences at Wayne State University, we are developing an Oracle 8i database that will facilitate extraction of functional information from the protein interaction data. A link to this database will appear here as soon as it is available. In the meantime, excerpts of our current interaction database (PIMdb v0.1) can be viewed here in tabular form

[Link]http://cmmg.biosci.wayne.edu/finlab/PIMdbv01.htm
-----------------------------------------------------------------------------------------------

【PREDICTOME】

[Introduction]
... is a tool for visualizing the predicted functional associations among genes and proteins in many different organisms. Associations, or gene links, are created using a variety of techniques, both experimental (yeast two-hybrid, immuno-coprecipitation, correlated expression) and computational (gene fusion, chromosomal proximity, gene co-evolution).
... is based on the premise that genes, or their protein products, can be linked using both experimental and computational techniques. Functional information about individual proteins is then assessed in a network context, where characteristics about a protein can be inferred using the functional traits of neighbors, the neighbors of neighbors, etc.
... is intended as a central repository of the predicted links between proteins. The interface also includes modules that facilitate browsing and interpreting these links.

[Link]http://predictome.bu.edu/l

-----------------------------------------------------------------------------------------------

【YEAST PROTEIN COMPLEX DATABASE】

[Introduction]
YEAST PROTEIN COMPLEX DATABASE

[Link]http://yeast.cellzome.com/

-----------------------------------------------------------------------------------------------
【Protein Interaction Facility 】

[Introduction]
The Protein Interaction Facility provides expertise in characterizing macromolecular interactions. Analytical ultracentrifugation, titration calorimetry and optical biosensors are used to define the assembly state, thermodynamics and kinetics of a reaction. The facility operates on a fee-for-service basis for University of Utah scientists as well as external academic and industrial collaborators. Send us an email message if you are interested in using the facility.

[Link]http://www.cores.utah.edu/interaction/

-----------------------------------------------------------------------------------------------

【ProMesh 】

[Introduction]
N/A

[Link]http://www.bit.uq.edu.au/ProMesh/

cake · 发表于 2004-12-16 11:19:18

我有复旦大学钟杨教授在中科院的生物信息学讲座，ppt格式的压缩后有60M怎么上传啊

显示全部楼层 · 发表于 2005-1-24 02:28:45

ding yi xia

显示全部楼层 · 发表于 2005-1-24 02:34:13

ding

elvias · 发表于 2005-3-23 12:42:12

ding

		自动登录	找回密码
密码			注册

生物信息学巨多网址及其学习和技术交流！

本帖子中包含更多资源