11
Plant Molecular Biology 13: 653-663, 1989. © 1989 Kluwer Academic Publishers. Printed in Belgium. 653 The legumin gene family: structure and evolutionary implications of Vicia faba B-type genes and pseudogenes Ute Helm, Roland Schubert, Helmut Baumlein and Ulrich Wobus Akademie der Wissensehafien der DDR, Zentralinstitut fi~r Genetik und Kulturpflanzenforschung, DDR-4325 Gatersleben, German Democratic Republic Received 21 March 1989; accepted in revised form 6 July 1989 Key words: deletion formation, legumin gene evolution, pseudogenes, repetitive elements, Viciafaba Abstract We have characterized several Viciafaba genes encoding methionine residue-free group B subunits of the 11S or legumin storage proteins. The respective gene subfamily consists of 10 to 15 members, six of them having been studied by DNA sequence analysis. Four functional genes (LeB2, LeB4, LeB6, LeB7) are highly homologous in their coding region and 0.3 kb of their 3' flanking sequences. On the other hand, two pseudogenes (¢LeB 1, CLeB5) have accumulated a large number of mutations including an identical 0.7 kb internal deletion; they are both flanked by a repetitive element. Analysis of sequence changes show that transitions are nearly double as frequent as transversions. CpG is the most infrequent dinucleotide whereas TpA is significantly underrepresented in exon sequences. End points of deletions are correlated with short direct repeats and preferentially found in the two introns. Our studies indicate that the Vicia faba legumin B gene subfamily contains a group of expressed, highly homologous genes as well as more diverged pseudogenes. Introduction Multigene families are a characteristic feature of eukaryotic genomes. Extensive studies of globin [17], immunoglobulin [22] and haptoglobin [29] gene families have revealed a wealth of data on the complex genetic events that affect multigene families during evolution. In plants, seed storage proteins and their genes have been studied inten- sively [7] because these proteins are of great economic importance and their genes are inten- sively transcribed only in developing seeds in a precisely controlled manner. Two major protein classes have been distinguished, the 7S or vicilin class and the llS or legumin class (see [13]). Legumins account for about 70~o of field bean seed protein [32] and consist of two major types of subunit containing (type A) or lacking (type B) methionine [23]. Type A and B subunits are encoded by two distinct gene subfamilies, desig- nated A and B [44, 45], and homologous sub- families have been described in other legumes such as Glycine max [35] and Pisum sativum [ 15]. Non-leguminous plants, such as Gossypium hirsu- turn [8 ], Oryza sativa [42] and Arabidopsis thaliana [36], also contain clearly distinguished legumin- like gene subfamilies but they cannot be directly related to the major legume subfamilies [36]. We have previously described the primary structure of a field bean legumin gene coding for a B-type subunit [2]. In this paper we present sequence data from five other members of the

The legumin gene family: structure and evolutionary implications of Vicia faba B-type genes and pseudogenes

Embed Size (px)

Citation preview

Plant Molecular Biology 13: 653-663, 1989. © 1989 Kluwer Academic Publishers. Printed in Belgium. 653

The legumin gene family: structure and evolutionary implications of Vicia faba B-type genes and pseudogenes

Ute Helm, Roland Schubert, Helmut Baumlein and Ulrich Wobus Akademie der Wissensehafien der DDR, Zentralinstitut fi~r Genetik und Kulturpflanzenforschung, DDR-4325 Gatersleben, German Democratic Republic

Received 21 March 1989; accepted in revised form 6 July 1989

Key words: deletion formation, legumin gene evolution, pseudogenes, repetitive elements, Viciafaba

Abstract

We have characterized several Viciafaba genes encoding methionine residue-free group B subunits of the 11S or legumin storage proteins. The respective gene subfamily consists of 10 to 15 members, six of them having been studied by DNA sequence analysis. Four functional genes (LeB2, LeB4, LeB6, LeB7) are highly homologous in their coding region and 0.3 kb of their 3' flanking sequences. On the other hand, two pseudogenes (¢LeB 1, CLeB5) have accumulated a large number of mutations including an identical 0.7 kb internal deletion; they are both flanked by a repetitive element. Analysis of sequence changes show that transitions are nearly double as frequent as transversions. CpG is the most infrequent dinucleotide whereas TpA is significantly underrepresented in exon sequences. End points of deletions are correlated with short direct repeats and preferentially found in the two introns. Our studies indicate that the Vicia faba legumin B gene subfamily contains a group of expressed, highly homologous genes as well as more diverged pseudogenes.

Introduction

Multigene families are a characteristic feature of eukaryotic genomes. Extensive studies of globin [17], immunoglobulin [22] and haptoglobin [29] gene families have revealed a wealth of data on the complex genetic events that affect multigene families during evolution. In plants, seed storage proteins and their genes have been studied inten- sively [7] because these proteins are of great economic importance and their genes are inten- sively transcribed only in developing seeds in a precisely controlled manner. Two major protein classes have been distinguished, the 7S or vicilin class and the l lS or legumin class (see [13]). Legumins account for about 70~o of field bean

seed protein [32] and consist of two major types of subunit containing (type A) or lacking (type B) methionine [23]. Type A and B subunits are encoded by two distinct gene subfamilies, desig- nated A and B [44, 45], and homologous sub- families have been described in other legumes such as Glycine max [35] and Pisum sativum [ 15]. Non-leguminous plants, such as Gossypium hirsu- turn [ 8 ], Oryza sativa [42] and Arabidopsis thaliana [36], also contain clearly distinguished legumin- like gene subfamilies but they cannot be directly related to the major legume subfamilies [36].

We have previously described the primary structure of a field bean legumin gene coding for a B-type subunit [2]. In this paper we present sequence data from five other members of the

654

same type B subfamily including two pseudogenes. Four of the genes, probably all functional, are nearly identical in their coding region and 0.3 kb of their 3' flanking region (5' region sequences are incomplete) in a manner hitherto not described for any other plant gene family. The homology observed is best explained by a conversion-like sequence homogenization mechanism. On the other hand, the two pseudo- genes have accumulated a considerable number of mutations indicating their exclusion from the ho- mogenization process. Close inspection of dele- tion end points revealed their correlation with short direct repeats.

Material and methods

The construction of a field bean (Vicia faba var. minor cv. Fribo) genomic library has been describ- ed [2]. Approximately 750000 phages were screened for homology to either an internal Hinf I-Pst I fragment o f c D N A clone pVfc70 [45] or a 1.95 kb Sph I fragment containing the legumin B gene LeB4 [2]. Relevant DNA fragments from hybridization-positive plaques were subcloned in- to M 13mp vectors and their sequences analyzed as detailed in the Amersham 'M13 Cloning and Sequencing Handbook 1985'. 35 S-dATP and the universal sequencing primer as well as specific primers were used. Sequence data were processed using the adapted version of a published program [37]. Hybridization was done by standard proce- dures [30].

Results

The legumin B gene subfamily

Figure 1 shows that the field bean genome contains a number of different-sized DNA frag- ments hybridizing with a legumin B gene probe. These fragments represent legumin B sequences because experiments described earlier have shown that under our hybridization conditions cross-reaction with legumin A sequences does not

Fig. 1. Southern analysis of Viciafaba genomic DNA. 10/~g of V.faba genomic DNA (lanes 2, 3, 4) or 10 pg of LeB5 DNA mixed with 10 #g ofNicotiana tabacum genomic DNA (lane 1) digested with either Eeo RI (lanes 1, 2, 3) or Bam HI (lane 4)were separated in agarose gels at different resolution, blotted on either Hybond N (lanes 1,2) or GeneScreen (lanes 3, 4) and hybridized with the 32P-labelled 1.95 kb Sph I fragment of LeB4. Identical positions in the two blots are connected by lines, and fragment sizes in kb derived from different marker fragments (not shown) are given at the right margin. The apparently single-copy 1.6 kb Eco RI fragment, expected to arise from a genomic LeB5 copy, is marked by

a black triangle.

occur [44]. From the number and intensity of hybridizing fragments we estimate that there are approximately 10 to 15 members in the legumin B subfamily. This number is in agreement with the results of our library screening because at least 10 of the 14 isolated phages contain different genome segments as demonstrated by DNA blot and sequence analysis. Hybridization experiments with cotelydon poly(A) RNA indicated that only one gene expressed at high levels in embryos was represented on each phage. Taking into account the mean insert length and the location of four of the six characterized genes at the 5' end of the respective insert (see below), we calculate that each at least 13 kb of downstream DNA is free of

655

-100 +1 +500 +1000 +1500 +2000 I } I I t I I L I t L I t L t t I I ] t L 1 1

I E B

~ E i B

~ J J ~ \

~ ~ o%8 \ B ' '

o ,~ o ~ exon ] / 2 intron exon 5 intron exon 4 ATAAA

~ ' - ~ < E ~- B

E B

LeB7 ~ ~ I :~ ÷ 5 H.. """ / I " ~

E / I ~

Fig. 2. Structural organization of legumin B genes of Vidafaba. The functional gene LeB4 is taken as reference gene and all changes in the other genes relative to LeB4 are indicated: base substitutions (dots), small deletions (downward open arrows) and insertions (upward open arrows). Two relevant restriction sites (E = Eco RI, B = Bgl II) are marked. Exon, intron, untrans- lated and flanking regions are shown by different-sized open bars. The end of the 3' untranslated region is defined by a cDNA sequence (pVfc70), seemingly a transcript of LeB6. The sequence probably inserted immediately after the translational stop codon

of LeB6 is stippled. Sequences not found elsewhere in the analysed genes are differentially hatched.

related sequences; that is, there is no tight cluster- ing of legumin B type genes as reported for some other seed storage protein genes [5, 9, 20, 35, 38, 41]. Figure 2 provides a schematic view of the determined sequence structure of six legumin B- type genes. Four of the six genes lack their 5' sequences due to cloning at the gene internal Eco RI site. We chose the complete sequence of the earlier described gene LeB4 [2] as reference for comparisons and indicated in Fig. 2 each structural difference between this and any other gene; that is, point mutations, insertions, dupli- cations and deletions. It is important to note that gene LeB4 is expressed in vivo since transfer of a 4.7 kb Bgl II fragment carrying the gene to tobac- co plants resulted in seed-specific expression and correct processing of the prepropolypeptide [3]. The genes depicted in Fig. 2 can be separated into two groups: genes LeB4, LeB2, LeB6 and LeB7 on the one hand and LeB 1 and LeB5 on the other.

Genes LeB2/4/6/7 are highly homologous and prob- ably all functional

As evident from Fig. 2, genes LeB2 and LeB7 are highly homologous to LeB4 even within 0.3 kb of 3' flanking sequence (5' flanking sequences are lacking). The few mutations are almost restricted to the two introns, the last exon (only in third codon positions) and the 3' untranslated and flanking region. LeB6 is more distinct by a 'double' 3' end: the 257 bp sequence following the TGA stop codon in LeB6 is clearly homologous to the respective region of LeB4 (beside the last 30 bp) but 45 out of 207 positions are different plus two insertions of 9 and 11 bp (Fig. 3). Sur- prisingly, the following sequence stretch is nearly identical to the 3' untranslated and flanking region of LeB4 up to the Bgl II site (see Fig. 2). The intron duplication/deletions should not prev- ent correct splicing, and the 18 bp deletion in the middle exon leaves the reading frame intact. The determined sequences for four of the six genes end at the 3' Bgl II site (see Figs. 2 and 3). This site also marks nearly exactly the end of the sequence homology between at least some of the genes. We

656

W

N

0

n~

0

W

.Ii 1: ° 0

I ' :

I

I : I

I : I : . I

: I • ' t

!:i i;i

i:i I:i

I t

,,

I

1 I 0

Ioi i- ! .

. U z

°!

.i ! ¢I.

W

I I

! I " '~

i! . : I "

, I

: I ~,

I

I I

1

! t

E o

I.,

,i i: Io l-

i N I '

M

i:i I I

t ' l

i"

i"I

+1 I

4

io , l

, I

i , . I

+:I l i l l I I

t,,.l : I ioi :<I

+'i !

"I

' i !- i"! I . i 1

, o r

: I i '

I

I I

I

I

i I

i ' i

i

i , • • ° ~

i i

, . !

i

+.

. ~ l ! i i + + + + . , , , , , ° . , ~ , ~ I [ ~ : : : ~ o : : : : : j i :~ < i i i l i ~ : : : : : : : : : :

. ~ i i i i i '~ . . . . . . . . . . . l ~ : : : : : ' l o i i i i i " ~ ! i i i i

! ~ : : : : : I : : : : : ~ : : : : :

• ~ ' : : : : : ' + o ~ i i ! i ' + i i i i i

+~+~" : ~ ! ! i i . " : : : : - : : : : : ~ i i i ; i : ~ : : : : ~ : : :

• ~ : : : : : ' ~ o i i ! i i " < i ! i i i

: : ' ~ ~ i ! i i i u . < < " " ° . . . . . ~ i ! i i i . i i ? i !

• : : : : : , ~ : : : 1 :

t i i i i i ° i ~ i ~ i i i : i i i i i " ] i l i i

m ~ m m ~

ii+ +++i +I:+ u:+:+':++ . . . .

w " M . . . . . ~ . . . . . I * • • k i l t

~+ 1 1 + . . . . . . . . . . . . . . . • °8 . . . . ~.- ~ '~ ~ ' : ' ~ ° ~ i ' ~ ° ! ! ! ! ! { } ~ . . . . . i I * . I I •

" ~ ' ~ " ! i - - "~ " : i i - ' i +~'~+'.. .o.+°~' +~ + ~ ° . ~: : : ~ . : o ° : . . . . . . . : : : : : ~° . . . . . . . . . . . . . . . . • +. ,,,, . . . o+~,,.,,,, ..

~ 4 , P a l • . . . . . I I • • * I I : ~ E .~ ..+-~ .g - : i . i - ° : ,.~ . . . ~ . . . . . ~ , , : 1~ .+ ,~ . . . . . . . . . . . . . . . . . I l l M * 4 . . . . . . ~ . . . . I I * . . . . . . . • + , ~ . o: i ° i i ° ° l l . . . . . u . . . ° " ' " = . I I : • . ~ . , ~ . • . . . . . . . . . .

~+..:. ~ .ll.~ .~. ~ ~ . : ~< ! l , ~ ~ , . . . . . . . . . . . . . . . . . . . ~ . . . .

+~ i ! . . . . . ° . . . . . . I * * • • • i J I I • * • i I I "

I . . . . . ~ I • • , I I , . . . ~ . . . ~ . f f . ~ ' . ~ . ~ . : ~ - : . . . . . . . . . . . . . . . .

!! 1 ! : ,1 i ........... U . . . . . I I I I • • , " ~ "+ " : I ' ! " : : ~ : "~° . . . . . ~ " " ~ " "

~ gl . ~ . ~ i:~ : ~ . . ~ . . . . . . . . . . . . . . . ~ .+i ~: .+: ,. . . . . . . . . . . . . o , , . . . . . . . . ,~. • ~. . ~

, I ° • . : . . . . . . . . o ~ ~ : - . : : : : ~ I ! : . I : = ~,~, . . : . . . . ~. ,~ . , ,

I: . . . . . . . . . . . . ~. ~ < + ! < ~ , : ~ : < . . . . . . o

,~ ~ o " , , , , • I ~ ~ . . . . . . . . , i . i ~ 0 4 , J ° 14 g I I I I U • ~. ~ ~ ~ . ~ "~ ! ~ o i ~ - i " '< . . . . . . . ' . . . . " "

~. • o i f . . ~ > • • ~ , i ~ , ~ ' + + ~ ~ . . . . . ~ . . . . . ~ . . " ~ ~'; ~ i . . . . . . . . . . . . . .,3 . . . . . . . . I I ~ ' • . . .~ . . . . . . . . , , < , I .

,-+ ~ < 4 . . . . ~ . . . p . : , . , , • • : . ' + . + ~ . .~.~' ' . = i . i . . i . . . . . . . . . . . . . . ~" "~ '~ " . . . . . L : : : : : ~ : . . . . ~ , : . ~ , ~! . . . .

" . + + : : : . . . . . : . . . . +, +" ~'+ "+ + ~° : +.., i + : + + " : " : + : + : : " +" " . + ~ll + . . . . . + . . . . . . . . .

+'+.+ !~.~+ +.~ ! . . . . , , . • • . : + . l l ~ . : R . • • i o i , . + . - ~ . . . . . . . . . . "" ~' ~ i t:.,~ ~,,~. . . . . . . . " " ~ , 3 5",., ' = ' ' ' o o II " " ' " - -"

~ t • : : : : : . . . . . ,, , ~ .

ii i ' . . . . . . . . . . "

" + : ' : " ~ - : l ° : I . . . . . . . . . . : + ' ~ l I I " i , I • • • I I { : : " " I I i I •

.++.~ . + + ~ . : ; l . , + ~ . . ~ , . + : . ~ . ~ ,. : : : : : . . . . . : . . . .

++ ++ ~i ~+' !!" ~ ++ ! :~ " " ' ° ° " : : : ' :~" . . . . . . . . . . . ' : ' : . . . . ' + " ,,": 4 4 ~ : ~ i , • I I •

~ : : : : : : : : : " + ~ , .

++, i~:.+ . ~ -: ~ : + ~ . . . . . . . . . . : . . . . +a i i i i " 0 " , < i l l • + . . . . . . . . I I I I •

+ : + " .+ . + + i"+ ; . + " ~ . . . . . . . . . . ' " I I i ~ i ~ ~ • + • • • . i { 0 . . . . . 0 I I •

;+~ ,++ ,?.+ .~ "+.,

.u_::::: ~:::::

~ i l i ~ i . ~ : : : : : .~

. ~ : : : : : ~ : : : : : .~,~ ~ : : : : : ~ I I : : : ' ~

g : : : : : ~ I I

"i~ i"" ~.~.~!i ~,"~g. . . . . . . ~ : : : : : -~,~ ,,;,J'" " ::::: I~"

.~ ...... -~ : : : : : :: : : :

Uiiiii ~iiiii .~iiiii .~iiiii ~,iiill "iiiii . . . . . . ~,;::::

~ i i : : i ! ~.i!!ii ~44iii "~:::::

~ : : : : : ' ~ i i : : i : : ' ~ ! i i i i ~::::: ,S'::::.: ~::::: ~ : : : : : ~::::: I ::i ::ii ~':::1: ~1:::: °::::: ~ :::ii ~.:...

~::::: ~i!!!! ~.::::: iiliil ~-::iiii ~!::ii::

--H~i i l .~'.~ ~ • .~

• ~ • ..~

.~., ~ ~ • ° •

~! ii ' • . . ~ ..~

• .~..~.~ .~

H " .~.: ~.~ .g

. . . . . . . . . . . . . . . . . . . . ~ iT i i i i 'Hi i i i :~"

. . . . . . . . . . , i ::.::. i.: ii I , . i i l i i i i i ! i ~ . : : : : : : . . . . . . : : : : : , ; ~ ! i i j ~ ! i i !

"~:~::~:::::: • ii!i~ i ~. i ! i i ! . ~ : : : : : ~

. . . . . i

. . . . . ! : ~ i i i i : : : : : : : , ; : : ~ ' . . . . . . . . : , j : : : :

::::::il i ~ i i i i .- :::::".. ~ , . , . : :

. . . . . ::':': iiiiiiii!iiil ." i i i i i ~ : i i i i i .~::: : :" . . . . . . . °°: .-i i i ii i i ° .... ..,-~:: : ~ " ,,,,, ..... !"" i i i i~i~i i i i i ! , , , , ,~.~.~, ~!iiii~

i . . . . . . • :: :: i i !~. ' i .~" : : ::::: : : : ~ ~ : : : : : : ~,: : : : :~. . , : : : : : . . . . . . H:: : : : ~.

"i'"""i"

U ~

N ~ +

d = ~ ' ~

~

: : : : : : ~ : : : : : : ~

. . . . . . . . . . . . ~ . ~

: : : : : : . , 2 2 ~ : ~ N

~ : : : 2 ~ . : : : : : 6 Z . . . . . o ~ . . . . . o o o

~ ~ ,~°=

• ~ ~ ,.~ • ~

g~ ~.~

,AS

~ ~

~ z ~

. ~ ×

.g ~ . ! x

657

658

derive this conclusion from two sets of data. First, sequences of ~LeB5 and LeB2 diverge com- pletely 6 bp downstream of that site. Second, a 1.4 kb Bgl II restriction fragment of the LeB2 3' flanking region (see Fig. 2) does not hybridize with any of the other phages including 2LeB 1 and 2LeB5. The region upstream of the discussed BglII site (positions 2002 to 2131 in Fig. 3) of all genes can be folded in an extented palindromic sequence (not shown) of unknown significance. As already mentioned, expression has been pro- ven for gene LeB4 by transformation experiments [3]. Expression ofgene LeB6 is very likely, since the diverged 3' untranslated region is exactly represented in cDNA clone pVfc70 [45]. In ad- dition, oligonucleotides specific for genes LeB6 or LeB2 + LeB7 hybridized with middle cotyledon- ary poly(A) RNA under discriminating conditions (not shown). The available structural and hybridi- zation data together let us assume that not only gene LeB4 but also genes LeB2, LeB6 and LeB7 are transcriptionally active genes.

Repetitive elements in the flanking regions of tpLeB 1 and OLeB5

Sequence comparison between the flanking regions of 0LeB5 and ~LeB1 revealed 85~o ho- mology between the Bgl II-Eco RI fragment 3' of 0LeB 5 and the reverse complement of a sequence in the 5' flanking region of OLeB 1 (Fig. 4a). At the break points of homology of both sequences palindromic structures of high potential stability

Genes LeB1 and LeB5 represent pseudogenes

Restriction and sequence analyses of genes LeB 1 and LeB5 revealed the complete lack of the middle exon and most of the two flanking introns (Fig. 2), thus preventing the synthesis of a functional polypeptide; so, genes LeB 1 and LeB 5 are pseudogenes. The deletion leads to indicative Eco RI fragments seen in genomic blots (Fig. 1). In addition to the 0.7 kb deletion, there are only two CACTTCA repeats in the cap-site region of LeB 1 instead of three repeats in the LeB4 pro- moter, and a 16 bp deletion destroys the legumin box (see Fig. 3), a highly conserved sequence element in front of legume legumine genes [2]. Both pseudogenes are closely related but through a seemingly complicated evolutionary history. They share in the remaining gene bodies and 3' flanking regions 6 small deletions and insertions and 21 base substitutions (always in comparison to LeB4) but accumulated 45 additional unique point mutations (25 in OLeB 1 and 20 in ~pLeB5).

Fig. 4. A repetitive element in the flanking regions of pseudo- genes ~pLeB1 and ~pLeB5. a (top). Location of the repetitive element (RE) relative to the two pseudogenes. The different- sized black rectangles represent exon, intron and untranslat- ed sequences, respectively (compare Fig. 2), the thin line sequenced flanking regions. The two elements (double-lined) are inversely oriented and end in a potential palindromic sequence, b (bottom left). Southern hybridization analysis. DNA of phages 2LeB1, 2LeB4 and 2LeB5 was double- digested with Eco RI and Bgl II, the fragments separated in a 0,8 % agarose gel, transferred to nitrocellulose and hybrid- ized with the 441 bp fragment of ~OLeB5 32p-labelled by random priming. Numbers refer to fragment sizes in bp. e (bottom right). Potential palindromic structures at the known termini of the two elements. Numbers evaluate the influence of base pairing on the stability of the structure according to Krawinkel et al. [27]. In addition, the relative stability of the whole structure and the relation of stem length

to Watson/Crick base pairs is given.

659

Table 1. Evolutionary distance of different Viciafaba legumin gene sequences to gene LeB4 according to Kimura [26].

K 1 K2 K3 K3' UT

OLeB1 0.012 + 0.005 0.002 + 0.002 0.022 + 0.007 0.045 _+ 0.01 @LeB5 0.026 _+ 0.007 0.001 _+ 0.002 0.025 _+ 0.007 0.043 + 0.01 LelB3 0.098 _+ 0.030 0.052 _+ 0.022 0.370 _+ 0.072 LEA77/30 0.508 +_ 0.088 0.433 _+ 0.076 1.038 +_ 0.202

K1, K2, K3 and K 3 ' U T represent K values + standard error for codon positions 1, 2, 3 and the 3' untranslated region, respectively. For genes @LeB1 and OLeB5 codon positions ofexons 1 and 3 were used for the calculations; note that for 0LeB5 the first exon is only partially known (see Figs. 3 and 5). For LelB3, coding for a 80 kDa legumin subunit, and the cDNA sequence LEA77/30 representing an A-type legumin gene [45] only the 119 (LelB3) and 132 (LEA77/30) N-terminal codons of the/~-chain coding regions were compared.

can be formed (Fig. 4c). The described structures are most likely distal parts of repetitive elements since if the 2LeB5 BglII-EcoRI fragment (Fig. 4a) is used to probe a genomic Southern blot, a hybridization pattern typical of a middle repetitive sequence is found (not shown). The same probe detects several copies already within the cloned DNA segments carrying genes OLeB 1 and OLeB5 but none in 2LeB4 (Fig. 4b). There- fore, both pseudogenes are probably flanked by the described element in inverse orientation, an assumption supported by preliminary electrone microscopic observations (R. Panitz, unpublish- ed).

Analysis of base substitutions and dinucleotide fre- quencies

Only the two pseudogenes are different enough from the reference gene LeB4 to allow a meaning-

ful analysis. Based on the LeB4 reading frame the ratio of silent to replacement substitutions is 1.32 for the OLeB1/LeB4 gene pair and 0.92 for the @LeB5/LeB4 gene pair. Transitions are nearly double as frequent as transversions. Whereas among the transitions G to A changes predomi- nate, clear preferences are less evident within the transversions. Generally, in the pseudogenes and their flanking sequences substitutions leading to an increased A + T content predominate (alto- gether 41 out of 60 cases).

K values (Table 1) calculated according to Ki- mura [26] show the expected relatively high varia- bility in the third position, whereas the second position is the most invariable one. Dinucleotide frequencies (Table 2) deviate from statistically expected values for reasons partially understood [see 4, 39]. However, we note that CpG is not enriched in the 3' flanking sequences as reported for a number of leguminous plant genes [ 31 ], and that clear differences between dinucleotide fre-

Tab& 2. Selected dinucleotide ~equenciesin different parts ofthe ~nctionM gene LeB4 and the two pseudogenes 0LeB1 and ~LeB5.

CpG TpG TpA CpA

LeB4 LeB1 LeB5 LeB4 LeB1 LeB5 LeB4 LeB1 LeB5 LeB4 LeB1 LeB5

5 'UT 0.573 0.586 - 1.217 1.227 - 0.933 0.923 - 1.239 1.607 Exons 0.510 0.458 0.421 1.185 1.343 1.400 0.548 0.565 0.586 1.370 1.422 1.402 Introns 0.743 1.491 1.125 1.480 3 'UT 0.462 0.127 0.231 1.415 1 . 5 2 1 1.492 0.820 0.800 0.765 0.961 1.052 (1.075)

The values represent the ratio between observed and theoretical (expected) values calculated from base composition. All values ) 1.2 or <0.8 deviate significantly from the expected values.

660

quencies of the functional gene LeB4 and the nonfunctional pseudogenes are only seen in the CpG/3 'UT values whereas we would expect to see this effect over the whole pseudogene region since CpG-TpG changes should no longer be sub- ject to functional constraints after silencing of a gene.

626 1324

LoB1 AG_.G~TAAGTAATA~ Deletion TGTAC~_T

LoB5 =.:_u ...... ~ . . . . . . . . . . . . . . . . C''~.L~

Exon 1/2 Intron 2 Intron 3 Exon 4

617 638 j ~

LeB4 AG__G~TAAGT~TAACTATACATTA~TGTT

AG_G~TAAGT~ Deletion CTGTT LeB6

LeB7 .:-:-u . . . . . . . . . . . . . . . . .

Exon 1/2 Intron 2

8 2 8 8 4 7

LeB4 ~TCTGAAGAA@GGTAAC

Deletion CGGTAAC LeB6

Exon 3

1313 1328

LeB4 TA~CATG~ATGC~GTATG~

LeB6 T A A ~ Deletion

Intron 3 Exon 4

1255

(---ATTTCTA - - - 3 4 b p - - - AA~GCATAT

Intron 3

Fig. 5. Short repeats correlated with deletions and a dupli- cation in legumin B genes. Each sequence surrounding a deletion is compared to the respective LeB4 sequence. Exon sequences are indicated by underlining, identical nucleotides in genes carrying the same deletion by dots. Nucleotide numbers above the LeB4 sequence mark the deletion break points. Short direct repeats correlated with the deletions are boxed and nucleotide mismatches between the two repeats indicated by underlaid dots. Other direct repeats and dyad symmetries are indicated by arrows. The duplicated sequence in the LeB6 intron is written as to stress the possible signifi- cance of the boxed direct repeat. Introns and exons were

numbered according to Bgmmlein et al. [2].

Short direct repeats are present at deletion termini

A close look at the nucleotide sequences near all deletion end points within the transcribed region shows the presence of short direct repeats. Direct repeats also flank the duplication in the second intron of gene LeB6 (Fig. 5). The repeat size varies from 5 to 10 bp, including 1-2 mismatches. Statistical analysis [21 ] proved the significance of the correlation between deletions and their termi- nal repeats. The distribution of deletions within the genes is highly uneven and defines two hot spots, located in the first and second intron, both characterized by a high number of direct and inverted repeats.

Discussion

The legumin gene family

113 globulins or legumins are a polymorphic class of proteins encoded by a multigene family which consists of two major subfamilies, called A and B in V.faba beans [44]. According to Southern analysis, gene subfamily B of the field bean V. faba var. minor consists of about 12 members (Fig. 1), the subfamily A of less then 10 members (our unpublished data). B type genes of V.faba ([2] and this paper) and Pisum sativum [ 7 ] are different from those of Glycine max [35] in that they possess only two introns, a feature certainly achieved by intron loss in direct ancestors of Pisum and Vicia. A third type of Vicia legumin genes is represented by ~ the single-c0py gene LelB3, coding for a 80%Da legumin~subunit which exhibits in the sequenced beta-chain region a higher homology to B-type genes than to A-type genes (U. Heim and U. Wobus, in preparation). Protein analysis provides evidence for a few more genes coding for minor legumin subunits [23] suggesting a total of 20 to 25 legumin genes in the V. faba genome, a number well above those deter- mined for pea [10, 14] and soybean [35] and in sharp contrast to Arabidopsis with only 3 to 4 genes representing 3 subfamilies [36].

Homogenizing processes shape the structure of legu- min B genes

Whereas sequence divergence between members of both subfamilies amounts to about 507o, cDNA sequences suggest that (transcribed) mem- bers of each of the two subfamilies are nearly identical in sequence ([45], R. Jung, unpublished data). Sequences of several paralogous B-type genes presented in this paper confirm this sug- gestion, since the four functional genes LeB2, LeB4, LeB6 and LeB7 exhibit an extremely high degree of homology (Figs. 2 and 3) which extends 0.3 kb into the 3' flanking region (5' flanking sequences are only known for one of the four genes, LeB4). The long ranging homology of genes LeB2, LeB4, LeB6 and LeB7 is only inter- rupted in gene LeB6 by a diverged 257 bp se- quence stretch inserted behind the stop codon; we have no indication as to its origin. Of crucial importance for the explanation of the high homol- ogy of genes LeB2, LeB4, LeB6 and LeB7 and their 3' flanking regions is the question as to whether these genes represent different loci and not allelic or alloallelic variants. First, there is good evidence that Viciafaba is a diploid species (I. Schubert, personal communication) and not of allotetraploid origin like soybean (see [19]). Sec- ond, even the highly homologous genes LeB2, LeB4 and LeB7 differ by several mutations and their extented flanking regions are different as determined by restriction and hybridization ana- lysis. These data argue against the possibility that the genes are allelic variants. The different restric- tion and hybridization patterns of the recombi- nants also make very recent amplification events unlikely which could easily account for the ob- served homology. The most likely mechanism to explain the structure of genes LeB2, LeB4, LeB6 and LeB7 are repeated conversion-like homogeni- zation processes (see [29] for a definition related to higher eukaryotes). There is some evidence recently summarized by Walsh [43] that con- version is initiated at a specific site and may proceed for up to a few kilobases. The presented sequence and hybridization data suggest that the 3' end of the LeB2/LeB4/LeB6/LeB7 gene con-

661

version tract lies immediately downstream of a Bgl II-site (2157-2162 of Fig. 3). Due to the lack of 5' sequences of genes LeB2, LeB6 and LeB7 we only know that the 5' flanks of OLeB 1 and LeB4 suddenly diverge around the legumin box (Fig. 3). Thus, the total length of the homogenized region between genes LeB2, LeB4, LeB6 and LeB7 may be around 2 kb. The unevenly distri- buted mutations within the LeB2/LeB4/LeB6/LeB7 gene region (Fig. 2) would be best explained by multiple conversions of in each case parts of the whole region. Recog- nizable sequence elements such as dyad sym- metries and palindromic sequences found in the legumin B genes have been discussed to be involv- ed in the conversion process [43], but a clear correlation is lacking [12]. Extended sequence homologies between members of other seed sto- rage protein gene families suggestive of gene con- version events have not been described but the exceptional 99~o identity between the second introns of soybean glycinin genes Gyl and Gy2 was discussed as a likely vestige of a recent gene conversion between the two linked genes [35].

Pseudogene structure suggests a complex evolution- ary history

The large number of mutations in genes ~LeB 1 and ~0LeB5 (Fig. 2) is evidence for the exclusion of the pseudogenes from the sequence homogen- izing processes shaping the functional genes. We envisage as most plausible reason the removal of sequences necessary for conversion initiation (see [43]) by the large 0.7 kb deletion. The present structure of the two pseudogenes seems to be best explained by assuming that a gene, inactivated by the 0.7 kb deletion and carrying a number of mutations, was duplicated by unequal but homol- ogous crossing-over between two flanking copies of the described repetitive element in a way dis- cussed for mammalian gene duplications [24, 28, 40]. After duplication both genes accumulated more base substitutions independently. However, the structural details .indicate a more complex evolutionary history which we are unable to rec- onstruct from the available data.

662

Deletions are correlated with short repeats

Several pathways probably exist for the gener- ation of deletions, one of which is based on short direct repeats. Extensive sequence analyses of termini of naturally occurring and induced de- letions in both bacterial [1] and animal [17, 25, 33, 34] genomes verified the close correlation between direct repeats and deletions and led the authors to favour 'slipped mispairing' as a mecha- nism to explain deletion formation. Our studies add an example from the plant kingdom for which data are still scarce [6]. Beside direct repeats palindromic sequences may be involved in dele- tion formation [ 11, 18 ] by either forming enzyme recognition sites or stabilizing nucleotides looped out by slipped mispairing of direct repeats [33, 34]. The favoured regions for deletion for- mation in the first and second intron of Viciafaba legumin B genes contain a complex pattern of not only direct repeats but also dyad symmetries, and the same observation was made in two other plant gene families, the 7S vicilin genes [16] and the leghemoglobin genes of legumes [6].

Acknowledgements

We thank Elsa Fessel and Andreas Czihal for excellent technical assistance and Angela Steg- mann for typing this manuscript. We are indepted to Dr E. Birch-Hirschfeld (ZIMET, Jena) and Dr G. Herrmann (ZIM, Berlin-Buch) for oligonu- cleotide synthesis and to Prof. J.G. Reich (Berlin- Buch) for helping us with the statistical analysis. We especially appreciate the critical and helpful comments of Dr R.B. Goldberg on the first version of the manuscript.

References

1. Albertini A, Hofer M, Calos M, Miller J: On the for- mation of spontaneous deletions: The importance of short sequence homologies in the generation of large deletions. Cell 24:319-328 (1982).

2. B~iumlein H, Wobus U, Pustell J, Kafatos FC: The legumin gene family: structure of a B type gene of Vicia

faba and a possible legumin gene specific element. Nu- cleic Acids Res 6:2707-2720 (1986).

3. B~iumlein H, Mfiller A, Schiemann J, Helbing D, Man- teuffel R, Wobus U: A legumin B gene of Viciafaba is expressed in developing seeds oftransgenic tobacco. Biol Zentralblatt 106:569-575 (1987).

4. Boudraa M, Perrin, P: CpG and TpA frequencies in the plant system. Nucleic Acids Res 14:5729-5737 (1987).

5. Bown D, Levasseur M, Croy RRD, Boulter D, Gate- house JA: Sequence of a pseudogene in the legumin gene family of pea (Pisum sativum L.). Nucleic Acids Res 13: 4527-4238 (1985).

6. Brown GG, Lee JS, Brisson N, Verma DPS: The evolu- tion of a plant globin gene family. J Mol Evol 21 : 19-32 (1984).

7. Casey, R, Domoney, C and Ellis, N: Legume storage proteins and their genes. Oxford Surv Plant Mol Cell Biol 3:1-95 (1986).

8. Chlan CA, Pyle JB, Legocki AB, Dure III L: Develop- mental biochemistry of cottonseed embryogenesis and germination. XVIII. cDNA and amino acid sequences of members of the storage protein families. Plant Mol Biol 7:475-489 (1986).

9. Cho T-J, Davies CS, Nielsen NC: Inheritance and orga- nization of glycinin genes in soybean. Plant Cell 1: 329-337 (1989).

10. Croy RRD, Evans IM, Yarwood JN, Harris N, Gate- house JA, Shirsat AH, Kang A, Ellis JR, Thompson A, Boulter D: Expression of pea legumin sequences in pea, Nicotiana and yeast. Biochem Physiol Pfl 183:183-197 (1988).

11. Dasgupta U, Weston-Hafer K, Berg D: Local DNA sequence control of deletion formation in Escherichia coli plasmid pBR322. Genetics 115:41-49 (1986).

12. Den Dunnen J, Moormann R, Lubsen N, Schoenmakers J: Concerted and divergent evolution within the rat gamma-crystallin gene family. J Mol Biol 189:37-46 (1986).

13. Derbyshire E, Wright DJ, Boulter D: Legumin and vicilin, storage proteins of legume seeds. Phytochemistry 15:3-24 (1976).

14. Domoney C, Casey R: Measurement ofgene number for seed storage proteins in Pisum. Nucleic Acids Res 13: 687-699 (1985).

15. Domoney C, Ellis THN, Davies DR: Organization and mapping of legumin genes in Pisum. Mol Gen Genet 202: 280-285 (1986).

16. Doyle JJ, Schuler MA, Godette WD, Zenger V, Beachy RN: The glycosylated seed storage proteins of Glycine max and Phaseolus vulgaris. J Biol Chem 261:9228-9238 (1986).

17. Efstratiadis A, Posakony JW, Maniatis T, Lawn RM, O'Conell C, Spitz Ra, Riel JK, Forget B J, Weissmann SM, Slightom JL, Blechl AE, Smithies O, Baralle F, Shoulders CC, Proudfoot NJ: The structure and evolu- tion of the human c~-globin gene family. Cell 21:653-668 (1980).

18. Glickman B, Ripley L: Structural intermediates of dele- tion mutagenesis: a role for palindromic DNA. Proc Natl Acad Sci USA 81:512-516 (1984).

19. Grandbastian M, Berry-Lowe S, Shirley B, Meagher R: Two soybean ribulose-1, 5-biphosphate carboxylase small subunit genes share extensive homology even in distant flanking sequence. Plant Mol Biol 7:451-465 (1986).

20. Harada JH, Barker SJ, Goldberg RB: Soybean c~-con- glycinin genes are clustered in several DNA regions and are regulated by transcriptional and posttranscriptional processes. Plant Cell 1:415-425 (1989).

21. Heim U: Struktur und Evolution der Legumin B-Gen- familie von Viciafaba. Dissertation, AdW der DDR, Fob Biologie/Medizin (1988).

22. Hood L, Kronenberg M, Hunkapiller T: T cell antigen receptors and the immunoglobulin supergene family. Cell 40:225-229 (1985).

23. Horstmann C: Specific subunit pairs of legumin from Viciafaba. Phytochemistry 22:1861-1866 (1983).

24. Jeffreys AJ, Harris S: Processes of gene duplication. Nature 296:9-10 (1982).

25. Jones CW, Kafatos FC: Accepted mutations in a gene family: Evolutionary diversification of duplicated DNA. J Mol Evol 19:87-103 (1982).

26. Kimura M: Estimation of evolutionary distance between homologous nucleotide sequences. Proc Natt Acad Sci USA 78:454-458 (1981).

27. Krawinkel U, Zoebelein G, Bothwell A: Palindromic sequences are associated with sites of DNA breakage during gene conversion. Nucleic Acids Res 14: 3871-3882 (1986).

28. Lehrmann MA, Goldstein JL, Russell DW, Brown MS: Duplication of seven exons in LDL receptor gene caused by Alu-Alu recombination in a subject with familial hypercholesterolemia. Cell 48:827-835 (1987).

29. Maeda N, Smithies O: The evolution of multigene families: human haptoglobin genes. Ann Rev Genet 20: 81-108 (1987).

30. Maniatis T, Fritsch EF, Sambrook J: Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1982).

31. McClelland M: The frequency and distribution of me- thylable DNA sequences in leguminous plant protein coding genes. J Mol Evol 19:346-354 (1983).

32. Mfintz K, Horstmann C, Schlesier B: Seed proteins and their genetics in Vicia faba. L. Biol Zentralblatt 105: 107-120 (1986).

663

33. Nalbantoglu J, Hartley D, Phear G, Tear G, Meuth M: Spontaneous deletion formation at the aprt locus of hamster cells: the presence of short sequence homologies and dyad symmetries at deletion termini. EMBO J 5: 1199-1204 (1986).

34. Nalbantoglu J, Phear G, Meuth M: DNA sequence ana- lysis of spontaneous mutations at the aprt locus of hamster cells. Mol Cell Biol 4:1445-1449 (1987).

35. Nielsen NC, Dickinson CD, Co T-J, Thanh VH, Scallon BJ, Fischer RL, Sims TL, Drews GN, Golberg RB: Characterization of the glycinin gene family in soybean. Plant Cell 1:313-328 (1989).

36. Pang PP, Pruitt RE, Meyerowitz EM: Molecular cloning, genome organization, expression and evolution of 12S seed storage protein genes ofArabidopsis thaliana. Plant Mol Biol 11:805-820 (1988).

37. Pustell J, Kafatos FC: A convenient and adaptable microcomputer environment for DNA and protein se- quence manipulation and analysis. Nucl Acids Res 14: 479-488 (1986).

38. Rafalsky JA: Structure of wheat gamma-gliadin genes. Gene 43:221-229 (1986).

39. Salser W: Globin mRNA sequences: analysis of base pairing and evolutionary implications. Cold Spring Har- bor Symp Quant Biol 42:985-1002 (1977).

40. Shen S, Slightom JL, Smithies O: A history of the human fetal globin gene duplication. Cell 26:191-203 (1981).

41. Spena A, Viotti A, Pirrotta V: Two adjacent genomic zein sequences: structure, organization and tissue-speci- fic restriction patterns. J Mol Biol 169:799-811 (1983).

42. Takaiwa F, Kikuchi S, Oono K: A rice glutelin gene family - A major type of glutelin mRNAs can be divided into two classes. Mol Gen Genet 208:15-22 (1987).

43. Walsh JB: Sequence-dependent gene conversion: can duplicated genes diverge fast enough to escape con- version. Genetics 117:543-557 (1987).

44. Wobus U, B~umlein H, Basst~ner R, Heim U, Jung R, M~ntz K, Panitz R, Saalbaeh G, Weschke W: Molecular characterization of Vicia faba storage protein specific DNA. Kulturpflanze (Bln.) 32:5117-5126 (1984).

45. Wobus U, B~iumlein H, Bassttner R, Heim U, Jung R, Mfintz K, Saalbach G, Weschke W: Characteristics of two types of legumin genes in the field bean (Viciafaba L. var. minor) genome as revealed by cDNA analysis. FEBS Lett 201:74-80 (1986).