PRELIMINARY VERSION DO NOT CITE

Hierarchical Coherence Modeling for Document Quality Assessment

Dongliang Liao, Jin Xu, Gongfu Li, Yiru WangData Quality Team, WeCaht, Tencent Inc., China.

{brightliao, jinxxu, gongfuli,dorisyrwang}@tencent.com

Abstract

Text coherence plays a key role in document quality assess-ment. Most existing text coherence methods only focus onsimilarity of adjacent sentences. However, local coherenceexists in sentences with broader contexts and diverse rhetoricrelations, rather than just adjacent sentences similarity. Be-sides, the high-level text coherence is also an important as-pect of document quality. To this end, we propose a hierarchi-cal coherence model for document quality assessment. In ourmodel, we implement the local attention mechanism to cap-ture the location semantics, bilinear tensor layer to measurecoherence and max-coherence pooling to acquire high-levelcoherence. We evaluate the proposed method on two realistictasks: news quality judgement and automated essay scoring.Experimental results demonstrate the validity and superiorityof our work.

IntroductionDocument quality assessment is academic valuable and in-dustrial applicable in many tasks, such as online news rec-ommendation, automated essay scoring and readability as-sessment (Chen et al. 2010; Li and Hovy 2014; Dasguptaet al. 2018). Document quality is not only affected by the se-mantics, but also significantly affected by the text coherence(Petersen et al. 2015). Coherent documents are readable andattractive to readers, while documents with poor coherenceare boring to read and may lead to misunderstanding. Sev-eral approaches are designed for modeling coherence, suchas entity-based methods, lexical methods and neural co-herence models. Entity-based methods relate adjacent sen-tences by means of entities in sentences (Li and Hovy 2014;Nguyen and Joty 2017). Lexical methods connect sentencesbased on word co-occurrence (Louis and Nenkova 2012;Chen et al. 2010). Neural coherence models learned vec-tor representation of words and sentences, and capture localcoherence by measure the similarity of words or sentencesvectors (Mesgar and Strube 2018; Moon et al. 2019).

However, most existing local coherence models only fo-cus on the similarity of adjacent sentences, which is inade-quate for modeling the fully coherence of documents. Twoattributes of text coherence are ignored in existing methods.

Copyright c© 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

incoherent

coherent

incoherent

A1

A2

A3

B1

B2

B3

Chinese cuisine is an important part of Chinese

culture, which includes …The history of Chinese cuisine has influenced

many other cuisines in Asia, …

Chinese food staples and utensils can now be

found worldwide, …

Good news, McDonald‘s new burgers are 50% off for 3 days.

Bacon Double Cheeseburger contains two 100%

beef patties, …, all in a soft bun.

Burger Combo serves with chips and drink.

Chinese dishes stress the three main points of appearance, smell, and taste …

A4

coherent

coherent

coherent

coherent

Figure 1: An example of high-level coherence.

Firstly, there are various rhetorical relations between sen-tence and the broader context, rather than the similarity ofadjacent sentences. For example, in parallel structures, sev-eral consecutive sentences form a coherent and consistentdiscourse unit, but are relative independent with each other.An example of this circumstance is shown as follows:

A: Let’s see the new features in IPhone 11.B: It’s powered by the latest proprietary chip, the A13 Bionic pro-

cessor.C: New tri-lens camera array combines a 12-megapixel main lens,

an ultra-wide angle lens and a telephoto lens.D: The iPhone 11 also features a new anodized aluminum finish,

which Apple says is more durable.

where the sentences A,B,C and D forms a coherent partabout the new features of IPhone 11, but each sentence de-scribes an independent feature. Therefore, the first challengeis how to capture the various rhetorical relation betweensentences and the broader contexts.

Secondly, the document quality is not only impacted bythe sentence-level coherence, but also hierarchical coher-ence of high-level discourse units. The high-level coherencereflects the organizational structure and logicality. Figure 1shows an example of this circumstance, where the first foursentences are about Chinese cuisine, while the next threesentences consist an advertisement. The inserted advertise-ments hurt the reading experience and reduces the documentquality. As for the sentence-level coherence, only A4 andB1 are incoherent, but the entire text is discontinuity. Thusthe second challenge is how to model the hierarchical co-

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published

version some time after the conference

herence of documents. Discourse analysis is a related topicfocused on study the text coherence and structure. Unfortu-nately, the reliance on labeled data and the errors propaga-tion are obstacles to the usage and performance of discourseparser in document modeling (Liu and Lapata 2018).

Our motivation is to overcome these limitations, inducingdocument representations and coherence representations di-rectly by end-to-end neural networks. First of all, we buildthe sentence representation and treat the sentence as the ba-sic discourse units. Then we decompose the hierarchical co-herence modeling into three steps: merging low-level dis-course units to local context block, capturing the coherenceof each local context block and detecting the coherent blockas high-level discourse units. This process can be stackedhierarchically to generate multi-level discourse units and hi-erarchical coherence. Concretely, we adopt Transformer en-coder to build the sentence vectors. We combine the convo-lutional operation and attention mechanism to generate lo-cal context block of low-level discourse units. We employbilinear layers to capture multiple dimension rhetoric rela-tions of each unit of local context, and merge all units co-herence as the context coherence. Then, we propose max-coherence pooling to to detect the coherent context blocksas high-level discourse units. At last, we aggregate all thecoherence vectors of all discourse units in different levelsby attention pooling, representing the overall coherence ofdocument.

We summarize our contributions as follows.

• We capture the various rhetorical relation in a broadercontext rather than the similarity of adjacent sentences formodeling the text coherence, and introduce the hierarchi-cal coherence for document quality assessment.

• We propose the local attention pooling to generate localcontext block, capture multiple dimension coherence bybilinear layer and propose max-coherence pooling to todetect high-level discourse units.

• We evaluate the proposed method on two realistic tasks:online news quality judgement and automated essay scor-ing. The experimental results demonstrating the validityand superiority of our approaches.

Related WorkCoherence ModelingText coherence modeling has attracted many researchers’attentions. Early works leveraged lexical and syntax fea-tures for capturing coherence. Barzilay and Lapata (2008)proposed the entity-based model, extracting entities fromsentences and measuring coherence by the occurrence fre-quency of entities. Nguyen and Joty (2017) adopted CNNon entity grids to improve the coherence modeling perfor-mance. However, entity based models rely on lexical andsyntax analysis for entity extraction, which may encountererror propagation. Recently, end-to-end neural coherencemodels got the state-of-the-art performance in related tasks.Li and Hovy (2014) proposed a neural method, modelingsentences with recurrent and recursive neural network, andestimating the coherence probability with local connected

layers for the window of three sentences. Nadeem and Os-tendorf (2018) proposed a hierarchical RNN with the bidi-rectional context for capture the sentence coherence withtheir adjust sentences. Han et al. Moon et al. (2019) capturedtext coherence by bi-linear layer for discourse relations andlight weight convolution-pooling for the attention and topicstructures. All these methods only modeled similarity of ad-jacent sentences. Our approaches capture multi-dimensioncoherence in broader contexts and supplement the high-levelcoherence for better understanding the text coherence.

Our work is partly inspired by another related researchtopic, discourse analysis. Rhetorical Structure Theory is themost widely accepted discourse structure, providing a sys-tematic way to analyze the text (Marcu 2000; Taboada andMann 2006). Scholars have developed automatic discourseparsers (Ji and Eisenstein 2014; Chuan-An et al. 2018) forgenerating RST style trees from documents, benefiting someapplications such as the text classification (Ji and Smith2017) and sentiment analysis (Bhatia, Ji, and Eisenstein2015). However, discourse parsing is such a difficult taskand there is still a large margin to construct a perfect parser(Chuan-An et al. 2018). Errors in discourse tree significantlyaffect the document modeling performance. In this paper,we do not try to detect the discourse structure precisely, butonly focus on capturing the hierarchical coherence of doc-ument. We propose a approximate but valid way to aggre-gate coherent consecutive sentences a high-level discourseunits, which is efficient and can be implemented in end-to-end neural networks.

Document Quality AssessmentDocument quality assessment is a subtopic of documentmodeling. Two typical categories of document quality as-sessment tasks are automatic essay scoring (AES) (Dasguptaet al. 2018; Alikaniotis, Yannakoudakis, and Rei 2016) andreadability assessment (Li and Hovy 2014; Petersen et al.2015; Todirascu et al. 2016). Early studies mainly focusedon feature engineering such as bag of words, N-gram andother linguistics features (Chen et al. 2010; Feng et al. 2010;Todirascu et al. 2016). Up to now, many scholars followedthe popular neural approaches for document quality assess-ment, such as CNN, RNN and attention mechanism (Dong,Zhang, and Yang 2017; Dasgupta et al. 2018; Moon et al.2019). However, these methods failed to capture the dis-course coherence, which significantly affected the documentquality. Louis and Nenkova (2012) and (Miltsakaki and Ku-kich 2004) showed that discourse coherence features signif-icantly improved the readability assessment and essay scor-ing performance. Mesgar and Strube (2018) proposed a end-to-end neural coherence model which capture coherencevector of adjacent sentences and capture the text coherenceby convolutional operators on the coherence vector. Nadeemand Ostendorf (2018) and Tay et al. (2018) also proposedtheir neural coherence methods for readability assessmentand essay scoring, with CNN and SkipFlow RNN for coher-ence modeling. Comparing with these existing approaches,we proposed a novel coherence model for capture text co-herence, achieving considerable performance improvementin document quality assessment tasks.

(w1, w2 ,…)

Transformer + Attention

Output Layers

…

(w1, w2 ,…) …(w1, w2 ,…)

s2 s|D|s1

Local Attention + Bilinear

Max Coherence Pool


Max Coherence Pool


Max Coherence Pool

Transformer + Attention

Ds

Concat

C

SentencesLayers

DocumentLayers

HierarchicalCoherence Layers

Output Layers

(a) Framework of HierCoh model.

Bilinear

Local Attention

c2 c|D|c1…

s2 s|D|s1 …s3

h2 h|D|/ph1…

Gather

Argmax pool

m2 m|D|/pm1…

t2 t|D|t1…

Local Coherence

Max-Coherence Pooling


concat

(b) Coherence Layers.

Max-Coherence Pooling (k = 3, p = 2)

0.63

0.56

0.4

0.4

0.65

c2

Ƹ𝑐1

s1

s2

s3

s4

s5

s6

s7

DiscourseUnit A

DiscourseUnit B

c3

c4

c5

c6

Ƹ𝑐2

Ƹ𝑐3

Ƹ𝑐4

Ƹ𝑐5

c1

c7

ො𝒄1

ො𝒄2

ො𝒄3

ො𝒄4

ො𝒄5

mean-pool

max-pool

argmax

(c) An example of MCP.

Figure 2: (a) Framework of the proposed HierCoh model. (b) Illustration of coherence layers, consisting of local coherence andmax-coherence pooling. (c) Illustration of max-coherence pooling process on the example of Fig.1. Suppose ci represents thelocal coherence vector of sentence si generated by Eq. (6), we calculate ci with Eq. (8) and select the biggest ci to form theapproximate high-level discourse unit.

ModelWe propose a novel coherence model named HierCoh,shown in Fig. 2(a). We implement a hierarchical architec-ture, consisting of sentence layers, hierarchical coherencelayers, document layers and output layers.

Let D represent a document, consisting of a sentences se-quence D = (s1, s2, · · · , s|D|), where each sentence is asequence of one-hot words s = (w1, w2, · · · , w|s|). Sen-tence layers take each sentence s as the input and generatethe sentence vector s. Hierarchical coherence layers capturethe hierarchical text coherence as a vector C. Document lay-ers aggregate sentence vectors as document semantics vec-tor Ds and integrate Ds coherence vector C for generatingdocument vector D. Then document vector D is fed intotask-specific output layers for document quality assessment.

Sentence LayersAs mentioned before, we treat the sentences as the basic dis-course units. In sentence layers, we embed sentences to rep-resentation vectors. We map one-hot word vectors to denseembeddings and employ Transformer encoder (Sutskever,Vinyals, and Le 2014) to update word vectors with the sen-tence context, then adopt attention pooling for aggregatingword vectors to sentence vectors.

Hierarchical Coherence LayersIn Hierarchical coherence layers, we generate the local con-text block of low-level discourse units, capture the multi-dimension coherence, and detect high-level discourse unitsfor capturing the hierarchical coherence.

Local Context Block Convolution neural network (CNN)is a powerful tool for modeling local structures in computervision tasks (Zeiler and Fergus 2014). The fixed-weight fil-ters in CNN are designed for capturing the contour or textureof image pixels. The stacked CNN can extract hierarchicalstructures efficiently.

However, the relations between sentences semantics aremore complex and flexible. In this work, we integrate theconvolution operation with attention mechanism named Lo-cal Attention Pooling (LAP), for tackling this challenge.LAP replaces the linear transformation in CNN to the multi-head attention pooling. Attention pooling generates attentiveweights αi according to the input sequence Si, and mergesthe input sequence Si by multiplying attentive weights αi.Multi-head attention is a extension of attention mechanism,which calculates multiple attentive weights to construct mul-tiple attention “heads”. The output of multi-head attentionis concatenation of all attention “heads”. The calculation ofLAP can be formulated as follows:

Ski =

[si− k

2, · · · , si, · · · , si+ k

2

]T(1)

ami = V mlaptanh(Wm

lapSki ) (2)

αmi = softmax(ami ) (3)

headmi = αmi S

ki (4)

ti = Lapk(si) =[head1i , · · · , headMi

](5)

Comparing with the fixed-weight linear transformation inCNN, the multi-head attention pooling in LAP aggregatesthe context semantics with a flexible weights depending onthe semantics of each sentences. We treat ti as the local con-text block of sentences si− k

2, · · · , si, · · · , si+ k

2.

Multi-Dimension Coherence Given the context vector tiof sentence si, an intuitive way to measure the local coher-ence is using vector similarity such as the cosine similarity.As we have discussed, the relation between sentences andtheir context block is not only similarity but diverse rhetoricrelations (Taboada and Mann 2006). Inspired the neural ten-sor layer (Socher et al. 2013; Tay et al. 2018), we employ abilinear tensor layer for model the relations of the two vec-tors across multiple dimensions.

ci = σ(sTi Wbti + b) (6)

Here, Wb ∈ Rds×dc×dt is called the relation tensor, and b ∈Rdc is the bias vector. ds, dt are the dimension of sentencevector s and context vector t. σ denotes sigmoid function. dcis the dimension of output coherence vector ci.

We treat each slice W (r)b ∈ Rds×dt in Wb as represen-

tation of a rhetoric relation r. Thus, each element c(r)i =

σ(sTi W(r)b ti) in ci represents the probability of the relation

r between sentence si and local context block ti. With thebilinear tensor layer, we can capture the local coherence ofall sentences as C(0) =

[c1, c2, · · · , c|D|

].

Hierarchical Coherence In this work, we propose an ap-proximate approach, called Max-Coherence Pooling (MCP),leveraging low-level coherence (i.e. level l − 1) to detecthigh-level discourse units for measuring the high-level co-herence (i.e. level l). The behind intuition is that if severalconsecutive low-level discourse units are in coherence, theycan be treated as a coherent high-level discourse unit.

Concretely, after we extracting the context block and lo-cal coherence of low-level discourse units with LAP andbilinear tensor layers, MCP can be divided to three steps.Firstly, we adopt the mean pooling on coherence vectorsc(l−1)i− k

2

, · · · , c(l−1)i , · · · , c(l−1)i+ k

2

in the local context block for

the overall multi-dimension coherence c(l−1)i . Secondly wetreat the max dimension c(l−1)i in c(l−1)i as the coherenceprobability of the local context block t(l−1)i . Thirdly weadopt a “argmax pooling” to select the low-level context mi

with max coherence score ci in a k size window with stridesp.

ci =1

k

(ci− k

2+ · · ·+ ci + · · ·+ ci+ k

2

)(7)

ci = max{cij∣∣cij ∈ ci

}(8)

mi = argmaxj

{cj∣∣j ∈ [i− k

2, i+

k

2)

}(9)

By these two steps, we have selected the consecutive coher-ent low-level discourse units in window size k, which is thelocal context block t(l)mi . Thus we treat them as the approxi-mation of a high-level discourse unit.

h(l)i = MCP(h(l−1)

i ) = t(l−1)mi(10)

Context window size k and strides p are hyper-parameter,controlling the coherence scope and selection rate of max-coherence pooling. Figure 2(b) shows the illustration of thecoherence layer. Fig. 2(c) presents max-coherence poolingprocess on the example of Fig. 1.

Then, given the high-level discourse units, we can alsoadopt the local attention pooling and bilinear tensor layerto measure the high-level coherence. Our model can cap-ture the hierarchical coherence by stacking multi-layer max-coherence pooling, local attention pooling and bilinear lay-ers. At last, we adopt attention pooling to aggregate all co-herence vectors in different levels as the document coher-ence vector C. The entire process is shown in Alg. 1, whereL is the coherence layer number.

Algorithm 1: Text Coherence ModelingInput: L, {s1, s2, · · · , s|D|}Result: Text coherence vector CH(0) ← {s1, s2, · · · , s|D|};Generate T (0) = {t1, t2, · · · , t|D|} by (5);Generate C(0) = {c1, c2, · · · , c|D|} by (6);C ← C(0);for l ∈ [1, L] do

Generate H(l) by (7)(8)(9)(10);Generate T (l) by (5);Generate C(l) by (6);C ← C ∪ C(l);

endC ← AttentionPool(C)

Document LayersDocument quality is determined by both the text coherenceand the semantics. We integrate the all the sentence vec-tors by another Transformer and attention pooling layer asthe document semantics vector Ds for document seman-tics modeling. We concatenate document semantics vectorDs and coherence vectors C as the final document vectorD = [Ds, C], so that D captures both the semantics and hi-erarchical coherence of document. Then D can be fed intooutput layers for document quality assessment.

Evaluate TasksWe evaluate our model on two document quality assess-ment tasks, automated essay scoring and online news qualityjudgement.

Automated Essay ScoringAutomated essay scoring (AES) is a classic document qual-ity assessment problem, since the score of student essay isdecided by the essay quality. The most popular AES datasetis Automated Student Assessment Prize (APSP) corpus. Intotal, ASAP consists of eight sets of essays, each of whichassociates to one prompt. These essays are originally writ-ten by students between Grade 7 to Grade 10. The statis-tics information of APSP please refers to Taghipour and Ng(2016). AES is usually treated as a regression task to predictthe essay score. Each essay is scored by domain experts, andscore ranges of different essay sets are scaled to [0, 1]. Weimplement a sigmoid output layer, taking document vectorD as the input to get the predicted essay scores y, and meansquare errors are employed as the loss function.

Online News Quality JudgementWe collect a Chinese online news dataset for documentquality assessment. Similar as some readability assessmentdatasets such as De Clercq et al. (2014), we define the qual-ity assessment task as a pair-wise rank task, judging whichone has better quality in a news pair. We collect two news de-scribing the same topic or event as a news pair and employ

Category #Pairs 1st. News Len. 2nd. New Len.0 4214 515 5221 3908 642 4272 3840 476 589

Table 1: Statistics of Online News Dataset.

human annotators to annotate the pair-wise quality rank la-bels of three categories: {0, 1, 2}. Category 0 indicates thatthe news pair has equally quality, category 1 represents thefirst news has better quality and category 2 denotes the sec-ond news has better quality. The quality labels follow thecriteria of clear theme, coherent sentences and well orga-nized structure. We employed 8 annotators and 1 checkerfor the Chinese news dataset. We shuffled all the news pairsinto batches and handed out to 8 annotators. The checker isan experienced annotator who also participated the develop-ing of annotating criteria. The checker random checked 20%of each batch by annotating the news pair himself and com-paring with results of annotators. A batch with more than15% divergence in check examples would be rejected andre-annotated. At last, we get 11962 valid news pairs. Thedataset statistics information is shown in Table 1.

The online news quality judgement is a three-classification problem. We adopt a Siamese architecturefor modeling the news pair and get two document vectorsD1 and D2. We concatenate the document vectors togetherand adopt a softmax output layer to predict the probabilityof news pair’s label. The cross entropy loss function isemployed as the minimized object.

ExperimentsBaselinesFor the AES task, we report following approaches as base-lines.

• CNN+LSTM. Taghipour and Ng (2016) is A neuralmethod for AES task, ensembling CNN and LSTM foressay scoring.

• LSTM-CNN-att. Dong, Zhang, and Yang (2017) pro-posed a hierarchical structure with attention CNN for sen-tence modeling and LSTM for document modeling.

• SkipFlow. Tay et al. (2018) adopted tensor compositionsto model the text coherence for essay scoring.

For the online news quality judgement task, we employthe state-of-the-art coherence models.

• CNNCoh. Farag, Yannakoudakis, and Briscoe (2018)adopt CNN to capture local coherence of sentences as acoherence score.

• CohLSTM. Mesgar and Strube (2018) proposed a modelto encoder the perceived coherence of a text by a vector,named CohLSTM. We integrate CohLSTM with HAN tocompare CohLSTM with HierCoh.

• UNCM. Moon et al. (2019) proposed a unified neural co-herence model for local and global coherence discrimina-tion, implementing BiLSTM, bilinear and CNN layers.

Besides the task-specific methods, we take several com-monly used document modeling methods for comparing.

• HAN. The hierarchical attention network (Yang et al.2016) adopts 2-layer BiGRU and attention mechanism fordocument modeling.

• Bert. We finetune the pretrained Bert-base, taking the[CLS] vector of the last layer as document representationsfor document quality assessment (Devlin et al. 2019).

• ToBert. Transformer over Bert is a hierarchical architec-ture, encoding sentences with Bert and merging sentencevectors with Transformer (Pappagari et al. 2019; Wanget al. 2020).

We also report the simplified version of the proposedmethod for ablation analysis.

• H-Trans. The hierarchical network adopts staked Trans-former and attention pooling layers to generated docu-ment vectors without coherence modeling.

• H-Trans+Cos. We replace bilinear tensor layers with co-sine similarity for coherence modeling.

• H-Trans+LC. We integrate local coherence with H-Trans, without MCP layers.

• OnlyMCP. We model the document only using the co-herence vector C, without document layers.

Implementation SettingsFor the AES task, The hidden sizes of H-Trans and pro-posed methods are empirically set as 64 for word embed-ding, Transformers and attention layers. For the news qualityjudgement task, the hidden sizes is set as 128. In coherencelayers, there are several hyper-parameters that need to be set.We set the coherence vector size (i.e. the hidden size of bilin-ear layer) as 5 follows Tay et al. (2018). The window size kand layer number of max-coherence pooling L is fine tunedon {3, 5, 7, 11} and {1, 2, 4, 8} respectively. The strides sis set as the half of window size s = k/2 empirically. Weadopt the Adam with 0.0005 learning rate for training andemploy dropout mechanism on the input word embeddingwith dropout rate 0.5. All experiments are construsted basedon TensorFlow with Tesla P40 GPU 1.

Automatic Essay Scoring ResultIn the following experiments, we follow the 5-fold evalua-tion method with Taghipour and Ng (2016) 2 and reuse thedata preprocess code of Dong, Zhang, and Yang (2017) 3.The metrics is quadratic weighted Cohen’s κ (QWK).

Table 2. shows the overall performance comparison ofAES. We can observe that the proposed HierCoh achievesthe best results in essay sets 1, 2 and 7, and enhances theQWK in set 3 and 6 to approach the Bert-based models per-formance. However, our methods perform poorly on essayset 8, so that the average QWK on all essay sets is similarwith LSTM-CNN-att and SkipFlow, obviously worse than

1Code and datasets will be public avaliable if accepted.2https://github.com/nusnlp/nea/tree/master/data3https://github.com/feidong1991/aes/

Models #Params Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Avg. 1-8 Avg 1-7CNN+LSTM § – 0.821 0.688 0.694 0.805 0.807 0.819 0.808 0.644 0.761 0.777

LSTM-CNN-att § 326K 0.822 0.682 0.672 0.814 0.803 0.811 0.801 0.705 0.764 0.772SkipFlow § – 0.832 0.684 0.695 0.788 0.815 0.810 0.800 0.697 0.764 0.775

HAN† 449K 0.825 0.693 0.691 0.798 0.798 0.808 0.809 0.637 0.757 0.775Bert† 110M 0.821 0.678 0.717 0.815 0.803 0.813 0.802 0.718 0.772 0.780

ToBert† 119M 0.823 0.702 0.711 0.806 0.808 0.832 0.806 0.657 0.769 0.784H-Trans 363K 0.818 0.699 0.703 0.788 0.798 0.806 0.811 0.639 0.758 0.775

H-Trans+Cosk=5,L=1 367K 0.827 0.699 0.707 0.792 0.798 0.821 0.812 0.637 0.762 0.778H-Trans+Cosk=5,L=4 370K 0.829 0.700 0.713 0.807 0.804 0.816 0.792 0.647 0.764 0.781H-Trans+LCk=5,L=1 385K 0.837 0.696 0.710 0.798 0.812 0.826 0.799 0.633 0.763 0.782

OnlyMCPk=5,L=4 455K 0.828 0.684 0.699 0.775 0.803 0.819 0.771 0.594 0.747 0.768HierCohk=5,L=4 464K 0.839 0.702 0.711 0.809 0.801 0.827 0.820 0.631 0.763 0.786

Table 2: Automated essay scoring performance. We sign the best result in bold and second best result with underlines. Methodswith up-script § is the result reported in related works. Methods with up-script † is our reproduced version.

Bert-based methods. That’s because the set 8 has the longestessays (with average length of 650) but least examples (only723 in total), while other sets have more than 1500 essaysand average lengths are 150-350.

Modeling the document only with the coherence vector(i.e. OnlyMCP) gets the worst performance on the most es-say sets. It is not surprising since the scores of student es-says are not only depend on the text coherence, but alsothe semantics features such as their topics and arguments.HAN, LSTM-CNN-att and H-Trans have similarity hierar-chical structure but different neural units in sentences anddocument layers. HAN and H-Trans performs slightly bet-ter on set 1-7, while LSTM-CNN-att is much better on set 8with simpler layers and fine-tuning parameters. H-Trans hasless parameters than HAN and much faster training speedwith parallelizable Transformer architecture, so we adopt H-Trans to model the document semantics.

H-Trans+Cosk=5,L=1 and H-Trans+Cosk=5,L=4 enhancethe QWK obviously on set 1-7, with slightly parametersincrement of LAP. LAP has the same parameter sharingcharacteristics with CNN, extracting useful features with asmall amount of parameters. To some extent, we can treatthe deep neural layers as the feature extractors. The coher-ence layers are heuristics high-order feature interactions ofsentence semantics features. Cosine similarity can be seenas a kind of non-parametric feature interaction, increasingno model complexity of feature extract layers but supple-menting effective features. In this perspective, bilinear ten-sor layer is a parametric feature interaction, generating moredimensional, more effective and more flexible feature in-teractions. Thus H-Trans+LCk=5,L=1 and HierCohk=5,L=4

further improve the performance on set 1-7. Nevertheless,bilinear tensor layers also increase the model complexity,making it harder to train the model well on set 8. Com-paring with Bert-based methods, the proposed methods hasmuch less parameters and achieve similarity performances,demonstrating that not only the complex model but also thereasonable architecture innovations brings the improvement.

Models Acc 0-F1 1-F1 2-F1CNNCoh† 0.576 0.516 0.586 0.602

CohLSTM† 0.572 0.404 0.625 0.614UNCM† 0.547 0.465 0.506 0.551HAN† 0.551 0.505 0.549 0.591Bert† 0.596 0.545 0.629 0.643

ToBert† 0.607 0.565 0.627 0.651H-Trans 0.553 0.517 0.559 0.586

H-Trans+Cosk=7,L=1 0.595 0.481 0.595 0.619H-Trans+Cosk=7,L=4 0.625 0.507 0.621 0.661H-Trans+LCk=7,L=1 0.621 0.489 0.644 0.649

OnlyMCPk=7,L=4 0.611 0.487 0.617 0.634HierCohk=7,L=4 0.644 0.532 0.687 0.688

Table 3: News quality judgement comparison. We sign thebest result in bold. Methods with up-script † is our repro-duced version.

Online News Quality Judgement ResultWe sample 80% of news pairs as the training set, 10% ofnews pairs as the validation set and 10% as the test set. Wemeasure the online news quality judgement task by classi-fication accuracy and F1 score of each category. For a cat-egory c, we treat other two categories examples as negativeexamples to calculate the F1 score. The overall predictionperformance is shown in Table 3. The proposed approachHierCohk=7,L=4 achieve best performance.

Among all baselines, UNCM gets the worst result. Thismay be because the UNCM is designed for the discrimi-nation of sentences order, instead of the entire documentquality. Different from AES task, existing document mod-eling methods (i.e. HAN, Bert and ToBert) perform poorlyin news quality judgement. These methods are effective tocapture the semantics of documents, while the news pairis related with the same topic or events, having similarsemantics. Coherence features play a more important rolein the quality judgement in this situation. CNNCoh andCohLSTM capture the coherence between sentences andtheir adjacent sentences, thus integrating them with H-Tranimproves the news quality judgement capability. In thiswork, we argue that the coherence between in the broader

(a) AES (b) News Quality Judgement.

Figure 3: Valid losses with different context window size andcoherence layers.

Figure 4: Coherence score distribution.

context should be considered, rather than just the similar-ity of adjacent sentences. Thus the H-Tran+Cosk=7,L=1 andH-Tran+LCk=7,L=1 perform better than existing coherencemodels. What’s more, we capture hierarchical coherence ofdocument by max-coherence pooling in HierCoh, furtherimprove the document quality judgement performance sig-nificantly. Note that OnlyMCP method also achieves a con-siderable result, demonstrating the superiority of capturingthe hierarchical multi-dimension coherence.

Further AnalysisParameters Effects There are two important hyper-parameters in the HierCoh models, context window sizek in Eq. (5) and max coherence layers in Alg. 1. Figure3 shows the average valid loss variant on two tasks withk ∈ {3, 5, 7, 11} and L ∈ {1, 2, 4, 8}. Model with coher-ence layer number L = 1 local size k = 3 is similar withexisting coherence models that only capture the local co-herence with adjacent sentences, which is insufficient fordocument quality assessment. Models with multiple coher-ence layers coherence modeling achieve obviously lowervalid loss. Actually, the coherence between sentences andtheir broader context can also be captured in high-level co-herence modeling, so that model with multiple coherencelayers still achieve a appreciable performance improvementwith local size k = 3. Even though, local size k still ef-fect the impact scope of high-level coherence. Comparingthe validation loss, we select local size k = 5 and coherencelayer number L = 4 for automatic essay scoring. The valid

Coherence Vector (5-dim)

Recently, the United States Apple introduced its new smartphone iPhone11 series.

As before, iPhone11 immediately attracted the attention of Chinese consumers.

It is reported that iPhone 11 pre-sales exceeded 100 million on the e-commerce platform in just 1 minute.

The flagship ￥5499 smartphone that offers a range of powerful features at an affordable price tag.

New tri-lens cameras combines a 12-megapixel main lens, an ultra-wide angle lens and a telephoto lens.

It’s powered by the latest proprietary chip, the A13 Bionic, improving the processing capabilities by 20%.

The Night Mode feature improves the smartphones' photography capability in low-light environments.

The iPhone 11 also features a new anodized aluminum finish, which Apple says is more durable.

News Segment

Figure 5: Coherence Vector Visualization.

loss curve of news quality judgement shows slightly differ-ent with AES, achieving best loss when k = 7, since thedocument length is longer in the online news dataset. Thuswe set k = 7 and L = 4 for the news dataset.

Hierarchical Coherence Distribution For illustrating theeffect of hierarchical coherence, we output coherence vec-tors C(l) of in different level of student essays. We calculatethe coherence probability c by Eq.(8) in level l of documentD. We select the high score essays as good essays (normal-ized score > 0.8) and low score essays (normalized score< 0.3) as bad essays to observe the coherence score distribu-tion of each level by kernel density estimation, illustrated inFig 4. This observation demonstrates that the high-level co-herence take effect in document quality assessment. What’smore, the high-level coherence scores have more stable dis-tribution and less outliers, which makes the high-level co-herence distribution easier to be captured by neural models.

Case Study of Multi-dimension Coherence Figure 5shows the illustrate the coherence vectors of a translatednews segment in online news datasets. In the proposedmodel, each dimension of coherence vector represents theprobability of a kind of relations between discourse units.We use difference colors to denote difference dimensions ofcoherence vector. The deeper the color denotes the biggervalue. This news segment is the first paragraph of a onlinenews about Apple smartphone, from where we detect theparallel structure example in Introduce section. The first foursentences are consequent relation and the latter four sen-tences are coordinate/parallel structure. We can observe thatthe green dimension reflects the consequent relation whilethe blue and yellow dimensions capture the parallel struc-ture. This observation proves the bilinear layers can capturevarious relations in multi-dimension coherence vectors.

ConclusionIn this paper, we argue that text coherence exists in a broaderlocal context and between high-level discourse units. Basedon that, we develop a hierarchical coherence model for docu-ment quality assessment. Experiment results on two realistictasks demonstrate the superiority of our method.

ReferencesAlikaniotis, D.; Yannakoudakis, H.; and Rei, M. 2016. Au-tomatic text scoring using neural networks. arXiv preprintarXiv:1606.04289 .Barzilay, R.; and Lapata, M. 2008. Modeling local coher-ence: An entity-based approach. Computational Linguistics34(1): 1–34.Bhatia, P.; Ji, Y.; and Eisenstein, J. 2015. Better document-level sentiment analysis from rst discourse parsing. arXivpreprint arXiv:1509.01599 .Chen, Y.-Y.; Liu, C.-L.; Lee, C.-H.; Chang, T.-H.; et al.2010. An unsupervised automated essay-scoring system.IEEE Intelligent systems 25(5): 61–67.Chuan-An, L.; Huang, H.-H.; Chen, Z.-Y.; and Chen, H.-H.2018. A unified RvNN framework for end-to-end chinesediscourse parsing. In Proceedings of the 27th InternationalConference on Computational Linguistics: System Demon-strations, 73–77.Dasgupta, T.; Naskar, A.; Dey, L.; and Saha, R. 2018. Aug-menting textual qualitative features in deep convolution re-current neural network for automatic essay scoring. In Pro-ceedings of the 5th Workshop on Natural Language Process-ing Techniques for Educational Applications, 93–102.De Clercq, O.; Hoste, V.; Desmet, B.; Van Oosten, P.;De Cock, M.; and Macken, L. 2014. Using the crowdfor readability prediction. Natural Language Engineering20(3): 293–325.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In NAACL-HLT (1).Dong, F.; Zhang, Y.; and Yang, J. 2017. Attention-basedrecurrent convolutional neural network for automatic essayscoring. In Proceedings of the 21st Conference on Computa-tional Natural Language Learning (CoNLL 2017), 153–162.Farag, Y.; Yannakoudakis, H.; and Briscoe, T. 2018. Neuralautomated essay scoring and coherence modeling for adver-sarially crafted input. arXiv preprint arXiv:1804.06898 .Feng, L.; Jansche, M.; Huenerfauth, M.; and Elhadad, N.2010. A comparison of features for automatic readabilityassessment. In Proceedings of the 23rd international con-ference on computational linguistics: Posters, 276–284. As-sociation for Computational Linguistics.Ji, Y.; and Eisenstein, J. 2014. Representation learning fortext-level discourse parsing. In Proceedings of the 52nd An-nual Meeting of the Association for Computational Linguis-tics (Volume 1: Long Papers), volume 1, 13–24.Ji, Y.; and Smith, N. 2017. Neural discourse structure fortext categorization. arXiv preprint arXiv:1702.01829 .Li, J.; and Hovy, E. 2014. A model of coherence basedon distributed sentence representation. In Proceedings ofthe 2014 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), 2039–2048.Liu, Y.; and Lapata, M. 2018. Learning structured text rep-resentations. Transactions of the Association for Computa-tional Linguistics 6: 63–75.

Louis, A.; and Nenkova, A. 2012. A coherence model basedon syntactic patterns. In Proceedings of the 2012 Joint Con-ference on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learning,1157–1168. Association for Computational Linguistics.Marcu, D. 2000. The theory and practice of discourse pars-ing and summarization. MIT press.Mesgar, M.; and Strube, M. 2018. A neural local coher-ence model for text quality assessment. In Proceedings ofthe 2018 Conference on Empirical Methods in Natural Lan-guage Processing, 4328–4339.Miltsakaki, E.; and Kukich, K. 2004. Evaluation of text co-herence for electronic essay scoring systems. Natural Lan-guage Engineering 10(1): 25–55.Moon, H. C.; Mohiuddin, M. T.; Joty, S.; and Xu, C. 2019.A Unified Neural Coherence Model. In Proceedings of the2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP),2262–2272.Nadeem, F.; and Ostendorf, M. 2018. Estimating linguisticcomplexity for science texts. In Proceedings of the Thir-teenth Workshop on Innovative Use of NLP for Building Ed-ucational Applications, 45–55.Nguyen, D. T.; and Joty, S. 2017. A neural local coherencemodel. In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers), 1320–1330.Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; and De-hak, N. 2019. Hierarchical Transformers for Long Doc-ument Classification. In 2019 IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU), 838–844. IEEE.Petersen, C.; Lioma, C.; Simonsen, J. G.; and Larsen, B.2015. Entropy and graph based modelling of document co-herence using discourse entities: An application to IR. InProceedings of the 2015 International Conference on TheTheory of Information Retrieval, 191–200. ACM.Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013.Reasoning with neural tensor networks for knowledge basecompletion. In Advances in neural information processingsystems, 926–934.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In Advances inneural information processing systems, 3104–3112.Taboada, M.; and Mann, W. C. 2006. Applications of rhetor-ical structure theory. Discourse studies 8(4): 567–588.Taghipour, K.; and Ng, H. T. 2016. A neural approach toautomated essay scoring. In Proceedings of the 2016 confer-ence on empirical methods in natural language processing,1882–1891.Tay, Y.; Phan, M. C.; Tuan, L. A.; and Hui, S. C. 2018.SkipFlow: incorporating neural coherence features for end-to-end automatic text scoring. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

Todirascu, A.; Francois, T.; Bernhard, D.; Gala, N.; andLigozat, A.-L. 2016. Are cohesive features relevant for textreadability evaluation?Wang, Y.; Huang, S.; Li, G.; Deng, Q.; Liao, D.; Si, P.;Yang, Y.; and Xu, J. 2020. Cognitive Representation Learn-ing of Self-Media Online Article Quality. arXiv preprintarXiv:2008.05658 .Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,E. 2016. Hierarchical attention networks for documentclassification. In Proceedings of the 2016 conference ofthe North American chapter of the association for compu-tational linguistics: human language technologies, 1480–1489.Zeiler, M. D.; and Fergus, R. 2014. Visualizing and under-standing convolutional networks. In European conferenceon computer vision, 818–833. Springer.

Documents

PRELIMINARY VERSION DO NOT CITE