A corpus linguistics based approach for estimating online content

Preview:

Citation preview

A Corpus Linguistics Based Approach for Estimating Arabic Online Content

5,340,000

1,950,000

0.5 %

1 %

1.4 %

3 %

0.5 %1.4 %1 %

Zipff’s Law

CorporaBuilding

Dmoz corpus75,560 pages530.1 MB659,756 uniq. words

Wikipedia corpus95,140 pages213.3 MB760,690 uniq. words

CCA corpus377 pages82,878 uniq. words

Common

Word Document Frequency

أو 26,769 105,754

هذه 29,289 97,964

بين 32,662 84,535

اهلل 26,803 84,216

أخبار 30,010 81,894

كل 30,277 81,224

الزئيسية 41,000 80,161

بعد 32,370 78,317

الصفحة 27,837 66,944

لم 25,403 64,251

كان 23,316 63,318

العالم 23,287 60,468

Word Document Frequency

فً 60,218 1,077,288

من 61,949 860,052

على 56,846 498,694

إلى 48,599 278,315

أن 40,439 277,465

عن 50,736 241,824

التً 35,437 166,002

ال 40,122 153,887

مع 38,797 130,157

ما 33,637 129,304

هذا 31,363 109,125

الذي 32,474 108,844

Recommended