View
444
Download
2
Category
Preview:
Citation preview
A Corpus Linguistics Based Approach for Estimating Arabic Online Content
5,340,000
1,950,000
0.5 %
1 %
1.4 %
3 %
0.5 %1.4 %1 %
Zipff’s Law
CorporaBuilding
Dmoz corpus75,560 pages530.1 MB659,756 uniq. words
Wikipedia corpus95,140 pages213.3 MB760,690 uniq. words
CCA corpus377 pages82,878 uniq. words
Common
Word Document Frequency
أو 26,769 105,754
هذه 29,289 97,964
بين 32,662 84,535
اهلل 26,803 84,216
أخبار 30,010 81,894
كل 30,277 81,224
الزئيسية 41,000 80,161
بعد 32,370 78,317
الصفحة 27,837 66,944
لم 25,403 64,251
كان 23,316 63,318
العالم 23,287 60,468
Word Document Frequency
فً 60,218 1,077,288
من 61,949 860,052
على 56,846 498,694
إلى 48,599 278,315
أن 40,439 277,465
عن 50,736 241,824
التً 35,437 166,002
ال 40,122 153,887
مع 38,797 130,157
ما 33,637 129,304
هذا 31,363 109,125
الذي 32,474 108,844
Recommended