28
A Corpus Linguistics Based Approach for Estimating Arabic Online Content

A corpus linguistics based approach for estimating online content

Embed Size (px)

Citation preview

Page 1: A corpus linguistics based approach for estimating online content

A Corpus Linguistics Based Approach for Estimating Arabic Online Content

Page 2: A corpus linguistics based approach for estimating online content
Page 3: A corpus linguistics based approach for estimating online content
Page 4: A corpus linguistics based approach for estimating online content

5,340,000

Page 5: A corpus linguistics based approach for estimating online content

1,950,000

Page 6: A corpus linguistics based approach for estimating online content

0.5 %

Page 7: A corpus linguistics based approach for estimating online content

1 %

Page 8: A corpus linguistics based approach for estimating online content

1.4 %

Page 9: A corpus linguistics based approach for estimating online content

3 %

Page 10: A corpus linguistics based approach for estimating online content

0.5 %1.4 %1 %

Page 11: A corpus linguistics based approach for estimating online content
Page 12: A corpus linguistics based approach for estimating online content

Zipff’s Law

Page 13: A corpus linguistics based approach for estimating online content
Page 14: A corpus linguistics based approach for estimating online content

CorporaBuilding

Page 15: A corpus linguistics based approach for estimating online content
Page 16: A corpus linguistics based approach for estimating online content

Dmoz corpus75,560 pages530.1 MB659,756 uniq. words

Page 17: A corpus linguistics based approach for estimating online content

Wikipedia corpus95,140 pages213.3 MB760,690 uniq. words

Page 18: A corpus linguistics based approach for estimating online content

CCA corpus377 pages82,878 uniq. words

Page 19: A corpus linguistics based approach for estimating online content
Page 20: A corpus linguistics based approach for estimating online content
Page 21: A corpus linguistics based approach for estimating online content

Common

Page 22: A corpus linguistics based approach for estimating online content
Page 23: A corpus linguistics based approach for estimating online content
Page 24: A corpus linguistics based approach for estimating online content

Word Document Frequency

أو 26,769 105,754

هذه 29,289 97,964

بين 32,662 84,535

اهلل 26,803 84,216

أخبار 30,010 81,894

كل 30,277 81,224

الزئيسية 41,000 80,161

بعد 32,370 78,317

الصفحة 27,837 66,944

لم 25,403 64,251

كان 23,316 63,318

العالم 23,287 60,468

Word Document Frequency

فً 60,218 1,077,288

من 61,949 860,052

على 56,846 498,694

إلى 48,599 278,315

أن 40,439 277,465

عن 50,736 241,824

التً 35,437 166,002

ال 40,122 153,887

مع 38,797 130,157

ما 33,637 129,304

هذا 31,363 109,125

الذي 32,474 108,844

Page 25: A corpus linguistics based approach for estimating online content
Page 26: A corpus linguistics based approach for estimating online content
Page 27: A corpus linguistics based approach for estimating online content
Page 28: A corpus linguistics based approach for estimating online content