62
Machine Verification and Identification of Telugu Metrical Poetry (Chandassu) Dileep Miriyala

Machine verification and identification of telugu metrical poetry 1.1

Embed Size (px)

Citation preview

Page 1: Machine verification and identification of telugu metrical poetry 1.1

Machine Verification and Identification of Telugu Metrical Poetry (Chandassu)

Dileep Miriyala

Page 2: Machine verification and identification of telugu metrical poetry 1.1

Agenda

Chandassu & padyam Need of Machine based Verification & Identification of

Chandassu How it Works? Demo of chandaM © Case Studies What’s Next?

Page 3: Machine verification and identification of telugu metrical poetry 1.1

Chandassu

Chandas Sastra is a literature framework with set of rules to be followed to write a poem or prose.

Chandassu was first used in Vedas.

Telugu Chandassu was derived from Sanskrit but it has its own Set of rules.

The literary work that followed chandassu framework in Telugu is referred as padyam.

SKIP

Page 4: Machine verification and identification of telugu metrical poetry 1.1

Features of a Chandssu

The features that define a chandassu are

gana structure

yati

prAsa

prAsa yati

Page 5: Machine verification and identification of telugu metrical poetry 1.1

Guru and Laghu

Syllables are classified into guru and laghu

Symbols associate with guru and laghu are U I

Other Symbols that are in-usage are Dot, Dash and Inverted U etc.

The classification of the Syllable is based on the time takes to pronounce the given syllable

Laghu syllable takes 1unit ,

Guru Syllable takes 2 units.

Syllable classification is also based on the position of the Syllable in a given word.

Ex:

Independent అ Laghu I But the word అమ్మ� to have Sequence as UI

Chandas Sastra defines these rules.

Page 6: Machine verification and identification of telugu metrical poetry 1.1

gana

Sequences of Guru and laghu will form a gaNa.

Ganas are classified as

Named gana (Akshara)

The Symbol Sequences with a length of 1,2,3 are given with a name. Ex: la, ga, va, ha, ya, ma, ta, ra , ha , bha , na , sa

Compound gana: The sequences of named ganas to form compound ganas. Ex: la-la means II

Grouped gana

Set of Named Gana’s and Compound ganas are classified as Grouped ganas

Matra gana

Ganas are classified based on the total time takes to pronounces

Ex: la-la takes 2 units of time.

Page 7: Machine verification and identification of telugu metrical poetry 1.1

yati –prAsa- prAsa yati

yati

Is a position at which the word break place in Sanksrit Chandassu

Is a similar or yati friend syllable to the 1st Syllable of a pada

prAsa

Is usually 2nd Syllable or last syllable of a pada.

Same or prAsa friend syllable to be maintained at each line.

prAsa yati

It’s a relaxed feature when poet is not able to apply prAsa in some chandassu’s

yati and prAsa to form a group and a prAsa yati friendly group to be formed at yati position.

Page 8: Machine verification and identification of telugu metrical poetry 1.1

Classification of Chandassu’s

gaNa Structure

yati prAsa prAsa yati

Jati Grouped gaNa ✓ ✓

upaJati Grouped gaNa ✓ ✓ ✓

vRutta Named gaNa ✓ ✓

matrA matra gaNa ✓* ✓*

*Optional in many cases

Page 9: Machine verification and identification of telugu metrical poetry 1.1

How many?

Chandassu uses Binary symbols (U,I) to represent a sequence of Syllables.

There is no restriction on No. of Syllables per line or poem (padyam).

The no. of Sequences that can be formed With upto n Syllables are

21 +22 +23 +…. +2n=2n+1-2 Chandas Sastra named Chandassu’s up to 26 Syllables.

Ex: gayatri Chandassu means Any Sequence of 6 Syllable Symbols*

udduramala Chandassu is a name given to a chandassu with >26 Syllables.

Simply we can say the total possible Sequences are infinite.

Page 10: Machine verification and identification of telugu metrical poetry 1.1

How many?

If n=26 then total possible Sequences are 227-

2=13,42,17,728. Chandas Sastra defined very few of them around 2000 sequences

only.

Ancient Telugu poets frequently used 30 Telugu Chandassu’s.

Ancient Sanskrit poets used more than 1200 Sequences but less

than 1500 * Many poets create their own Chandassu's in our time too.

Page 11: Machine verification and identification of telugu metrical poetry 1.1

Why?

The Quantity of Literature written under Chandassu framework got reduced by large amount.

This era of Digitization

Tools for publishers: To ensure the quality

Tools for Learners: To learn in a easy and interactive way.

Tools for professional poets: To experiment in new sequences and reducing effort of computation and validation against rules.

Tools for Language study and Analysis: To understand and distinguish the poets style and Language, vocabulary etc.. at that time.

What If a tool can do all these related to Padyam?

Page 12: Machine verification and identification of telugu metrical poetry 1.1

ChandaM

Chandam © or ఛం�దం� © is such a first generation chandassu tool

http://chandassu.org

http://chandam.apphb.com

Basic Objective

Should verify the given padyam against a given chandassu

Should Identify the chandassu of the given padyam along with the errors (i.e. violated rules).

Page 13: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 14: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 15: Machine verification and identification of telugu metrical poetry 1.1

White Listing

Digitization might involve references to sources and sometime foreign characters.

Telugu[Targeted Language] Unicode Sub-Range characters will be filtered along with few identified punctuations.

Page 16: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 17: Machine verification and identification of telugu metrical poetry 1.1

Syllable Chunks

Building of Syllable Chunks are necessary to create Laghu-Guru Stream.

Any Syllable Extraction Mechanism can be used.

Ex: కొత్త కొ , త్త

Page 18: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 19: Machine verification and identification of telugu metrical poetry 1.1

Laghu-Guru Stream

Each Syllable group will be assigned to a Symbol (U,I)

All laghu syllables will be checked for the influence of next syllable on it or not.

Ex: కొత్త

కొ [ I,1,0] త్త [ I,1,1]

[Current Symbol, Check for next Syllable Influence, Can influence prev. Syllable]

i.e. కొ, త్త U, I

Page 20: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 21: Machine verification and identification of telugu metrical poetry 1.1

Extract Features

Pairs parserGana parser

Extract Features

Page 22: Machine verification and identification of telugu metrical poetry 1.1

gaNa Parser

Based on the target gaNa Characteristic Symbol Stream will be parsed.

Ex: U||U||U||U Named gaNa:

Above gaNa sequence can be parsed as bha-bha-bha-ga or gala-laga-lala-gala-laga or many other

While parsing the Symbols next expected gaNa’s threshold will be considered.

Say for the above sequence feature is defined as bha-bha-bha-ga then threshold would be 3-3-3-1.

bha-bha-bha-ga

Page 23: Machine verification and identification of telugu metrical poetry 1.1

gaNa Parser

Grouped gaNa

Incase of Grouped gana’s expected threshold is not constant

Immediate Symbol Sequence is when expected group found or Symbol at which Max Threshold reached is considered as the gana.

Ex: U|UU| U|- UU| Surya-Indra

Min Threshold Max Threshold

Surya (Brahma) 2 3

Indra (Vishnu) 3 4

Chandra (Rudra) 4 5

Page 24: Machine verification and identification of telugu metrical poetry 1.1

gaNa Parser

Matra gaNa

Immediate Symbol Sequence is found with expected Matra count reached or Exceeded.

Ex: : U|UU| can be parsed as UIU-UUI when expected matra gaNa is 5-5

Page 25: Machine verification and identification of telugu metrical poetry 1.1

Pairs Parser

yati, prAsa, prAsa-yati are the pair of syllables.

These will be extracted based on the position of yati

Position of yati

Usually a absolute number incase of vRutta’s

Ex: 10th place means 10 Syllable in each line. Relative position from a given gaNa.

Ex: 1st Syllable of 3rd gaNa.

Ex: Last Syllable of 3rd gaNa. Pairs of 1st and nth syllable extracted will be created for each line along with their previous Syllable,

prAsa: prAsa is usually the 2nd or last syllable of each line.

Hence array of prAsa will be created with previous syllable too.

Previous Syllable has important role since it can influence the validity of the Yati, prAsa , prAsa-yati pairs

Page 26: Machine verification and identification of telugu metrical poetry 1.1

Matching EngineRaw Input

White Listin

g

Syllable Chucks

Prepare laghu-guru

Stream

Extract Features

Chandassu

Match Features

Match Result

Text Analyzer Matcher

Page 27: Machine verification and identification of telugu metrical poetry 1.1

Match Features

Extracted Features (gaNa’s and Pairs) will be matched against Expected feature.

A Scoring System is defined to find the match percentage.

-1 → Key feature not found or mismatched.

0 → Feature not found or mismatched

+1 → Feature found and matched.

+2 → Key feature found and matched exactly.

Customised Scoring Systems are open for experiments.

Percentage of Match or Confidence

(Sum of all features gained Score)*100

____________________________________________

((2*No. of Key features) + No. of Normal Features);

Page 28: Machine verification and identification of telugu metrical poetry 1.1

Match Results

Match Results may be delivered based on the user needs

HTML, PDF, Excel, TEXT etc.

Mismatches will be presented as Errors

Match score will be presented as Confidence of Matching.

Page 29: Machine verification and identification of telugu metrical poetry 1.1

Sample Result [HTML]

Page 30: Machine verification and identification of telugu metrical poetry 1.1

Chandassu Identification

Why?

To Determine the Chandassu of a unknown padyam.

To find the multiple matches if any.

Resolving the conflicts.

Mechanism

Match each and every chandassu against a given padyam

Identifying Chandassu for which the Max Score is obtained.

Can be applied only on Known Chandassu’s

To determine the Sama/Vishama pada Chandassu’s is also possible [Not in Scope]

Page 31: Machine verification and identification of telugu metrical poetry 1.1

Sorted Results

Known Candidates

Each Candidate

Matching Engine

Chandassu Identification

Page 32: Machine verification and identification of telugu metrical poetry 1.1

Identification Engine

Need of Optimization

Running Matching Engine on all known Chandassu’s could take a longer time.

Ex:

Consider the Known Chandassu’s size 400 (Incase of Telugu)

Total Avg. Time takes to match Features of a given Chandassu is 40-120 Milli Seconds.

Total time to Identify is 40*400 to 100*400 i.e. 16 Sec. to 40 sec.

Size Min Time Max Time

Telugu/Kannada 400 16 Sec. 40 sec.

Sanskrit -1200 1200 48 Sec. 120 sec.

Page 33: Machine verification and identification of telugu metrical poetry 1.1

Identification Engine

Eliminating redundant steps and Caching the results

Results of the Text Analysis will be cached.

Determining the Eligible Candidates

Find Syllable Count for each line Ex: 7, 12, 8, 15

Find the Range of Syllable Count i.e Min and Max Values Ex: 7-15

Find all the Candidates which fall under this Range Ex: 7-15

# If the Digitization has Errors Syllable count may not match the actual.

Extended Range will be calculated i.e. Say t% Digitization Errors.

Floor(X1*((100-t)/100)) to Ceil(X2*((100+t)/100)) X1=Min Value, X2=Max Value.

Extended Range would be Ex: 6 -16.

Page 34: Machine verification and identification of telugu metrical poetry 1.1

Identification EngineRaw Input

White Listin

g

Syllable Chunks

Prepare laghu guru

stream

Extract Features

Match Features

Match Result

Result

Available Candidates

Eligible Candidates

Each Candidate

Match Results

Text Analyzer Matcher

IdentifierRange Extractor

Page 35: Machine verification and identification of telugu metrical poetry 1.1

Sample Evaluation (Sorted Scores)

Page 36: Machine verification and identification of telugu metrical poetry 1.1

Performance

Non Optimized Identification Engine

Optimized Identification Engine

Padyam -1 523 m.s 56 m.s.

Padyam-2 186 m.s. 82 m.s

Padyam-3 346 m.s 116 m.s.

Page 37: Machine verification and identification of telugu metrical poetry 1.1

Demo Of Chandam

Page 38: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దం�© యొక్క శక్తి�దెప్ప�ర మ్మగు� కాలమ్మ�చే

నెప్పు�డు� దేవత్తల కెల నష్టం#� బగు� నీ

యొప్పి�దంమ్మ�' గు(ష్టం�) ' డురిగిన'

దంప్పెఁ�' గుదా! త్తల్లి ! నీవు త్తల డుప్పడు'గున్.

పై ప్పదం2� భాగువత్త� లోని మొదంటి స్కం �దం� గోవ(ష్టంభ స్కం�వాదం�బ�(33 వ ఘట్టం#�) లోని

ప్పదం2�(#397).

JUMP to Questions

Page 39: Machine verification and identification of telugu metrical poetry 1.1

ఈ ప్పదా2ని@ ఛం�దం�© తో గుణిం�చినప్పుడు� వచిDన ఫల్లిత్త�

ఛం�దం�© శక్తి�

JUMP to Questions

Page 40: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దం�© యొక్క శక్తి�

ఇక్క డు రెం�డువ పాదం� లో ‘ నె’ క్క� ‘ న’క్క� యతి మైతిN క్క�దంరదం�

అని ఛం�దం�© చెప్పి��ది.

టైప్పి�గు� త్తప్పి�దంమేమో అని ప్పుస్కంకాని@ చూUడుబోతే తెల�గు�

సాహిత్త2 అకాడుమీ వారి ప్ప\చూ�రణ లో పాఠ్యం2� అలానే ఉం�ది.

మ్మరో ప్పుస్కంక్కమ్మ� లోనU అ�తే. JUMP to Questions

Page 41: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దం�© యొక్క శక్తి�

ఇది మ్మరేదైనా ప్ప\తే2క్క యతిగా క్కUడా అనిప్పి�చూలేదం�.

మ్మరో ప్ప\చూ�రణ ప్పుస్కంక్క�: తిర�మ్మల తిర�ప్పతి దేవసాi న� వారిది చూUడుగా నష్టం#�

అనేది స్కంరైన పాఠ్యం2� కాదంని, నిష్టం#� అనేది స్కంరైన పాఠ్యం2� అని తేల్లి�ది.JUMP to Questions

Page 42: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దం�© శక్తి�

స్కంరైన పాఠ్యం2� తో ప్పదా2ని@ ఛం�దం�© తో గుణిం�చినప్పుడు� వచిDన ఫల్లిత్త�

JUMP to Questions

Page 43: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దం�© మ్మ�ఖచిత్తN�

JUMP to Questions

Page 44: Machine verification and identification of telugu metrical poetry 1.1

ఛం�దో గుణన�

JUMP to Questions

Page 45: Machine verification and identification of telugu metrical poetry 1.1

గుణన ఫల్లితాల�

JUMP to Questions

Page 46: Machine verification and identification of telugu metrical poetry 1.1

గుణన ఫల్లితాల�

ఇక్క డు మ్మన� రెం�డువ పాదం�లో ఒక్క గుణ� త్తక్క� వగా ఉం�డుడాని@ ఛం�దం� © స్కంరిగాo ఎతిచూUప్పడాని@ చూUడువచూ�D.

JUMP to Questions

Page 47: Machine verification and identification of telugu metrical poetry 1.1

ఇక్క డు మ్మన� ఒక్కట్టంవ పాదం� దోష్టంపూరిత్త� అని అరr� చేస్కం�కోవచూ�D.

గుణన ఫల్లితాల�

JUMP to Questions

Page 48: Machine verification and identification of telugu metrical poetry 1.1

గుణన ఫల్లితాల�

రెం�డువ పాదం� లో పా\ స్కం యతిని స్కంరిగాo గు�రి��చూడాని@ క్కUడా గుమ్మని�చూవచూ�D.

మ్మ��దం� ప్పదం2�లో ఛాయనొస్కంగు� బదం�ల� ఛాయననొస్కంగు� అని ఉం�డుడాని@ గుమ్మని�చూవచూ�D. JUMP to Questions

Page 49: Machine verification and identification of telugu metrical poetry 1.1

క్క�పూ2ట్టంర� వాv సిన క్క�దం ప్పదం2�

క్క�దం ప్పదం2మే కాక్క అని@ తెల�గు� ప్పదం2 ఛం�దంస్కం�xలలోనU క్కUడా ప్పదా2ల� వాv యగుల్లిగే సామ్మరrz� ఛం�దం� © క్క� ఉం�ది. అయితే స్కం,రి,గు,మ్మ,ప్ప,దం,ని లతో మాత్తNమే

JUMP to Questions

Page 50: Machine verification and identification of telugu metrical poetry 1.1

ష్టంణ్మా�తాN శ్రే�ణిం లో వాv యదంగుo ప్పదం2 ఛం�దంస్కం�xల శోధనా ఫల్లితాల�

ఛం�దంస్కం�xల శోధన

JUMP to Questions

Page 51: Machine verification and identification of telugu metrical poetry 1.1

మ్మUలా2�క్కన� అనేది ఛం�దం�© ఛం�దంస్కం�xన� క్కన�గొనడు�లో అన�స్కంరి�చే ప్పదం�తి.

ప్పదా2ని@ ప్ప\తీ# ఛం�దంస్కం�xతోనU గుణిం�చి ఏ ఛం�దంస్కం�x యొక్క నియమాలన� ఎక్క� వ శాత్త� స్కం�త్త(ప్పి ప్పరిచి�దో ఆ ఛం�దంస్కం�xన� ఆ ప్పదం2 ఛం�దంస్కం�x గా గు�రి�స్కం� �ది.

# ఏఏ ఛం�దంస్కం�xల� గుణనానిక్తి ఎన�@కోబడా� యో ఏ ఛం�దంస్కం�xక్క� ఎని@ మార� ల� లేదా శాతాల� వచ్చాD యో మ్మUలా2�క్కన� లో చూUస్కం�కోవచూ�D.

Page 52: Machine verification and identification of telugu metrical poetry 1.1

కొత్త ఛం�దంస్కం�x స్కం(ష్టి# 'గోవిం�దం' అనే ఛం�దంస్కం�xన� ఎ�త్త స్కం�లభ�గా నిరి��చూ�కోవచ్చోD

చూUడు�డి.

దీనిని శ్రీ� బెజ్జా� ల మోహనరావుగార� నిరి��చినార�.

JUMP to Questions

Page 53: Machine verification and identification of telugu metrical poetry 1.1

కొత్త ఛం�దంస్కం�x స్కం(ష్టి#

నిరి��చిన ఛం�దంస్కం�x యొక్క లక్షణ్మాలన� ఛం�దం�© అరr� చేస్కం�కొని, ఇత్తర�లతో ప్ప�చూ�కొనేవింధ�గా ఎ�త్త వింప్పుల�గా చూUప్పి�చిదో చూUడు�డి.

ఇ�తే వింప్పుల�గా అని@ తెల�గు� , స్కం�స్కం �త్త ఛం�దంస్కం�xల లక్షణ్మాలన� క్కUడా చూUప్పిస్కం� �ది.

JUMP to Questions

Page 54: Machine verification and identification of telugu metrical poetry 1.1

కొత్త ఛం�దంస్కం�x స్కం(ష్టి#

నUత్తన ఛం�దంస్కం�xలో వాv సిన ప్పదా2ని@ క్కUడా ఛం�దం�© చూక్క గా గుణిం�చూడాని@ చూUడువచూ�D.

JUMP to Questions

Page 55: Machine verification and identification of telugu metrical poetry 1.1

Case Studies-1 Telugu Bhagavatam

Sri U. Samba Siva Rao Digitized Telugu Bhaghavatam http://telugubhaghavatam.org

Total Padyams & 10061(7400) with 900K Words 16K unique Words

Total time taken run 7400 Padyams is 18min ~= 150ms. Per Padyam

After running this on Chandam results with 70% confidence.

Percentage

Result Examples

40 Human errors Spelling mistakes or misplaced punctuations

30 Human errors Misplace spaces and punctuations.Treating of Compound words as independent and Vice-versa

30 False Alarm Due to Limitations of the Tool.

Page 56: Machine verification and identification of telugu metrical poetry 1.1

Case Studies-2 Poets & Usage

Poets primary or intermediate skills at Writing Padyam's Credited tool on various forums

Improved Quality

Focusing more on Creative and Literature part

Computation is Outsourced.

Poets who mastered in writing Padyam’s

Experiment and Practice (or learn) new patterns.

Around 30-40 regular Telugu Poets are using Chandam

Avg. Poet Computations per Day:3-5

Page 57: Machine verification and identification of telugu metrical poetry 1.1

Case Studies-3 Research

Mr. M. Narasimha Rao & I started analyzing Annamaya kIrtaNa’s

To find if there any influence of Chandassu in his writings

To compute the Statistical Analysis of Chandassu Pattern's.

Mr. Sri Ganesh T working on Determining Author Style in Metrical Poetry.

Page 58: Machine verification and identification of telugu metrical poetry 1.1

Limitations

Handling Special Rules

In Determining Symbols Ex: అద్రు�� చు�, కద్రు�� చు� yati matching when there is Sandhi

ఆట్టంగా ఛం�దాలనల్లి �చి, యలరి�చి where ఛం�దాలనల్లి �చి = ఛం�దాలన�+ అల్లి �చి

#Yati matching based on achchu

Page 59: Machine verification and identification of telugu metrical poetry 1.1

What’s Next?

Resolving the Lines

Ancient poets used write the wrong padyam in single line.

Some cases No Indicator of Line Break and Chandassu Name.

Makes difficulty in determine the Chandassu.

With Little customization to Identification Engine can be resolved easily.

Discovering the Art Forms

Bandha or citra kavitva’s

Configured for Kannada Chandassu’s too. [Alpha Version]

Page 60: Machine verification and identification of telugu metrical poetry 1.1

Technologies.

Runs on WEB Client and Server , Windows Client Application

JS API is available for integration with external Sites.

http://chandam.apphb.com/?qpi

Technologies

JAVA Script via Script #

HTML5

MONGO DB

C SHARP

ASP.NET

Ports for Java/PHP can contact me for collaborative working,

Page 61: Machine verification and identification of telugu metrical poetry 1.1

Dileep Miriyala

Contribution to Indic Languages:

Indic PDF http://indicpdf.apphb.com

Telugu Bhaghavatam http://telugubhaghavatam.org

Chandam : http://chandam.apphb.com

7 Keyboard Layouts for Telugu [Web/Windows/Mac]

Importable Mac Keyboard Layout on Windows

ASCII to Unicode Fonts (Not TEXT Conversion)

Some Works in progress

Sandhi Merger and Identifier

Spell checker

Content Clustering

Page 62: Machine verification and identification of telugu metrical poetry 1.1

Questions?

Contact

[email protected]

Phone +91-8978559072

http://chandassu.org

http://chandam.apphb.com

http://indicpdf.apphb.com