Upload
lenhu
View
232
Download
0
Embed Size (px)
Citation preview
LEADING EDGE FORUM CSC PAPERS Copyright © 2013 Computer Sciences Corporation. All rights reserved.
NoSQL for Next Gen CMS
NOSQL FOR NEXT GEN CONTENT
MANAGEMENT SYSTEM
K V S Ranga Prasad [ [email protected] ]
J Deepika [ [email protected] ]
M Ravi Chandrasekhar [ [email protected] ]
R Ravi Sudharson [ [email protected] ]
Naren Raghavendra Suri [ [email protected] ]
Title
CSC Authors
CSC Papers
2013
2
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
1. INTRODUCTION
For users, a Content Management System comprises of processes:
Collecting relevant content and storing it to database
Content management and processing
Publishing the content in different forms as per end user needs
In the real-world, most of the domains / systems/ applications have CMS systems in-place, to name
a few: Electronic Media, Health Care [like: code books], Complex dictionaries, Online-Reservation
systems, Logistics etc are big business in the market.
Big Data is a buzz word today. NoSQL databases are already captivating the market. To stay
ahead of the race, seize the business opportunity before clients identify the need for change.
Embracing the High performance open source technologies like Big Data and NoSQL is definitely a
beneficial thing to give attention.
To assist clients / organizations who use Legacy Content Management System, we did a Case
Study / POC on a Legacy Content Management System. As a part of this, we did an end-to-end
study of the current system and identified the areas that need to be addressed in order to transform
current marginal efficient system to Real -Time Content Management System.
In order to address this, we started looking at a new model which can enhance the current system
without impacting the structure/functionality of how a Content Management System should behave
i.e. Content Generation, Work-Flow, Authorization, Validation, Publishing etc;. Thought process for
creating a new model; paved path for evaluating the NoSQL capabilities in place of our current
RDBMS [Oracle].
The proposed new model / solution is expcted to enhance and bring in the benefits of using NoSQL
inplace of traditional RDBMS. Benefits include:
Real-time Content Management System
Operational Efficiency
High Performance
Cost Benefits [open-source]
The goal of this paper is to describe the soultion / new model and showcase the results.
1.1 BRIEF NOTE ON NOSQL DATABASES
[In short: Not Only SQL] A NoSQL database provides a mechanism for storage and retrieval
of data that use looser consistency models than traditional relational databases in order to achieve horizontal scalability with ease and provide higher availability. In general NoSQL databases are classified as:
Key / Value – Voldemort, Simple DB, Memcache, Amazon’s Dynamo
Document – MongoDB, CouchDB
Column – HBase, Cassandra
Graph – Neo4j, Infinite Graph
Others – Geospatial, File System, Object
3
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
1.2 NOSQL CAPABILITIES
During this thought process of creating the new model for Legacy CMS system, we had brain-storming sessions to see which technologies can be best suited for us to promote a new model which can bring in more benefits to the customer. Outcome of our brain-storming sessions was to switch the CMS data storage from RDBMS [Oracle] to NoSQL database because of NoSQL capabilities which includes:
High Performance
High Availability [In-built Caching Mechanism]
Handle Huge Volume of Data
DB Scaling-Out [adding nodes to existing set-up] is Elastic in nature
2. EVALUATION OF CURRENT SYSTEM
As described in previous section, we have performed an end-to-end study of the current CMS - a Legacy System. During this process we evaluated the system in a phase-by-phase manner to identify the areas, where the current process is consuming more time or causing delay in getting content published to end clients / down-stream applications.
2.1 CURRENT SCENARIO
Our client’s Legacy Content Management System [CMS] is designed to publish the content to downstream applications in desired formats [which includes .txt, .dat, .xml. .html, .sql etc;]. During the process of content publishing, CMS system fetches the content from Oracle [which is in XML format and stored as CLOB] and processes the content in stages. The design of current system is not feasible for real-time content publishing [which was expected by the system] as content generation process itself is taking huge time. As an impact of this, content is being published at intervals – daily, weekly, monthly, quarterly. Here is our system wired diagram:
Data Layer RDBMS - Oracle
ValidationLayer
Content Generation
Content Processing
Content Publishing
File Server
Applications
4
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
2.2 CURRENT SYSTEM BASE-LINE MERTICS
We have collected the current system statistics to have a basis for comparison. This stats can help us in understanding how our proposed solution/new model can enhance the system better and achieve our goal of Real-Time Content Management System.
2.2.1 TIME CONSUMED BY THE SYSTEM TO PROCESS THE CONTENT
This metrics depicts about the time consumed by CMS in processing the content and make it ready for publishing and finally deliver to customer. As per our study we have segregated the content processed based on the complexity and volume of content. From the below figures [marked in Green] it is evident that most of the content requires a huge amount of processing time.
2.2.2 TIME CONSUMED FOR CMS PROCESS IN A QUARTER
This metrics depicts the effort in hours consumed by different phases of current CMS system in Generating, Processing and Publishing the content.
10%
23%
67%
< 1 hr
1 hr to 3 hrs
> 3 hrs
0
200
400
600
800
1000
1200
441
1012
552
Effort In Hours
5
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
2.2.3 FREQUENCY OF CONTENT DELIVERED TO ONLINE CUSTOMERS
This metric depicts the frequency of content being published to end customer. Based on our study on the current legacy CMS system, we are pretty clear that 74% of the content is not published and delivered to the end customers on a daily basis.
2.3 CHALLENGES IN CURRENT SYSTEM
Current system is experiencing the below list of challenges:
System stores its content in RDBMS [Oracle]. For processing the content, the system
fetches content from Oracle. RDBMS response time for content query [having records
counts more than100K +] is adding considerable delay.
System model/architecture/process is adding huge delay due to too many layers between
content generation process to publish and deliver to end customer.
System is not elastic in nature because: system delivers the content in a periodic way.
System uses XML medium for content generation and processing. Processing huge XML
files using XSLT requires additional processing time.
During content generation process, searching the appropriate Lexicon content [concept /
terms] in Healthcare dictionary is a tedious activity.
1%
36%
23%
1%
12%
26% Annualy
Quarterly
Monthly
Bi-Weekly
Weekly
Daily
6
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
3. OUR APPROACH
We looked forward for a solution which can pave the path to reach our goal: Real-Time Content System Management. Outcome of our thought process is to suggest a new model which can enhance the current system.
We recommend a strongly favored in-house solution which can be done in phase-by-phase manner instead of Big-Bang way.
We have plotted different paths [based on our analysis, research, estimation techniques] to reach our final goal: Real-Time Content System Management using NoSQL DB.
This path-plotted graph along with the Break-Even Analysis is the heart and soul of our new model / solution to transform the existing system.
Let us take a look at the below graph closely. We have X-axis defining the Operational Cost + Business Benefits while Y-axis defines the Performance at different stages. These paths are plotted considering medium in which data is going to be stored and processed using NoSQL DB and published as per end user requirements i.e. in form of html, xml, text, pdf, word etc;.
PATH –PLOTTED GRAPH TO CHOOSE THE BEST PROJECT EXECUTION
7
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
BREAK –EVEN ANALYSIS FOR THE SUGGESTED MODEL
Above path-plotted graph can depict the picture, how a project can be handled based on the path chosen by the client. Each path is estimated in person years.
Of all the three paths, whatever path the client may choose, he can reap his benefits in ≈ 3yrs
once after the project starts.
$839,560
$1,679,120
$2,518,680
$3,358,240
$4,197,800
$1,372,800
$2,082,300
$2,494,800
$2,907,300
$3,319,800
$0
$500,000
$1,000,000
$1,500,000
$2,000,000
$2,500,000
$3,000,000
$3,500,000
$4,000,000
$4,500,000
1st Year 2nd Year 3rd Year 4th Year 5th Year
Cumulative for existing System
Cumulative for new system
8
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
3.1 ENHANCE THE DATA LAYER / FILE SERVER LAYER WITH NOSQL DB
We took our initial steps towards enhancing the Data Layer of our system. As discussed in section: Introduction, we are more lured and drifted towards the NoSQL capabilities and business benefits.
Of all the NoSQL databases, we have opted for a Document-Oriented database which is MongoDB [- named from "huMONGOus," meaning "extremely large"] because – usability &
installation, dynamic schema, open-source which is a cost effective, Replica-Sets for Master / Slave and automatic fail-over, Sharding for elastic DB scale-out.
Our current system, Content Generation, Content Processing and Content Publishing processes are not effectively coupled. We say this because, Content Generation might have added, updated or deleted a particular content and the RDBMS gets updated with content change. Both Content Processing and Content Publishing process are periodic and coupled.
As Content Processing is a periodic process which means few contents run on Daily, Weekly, Monthly and Quarterly basis. Because of these latest content changes are not processed and not available for the Content Publishing process to publish the content. So, in order to achieve a Real-Time Content Management System, we need to couple-up all the required process so that whenever there are changes to CMS database, it should be able to trigger up the downstream process to publish the content to end-user applications.
MongoDB capabilities can definitely help us in enhancing our system by improving the Content Processing and Content Publishing layers thereby reducing the overall time consumed. Model look like:
9
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
On implementing this model /solution, we will eliminate the need for RDBMS [Oracle], Fileservers, some background copy jobs, etc. MongoDB acts as Central Repository for Content Processing and Content Publishing layers. Once after content is processed it will be stored onto Central Repository and publishing layer can take up the approved content and start publishing which can save some considerable amount of time.
New model allows the system to publish the latest content changes to end customers instead of periodic publishing.
As a part of this phase, we migrated the existing content [partially which includes 270,000 xml files and 1,500,000 lexicon terms]. Once after migration we took a process that runs for < 1 hour in our existing system. We first did a base-line using current system to measure the content generation processing time and then with the new system in-place.
HUGE CUT-DOWN IN CONTENT PROCESSING TIME
Below metrics depicts the time consumed at each stage of Content Processing before and after replacing RDBMS with NoSQL – MongoDB.
It is very clear that MongoDB has improved system performance and cut-down the processing time by ≈5 times.
Based on the results obtained, the overall system performance is improved by 60%
0
5
10
15
20
25
30
Extraction Time
Mastering Time
Total Process Time
21
5
28
1.22 1.35
5
Before [in mins]
After [in mins]
0
200
400
600
800
1000
1200
441
1012
552
300 409
300 Before [Effort In Hours]
After [Effort In Hours]
10
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
VERY QUICK IN FETCHING CONTENT
We did a study on this data fetch process. We have created a query in RDBMS [Oracle] and MongoDB just to fetch the content from the DB and count the number of records. Here is the result:
Oracle Query:
SELECT B.XML_DATA FROM TABLE B WHERE B.NODE_ID=? AND B.CURRENTPUB_FLAG=? AND B.ACTIVITY_STATUS=?
MongoDB Query:
queryObject.put(AU_TYPE, ”xxxxx”);
queryObject.put(ACTIVITY_STATUS, "ACTIVE");
queryObject.put(CURRENTPUB_FLAG, "T");
collection.find(queryObject).addOption(Bytes.QUERYOPTION_NOTIMEOUT).addOption(Bytes.QUERYOPTION_AWAITDATA);
This query [in both SQL & NoSQL] is expected to fetch ≈ 110k records from the database.
Our study results proved that MongoDB is ≈ 16 times faster than Oracle in fetching the
content from database.
0
5
10
15
20
Query Response Time
19
1.2
Before [in Minutes]
After [in Minutes]
11
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
LOW OPERATIONAL-COSTS
For comparing the operational costs between MongoDB and RDBMS [Oracle], we gathered metrics from 10gen [the MongoDB Company]. From the below graph it is clear that at we can save ≈ 70% of up-front cost and ≈ 64% of ongoing cost in comparison with
the RDBMS – Oracle Source - http://info.10gen.com/rs/10gen/images/10gen.TCO%20-%20MongoDB%20vs.%20Oracle.pdf
3.2 COMPARISION BETWEEN CURRENT CMS AND NOSQL IN CMS
As a part of this case study / POC, based on the research, results obtained we prepared a chart to perform a factor-by-factor comparison between a Legacy CMS vs. NoSQL in CMS. NoSQL in CMS was able to beat-out the existing CMS system and will be able to achieve our goal: Real-Time Content Management System
0
100
200
300
400
500
600
700
800
900
Up Front Cost
Ongoing Cost 1st Year
Ongoing Cost 3rd Year
820
287
860
166 106
317 RDBMS [k in $]
MongoDB [k in $]
12
NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM
4. BENEFITS
Benefits can be reaped by implementing the solution / new model are:
Better Performance and No additional Caching mechanism is required
Cut-Down the operational / maintenance costs as NoSQL DB is open-
source
Helps CMS system to provide Real-Time Content rather than periodical
Elastic DB Scaling-Out
Helps in smooth transition of bringing MongoDB in shoes of RDBMS
Suits for Agile development process
5. CONCLUSION
While the focus of the IT World is on Big Data, Analytics and intelligent system, the Big Data has remarkable breakthrough in Real-time data processing, we must evolve to see the opportunities within applications of existing customers and provide consultancy to customers. It is crucial to understand the customer business and the impact of Big Data in their business. Since, Big Data technologies are soon going to sweep the market that deals with bulk content.
We should note that an able NoSQL database is required to scale with the present market requirements. Hence the integration of MongoDB and Analytic tools would even increases the value. So expanding this analysis into other domains by having Big Data over the cap would be really fetching to the business benefits.
We did an extensive study on a real time product for migrating a Legacy Content Management System's content which is already voluminous and obviously tend to increase in accelerated mode. MongoDB, an open source NoSQL database, maintained by 10gen, is racing in Big Data Technologies and already had its major foot print. After probing through each of Hadoop, Cassandra, Couch DB, VoltDB, Hive, MongoDB came out as best fit for our customer business requirement. Results are very impressive. Performance improvement is stunning and projections show that investments will be realized in not more than 2 year. Of course long term cost benefits are enormous.
Aligned with CSC Business strategy to market Big Data consultancy and expertise, we explored a branch of it which is NoSQL database, MongoDB. Reduced cost of Infrastructure, increased profitability, improved performance, high scalability are direct benefits addressing the challenges posed by accelerated increase in data. Massively parallel processing capabilities of the NoSQL enables information on finger tips to its customers. Need of the hour is to give the customers what they want rather than asking them to find something in what they can sell. Big Data is there to make this happen.
The risks that lie in the way of the customer to "change technology" are reliability, scalability and investment in upfront cost and time to setup the environment. These risks can be translated to opportunities for any service based companies. To change customer perception on reliability and scalability, numerous cases studies and POCs are now readily available, which are tested on real-time data. That will make customers experience Big Data results. Coming to investment concerns, MongoDB has almost zero upfront costs. However developing a reusable or customizable framework that will readily setup NoSQL Environments for customers brings more business and hence revenue. It's another great opportunity as developing framework is one time investment, but it becomes a product to sell and also gain further business opportunity.
Big Data is a buzz word today. NoSQL databases are already captivating the market. To stay ahead of the race, we should seize the business opportunity before clients identify the need for change. Embracing the High performance open source technologies like Big Data and NoSQL is definitely a beneficial thing to give attention.