Upload
circuit
View
79
Download
2
Embed Size (px)
Citation preview
CIRCUIT – An Adobe Developer Event Presented by ICF Interactive
Monitoring AEM - Going above and beyond CPU,
Disk, and Memory
Michael Chan ICFI Interactive
Introduction Who Am I
• Michael Chan, Systems Engineer & Architect for ICFI Interactive Managed Services
• Former Java & C developer • With past experience in
– Unix security – Network monitoring – Systems (network, storage, server) integration – Ecommerce
• Primary responsibilities at ICFI (among others) – Build out systems infrastructure, including systems
automation, logging, and monitoring – Enable engineers to quickly assess and respond to
systems issues
Purpose of session Session will cover: • Introduce systems monitoring concepts • Provide practical ideas and examples on how to
monitor your website and AEM stack • Use data to make correlations for root-cause
analysis
Session will not cover: • Which monitoring software to use • How to implement x or y feature in your monitoring
software • What alerting strategies you should use
Goals of systems monitoring • Maintain site availability
– Can users access the site?
• Identify performance issues – Are users waiting too long?
• Troubleshoot problems – How do I identify root cause?
• Identify long-term trends – Is the application slowing down? – Do we need faster hardware?
Monitoring tools out there (not exhaustive)
Open Source (free!) • Nagios • Icinga • Zabbix
SAAS • Application-performance focused
– AppDynamics – New Relic
• Boundary • Datadog
Monitoring software considerations What I have found most important
• Easy to use – Has a convenient GUI – Easy to add servers, applications
• Easy to view and interpret data – Need to be able to view data and quickly make correlations
• Extensible – Easy to customize, e.g. monitor Publisher listening on port 4506 instead
of 4503 – Support for plugins and especially custom scripts, necessary for
application-specific monitoring
• Other considerations – Can the setup configs be version controlled in Git? – Is there an API for the monitoring system, to create/modify configs?
Tip: everyone’s needs are differerent, use what makes sense for you!
Basic monitoring – CPU, network, disk
Good questions to ask when monitoring these • CPU Load Average
– What percentage of CPU is the application utilizing? – Is there surplus CPU capacity left?
• Network Statistics – e.g. Bytes in/out, Packets in/out – How much traffic are our servers receiving? – Do any network spikes correlate with slower application performance?
• Disk (IOPS, throughput) – How much is the application utilizing the disk? – Is the application hitting any Disk I/O thresholds?
Tip: benchmark your Network and Disk I/O thresholds to discover your hardware limitations.
Note: AEM may be hitting CPU limits even before CPU load is %100. Reason for this is that threads often can be waiting on another thread’s operations to complete, and until that thread completes, the rest are waiting or blocked. Therefore slowness can begin even at %50-%75 CPU utilization
Simple web monitoring – must have’s HTTP Code check mc-macbook-2:~ mc$ curl -I http://www.citytechinc.com/us/en.html HTTP/1.1 200 OK
Content-based checks mc-macbook-2:~ mc$ curl -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved'
© 2015 CITYTECH, Inc. all rights reserved
Content-based checks with timeout mc-macbook-2:~ mc$ curl --max-time 30 -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved'
© 2015 CITYTECH, Inc. all rights reserved
Response time check mc-macbook-2:~ mc$ time curl -s http://www.citytechinc.com/us/en.html | grep 'CITYTECH, Inc. all rights reserved' >/dev/null 2>&1 real 0m0.195s user 0m0.007s Sys 0m0.006s
Tip: install content-based checks on each Publish and Dispatcher instance. That way you can quickly detect which instance has a failure.
Simple web monitoring – Apache performance stats
Apache, mod_status module • Provides performance statistics • Note: path e.g. /server-status should be disabled from public internet
root@Client Prod CQ Disp 1a i-a678d2db:~# curl -s http://localhost/server-status | html2text|more ****** Apache Server Status for localhost ****** Server Version: Apache/2.2.15 (Unix) Communique/4.1.2 mod_ssl/2.2.15 OpenSSL/ 1.0.1e-fips Server Built: Jul 18 2014 02:31:29 ==================================================================== Current Time: Wednesday, 29-Jul-2015 01:37:00 GMT Restart Time: Sunday, 26-Jul-2015 03:39:30 GMT Parent Server Generation: 4 Server uptime: 2 days 21 hours 57 minutes 30 seconds Total accesses: 3430869 - Total Traffic: 114.6 GB CPU Usage: u43.79 s19.41 cu0 cs0 - .0251% CPU load 13.6 requests/sec - 477.1 kB/second - 35.0 kB/request 41 requests currently being processed, 21 idle workers
Web monitoring – STM / RUM – nice to have
Synthetic Transaction Monitoring • (also known as active monitoring) is website monitoring that is
done using a web browser emulation or scripted recordings of web transactions.
• Examples – Selenium – Neustar – Keynote
• Advantages – Repeatable process
• e.g. can ensure that the process of “login, add product to shopping cart, checkout” works between code releases
– Can be used as a control – Cheap
• Disadvantages – Monitors only what you decided to test against – Not as thorough as RUM
Web monitoring – RUM / STM – nice to have
Real User Monitoring • (RUM) is a passive monitoring technology that records all user
interaction with a website or client interacting with a server or cloud-based application.
• Examples – Google Analytics – New Relic – Keynote – Many, many more
• Advantages – Real-user “testing” data – Monitoring for issues as they occur – Identifies browser-related issues
• Disadvantages – Expensive – Too much information (information overload)
Adobe WEM monitoring – basic checks for Author, Publisher
Ports to monitor (are they accessible)? • Author – 4502 • Publisher – 4503
Suggested pages to monitor • Sling login page - /system/sling/cqform/defaultlogin.html
– Should always work! – Response times almost always the same – If Sling login page is up, but for example homepage is not, can be indicative a content
or code-related issue curl -s http://localhost:4503/system/sling/cqform/defaultlogin.html | grep QUICKSTART_HOMEPAGE <!-- QUICKSTART_HOMEPAGE - (string used for readyness detection, do not remove) -->
• Homepage, important landing pages – If Publisher hosts multiple farms & host-specific sling mappings are used, you may
need to pass host-header: curl -H "Host: www.citytechinc.com" http://localhost:4503/us/en.html
– Above example is another reason why a customizable monitoring solution is needed • Nagios has an http_check plugin that supports sending host headers with requests
Adobe WEM monitoring – error.log, critical errors
Files to monitor • error.log, keywords (AEM 5.5, 5.6, although some may still be
applicable to 6.x)
– critical errors • OutOfMemoryMonitor CQ shutting down • StackOverflowError • Maximum threads reached • Java OutOfMemoryErrors, e.g.
– java.lang.OutOfMemoryError: unable to create new native thread
• too many open files – Non-critical errors (error count is useful)
• RecursionTooDeepException • Failed to mmap tar file / java.lang.OutOfMemoryError: Map failed
Adobe WEM monitoring – error.log, repository related
Files to monitor • error.log, repository-related keywords
– critical errors • tar files read-‐only
– Non-critical errors (error count is useful, with alarm set when threshold is exceeded)
• failed to retrieve state of(.+)node • failed to retrieve state of intermediary node • Failed to read bundle • Repository error during page import • Unable to create version • lucene(.+)Unknown(.+)node • lucene(.+)query result node
Tip: When encoutering important repository errors, make sure to update your monitoring software to detect it!
Adobe WEM monitoring – error.log
Adobe WEM monitoring – access.log Files to monitor • access.log
– HTTP code frequency, e.g. – 200 Success – 302 Redirect – 403 Forbidden – 404 Not Found – 500 Internal Server Error
# tail access.log 127.0.0.1 - anonymous 04/Aug/2015:20:30:54 +0000 "GET /us/en.html HTTP/1.1" 200 22572 "-" "-" 127.0.0.1 - anonymous 04/Aug/2015:20:30:55 +0000 "GET /content/citytech/global/en.html HTTP/1.1" 200 22598 "-" "curl agent, CTMSP monitoring”
Tip: throw these stats into graph in order to correlate trends or possible page issues or anomalies (RUM, ELK does this excellently)
Adobe WEM monitoring – access.log, cont.
Adobe WEM monitoring – access.log, cont.
Files to monitor • access.log
– Cache-busting requests • Contains query strings, e.g.
– http://www.citytechinc.com/us/en.html?hi=test • Extensionless, e.g.
– GET /athletes/athletes.34360.html/career
– Extensions • .js, .css • Images - .bmp, .jpg, .jpeg, .png
Tip: calculate the percentage of cache-busting requests over time as a baseline to compare against.
Adobe WEM monitoring – access.log, cont.
Adobe WEM monitoring – request.log Files to monitor • request.log
– Looks like root@Citytech Prod CQ Pub 1c i-cbdd5ba9:/var/log/cq5# tail request.log
26/Jul/2015:01:10:23 +0000 [2774404] -> GET /content/citytech/global/en.html HTTP/1.1
26/Jul/2015:01:10:23 +0000 [2774404] <- 200 text/html 229ms
26/Jul/2015:01:10:25 +0000 [2774405] -> GET /system/sling/cqform/defaultlogin.html HTTP/1.1
26/Jul/2015:01:10:25 +0000 [2774405] <- 200 text/html 3ms
26/Jul/2015:01:10:28 +0000 [2774407] -> GET /us/en.html HTTP/1.1
26/Jul/2015:01:10:28 +0000 [2774407] <- 200 text/html 222ms
– Can obtain list of response times with rlog.jar java -jar /opt/adobe-cq5.6.1/publish/crx-quickstart/opt/helpers/rlog.jar -n 50 -xdev /var/log/cq5/request.log
Tip: create a top 100 list of slowest page requests over 5 minute intervals in order to spot poorly performing pages
Adobe WEM monitoring – request.log 07/24/2015 01:50:27 PM ------------- Fri Jul 24 18:50:26 GMT 2015 --------------
*Info * Parsed 1135 requests.
*Info * Time for parsing: 72ms
*Info * Time for sorting: 3ms
*Info * Total Memory: 110mb
*Info * Free Memory: 109mb
*Info * Used Memory: 1mb
------------------------------------------------------
7165ms 24/Jul/2015:18:48:16 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct%2Fulocks%2Fshimano text/html
7020ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/products/show_products.html?tag=trek-americas%3Abrand%2Ftrek&tag=trek-americas%3Aproduct%2Fulocks%2FnonLocking text/html
6643ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/style/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3AstyleCollection%2Fcamelot text/html
6001ms 24/Jul/2015:18:46:54 +0000 200 GET /en/home/style/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3AstyleCollection
%2Fbrookshire text/html
4979ms 24/Jul/2015:18:46:53 +0000 200 GET /en/home/products/show_products.html?tag=trek-americas%3Abrand%2Ftrek&tag=trek-americas%3Aproduct
%2Fhandlesets%2FtwoSidesKeyed text/html
4074ms 24/Jul/2015:18:48:16 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct%2Fknobs
%2FnonLocking text/html
3357ms 24/Jul/2015:18:46:39 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us%3Abrand%2Ftrek&tag=bikes-us%3Aproduct
%2Fshifters&tag=bikes-us%3Aproduct%2Fshifters%2FoneSideKeyed text/html
1031ms 24/Jul/2015:18:46:02 +0000 200 GET /en/home/products/show_products.html?tag=bikes-us:brand/trek&tag=bikes-us:product/shifters&tag=bikes-us:product/shifters/oneSideKeyed text/html
925ms 24/Jul/2015:18:46:44 +0000 200 GET /content/bikes-us/en/home/search.html?searchQuery=user+and+alarm+programming text/html
818ms 24/Jul/2015:18:46:36 +0000 200 GET /en/home/style/design-guides/style-evolution-2014.html text/html
528ms 24/Jul/2015:18:49:38 +0000 200 GET /en/home/products/F51ACCFFF.html?bck=@@bikes-us:brand/trek@@bikes-us:product/ulocks/keyedLock@@bikes-us:product/ulocks text/html
456ms 24/Jul/2015:18:47:00 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10ACC622ADD.jpg/_jcr_content/renditions/cq5dam.thumbnail.319.319.png image/png
361ms 24/Jul/2015:18:49:38 +0000 200 GET /content/bikes-us/en/home/search.html?searchQuery=AL+SERIES text/html
306ms 24/Jul/2015:18:47:01 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10BRW625GRW.jpg/_jcr_content/renditions/cq5dam.thumbnail.319.319.png image/png
292ms 24/Jul/2015:18:49:42 +0000 200 GET /en/home/faq.html?id=42 text/html
272ms 24/Jul/2015:18:47:01 +0000 200 GET /content/dam/bikes-us/product-images/BE469NX/BE469NXCEN626.jpg/_jcr_content/renditions/cq5dam.thumbnail.
319.319.png image/png
244ms 24/Jul/2015:18:47:27 +0000 200 GET /content/bikes-us/en/home.html text/html
236ms 24/Jul/2015:18:47:02 +0000 200 GET /content/dam/bikes-us/product-images/F10%20%28F75%29/F10PLY716GRW.jpg/_jcr_content/renditions/cq5dam.thumbnail.319.319.png image/png
Adobe WEM monitoring – thread count WEM request thread count • Why is this important?
– default max request thread is set to 200 – If hitting the maximum, can indicate spike in traffic or application slowness
• How do I view? System console: http://i-cbdd5ba9.citytech-prod.ctmsp.com:4503/system/console/status-Threads Thread #768010/10.87.66.63 [1437866326422] <closed> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #2228/127.0.0.1 [1437868798545] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #2196/127.0.0.1 [1437868798613] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #768030/127.0.0.1 [1437868798881] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #937774/127.0.0.1 [1437868798896] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #767978/127.0.0.1 [1437868798909] GET /us/en.html HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=org.apache.sling.commons.classloader.impl.ClassLoaderFacade@2d58350a] Thread #767927/127.0.0.1 [1437868802472] <closed> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #767940/64.6.160.57 [1437868802408] GET /system/console/status-Threads HTTP/1.1 [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #767982/64.6.160.57 [1437868802410] <parse> [priority=5, alive=true, daemon=true, interrupted=false, loader=cqse-httpservice [22]] Thread #87/ActivityServiceImpl [priority=5, alive=true, daemon=true, interrupted=false, loader=java.net.URLClassLoader@42472d48] Thread #109/Adobe Granite Offloading job cloner queue processor [priority=5, alive=true, daemon=true, interrupted=false, loader=java.net.URLClassLoader@42472d48]
• Obtaining request thread count (easy curl command) curl -s -u 'admin:_insert_password_here' http://localhost:4503/system/console/status-Threads|grep -E 'GET|POST’|wc –l
Adobe WEM monitoring – JCR queries
Slow Queries, Popular Queries • AEM built-in • Displays top 15 slowest JCR queries • Example:
/usr/bin/java -jar /usr/local/bin/cmdline-jmxclient-0.10.3.jar - localhost:12345 com.adobe.granite:type=QueryStat SlowQueries
Adobe WEM monitoring – JCR queries ------------- 07/26/2015 01:44:33 +0000 org.archive.jmx.Client SlowQueries: -------------- creationTime: Sun Jul 26 01:40:06 GMT 2015 duration: 2788ms language: xpath occurrenceCount: 1 position: 1 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/@cq:tags = 'trek-americas:brand/trek'))] creationTime: Sun Jul 26 01:36:34 GMT 2015 duration: 1766ms language: xpath occurrenceCount: 8729 position: 2 statement: /jcr:root/var/eventing/jobs//element(*, slingevent:Job)[jcr:contains(., '/com/day/cq/replication/job') and not(@slingevent:finished)] creationTime: Sun Jul 26 01:40:33 GMT 2015 duration: 809ms language: xpath occurrenceCount: 1 position: 3 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/@cq:tags = 'trek-americas:brand/trek'))] creationTime: Sun Jul 26 01:41:15 GMT 2015 duration: 790ms language: xpath occurrenceCount: 1 position: 4 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:brand/trek' and jcr:content/@cq:tags = 'trek-americas:product/ulocks' and jcr:content/@cq:tags = 'trek-americas:product/ulocks/titanium'))] creationTime: Sun Jul 26 01:40:05 GMT 2015 duration: 782ms language: xpath occurrenceCount: 1 position: 5 statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:product/levers' and jcr:content/@cq:tags = 'trek-americas:electric/zwave' and jcr:content/@cq:tags = 'trek-americas:product/ulocks/titanium'))] order by jcr:content/content-par/productdetail/@releasedate descending
Tip: The slow query statistic by default shows all queries since AEM startup. However this counter can be reset, if you want to have for example 10-minute “summaries” of the slowest queries.
Adobe WEM monitoring – misc.
Other possible things to monitor • Running workflows • Bundle status - installed, active • Replication queues - total, blocked data for all of the above is possible via curl!
JVM monitoring – heap usage Heap usage • Useful for viewing AEM memory usage and GC issues • Can be obtained via JMX
– Example using free cmdline-jmxclient.jar tool: # java -jar /usr/local/bin/cmdline-jmxclient.jar - i-d4bb64dd.ct-prod.ctmsp.com:12345 'java.lang:name=PS Old Gen,type=MemoryPool' Usage 07/26/2015 20:12:20 +0000 org.archive.jmx.Client Usage: committed: 4462215168 init: 894828544 max: 14316601344
used: 4158743792
• Also viewable via jmap command # jmap -heap 31470 Attaching to process ID 31470, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.5-b03 using thread-local object allocation. Parallel GC with 1 thread(s) - additional output trimmed -
JVM monitoring – heap usage
JVM monitoring – heap usage issues
JVM Monitoring – GC pause times Why monitor JVM pause times? • These are “stop-the-world” events where the application is unreponsive due to JVM
garbage collection • Sometimes JVM garbage collection is not successful, and thus constant GCs occur
since memory cannot be freed – this incurs serious CPU usage • Should be monitored since it can be a performance hit How to monitor? • Pause times can be added to stdout via JVM options
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps 2015-07-27T18:50:30.212+0000: [Full GC [PSYoungGen: 98121K->0K(6107264K)] [ParOldGen: 6144935K->1561525K(6291456K)] 6243056K->1561525K(12398720K) [PSPermGen: 193509K->193465K(193600K)], 5.7558230 secs] [Times: user=22.98 sys=0.00, real=5.75 secs] 2015-07-27T18:50:42.432+0000: [GC [PSYoungGen: 5916288K->81734K(5998080K)] 7477813K->1643259K(12289536K), 0.1018320 secs] [Times: user=0.52 sys=0.00, real=0.10 secs]
• Pause times also can be added via: -XX:+PrintGCApplicationStoppedTime Total time for which application threads were stopped: 0.0001780 seconds Total time for which application threads were stopped: 0.0001920 seconds
Tip: Even if you don’t have time to enable monitoring via JMX, at least print GC output to a log file for later analysis when AEM is slowing down!
JVM Monitoring – GC pause times
Summary
• Monitor all homepage & landing pages, for all individual Publishers and Dispatchers
• Use AEM logs and tools to provide info on AEM status and performance – access/error/request logs, rlog.jar, thread status, slow queries page and customize your monitoring to record this data
• Use JMX and verbose GC logging to record JVM memory heap usage, and GC pause times
References References • http://smartbear.com/articles/what-is-real-user-
monitoring/ • https://en.wikipedia.org/wiki/
Synthetic_monitoring • https://docs.adobe.com/docs/en/cq/5-6-1/
deploying/performance.html Contact Info: Michael Chan [email protected]