Upload
augusta-daniel
View
223
Download
0
Embed Size (px)
Citation preview
The Mobile Web is Structurally Different
Apoorva Jindal
USC
Chris Crutchfield
MITSamir Goel
Google Inc
Ravi Jain
Google Inc
Ravi Kolluri
Google Inc
The Mobile Web is Structurally DifferentThe Mobile Web?
Web pages designed for consumption on mobile wireless devices CHTML, XHTML, WML
All other pages referred to as fixed web Becoming more important
Better devices Better networks Cheaper plans
Different from fixed web? Smaller pages Fewer hyperlinks Fewer images
is Structurally Different
Web graph pages ↔ nodes hyperlinks ↔ edges
Properties of this graph In-degree distribution Out-degree distribution Strongly connected component size distribution ….
Importance Used in basic algorithms to implement search
Crawling Ranking the web pages
Studied in detail for fixed web
INFOCOM 2008
Structurally?The Mobile Web is Structurally DifferentThe Mobile Web is
EDAS
Bow-tie Structure [Broder et al 2000]
Model to describe the structure of the fixed web.
Methodology Collapse all pages in a domain to one node
Use Tools based on Mapreduce
Google’s mobile web index, June 2007 CHTML XHTML + WML
Webbase 2001
Google’s fixed web index, July 2007
In-degree & out-degree distributions Tools based on mapreduce Use [Clauset et al 2006] to infer the power law
coefficient Determine bow-tie structure properties
Use COSIN tools [Donato et al 2004] Limitations
Cannot handle Google fixed web 2007 at page level
Mobile web is sparser
Page-level Graph properties – Degree Distributions
Corpus Avg Node Degree
In-degree Out-degree
XHTML+WML 3.75 2.00 3.49
CHTML 5.06 1.99 4.06
Webbase 7.0 2.1 2.7
Coefficient of power-law distribution
CHTML lies between XHTML+WML and fixed web
Out-degree distribution falls off faster for mobile web
Mobile web Smaller SCC Larger IN and smaller OUT Bigger Disconnected + Tendrils
Connectivity: Fixed Web > CHTML > XHTML/WML
Page-level Graph properties – Bow-tie structure
Corpus SCC IN OUT Tendrils Disconnected
XHTML+WML
10.5% 18% 10.4% 18.3% 42.7%
CHTML 22% 25.9% 14.2% 22% 15.8%
Webbase
33% 11% 39% 13% 4%
Language Properties Sub-graph of pages that share a common trait
Like keyword, location. Called Thematically Unified Clusters (TUCs). In fixed web, they retain the structural properties of the entire graph.
Mobile web?
Corpus Language Fraction of Nodes
XHTML
Chinese 42.6%
English 22.3%
Russian 13.4%
French 3.4%
German 2.3%
CHTML Japanese 92.3%
English 5.9%
Corpus SCC IN OUT Tendrils Disconnected
XHTML+WML
10.5% 18% 10.4% 18.3% 42,7%
Chinese 13% 22% 9% 14% 42%
English 2% 3% 7% 25% 63%
Russian 22% 40% 8% 11% 19%
Don’t study Japanese: Properties same as CHTML
Domain-level Graph Properties Domain-level graph
Collapse all nodes for a domain into a single super-node
Compare mobile web 2007 and fixed web 2007
Advantages Allows us to understand the differences at a much coarser level Allows us to compare present day fixed and mobile webs
Corpus Avg Node Degree
SCC IN OUT Tendrils + Disconn.
XHTML+WML
3.91 40.6% 40.7% 2.73% 15.9%
CHTML 5.56 83% 16.4% 0.22% 0.36%
Fixed web 2007
35.75 93.9% 5.62% 0.4% 0.03%
Observations Domain-level graphs are better
connected. XHMTL + WML has a much larger
Disconnected component CHTML properties lies between
XTHML+WML and Fixed web. Structural differences between
domain-level fixed web and mobile web same as the differences between page-level fixed web and mobile web.
Application: Impact on Crawling
Crawling is resource-intensive. Efficiency is important
Higher level of disconnectedness Need a larger and a more diverse seed set
Covering the IN component requires special care
Depth-first strategy risks spending a disproportionate time in Tendrils and Disconnected components
Different languages have different levels of disconnectedness Require a larger seed set for English pages than Russian pages Crawl depth can be reduced for Russian sub-graph
Sparseness also can give an advantage Chances of encountering the page again during a crawl is smaller
Conclusions
Mobile web graph is structurally different Sparser, more disconnected Smaller SCC and OUT
CHTML properties lies between XHTML+WML and Fixed web
Surprising preponderance of Chinese pages
English sub-graph extremely disconnected
Future Work
Only a first step
Results motivate the need of a deeper and more extensive analysis
Propose alternatives to bow-tie model for mobile web
Better understanding of language sub-graphs
Quantitatively characterize the impact of differences in structure on different search algorithms