25
Mining twitter 1.9, 1.10 1131036001 김김김

Mining twitter 1.9, 1.10 1131036001 김종명. 1.9 Making Robust Twitter Requests Problem –You want to write a long-running script that harvests large amounts

Embed Size (px)

Citation preview

Mining twitter

1.9, 1.101131036001 김종명

1.9 Making Robust Twitter Re-quests

• Problem– You want to write a long-running script that harvests large amounts of data, such as the friend and follower ids for a very popular Twit-terer; however, the Twitter API is inherently unreliable and imposes rate limits that require you to always expect the unexpected.

• Solution– Write an abstraction for making twitter requests that accounts for rate limiting and other types of HTTP errors so that you can focus on the problem at hand and not worry about HTTP errors or rate limits, which are just a very specific kind of HTTP error.

Error Codes & ResponsesCod

eText Description

200 OK Success!304 Not Modified There was no new data to return.

400 Bad RequestThe request was invalid. An accompanying error message will explain why. This is the status code will be returned during version 1.0 rate limiting. In API v1.1, a re-quest without authentication is considered invalid and you will get this response.

401 Unauthorized Authentication credentials were missing or incorrect.

403 ForbiddenThe request is understood, but it has been refused or access is not allowed. An ac-companying error message will explain why. This code is used when requests are be-ing denied due to update limits.

404 Not FoundThe URI requested is invalid or the resource requested, such as a user, does not ex-ists. Also returned when the requested format is not supported by the requested method.

406 Not Acceptable Returned by the Search API when an invalid format is specified in the request.

410 GoneThis resource is gone. Used to indicate that an API endpoint has been turned off. For example: "The Twitter REST API v1 will soon stop functioning. Please migrate to API v1.1."

420 Enhance Your Calm Returned by the version 1 Search and Trends APIs when you are being rate limited.

422Unprocessable En-

tityReturned when an image uploaded to POST account/update_profile_banner is unable to be processed.

429 Too Many RequestsReturned in API v1.1 when a request cannot be served due to the application's rate limit having been exhausted for the resource. See Rate Limiting in API v1.1.

500Internal Server Er-

rorSomething is broken. Please post to the group so the Twitter team can investigate.

502 Bad Gateway Twitter is down or being upgraded.

503Service Unavail-

ableThe Twitter servers are up, but overloaded with requests. Try again later.

504 Gateway timeoutThe Twitter servers are up, but the request couldn't be serviced due to some failure within our stack. Try again later.

Error Messages• {"errors":[{"message":"Sorry, that page does not

exist","code":34}]}• <?xml version="1.0" encoding="UTF-8"?>

<errors><error code="34">Sorry, that page does not exist</error></errors>

Error CodesCode Text Description

32 Could not authenticate you Your call could not be completed as di-aled.

34 Sorry, that page does not exist

Corresponds with an HTTP 404 - the spec-ified resource was not found.

88 Rate limit exceededThe request limit for this resource has been reached for the current rate limit window.

89 Invalid or expired token The access token used in the request is incorrect or has expired. Used in API v1.1

130 Over capacity Corresponds with an HTTP 503 - Twitter is temporarily over capacity.

131 Internal error Corresponds with an HTTP 500 - An un-known internal error occurred.

135 Could not authenticate youCorresponds with a HTTP 401 - it means that your oauth_timestamp is either ahead or behind our acceptable range

215 Bad authentication data

Typically sent with 1.1 responses with HTTP code 400. The method requires au-thentication but it was not presented or was wholly invalid.

정상 수행

존재하지 않는 페이지 404 34

Rate limit reached 429 88

URL Error

• DNS 교체

1.10

• Problem– You want to harvest and store tweets from a collection of id values, or harvest entire timelines of tweets

• Solution– Use the /statuses/show resource to fetch a single tweet by its id value; the various /statuses/*_timeline methods can be used to fetch timeline data. CouchDB is a great op-tion for persistent storage, and also pro-vides a map/reduce processing paradigm and built-in ways to share your analysis with others.

• 문서 기반분산 데이터베이스– Cluster Of Unreliable Commodity Hardware

Document-oriented

Document-oriented

Document-oriented DB

• MongoDB(C++)• RavenDB(C#)• CouchDB(Erlang)

Document

Document

{"_id": "tansac",“_rev”: “1”"profile": {"nickname": "tansanc","name": {"firstname": "종명","lastname": "김"},"birthdate": "1987-05-31“}

}

Schema Free

{"_id": "tansac",“_rev”: “2”"profile": {

"nickname": "tansanc","name": {

"firstname": "종명","lastname": "김"

},"birthdate": "1987-05-31”“hasBrother”: true

}}

Typical 3-Tier Architecture

2-Tier Architecture with CouchDB

No Locking

• Multi-Version Concurrency Control (MVCC)

/statuses/show

• public_timeline()• user_timline()• home_timeline()

tweepy get timeline• API.public_timeline()

– Returns the 20 most recent statuses from non-protected users who have set a custom user icon. The public timeline is cached for 60 seconds so requesting it more often than that is a waste of resources.

– Parameters: None– Returns: list of class:Status objects

• API.home_timeline()– Returns the 20 most recent statuses, including retweets, posted by the authenticating

user and that user’s friends. This is the equivalent of /timeline/home on the Web.– Parameters: since_id, max_id, count, page– Returns: list of class:Status objects

• API.friends_timeline()– Returns the 20 most recent statuses posted by the authenticating user and that user’s

friends.– Parameters: since_id, max_id, count, page– Returns: list of class:Status objects

• API.user_timeline()– Returns the 20 most recent statuses posted from the authenticating user. It’s also pos-

sible to request another user’s timeline via the id parameter.– Parameters: (id or user_id or screen_name), since_id, max_id, count, page– Returns: list of class:Status objects

• http://pythonhosted.org/tweepy/html/api.html#timeline-methods

home_timeline()

• API.friends_timeline()• API.public_timeline()

• API.user_timeline• API.mention_timeline