23
FCS Documentation Release 1.0 AGH-GLK November 06, 2014

FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS DocumentationRelease 1.0

AGH-GLK

November 06, 2014

Page 2: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field
Page 3: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

Contents

1 Quickstart 3

2 FCS basics 52.1 Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Main page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 List of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Create new task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Edit existing task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 Send feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.7 Download crawling results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Management module (fcs.manager) 11

4 Crawling Unit module (fcs.crawler) 13

5 Task Server module (fcs.server) 15

6 Crawling results decoder (fcs.content_file_decoder) 17

7 Indices and tables 19

i

Page 4: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

ii

Page 5: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

Contents:

Contents 1

Page 6: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

2 Contents

Page 7: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 1

Quickstart

Short instruction presenting how to launch Focused Crawling Search.

Note: Unix-based operation system and Vagrant (preferred 1.35 or higher) are required.

1. Download project code from Github repository.

2. Change directory into /fcs.

3. In command line type vagrant up. Virtual machine with all requirements will be provisioned. Its ip addressis 192.168.0.2.

4. Start second shell, in both of them:

• connect to machine with vagrant ssh,

• activate python virtual environment: source ./fcs/bin/activate,

• move to FCS Management web application main directory: cd /vagrant/fcs.

5. In first terminal:

• create data base: python manage.py syncdb,

• apply data base migrations with python manage.py migrate,

• set Userena permissions: python manage.py check_permissions,

• start web application server: python manage.py runserver 192.168.0.2:8000 on localport 8000.

6. In second terminal window start Autoscaling module: python manage.py autoscaling192.168.0.2.

7. Open browser and go to http://192.168.0.2:8000.

8. Register new user. Activation mail should be displayed in console.

9. Log in.

10. Click Tasks → Add new. Fill the form. Confirm with Add button.

11. Crawling process will begin soon. You can monitor it in terminal’s windows and logs of crawler and serverlocated in ./fcs/fcs.

12. In task details(hyperlink in tasks table) you can download crawling results.

3

Page 8: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

4 Chapter 1. Quickstart

Page 9: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 2

FCS basics

2.1 Registration

1. Click on Register button on main page.

2. Fill fields with user name, correct email and password (two times the same).

3. Confirm with Register button.

4. Check your email. Registration message should wait for you. Click the link in email content.

5. Your account is activated. Now you can log in with your email or login and password.

5

Page 10: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

2.2 Main page

On main page you can:

• API - see REST API documentation,

• Tasks - display information about crawling tasks,

• Change password - change your password,

• Change your data - modify details of your account,

• Show quota - check your permissions in creating task,

• API keys - view keys required for using REST API,

• Logout - finish work with system.

2.3 List of tasks

This page presents all tasks of current user. They can be active (yellow rows), paused (grey rows) and finished (greenrows). To decrease amount of elements in table you can filter them with two select lists and Filter button.

2.4 Create new task

1. Click Add button under tasks’ list table.

6 Chapter 2. FCS basics

Page 11: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

2. Fill form below. If not tell otherwise, all fields are mandatory.

3. In first row specify task’s name.

4. Fill priority field with number from 1 to 10. The higher it is, the more important is this task in compari-son with other tasks of this user.

5. Give start links separated with white space.

6. In the whitelist field you can specify list of regular expressions (separate them with comma) describingurls, which can be processed. If you leave this input empty, all urls will be crawled.

7. Blacklist is list of regular expressions which cannot be crawled. Optional.

8. In the next field set the maximal amount of pages which can be crawled.

9. Select maximal date of task lasting in Expire field.

10. In last input you can type list of MIME types which should be processed by crawler.

11. Send form with Add.

12. If you see message like below, task was created successfully.

2.5 Edit existing task

1. Click one of the rows in table with tasks.

2. If task is finished, you cannot change anything. View should look like below:

2.5. Edit existing task 7

Page 12: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

3. If task is running or paused you can change some of its parameters, pause/resume it, stop, get crawling results:

4. After modifying task click Save changes.

2.6 Send feedback

If task is running or paused, on task edition page you can rate some pages. To send feedback to Task Server, you needto specify url and rating. Higher than 3 means that link is valuable, lower means that link is useless. Confirm withSend feedback button.

2.7 Download crawling results

On the same page you can also download crawling results. Click Get data. In window which appears set size in MBof file with part of results. Click OK and download should begin.

8 Chapter 2. FCS basics

Page 13: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

2.7. Download crawling results 9

Page 14: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

10 Chapter 2. FCS basics

Page 15: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 3

Management module (fcs.manager)

Management module is a web application implemented in Django framework. It is responsible for managing useraccounts and handling crawling requests from clients. Management module provides:

• accounts management,

• user’s tasks management,

• email notification system,

• client REST API.

admin

api_urls

api_views

autoscale_views

forms

middleware

models

tasks

urls

views

management/commands/autoscaling

11

Page 16: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

12 Chapter 3. Management module (fcs.manager)

Page 17: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 4

Crawling Unit module (fcs.crawler)

The fcs.crawler module contains classes that implement the Crawling Unit. Crawling Units execute clients’ tasks. EachCrawling Unit receives from a Task Server a pool of URI to fetch. A single Crawling Unit can perform simultaneouslyseveral crawling tasks. Crawling results and other information (like errors), are returned to a Task Server.

content_parser

crawler

mime_content_type

thread_with_exc

web_interface

13

Page 18: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

14 Chapter 4. Crawling Unit module (fcs.crawler)

Page 19: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 5

Task Server module (fcs.server)

This module contains implementation of Task Server. Each Task Server is responsible for handling just one task at thesame time. However, it does not mean that one physical machine corresponds with only one Task Server, since thismodel is logical. Each Task Server contains its own database for storing links or crawled data.

content_db

crawling_depth_policy

data_base_policy_module

graph_db

link_db

task_server

url_processor

web_interface

15

Page 20: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

16 Chapter 5. Task Server module (fcs.server)

Page 21: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 6

Crawling results decoder (fcs.content_file_decoder)

Script unpacking *.dat files, results of crawling. Proper usage:

python script.py <file_location> <unpacked_directories_structure_location>

Script creates tree of directories. In every leaf directory there are two files - url_links.txt (page URL in first line,extracted links separated with whitespace in second) and content.dat containing decoded from Base64 resource. Athigher level directories with names of integers are stored. Additionally, file index.txt links directory name with pageURL.

17

Page 22: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

FCS Documentation, Release 1.0

18 Chapter 6. Crawling results decoder (fcs.content_file_decoder)

Page 23: FCS DocumentationFCS Documentation, Release 1.0 2. Fill form below. If not tell otherwise, all fields are mandatory. 3.In first row specify task’s name. 4. Fill priority field

CHAPTER 7

Indices and tables

• genindex

• modindex

• search

19