Knowlesys Web Newshub system
Powered by the world's leading data aquisition technology, the Knowlesys Web Newshub System enables web editors to find the latest newsworthy information in a systematic, massive and fast manner every day.
- Overview
- Benefits
- Composition
- Functional description of automatic aquisition sub-system
- Functional description of presentation sub-system
- Implementation
I. Overview
Knowlesys Web Newshub System is an edition platform that automatically collects, summarizes and identifies critical information in real time from numerous target websites (e.g. news, BBS, blogs, microblogs) to find newsworthy information and provide functions for subsequent edition and review.
Its system architecture is illustrated below: knowlesys.com
Fig.1. Architecture of Knowlesys Web Newshub System
Compared to the current manual Newshub, it has the following advantages:
Indicator for comparison |
Knowlesys Web Newshub System |
Manual reprint |
Target site |
hundreds, thousands, even tens of thousands |
dozens |
Labor cost |
automatic access to network information, only a few editors are needed to perform manual content viewing and analysis within the private LAN |
a large number of editors are needed to log in to all sites to manually access, copy and paste the content, which is tiresome. |
News source identification |
manual confirmation based on automatic identification |
manual review and confirmation item by item required |
Information storage |
accurate, full coverage, easy to track |
fragmented, errors unavoidable |
Data storage |
all stored in a large relational database and under centralized management |
paste at any time, hard to manage |
Work report |
based on automated statistical analysis |
ambiguous, unclear, no quantitative analysis: Knowlesys |
Reprint effect |
systematic and massive Newshub from partner media or exposures from users, traffic and ranking boosted quickly |
unsystematic, little |
II. Benefits
1. The latest information from all news sites, paper media, BBS, blogs and video sites are automatically presented;
2. the system finds valuable information immediately which can be selected just by a click;
3. editors have more time for in-depth edition or origination 乐�?/span>
4. Daily reprint volume is increased by dozens or hundreds of times, and so is website traffic and ranking.
III. Composition
Knowlesys Web Newshub System consists of three sub-systems: extraction sub-system, Analysis sub-system and presentation sub-system. Their connections are shown below:
Fig. 2. System composition
The network topology of Knowlesys Web Newshub System is shown below. It can be separately implemented on the Internet LAN and private LAN as needed.
Fig. 3: Network topology
IV. Functional description of automatic acquisition sub-system
The automatic acquisition sub-system can collect any target website automatically.
E.g. Xinhua net, bbs.people.com.cn, tianya.cn, xici.net, club.163.com, bbs.sina.com.cn, club.sohu.com, ifeng.com, tieba.baidu.com, and other sites specified by users It can extract all news articles or threads, or content of the latest thread. It can also extract all replies to a threadt or contents of the lastest reply. It can not only monitor a specified target website but also all website around the world without specified target sites, or uses the two modes in combination. It can monitor not only domestic websites but also foreign ones, e.g. BBC, CNN.
The back-end database supports any mainstream relational databases, e.g. Oracle, IBM DB2, MS SQL Server, MySQL, Sybase and document databases, e.g. Access. �?�?�?�?/span>
The all-round monitoring function of the automatic acquisition sub-system is illustrated below:
Fig. 4. All-round monitoring of extraction sub-system
The automatic acquisition sub-system has the following features:
1. World's leading automatic data mining function
Knowlesys' web data mining technology is leading in the world and is able to perform accurate collection of any data on any web pages. Every day, Knowlesys provides data mining service from all kinds of websites to clients within and outside China. To achieve this, an efficient and stable acquisition platform is necessary.
2. All targets can be monitored.
News, BBS, blogs, public chat rooms, search engines, message boards, applications, electronic editions of newspapers and websites can be monitored in real-time.
3. Thousands of news websites can be monitored without additional configuration.
With the built-in configuration for worldwide website monitoring, titles and texts can be automatically acquired as long as the key words are typed in.
4. Powerful multi-language centralized processing function
Information in multiple languages can be automatically processed and stored such as Chinese, English, French, German, Japanese and Korean. knowlesys.cn
5. Smart article extraction
Article texts and titles can be directly extracted from the article-type web pages without additional configuration as well as release dates, while irrelevant contents like adverts, columns and copyright information are removed automatically.
6. All web page conditions are supported:
Popular Web 2.0 AJAX dynamic web site
Auto-login with user ID and password
Form query新闻转载
Next page automatic view
Automatic extraction and combination of article contents extending several pages �?�?�?�?/span>
Automatic downloading of images contained in texts and various attachments
Original snapshot saving option for review
multiple Internet protocols supported: HTTP, HTTPS and FTP
multiple web file formats supported: HTML/XML/CSV/TEXT/RSS/ATOM
�?strong>
7. Automatic deduplication function
For the same URL, each time only the latest uncollected article contents or replies are collected; the contents already aquired are ignored. Automatic deduplication can be applied to reprinted articles.
8. Various built-in post-data processing functions
After data are acquired from web pages, they can be further processed into finer data fields or integrated, replaced or summarized, for example, extraction of key words, street addresses, province/city names, postal codes, telephone numbers, fax numbers, e-mail addresses, QQ/MSN/Skype accounts and URLs. Knowlesys
9. Automatic, unattended acquisition around the clock
The system can either operate by schedule or on a 7/24 basis, at an interval as short as 1 minute.
10. Users can add target websites themselves.
With the acquisition platform provided by the system, users can easily make visual analysis of target websites, configure acquisition task files and add them in the deployment process so as to modify, add and remove any monitored target freely.
V. Functional description of presentation sub-system
The presentation sub-system allows the latest information from all possible source sites to be presented on users' desktop browsers. Its functional architecture is illustrated below:
Fig. 5. Functional architecture of presentation sub-system
The presentation sub-system has the following distinct features:
1. Working in collaboration
Different users view different contents, execute different operations and perform different duties.
2. Displaying article elements
For news and blogs, titles, texts, authors, release time and sources can be collected.
Key words are highlighted 新闻转载
and even title lists can be displayed for quick view
3. Displaying post elements
For posts on BBS, titles, texts, posting time, view counts, number of replies and poster IP addresses can be collected.
Key words are highlighted
and even title lists can be displayed for quick view.
4. Classifying and compiling
The contents acquired can be filtered, classified, added with notes and complied for subsequent management and analysis.
5. Powerful search function
can perform precise search or fuzzy search, and can search by category or by source.
6. Supporting manual adding
The manual adding of articles, and the monitoring of news, BBS and blogs are possible.
7. Anti-website restrictions
Collection of blocked foreign websites in China, collection of websites subject to restrictions to source IP and access frequency and automatic collection of proxy IP addresses are possible without further configuration.
VI. Implementation
The system is mainly applied to all portal operators.
Due to the complexity of the Internet, communication and cooperation with users are required for the implementation of the Knowlesys Web New Reprint System.
We provide the following implementation services to meet user requirements:
Number |
Service |
Content |
1 |
Turn-key project |
Provide a full package of software and documentations of Knowlesys Web Newshub System; |
2 |
Training |
E-training or training at clients' premises |
3 |
Subsequent services |
Provide configuration parameter files after the update of target websites; |
4 |
Technical support |
Answer questions from users via Email, QQ/MSN/Skype, give technical support |