استخلاص المواقع

استخلاص الوب Web scraping، حصاد الوب web harvesting، أو استخراج بيانات الوب web data extraction، هي طريقة لاستخلاص للبيانات تستخدم لاستخراج البيانات من المواقع الإلكترونية.^[1] يمكن لبرامج استخلاص الوب الولوج إلى الشبكة العالمية باستخدام پروتوكول نقل النص الفائق، أو عن طريق محرك البحث. بينما يمكن أداء عملية استخلاص المواقع عن طريق مستخدمي البرامج، فعادة ما يشير المصطلح إلى عملية النسخ المؤتمتة، التي يتم فيها جمع بيانات معينة ونسخها من الوب، وغالباً ما يتم نقلها إلى قاعدة بيانات محلية مركزية، واسترجاعها أو تحليلها لاحقاً.

Web scraping a web page involves fetching it and extracting from it.^[1]^[2] Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list (contact scraping).

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an Application Programming Interface (API) to extract data from a web site. Companies like Amazon AWS and Google provide web scraping tools, services and public data available free of cost to end users.

Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server.

There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.

التاريخ

التقنيات

Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.

تقنية النسخ واللصق البشري

Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.

مطابقة النموذج النصي

A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python).

برمجة HTTP

Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

تفسير HTML

Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme.^[3] Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.

تفسير DOM

By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.

التجميع العمودي

There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of “bots” for specific verticals with no "man in the loop" (no direct human involvement), and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from.

التعرف على الترميز الدلالي

The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer,^[4] are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

تحليل صفحة الوب المرئية على الحاسوب

There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.^[5]

البرامج

أمثلة على الأدوات

cURL – command line tool and library for transferring (including getting) data with URLs supporting a wide range of HTTP methods (GET, POST, cookies, etc.).
Data Toolbar – web scraping add-on for Internet Explorer, Mozilla Firefox, and Google Chrome Web browsers that collects and converts structured data from web pages into a tabular format that can be loaded into a spreadsheet or database management program.
Diffbot – uses computer vision and machine learning to automatically extract data from web pages by interpreting pages visually as a human being might.
Heritrix – gets pages (lots of them). It is a web crawler designed for web archiving, written by the Internet Archive (see Wayback Machine).
HtmlUnit – headless browser that can be used for retrieving web pages, web scraping, and more.
HTTrack – free and open source Web crawler and offline browser, designed to download websites.
iMacros – a browser extension to record, code, share and replay browser automation (javascript).
Selenium (software) – a portable software-testing framework for web applications.
Jaxer
Mozenda – is a WYSIWYG software that offers cloud, onsite, and data wrangling services.
nokogiri – HTML, XML, SAX, and Reader parser based on XPath or CSS Selectors.
OutWit Hub – Web scraping application including built-in data, image, document extractors and editors for custom scrapers and automatic exploration and extraction jobs (free and paid versions).
watir – open source Ruby library for automating tests and interact with a browser the same way people do.
Wget – computer program that retrieves content from web servers. It is part of the GNU Project. It supports downloading via the HTTP, HTTPS, and FTP protocols.
WSO2 Mashup Server –
Yahoo! Query Language (YQL) –

أدوات جاڤاسكريپت

Greasemonkey
Node.js
PhantomJS – scripted, headless browser used for automating web page interaction.
jQuery

قضايا قانونية

الولايات المتحدة

الاتحاد الأوروپي

أستراليا

طرق منع استخلاص المواقع

انظر أيضاً

المصادر

^ ^أ ^ب Boeing, G.; Waddell, P. (2016). "New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings". Journal of Planning Education and Research (0739456X16664789). arXiv:1605.05397. doi:10.1177/0739456X16664789.
^ Vargiu & Urru (2013). "Exploiting web scraping in a collaborative filtering- based approach to web advertising". Artificial Intelligence Research. 2 (1). doi:10.5430/air.v2n1p44.
^ Song, Ruihua; Microsoft Research (Sep 14, 2007). "Joint Optimization of Wrapper Generation and Template Detection" (PDF). The 13th International Conference on Knowledge Discovery and Data Mining.
^ Semantic annotation based web scraping
^ Roush, Wade (2012-07-25). "Diffbot Is Using Computer Vision to Reinvent the Semantic Web". www.xconomy.com. Retrieved 2013-03-15.

[Boeing2016JPER-1] أ ^ب Boeing, G.; Waddell, P. (2016). "New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings". Journal of Planning Education and Research (0739456X16664789). arXiv:1605.05397. doi:10.1177/0739456X16664789.

[2] Vargiu & Urru (2013). "Exploiting web scraping in a collaborative filtering- based approach to web advertising". Artificial Intelligence Research. 2 (1). doi:10.5430/air.v2n1p44.

[3] Song, Ruihua; Microsoft Research (Sep 14, 2007). "Joint Optimization of Wrapper Generation and Template Detection" (PDF). The 13th International Conference on Knowledge Discovery and Data Mining.

[4] Semantic annotation based web scraping

[5] Roush, Wade (2012-07-25). "Diffbot Is Using Computer Vision to Reinvent the Semantic Web". www.xconomy.com. Retrieved 2013-03-15.

[1]

[2]

[3]

[4]

[5]