Manual Chapter : Detecting and Preventing Web Scraping

Applies To:

Show Versions Show Versions

BIG-IP ASM

  • 12.1.6, 12.1.5, 12.1.4, 12.1.3, 12.1.2, 12.1.1, 12.1.0
Manual Chapter

Detecting and Preventing Web Scraping

Overview: Detecting and preventing web scraping

Web scraping is a technique for extracting information from web sites that often uses automated programs, or bots (short for web robots), opening many sessions, or initiating many transactions. You can configure Application Security Manager™ (ASM) to detect and prevent various web scraping activities on the web sites that it is protecting.

Note: The BIG-IP® system can accurately detect web scraping anomalies only when response caching is turned off.

ASM™ provides the following methods to address web scraping attacks. These methods can work independently of each other, or they can work together to detect and prevent web scraping attacks.

  • Bot detection investigates whether a web client source is human by detecting human interaction events such as mouse movements and keyboard strokes, by detecting irregular sequences of those events, and by detecting rapid surfing.
  • Session opening detects an anomaly when either too many sessions are opened from an IP address or when the number of sessions exceeds a threshold from an IP address. Also, session opening can detect an attack when the number of inconsistencies or session resets exceeds the configured threshold within the defined time period. This method also identifies as an attack an open session that sends requests that do not include an ASM cookie.
  • Session transactions anomaly captures sessions that request too much traffic, compared to the average amount observed in the web application. This is based on counting the transactions per session and comparing that to the average amount observed in the web application.
  • Fingerprinting captures information about browser attributes in order to identify a client. It is used when the system fails to detect web scraping anomalies by examining IP addresses, ASM cookies, or persistent device identification.
  • Suspicious clients used together with fingerprinting, specifies how the system identifies and protects against potentially malicious clients; for example, by detecting scraper extensions installed in a browser.
  • Persistent client identification prevents attackers from circumventing web scraping protection by resetting sessions and sending requests.

Task Summary

Prerequisites for configuring web scraping

For web scraping detection to work properly, you should understand the following prerequisites:

  • The web scraping mitigation feature requires that the DNS server is on the DNS lookup server list so the system can verify the IP addresses of legitimate bots. Go to System > Configuration > Device > DNS to see if the DNS lookup server is on the list. If not, add it and restart the system.
  • Client browsers need to have JavaScript enabled, and support cookies for anomaly detection to work.
  • Consider disabling response caching. If response caching is enabled, the system does not protect cached content against web scraping.
  • The Application Security Manager™ does not perform web scraping detection on legitimate search engine traffic. If your web application has its own search engine, we recommend that you add it to the system. Go to Security > Options > Application Security > Advanced Configuration > Search Engines , and add it to the list.

Adding allowed search engines

The Application Security Manager™ does not perform web scraping detection on traffic from search engines that the system recognizes as being legitimate. You can optionally add other legitimate search engines to the search engines list.
  1. On the Main tab, click Security > Options > Application Security > Advanced Configuration > Search Engines .
    The Search Engines screen opens, and displays a list of the search engines that are considered legitimate.
  2. Click Create.
    The New Search Engine screen opens.
  3. In the Search Engine field, type the name.
  4. In the Bot Name field, type the search engine bot name, such as googlebot.
    Tip: You can get this name from the user-agent header of a request that the search engine sends.
  5. In the Domain Name field, type the search engine crawler’s domain name; for example, yahoo.net.
  6. Click Create.
    Note: For this feature to work, the DNS server must be on the DNS lookup server list on the BIG-IP® system ( System > Configuration > Device > DNS ). The system uses reverse DNS lookup to verify search engine requests.
    The system adds the search engine to the list.
The system does not perform web scraping detection on traffic originating from the search engines on the search engines list.

Allowed search engines

By default, Application Security Manager™ (ASM) allows requests from these well-known search engines and legitimate web robots:

  • Ask (.ask.com)
  • Baidu (.baidu.com, .baidu.jp)
  • Bing (.msn.com)
  • Google (.googlebot.com)
  • Yahoo (.yahoo.net)
  • Yandex (.yandex.com .yandex.net, .yandex.ru,)

You can add other search engines to the allowed search engine list; for example, if your web application uses an additional search engine. The list applies globally to all security policies on the system. ASM does not perform web scraping detection on traffic from any search engine listed.

Detecting web scraping based on bot detection

Application Security Manager™ can mitigate web scraping on application web sites by investigating whether a client is human or a web robot. This is called bot detection. The bot detection method protects web applications against rapid surfing by measuring how frequently URLs are accessed and whether pages are refreshed too often. The system can also detect non-human clients by detecting bad sequences in human interaction events.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping .
    The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Bot Detection setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects that a bot is sending requests to the web application.
    If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks.
    Note: The system can accurately detect a human user only if clients have JavaScript enabled and support cookies in their browsers.
    The screen displays the Bot Detection tab and more settings.
  4. If you plan to use fingerprinting to monitor behavior by browser (and collect its attributes) rather than by session, select the Fingerprinting Usage check box.
    The screen displays additional settings; a separate task explains how to use fingerprinting to detect web scraping.
  5. If you want to protect client identification data (when using Bot Detection or Session Opening detection), specify the persistence settings.
    1. Select the Persistent Client Identification check box.
    2. For Persistent Data Validity Period, type how long you want the client data to persist in minutes. The default value is 120 minutes.
    Note: This setting enforces persistent storage on the client and prevents easy removal of client data. Be sure that this behavior is compatible with the application privacy policy.
    The system maintains client data and prevents removal of this data from persistent storage for the validity period specified.
  6. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  7. On the Bot Detection tab, for the Rapid Surfing setting, specify the maximum number of page refreshes or different pages that can be loaded within a specified number of seconds before the system suspects a bot.
    The default value is Maximum 120 page refreshes, or 30 different pages loaded within 30 seconds.
  8. For Grace Interval, type the number of requests to allow while determining whether a client is human.
    The default value is 100.
  9. For Blocking Period, type the number of requests that cause the Web Scraping Detected violation if no human activity was detected during the grace period.
    The default value is 500.
    Reaching this interval causes the system to reactivate the grace period.
  10. For Safe Interval, type the number of requests to allow after human activity is detected, and before reactivating the grace threshold to check again for non-human clients.
    The default value is 2000.
  11. To track event sequences in the browser and detect irregular sequences, for Event Sequence Enforcement, select Enabled.
  12. Click Save to save your settings.
  13. On the Main tab, click Security > Application Security > Policy Building > Learning and Blocking Settings .
    The Learning and Blocking Settings screen opens.
  14. To ensure that web scraping violations are learned, in the Policy Building Settings area, expand Bot Detection and select Web scraping detected, if it is not already selected.
  15. Click Save to save your settings.
  16. To put the security policy changes into effect immediately, click Apply Policy.
The system checks for rapid surfing, and if too many pages are loaded too quickly, it logs Web Scraping detected violations in the event log, and specifies the attack type of bot detection. If you enforced event sequencing, the system tracks the sequence of events in the browser, thus preventing bots that can imitate human mouse, keyboard, or touch events. If an irregular sequence of events is detected during the grace period, the client continues to receive the JavaScript challenge that tries to detect a human. When the grace period is over, if the sequences were irregular, the system starts to block requests from the client.
After setting up bot detection, you can also set up fingerprinting, session opening, and session transactions anomaly detection for the same security policy. If you want to implement proactive bot defense for additional protection against web robots, you can set that up by configuring DoS protection.

Detecting web scraping based on session opening

You can configure how the system protects your web application against session opening web scraping violations that result from too many sessions originating from a specific IP address, inconsistencies detected in persistent storage, and when the number of session resets exceeds the threshold.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping .
    The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Session Opening setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects a large increase in the number of sessions opened from a. specific IP address, or when the number of session resets or inconsistencies exceeds the set threshold.
    If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks.
    The screen displays the Session Opening tab and more settings.
  4. If you plan to use fingerprinting to monitor behavior by browser (and collect its attributes) rather than by session, select the Fingerprinting Usage check box.
    The screen displays additional settings; a separate task explains how to use fingerprinting to detect web scraping.
  5. If you want to protect client identification data (when using Bot Detection or Session Opening detection), specify the persistence settings.
    1. Select the Persistent Client Identification check box.
    2. For Persistent Data Validity Period, type how long you want the client data to persist in minutes. The default value is 120 minutes.
    Note: This setting enforces persistent storage on the client and prevents easy removal of client data. Be sure that this behavior is compatible with the application privacy policy.
    The system maintains client data and prevents removal of this data from persistent storage for the validity period specified.
  6. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  7. To detect session opening anomalies by IP address, on the Session Opening tab, select the Session Opening Anomaly check box.
  8. For the Prevention Policy setting, select one or more options to direct how the system should handle a session opening anomaly attack.
    Option Description
    Client Side Integrity Defense When enabled, the system determines whether a client is a legitimate browser or an illegal script by sending a JavaScript challenge to each new session request. Legitimate browsers can respond to the challenge; scripts cannot.
    Rate Limiting When enabled, the system drops random sessions exhibiting suspicious behavior until the session opening rate is the same as the historical legitimate value. If you select this option, the screen displays an option for dropping requests from IP addresses with a bad reputation.
    Drop IP Addresses with bad reputation This option is available only if you have enabled rate limiting. When enabled, the system drops requests originating from IP addresses that are in the system’s IP address intelligence database when the attack is detected; no rate limiting will occur. (Attacking IP addresses that do not have a bad reputation undergo rate limiting, as usual.) You also need to set up IP address intelligence, and at least one of the IP intelligence categories must have its Alarm or Block flag enabled.
  9. For the Detection Criteria setting, specify the criteria under which the system considers traffic to be a session opening anomaly attack.
    Option Description
    Sessions opened per second increased by The system considers traffic to be an attack if the number of sessions opened per second increased by this percentage. The default value is 500%.
    Sessions opened per second reached The system considers traffic to be an attack if the number of sessions opened per second is equal to or greater than this number. The default value is 50 sessions opened per second.
    Minimum sessions opened per second threshold for detection The system only considers traffic to be an attack if this value plus one of the sessions opened values is exceeded. The default value is 25 sessions opened per second.
    Note: The Detection Criteria values all work together. The minimum sessions value and one of the sessions opened values must be met for traffic to be considered an attack. However, if the minimum sessions value is not reached, traffic is never considered an attack even if the Sessions opened per second increased by value is met.
  10. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds.
    If the attack ends before this number of seconds, the system also stops attack prevention.
  11. If you enabled Persistent Client Identification and you want to detect session opening anomalies based on inconsistencies, select the Device Identification Integrity check box, and set the maximum number of integrity faulty events to allow within a specified number of seconds.
    The system tracks the number of inconsistent device integrity events within the time specified, and if too many events occurred within the time, a Web scraping detection violation occurs.
  12. If you enabled Persistent Client Identification and you want to track cookie deletion events, in the Cookie Deletion Detection setting, specify how to detect cookie deletion. You can use either one or both options.
    1. To use persistent device identification to detect cookie deletion events, select the Enabled by Persistent Device Identification check box, and set the maximum number of cookie deletions to allow within a specified number of seconds.
    2. If you enabled Fingerprinting Usage, to use fingerprinting to detect cookie deletion events, select the Enabled by Fingerprinting check box, and set the maximum number of cookie deletions to allow within a specified number of seconds.
    The system tracks the number of cookie deletion events that occur within the time specified, and if too many cookies were deleted within the time, a Web scraping detection violation occurs.
  13. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds.
    If the attack ends before this number of seconds, the system also stops attack prevention.
  14. Click Save to save your settings.
  15. To put the security policy changes into effect immediately, click Apply Policy.
The system checks for too many sessions being opened from one IP address, too many cookie deletions, and persistent storage inconsistencies depending on the options you selected. The system logs violations in the web scraping event log along with information about the attack including whether it is a Session Opening Anomaly by IP Address or Session Resets by Persistent Client Identification attack type and when it began and ended. The log also includes the type of violation (Device Identification Integrity or Cookie Deletion Detection) and the violation numbers.
After setting up web scraping detection by session opening, you can also set up bot detection, fingerprinting, and session transactions anomaly detection for the same security policy.

Detecting web scraping based on session transactions

You can configure how the system protects your web application against harvesting, which is detected by counting the number of transactions per session and comparing that number to a total average of transactions from all sessions. Harvesting may cause session transaction anomalies.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping .
    The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Session Transactions Anomaly setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects  a large increase in the number of transactions per session.
    If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks.
    The screen displays the Session Transactions Anomaly tab and more settings.
  4. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  5. On the Session Transactions Anomaly tab, for the Detection Criteria setting, specify the criteria under which the system considers traffic to be a session transactions anomaly attack.
    Option Description
    Session transactions above normal by The system considers traffic in a session to be an attack if the number of transactions in the session is more than normal by this percentage (and minimum session value is met). Normal refers to the average number of transactions per session for the whole site during the last hour. The default value is 500%.
    Sessions transactions reached The system considers traffic to be an attack if the number of transactions per sessions is equal to or greater than this number (and minimum session value is met). The default value is 400 transactions.
    Minimum session transactions threshold for detection The system considers traffic to be an attack only if the number of transactions per session is equal to or greater than this number, and at least one of the sessions transactions numbers was exceeded. The default value is 200 transactions.
    Important: The Detection Criteria values all work together. The minimum sessions value and one of the sessions transactions values must be met for traffic to be considered an attack. However, if the Minimum session transactions threshold is not reached, traffic is never considered an attack even if the Sessions transactions above normal by value is met.
  6. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds.
    If the attack ends before this number of seconds, the system also stops attack prevention.
  7. Click Save to save your settings.
  8. To put the security policy changes into effect immediately, click Apply Policy.
When the system detects a session that requests too many transactions (as compared to normal), all transactions from the attacking session cause the Web Scraping detected violation to occur until the end of attack or until the prevention duration expires.
After setting up web scraping detection based on session transactions, you can also set up bot detection, fingerprinting, and session opening anomaly detection for the same security policy.

Using fingerprinting to detect web scraping

Application Security Manager™ (ASM) can identify web scraping attacks on web sites that ASM™ protects by using information gathered about clients through fingerprinting or persistent identification. Fingerprinting is collecting browser attributes and associating the detected behavior with the browser using those attributes. The system can use the collected information to identify suspicious clients (potential bots) and recognize web scraping attacks more quickly.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping .
    The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. To have the system detect browsers and bots by collecting browser attributes, select the Fingerprinting Usage check box.
    The screen enables fingerprinting and displays the Suspicious Clients setting, which works together with the fingerprinting feature.
  4. If you want the system to detect suspicious clients using fingerprinting data, for the Suspicious Clients setting, select Alarm and Block.
    The system displays a new Suspicious Clients tab.
  5. To configure how the system determines which clients are suspicious, adjust the setting on the Suspicious Clients tab:
    1. For the Scraping Plugins setting, select the Detect browsers with Scraping Extensions check box, and move the browser extensions you do not want to allow to the Disallowed Extensions list.
      If ASM detects a browser with a disallowed extension, the client is considered suspicious, and ASM logs and blocks requests from this client to the web application.
    2. In the Prevention Duration field, type the number of seconds for which the system prevents requests from a client after ASM determines it to be suspicious.
  6. Click Save to save your settings.
  7. To put the security policy changes into effect immediately, click Apply Policy.
The system now collects browser attributes to help with web scraping detection, and tracks behavior by browser rather than session. This way a bot cannot stay undetected by opening a new session. If you also enabled the Suspicious Clients setting, when the system detects suspicious clients using information obtained by fingerprinting, the system records the attack data, and blocks the suspicious requests.
In addition to using fingerprinting, you can also set up bot detection, session opening, and session transactions anomaly detection for the same security policy.

Displaying web scraping event logs

You can display event logs to see whether web scraping attacks have occurred, and view information about the attacks.
  1. On the Main tab, click Security > Event Logs > Application > Web Scraping Statistics .
    The Web Scraping Statistics event log opens.
  2. If the log is long, you can filter the list by security policy and time period to show more specific entries.
  3. Review the list of web scraping attacks to see the web scraping attack type that occurred, the IP address of the client that caused the attack, which security policy detected the attack, and the start and end times of the attack.
  4. Examine the web scraping statistics shown, and click the attack type links to see what caused the attack.
  5. To learn more about the requests that caused the web scraping attack, click the number of violating requests.
    The Requests screen opens where you can investigate the requests that caused the web scraping attacks.

Web scraping attack examples

This figure shows a Web Scraping Statistics event log on an Application Security Manager™ (ASM) system where several web scraping attacks, with different attack types, have occurred.

Web scraping event log

Web scraping statistics event log

The next figure shows details on a web scraping attack started on November 17 at 7:14PM. The attack type was Session Resets by Persistent Client Identification, and it occurred when the number of cookie deletions detected through the use of fingerprinting exceeded the configured threshold.

Example cookie deletion attack

Example cookie deletion attack (fingerprinting)

The next figure shows details on a web scraping attack started on November 17 at 7:20PM. The attack type was Session Resets by Persistent Client Identification. It occurred when the number of cookie deletions detected through the use of persistent client identification exceeded the configured threshold (more than 2 in 4 seconds).

Example cookie deletion attack

Example cookie deletion attack (persistent client ID)

The next figure shows details on a web scraping attack started on November 17 at 7:24PM. The attack type was Session Resets by Persistent Client Identification. It occurred when the number of integrity fault events detected through the use of persistent client identification exceeded the configured threshold (more than 3 in 25 seconds).

Example device ID itegrity attack

Example device ID integrity attack

The next figure shows details on a suspicious clients attack that occurred when a client installed the disallowed Scraper browser plug-in.

Example disallowed plug-in attack

Example disallowed plug-in attack

Web scraping attack types

Web scraping statistics specify the attack type so you have more information about why the attack occurred. This shows the web scraping attack types that can display in the web scraping event log.

Attack Type Description
Bot activity detected Indicates that there are more JavaScript injections than JavaScript replies. Click the attack type link to display the detected injection ratio and the injection ratio threshold.
Note: You cannot configure the Bot activity detected ratio values. This attack type can occur only when the security policy is in Transparent mode.
Bot Detected Indicates that the system suspects that the web scraping attack was caused by a web robot.
Session Opening Anomaly by IP Indicates that the web scraping attack was caused by too many sessions being opened from one IP address. Click the attack type link to display the number of sessions opened per second from the IP address, the number of legitimate sessions, and the attack prevention state.
Session Resets by Persistent Client Identification Indicates that the web scraping attack was caused by too many session resets or inconsistencies occurring within a specified time. Click the attack type link to display the number of resets or inconsistencies that occurred within a number of seconds.
Suspicious Clients Indicates that the web scraping attack was caused by web scraping extensions on the browser. Click the attack type link to display the scraping extensions found in the browser.
Transactions per session anomaly Indicates that the web scraping attack was caused by too many transactions being opened during one session. Click the attack type link to display the number of transactions detected on the session.

Viewing web scraping statistics

Before you can look at the web scraping attack statistics, you need to have configured web scraping protection.
You can display charts that show information about web scraping attacks that have occurred against protected applications.
  1. On the Main tab, click Security > Reporting > Application > Web Scraping Statistics .
    The Web Scraping Statistics screen opens.
  2. From the Time Period list, select the time period for which you want to view information about web scraping attacks.
  3. If you want to export the report to a file or send it by email, click Export and select the options.
    To send reports by email, you need to specify an SMTP configuration ( System > Configuration > Device > SMTP ).
The statistics show the total number of web scraping attacks, violations, and rejected requests that occurred. You can review the details about the attacks and see that mitigation is in place.

Web scraping statistics chart

This figure shows a Web Scraping Statistics chart on an Application Security Manager™ (ASM) test system where many web scraping attacks occurred during a short period of time.

Web scraping statistics chart

Web scraping statistics chart

You can use this chart to see the number of rejected requests, web scraping attacks, and total violations that occurred on the web applications protected using the five security policies listed at the bottom.

Implementation Result

When you have completed the steps in this implementation, you have configured the Application Security Manager™ to protect against web scraping. The system examines mouse and keyboard activity for non-human actions. Depending on your configuration, the system detects web scraping attacks based on bot detection, session opening violations, session transaction violations, and fingerprinting.

After traffic is flowing to the system, you can check whether web scraping attacks are being logged or prevented, and investigate them by viewing web scraping event logs and statistics.

If fingerprinting is enabled, the system uses browser attributes to help with detecting web scraping. If using fingerprinting with suspicious clients set to alarm and block, the system collects browser attributes and blocks suspicious requests using information obtained by fingerprinting. If you enabled event sequencing, the system looks for irregular event sequences to detect bots.

If you chose alarm and block for the web scraping configuration and the security policy is in the blocking operation mode, the system drops requests that cause the Web scraping detected violation. If you chose alarm only (or the policy is in the transparent mode), web scraping attacks are logged only but not blocked.