What is an appropriate way to datamine the total number of results of a keyword search?
newbie programmer and lurker here, hoping for some sensible advice. :)
Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code:
import urllib2 from BeautifulSoup import BeautifulStoneSoup Appid = #My Appid query = #My query soup = BeautifulStoneSoup(urllib2.urlopen("http://api.search.live.net/xml.aspx?Appid=" + Appid + "&query=" + query + "&sources=web")) totalResults = soup.find('web:total').text
So I'd like to do this across a few thousand search terms and was wondering if
- doing this request a thousand times would be construed as hammering the server,
- what steps I should take to not hammer said servers (what are best practices?), and
- is there a cheaper (data) way to do this using any of the major search engine APIs?
It just seems unnecessarily expensive to grab all that data just to grab one number per keyword and I was wondering if I missed anything.
FWIW, I did some homework and tried the Google Search API (deprecated) and Yahoo's BOSS API (soon to be deprecated and replaced with a paid service) before settling with the Bing API. I understand direct scraping of a page is considered poor form so I'll pass on scraping search engines directly.
Asked by: Chloe515 | Posted: 30-11-2021
There are three approaches I can think of that have helped previously when I had to do large scale URL resolution.
- HTTP Pipelining (another snippet here)
- Rate-limiting server requests per IP (i.e., each IP can only issue 3 requests / second). Some suggestions can be found here: How to limit rate of requests to web services in Python?
- Issuing requests through an internal proxy service, using
http_proxyto redirect all requests to said service. This proxy service will then iterate over a set of network interfaces and issue rate limited requests. You can use Twisted for that.
With regard to your question 1, Bing has an API Basics PDF file that summarizes the terms and conditions in human-readable form. In the "What you must do" section. That includes the following statement:
Restrict your usage to less than 7 queries per second (QPS) per IP address. You may be permitted to exceed this limit under some conditions, but this must be approved through discussion with email@example.com.
If this is just a one-off script, you don't need to do anything more complex than just adding a
sleep between making requests, so that you're making only a couple of requests a second. If the situation is more complex, e.g. these requests are being made as part of a web service, the suggestions in Mahmoud Abdelkader's answer should help you.
python - Which of these scripting languages is more appropriate for pen-testing?
Closed. This question is opinion-based. It is not c...
Help me find an appropriate ruby/python parser generator
The first parser generator I've worked with was Parse::RecDescent, and the guides/tutorials available for it were great, but the most useful feature it has was it's debugging tools, specifically the tracing capabilities ( activated by setting $RD_TRACE to 1 ). I am looking for a parser generator that can help you debug it's rules. The thing is, it has to be written in python or in ruby, and have a verbose mo...
Would python be an appropriate choice for a video library for home use software
I am thinking of creating a video library software which keep track of all my videos and keep track of videos that I already haven't watched and stats like this. The stats will be specific to each user using the software. My question is, is python appropriate to create this software or do I need something like c++.
python - How do I determine the appropriate check interval?
I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation. Thus we have the main thread being almost tot...
arrays - Most appropriate data structure (Python)
I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated! I have a long .csv file in the following format: Scenario,Year,Month,Value 1,1961,1,0.5 1,1961,2,0.7 1,1961,3,0.2 etc. My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29...
python - Numpy time based vector operations where state of preceding elements matters - are for loops appropriate?
What do numpy arrays provide when performing time based calculations where state matters. In other words, where what has occurred in earlier or later in a sequence is important. Consider the following time based vectors, TIME = np.array([0., 10., 20., 30., 40., 50., 60., 70., 80., 90.]) FLOW = np.array([100., 75., 60., 20.0, 60.0, 50.0, 20.0, 30.0, 20.0, 10.0]) TEMP = np.array([300., 310...
python extend or append a list when appropriate
Is there a simple way to append a list if X is a string, but extend it if X is a list? I know I can simply test if an object is a string or list, but I was wondering if there is a quicker way than this?
python - Finding appropriate cut-off values
I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation: Given x - a sorted list of numbers and...
python - Defining appropriate number of processes
I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool. I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core...