Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)

I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

Is this normal behaviour when using a SQL database?

The schema (the table holds responses to a survey):

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

I wrote some tests in Java and Python for context and they crush SQL (except for pure python):

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)

Tunings i've tried without success include (blindly following some web advice):

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.

Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.

No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.

The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)

I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.

The Postgres query doesn't change timing on subsequent runs.

I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)

Asked by: Madaline906 | Posted: 27-01-2022

Answer 1

I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:

  1. parse the SQL
  2. work up a query plan, i. e. decide on which indices to use (if any), optimize etc.
  3. if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or
  4. if no index is used, scan the whole table to determine which rows are needed
  5. load the data from disk into a temporary location (hopefully, but not necessarily, memory)
  6. perform the count() and avg() calculations

So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.

To obtain more information about where Postgres spends its time, I would suggest the following tests:

  • Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)
  • If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.

To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.

There's several ways to do that:

  • Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached
  • Reduce the size of your stored data
  • Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.
  • If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.
  • There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.


I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.

Answered by: Daryl247 | Posted: 28-02-2022

Answer 2

Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)

If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.

(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).

Materialized Views

Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.

I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.

The real question is, is 200ms fast enough?

-------------- More --------------------

I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.

First I created a MV which refreshes every minute.

create materialized view mv_so_x 
build immediate 
refresh complete 
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

While its refreshing, there is no rows returned

SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00

Once it refreshes, its MUCH faster than doing the raw query

SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00

If we insert into the base table, the result is not immediately viewable view the MV.

SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00

But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.

SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00

This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.

Answered by: Richard352 | Posted: 28-02-2022

Answer 3

I retested with MySQL specifying ENGINE = MEMORY and it doesn't change a thing (still 200 ms). Sqlite3 using an in-memory db gives similar timings as well (250 ms).

The math here looks correct (at least the size, as that's how big the sqlite db is :-)

I'm just not buying the disk-causes-slowness argument as there is every indication the tables are in memory (the postgres guys all warn against trying too hard to pin tables to memory as they swear the OS will do it better than the programmer)

To clarify the timings, the Java code is not reading from disk, making it a totally unfair comparison if Postgres is reading from the disk and calculating a complicated query, but that's really besides the point, the DB should be smart enough to bring a small table into memory and precompile a stored procedure IMHO.

UPDATE (in response to the first comment below):

I'm not sure how I'd test the query without using an aggregation function in a way that would be fair, since if i select all of the rows it'll spend tons of time serializing and formatting everything. I'm not saying that the slowness is due to the aggregation function, it could still be just overhead from concurrency, integrity, and friends. I just don't know how to isolate the aggregation as the sole independent variable.

Answered by: Melanie303 | Posted: 28-02-2022

Answer 4

Those are very detailed answers, but they mostly beg the question, how do I get these benefits without leaving Postgres given that the data easily fits into memory, requires concurrent reads but no writes and is queried with the same query over and over again.

Is it possible to precompile the query and optimization plan? I would have thought the stored procedure would do this, but it doesn't really help.

To avoid disk access it's necessary to cache the whole table in memory, can I force Postgres to do that? I think it's already doing this though, since the query executes in just 200 ms after repeated runs.

Can I tell Postgres that the table is read only, so it can optimize any locking code?

I think it's possible to estimate the query construction costs with an empty table (timings range from 20-60 ms)

I still can't see why the Java/Python tests are invalid. Postgres just isn't doing that much more work (though I still haven't addressed the concurrency aspect, just the caching and query construction)

UPDATE: I don't think it's fair to compare the SELECTS as suggested by pulling 350,000 through the driver and serialization steps into Python to run the aggregation, nor even to omit the aggregation as the overhead in formatting and displaying is hard to separate from the timing. If both engines are operating on in memory data, it should be an apples to apples comparison, I'm not sure how to guarantee that's already happening though.

I can't figure out how to add comments, maybe i don't have enough reputation?

Answered by: Lana863 | Posted: 28-02-2022

Answer 5

I'm a MS-SQL guy myself, and we'd use DBCC PINTABLE to keep a table cached, and SET STATISTICS IO to see that it's reading from cache, and not disk.

I can't find anything on Postgres to mimic PINTABLE, but pg_buffercache seems to give details on what is in the cache - you may want to check that, and see if your table is actually being cached.

A quick back of the envelope calculation makes me suspect that you're paging from disk. Assuming Postgres uses 4-byte integers, you have (6 * 4) bytes per row, so your table is a minimum of (24 * 350,000) bytes ~ 8.4MB. Assuming 40 MB/s sustained throughput on your HDD, you're looking at right around 200ms to read the data (which, as pointed out, should be where almost all of the time is being spent).

Unless I screwed up my math somewhere, I don't see how it's possible that you are able to read 8MB into your Java app and process it in the times you're showing - unless that file is already cached by either the drive or your OS.

Answered by: Elise125 | Posted: 28-02-2022

Answer 6

I don't think that your results are all that surprising -- if anything it is that Postgres is so fast.

Does the Postgres query run faster a second time once it has had a chance to cache the data? To be a little fairer your test for Java and Python should cover the cost of acquiring the data in the first place (ideally loading it off disk).

If this performance level is a problem for your application in practice but you need a RDBMS for other reasons then you could look at memcached. You would then have faster cached access to raw data and could do the calculations in code.

Answered by: Owen369 | Posted: 28-02-2022

Answer 7

One other thing that an RDBMS generally does for you is to provide concurrency by protecting you from simultaneous access by another process. This is done by placing locks, and there's some overhead from that.

If you're dealing with entirely static data that never changes, and especially if you're in a basically "single user" scenario, then using a relational database doesn't necessarily gain you much benefit.

Answered by: Grace596 | Posted: 28-02-2022

Answer 8

Are you using TCP to access the Postgres? In that case Nagle is messing with your timing.

Answered by: Alissa392 | Posted: 28-02-2022

Answer 9

You need to increase postgres' caches to the point where the whole working set fits into memory before you can expect to see perfomance comparable to doing it in-memory with a program.

Answered by: Julian860 | Posted: 28-02-2022

Answer 10

Thanks for the Oracle timings, that's the kind of stuff I'm looking for (disappointing though :-)

Materialized views are probably worth considering as I think I can precompute the most interesting forms of this query for most users.

I don't think query round trip time should be very high as i'm running the the queries on the same machine that runs Postgres, so it can't add much latency?

I've also done some checking into the cache sizes, and it seems Postgres relies on the OS to handle caching, they specifically mention BSD as the ideal OS for this, so I thinking Mac OS ought to be pretty smart about bringing the table into memory. Unless someone has more specific params in mind I think more specific caching is out of my control.

In the end I can probably put up with 200 ms response times, but knowing that 7 ms is a possible target makes me feel unsatisfied, as even 20-50 ms times would enable more users to have more up to date queries and get rid of a lots of caching and precomputed hacks.

I just checked the timings using MySQL 5 and they are slightly worse than Postgres. So barring some major caching breakthroughs, I guess this is what I can expect going the relational db route.

I wish I could up vote some of your answers, but I don't have enough points yet.

Answered by: Victoria171 | Posted: 28-02-2022

Similar questions

python - Django Orm: Custom Select column with aggregate functions for grouped values

My postgres database table looks (simplified) like this: timeframe::timestamp, value::integer. I have a grouped query (grouped timeranges: 1 hour, 1 day, a week, etc) and I have different aggregate functions. An example query looks like this in sql: SELECT date_trunc(%s, timeframe), SUM(CASE WHEN metric = 'visitors' THEN value ELSE 0 END) / NULLIF(SUM(CASE WHEN metric = 'total' THEN ...

how to use different aggregate functions for separate columns in pandas? - python

I have this data frame: >>> df = pd.DataFrame({'c1':['a','a','a','a','b','b','b','b'], 'c2':['x','y','x','y','x','y','x','y'], 'sum':[1,1,0,1,0,0,1,0], 'mean':[12,14,11,13,12,23,12,31]}) I'm trying to use two separate aggregate functions and I know I can do this: >>> df.groupby(['c1','c2'])['sum','mean'].agg([np.sum,np.mean]) >>> df ...

python pandas: applying different aggregate functions to different columns

I am trying to understand what the equivalent of this simple SQL statement would be: select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2 from df group by mykey I understand I can passa a dictionary to the agg() function: f = {'Field1':'sum', 'Field2':['max','mean'], 'Field3':['min','mean','count'], 'F...

python - How do aggregate functions in pandas work?

I'm trying to understand what is going on under the hood with groupby and aggregating functions in pandas. For example, here are timing results for 4 different ways to get sum by group: df = pd.DataFrame({'a': np.random.randint(0,3,10), 'b': np.random.rand(10)}) %timeit df.groupby('a')['b'].sum() 1000 loops, best of 3: 287 µs per loop %timeit df.groupby('a')['b'].agg(sum) 1...

pandas - How to access the values of aggregate functions in Python

I created a data frame and grouped and aggregated timestamp to give me min and max value for each grouping the resulting data frame looks like this DF is defined to be patient_id, timestamp I grouped the DF by patient_id and then I wanted to get the min and max timestamp for each groups and I did this bypatient_date = pd.DataFrame(byencounter.agg({'timestamp' : [np.min,np.max]})).reset_index()) patient_i...

python - pandas group by, aggregate using multiple agg functions on input columns

I am looking to do some aggregation on a pandas groupby dataframe, where I need to apply several different custom functions on multiple columns. This operation is very easy and customary in R (using data.table or dplyr), but I am surprised I'm finding it so difficult in pandas: import pandas as pd data = pd.DataFrame({'A':[1,2,3,4,5,6],'B':[2,4,6,8,10,12],'C':[1,1,1,2,2,2]}) #Thes...

pandas - Python aggregate functions (e.g. sum) not working on object dtypes, but won't work when they're converted either?

I'm importing data from a CSV file which has text, date and numeric columns. I'm using pandas.read_csv() to read it in, but I'm not specifying what each column's dtype should be. Here's a cut of that csv file (apologies for the shoddy formatting). Now these two columns (total_imp_pma, char_value_aa503) are imported very differently. I import all the number fields and create a...

python - Aggregate functions without min max time

I have a dataset comprised of different world regions and I have a groupby argument that I would like to include not just the current output but also the associated dates of each of the min and max functions: dfmin2 = df2.groupby('Region')['Student'].agg(['sum','min','max','mean'] produces: Region sum min max mean Canada 5 1 3 2.5 Mexico 10 2 6...

python - panda aggregate by functions

I have data like below: id movie details value 5 cane1 good 6 5 wind2 ok 30.3 5 wind1 ok 18 5 cane1 good 2 5 cane22 ok 4 5 cane34 good 7 5 wind2 ok 2 I want the output with below criteria: If movie name starts with 'cane' - sum the value If movie name starts with 'wind' - count the occurrence. So - the final output will be:

python - How to use aggregate functions in Pandas with missing values

I am a newbie to pandas and experimenting with Titanic dataset. After using groupby function, if i apply mean(), then it works properly. But when I calculate std(), I am getting NaN. I understood that mean by default excludes the missing values and does not consider them but std takes them into consideration. I have tried changing ddof=1 and skipna=Tr...

python - A QuerySet by aggregate field value

Let's say I have the following model: class Contest: title = models.CharField( max_length = 200 ) description = models.TextField() class Image: title = models.CharField( max_length = 200 ) description = models.TextField() contest = models.ForeignKey( Contest ) user = models.ForeignKey( User ) def score( self ): return self.vote_set.all().aggregate( models.Sum( 'value' )...

python - Missing 'Median' Aggregate Function in Django?

The Development version of Django has aggregate functions like Avg, Count, Max, Min, StdDev, Sum, and Variance (link text). Is there a reason Median is missing from the list? Implementing one seems like it would be easy. Am I missing something? How much are the aggregate functions doing behind the scenes?

php - Aggregate photos from various services into one Stream

Helllo All, I'm looking to aggregate photos from various streams into one stream in a similar manner as to friend feed. I'd like to be able to watch flickr and picasa and other sites with RSS feeds of my choosing and then create a timeline of top photos. For example, assume that X's below are photos: Event Name -- March 15th X X X X X X X X X more-> Event Name 2 -- March...

python - Combine two lists: aggregate values that have similar keys

I have two lists or more than . Some thing like this: listX = [('A', 1, 10), ('B', 2, 20), ('C', 3, 30), ('D', 4, 30)] listY = [('a', 5, 50), ('b', 4, 40), ('c', 3, 30), ('d', 1, 20), ('A', 6, 60), ('D', 7, 70]) i want to get the result that move the duplicate elements like this: my result is to get all the list from listX + listY,but in the case there are duplicated for example ...

mysql - How to override NULL value from aggregate query using MySQLdb module in python?

I am selecting the maximum date value from a MySQL database table in python using the MySQLdb module. If the result comes back as null, I want to override it with a default value. I could do this in the MySQL using a sub-query, but would prefer to do it in python. Any recommendations on a simple way to do this? I figure this is an easy one, but I'm new to python so I thought it was worth asking. Here's my c...

python - Django: Unpack argument list for use in aggregate query

I am attempting to create a semi-dynamic aggregate function that will return the sums of all fields within a list. The assumption is that running get_query_set() will return a filtered query that contains all the fields in the list and some others that may not play so well with a Sum aggregate (date fields, char fields, Foreign Keys, etc...) Best examples I've come up with so far are below, this is largely a python...

How can I reshape and aggregate list of tuples in Python?

I'm a newb to Python so apologies in advance if my question looks trivial. From a psycopg2 query i have a result in the form of a list of tuples looking like: [(1, 0), (1, 0), (1, 1), (2, 1), (2, 2), (2, 2), (2, 2)] Each tuple represents id of a location where event happened and hour of the day when event took place. I'd like to reshape and aggregate this list with subtotal...

Django Python noob - how do I get the NUMBER out of a Aggregate sum?

I have a simple Aggregate: tot=PurchaseOrderLine.objects.aggregate(total=Sum('price')) return HttpResponse(tot) This returns "total". If I do: return HttpResponse(str(tot)) It displays "{'total': Decimal('321.60')}" How do I get the NUMERIC var out of this!

python - Django aggregate function problem (Count = -number-)

User id name Device id name user-id is_enabled I want to select all Users who has At least one device enabled. Exactly 2 device enabled All devices enabled Note: enabled means (Device.is_enabled = 1) I know I can use annotate to get number of devices. But how to use annonate to get number of users who has exact...

python - Access django dict through shell after aggregate annotate

Simple question I am trying to get the result of an aggregate to use in my view to make a simple calculation. I have written the following. sms_raised = SmsBacker.objects.values('amount').annotate(Sum('amount')) sms_raised [{'amount': 150L, 'amount__sum': 600}] How do I access those values in the shell. I have tried sms_raised_amount sms_raised__amount sms_...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python