Writing a Domain Specific Language for selecting rows from a table

I'm writing a server that I expect to be run by many different people, not all of whom I will have direct contact with. The servers will communicate with each other in a cluster. Part of the server's functionality involves selecting a small subset of rows from a potentially very large table. The exact choice of what rows are selected will need some tuning, and it's important that it's possible for the person running the cluster (eg, myself) to update the selection criteria without getting each and every server administrator to deploy a new version of the server.

Simply writing the function in Python isn't really an option, since nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime.

What I need are suggestions on the simplest way to implement a Domain Specific Language to achieve this goal. The language needs to be capable of simple expression evaluation, as well as querying table indexes and iterating through the returned rows. Ease of writing and reading the language is secondary to ease of implementing it. I'd also prefer not to have to write an entire query optimiser, so something that explicitly specifies what indexes to query would be ideal.

The interface that this will have to compile against will be similar in capabilities to what the App Engine datastore exports: You can query for sequential ranges on any index on the table (eg, less-than, greater-than, range and equality queries), then filter the returned row by any boolean expression. You can also concatenate multiple independent result sets together.

I realise this question sounds a lot like I'm asking for SQL. However, I don't want to require that the datastore backing this data be a relational database, and I don't want the overhead of trying to reimplement SQL myself. I'm also dealing with only a single table with a known schema. Finally, no joins will be required. Something much simpler would be far preferable.

Edit: Expanded description to clear up some misconceptions.

Asked by: Julian831 | Posted: 05-10-2021

Answer 1

Building a DSL to be interpreted by Python.

Step 1. Build the run-time classes and objects. These classes will have all the cursor loops and SQL statements and all of that algorithmic processing tucked away in their methods. You'll make heavy use of the Command and Strategy design patterns to build these classes. Most things are a command, options and choices are plug-in strategies. Look at the design for Apache Ant's Task API -- it's a good example.

Step 2. Validate that this system of objects actually works. Be sure that the design is simple and complete. You're tests will construct the Command and Strategy objects, and then execute the top-level Command object. The Command objects will do the work.

At this point you're largely done. Your run-time is just a configuration of objects created from the above domain. [This isn't as easy as it sounds. It requires some care to define a set of classes that can be instantiated and then "talk among themselves" to do the work of your application.]

Note that what you'll have will require nothing more than declarations. What's wrong with procedural? One you start to write a DSL with procedural elements, you find that you need more and more features until you've written Python with different syntax. Not good.

Further, procedural language interpreters are simply hard to write. State of execution, and scope of references are simply hard to manage.

You can use native Python -- and stop worrying about "getting out of the sandbox". Indeed, that's how you'll unit test everything, using a short Python script to create your objects. Python will be the DSL.

["But wait", you say, "If I simply use Python as the DSL people can execute arbitrary things." Depends on what's on the PYTHONPATH, and sys.path. Look at the site module for ways to control what's available.]

A declarative DSL is simplest. It's entirely an exercise in representation. A block of Python that merely sets the values of some variables is nice. That's what Django uses.

You can use the ConfigParser as a language for representing your run-time configuration of objects.

You can use JSON or YAML as a language for representing your run-time configuration of objects. Ready-made parsers are totally available.

You can use XML, too. It's harder to design and parse, but it works fine. People love it. That's how Ant and Maven (and lots of other tools) use declarative syntax to describe procedures. I don't recommend it, because it's a wordy pain in the neck. I recommend simply using Python.

Or, you can go off the deep-end and invent your own syntax and write your own parser.

Answered by: Ada651 | Posted: 06-11-2021

Answer 2

I think we're going to need a bit more information here. Let me know if any of the following is based on incorrect assumptions.

First of all, as you pointed out yourself, there already exists a DSL for selecting rows from arbitrary tables-- it is called "SQL". Since you don't want to reinvent SQL, I'm assuming that you only need to query from a single table with a fixed format.

If this is the case, you probably don't need to implement a DSL (although that's certainly one way to go); it may be easier, if you are used to Object Orientation, to create a Filter object.

More specifically, a "Filter" collection that would hold one or more SelectionCriterion objects. You can implement these to inherit from one or more base classes representing types of selections (Range, LessThan, ExactMatch, Like, etc.) Once these base classes are in place, you can create column-specific inherited versions which are appropriate to that column. Finally, depending on the complexity of the queries you want to support, you'll want to implement some kind of connective glue to handle AND and OR and NOT linkages between the various criteria.

If you feel like it, you can create a simple GUI to load up the collection; I'd look at the filtering in Excel as a model, if you don't have anything else in mind.

Finally, it should be trivial to convert the contents of this Collection to the corresponding SQL, and pass that to the database.

However: if what you are after is simplicity, and your users understand SQL, you could simply ask them to type in the contents of a WHERE clause, and programmatically build up the rest of the query. From a security perspective, if your code has control over the columns selected and the FROM clause, and your database permissions are set properly, and you do some sanity checking on the string coming in from the users, this would be a relatively safe option.

Answered by: Jared544 | Posted: 06-11-2021

Answer 3

"implement a Domain Specific Language"

"nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime"

I want a DSL but I don't want Python to be that DSL. Okay. How will you execute this DSL? What runtime is acceptable if not Python?

What if I have a C program that happens to embed the Python interpreter? Is that acceptable?

And -- if Python is not an acceptable runtime -- why does this have a Python tag?

Answered by: Lucas931 | Posted: 06-11-2021

Answer 4

Why not create a language that when it "compiles" it generates SQL or whatever query language your datastore requires ?

You would be basically creating an abstraction over your persistence layer.

Answered by: Kellan308 | Posted: 06-11-2021

Answer 5

You mentioned Python. Why not use Python? If someone can "type in" an expression in your DSL, they can type in Python.

You'll need some rules on structure of the expression, but that's a lot easier than implementing something new.

Answered by: Emma434 | Posted: 06-11-2021

Answer 6

You said nobody is going to want to install a server that downloads and executes arbitrary code at runtime. However, that is exactly what your DSL will do (eventually) so there probably isn't that much of a difference. Unless you're doing something very specific with the data then I don't think a DSL will buy you that much and it will frustrate the users who are already versed in SQL. Don't underestimate the size of the task you'll be taking on.

To answer your question however, you will need to come up with a grammar for your language, something to parse the text and walk the tree, emitting code or calling an API that you've written (which is why my comment that you're still going to have to ship some code).

There are plenty of educational texts on grammars for mathematical expressions you can refer to on the net, that's fairly straight forward. You may have a parser generator tool like ANTLR or Yacc you can use to help you generate the parser (or use a language like Lisp/Scheme and marry the two up). Coming up with a reasonable SQL grammar won't be easy. But google 'BNF SQL' and see what you come up with.

Best of luck.

Answered by: Carlos993 | Posted: 06-11-2021

Answer 7

It really sounds like SQL, but perhaps it's worth to try using SQLite if you want to keep it simple?

Answered by: Cherry271 | Posted: 06-11-2021

Answer 8

It sounds like you want to create a grammar not a DSL. I'd look into ANTLR which will allow you to create a specific parser that will interpret text and translate to specific commands. ANTLR provides libraries for Python, SQL, Java, C++, C, C# etc.

Also, here is a fine example of an ANTLR calculation engine created in C#

Answered by: Kelvin555 | Posted: 06-11-2021

Answer 9

A context-free grammar usually has a tree like structure and functional programs have a tree like structure too. I don't claim the following would solve all of your problems, but it is a good step in the direction if you are sure that you don't want to use something like SQLite3.

from functools import partial
def select_keys(keys, from_):
    return ({k : fun(v, row) for k, (v, fun) in keys.items()}
            for row in from_)

def select_where(from_, where):
    return (row for row in from_
            if where(row))

def default_keys_transform(keys, transform=lambda v, row: row[v]):
    return {k : (k, transform) for k in keys}

def select(keys=None, from_=None, where=None):
    SELECT v1 AS k1, 2*v2 AS k2 FROM table WHERE v1 = a AND v2 >= b OR v3 = c

    translates to 

    select(dict(k1=(v1, lambda v1, r: r[v1]), k2=(v2, lambda v2, r: 2*r[v2])
        , from_=table
        , where= lambda r : r[v1] = a and r[v2] >= b or r[v3] = c)
    assert from_ is not None
    idfunc = lambda k, t : t
    select_k = idfunc if keys is None  else select_keys
    if isinstance(keys, list):
        keys = default_keys_transform(keys)
    idfunc = lambda t, w : t
    select_w = idfunc if where is None else select_where
    return select_k(keys, select_w(from_, where))

How do you make sure that you are not giving users ability to execute arbitrary code. This framework admits all possible functions. Well, you can right a wrapper over it for security that expose a fixed list of function objects that are acceptable.

ALLOWED_FUNCS = [ operator.mul, operator.add, ...] # List of allowed funcs

def select_secure(keys=None, from_=None, where=None):
    if keys is not None and isinstance(keys, dict):
       for v, fun keys.values:
           assert fun in ALLOWED_FUNCS
    if where is not None:
       assert_composition_of_allowed_funcs(where, ALLOWED_FUNCS)
    return select(keys=keys, from_=from_, where=where)

How to write assert_composition_of_allowed_funcs. It is very difficult to do that it in python but easy in lisp. Let us assume that where is a list of functions to be evaluated in a lips like format i.e. where=(operator.add, (operator.getitem, row, v1), 2) or where=(operator.mul, (operator.add, (opreator.getitem, row, v2), 2), 3).

This makes it possible to write a apply_lisp function that makes sure that the where function is only made up of ALLOWED_FUNCS or constants like float, int, str.

def apply_lisp(where, rowsym, rowval, ALLOWED_FUNCS):
    assert where[0] in ALLOWED_FUNCS
    return apply(where[0],
          [ (apply_lisp(w, rowsym, rowval, ALLOWED_FUNCS)
            if isinstance(w, tuple)
            else rowval if w is rowsym
            else w if isinstance(w, (float, int, str))
            else None ) for w in where[1:] ])

Aside, you will also need to check for exact types, because you do not want your types to be overridden. So do not use isinstance, use type in (float, int, str). Oh boy we have run into:

Greenspun's Tenth Rule of Programming: any sufficiently complicated C or Fortran program contains an ad hoc informally-specified bug-ridden slow implementation of half of Common Lisp.

Answered by: Rafael461 | Posted: 06-11-2021

Similar questions

python - Selecting specific column in each row from array

I am trying to select specific column elements for each row of a numpy array. For example, in the following example: In [1]: a = np.random.random((3,2)) Out[1]: array([[ 0.75670668, 0.1283942 ], [ 0.51326555, 0.59378083], [ 0.03219789, 0.53612603]]) I would like to select the first element of the first row, the second element of the second row, and the first element of th...

python - Selecting specific users

I have a a model that stores users: class SubItem(models.Model): created = models.DateTimeField(default=datetime.now) created_by = models.ForeignKey(User) I have a group in the auth admin section called 'advisors'. How do I only show the advisors in the pulldown that django creates. (I have a feeling I have to override the init in forms.py)

Selecting specific text from a webpage using Python

Although I love the program, I've gotten extremely tired of Calibre's weekly updating habit. To counteract that problem I'm trying to work with a python script that will automate the process. I have successfully opened the document, but I have trouble figuring out how to capture a specific piece of it for a string. Since Calibre's download link depends on the version number that needs to be retrieved. Currently ...

parsing - Selecting specific columns from df -h output in python

I'm trying to create a simple script that will select specific columns from the unix df - h command. I can use awk to do this but how can we do this in python? Here is df -h output: Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_base-lv_root 28G 4.8G 22G 19% / tmpfs 814M 176K 814M 1% /dev/shm /dev/sda1 ...

python csv reader selecting specific rows

Suppose we have a text file as given below: sfgsdgfs >sfsf > "assfgs.jpg">sggw.sgw sgsdfghsg>sdgsgsgsg[] werw>"erqwer.jpg">egfwrewrw How to extract the rows that contain .jpg? What is wrong with the following code? import csv data = csv.reader (open ('outfile.txt', 'r'), delimiter = '"') for row in data: if '.jpg' in row: print (row)

Python - Selecting specific values from CSV web page

import csv import urllib url = "https://www.national-lottery.co.uk/player/lotto/results/downloadResultsCSV.ftl" webpage = urllib.urlopen(url) datareader = csv.reader((webpage),delimiter=',') I am trying to select specific values from the CSV file on this URL. However I can't seem to select values from the second row of the file, instead it seems to be giving me just values from the top row, i.e. B...

Python CSV: selecting only specific column headers

Regarding the csv module in Python: Example - the content of the first file: file1.csv, file2.csv, header1, header2 file3.csv, file4.csv, header1, header2, header5 file5.csv, file6.csv, header2 I read the first file (file1.csv) and need to only export the columns with header1 and header2. For the next file (file3.csv), I need to only export header1, header2 and header5 to ...

python - Opening a file and selecting specific column in file

Hello all I am new to python and really need help I have a set of data as below all large values are column 1 and xx.x are column 2 19600110 28.6 19600111 28.9 19600112 29.2 19600113 28.6 19600114 28.6 19600115 28.4 19600116 28.6 19600117 28.6 stored as station.txt I am trying to get python to only present the first column of data (19600115 etc) which is labelled dates.I ope...

Selecting specific words from a file in Python

I am completly new to python, and I need a specific answer. I need to print only the authors name from a list of reference in a file by using python code.

python - Selecting specific lists in a file

I've been working with a Python dictionary to replace md5 values with COG/NOG identifiers. This is what I have done so far... #!/usr/bin/python import sys fil = sys.argv[1] # load md5 -> COG into dictionary with open(fil) as fin: rows = ( line.strip().split('\t') for line in fin ) d = { row[0]:row[1] for row in rows } # open blast output, replace md5 with COG by looking up md5 in the dictio...

asp.net - Python selecting a value in a combo box and HTTP POST

In Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. How can I read years other than the default of 2002? So far, using mechanize, I've been able to reference the SELECT and list all of its available options/values but am unsure how to change its value and resubmit the form. I'm sure this is ...

python - Selecting related objects in django

I have following problem: My application have 2 models: 1) class ActiveList(models.Model): user = models.ForeignKey(User, unique=True) updatedOn = models.DateTimeField(auto_now=True) def __unicode__(self): return self.user.username ''' GameClaim class, to store game requests. ''' class GameClaim(models.Model): me = models.ForeignKey(ActiveList, related_name='gameclai...

python - Selecting rows from a NumPy ndarray

I want to select only certain rows from a NumPy array based on the value in the second column. For example, this test array has integers from 1 to 10 in the second column. >>> test = numpy.array([numpy.arange(100), numpy.random.randint(1, 11, 100)]).transpose() >>> test[:10, :] array([[ 0, 6], [ 1, 7], [ 2, 1...

python - Selecting indices for a 2d array in numpy

This works quite well in 1 dimension: # This will sort bar by the order of the values in foo (Pdb) bar = np.array([1,2,3]) (Pdb) foo = np.array([5,4,6]) (Pdb) bar[np.argsort(foo)] array([2, 1, 3]) But how do I do that in two dimensions? Argsort works nicely, but the select no longer works: (Pdb) foo = np.array([[5,4,6], [9,8,7]]) (Pdb) bar = np.array([[1,2,3], [1,2,3]]) (P...

python - Selecting specific column in each row from array

I am trying to select specific column elements for each row of a numpy array. For example, in the following example: In [1]: a = np.random.random((3,2)) Out[1]: array([[ 0.75670668, 0.1283942 ], [ 0.51326555, 0.59378083], [ 0.03219789, 0.53612603]]) I would like to select the first element of the first row, the second element of the second row, and the first element of th...

python - Randomly selecting lines from files

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

python - Selecting Widgets

In Tkinter I'm trying to make it so when a command is run a widget is automatically selected, so that a one may bind events to the newly selected widget. Basically I want it so when I press a button a text widget appears. When it appears normally one would have to click the text widget to facilitate the running of events bound to the text widget. I want that behavior to automatically happen when the user clicks th...

python - Django - Selecting related set : how many times does it hit the database?

I took this sample code here : Django ORM: Selecting related set polls = Poll.objects.filter(category='foo') choices = Choice.objects.filter(poll__in=polls) My question is very simple : do you hit twice the database when you finally use the queryset choices ?

python - Selecting and printing specific rows of text file

I have a very large (~8 gb) text file that has very long lines. I would like to pull out lines in selected ranges of this file and put them in another text file. In fact my question is very similar to this and this but I keep gettin...

mysql - Selecting a Python Web Framework

This may seem like a subjective question. But it is not (that's not the idea, at least). I'm developing an Advertising software (like AdWords, AdBrite, etc) and i've decide to use Python. And would like to use one of those well known web frameworks (Django, Cherrypy, pylons, etc). The question is: Given that it will have just a few Models (seven or eight), which has the best cache support? and What ...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python