Most appropriate data structure (Python)

I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated!

I have a long .csv file in the following format:


My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29*12 = 34800 rows, each with an associated value.

I'd like to read this file into some kind of Python data structure so that I can access a 'Value' by specifying the 'Scenario', 'Year' and 'Month'. What's the best way to do this please (or what are the various options)?

In my head I think of this data as a kind of 'number cuboid' with axes for Scenario, Year and Month, so that each Value is located at co-ordinates (Scenario, Year, Month). For this reason, I'm tempted to try to read these values into a 3D numpy array and use Scenario, Year and Month as indices. Is this a sensible thing to do?

I guess I could also make a dictionary where the keys are something like


Would this be better? Are there other options?

(By 'better' I suppose I mean 'faster to access', although if one method is much less memory intensive than another it'd be good to know about that too).

Thanks very much!

Asked by: Lily882 | Posted: 30-11-2021

Answer 1

I'd use a dict of tuples. Simple, fast, and a hash-table look-up to retrieve a single value:

import csv

reader = csv.reader(open('data.csv', 'rb'))
header =
data = {}

for row in reader:
    key = tuple([int(v) for v in row[:-1]])
    val = row[-1]
    data[key] = float(val)

# Retrieve a value
print data[1, 1961, 3]

Answered by: Adelaide236 | Posted: 01-01-2022

Answer 2

I would use sqlite3 for storing the data to disk. You'll be able to read in the full data set or subsets through SQL queries. You can then load that data into a numpy array or other Python data structure -- whatever is most convenient for the task.

If you do choose to use sqlite, also note that sqlite has a TIMESTAMP data type. It may be a good idea to combine the year and month into one TIMESTAMP. When you read TIMESTAMPs into Python, sqlite3 can be told to automatically convert the TIMESTAMPs into datetime.datetime objects, which would reduce some of the boilerplate code you'd otherwise have to write. It will also make it easier to form SQL queries which ask for all the rows between two dates.

Answered by: John879 | Posted: 01-01-2022

Answer 3

sqlite is a nice option if you're going to access your values by different parameters each time.

If that's not the case, and you'll always access by this triplet (scenario, year, month), you can use a Tuple (immutable list) as your key, and the value as your value.

In code it would look like:

d = {}
d[1, 1961, 12] = 0.5

or in more generic loop code:

d[scenario, year, month] = value

later on you can just access it with:

print d[scenario, year, month]

Python will automatically create the Tuple for you.

Answered by: Kimberly825 | Posted: 01-01-2022

Answer 4

Make a dictionary of dictionaries of dictionaries like you described. If you need data as numbers, convert them to numbers once when your read them and stores numbers in the dicts. It will be faster then using strings as keys. Let me know if need a help with the code.

Answered by: David565 | Posted: 01-01-2022

Similar questions

looking for the appropriate python data structure

I'm looking for a python 2.7 data structure equivalent to a dictionary but where I can associate more than 1 key. For example I want to associate: cars: -Chevrolet -Toyota -hummer -Ferrari computer: -mac -windows -linux -amstrad I need to be able to search in the dictionary for the string 'cars' or 'computers' or others using something like myDictionary.has('cars')...

I need an appropriate Python data structure

Here's the problem I'm trying to solve: I have boxes A, B, C, and D and have balls a ,b ,c ,d ,e, f, ... -- each of which is in one of the aforementioned boxes. So I know that, for example, balls a, b, c and d are in box A....

What is the appropriate Python data structure to hold its items and a combined list of its items?

I have a list called Materials MATERIALS = [ 'AR', 'ARU', 'ARC', 'CON', 'CSR', 'MCR', 'USF', ['AR', 'ARU', 'ARC', 'MCR', 'CSR'], ['AR', 'ARU', 'ARC', 'MCR'], ['AR', 'ARU', 'ARC'], ] As you can see at first it stores each element alone, then does groupings of the elements. What is the appropriate data structure to store th...

python - Appropriate data structure for social networks

Closed. This question is opinion-based. It is not c...

python - Which data structure is appropriate for this?

I have a line in my code that currently does this at each step x: myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi] This is pretty slow. Is there a more efficient way to eliminate things from a list that don't contain a given x?

appropriate data structure to implement in python

I'm sorry if this is not well written question. I'm retrieving sensory data records from hardware. There are many sensory data records and each of them have lines of hexadecimal bytes. The responses are received one at a time. It's a line of hexadecimal bytes and these are returned one line at a time. For example: It returns first line from 1st record and continues until the first record is finished. Then it moves ...

python - Appropriate data structure for time series

I'm working on an application where I will need to maintain an object's trajectory. Basically, I'd like to have something like a sorted dictionary where the keys are times, and the values are positions. In addition, I'll be doing linear interpolation between existing entries. I've played a little bit with SortedDictionary in Grant Jenks's SortedContainers library, and it does a lot of what I want, but I'm wondering if ther...

oop - Collect dict frames in a python object with appropriate and fast data structure

I'm using an API which provides 128 signals (and signal quality values) per second as a dict in a dict. I want to collect them into windows of 1 (or more) seconds. My native approach was to use the same dict structure and append the values to lists (Like this: Appending values to dictionary in Python). See ...

python - Looping through track keys and jumping to the appropriate part of the loop structure

My repl (Various JSON files for various language use cases.) import json # v = "64457.json" #jp # v = "35777.json" #en, jp, ro v = "66622.json" #ge, jp, ro # v = "50900k.json" #ko # v = "25364c.json" #ch, en with open(v) as f: data = json.load(f) track_list = data['discs'][0]['tracks'] langs = ('German'...

python - sharing a module between tests and core - appropriate project structure

I am trying to improve the project structure while adding to a code base. I found a sample structure here which looks like this: README.rst LICENSE requirements.txt sample/ sample/ sample/ docs/ docs/index.rst tests/ tests/ I noti...

python - Which of these scripting languages is more appropriate for pen-testing?

Closed. This question is opinion-based. It is not c...

Help me find an appropriate ruby/python parser generator

The first parser generator I've worked with was Parse::RecDescent, and the guides/tutorials available for it were great, but the most useful feature it has was it's debugging tools, specifically the tracing capabilities ( activated by setting $RD_TRACE to 1 ). I am looking for a parser generator that can help you debug it's rules. The thing is, it has to be written in python or in ruby, and have a verbose mo...

Would python be an appropriate choice for a video library for home use software

I am thinking of creating a video library software which keep track of all my videos and keep track of videos that I already haven't watched and stats like this. The stats will be specific to each user using the software. My question is, is python appropriate to create this software or do I need something like c++.

Is Python appropriate for algorithms focused on scientific computing?

python - How do I determine the appropriate check interval?

I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation. Thus we have the main thread being almost tot...

python - Numpy time based vector operations where state of preceding elements matters - are for loops appropriate?

What do numpy arrays provide when performing time based calculations where state matters. In other words, where what has occurred in earlier or later in a sequence is important. Consider the following time based vectors, TIME = np.array([0., 10., 20., 30., 40., 50., 60., 70., 80., 90.]) FLOW = np.array([100., 75., 60., 20.0, 60.0, 50.0, 20.0, 30.0, 20.0, 10.0]) TEMP = np.array([300., 310...

python extend or append a list when appropriate

Is there a simple way to append a list if X is a string, but extend it if X is a list? I know I can simply test if an object is a string or list, but I was wondering if there is a quicker way than this?

python - Finding appropriate cut-off values

I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation: Given x - a sorted list of numbers and...

python - What is an appropriate way to datamine the total number of results of a keyword search?

newbie programmer and lurker here, hoping for some sensible advice. :) Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code: import urllib2 from BeautifulSoup import BeautifulStoneSoup Appid = #My Appid query = #My query soup = BeautifulStoneSoup(urllib2.urlopen("" + Appid + "&query=" ...

python - Defining appropriate number of processes

I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool. I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python