Finding appropriate cut-off values

I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:

Given x - a sorted list of numbers and m - the median of x, I need to find a such that approximately 70% of the values in x fall into the range (m-a; m+a). We know nothing about the distribution of values in x. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.


Seth suggested to use scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

The output is 0.79 (~80%, instead of 70)

Asked by: Freddie826 | Posted: 30-11-2021

Answer 1

You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard scipy.stats functions on this one-sided distribution:

from scipy.stats import scoreatpercentile
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

oneSidedList = theList[:]               # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]

# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

This gives the result of 0.7 as required.

Answered by: Joyce328 | Posted: 01-01-2022

Answer 2

Restate the problem slightly. You know the length of the list, and what fraction of the numbers in the list to consider. Given that, you can determine the difference between the first and last indices in the list that give you the desired range. The goal then is to find the indices that will minimize a cost function corresponding to the desired symmetric values about the median.

Let the smaller index be n1 and the larger index by n2; these are not independent. The values from the list at the indices are x[n1] = m-b and x[n2]=m+c. You now want to choose n1 (and thus n2) so that b and c are as close as possible. This occurs when (b - c)**2 is minimal. That's pretty easy using numpy.argmin. Paralleling the example in the question, here's an interactive session illustrating the approach:

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

Answered by: Cadie532 | Posted: 01-01-2022

Answer 3

What you want is scipy.stats.mstats.trimboth. Set proportiontocut=0.15. After trimming, take (max-min)/2.

Answered by: Kellan838 | Posted: 01-01-2022

Similar questions

python - Finding appropriate text

&lt;div class=&quot;product-name&quot;&gt; CLR2811 &lt;/div&gt; I want to scrape this Product name. My Code : ProductTitle = page_soup.find(&quot;div&quot;,attrs = {'class':'product-name'}) This Should Probably return me the right things i-e CLR2811 but when I print ProductTitle its returns me. ...

python - Which of these scripting languages is more appropriate for pen-testing?

Closed. This question is opinion-based. It is not c...

Help me find an appropriate ruby/python parser generator

The first parser generator I've worked with was Parse::RecDescent, and the guides/tutorials available for it were great, but the most useful feature it has was it's debugging tools, specifically the tracing capabilities ( activated by setting $RD_TRACE to 1 ). I am looking for a parser generator that can help you debug it's rules. The thing is, it has to be written in python or in ruby, and have a verbose mo...

Would python be an appropriate choice for a video library for home use software

I am thinking of creating a video library software which keep track of all my videos and keep track of videos that I already haven't watched and stats like this. The stats will be specific to each user using the software. My question is, is python appropriate to create this software or do I need something like c++.

Is Python appropriate for algorithms focused on scientific computing?

python - How do I determine the appropriate check interval?

I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation. Thus we have the main thread being almost tot...

arrays - Most appropriate data structure (Python)

I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated! I have a long .csv file in the following format: Scenario,Year,Month,Value 1,1961,1,0.5 1,1961,2,0.7 1,1961,3,0.2 etc. My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29...

python - Numpy time based vector operations where state of preceding elements matters - are for loops appropriate?

What do numpy arrays provide when performing time based calculations where state matters. In other words, where what has occurred in earlier or later in a sequence is important. Consider the following time based vectors, TIME = np.array([0., 10., 20., 30., 40., 50., 60., 70., 80., 90.]) FLOW = np.array([100., 75., 60., 20.0, 60.0, 50.0, 20.0, 30.0, 20.0, 10.0]) TEMP = np.array([300., 310...

python extend or append a list when appropriate

Is there a simple way to append a list if X is a string, but extend it if X is a list? I know I can simply test if an object is a string or list, but I was wondering if there is a quicker way than this?

python - What is an appropriate way to datamine the total number of results of a keyword search?

newbie programmer and lurker here, hoping for some sensible advice. :) Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code: import urllib2 from BeautifulSoup import BeautifulStoneSoup Appid = #My Appid query = #My query soup = BeautifulStoneSoup(urllib2.urlopen("" + Appid + "&amp;query=" ...

python - Defining appropriate number of processes

I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool. I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python