• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Programming puzzle: highest proportion of 1s in a substring

Jayjay

Contributor
Joined
Apr 7, 2002
Messages
7,173
Location
Finland
Basic Beliefs
An accurate worldview or philosophy
I recently had a programming test as part of a job interview. I failed this particular puzzle. This bothered me, so I tried to google for an answer afterwards, but no luck, so maybe I could throw it to you geniuses.

Given number k, and a string of 1s and 0s of at least size k, find a substring of at least size k with the highest proportion of 1s. If there are multiple possible answers, any one of them will do.

It's trivial to just test all the substrings, but the algorithm needs to run faster than that. During the actual test, I failed this criterion for some inputs (I was not given the actual test cases). Since the test is no longer online, I can't refer to it or run other experiments.

So for example, if k=4, and the given string is "0110011" the answer is substring "110011".

How would you go about solving the problem? I'm not looking for a rigorous mathematical proof, just an outline or an idea would do.
 
Programming puzzles are fun! (If I want to contribute some, should I start my own thread or can I post them here?)

Did the test require you to create working code, or just an English-language summary?

The obvious approach involves a double loop [ O(N2) ] and I don't see any way to avoid this. You can chop away some of the cases to consider (e.g. discard initial 11100 when you've already found a 60% solution) but depending on detailed characteristics of the data this might cost more time than it saves. And the "bit-twiddling" to extract 1's and 0's from a byte might be a major cost in some cases.

A pre-processing step which might reduce the work is to transform 1a0b1c0d1e0f ... into (a,b), (c,d), (e, f), ...

The best approach for small k might be very different from the best approach for large k. It sounds like you were asked for a "one size fits all" solution.

If k is very large, you might start with larger chunks, e.g. replacing (a,b), (c,d) above with (a+c, b+d). BTW, in practical problems of this type, it is often good enough to output a nearly-optimal solution rather than the exact optimum.

Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.
 
Programming puzzles are fun! (If I want to contribute some, should I start my own thread or can I post them here?)

Did the test require you to create working code, or just an English-language summary?
Working code. It was an automated test, where you submit code, and then it is run on the server side with a set of test cases. I was not shown what the test cases were, only that at least one of them failed due to exceeded time limit. The specifics of the test were that the string is at most 100,000 characters, and it has to run within 1 second, and not use any more memory than one megabyte. But of course the actual runtime depends on the server environment so that's a fuzzy goal. There was 48 hour time limit to do the test (there were two simpler problems, and of course I didn't sit two days on my computer, so the actual time I used to ponder this problem was about 5 or 6 hours) and I could attempt to submit the solution as many times as possible within that time.

Here is my last revision in C++:

Code:
#include <iostream>

using namespace std;

// helper to count number of ones in a string of ones and zeroes
int count_hits(const string& s)
{
    int n = 0;
    for(char c : s)
        if(c == '1')
            ++n;
    return n;
}

int main()
{
    int k;
    string s;
    cin >> k >> s;
    
    double max_accuracy = 0.0;
    int f = 1, l = k;
    for(size_t len = k; len <= s.length(); ++len)
    {
        unsigned int hits = count_hits(s.substr(0, len));
        unsigned int max_hits = hits;
        int max_pos = 0;
        for(size_t pos = 1; pos <= s.length() - len; ++pos)
        {
            int decrement = s[pos - 1] == '1' ? 1 : 0;
            int increment = s[pos + len - 1] == '1' ? 1 : 0;
            int diff = increment - decrement;
            hits += diff;
            
            if(hits > max_hits)
            {
                max_hits = hits;
                max_pos = pos;
                
                if(max_hits == len)
                    break;
            }
        }
        double accuracy = max_hits / static_cast<double>(len);
        if(accuracy > max_accuracy)
        {
            max_accuracy = accuracy;
            f = max_pos + 1; // change to 1-based index
            l = len;
            
            if(max_accuracy == 1.0)
                break;
        }
    }
    
    // print the output
    cout << f << " " << l;
}

The output is the position (f) and length (l) of the substring.

I also tried switching the loops to check first for each position in the outer loop, and each length of substring in the inner loop. Didn't work.

The obvious approach involves a double loop [ O(N2) ] and I don't see any way to avoid this. You can chop away some of the cases to consider (e.g. discard initial 11100 when you've already found a 60% solution) but depending on detailed characteristics of the data this might cost more time than it saves. And the "bit-twiddling" to extract 1's and 0's from a byte might be a major cost in some cases.

A pre-processing step which might reduce the work is to transform 1a0b1c0d1e0f ... into (a,b), (c,d), (e, f), ...

The best approach for small k might be very different from the best approach for large k. It sounds like you were asked for a "one size fits all" solution.

If k is very large, you might start with larger chunks, e.g. replacing (a,b), (c,d) above with (a+c, b+d). BTW, in practical problems of this type, it is often good enough to output a nearly-optimal solution rather than the exact optimum.

Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.
That's what I'm thinking too. But of course, it's possible that the given data just happens to favor some other way of doing things, or that I could have shaved off enough with some very minor optimizations.
 
I can't believe you failed that. this is very common problem which happens all the time.

In the first pass you calculate integral of your f(x)=0,1 function.
Then look for the max of Integ(x+k)-Integ(x)
....
Oh, string has to be at least k long.

Ok, i need to think
 
Last edited:
Way too difficult for me. I needed a long time to come up with any kind of workable algorithm, let alone something that is fast and efficient.

Here's my solution, in Python because I don't know C++.

It's hideously slow.



Code:
def get_substring(digits, min_length):
    """
        Get the substring with greatest ratio of ones to zeroes and with 
        given minimum length
    """
    optimal_start_position = None
    optimal_length = 0
    optimal_ratio = 0
    one_positions = get_one_positions(digits)
    start_positions = list(enumerate(one_positions))
    end_positions = list(enumerate(one_positions))
    for start_index, start_position in start_positions:
        for end_index, end_position in end_positions:
            # Count the number of ones between the start and end positions.
            count = end_index - start_index + 1
            # Count the total number of digits between the start and end positions.
            length = end_position - start_position + 1
            if count > 0 and length >= min_length:
                ratio = count / length
                if ratio > optimal_ratio:
                    optimal_ratio = ratio
                    optimal_start_position = start_position
                    optimal_length = length
    return optimal_start_position, optimal_length


def get_one_positions(digits):
    positions = []
    for position, digit in enumerate(digits):
        if digit == "1":
            positions.append(position)
    return positions

 
Last edited:
Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.

I had the same thought, it seems like a counter-productive exercise for an interview to me. An unusually difficult and uncommon problem that has to be completed under undue pressure.
 
I can't believe you failed that. this is very common problem which happens all the time.

In the first pass you calculate integral of your f(x)=0,1 function.
Then look for the max of Integ(x+k)-Integ(x)
....
Oh, string has to be at least k long.

Ok, i need to think

Well, I'm glad I'm not the only person who though this was easy, then discovered it was not.
 
OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1
 
Ninja'ed!

[Bardos' solution occurred to me while watching a movie just now. The movie finished; I typed it in; and only then saw Bardos had beat me to the draw.]

And yes, I think we can prove this:
"So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring."
 
Well, I have not proven that It MUST contain length=k solution.
But I think it's very strong starting point even if it's not true. I mean one can start with solutions for few intermediate length=k, 1.25*k, 1.5*k, 1.75*k and see around them. It's hard to imagine solution would not contain one of these substrings.


...
Actually, I got an example where it's not true. Sorry, that solution does not work.
even modified k, 1.25*k, 1.5*k, 1.75*k would not work without some minor modifications.
 
OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1

Very insightful! I didn't occur to me, but of course if there was a substring of length 2k or longer, you could always split it in two and one of the shorter strings would have either a higher ratio of 1s than the original, or they would both have an equal ratio. (Just writing it out because it was not trivial to me)

I think just being able to limit the length of the solution to 2k-1 might have been enough. Too bad I can't actually try these out anymore, unless I write my own test and come up with worst case scenarios.
 
Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1110101001001110011111


This can be simplified to:
30101001003005

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1110101001001110011111

Becomes...

21111021122

Which becomes...

32135

This tells you where most of your ones are located. Target a brute force search at the end.
 
Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1110101001001110011111


This can be simplified to:
30101001003005

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1110101001001110011111

Becomes...

21111021122

Which becomes...

32135

This tells you where most of your ones are located. Target a brute force search at the end.
That's an obvious optimization which does not change complexity class of the task.
 
Too late, the business already lost out to it's competitors. But actually the business really only needed the image to be moved 20 pixels to the left.
 
OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1

Very insightful! I didn't occur to me, but of course if there was a substring of length 2k or longer, you could always split it in two and one of the shorter strings would have either a higher ratio of 1s than the original, or they would both have an equal ratio. (Just writing it out because it was not trivial to me)

I think just being able to limit the length of the solution to 2k-1 might have been enough. Too bad I can't actually try these out anymore, unless I write my own test and come up with worst case scenarios.
Actually, I am not sure I am right about 2k-1. .... No, I am right about 2k-1.

So you are right, if k is fixed number then it's O(N) where N is a size of the set. That could have been enough. But if k is a fraction of an N then it's O(N^2)
 
Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1110101001001110011111


This can be simplified to:
30101001003005

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1110101001001110011111

Becomes...

21111021122

Which becomes...

32135

This tells you where most of your ones are located. Target a brute force search at the end.
That's an obvious optimization which does not change complexity class of the task.
So why is the solution is no longer than 2*k-1?
 
So why is the solution is no longer than 2*k-1?
Because you can split your long solution into two strings. And at least one of these strings would have higher or at least the same density as your original longer "solution". You can do this splitting over and over until it can't be split anymore.
 
Way too difficult for me. I needed a long time to come up with any kind of workable algorithm, let alone something that is fast and efficient.

Here's my solution, in Python because I don't know C++.

It's hideously slow.



Code:
def get_substring(digits, min_length):
    """
        Get the substring with greatest ratio of ones to zeroes and with 
        given minimum length
    """
    optimal_start_position = None
    optimal_length = 0
    optimal_ratio = 0
    one_positions = get_one_positions(digits)
    start_positions = list(enumerate(one_positions))
    end_positions = list(enumerate(one_positions))
    for start_index, start_position in start_positions:
        for end_index, end_position in end_positions:
            # Count the number of ones between the start and end positions.
            count = end_index - start_index + 1
            # Count the total number of digits between the start and end positions.
            length = end_position - start_position + 1
            if count > 0 and length >= min_length:
                ratio = count / length
                if ratio > optimal_ratio:
                    optimal_ratio = ratio
                    optimal_start_position = start_position
                    optimal_length = length
    return optimal_start_position, optimal_length


def get_one_positions(digits):
    positions = []
    for position, digit in enumerate(digits):
        if digit == "1":
            positions.append(position)
    return positions

I don't have much practice in python but you should use range() instead of creating list of positions
Here is my code which seems to be working:
Code:
#!/usr/bin/python3
def scan_array(arr,k):
  sum=0.
  pos=0
  for i in range(k):
    sum+=arr[i]
  best=sum
  for i in range(k,len(arr)):
    sum-=arr[i-k]
    sum+=arr[i]
    if sum>best:
      pos=i-k+1
      best=sum
  return pos, best/float(k)

def scan_2k(arr,k):
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i)
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

scan_2k([0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1], 4)
 
I don't have much practice in python but you should use range() instead of creating list of positions
Here is my code which seems to be working:
Code:
#!/usr/bin/python3
def scan_array(arr,k):
  sum=0.
  pos=0
  for i in range(k):
    sum+=arr[i]
  best=sum
  for i in range(k,len(arr)):
    sum-=arr[i-k]
    sum+=arr[i]
    if sum>best:
      pos=i-k+1
      best=sum
  return pos, best/float(k)

def scan_2k(arr,k):
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i)
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

scan_2k([0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1], 4)

Sum needs to be initialised as an integer, but other than that it seems to work quite well.

There's an idea in there that I wouldn't have thought of: You scan all positions for solutions with length k, then k+1, up to 2*k -1. By scanning every start position for the best solution of a fixed length, each iteration of the inner loop only needs two list access operations: the digit you added and the digit you removed.
 
Even less additions here:
Code:
#!/usr/bin/python3
def scan_array(arr,k, sum):
  pos=k-1
  best=sum
  for i in range(k,len(arr)):
    sum+=(arr[i]-arr[i-k])
    if sum>best:
      pos=i
      best=sum
  return pos-k+1, best/float(k)

def scan_2k(arr,k):
  sum=0
  for i in range(k):
    sum+=arr[i]
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i,sum)
    sum+=arr[i]
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

str="0110111011101111100011"
scan_2k([int(x) for x in str], 4)
 
Last edited:
Back
Top Bottom