Programming puzzle: highest proportion of 1s in a substring

Jayjay · Dec 3, 2020

I recently had a programming test as part of a job interview. I failed this particular puzzle. This bothered me, so I tried to google for an answer afterwards, but no luck, so maybe I could throw it to you geniuses.

Given number k, and a string of 1s and 0s of at least size k, find a substring of at least size k with the highest proportion of 1s. If there are multiple possible answers, any one of them will do.

It's trivial to just test all the substrings, but the algorithm needs to run faster than that. During the actual test, I failed this criterion for some inputs (I was not given the actual test cases). Since the test is no longer online, I can't refer to it or run other experiments.

So for example, if k=4, and the given string is "0110011" the answer is substring "110011".

How would you go about solving the problem? I'm not looking for a rigorous mathematical proof, just an outline or an idea would do.

Swammerdami · Dec 4, 2020

Programming puzzles are fun! (If I want to contribute some, should I start my own thread or can I post them here?)

Did the test require you to create working code, or just an English-language summary?

The obvious approach involves a double loop [ O(N²) ] and I don't see any way to avoid this. You can chop away some of the cases to consider (e.g. discard initial 11100 when you've already found a 60% solution) but depending on detailed characteristics of the data this might cost more time than it saves. And the "bit-twiddling" to extract 1's and 0's from a byte might be a major cost in some cases.

A pre-processing step which might reduce the work is to transform 1^a0^b1^c0^d1^e0^f ... into (a,b), (c,d), (e, f), ...

The best approach for small k might be very different from the best approach for large k. It sounds like you were asked for a "one size fits all" solution.

If k is very large, you might start with larger chunks, e.g. replacing (a,b), (c,d) above with (a+c, b+d). BTW, in practical problems of this type, it is often good enough to output a nearly-optimal solution rather than the exact optimum.

Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.

Jayjay · Dec 4, 2020

Swammerdami said:
Programming puzzles are fun! (If I want to contribute some, should I start my own thread or can I post them here?)

Did the test require you to create working code, or just an English-language summary?

Working code. It was an automated test, where you submit code, and then it is run on the server side with a set of test cases. I was not shown what the test cases were, only that at least one of them failed due to exceeded time limit. The specifics of the test were that the string is at most 100,000 characters, and it has to run within 1 second, and not use any more memory than one megabyte. But of course the actual runtime depends on the server environment so that's a fuzzy goal. There was 48 hour time limit to do the test (there were two simpler problems, and of course I didn't sit two days on my computer, so the actual time I used to ponder this problem was about 5 or 6 hours) and I could attempt to submit the solution as many times as possible within that time.

Here is my last revision in C++:

Show

The output is the position (f) and length (l) of the substring.

I also tried switching the loops to check first for each position in the outer loop, and each length of substring in the inner loop. Didn't work.

Swammerdami said:
The obvious approach involves a double loop [ O(N²) ] and I don't see any way to avoid this. You can chop away some of the cases to consider (e.g. discard initial 11100 when you've already found a 60% solution) but depending on detailed characteristics of the data this might cost more time than it saves. And the "bit-twiddling" to extract 1's and 0's from a byte might be a major cost in some cases.

A pre-processing step which might reduce the work is to transform 1^a0^b1^c0^d1^e0^f ... into (a,b), (c,d), (e, f), ...

The best approach for small k might be very different from the best approach for large k. It sounds like you were asked for a "one size fits all" solution.

If k is very large, you might start with larger chunks, e.g. replacing (a,b), (c,d) above with (a+c, b+d). BTW, in practical problems of this type, it is often good enough to output a nearly-optimal solution rather than the exact optimum.

Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.

That's what I'm thinking too. But of course, it's possible that the given data just happens to favor some other way of doing things, or that I could have shaved off enough with some very minor optimizations.

barbos · Dec 4, 2020

I can't believe you failed that. this is very common problem which happens all the time.

In the first pass you calculate integral of your f(x)=0,1 function.
Then look for the max of Integ(x+k)-Integ(x)
....
Oh, string has to be at least k long.

Ok, i need to think

bigfield · Dec 4, 2020

Way too difficult for me. I needed a long time to come up with any kind of workable algorithm, let alone something that is fast and efficient.

Here's my solution, in Python because I don't know C++.

It's hideously slow.

Show

rousseau · Dec 4, 2020

Swammerdami said:
Summary: I'm stumped. This strikes me as an unusually difficult exercise for an interview. There's probably some clever trick I'm overlooking.

I had the same thought, it seems like a counter-productive exercise for an interview to me. An unusually difficult and uncommon problem that has to be completed under undue pressure.

bigfield · Dec 4, 2020

barbos said:
I can't believe you failed that. this is very common problem which happens all the time.

In the first pass you calculate integral of your f(x)=0,1 function.
Then look for the max of Integ(x+k)-Integ(x)
....
Oh, string has to be at least k long.

Ok, i need to think

Well, I'm glad I'm not the only person who though this was easy, then discovered it was not.

barbos · Dec 4, 2020

OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1

Swammerdami · Dec 4, 2020

Ninja'ed!

[Bardos' solution occurred to me while watching a movie just now. The movie finished; I typed it in; and only then saw Bardos had beat me to the draw.]

And yes, I think we can prove this:
"So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring."

barbos · Dec 4, 2020

Well, I have not proven that It MUST contain length=k solution.
But I think it's very strong starting point even if it's not true. I mean one can start with solutions for few intermediate length=k, 1.25*k, 1.5*k, 1.75*k and see around them. It's hard to imagine solution would not contain one of these substrings.

...
Actually, I got an example where it's not true. Sorry, that solution does not work.
even modified k, 1.25*k, 1.5*k, 1.75*k would not work without some minor modifications.

Jayjay · Dec 4, 2020

barbos said:
OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1

Very insightful! I didn't occur to me, but of course if there was a substring of length 2k or longer, you could always split it in two and one of the shorter strings would have either a higher ratio of 1s than the original, or they would both have an equal ratio. (Just writing it out because it was not trivial to me)

I think just being able to limit the length of the solution to 2k-1 might have been enough. Too bad I can't actually try these out anymore, unless I write my own test and come up with worst case scenarios.

Jimmy Higgins · Dec 4, 2020

Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1

0

1

0

1

0

1

0

1

0

1

This can be simplified to:

3

0

1

0

1

0

1

0

3

0

5

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1

0

1

0

1

0

1

0

1

0

1

Becomes...

2

1

0

2

1

2

Which becomes...

3

2

1

3

5

This tells you where most of your ones are located. Target a brute force search at the end.

barbos · Dec 4, 2020

Jimmy Higgins said:
Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1

This can be simplified to:

3 0 1 0 1 0 0 1 0 0 3 0 0 5

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1

Becomes...

2 1 1 1 1 0 2 1 1 2 2

Which becomes...

3 2 1 3 5

This tells you where most of your ones are located. Target a brute force search at the end.

That's an obvious optimization which does not change complexity class of the task.

rousseau · Dec 4, 2020

Too late, the business already lost out to it's competitors. But actually the business really only needed the image to be moved 20 pixels to the left.

barbos · Dec 4, 2020

Jayjay said:
barbos said:

OK, Here is a solution:
First of all, it's obvious that solution is generally not unique.
let's try to find shortest string.
Given the minimum length=k it is trivial to show that solution is no longer than 2*k-1.
So now we find solution for length=k exactly and then I suspect we can prove that solution MUST contain this as a substring.
If we can prove that then simple O(k) scan around this substring gives us solution.

If it can't be proven then you can scan for solutions for length=k.....2*k-1

Click to expand...

Very insightful! I didn't occur to me, but of course if there was a substring of length 2k or longer, you could always split it in two and one of the shorter strings would have either a higher ratio of 1s than the original, or they would both have an equal ratio. (Just writing it out because it was not trivial to me)

I think just being able to limit the length of the solution to 2k-1 might have been enough. Too bad I can't actually try these out anymore, unless I write my own test and come up with worst case scenarios.

Actually, I am not sure I am right about 2k-1. .... No, I am right about 2k-1.

So you are right, if k is fixed number then it's O(N) where N is a size of the set. That could have been enough. But if k is a fraction of an N then it's O(N^2)

Jimmy Higgins · Dec 4, 2020

barbos said:
Jimmy Higgins said:

Trying to go equationy doesn't work. I think brute force is the way to go, but there are ways to simplify it.

1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1

This can be simplified to:

3 0 1 0 1 0 0 1 0 0 3 0 0 5

The benefit of this is that if you enter a k value that is equal to or less than your max value above, you already have your solution. IE, above k <= 5 it is 1,1,1,1,1.

If k is greater than 5 then what about just contracting the numbers?

1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1

Becomes...

2 1 1 1 1 0 2 1 1 2 2

Which becomes...

3 2 1 3 5

This tells you where most of your ones are located. Target a brute force search at the end.

Click to expand...

That's an obvious optimization which does not change complexity class of the task.

So why is the solution is no longer than 2*k-1?

barbos · Dec 4, 2020

Jimmy Higgins said:
So why is the solution is no longer than 2*k-1?

Because you can split your long solution into two strings. And at least one of these strings would have higher or at least the same density as your original longer "solution". You can do this splitting over and over until it can't be split anymore.

barbos · Dec 6, 2020

bigfield said:

Way too difficult for me. I needed a long time to come up with any kind of workable algorithm, let alone something that is fast and efficient.

Here's my solution, in Python because I don't know C++.

It's hideously slow.

Show

I don't have much practice in python but you should use range() instead of creating list of positions
Here is my code which seems to be working:

Code:

#!/usr/bin/python3
def scan_array(arr,k):
  sum=0.
  pos=0
  for i in range(k):
    sum+=arr[i]
  best=sum
  for i in range(k,len(arr)):
    sum-=arr[i-k]
    sum+=arr[i]
    if sum>best:
      pos=i-k+1
      best=sum
  return pos, best/float(k)

def scan_2k(arr,k):
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i)
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

scan_2k([0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1], 4)

bigfield · Dec 7, 2020

barbos said:

I don't have much practice in python but you should use range() instead of creating list of positions
Here is my code which seems to be working:

Code:

#!/usr/bin/python3
def scan_array(arr,k):
  sum=0.
  pos=0
  for i in range(k):
    sum+=arr[i]
  best=sum
  for i in range(k,len(arr)):
    sum-=arr[i-k]
    sum+=arr[i]
    if sum>best:
      pos=i-k+1
      best=sum
  return pos, best/float(k)

def scan_2k(arr,k):
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i)
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

scan_2k([0,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1], 4)

Sum needs to be initialised as an integer, but other than that it seems to work quite well.

There's an idea in there that I wouldn't have thought of: You scan all positions for solutions with length k, then k+1, up to 2*k -1. By scanning every start position for the best solution of a fixed length, each iteration of the inner loop only needs two list access operations: the digit you added and the digit you removed.

barbos · Dec 7, 2020

Even less additions here:

Code:

#!/usr/bin/python3
def scan_array(arr,k, sum):
  pos=k-1
  best=sum
  for i in range(k,len(arr)):
    sum+=(arr[i]-arr[i-k])
    if sum>best:
      pos=i
      best=sum
  return pos-k+1, best/float(k)

def scan_2k(arr,k):
  sum=0
  for i in range(k):
    sum+=arr[i]
  k2=min(2*k,len(arr))
  best=0
  for i in range(k,k2):
    p, b=scan_array(arr,i,sum)
    sum+=arr[i]
    if(b>best):
      best=b
      pos=p
      bl=i
  print("Best",best, 'pos', pos)
  print("substring", arr[pos:(pos+bl)])

str="0110111011101111100011"
scan_2k([int(x) for x in str], 4)

Programming puzzle: highest proportion of 1s in a substring

Contributor

Squadron Leader

Contributor

Contributor

the baby-eater

Contributor

the baby-eater

Contributor

Squadron Leader

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

the baby-eater

Contributor