I can do one quickly. I need to take affiliations from papers and work out which organisation(s) they're talking about. How would you solve this problem, assuming the affiliations are already extracted for you?
What are the top level concerns, how do you break that problem down, how might your problems scale, etc. There's a lot of questions I'd expect to get to, and this should be something done along with the team kind of like we're all working on the problem together.
I find it useful to see how well people can talk through the problem, it can lead easily into questions about licensing and rights of reuse, types of errors, etc. If they suggest an approach they've used before, can they explain likely failure cases / benefits? Are there workarounds, detection methods? For example, if you're doing text classification then tfidf+svm is a solid first thing to try, and there's easy ways that can fail which we could talk about.
There's a lot of that you can cover in an hour, and it tests whether someone can explain a potential solution to the team effectively, just as they would have to on a day-to-day basis. We can bring up specific types of problems that we face within it, what we've tried, and we can constrain the problem more or lead someone to starting points if it's a bit overwhelming.
edit - I guess this would fall under some data science fundamentals, but the approach I think works for CS fundamentals. What data structures could you use? What are the tradeoffs? It's not about finding the one optimal solution, but about how to proceed.
> For example, if you're doing text classification then tfidf+svm is a solid first thing to try
This is screening for specific ___domain knowledge (text processing) not general programming aptitude. That's ok if you want specific kinds of prior knowledge on Day 1 but it is not a way to hire generally smart people.
> I guess this would fall under some data science fundamentals, but the approach I think works for CS fundamentals. What data structures could you use? What are the tradeoffs? It's not about finding the one optimal solution, but about how to proceed.
This is exactly how most algorithm interview questions work.
I'm trying to understand what the OP meant by "real problems" not "academic puzzles". It sounded like they avoided hard algorithms yet "got a sense" of CS fundamentals somehow.
> This is screening for specific ___domain knowledge (text processing) not general programming aptitude. That's ok if you want specific kinds of prior knowledge on Day 1 but it is not a way to hire generally smart people.
Not really, the ___domain knowledge here is much more the bibliometrics stuff, which we don't need usually. What I do need is someone who knows that they can't just take any data source they find and throw the latest deep learning hotness at it and call it a day because the F score is over some random threshold.
You can absolutely use this to hire generally smart people, what you are right about is I won't be hiring people who are generally smart but have no basic understanding of any of the types of solutions they'll need to work on and with. Based on team size and where we are, that's completely fine for me right now.
I think this helps find people that:
1. Are able to talk through a problem
2. Understand the kinds of issues they may face, and discuss how to go with that (including business level work, when to / not to use humans, etc)
3. Have experience working on the kinds of problems they are going to face
The tfidf+svm example was not intended as "ah yes, they said the algorithm I wanted" but as a springboard into a further discussion. Maybe they talk about word vectors, whatever, can they explain what the pros and cons are? Where might it fail, or more importantly what would they want to test? Where do we get training data, how long might that take for reasonable quality, how do you measure that, etc.
> This is exactly how most algorithm interview questions work.
The puzzles I think they're talking about are "here is a theoretical problem, find the optimal solution". Like "you have X eggs and need to find the highest floor you can drop them from without them breaking" or the classic "implement a doubly linked list with a single pointer" despite the fact _almost nobody_ would actually implement that.
I'm talking about giving an actual issue they're likely to face and talking through it. Maybe that's "how would you implement user flow for X, given that we've got institutional customers with multiple clients" or "we need to do rate limiting and have X servers, how would you go about that?". For the latter that might go into a discussion on what the problems are with different approaches in complexity, what the cost is of letting someone exceed their rate, of cutting someone off early, etc. Those aren't really my field so maybe the questions are a bit off, but the point is can they contribute to a discussion on the way forwards on a problem which represents something they will realistically face.
What are the top level concerns, how do you break that problem down, how might your problems scale, etc. There's a lot of questions I'd expect to get to, and this should be something done along with the team kind of like we're all working on the problem together.
I find it useful to see how well people can talk through the problem, it can lead easily into questions about licensing and rights of reuse, types of errors, etc. If they suggest an approach they've used before, can they explain likely failure cases / benefits? Are there workarounds, detection methods? For example, if you're doing text classification then tfidf+svm is a solid first thing to try, and there's easy ways that can fail which we could talk about.
There's a lot of that you can cover in an hour, and it tests whether someone can explain a potential solution to the team effectively, just as they would have to on a day-to-day basis. We can bring up specific types of problems that we face within it, what we've tried, and we can constrain the problem more or lead someone to starting points if it's a bit overwhelming.
edit - I guess this would fall under some data science fundamentals, but the approach I think works for CS fundamentals. What data structures could you use? What are the tradeoffs? It's not about finding the one optimal solution, but about how to proceed.