Join Data Science Interview MasterClass (August Cohort) 🚀 led by Data Scientists and a Recruiter at FAANGs | 12 Slots Remaining...

Succeed Interviews Dream Data Job

Created by interviewers at top companies like Google and Meta for data engineers, data scientists, and ML Engineers

0

Never prep alone. Prep with coaches and peers like you.

Trusted by talents with $240K+ compensation offers at

/company_logos/google_logo.png

How we can help you.

Never enter Data Scientist and MLE interviews blindfolded. We will give you the exclusive insights, a TON of practice questions with case questions and SQL drills to help you hone-in your interviewing game!

📚 Access Premium Courses

Practice with over 60+ cases in detailed video and text-based courses! The areas covered are applied statistics, machine learning, product sense, AB testing, and business cases! Plus, you can watch 5+ hours of pre-recorded mock interviews with real candidates preparing for data scientist & MLE interviews.

Join Premium Courses

Customer profile user interface

⭐ Prep with a Coach

Prep with our coaches who understands the ins-and-outs of technical and behavioral interviewing. The coaching calls are personalized with practice questions and detailed feedback to help you ace your upcoming interviews.

Book a Session

Customer profile user interface

🖥️ Practice SQL

Practice 100 actual SQL interview questions on a slick, interactive SQL pad on your browser. The solutions are written by data scientists at Google and Meta.

Access SQL Pad

Customer profile user interface

💬 Slack Study Group

Never study alone! Join a community of peers and instructors to practice interview questions, find mock interview buddies, and pose interview questions and job hunt tips!

Join Community

Join the success tribe.

Datainterview.com is phenomenal resource for data scientists aspiring to earn a role in top tech firms in silicon valley. He has laid out the entire material in a curriculum format. In my opinion, if you are interviewing at any of the tech firms (FB, google, linkedin etc..) all you have to do is go through his entire coursework thoroughly. No need to look for other resources. Daniel has laid down everything in a very straightforward manner.

case study data preparation

Designed for candidates interviewing for date roles, the subscription course is packed with SQL, statistics, ML, and product-sense questions asked by top tech companies I interviewed for. Comprehensive solutions with example dialogues between interviewers and candidates were incredibly helpful!

case study data preparation

Datainterview was extremely helpful during my preparation for the product data science interview at Facebook. The prep is designed to test your understanding of key concepts in statistics, modeling and product sense. In addition, the mock interviews with an interview coach were valuable for technical interviews.

case study data preparation

A great resource for someone who wants to get into the field of data science, as the prep materials and mock interviews not only describe the questions but also provide guidance for answering questions in a structured and clear way.

case study data preparation

DataInterview was a key factor in my success. The level of depth in the AB testing helped me standout over generic answers. The case studies helped provide a solid foundation on how to respond, and the slack channel gave me an amazing network to do over 50 mocks with! If you havent signed up yet you’re missing out!

case study data preparation

DataInterview is one of the best resources that helped me land a job at Apple. The case study course helped me not only understand the best ways to answer a case but also helped me understand how an interviewer evaluates the response and the difference between a good and bad response. This was the edge that I needed to go from landing an interview to converting into an offer.

case study data preparation

Get Started Today

Don't leave success up to chance.

20+ Data Science Case Study Interview Questions (with Solutions)

2024 Guide: 20+ Essential Data Science Case Study Interview Questions

Case studies are often the most challenging aspect of data science interview processes. They are crafted to resemble a company’s existing or previous projects, assessing a candidate’s ability to tackle prompts, convey their insights, and navigate obstacles.

To excel in data science case study interviews, practice is crucial. It will enable you to develop strategies for approaching case studies, asking the right questions to your interviewer, and providing responses that showcase your skills while adhering to time constraints.

The best way of doing this is by using a framework for answering case studies. For example, you could use the product metrics framework and the A/B testing framework to answer most case studies that come up in data science interviews.

There are four main types of data science case studies:

  • Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics.
  • Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem. Additionally, you must write a SQL query to pull your proposed metrics, and then perform analysis using the data you queried, just as you would do in the role.
  • Modeling and Machine Learning Case Studies - Modeling case studies are more varied and focus on assessing your intuition for building models around business problems.
  • Business Case Questions - Similar to product questions, business cases tackle issues or opportunities specific to the organization that is interviewing you. Often, candidates must assess the best option for a certain business plan being proposed, and formulate a process for solving the specific problem.

How Case Study Interviews Are Conducted

Oftentimes as an interviewee, you want to know the setting and format in which to expect the above questions to be asked. Unfortunately, this is company-specific: Some prefer real-time settings, where candidates actively work through a prompt after receiving it, while others offer some period of days (say, a week) before settling in for a presentation of your findings.

It is therefore important to have a system for answering these questions that will accommodate all possible formats, such that you are prepared for any set of circumstances (we provide such a framework below).

Why Are Case Study Questions Asked?

Case studies assess your thought process in answering data science questions. Specifically, interviewers want to see that you have the ability to think on your feet, and to work through real-world problems that likely do not have a right or wrong answer. Real-world case studies that are affecting businesses are not binary; there is no black-and-white, yes-or-no answer. This is why it is important that you can demonstrate decisiveness in your investigations, as well as show your capacity to consider impacts and topics from a variety of angles. Once you are in the role, you will be dealing directly with the ambiguity at the heart of decision-making.

Perhaps most importantly, case interviews assess your ability to effectively communicate your conclusions. On the job, data scientists exchange information across teams and divisions, so a significant part of the interviewer’s focus will be on how you process and explain your answer.

Quick tip: Because case questions in data science interviews tend to be product- and company-focused, it is extremely beneficial to research current projects and developments across different divisions , as these initiatives might end up as the case study topic.

case study data preparation

How to Answer Data Science Case Study Questions (The Framework)

image

There are four main steps to tackling case questions in Data Science interviews, regardless of the type: clarify, make assumptions, gather context, and provide data points and analysis.

Step 1: Clarify

Clarifying is used to gather more information . More often than not, these case studies are designed to be confusing and vague. There will be unorganized data intentionally supplemented with extraneous or omitted information, so it is the candidate’s responsibility to dig deeper, filter out bad information, and fill gaps. Interviewers will be observing how an applicant asks questions and reach their solution.

For example, with a product question, you might take into consideration:

  • What is the product?
  • How does the product work?
  • How does the product align with the business itself?

Step 2: Make Assumptions

When you have made sure that you have evaluated and understand the dataset, start investigating and discarding possible hypotheses. Developing insights on the product at this stage complements your ability to glean information from the dataset, and the exploration of your ideas is paramount to forming a successful hypothesis. You should be communicating your hypotheses with the interviewer, such that they can provide clarifying remarks on how the business views the product, and to help you discard unworkable lines of inquiry. If we continue to think about a product question, some important questions to evaluate and draw conclusions from include:

  • Who uses the product? Why?
  • What are the goals of the product?
  • How does the product interact with other services or goods the company offers?

The goal of this is to reduce the scope of the problem at hand, and ask the interviewer questions upfront that allow you to tackle the meat of the problem instead of focusing on less consequential edge cases.

Step 3: Propose a Solution

Now that a hypothesis is formed that has incorporated the dataset and an understanding of the business-related context, it is time to apply that knowledge in forming a solution. Remember, the hypothesis is simply a refined version of the problem that uses the data on hand as its basis to being solved. The solution you create can target this narrow problem, and you can have full faith that it is addressing the core of the case study question.

Keep in mind that there isn’t a single expected solution, and as such, there is a certain freedom here to determine the exact path for investigation.

Step 4: Provide Data Points and Analysis

Finally, providing data points and analysis in support of your solution involves choosing and prioritizing a main metric. As with all prior factors, this step must be tied back to the hypothesis and the main goal of the problem. From that foundation, it is important to trace through and analyze different examples– from the main metric–in order to validate the hypothesis.

Quick tip: Every case question tends to have multiple solutions. Therefore, you should absolutely consider and communicate any potential trade-offs of your chosen method. Be sure you are communicating the pros and cons of your approach.

Note: In some special cases, solutions will also be assessed on the ability to convey information in layman’s terms. Regardless of the structure, applicants should always be prepared to solve through the framework outlined above in order to answer the prompt.

The Role of Effective Communication

There have been multiple articles and discussions conducted by interviewers behind the Data Science Case Study portion, and they all boil down success in case studies to one main factor: effective communication.

All the analysis in the world will not help if interviewees cannot verbally work through and highlight their thought process within the case study. Again, interviewers are keyed at this stage of the hiring process to look for well-developed “soft-skills” and problem-solving capabilities. Demonstrating those traits is key to succeeding in this round.

To this end, the best advice possible would be to practice actively going through example case studies, such as those available in the Interview Query questions bank . Exploring different topics with a friend in an interview-like setting with cold recall (no Googling in between!) will be uncomfortable and awkward, but it will also help reveal weaknesses in fleshing out the investigation.

Don’t worry if the first few times are terrible! Developing a rhythm will help with gaining self-confidence as you become better at assessing and learning through these sessions.

Finding the right data science talent for case studies? OutSearch.ai ’s AI-driven platform streamlines this by pinpointing candidates who excel in real-world scenarios. Discover how they can help you match with top problem-solvers.

Product Case Study Questions

image

With product data science case questions , the interviewer wants to get an idea of your product sense intuition. Specifically, these questions assess your ability to identify which metrics should be proposed in order to understand a product.

1. How would you measure the success of private stories on Instagram, where only certain close friends can see the story?

Start by answering: What is the goal of the private story feature on Instagram? You can’t evaluate “success” without knowing what the initial objective of the product was, to begin with.

One specific goal of this feature would be to drive engagement. A private story could potentially increase interactions between users, and grow awareness of the feature.

Now, what types of metrics might you propose to assess user engagement? For a high-level overview, we could look at:

  • Average stories per user per day
  • Average Close Friends stories per user per day

However, we would also want to further bucket our users to see the effect that Close Friends stories have on user engagement. By bucketing users by age, date joined, or another metric, we could see how engagement is affected within certain populations, giving us insight on success that could be lost if looking at the overall population.

2. How would you measure the success of acquiring new users through a 30-day free trial at Netflix?

More context: Netflix is offering a promotion where users can enroll in a 30-day free trial. After 30 days, customers will automatically be charged based on their selected package. How would you measure acquisition success, and what metrics would you propose to measure the success of the free trial?

One way we can frame the concept specifically to this problem is to think about controllable inputs, external drivers, and then the observable output . Start with the major goals of Netflix:

  • Acquiring new users to their subscription plan.
  • Decreasing churn and increasing retention.

Looking at acquisition output metrics specifically, there are several top-level stats that we can look at, including:

  • Conversion rate percentage
  • Cost per free trial acquisition
  • Daily conversion rate

With these conversion metrics, we would also want to bucket users by cohort. This would help us see the percentage of free users who were acquired, as well as retention by cohort.

3. How would you measure the success of Facebook Groups?

Start by considering the key function of Facebook Groups . You could say that Groups are a way for users to connect with other users through a shared interest or real-life relationship. Therefore, the user’s goal is to experience a sense of community, which will also drive our business goal of increasing user engagement.

What general engagement metrics can we associate with this value? An objective metric like Groups monthly active users would help us see if Facebook Groups user base is increasing or decreasing. Plus, we could monitor metrics like posting, commenting, and sharing rates.

There are other products that Groups impact, however, specifically the Newsfeed. We need to consider Newsfeed quality and examine if updates from Groups clog up the content pipeline and if users prioritize those updates over other Newsfeed items. This evaluation will give us a better sense of if Groups actually contribute to higher engagement levels.

4. How would you analyze the effectiveness of a new LinkedIn chat feature that shows a “green dot” for active users?

Note: Given engineering constraints, the new feature is impossible to A/B test before release. When you approach case study questions, remember always to clarify any vague terms. In this case, “effectiveness” is very vague. To help you define that term, you would want first to consider what the goal is of adding a green dot to LinkedIn chat.

Data Science Product Case Study (LinkedIn InMail, Facebook Chat)

5. How would you diagnose why weekly active users are up 5%, but email notification open rates are down 2%?

What assumptions can you make about the relationship between weekly active users and email open rates? With a case question like this, you would want to first answer that line of inquiry before proceeding.

Hint: Open rate can decrease when its numerator decreases (fewer people open emails) or its denominator increases (more emails are sent overall). Taking these two factors into account, what are some hypotheses we can make about our decrease in the open rate compared to our increase in weekly active users?

Data Analytics Case Study Questions

Data analytics case studies ask you to dive into analytics problems. Typically these questions ask you to examine metrics trade-offs or investigate changes in metrics. In addition to proposing metrics, you also have to write SQL queries to generate the metrics, which is why they are sometimes referred to as SQL case study questions .

6. Using the provided data, generate some specific recommendations on how DoorDash can improve.

In this DoorDash analytics case study take-home question you are provided with the following dataset:

  • Customer order time
  • Restaurant order time
  • Driver arrives at restaurant time
  • Order delivered time
  • Customer ID
  • Amount of discount
  • Amount of tip

With a dataset like this, there are numerous recommendations you can make. A good place to start is by thinking about the DoorDash marketplace, which includes drivers, riders and merchants. How could you analyze the data to increase revenue, driver/user retention and engagement in that marketplace?

7. After implementing a notification change, the total number of unsubscribes increases. Write a SQL query to show how unsubscribes are affecting login rates over time.

This is a Twitter data science interview question , and let’s say you implemented this new feature using an A/B test. You are provided with two tables: events (which includes login, nologin and unsubscribe ) and variants (which includes control or variant ).

We are tasked with comparing multiple different variables at play here. There is the new notification system, along with its effect of creating more unsubscribes. We can also see how login rates compare for unsubscribes for each bucket of the A/B test.

Given that we want to measure two different changes, we know we have to use GROUP BY for the two variables: date and bucket variant. What comes next?

8. Write a query to disprove the hypothesis: Data scientists who switch jobs more often end up getting promoted faster.

More context: You are provided with a table of user experiences representing each person’s past work experiences and timelines.

This question requires a bit of creative problem-solving to understand how we can prove or disprove the hypothesis. The hypothesis is that a data scientist that ends up switching jobs more often gets promoted faster.

Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.

For example, if we looked at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis that the number of data science managers increased as the number of career jumps also rose.

  • Never switched jobs: 10% are managers
  • Switched jobs once: 20% are managers
  • Switched jobs twice: 30% are managers
  • Switched jobs three times: 40% are managers

9. Write a SQL query to investigate the hypothesis: Click-through rate is dependent on search result rating.

More context: You are given a table with search results on Facebook, which includes query (search term), position (the search position), and rating (human rating from 1 to 5). Each row represents a single search and includes a column has_clicked that represents whether a user clicked or not.

This question requires us to formulaically do two things: create a metric that can analyze a problem that we face and then actually compute that metric.

Think about the data we want to display to prove or disprove the hypothesis. Our output metric is CTR (clickthrough rate). If CTR is high when search result ratings are high and CTR is low when the search result ratings are low, then our hypothesis is proven. However, if the opposite is true, CTR is low when the search result ratings are high, or there is no proven correlation between the two, then our hypothesis is not proven.

With that structure in mind, we can then look at the results split into different search rating buckets. If we measure the CTR for queries that all have results rated at 1 and then measure CTR for queries that have results rated at lower than 2, etc., we can measure to see if the increase in rating is correlated with an increase in CTR.

10. How would you help a supermarket chain determine which product categories should be prioritized in their inventory restructuring efforts?

You’re working as a Data Scientist in a local grocery chain’s data science team. The business team has decided to allocate store floor space by product category (e.g., electronics, sports and travel, food and beverages). Help the team understand which product categories to prioritize as well as answering questions such as how customer demographics affect sales, and how each city’s sales per product category differs.

Check out our Data Analytics Learning Path .

Modeling and Machine Learning Case Questions

Machine learning case questions assess your ability to build models to solve business problems. These questions can range from applying machine learning to solve a specific case scenario to assessing the validity of a hypothetical existing model . The modeling case study requires a candidate to evaluate and explain any certain part of the model building process.

11. Describe how you would build a model to predict Uber ETAs after a rider requests a ride.

Common machine learning case study problems like this are designed to explain how you would build a model. Many times this can be scoped down to specific parts of the model building process. Examining the example above, we could break it up into:

How would you evaluate the predictions of an Uber ETA model?

What features would you use to predict the Uber ETA for ride requests?

Our recommended framework breaks down a modeling and machine learning case study to individual steps in order to tackle each one thoroughly. In each full modeling case study, you will want to go over:

  • Data processing
  • Feature Selection
  • Model Selection
  • Cross Validation
  • Evaluation Metrics
  • Testing and Roll Out

12. How would you build a model that sends bank customers a text message when fraudulent transactions are detected?

Additionally, the customer can approve or deny the transaction via text response.

Let’s start out by understanding what kind of model would need to be built. We know that since we are working with fraud, there has to be a case where either a fraudulent transaction is or is not present .

Hint: This problem is a binary classification problem. Given the problem scenario, what considerations do we have to think about when first building this model? What would the bank fraud data look like?

13. How would you design the inputs and outputs for a model that detects potential bombs at a border crossing?

Additional questions. How would you test the model and measure its accuracy? Remember the equation for precision:

image

Because we can not have high TrueNegatives, recall should be high when assessing the model.

14. Which model would you choose to predict Airbnb booking prices: Linear regression or random forest regression?

Start by answering this question: What are the main differences between linear regression and random forest?

Random forest regression is based on the ensemble machine learning technique of bagging . The two key concepts of random forests are:

  • Random sampling of training observations when building trees.
  • Random subsets of features for splitting nodes.

Random forest regressions also discretize continuous variables, since they are based on decision trees and can split categorical and continuous variables.

Linear regression, on the other hand, is the standard regression technique in which relationships are modeled using a linear predictor function, the most common example represented as y = Ax + B.

Let’s see how each model is applicable to Airbnb’s bookings. One thing we need to do in the interview is to understand more context around the problem of predicting bookings. To do so, we need to understand which features are present in our dataset.

We can assume the dataset will have features like:

  • Location features.
  • Seasonality.
  • Number of bedrooms and bathrooms.
  • Private room, shared, entire home, etc.
  • External demand (conferences, festivals, sporting events).

Which model would be the best fit for this feature set?

15. Using a binary classification model that pre-approves candidates for a loan, how would you give each rejected application a rejection reason?

More context: You do not have access to the feature weights. Start by thinking about the problem like this: How would the problem change if we had ten, one thousand, or ten thousand applicants that had gone through the loan qualification program?

Pretend that we have three people: Alice, Bob, and Candace that have all applied for a loan. Simplifying the financial lending loan model, let us assume the only features are the total number of credit cards , the dollar amount of current debt , and credit age . Here is a scenario:

Alice: 10 credit cards, 5 years of credit age, $\$20K$ in debt

Bob: 10 credit cards, 5 years of credit age, $\$15K$ in debt

Candace: 10 credit cards, 5 years of credit age, $\$10K$ in debt

If Candace is approved, we can logically point to the fact that Candace’s $\$10K$ in debt swung the model to approve her for a loan. How did we reason this out?

If the sample size analyzed was instead thousands of people who had the same number of credit cards and credit age with varying levels of debt, we could figure out the model’s average loan acceptance rate for each numerical amount of current debt. Then we could plot these on a graph to model the y-value (average loan acceptance) versus the x-value (dollar amount of current debt). These graphs are called partial dependence plots.

Business Case Questions

In data science interviews, business case study questions task you with addressing problems as they relate to the business. You might be asked about topics like estimation and calculation, as well as applying problem-solving to a larger case. One tip: Be sure to read up on the company’s products and ventures before your interview to expose yourself to possible topics.

16. How would you estimate the average lifetime value of customers at a business that has existed for just over one year?

More context: You know that the product costs $\$100$ per month, averages 10% in monthly churn, and the average customer stays for 3.5 months.

Remember that lifetime value is defined by the prediction of the net revenue attributed to the entire future relationship with all customers averaged. Therefore, $\$100$ * 3.5 = $\$350$… But is it that simple?

Because this company is so new, our average customer length (3.5 months) is biased from the short possible length of time that anyone could have been a customer (one year maximum). How would you then model out LTV knowing the churn rate and product cost?

17. How would you go about removing duplicate product names (e.g. iPhone X vs. Apple iPhone 10) in a massive database?

See the full solution for this Amazon business case question on YouTube:

case study data preparation

18. What metrics would you monitor to know if a 50% discount promotion is a good idea for a ride-sharing company?

This question has no correct answer and is rather designed to test your reasoning and communication skills related to product/business cases. First, start by stating your assumptions. What are the goals of this promotion? It is likely that the goal of the discount is to grow revenue and increase retention. A few other assumptions you might make include:

  • The promotion will be applied uniformly across all users.
  • The 50% discount can only be used for a single ride.

How would we be able to evaluate this pricing strategy? An A/B test between the control group (no discount) and test group (discount) would allow us to evaluate Long-term revenue vs average cost of the promotion. Using these two metrics how could we measure if the promotion is a good idea?

19. A bank wants to create a new partner card, e.g. Whole Foods Chase credit card). How would you determine what the next partner card should be?

More context: Say you have access to all customer spending data. With this question, there are several approaches you can take. As your first step, think about the business reason for credit card partnerships: they help increase acquisition and customer retention.

One of the simplest solutions would be to sum all transactions grouped by merchants. This would identify the merchants who see the highest spending amounts. However, the one issue might be that some merchants have a high-spend value but low volume. How could we counteract this potential pitfall? Is the volume of transactions even an important factor in our credit card business? The more questions you ask, the more may spring to mind.

20. How would you assess the value of keeping a TV show on a streaming platform like Netflix?

Say that Netflix is working on a deal to renew the streaming rights for a show like The Office , which has been on Netflix for one year. Your job is to value the benefit of keeping the show on Netflix.

Start by trying to understand the reasons why Netflix would want to renew the show. Netflix mainly has three goals for what their content should help achieve:

  • Acquisition: To increase the number of subscribers.
  • Retention: To increase the retention of active subscribers and keep them on as paying members.
  • Revenue: To increase overall revenue.

One solution to value the benefit would be to estimate a lower and upper bound to understand the percentage of users that would be affected by The Office being removed. You could then run these percentages against your known acquisition and retention rates.

21. How would you determine which products are to be put on sale?

Let’s say you work at Amazon. It’s nearing Black Friday, and you are tasked with determining which products should be put on sale. You have access to historical pricing and purchasing data from items that have been on sale before. How would you determine what products should go on sale to best maximize profit during Black Friday?

To start with this question, aggregate data from previous years for products that have been on sale during Black Friday or similar events. You can then compare elements such as historical sales volume, inventory levels, and profit margins.

Learn More About Feature Changes

This course is designed teach you everything you need to know about feature changes:

More Data Science Interview Resources

Case studies are one of the most common types of data science interview questions . Practice with the data science course from Interview Query, which includes product and machine learning modules.

Enhancing data preparation: insights from a time series case study

  • Open access
  • Published: 25 July 2024

Cite this article

You have full access to this open access article

case study data preparation

  • Camilla Sancricca 1 ,
  • Giovanni Siracusa 1 &
  • Cinzia Cappiello 1  

Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results. However, well-preparing data for machine learning is becoming difficult due to the variety of data quality issues and available data preparation tasks. For this reason, approaches that help users in performing this demanding phase are needed. This work proposes DIANA, a framework for data-centric AI to support data exploration and preparation, suggesting suitable cleaning tasks to obtain valuable analysis results. We design an adaptive self-service environment that can handle the analysis and preparation of different types of sources, i.e., tabular, and streaming data. The central component of our framework is a knowledge base that collects evidence related to the effectiveness of the data preparation actions along with the type of input data and the considered machine learning model. In this paper, we first describe the framework, the knowledge base model, and its enrichment process. Then, we show the experiments conducted to enrich the knowledge base in a particular case study: time series data streams.

Avoid common mistakes on your manuscript.

1 Introduction

AI systems are widely employed in organizations that want to gain a competitive advantage. Data analysis pipelines that exploit machine learning (ML) techniques are designed to extract useful insights to support decision-making processes.

Within this scenario, the emerging concept is Data-centric AI (Jarrahi et al., 2023 ), which is based on the idea that data and their quality are important aspects to consider for the proper functioning of AI systems. Data preparation is fundamental to ensure high-quality input data (Hameed & Naumann, 2020 ) and, thus, guarantee the dependability of the analysis outcomes. Data preparation is the most challenging among the multiple stages of a data analysis pipeline, i.e., data acquisition, preparation, modeling and analysis, and evaluation. In fact, designing a data preparation pipeline requires addressing various data quality (DQ) issues and dealing with the plethora of available data preparation techniques. The complexity of this task is also demonstrated by the fact that a data scientist usually spends most (about 80%) of the total analysis time to prepare the datasets for the analysis stage (Hameed & Naumann, 2020 ).

In this paper, we present DIANA, a framework for a D apt I ve d A ta-ce N tric A I that aims to support users in designing the data preparation pipeline by recommending the data quality improvement techniques to adopt. Currently, approaches with similar goals already exist (see Section 2 ). However, they aim to automate the data preparation process completely while we adopt a human-centered approach, i.e., involving the users in all stages, supporting their decisions, and leaving them in control of the process.

Indeed, automation could speed up the process and thus enterprises’ productivity, but the risk is a potential disengagement of humans in crucial aspects of data analysis, e.g., decision-making, and a lack of transparency and explainability of such decisions. It has been proved that including human-centered approaches in the data science process achieves more effective and trustworthy systems (Garibay et al., 2023 ).

However, when the user is integrated into the process, the necessity of offering different treatments/support based on users’ skills and needs arises. For example, more experienced users are probably more confident using data preparation tools than non-expert users. Well-designed systems should, therefore, consider such differences and offer different levels of support depending on the users’ capabilities (Shneiderman, 2020 ).

Considering the complexity of the data preparation phase and the need to interact with the users, the proposed framework aims to provide:

(a) Data Exploration, Data Profiling, and DQ Assessment functionalities to make the users aware of the characteristics and anomalies of the datasets they want to analyze. It is worth noticing that DQ is not the only facet that we consider if sensitive data are also used. Ethical aspects also come into play: data should not contain biases (e.g., data not representative of the entire population or unbalanced data, where subgroups are over/underrepresented);

(b) Recommendations on the sequence of data-cleaning techniques that best fit the user’s analysis;

(c) Support for different types of data sources ; currently, our framework can deal with tabular data and time series;

(d) Sliding Autonomy that is the system’s ability to incorporate human intervention when needed (Côté et al., 2012 ). In our approach, we envision increasing the system support for non-expert users and decreasing it when the users aim to have full control of the design choices.

DIANA is fed with a Knowledge Base (KB) to offer these features. It contains information considered fundamental for the system’s correct functioning. For example, it stores descriptive information regarding the DQ dimensions, metrics, data preparation techniques, and the data needed to generate suggestions on the optimal data preparation tasks. In this paper, we describe the graph database model we designed for the KB and present the experiments we performed to gather the values for its enrichment. In particular, we show the experiments to conduct for enriching the KB by using, as an example, a case study related to a specific data source type: time series.

The paper is structured as follows: Section 2 discusses existing literature contributions and highlights the novel aspects of the presented approach, which is thoroughly described in Section 3 . Section 4 focuses on the KB presenting its structure. Section 5 describes the way in which it is possible to gather the values for enriching the KB; a time series case study has been considered. Section 6 discusses future work and concludes the paper.

2 Related work

Within the recent spread of data analysis applications, we are witnessing an increase in the number of tools that help users explore, prepare, and analyze data. In the literature, there are several approaches that aim to support the early stages of the data analysis pipeline, such as data exploration, profiling, and DQ assessment. For instance, Ehrlinger and Wöß ( 2022 ) makes a comprehensive survey of tools for profiling and DQ measurement and monitoring; Issa et al. ( 2021 ) focuses on the identification and exploration of data inconsistencies; Shrivastava et al. ( 2019 ); Patel et al. ( 2023 ) implement approaches able to support the exploration and the DQ assessment for data used in AI systems.

Other approaches focus instead on the entire data preparation pipeline. Rekatsinas et al. ( 2017 ); Cui et al. ( 2022 ); Melgar et al. ( 2021 ) proposes systems able to provide the automation of a data preparation pipeline. In Berti-Équille ( 2019 , 2020 ), the design of the optimal data preparation pipeline is based on a reinforcement learning approach. Other systems leverage the knowledge related to data preparation pipelines performed in the past to define and suggest promising pipelines for new datasets (Mahdavi et al., 2019 ; Chu et al., 2015 ; Mahdavi & Abedjan, 2021 ).

This paper presents DIANA, a framework with the same objective as the previous contributions. We also use a KB to gather recommendations on the optimal data preparation tasks to apply. However, recommendations are based on knowledge acquired empirically from experiments and past users’ actions (see Section 4 ).

Some AutoML approaches also offer the possibility to perform data preparation automatically (Neutatz et al., 2022 ; Feurer et al., 2022 ; Shchur et al., 2023 ). However, like most of the work mentioned above, these approaches aim to fully automate the data preparation process, keeping the user unaware of how the data were prepared. As previously stated, instead, such processes need to be designed to exploit human involvement, increasing effectiveness and trust.

On this subject, some tools have recently been developed to perform data preparation with more interactive approaches: Luo et al. ( 2020 ) develops a data visualization process exploiting user feedback to find new data errors and suggest possible repairs; Martin et al. ( 2019 ) builds an iterative process in which the data analyst progressively identifies DQ issues and addresses them; Krishnan et al. ( 2016 ) proposes an iterative data cleaning approach to support statistical modeling.

In these approaches, the interaction is very limited: they allow user intervention only at specific times of the process, e.g., providing feedback or answering specific questions, and they are not adaptive to different types of users and needs. For this reason, our approach aims to engage users at all stages of the data preprocessing pipeline and offer different levels of support depending on their needs.

Another set of relevant contributions focuses on studying the effect of DQ errors and DQ improvement techniques on the performance of ML applications: Qi et al. ( 2021 ); Qi and Wang ( 2021 ); Foroni et al. ( 2021 ) analyze the impact that different DQ errors have on the prediction outcome of different ML applications. In contrast, Li et al. ( 2021 ) investigates the impact of data cleaning on ML classification models. Note that to enrich the content of our KB, we needed to implement a set of experiments using an approach that systematically pollutes and then cleans the data, similar to those implemented by Qi et al. ( 2021 ); Qi and Wang ( 2021 ); Foroni et al. ( 2021 ); Li et al. ( 2021 ). Similarly, our previous work (Sancricca & Cappiello, 2022 ) demonstrates the different impact that DQ errors can have depending on the dataset and the ML model used and discusses preliminary results that led us to realize the framework presented in Section 3 .

A valuable set of contributions is related to DQ assessment and improvement of data streams, e.g., sensor data, or time series. Sibai et al. ( 2017 ); Yu et al. ( 2019 ); Pérez-Castillo et al. ( 2018 ) propose a methodology for facilitating the DQ management of sensors and streaming data, while Ramírez-Gallego et al. ( 2017 ) summarizes, categorizes, and analyzes contributions related to data preprocessing of streaming data. It is evident that controlling and increasing DQ is also essential for streaming data to have accurate analysis and predictions. However, there is a lack of systems that can handle the exploration and preparation of data streams.

Table 1 compares the above-mentioned approaches aimed to automate or suggest an optimal data preparation pipeline, including our framework. As can be seen from Table 1 , DIANA is the framework with the highest number of capabilities: (i) the number of adaptive approaches is low (only the Data Quality Advisor (DQA) (Shrivastava et al., 2019 ) works for both tabular and streaming data); (ii) our approach covers the majority of the data preparation actions and data analysis tasks (only Learn2clean (Berti-Équille, 2019 , 2020 ) is comparable to our approach); finally, (iii) DIANA is one of the few human-centered approaches (only (Martin et al., 2019 ; Luo et al., 2020 ) considers users’ interaction).

3 The DIANA framework

This section describes an environment for a D apt I ve d A ta-ce N tric A I (DIANA) we designed. Figure 1 depicts the functional architecture of DIANA. It includes the two main steps of a traditional data science pipeline: Data Preparation , which collects and processes data guaranteeing a certain level of DQ, and Data Analysis . We mainly focus on Data Preparation. As stated in Section 1 , DIANA is able to: (a) provide data exploration, data profiling, and DQ assessment functionalities to make users aware of errors and/or anomalies, and biases, (b) facilitate data preparation, providing recommendations of the data preparation activities and the order with which they have to be performed, to achieve better analysis results, (c) adapt to different data source types, and (d) change the provided support based on users’ needs. The approach is human-centered and implements human-in-the-loop techniques to include human interaction at various stages.

figure 1

The DIANA architecture

The suggestions rely on a KB containing the information needed to recommend the most appropriate pipeline for the specific user’s analysis context . The analysis context is defined as the combination of (i) the data source profile and (ii) the type of analysis, e.g., the ML algorithm that the user intends to run on the data. The system also warns users of potential biases. Moreover, DIANA can deal with different types of data sources, namely tabular data and data streams. It also regulates the support provided based on users’ preferences: the support can range from complete automatic to manual design. The user is potentially involved in all the phases through an interactive interface. The following describes all the phases in detail.

figure 2

Window-based approach for time series

Source adaptation analysis

At the beginning of the process, the targeted users load the Data Sources they want to analyze. Here, the Source Adaptation and Analysis component registers the type of data (tabular or a data stream) to trigger the right methods to consider in the following phases. In fact, in the case of data streams, since the data continuously arrive, the phases of Data Exploration, Data Profiling, DQ Assessment, and Data Preparation are executed following a window-based approach (Arasu & Manku, 2004 ). The window-based approach is highlighted in Fig. 2 : once a time series is connected to our system, a window W is fixed. Note that W can represent the number of incoming samples or a specific time period. Only the observations in such a window are considered for executing all the framework activities. Every time a new window is uploaded, the Data Exploration, Profiling, and DQ assessment are performed, and the results are updated. The data preparation pipeline is, instead, designed with the first window. Then, if something changes, e.g., the DQ decreases over time, the recommendations might also change. In this case, the system alerts the user, which can change the pipeline. Otherwise, the same pipeline is performed on all the incoming windows. Once a set of observations is cleaned, the Data Analysis phase can be performed and the history of the results related to each window is saved.

Data exploration & goal definition

First, the system provides a Data Exploration engine with interactive visualizations. It also allows users to enter the User Preferences : the subset of the most relevant features (i.e., dataset’s attributes) and values, and the type of analysis to be performed (if the user already knows it) ; in case the analysis purpose is not yet known, the Goal Definition engine can help to understand what kinds of analysis are suitable for the data source at hand. We envision different degrees of support in line with the sliding autonomy approach explained in Section 1 , as shown in Fig. 3 . If the user does not know the goal of the analysis, we can provide a list of possible use cases through the use of a generative AI tool. On the contrary, if the user already knows the analysis to perform, the system simply provides feedback on its initial suitability. Note that both functionalities are always available so that each user can choose the needed support level. Once all the preferences have been collected, they are saved in the KB.

figure 3

Sliding autonomy approach

Data profiling, data quality assessment & bias awareness

Data can be inspected via the Data Profiling , DQ Assessment , and Bias Awareness tools. Data Profiling extracts metadata and allows users to check the dataset’s content, providing basic statistics and interactive visualizations on value distributions, correlations, or missing values. The DQ Assessment assesses the level of several DQ aspects (e.g., completeness, accuracy, consistency) and reveals the presence of DQ issues that might compromise data analysis results. Bias Awareness phase provides insights about possible biases. The results of these phases should help users understand the datasets’ content and their initial suitability for the task at hand.

Knowledge base

Our framework generates recommendations to support the design of the optimal data preparation pipeline by exploiting the content of a KB. In detail, the KB stores the results of some experiments and empirical evidence useful to feed the recommendation tools. In particular, the KB provides data related to (i) the impact of DQ errors on ML models and (ii) the effectiveness of data preparation techniques on ML models. Such impact and effectiveness are expressed in terms of ML model performance (e.g., Precision, Recall, etc.). The idea behind the recommendations is as follows: given the user dataset and – eventually – the ML analysis (an unknown analysis context ), our goal is to exploit all previously collected data to train a set of regression models to predict the expected analysis performance in two cases: (i) in presence of particular DQ errors, and (ii) after a specific data preparation technique is applied. By sorting the results of the regression models, we can suggest: (a) the order in which the DQ dimensions should be improved; (b) the data preparation technique that could achieve the highest performance during analysis. In this way, we identify the most effective pipeline to recommend for a specific analysis context based on the dataset profiles in the KB that are most similar to the input dataset. The conceptual model of our KB is formalized in Section 4 , while the process we performed for enriching it is discussed in Section 5 .

Data preparation & bias mitigation

The results of the previous phases are sent to the Data Preparation engine, which has three main goals: recommending the most appropriate task to perform, designing the preparation pipeline, and executing the DQ improvement methods. At this point, the KB is queried to extract (i) the order in which the DQ dimensions should be improved and (ii) for each DQ dimension, a recommendation about the best improvement technique to apply. In the envisioned approach, we assume that the users are free to follow the suggestions or not, letting them change the order of the suggested actions or substitute them by selecting other options from the available ones. Once the data preparation pipeline is defined, the included tasks are executed, and the cleaned dataset is available for download. Also, in this last phase, different support is provided for the sliding autonomy principle. As shown in Fig. 3 , we provide three support levels: in (i) Manual Data Preparation, we let the user build the pipeline autonomously, choosing from the list of all the available activities; in the (ii) Assisted Data Preparation, suggestions on the data preparation activities and their order are given to the user, which always has the possibility to change the pipeline; instead, the (iii) Automatic Data Preparation performs the suggested pipeline automatically and directly outputs the cleaned dataset.

4 The knowledge base model

The KB contains (i) descriptive data and (ii) results of experiments. The former concern the DQ Dimensions \(DQ = \{dq_1,...,dq_i\}\) , each DQ dimension \(dq_i\) can be assessed by using specific Data Quality Metrics \(DQM = \{dqm_{i1},,..dqm_{ij}\}\) , and is associated with corresponding Data Preparation Activities \(DP = \{dp_1,...dp_k\}\) that could improve or worsen it. Each Data Preparation Activity \(dp_k\) has different Data Preparation Methods \(DPM = \{dpm_{k1},...dpm_{kn}\}\) in which it could be performed. For example, data imputation could be executed using different methods, e.g., imputation of the mean, mode, median, or ML-based techniques. Moreover, it stores the list of considered Data Analysis Applications \(A = \{a_1,...,a_p\}\) , e.g., queries or ML models, and the Data Properties \(P = \{p_1,...,p_m\}\) of the Data Objects \(D = \{d_1,...,d_x\}\) under analysis. A property \(p_m\) describes a specific dataset characteristic and could be extracted using one of the available Data Profiling Activities \(DPA = \{dpa_{m1},...dpa_{my}\}\) . The latter are the results of experiments fundamental to support the generation of suggestions. In particular, as partially presented in Sancricca and Cappiello ( 2022 ), we run two types of experiments.

The first one aimed to find, for each combination of a Data Object \(d_x\) and Data Analysis Application \(a_p\) , i.e., an ML model, the impact of DQ errors on the quality of the results. We found that different DQ dimensions can impact the outcome performance differently depending on the analysis context . We assume that the data preparation pipeline should first address the DQ issue that has the highest impact on the result’s performance. Our previous work (Sancricca & Cappiello, 2022 ) shows that improving the DQ dimensions in order of importance for that specific analysis context could give better final results. However, this approach still has limitations: it can be ineffective if (i) two DQ dimensions have a very similar impact or (ii) the dataset contains many errors. In the first case, the DQ dimensions can switch positions; in the latter case, the trained model may perform very differently from the original one. In this way, we can identify a DQ dimension ranking that specifies the DQ dimensions that must be prioritized in a specific context.

The second experiment has been designed to identify the corresponding top-k data preparation methods for each combination of Data Object \(d_x\) , Data Analysis Application \(a_p\) , and DQ dimension \(dq_i\) to improve. In this case, the goodness of the preparation method has been evaluated by considering the quality of the data analysis results.

figure 4

(a) Knowledge base model: graph database schema. (b) An example of the KB content, considering results of Section 5

In order to make recommendations based on the acquired knowledge, we compare the analysis context as input with all the previously analyzed contexts stored in the KB. The idea is to learn from the past: we consider the analysis context that is closest to the one in input and adopt the same DQ dimension ranking. Moreover, for each DQ dimension, we aim to use the knowledge acquired on the data preparation methods to train a classifier able to extract the k-top data preparation actions for a new context; in this way, we will be able to build the suggested pipeline.

Figure 4 (a) shows the schema of the KB modeled as a property graph database (Angles, 2018 ). To follow, we present the definition of each element of the KB.

Definition 1

A Data Quality Dimension \(dq_i \in DQ\) is related to a specific DQ aspect of the data. In our KB model, DQ Dimensions represent the DQ aspects that have been considered to pollute the datasets during the experiments. They associate each Data Object with the DQ errors (related to a specific DQ Dimension) that have been injected into it.

Definition 2

A Data Quality Metric \(dqm_{ij} \in DQM\) expresses ways in which a DQ Dimension can be assessed. A DQ Metric is calculated using a formula and assigned to a numerical value representing the DQ level achieved by a specific DQ Dimension.

Definition 3

A Data Preparation Activity \(dp_k \in DP\) represents a transformation or cleaning operation for DQ improvement that the user can apply in order to clean the data.

Definition 4

A Data Preparation Method \(dpm_{kn} \in DPM\) indicates a possible way by which a specific Data Preparation Activity can be performed.

Definition 5

A Data Object \(d_x \in D\) represents a database in which one or more DQ errors have been injected with specific percentages. Furthermore, it may be a dirty version and thus contain errors, or it may represent a cleaned version of the dirtied one. In the latter case, one or more Data Preparation Methods were executed on it.

Definition 6

A Data Property \(p_m \in P\) indicates a specific characteristic of a Data Object. They represent the dataset’s profiling properties and are all assigned to numerical values. The complete set of Data Properties represents the Data Profile of a given Data Object.

Definition 7

A Data Profiling Activity \(dpa_{my} \in DPA\) is a technique that calculates a specific Data Property of a Data Object and returns as output its associated numerical value.

Definition 8

A Data Analysis Application \(a_p \in A\) represents the type of analysis that has been performed on a Data Object. It may represent a simple query or a more complex analysis, such as an ML model. For an ML model, the performance reached by executing such a model on the Data Objects is stored as a numerical value.

Once all the elements of the KB have been defined, a formalization of the relationships, represented by the set of edges \(E = \{e_1,e_2,...\}\) and their related properties, are listed below. Function \(\rho \) indicates, respectively, the subject and object of an edge, function \(\lambda \) represents the name of a relationship, while the ( e ,  property ) notation characterizes the value of a specific property assigned to an edge or node of the graph. Note that the formalization of relationships is only shown in the direction in which the edges appear in the property graph.

Relationship 1

has . A DQ Dimension has at least one or more DQ Metrics; however, a Data Quality Metric is associated with a unique Data Quality Dimension. Moreover, a Data Preparation Activity has at least one or more Data Preparation Methods, while a Data Preparation Method is associated with a unique Data Preparation Activity.

Relationship 2

generates . A Data Profiling Activity generates a unique Data Property, while a Data Property is generated by a unique Data Profiling Activity.

Relationship 3

improve , affects . Each Data Quality Dimension could be improved or negatively affected by the application of a specific Data Preparation Activity. In fact, the application of a Data Preparation Activity that improves a specific DQ Dimension, e.g., data imputation for improving completeness, can negatively affect other ones, e.g., the accuracy dimension. Moreover, a Data Preparation Activity that improves and affects a DQ Dimension at the same time does not exist.

Relationship 4

isAssociated . Each Data Object is associated with at least one Data Quality Dimension whose errors have been injected into it.

Relationship 5

isGenerated . A Data Object could also be the result of the application of one or more Data Preparation Methods.

Relationship 6

propertyValue . The whole set of the Data Properties is saved for each Data Object. Moreover, this relation includes a numerical value \(\in \Re \) , which represents the computation of the Data Property for a Data Object.

Relationship 7

metricValue . The whole set of the Data Quality Metrics is stored for each Data Object. Moreover, this relation includes a numerical value between 0 and 1, representing the DQ assessment level of a Data Object for a specific DQ Metric.

Relationship 8

performanceValue . A Data Object can be given in input to a Data Analysis Application (in this case, an ML model). Once the ML model is computed, two properties are saved: the extracted performance value \(\in \Re \) and the name of the performance metric. The same ML model can be executed on the same Data Object multiple times.

5 Knowledge base enrichment: a time series case study

This section presents the enrichment procedure outlined to fill the KB described in Section 4 . These experiments aim to enrich only a subset of the abovementioned entities and relationships. In particular, the work presented in this paper focuses on the KB enrichment procedure for suggesting the best data preparation actions to apply in a specific analysis context .

In detail, the obtained results will feed the Relationship 8: performanceValue (see Section 4 ), in which: a Data Object is associated with the different polluted and cleaned versions of the considered datasets (NEWeather and AirQuality); a Data Analysis Application is represented by Random Forest and K-Nearest Neighbors algorithms; the performanceValue is reflected by the F1, Precision, Recall, and R2 metrics (see Section 5.1 for more details on the experimental setup) whose values have been extracted during the Procedure Execution (see Section 5.2 for more details on the experimental procedure). Figure 4 (b) shows an example of the filled KB content.

Note that we perform the experiments focusing on a subset of selected DQ dimensions, data preparation actions, data sources and ML algorithms. Integrating these experimental results into the KB will be fundamental to suggest the top-k data preparation activities. As the framework is adaptive, the KB will also contain information collected from different types of sources. Since part of the enrichment process has already been developed for tabular data (Sancricca & Cappiello, 2022 ), the results presented in this paper focus on time series. For this reason, all the experiments have been developed following a window-based approach, as described in Section 5.2 . Section 5.1 presents the DQ dimensions, data preparation actions, data sources, and ML algorithms we consider; Section 5.2 describes the experiment we designed; finally, Section 5.4 discusses the obtained results.

5.1 Experimental setup

Dq dimensions.

In this work, we focus on the Completeness and the Accuracy dimensions. Completeness is defined as the degree to which a given data collection includes the data describing the corresponding set of real-world objects (Wang & Strong, 1996 ). It is affected by missing values and can be improved, for instance, by applying data imputation techniques. Note that the input dataset is considered to be free of DQ problems. For this reason, we have to inject missing values to test the different data imputation techniques. Accuracy is defined as the closeness between a data value v and a data value v’ , considered as the correct representation of the real-life phenomenon that the data value v aims to represent (Batini & Scannapieco, 2020 ). Considering that the dataset values are mostly numerical in the context of the time series, the accuracy is mainly related to the presence of outliers: values that differ significantly from the other observations. Outliers can be detected by applying suitable outlier detection techniques. Cleaning outliers can be performed by removing them or substituting them with an estimated corrected value, e.g., using data imputation techniques. Again, we must inject outliers to test the data preparation techniques.

Two datasets containing multivariate time series have been selected for the experiments: AirQuality Footnote 1 and NEWeather. Footnote 2

AirQuality . The Beijing Multi-Site Air-Quality dataset contains hourly air impurity measurements from 12 nationally controlled monitoring sites of the Beijing Municipal Environmental Monitoring Center. This dataset already contains outliers and missing values; however, we performed the experiment starting with a cleaned version. It has 15 features and 9357 samples with all numerical features, except for Date and Time . The ML task that can be performed on this dataset is regression, and the target variable is PM2.5 , representing the concentration of a particular type of atmospheric pollutant.

NEWeather . The Weather dataset records daily weather measurements from hundreds of locations situated at the Offutt Air Force Base in Nebraska and contains over 50 years of data. It is a subset of the National Oceanic and Atmospheric Administration database (NOAA). Daily measurements refer to temperature, pressure, visibility, and wind speed. It has 9 features and 18159 samples with all numerical features, except for the targeted class rain . Note that this dataset is already cleaned. The ML task that can be performed is the binary classification of the target variable representing the presence of rain on a given day.

Machine learning algorithms

Two ML algorithms have been selected: Random Forest and K-Nearest Neighbors.

Random Forest (RF) . It is an ensemble learning method that combines the output of multiple decision trees to make predictions. This technique is capable of handling both regression and classification problems. We perform all the experiments using the RF algorithm.

K-Nearest Neighbors (KNN) . It is a distance-based algorithm that, like RF, can be employed for classification and regression.

In this work, we use the scikit-learn (Pedregosa et al., 2011 ) implementations of RandomForestRegressor, RandomForestClassifier, KNeighborsRegressor and KNeighborsClassifier. Before applying the ML model, the dataset (in our case, a single window) is divided into training and testing sets, where 67% of the window is assigned to training and the remaining 33% to testing.

Performance metrics

The two metrics that have been selected to assess the performance of classification and regression tasks are the following:

R2 . It assesses how effectively a regression model predicts a target variable. The R2 represents the extent to which the variance for a dependent variable explains the variance of an independent variable. In other words, the amount of variance that could be explained by the model w.r.t. the total variance.

F1 . It assesses the classification performance by providing a harmonic mean between precision and recall. Precision represents the True Positives (TP) ratio w.r.t. all positives, indicating the model’s ability to identify only the data points in a class. Conversely, Recall is the ratio of TP w.r.t. the sum of TP and False Negatives (FN), providing a measure of the model’s capability to detect all data points in a relevant class.

The Precision , Recall , and F1 measures have also been used to evaluate the effectiveness and performance of outlier detection methods, providing a comprehensive assessment of their ability to identify anomalies within a given dataset.

Data imputation methods

The following data imputation techniques were used for both data imputation and correcting outliers:

Dropping . It removes data observations displaying anomalies or missing values.

Last Observation Carried Forward (LOCF) . The values are propagated from the first previous valid observation. This is a form of data interpolation which helps to maintain data continuity.

Mean . It replaces missing data points or outliers using the aggregated mean. Since we applied a window-based approach the mean is computed for each window. This method could introduce discontinuity in data.

Linear Interpolation . It is a curve-fitting technique that uses linear polynomials. It imputes the value of a generic point ( x ,  y ) that is situated between two other points \((x_1,y_1)\) and \((x_2,y_2)\) with the following formula: \(y = y_1 + (x - x_1) \frac{(y_2 - y_1)}{(x_2 - x_1)}\) . Unlike averaging, linear interpolation can maintain data continuity in time series.

Outlier detection methods

The following outlier detection techniques have been selected.

z-score . It is a statistical-based approach that uses mean and standard deviation to classify a data point as an outlier. This score is computed as \(Z = \frac{x - \mu }{\sigma }\) , in which x is the analyzed value, \(\mu \) is the mean of all the values, and \(\sigma \) is the standard deviation. Since our datasets are multivariate, this is computed for each feature. A data point is flagged as an outlier if at least one z-score value exceeds a fixed threshold. In our case, 3 has been fixed.

Local Outlier Factor (LOF) . It is a density-based approach that uses the concept of local density, computed as the distance between that point and its k-nearest neighbors . The algorithm compares an object’s local density with its neighbors’ local density, detecting the points with a lower density w.r.t. their neighbors as outliers.

Isolation Forest (iforest) . It returns the outlier score of each sample using the Isolation Forest algorithm. Isolation Forest isolates observations starting from a set of features randomly selected. A division value between the maximum and the minimum of the selected feature is then randomly selected for each feature. The division is recursive and can be represented by a tree structure. The path length of each observation is a measure of its normality . Divisions will produce significantly shorter paths for anomalies; thus, the lower the outlier score, the more likely the observation is an anomaly (Liu et al., 2008 ).

Half Space Trees (HST) . It is an incremental tree-based anomaly detector. Each tree is created to divide the data space into windows. To detect anomalies, the majority of points must fall in the same window, and those that do not are considered anomalies. The algorithm generates many trees, and the prediction is computed as majority voting (Tan et al., 2011 ).

5.2 Experimental procedure

The experiments follow the procedure outlined below. This process aims to understand the influence of data preparation on a time series ML analysis.

Step 1: Error injection

As the initial phase of the experimental procedure, we apply a controlled injection of DQ errors into the dataset. The original dataset is polluted with errors concerning a specific DQ dimension. Several polluted instances of the original dataset are generated. Each of these instances contains a different percentage of DQ errors. We inject the errors for a percentage range P from 10% to 50%, with an increasing step of 10%. Therefore, we created five instances of each dataset for every DQ dimension. Note that the distribution we use to inject the DQ errors is uniform; thus, the errors have been injected completely at random. We injected two different errors: missing values and outliers, respectively, related to the completeness and accuracy DQ dimensions.

Three different data pollution functions have been created: Missing Values Injection , designed to insert only missing values into the P % of the dataset values; Outliers Injection , designed to introduce only outliers inside the P % of rows with a probability p for each value of the observation; and Missing Values and Outliers Injection , both Missing Values Injection and Outliers Injection are performed at the same time, but, in this case, the value P % is equally distributed among the two DQ errors: P /2% missing values and P /2% outliers have been injected.

Step 2: Procedure execution

The procedure’s input is a polluted time series created in Step 1 , in which data points are collected within a window of size W . Two different processes have been implemented to execute the experiments.

Performance Evaluation Process . The data preparation actions are applied to each window, which is split into train and test sets. The train set is used to fit the ML model, while the test set is employed to execute the prediction task. Once this phase is over, the W analyzed data points are discarded, and results are saved. Every time a number W of new data points is detected, a new window is initialized. The results displayed in Section 5.4 represent the average between all the analyzed windows. Once the entire series is processed, all the computed performance metrics are compared to understand the effect of the different data preparation actions. This evaluation aims to determine which data preparation action is the most effective. Two distinct data preparation actions are executed in the experiments concerning the injection of more than one type of DQ error. i.e., Missing Values and Outliers Injection .

Outlier Detection Evaluation Process . This process aims to assess the effectiveness of the outlier detection methods. This experiment slightly differs from the one described before. Here, no ML task is performed; this pipeline aims to evaluate the capacity to detect outliers correctly. Once an outlier detection method is performed on the window, the Precision , Recall , and F1 metrics are computed, comparing the detected outliers with the injected ones.

To reduce the random effect related to the random injection of errors, Steps 1 and 2 have been executed 8 times using different random seeds in the Error Injection phase. Then, the results of each Procedure Execution were averaged.

5.3 List of experiments

Six different types of experiments have been performed to evaluate the impact of data preparation on time series, following the two procedures explained in Section 5.2 and using the setup listed in Section 5.1 .

They are summarized in Table 2 , which includes the DQ Dimensions involved, the pollution function used for Error Injection (Step 1) , the process employed in the Procedure Execution (Step 2) , and the data preparation tasks involved:

It aims to determine which data imputation technique is better considering the ML model performance.

It aims to assess the capacity of outlier detection methods.

After evaluating the outlier detection technique, the detected outliers are put to a standard value before executing the analysis within the selected ML model.

It aims to determine to identify the better combination of outlier detection and correction techniques for the ML model performance. The identified outliers are then corrected using the selected data imputation techniques.

It aims to assess the best combination of data preparation actions to apply in order to (i) correct both outliers and missing values and (ii) achieve the best performance. Outliers are first detected and corrected with the data imputation techniques; finally, missing values are imputed.

It is similar to the previous experiment: the unique change is the order in which data cleaning methods are applied. Missing values are first imputed, and then outliers are detected and corrected with the data imputation techniques.

5.4 Results

This section discusses the results of the experiments described in Section 5.3 . For each of the experiments, a table with the results has been created. In the tables, the columns represent the percentage of DQ, while rows correspond to the different data imputation and outlier detection methods, or the combination of the two, performed on data. In addition to the data imputation and outlier detection methods listed in Section 5.1 , we also calculate the ML model’s performance in the case of considering the erroneous dataset as input. This case is labeled as none in the tables. The numerical values in the cells describe the ratio between (i) the performance achieved by executing that particular experiment (with that specific combination of a dataset, data preparation methods, and an ML algorithm) and (ii) the original performance obtained by using the clean dataset. The performance of the two clean datasets within the two selected ML models is shown in Table 3 .

Experiment 1: data imputation

The values in Table 4 demonstrate that applying any imputation method for the classification task is better than keeping the missing values. For the regression, very similar performance values are obtained for all the imputation methods, but the best ones are LOCF and linear interpolation.

Since the analyzed data is a multivariate time series, imputation techniques that maintain continuity in the trend of the values reach a higher performance. Regression works better by keeping the missing values rather than replacing them with the mean since the model is more sensitive to data discontinuity.

It is worth noticing that dropping the rows with missing values reaches performance higher than the original; this highlights that the performance is inversely proportional to the size of the dataset, with the risk of having an overfitted/less generic model and a considerable loss of information.

Experiment 2: outlier detection

Table 5 outlines that, for the NEWather dataset, HST outperforms the other methods. However, such a method has an inverse trend w.r.t. the others: it is able to recognize more outliers when the quality is poor. This result was initially reported in other literature contributions (Tan et al., 2011 ). A possible cause of this behavior is that the algorithm recognizes more data points as anomalies than the real ones. This is confirmed by the high value related to the Precision and a lower value for Recall. For the AirQuality dataset, the most effective method is LOF. This demonstrates that the goodness of an outlier detection method depends on the dataset characteristics and how the values are distributed. iforest detects almost the same number of outliers for both ML tasks. In the AirQuality dataset, the number of detected anomalies almost matches those inserted at a quality level of 80%. This explains why performance is higher at 80% of quality than at 90% of quality. This also occurs in the NEWeather dataset; however, the number of identified outliers corresponds with the number of outliers at 90% of quality. The z-score method performs badly due to its sensitivity to high variance values. In fact, for this method is easier to detect outliers when the quality of the dataset is higher and, thus, the standard deviation is lower. The distance-based approach (LOF) generally outperforms the tree-based approaches (iforest and HST).

Experiment 3: outlier detection and standard value correction

Here, detected outliers are standardized to a unique value. Table 6 partially reflects results obtained by Experiment 2 . Since RF is robust to outliers, results show that it is better to maintain all the outliers in the dataset rather than standardize them with a unique value. Since the z-score almost never finds outliers, its performance is comparable to maintaining them. The outlier detection methods behave similarly between the two datasets, except for iforest: for the AirQuality dataset, the regression performance gets worse even with a low number of outliers; this happens since this method recognizes a greater number of anomalies than the real ones, as confirmed by the iforest Precision value (see Table 5 ); thus, standardizing correct values worsens the performance. To summarise, the performance improves as the level of DQ improves (except for iforest, as explained before). This means that the presence of outliers affects the performance of the ML models.

Experiment 4: outlier detection and correction

In general, better performance is achieved by dropping the detected outliers.

For RF (Table 7 ), correcting the outliers with the imputation methods always has worse results than keeping them. We can notice a slight variation in performance compared to the number of outliers we injected; we assumed that this algorithm is robust to the presence of outliers. We repeat part of the experiments with the KNN algorithm and, as Table 8 demonstrates, outlier correction reaches higher performance than keeping the outliers in the dataset. This happens because KNN is less robust to outliers. Nonetheless, these results are comparable to those obtained with RF.

Removing outliers detected with HST, iforest, and LOF on the NEWeather dataset improves the performance. In particular, HST and iforest reach higher performance than the original dataset. This happens since more data points than the real ones are considered outliers and thus deleted. Again, with a risk of high information loss and overfitted/less generalized models. Instead, for the AirQuality dataset, dropping the detected outliers improves the performance only when applying LOF and z-score. These results are comparable to those obtained in previous experiments: LOF is the best outlier detector for the AirQuality dataset, thus identifying the right outliers. z-score finds very few outliers, as reported by the value of F1 in Table 5 , but they are all correct.

For AirQuality, using iforest and correcting with the mean leads to a negative R2-score. Since iforest associates too many data points with anomalies, it happens that many correct values have been reduced to the mean, causing a discontinuity in the time series and worsening the performance. This is not the case for NEWeather, which probably has a less pronounced variance. Therefore, the computed mean better fits the general trend of the time series.

As in the previous experiments, z-score hardly recognizes outliers; thus, it yields outcomes similar to the none case.

Experiment 5 and 6: outlier detection and correction, data imputation, and vice-versa

In Tables 9 and 10 , columns report the total level of DQ. For example, 50% of DQ means that 25% outliers and 25% missing values have been injected.

12 combinations of outlier detection, outlier correction, and data imputation methods have been investigated. Only the techniques that have achieved better results in the previous experiments have been selected. Since dropping missing values (which have obtained good performance) would have led to a loss of too much information, it was not considered for this part.

In general, the performance achieved by combining the data preparation techniques is very similar for all the combinations, and the results reflect the ones obtained in the Experiment 1, and 4 :

For the classification task, data imputation methods behaved the same, while the best outlier detection and correction combinations were HST and LOF with outliers removal; z-score does not detect any outliers.

For the regression task, LOCF and interpolation were the best imputation methods, and LOF outlier detection with outliers removal was the best combination for outlier detection and correction.

Our results suggest that, for time series, the order in which the two data preparation techniques are performed has a very low impact on the overall outcome. This behavior may be due to the limited size of the window on which the data preparation tasks are applied. Instead, we demonstrated (Sancricca & Cappiello, 2022 ) that for tabular datasets, the order in which data preparation actions are performed significantly impacts the analysis performance.

6 Conclusions and future work

This paper presents DIANA, the framework we designed to support users in effectively preparing data.

The experiments’ results led to significant findings related to the impact of data preparation on data streams: (i) the goodness of the data imputation or outlier detection methods depends on the dataset characteristics and its value distributions; (ii) in general, dropping outliers reaches higher performance but increases the risk of obtaining an overfitted and less generalized model; finally (iii) the order in which the data preparation techniques are performed has a very low impact on the analysis outcome.

The DIANA prototype is currently under development. Thus, future work includes the preparation of a complete first version, the definition of a user study for validating it, and its comparison with similar tools such as Berti-Équille ( 2019 ).

To let the system evolve and learn, as soon as the system starts to be used, we aim to extend the KB model to include users’ profiles, goals, and past actions (i.e., provenance). Future work will also focus on exploiting past users’ experiences and feedback to improve the recommendations.

Data Availability

The supporting data, code, and associated results are available in the dedicated GitHub repository: https://github.com/camillasancricca/diana-experiments-time-series.git .

https://archive.ics.uci.edu/dataset/501/beijing+multi+site+air+quality+data

https://www.ncei.noaa.gov/pub/data/gsod/2012/

Angles, R. (2018). The property graph database model. In Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management . CEUR Workshop Proceedings, vol. 2100. https://ceur-ws.org/Vol-2100/paper26.pdf

Arasu, A., & Manku, G. S. (2004). Approximate counts and quantiles over sliding windows. In C. Beeri, & A. Deutsch (eds.) Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems , (pp. 286–296). ACM. https://doi.org/10.1145/1055558.1055598

Batini, C., & Scannapieco, M. (2016). Data and information quality - dimensions, principles and techniques. Data-Centric Systems and Applications. Springer. https://doi.org/10.1007/978-3-319-24106-7

Berti-Équille, L. (2019) Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference, WWW 2019 , (pp. 2580–2586). ACM. https://doi.org/10.1145/3308558.3313602

Berti-Équille, L. (2020). Active reinforcement learning for data preparation: Learn2clean with human-in-the-loop. In CIDR 2020 Proceedings . www.cidrdb.org . http://cidrdb.org/cidr2020/gongshow2020/gongshow/abstracts/cidr2020_abstract59.pdf

Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. of the 2015 ACM SIGMOD , (pp. 1247–1261). ACM. https://doi.org/10.1145/2723372.2749431

Côté, N., Canu, A., Bouzid, M., & Mouaddib, A. (2012) Humans-robots sliding collaboration control in complex environments with adjustable autonomy. In 2012 IAT , (pp. 146–153). IEEE Computer Society. https://doi.org/10.1109/WI-IAT.2012.215

Cui, Q., Zheng, W., Hou, W., Sheng, M., Ren, P., Chang, W., & Li, X. (2022). Holocleanx: A multi-source heterogeneous data cleaning solution based on lakehouse. In HIS 2022, Proceedings. LNCS , vol. 13705, (pp. 165–176). Springer. https://doi.org/10.1007/978-3-031-20627-6_16

Ehrlinger, L., & Wöß, W. (2022). A survey of data quality measurement and monitoring tools. Frontiers Big Data, 5 , 850611. https://doi.org/10.3389/FDATA.2022.850611

Article   Google Scholar  

Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., & Hutter, F. (2022). Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research, 23 , 261–126161.

MathSciNet   Google Scholar  

Foroni, D., Lissandrini, M., & Velegrakis, Y. (2021). Estimating the extent of the effects of data quality through observations. In ICDE 2021 , (pp. 1913–1918). IEEE. https://doi.org/10.1109/ICDE51399.2021.00176

Garibay, Ö. Ö., Winslow, B., et al. (2023). Six human-centered artificial intelligence grand challenges. International Journal of Human–Computer Interaction, 39 (3), 391–437. https://doi.org/10.1080/10447318.2022.2153320

Hameed, M., & Naumann, F. (2020). Data preparation: A survey of commercial tools. SIGMOD Record, 49 (3), 18–29. https://doi.org/10.1145/3444831.3444835

Issa, O., Bonifati, A., & Toumani, F. (2021). INCA: inconsistency-aware data profiling and querying. In SIGMOD ’21 , (pp. 2745–2749). ACM. https://doi.org/10.1145/3448016.3452760

Jarrahi, M. H., Memariani, A., & Guha, S. (2023). The principles of data-centric AI. Communications of the ACM, 66 (8), 84–92. https://doi.org/10.1145/3571724

Krishnan, S., Wang, J., Wu, E., Franklin, M. J., & Goldberg, K. (2016). Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, 9 (12), 948–959. https://doi.org/10.14778/2994509.2994514

Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., & Zhang, C.: Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In ICDE 2021 , (pp. 13–24). IEEE. https://doi.org/10.1109/ICDE51399.2021.00009

Liu, F. T., Ting, K. M., & Zhou, Z. (2008). Isolation forest. In Proceedings of ICDM , pp. 413–422. IEEE Computer Society. https://doi.org/10.1109/ICDM.2008.17

Luo, Y., Chai, C., Qin, X., Tang, N., & Li, G. (2020). Interactive cleaning for progressive visualization through composite questions. In ICDE 2020 (pp. 733–744). IEEE. https://doi.org/10.1109/ICDE48307.2020.00069

Mahdavi, M., & Abedjan, Z. (2021). Semi-supervised data cleaning with raha and baran. In 11th Conference on Innovative Data Systems Research, CIDR 2021 . www.cidrdb.org . http://cidrdb.org/cidr2021/papers/cidr2021_paper14.pdf

Mahdavi, M., Neutatz, F., Visengeriyeva, L., & Abedjan, Z. (2019). Towards automated data cleaning workflows. In Proc. of the Conference on "Lernen, Wissen, Daten, Analysen" . CEUR Workshop Proceedings, vol. 2454, (pp. 10–19). CEUR-WS.org. https://ceur-ws.org/Vol-2454/paper_8.pdf

Martin, N., Martinez-Millana, A., Valdivieso, B., & Fernández-Llatas, C. (2019). Interactive data cleaning for process mining: A case study of an outpatient clinic’s appointment system. In BPM 2019 International Workshops . LNBIP, vol. 362, pp. 532–544. Springer. https://doi.org/10.1007/978-3-030-37453-2_43

Melgar, L. A., & Dao, D., et al. (2021). Ease.ml: A lifecycle management system for machine learning. In CIDR 2021 . www.cidrdb.org . http://cidrdb.org/cidr2021/papers/cidr2021_paper26.pdf

Neutatz, F., Chen, B., Alkhatib, Y., Ye, J., & Abedjan, Z. (2022). Data cleaning and automl: Would an optimizer choose to clean? Datenbank-Spektrum, 22 (2), 121–130. https://doi.org/10.1007/S13222-022-00413-2

Patel, H., Guttula, S. C., Gupta, N., Hans, S., Mittal, R. S., & Nagalapatti, L. (2023). A data-centric AI framework for automating exploratory data analysis and data quality tasks. ACM Journal of Data and Information Quality, 15 (4), 44–14426. https://doi.org/10.1145/3603709

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12 , 2825–2830. https://doi.org/10.5555/1953048.2078195

Article   MathSciNet   Google Scholar  

Pérez-Castillo, R., Carretero, A. G., Caballero, I., Rodríguez, M., Piattini, M., Mate, A., Kim, S., & Lee, D. (2018). DAQUA-MASS: an ISO 8000–61 based data quality management methodology for sensor data. Sensors, 18 (9), 3105. https://doi.org/10.3390/S18093105

Qi, Z., & Wang, H. (2021). Dirty-data impacts on regression models: An experimental evaluation. In DASFAA 2021 . LNCS, vol. 12681, (pp. 88–95). Springer. https://doi.org/10.1007/978-3-030-73194-6_6

Qi, Z., Wang, H., & Wang, A. (2021). Impacts of dirty data on classification and clustering models: An experimental evaluation. Journal of Computer Science and Technology, 36 (4), 806–821. https://doi.org/10.1007/S11390-021-1344-6

Ramírez-Gallego, S., Krawczyk, B., García, S., Wozniak, M., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239 , 39–57. https://doi.org/10.1016/J.NEUCOM.2017.01.078

Rekatsinas, T., Chu, X., Ilyas, I. F., & Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, 10 (11), 1190–1201. https://doi.org/10.14778/3137628.3137631

Sancricca, C., & Cappiello, C.: Supporting the design of data preparation pipelines. In Proceedings Of Sebd 2022. CEUR Workshop Proceedings, vol. 3194, (pp. 149–158). CEUR-WS.org. https://ceur-ws.org/Vol-3194/paper18.pdf

Shchur, O., Türkmen, A.C., Erickson, N., Shen, H., Shirkov, A., Hu, T., & Wang, B. (2023). Autogluon-timeseries: Automl for probabilistic time series forecasting. In Proc. of the International Conference on Automated Machine Learning , vol. 228, (pp. 9–121). PMLR. https://proceedings.mlr.press/v228/shchur23a.html

Shneiderman, B. (2020). Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36 (6), 495–504. https://doi.org/10.1080/10447318.2020.1741118

Shrivastava, S., et al. (2019) DQA: scalable, automated and interactive data quality advisor. In Proc. of 2019 (IEEE BigData) , (pp. 2913–2922). IEEE. https://doi.org/10.1109/BIGDATA47090.2019.9006187

Sibai, R.E., Chabchoub, Y., Chiky, R., Demerjian, J., & Barbar, K.: Assessing and improving sensors data quality in streaming context. In ICCCI 2017 , Nicosia. LNCS, vol. 10449, (pp. 590–599). Springer. https://doi.org/10.1007/978-3-319-67077-5_57

Tan, S. C., Ting, K. M., & Liu, F. T. (2011). Fast anomaly detection for streaming data. In T. Walsh (ed.) IJCAI 2011 , (pp. 1511–1516). IJCAI/AAAI. https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-254

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12 (4), 5–33. https://doi.org/10.1080/07421222.1996.11518099

Yu, M., Wu, C., & Tsung, F. (2019). Monitoring the data quality of data streams using a two-step control scheme. IISE Transactions, 51 (9), 985–998. https://doi.org/10.1080/24725854.2018.1530487

Download references

Acknowledgements

Not applicable

Open access funding provided by Politecnico di Milano within the CRUI-CARE Agreement. This research was supported by EU Horizon Framework grant agreement 101069543 (CS-AWARE-NEXT).

Author information

Authors and affiliations.

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Piazza Leonardo da Vinci, 20133, Milan, Italy

Camilla Sancricca, Giovanni Siracusa & Cinzia Cappiello

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed equally.

Corresponding author

Correspondence to Camilla Sancricca .

Ethics declarations

Ethical approval, competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Sancricca, C., Siracusa, G. & Cappiello, C. Enhancing data preparation: insights from a time series case study. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-024-00867-8

Download citation

Received : 20 March 2024

Revised : 01 June 2024

Accepted : 08 July 2024

Published : 25 July 2024

DOI : https://doi.org/10.1007/s10844-024-00867-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data-centric AI
  • Data preparation
  • Data quality
  • Machine learning pipelines

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

Data science case interviews (what to expect & how to prepare)

Data science case study

Data science case studies are tough to crack: they’re open-ended, technical, and specific to the company. Interviewers use them to test your ability to break down complex problems and your use of analytical thinking to address business concerns.

So we’ve put together this guide to help you familiarize yourself with case studies at companies like Amazon, Google, and Meta (Facebook), as well as how to prepare for them, using practice questions and a repeatable answer framework.

Here’s the first thing you need to know about tackling data science case studies: always start by asking clarifying questions, before jumping in to your plan.

Let’s get started.

  • What to expect in data science case study interviews
  • How to approach data science case studies
  • Sample cases from FAANG data science interviews
  • How to prepare for data science case interviews

Click here to practice 1-on-1 with ex-FAANG interviewers

1. what to expect in data science case study interviews.

Before we get into an answer method and practice questions for data science case studies, let’s take a look at what you can expect in this type of interview.

Of course, the exact interview process for data scientist candidates will depend on the company you’re applying to, but case studies generally appear in both the pre-onsite phone screens and during the final onsite or virtual loop.

These questions may take anywhere from 10 to 40 minutes to answer, depending on the depth and complexity that the interviewer is looking for. During the initial phone screens, the case studies are typically shorter and interspersed with other technical and/or behavioral questions. During the final rounds, they will likely take longer to answer and require a more detailed analysis.

While some candidates may have the opportunity to prepare in advance and present their conclusions during an interview round, most candidates work with the information the interviewer offers on the spot.

1.1 The types of data science case studies

Generally, there are two types of case studies:

  • Analysis cases , which focus on how you translate user behavior into ideas and insights using data. These typically center around a product, feature, or business concern that’s unique to the company you’re interviewing with.
  • Modeling cases , which are more overtly technical and focus on how you build and use machine learning and statistical models to address business problems.

The number of case studies that you’ll receive in each category will depend on the company and the position that you’ve applied for. Facebook , for instance, typically doesn’t give many machine learning modeling cases, whereas Amazon does.

Also, some companies break these larger groups into smaller subcategories. For example, Facebook divides its analysis cases into two types: product interpretation and applied data . 

You may also receive in-depth questions similar to case studies, which test your technical capabilities (e.g. coding, SQL), so if you’d like to learn more about how to answer coding interview questions, take a look here .

We’ll give you a step-by-step method that can be used to answer analysis and modeling cases in section 2 . But first, let’s look at how interviewers will assess your answers.

1.2 What interviewers are looking for

We’ve researched accounts from ex-interviewers and data scientists to pinpoint the main criteria that interviewers look for in your answers. While the exact grading rubric will vary per company, this list from an ex-Google data scientist is a good overview of the biggest assessment areas:

  • Structure : candidate can break down an ambiguous problem into clear steps
  • Completeness : candidate is able to fully answer the question
  • Soundness : candidate’s solution is feasible and logical
  • Clarity : candidate’s explanations and methodology are easy to understand
  • Speed : candidate manages time well and is able to come up with solutions quickly

You’ll be able to improve your skills in each of these categories by practicing data science case studies on your own, and by working with an answer framework. We’ll get into that next.

2. How to approach data science case studies

Approaching data science cases with a repeatable framework will not only add structure to your answer, but also help you manage your time and think clearly under the stress of interview conditions.

Let’s go over a framework that you can use in your interviews, then break it down with an example answer.

2.1 Data science case framework: CAPER

We've researched popular frameworks used by real data scientists, and consolidated them to be as memorable and useful in an interview setting as possible.

Try using the framework below to structure your thinking during the interview. 

  • Clarify : Start by asking questions. Case questions are ambiguous, so you’ll need to gather more information from the interviewer, while eliminating irrelevant data. The types of questions you’ll ask will depend on the case, but consider: what is the business objective? What data can I access? Should I focus on all customers or just in X region?
  • Assume : Narrow the problem down by making assumptions and stating them to the interviewer for confirmation. (E.g. the statistical significance is X%, users are segmented based on XYZ, etc.) By the end of this step you should have constrained the problem into a clear goal.
  • Plan : Now, begin to craft your solution. Take time to outline a plan, breaking it into manageable tasks. Once you’ve made your plan, explain each step that you will take to the interviewer, and ask if it sounds good to them.
  • Execute : Carry out your plan, walking through each step with the interviewer. Depending on the type of case, you may have to prepare and engineer data, code, apply statistical algorithms, build a model, etc. In the majority of cases, you will need to end with business analysis.
  • Review : Finally, tie your final solution back to the business objectives you and the interviewer had initially identified. Evaluate your solution, and whether there are any steps you could have added or removed to improve it. 

Now that you’ve seen the framework, let’s take a look at how to implement it.

2.2 Sample answer using the CAPER framework

Below you’ll find an answer to a Facebook data science interview question from the Applied Data loop. This is an example that comes from Facebook’s data science interview prep materials, which you can find here .

Try this question:

Imagine that Facebook is building a product around high schools, starting with about 300 million users who have filled out a field with the name of their current high school. How would you find out how much of this data is real?

First, we need to clarify the question, eliminating irrelevant data and pinpointing what is the most important. For example:

  • What exactly does “real” mean in this context?
  • Should we focus on whether the high school itself is real, or whether the user actually attended the high school they’ve named?

After discussing with the interviewer, we’ve decided to focus on whether the high school itself is real first, followed by whether the user actually attended the high school they’ve named.

Next, we’ll narrow the problem down and state our assumptions to the interviewer for confirmation. Here are some assumptions we could make in the context of this problem:

  • The 300 million users are likely teenagers, given that they’re listing their current high school
  • We can assume that a high school that is listed too few times is likely fake
  • We can assume that a high school that is listed too many times (e.g. 10,000+ students) is likely fake

The interviewer has agreed with each of these assumptions, so we can now move on to the plan.

Next, it’s time to make a list of actionable steps and lay them out for the interviewer before moving on.

First, there are two approaches that we can identify:

  • A high precision approach, which provides a list of people who definitely went to a confirmed high school
  • A high recall approach, more similar to market sizing, which would provide a ballpark figure of people who went to a confirmed high school

As this is for a product that Facebook is currently building, the product use case likely calls for an estimate that is as accurate as possible. So we can go for the first approach, which will provide a more precise estimate of confirmed users listing a real high school. 

Now, we list the steps that make up this approach:

  • To find whether a high school is real: Draw a distribution with the number of students on the X axis, and the number of high schools on the Y axis, in order to find and eliminate the lower and upper bounds
  • To find whether a student really went to a high school: use a user’s friend graph and location to determine the plausibility of the high school they’ve named

The interviewer has approved the plan, which means that it’s time to execute.

4. Execute 

Step 1: Determining whether a high school is real

Going off of our plan, we’ll first start with the distribution.

We can use x1 to denote the lower bound, below which the number of times a high school is listed would be too small for a plausible school. x2 then denotes the upper bound, above which the high school has been listed too many times for a plausible school.

Here is what that would look like:

Data science case study illustration

Be prepared to answer follow up questions. In this case, the interviewer may ask, “looking at this graph, what do you think x1 and x2 would be?”

Based on this distribution, we could say that x1 is approximately the 5th percentile, or somewhere around 100 students. So, out of 300 million students, if fewer than 100 students list “Applebee” high school, then this is most likely not a real high school.

x2 is likely around the 95th percentile, or potentially as high as the 99th percentile. Based on intuition, we could estimate that number around 10,000. So, if more than 10,000 students list “Applebee” high school, then this is most likely not real. Here is how that looks on the distribution:

Data science case study illustration 2

At this point, the interviewer may ask more follow-up questions, such as “how do we account for different high schools that share the same name?”

In this case, we could group by the schools’ name and location, rather than name alone. If the high school does not have a dedicated page that lists its location, we could deduce its location based on the city of the user that lists it. 

Step 2: Determining whether a user went to the high school

A strong signal as to whether a user attended a specific high school would be their friend graph: a set number of friends would have to have listed the same current high school. For now, we’ll set that number at five friends.

Don’t forget to call out trade-offs and edge cases as you go. In this case, there could be a student who has recently moved, and so the high school they’ve listed does not reflect their actual current high school. 

To solve this, we could rely on users to update their location to reflect the change. If users do not update their location and high school, this would present an edge case that we would need to work out later.

To conclude, we could use the data from both the friend graph and the initial distribution to confirm the two signifiers: a high school is real, and the user really went there.

If enough users in the same location list the same high school, then it is likely that the high school is real, and that the users really attend it. If there are not enough users in the same location that list the same high school, then it is likely that the high school is not real, and the users do not actually attend it.

3. Sample cases from FAANG data science interviews

Having worked through the sample problem above, try out the different kinds of case studies that have been asked in data science interviews at FAANG companies. We’ve divided the questions into types of cases, as well as by company.

For more information about each of these companies’ data science interviews, take a look at these guides:

  • Facebook data scientist interview guide
  • Amazon data scientist interview guide
  • Google data scientist interview guide

Now let’s get into the questions. This is a selection of real data scientist interview questions, according to data from Glassdoor.

Data science case studies

Facebook - Analysis (product interpretation)

  • How would you measure the success of a product?
  • What KPIs would you use to measure the success of the newsfeed?
  • Friends acceptance rate decreases 15% after a new notifications system is launched - how would you investigate?

Facebook - Analysis (applied data)

  • How would you evaluate the impact for teenagers when their parents join Facebook?
  • How would you decide to launch or not if engagement within a specific cohort decreased while all the rest increased?
  • How would you set up an experiment to understand feature change in Instagram stories?

Amazon - modeling

  • How would you improve a classification model that suffers from low precision?
  • When you have time series data by month, and it has large data records, how will you find significant differences between this month and previous month?

Google - Analysis

  • You have a google app and you make a change. How do you test if a metric has increased or not?
  • How do you detect viruses or inappropriate content on YouTube?
  • How would you compare if upgrading the android system produces more searches?

4. How to prepare for data science case interviews

Understanding the process and learning a method for data science cases will go a long way in helping you prepare. But this information is not enough to land you a data science job offer. 

To succeed in your data scientist case interviews, you're also going to need to practice under realistic interview conditions so that you'll be ready to perform when it counts. 

For more information on how to prepare for data science interviews as a whole, take a look at our guide on data science interview prep .

4.1 Practice on your own

Start by answering practice questions alone. You can use the list in section 3 , and interview yourself out loud. This may sound strange, but it will significantly improve the way you communicate your answers during an interview. 

Play the role of both the candidate and the interviewer, asking questions and answering them, just like two people would in an interview. This will help you get used to the answer framework and get used to answering data science cases in a structured way.

4.2 Practice with peers

Once you’re used to answering questions on your own , then a great next step is to do mock interviews with friends or peers. This will help you adapt your approach to accommodate for follow-ups and answer questions you haven’t already worked through.

This can be especially helpful if your friend has experience with data scientist interviews, or is at least familiar with the process.

4.3 Practice with ex-interviewers

Finally, you should also try to practice data science mock interviews with expert ex-interviewers, as they’ll be able to give you much more accurate feedback than friends and peers.

If you know a data scientist or someone who has experience running interviews at a big tech company, then that's fantastic. But for most of us, it's tough to find the right connections to make this happen. And it might also be difficult to practice multiple hours with that person unless you know them really well.

Here's the good news. We've already made the connections for you. We’ve created a coaching service where you can practice 1-on-1 with ex-interviewers from leading tech companies. Learn more and start scheduling sessions today .

Related articles:

Facebook data scientist interview

Data Preparation: From Raw Data to Ready-to-Use Insights

Data preparation is crucial for gaining true insights because it ensures accuracy and consistency in the data, which are the foundations for any reliable analysis. Without proper cleaning and structuring, data can be misleading, leading to incorrect conclusions and decisions.

If you collect information at scale, you're probably pulling it from different places and in all sorts of formats. This means the data you get can be a bit all over the place and hard to make sense of right away.

But did you know that messy data can cost companies an average of $12.9 million ? And that's not even the worst part. Using unorganized data can also lead to lost opportunities, disillusioned customers, and damaged reputation.

So, how do you transform raw information into insights that drive value? For this purpose, companies usually run data preparation . But what exactly is it, and why should you care? Jump into this article to learn more.

What is data preparation?

Amazon defines the preparation of data as a transformation of raw data into a format suitable for further processing. Gartner's data preparation definition suggests that this is an iterative-agile process aimed at examining, cleaning, transforming, and then merging raw information into curated datasets.

IBM distinguishes the automated data preparation process and describes it as a simplified way to get information ready for analysis. Within this process, you:

  • Analyze your data points
  • Identify fixes
  • Screen out problematic or useless fields
  • Derive new attributes
  • Improve performance through advanced screening techniques.

Why is data preparation important for analytics?

You might be surprised, but data preparation is the least favorite task of 76% of data scientists . Still, investments in solutions to processing messy data continue to grow. That's because prepared and unprepared data can make a difference in what analytical results you get.

  • Ever looked at raw data? Then you know that it's full of errors, inconsistencies, and irrelevant information. Preparing the data means you're making choices based on the real deal , not the clutter.
  • If you're into machine learning, prepping your data means giving it a boost. When data scientists get polished data, they can whip up some spot-on ML models .
  • Data preparation tools catch errors before any processing occurs. This proactive approach prevents potential issues down the line.
  • When data is neat and tidy, it's like an open book, — easy for anyone to read. This means faster, smoother analysis without the headaches.

What are the data preparation process steps?

Abraham Lincoln once said, "If I had eight hours to chop down a tree, I’d spend the first six of them sharpening my axe." That's why it comes as no surprise that most data specialists claim that they spend 70% to 80% of their time preparing data .

So, what is done to the data in the preparation stage? Here are the common data preparation steps that will ensure you get actionable insights.

  • Structuring
  • Transformation & enrichment
  • Visualization

1. Collection

Data collection, also known as data harvesting , is about gathering the right info to hit certain goals. And trust us, the better the info, the cooler the insights you get from it. So, what types of information you can collect:

  • Quantitative data refers to things you can count (how many orders are placed on your website, the cost of a similar product/service at your competitors, and the age of your prospects).
  • Qualitative data is more about characteristics or qualities (what do your customers say about your product?).
  • Primary data means information collected firsthand for a specific purpose.
  • Secondary data stands for the information someone has collected, and you're reusing it.

2. Profiling

Upon collecting the data, you've got to give it a thorough examination. You'll want to get a better idea of what it contains and what next steps of data preparation to take.

So, at this stage, you make sure the data is consistent, accurate, and free from anomalies.

3. Cleaning

The goal of data cleaning is to make your dataset as accurate as possible. Why should you care about this? Because data performance is directly related to how clean it is. For example, 25% of contact info records contain critical errors, which directly affect sales and deal closure.

The key steps to a clean dataset include:

  • Remove irrelevant data
  • Fix structural errors
  • Fill out the missing data
  • Standardize data entry

4. Structuring

Once you get clean data, the next step is to organize it. The process of collecting various types of data (both structured and unstructured ) and then converting it into usable, meaningful information is known as data structuring . In other words, your goal is to organize data so that you can do what you want with it.

Usually, you have multiple methods to structure your vast amounts of data: linear and non-linear. While linear structures store data elements in a sequence, non-linear ones organize data in a hierarchical manner (like trees or graphs).

5. Transformation & enrichment

As you dive into the organization and preparation of raw data for data analysis, you should also pay attention to data transformation and enrichment. So, what are these notions about?

Data transformation refers to the process of tweaking the data format or values to make it fit better for analysis. The common ways to do this are through normalization, scaling, and encoding.

On the other side, data enrichment stands for enhancing collected data with information from external sources. Here is how it works. For example, a telecom company wants to tailor data plans based on how and where their subscribers use the internet. They've got their own set of basic user info, but they decide to learn more. They get insights from another company about popular apps in various areas and how much data those apps typically use. Then, they blend this new info with their own. Now, the company can paint a clearer picture of their subscribers' internet habits based on where they are.

6. Visualization

Instead of sifting through rows and columns of raw data, visuals (charts, graphs, and maps) help make sense of it all. Data visualization is especially handy when you're dealing with a mountain of complex data, and you want to quickly grasp what's going on. Harvard Business Review breaks down data visualization into four main types based on its purpose:

  • Idea generation — helps see business operations or other aspects in a fresh light.
  • Idea illustration — enables to represent an idea in a visual format.
  • Everyday dataviz — assists with routine tasks and decisions.
  • Visual discovery — allows exploring data visually to discover new insights, patterns, or anomalies.

So, preparing data for analysis is like laying the foundation for a house—that's something you can't skip (if you would like to have accurate results, of course). Exclude one of the steps from the data preparation flow, and you might end up making wrong decisions based on faulty data. And just the other way round. When the data is neatly laid out and trustworthy, it's easier to ask the right questions and get meaningful answers.

Remember the saying, "You get out what you put in"? It's the same with data. If you put in the effort to prepare it well, the results will be top-notch.

case study data preparation

What Are the Top 6 Data Scientist Skills?

Unleashing the potential: harnessing your data for business development, 4 data backup policies to follow to protect your files.

  • Data science and analytics
  • Share this item with your network:

Tech Accelerator

What is data preparation an in-depth guide.

Craig Stedman

  • Craig Stedman, Industry Editor

Data preparation is the process of gathering, combining, structuring and organizing data for use in business intelligence , analytics and data science applications. It's done in stages that include data preprocessing, profiling, cleansing, transformation and validation. Data preparation often also involves pulling together data from both an organization's internal systems and external sources.

IT, BI and data management teams do data preparation work as they integrate data sets to load into data warehouses, data lakes or other repositories. They then refine the prepared data sets as needed when new analytics applications are developed. In addition, data scientists, data engineers, data analysts and business users increasingly  use self-service data preparation tools  to collect and prepare data themselves.

Data preparation is often referred to informally as  data prep . Alternatively, it's also known as  data wrangling . But some practitioners use the latter term in a narrower sense to refer to cleansing, structuring and transforming data, which distinguishes data wrangling from the  data preprocessing  stage.

This comprehensive guide to data preparation further explains what it is, how to do it and the benefits it provides in organizations. You'll also find information on data preparation tools, best practices and common challenges faced in preparing data. Throughout the guide, hyperlinks point to related articles that provide more information on the covered topics.

Why is data preparation important?

One of the main purposes of data preparation is to ensure that raw data being processed for analytics uses is accurate and consistent. Data is commonly created with missing values, inaccuracies or other errors. Also, separate data sets often have different formats that must be reconciled when they're combined. Correcting data errors, improving data quality and consolidating data sets are big parts of data preparation projects that help generate valid analytics results.

Data preparation also involves finding relevant data to ensure that analytics applications deliver meaningful information and actionable insights for business decision-making. The data often is enriched and optimized to make it more informative and useful -- for example, by blending internal and external data sets, creating new data fields, eliminating outlier values and addressing imbalanced data sets that could skew analytics results.

In addition, BI and data management teams use the data preparation process to curate data sets for business users to analyze. Doing so helps streamline and guide  self-service BI  applications for business analysts, executives and workers.

What are the benefits of data preparation?

Data scientists often complain that they spend much of their time gathering, cleansing and structuring data. A big benefit of an effective data preparation process is that they and other end users can focus more on  data mining  and data analysis -- the parts of their job that generate business value. For example, data preparation can be done more quickly, and prepared data can automatically be fed to users for recurring analytics applications.

Done properly, data preparation also helps an organization do the following to gain business benefits:

  • Ensure the data used in analytics applications produces reliable results.
  • Identify and fix data issues that otherwise might not be detected.
  • Enable more informed decision-making by business executives and operational workers.
  • Reduce data management and analytics costs.
  • Avoid duplication of effort in preparing data for use in multiple applications.
  • Get a higher ROI from BI and data science initiatives.

Effective data preparation is particularly beneficial in  big data  environments that store a combination of structured, semistructured and unstructured data to support machine learning (ML), predictive analytics and other forms of advanced analytics. Those applications typically involve large amounts of data, which is often stored in raw form in a data lake until it's needed for specific analytics uses. As a result,  preparing data for machine learning can be more time-consuming than creating the ML algorithms to run against the data -- a situation that a well-managed data prep process helps rectify.

Steps in the data preparation process

Data preparation is done in a series of steps. There's some variation in the data preparation steps listed by different data professionals and software vendors, but the process typically involves the following tasks:

1. Data collection

Relevant data is gathered from operational systems, data warehouses, data lakes and other data sources. During the data collection step, data scientists, data engineers, BI team members, other data professionals and end users should confirm that the data they're gathering is a good fit for the objectives of planned analytics applications.

2. Data discovery and profiling

The next step is exploring the collected data to better understand what it contains and what needs to be done to prepare it for the intended uses. To help with that,  data profiling  identifies relationships, connections and other attributes in data sets. It also finds inconsistencies, anomalies, missing values and other data quality issues. While they sound somewhat similar, profiling differs from data mining , which is a separate process for identifying patterns and correlations in data sets as part of analytics applications.

3. Data cleansing

Next, the identified data errors and issues are corrected to create complete and accurate data sets. For example, as part of  data cleansing work, faulty data is removed or fixed, missing values are filled in and inconsistent entries are harmonized.

4. Data structuring

At this point, the data needs to be modeled and organized to meet analytics requirements. For example, data stored in comma-separated values files or other file formats must be converted into tables to make it accessible to BI and analytics tools.

5. Data transformation and enrichment

In addition to being structured, the data typically must be transformed into a unified and usable format. For example,  data transformation  might involve creating new fields or columns that aggregate values from existing ones. Data enrichment further enhances and optimizes data sets as needed, through measures such as augmenting and adding data.

6. Data validation and publishing

In this last step, automated data validation routines are run against the data to check its consistency, completeness and accuracy. The prepared data is then stored in a data warehouse, a data lake or another repository, where it's either used by whoever prepared it or made available for other users to access.

Data preparation can also incorporate or feed into  data curation  work that creates ready-to-use data sets for BI and analytics applications. Data curation involves tasks such as indexing, cataloging and maintaining data sets and their associated metadata to help users find and access the data. In some organizations, data curator is a formal role that works collaboratively with data scientists, business analysts, other users and the IT and data management teams. In others, data might be curated by data stewards, data engineers, database administrators or data scientists and business users themselves. 

Key steps in the data preparation process

What are the challenges of data preparation?

Data preparation is inherently complicated. Data sets pulled together from different source systems are likely to have numerous data quality, accuracy and consistency issues to resolve. The data also must be manipulated to make it usable, and irrelevant data needs to be weeded out.

As noted above, doing so is often a lengthy process: In the past, a common maxim was that data scientists spent about 80% of their time collecting and preparing data and only 20% analyzing it. That might not be the case now, partly due to the increased availability of data preparation tools. But in the 2023 edition of an annual survey conducted by data science platform vendor Anaconda, 1,071 data science practitioners ranked data preparation and data cleansing as the two most time-consuming tasks in analytics applications.

The following are seven  common data preparation challenges faced by data scientists and others involved in the process:

  • Inadequate or nonexistent data profiling.  If data isn't properly profiled, errors, anomalies and other problems might not be identified, which can result in flawed analytics.
  • Missing or incomplete data.  Data sets often have missing values and other forms of incomplete data; such issues need to be assessed as possible errors and addressed if so.
  • Invalid data values.  Misspellings, other typos and wrong numbers are examples of invalid entries that frequently occur in data and must be fixed to ensure analytics accuracy.
  • Name and address standardization.  Names and addresses might be inconsistent in different systems, with variations that can affect views of customers and other entities if the data isn't standardized.
  • Inconsistent data across enterprise systems.  Other inconsistencies in data sets drawn from multiple source systems, such as different terminology and unique identifiers, are also pervasive issues to contend with in data preparation efforts.
  • Data enrichment issues.  Deciding how to enrich a data set -- for example, what to add to it -- is a complex task that requires a strong understanding of business needs and analytics goals.
  • Maintaining and expanding data prep processes.  Data preparation work often becomes a recurring process that needs to be sustained and enhanced on an ongoing basis.

List of top data preparation challenges

Data preparation tools

Data preparation can pull skilled BI, analytics and data management practitioners away from more high-value work, especially as the volume of data used in analytics applications continues to grow. However, the self-service tools now offered by various software vendors automate data preparation methods. That enables both data professionals and business users to get data ready for analysis in a streamlined and interactive way.

The tools run data sets through a workflow that follows the steps of the data preparation process. They also feature GUIs designed to further simplify the required tasks and functions. In addition to speeding up data preparation and leaving more time for related analytics work, the self-service software might help organizations increase the number of BI and data science applications they're able to run, thus opening up new analytics scenarios.

In 2023, consulting firm Gartner removed data preparation tools from its annual Hype Cycle report on emerging data management technologies, saying they had reached full maturity and mainstream adoption. Initially sold as separate products by several vendors that focused on data preparation, the tools have now largely been incorporated into broader data management software suites and BI or data science platforms. For example, Gartner lists data preparation as a core capability of BI platforms.

The following list includes some prominent BI, analytics and data management vendors that offer data preparation tools or capabilities -- it's based on market reports from Gartner and Forrester Research plus additional research by TechTarget editors:

  • Informatica.
  • Tibco Software.

One caveat from data management consultants and practitioners: Don't look at self-service data preparation software as a replacement for traditional  data integration  technologies, particularly extract, transform and load (ETL) tools. While data prep tools enable users to integrate relevant data sets for analytics applications, ETL ones provide heavy-duty capabilities for integrating large amounts of data, transforming it and moving it into a data store. The two technologies often complement one another: A data management team might use ETL software to combine and initially prepare data sets, then data scientists or business analysts can use a self-service tool to do more specific data preparation work.

Main features of self-service data preparation tools

Data preparation trends

While effective data preparation is crucial in machine learning applications, AI and machine learning algorithms are also increasingly being used to help prepare data. For example, tools with augmented data preparation capabilities based on AI and ML can automatically profile data, fix errors and recommend other data cleansing, transformation and enrichment measures. In its 2023 Hype Cycle report, Gartner said organizations should make such augmented features a must-have item when buying new data management tools.

Automated data prep features are also included in the augmented analytics technologies now offered by many BI vendors. The automation is particularly helpful for self-service BI users and citizen data scientists -- business analysts and other workers who don't have formal data science training but do some advanced analytics work. But it also speeds up data preparation by skilled data scientists and data engineers.

In addition, generative AI (GenAI) tools are starting to be incorporated into data management processes, including data preparation. For example, GenAI offers the potential for conversational interfaces that enable data prep tasks to be performed using natural language. It could also be used to write integration scripts, fix data errors and create data quality rules as part of data preparation work. Conversely, the deployment of GenAI applications further increases data prep workloads in organizations.

There's also a growing focus on cloud-based data preparation, as vendors now commonly offer cloud services for preparing data. Another ongoing trend involves integrating data preparation capabilities into DataOps processes that aim to streamline the creation of data pipelines for BI and analytics.

How to get started on data preparation

Donald Farmer, principal at consultancy TreeHive Strategy, listed the following six data preparation best practices to adopt as starting points for a successful initiative:

  • Think of data preparation as part of data analysis.  Data preparation and analysis are "two sides of the same coin," according to Farmer. That means data can't be properly prepared without knowing what analytics use it needs to fit.
  • Define what data preparation success means.  Desired data accuracy levels and other data quality metrics should be set as goals and then balanced against projected costs to create a data prep plan that's appropriate to each use case.
  • Prioritize data sources based on the application.  Resolving differences in data from multiple source systems is an important element of data preparation that also should be based on the planned analytics use case.
  • Use the right tools for the job and your skill level.  Self-service data preparation tools aren't the only option available -- other tools and technologies can also be used, depending on an organization's internal skills and data needs.
  • Be prepared for failures when preparing data.  Error-handling capabilities need to be built into the data preparation process to prevent it from going awry or getting bogged down when problems occur.
  • Keep an eye on data preparation costs.  The cost of software licenses, data processing and storage resources, and the people involved in preparing data should be watched closely to ensure that expenses don't get out of hand.

Craig Stedman is an industry editor who creates in-depth packages of content on analytics, data management, cybersecurity and other technology areas for TechTarget Editorial.

Ed Burns, a former executive editor at TechTarget, and freelance journalist Mary K. Pratt contributed to this article.

Dig Deeper on Data science and analytics

case study data preparation

raw data (source data or atomic data)

RobertSheldon

Self-service data preparation: What it is and how it helps users

DonaldFarmer

data cleansing (data cleaning, data scrubbing)

CraigStedman

Top data preparation challenges and how to overcome them

RickSherman

A data governance maturity model identifies where current operations are lacking and how to make improvements that better protect...

IoT devices generate and collect data from points all over the network. Organizations must apply general data management best ...

Organizations that want a data fabric architecture should consider how six top vendors use their suite of tools to support a ...

Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...

Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...

There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Many vendors offer headless CMS capabilities to help organizations manage omnichannel web content in a central repository. The ...

While SharePoint offers many capabilities, an organization may find that a different CMS or collaboration system better suits its...

OpenText adds generative AI tools for enterprise information management, security, business process mining and more in its 24.3 ...

With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...

Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...

The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

Sophia Mendelsohn talks about SAP's ambitions to both set an example of sustainability and be an enabler of it with products such...

In this Q&A, SAP CTO Juergen Mueller explains why a clean core is critical for moving to S/4HANA cloud and how the enterprise ...

SAP showcases new Business AI applications and continues to make the case for S/4HANA Cloud as the future of SaaS-based ERP ...

Data Preparation for Analytics Using SAS by Gerhard Svolba

Get full access to Data Preparation for Analytics Using SAS and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Case Studies

Chapter 25   Case Study 1—Building a Customer Data Mart

Chapter 26   Case Study 2—Deriving Customer Segmentation Measures from Transactional Data

Chapter 27   Case Study 3—Preparing Data for Time Series Analysis

Chapter 28   Case Study 4—Preparing Data in SAS Enterprise Miner

Introduction

In this part we will cover four case studies for data preparation. These case studies refer to the content we presented in earlier chapters and put together in the context of a concrete question. In the case studies we will show example data and complete SAS code to create from the input data the respective output data mart. The following case studies will be examined:

•  building a customer data mart

•  deriving customer segmentation ...

Get Data Preparation for Analytics Using SAS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Cover of Software Architecture Patterns

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

case study data preparation

Access all the online resources you need for success in your case and fit interviews.

Build a successful application with our step-by-step resume advice videos, examples, and templates.

  • Resources home
  • Applications
  • Interview tips
  • Management consulting industry guide
  • Top routes for joining McKinsey, BCG and Bain in 2023
  • The top-10 management consulting firms
  • How to practice case interviews alone
  • Tips for last-minute case interview prep

Thank you for your review!

Free interview prep material in your inbox

WHAT'S INCLUDED?

You're subscribed!

What’s included?

28 emails packed with tips. 8 full cases with solutions (incl. FlashFash). Plus sample case and fit interview videos.

Case Interview: The Free Preparation Guide (2024)

The case interview is a challenging interview format that simulates the job of a management consultant , testing candidates across a wide range of problem-solving dimensions.

McKinsey, BCG and Bain – along with other top consulting firms – use the case interview because it’s a statistically proven predictor of how well a candidate will perform in the role. The format is not only used by management consulting firms. Other types of organizations – like tech companies, financial services institutions, and non-profits – often use case interviews to assess candidates who are interviewing for roles focused on shaping strategic initiatives.

If you’re preparing to face a case interview, you may be feeling a little apprehensive. The format is notoriously demanding and unlike any other type of recruitment assessment you may have experienced before. However, with the right preparation and investment of time and effort, it is possible to master.

In this guide, we break down everything you need to know about the case interview, outlining exactly what you need to do to prepare effectively and ace the case.

Key takeaways

  • The classic case interview format follows the same steps that a management consultant would encounter on a client project. The interview is a little like a role-play where the interviewer plays the role of a client and the candidate plays the part of the consultant hired to solve the problem.
  • Some firms occasionally deviate from the classic case interview format. Popular alternatives include written case studies – which require candidates to review paper documents and then prepare and deliver a presentation – and market sizing case interviews, which require candidates to estimate a number.
  • Case interviews test candidates against a set of six problem-solving dimensions: structuring, math, judgment and insights, creativity, synthesis, and case leadership. The interviewer uses a scorecard to assess the candidate’s performance in each of these areas.
  • Case interview questions can be about almost any type of challenge or opportunity. However, our research indicates that there are 10 types of questions that are asked most frequently at top consulting firms. These include questions on profit improvement, revenue growth, and market entry.
  • To do well in a case interview, it’s vital to create custom interview structures that meet the conditions of the ‘AIM’ test. It helps to have a good working knowledge of key case interview frameworks, but this alone is not sufficient.
  • A strong grasp of case math is also crucial when it comes to case interview performance. While only high-school level math skills are required, it’s an aspect of the case interview that many candidates find challenging.
  • Successful candidates are able to summarize their findings effectively. They also demonstrate strong case leadership by progressing through the case proactively and remaining focused on its overarching objectives.
  • To prepare for a case interview, it’s essential to learn every problem-solving skill that will be assessed. We teach all of these skills in our Interview Prep Course , which contains all the video lectures, sample interviews, case material, and practice tools you’ll need to ace any case interview.
  • Most candidates who go on to receive an offer from a top consulting firm like McKinsey, BCG or Bain complete at least 25 live practice sessions with a partner before their interview. You’ll find over 100 high-quality cases in our Case Library and a diverse community of candidates available for practice in our Practice Room .
  • Some candidates choose to supplement their preparation by working with a coach who has been an interviewer at a top consulting firm. Here at CaseCoach, our coaches have all been handpicked from the alumni of top firms such as McKinsey, BCG and Bain.
  • Although the world’s top consulting firms all test candidates using similar methods, none of them approach the interview process in exactly the same way. If you’re preparing to interview at a top consulting firm, it’s important to do your research and find out what you can expect.

An introduction to the case interview

The case interview format, the classic case interview.

The vast majority of case interviews follow the same steps that management consultants encounter on real client projects.

  • Brief: The interviewer gives the candidate a brief for the case. They explain the context in which the client is operating, and outline the challenge they’re facing.
  • Clarification: The candidate then has the chance to ask clarifying questions. They might do this to ensure they’ve understood the context of the problem correctly or to confirm the client’s goals.
  • Reflection: The candidate takes 60 to 90 seconds or so to reflect and lay out a structured approach to solving the case.
  • Analysis: The candidate and interviewer then work through the case together, carrying out analyses and moving toward a recommendation. This is the part of the case where you’ll be handling numerical questions, reviewing exhibits, coming up with creative ideas, and so on. It comprises the vast majority of the time you’ll spend on the case.
  • Synthesis: The case concludes with the candidate synthesizing their findings and making an overall recommendation to the client.

So what does this unique interview format look and feel like? In reality, a consulting case interview is a little like a role-play. The interviewer plays the role of a manager or client, and the candidate plays the part of the consultant hired to solve the problem. However, a case interview shouldn’t feel like a performance. The most successful candidates treat it as a natural conversation between two professional people.

In the video below you can see an example of exceptional case interview performance in action. The candidate and interviewer in the video are both former McKinsey interviewers.

Interviewer-led vs candidate-led cases

Although the classic case interview has an established format and assesses a specific set of skills, cases can be delivered in different ways. Some are more candidate-led, while others are more interviewer-led

In a candidate-led case, the candidate is in the driver’s seat and is free to explore different aspects of the problem. Interviewers don’t tell candidates what to focus on next. Instead, they provide additional information – like an exhibit or a new fact – when asked. The candidate then analyzes the information and suggests next steps to get to the answer.

In an interviewer-led case, the interviewer may interrupt the candidate and ask them to either perform a specific investigation or focus on a different aspect of the problem. This doesn’t mean the interview is going badly; the interviewer is simply following a script. As a result, in an interviewer-led case, candidates are less likely to take the wrong path.

It’s difficult to predict which style of case you’ll receive. Some firms are known for using one style of interview more frequently than another. However, in practice, most interviews fall somewhere between the two extremes, depending on the style of the interviewer and the case material they’re using. You should therefore always be ready to suggest next steps and have a view about how to get to the answer.

Other case interview formats

While the classic case interview is most common, there are a couple of other interview formats that top consulting firms use from time to time:

The written case study

Some management consulting firms use written case studies to simulate the experience of carrying out consulting work even more accurately than the classic, verbal case interview. In some locations, BCG and Bain have been known to adopt this approach for a small minority of candidates.

In written cases, candidates review a series of paper documents and then structure the problem, run some numbers, generate ideas and, finally, deliver a short presentation. You can learn more in our article on how to crack written case studies .

Market sizing case interviews

Management consulting firms and other employers sometimes use market sizing questions – also known as estimation questions – as a standalone interview format to assess candidates on a wide range of problem-solving dimensions.

In a market sizing interview, you’ll be asked to estimate a number. This might be something like the revenue of a sandwich store or how many ATMs there are in a certain city. The ability to size a market is also a skill required for solving many case interview questions. You can learn more in our article on how to nail market sizing case interviews .

Some key differences to expect

While case interviews are highly codified, it’s important to remember that every interview is unique.

In the final round of interviews, for example, cases may feel less scripted than they did in the first stage. Partners – who are part of the interviewing group in the final round – often use the same case for years at a time. This means they can deliver it without a script and, as a result, tend to give candidates more room to take the lead. You can learn more in our article on the differences between a first and final-round interview at McKinsey, BCG and Bain .

In addition, each firm or office might bring their own nuance or style to the classic case interview format. It’s important to do your research and find out what you can expect from the interview experience at your target firm or office. You can learn more in our article on how the interviews at McKinsey, BCG and Bain differ .

The skills assessed in case interviews

Case interviews are primarily about testing a set of problem-solving skills. The interviewer uses a scorecard to assess a candidate’s performance in the following dimensions:

  • Structuring: This is the ability to break problems down into logical drivers. It’s most obviously required at the beginning of a case, where you can pause and take a moment to come up with an approach. But it’s also tested each time you have to consider a new aspect of the problem.
  • Math: Most cases contain a quantitative component, such as estimation questions, break-even questions, or other calculations. To do well in this dimension, you need to lay out a clear and efficient approach, run calculations quickly and accurately, and then state their implications for the case.
  • Judgment and insights: This dimension is about extracting insights from data, usually by interpreting information in a chart. Performing well in this area involves processing new information quickly, prioritizing what’s important, and connecting your findings to develop sound recommendations.
  • Creativity: Cases often have a creative thinking component. Sharing numerous, varied and sound ideas – ideally in a structured way – can help you succeed here.
  • Synthesis: This is all about wrapping up the case with a clear and practical recommendation, and delivering it convincingly.
  • Case leadership: This dimension is about progressing through the case efficiently and staying focused on its objectives. Case leadership involves gathering facts effectively and building on new findings to develop a recommendation. It’s a particularly important dimension in candidate-led cases.

Questions to expect

If you’re preparing to interview at a top management consulting firm like McKinsey, BCG or Bain, you’re probably curious about the kind of case interview questions you can expect to receive.

To identify the most common case interview questions , we surveyed CaseCoach users who interviewed at either McKinsey, BCG or Bain for a generalist role in 2023. We found that of the 260+ case interviews reported by respondents:

  • 20% focused on profit improvement
  • 15% focused on revenue growth
  • 12% focused on market entry
  • 10% focused on cost cutting
  • 9% focused on process optimization

These topics align with the typical challenges and opportunities faced by CEOs. Because the job of a management consultant is to help CEOs find solutions to these problems, it’s vital for candidates to demonstrate that they understand the issues behind these questions.

However, while there are some recurring topics, the context and nuances of each individual case mean that no two case questions are the same. Increasingly, firms are testing candidates on questions that fall outside of these recurring topics. One way they’re doing this is by focusing on non-traditional areas, like the public sector. If you’re interviewing for a generalist management consulting role, it’s therefore important to be ready for almost any type of case question.

If you’re interviewing for a role that’s focused on a specific industry or function, like financial services , you’ll likely be given a case focused on that particular area.

How to ace the case

Case interviews require you to think on your feet to solve a complex problem that you’ve never seen before, while being assessed against a number of problem-solving dimensions. Here’s what you need to do to rise to the challenge and ace the case:

1. Create case interview structures that meet the AIM test

Of all the case interview assessment dimensions, structuring is perhaps the most challenging, particularly for those who are just starting out. It requires candidates to propose a prioritized and insightful approach to the case that’s composed of a comprehensive set of independent drivers. Structuring plays a foundational role in the interview, setting the course for the entire conversation.

So, what does good case structuring look like? An effective structure should meet the conditions of the ‘AIM’ test. ‘AIM’ stands for:

  • Answer-focused: The structure should identify the client’s goal and the question to solve. It should also provide an approach to answering that question.
  • Insightful: The structure should be tailored to the specifics of the client or to the problem in question. You shouldn’t be able to apply it to another case of the same type.
  • MECE: This is a well-known acronym among consultants. It stands for ‘mutually exclusive and collectively exhaustive’. In plain English, if a structure is ‘MECE’ it has been broken down into an exhaustive set of independent drivers.

2. Know key case interview frameworks

In a case interview, you’ll be asked to structure a variety of problems. There are a number of frameworks that can help you do this, whether the problem you’re structuring corresponds to a common case question or a different topic entirely:

Business frameworks

You can use established business frameworks to craft custom structures for the most common types of case questions. These include frameworks for mastering profitability questions , answering revenue growth questions and nailing market sizing questions .

Academic frameworks

For unusual case questions that don’t relate to an obvious business framework, it can be helpful to draw on an academic framework like supply and demand, ‘the three Cs’, or Porter’s Five Forces. You can learn more about all of these in our ultimate guide to case interview frameworks . The article includes other business and academic frameworks that you can use to craft custom structures for case questions.

Logical frameworks

Finally, logical frameworks can help you look at the big picture in order to structure your approach. These options can be particularly useful when you’re faced with an unusual case question that doesn’t lend itself to a business or academic framework. Some examples of logical frameworks include:

  • Structuring with equations: This approach is most helpful for quantitative case questions. Listen out for introductions that focus on a number. These cases can often be broken down into an equation and then structured along its variables.
  • Structuring based on hypotheses: This approach is most helpful for structuring qualitative cases. It involves laying out what you most need to believe in order to validate a specific recommendation. These beliefs form your set of key hypotheses, which you then test as you progress through the case.
  • Structuring with root causes: This approach works well for structuring cases that require identifying the reasons for a problem. It involves laying out its potential causes in a way that is mutually exclusive and collectively exhaustive (i.e. MECE).

How to apply these frameworks

While business, academic, and logical frameworks can be helpful when it comes to structuring a problem, learning how to use them correctly is a skill in itself. Simply applying a framework to a case interview question in a ‘cookie-cutter’ fashion is not enough. To impress your interviewer and pass the AIM test, your structure will need to be heavily tailored to the situation at hand. In fact, many case questions can be best answered by combining different frameworks.

Ultimately, interviewers want to understand how your mind works and see you think on your feet. You’ll therefore need to demonstrate that you can propose a custom case interview structure to any question.

3. Be comfortable with simple math

Management consulting firms expect you to navigate mathematical problems confidently and reliably in case interviews. Regardless of your academic background or past experience, you’ll need to be able to set an approach to solve the problem, perform calculations quickly and accurately, and state the implications of your solution.

The good news is that you’ll only be required to demonstrate a high-school level of math skills in case interviews. However, with no calculators allowed and an interviewer looking over your shoulder, it’s natural to find this aspect of the experience a little intimidating.

So, what can you expect from case math? The problems you’ll be asked to solve may take the form of straight calculations, exhibits that require calculations, word problems, and estimation questions.

To do well in this part of the case interview, you’ll need to have a strong understanding of:

  • The four operations: addition, subtraction, multiplication, and division
  • Key math concepts such as fractions, percentages, and weighted averages
  • Business math concepts such as income statements, investments, and valuations

To stand out to your interviewer, you’ll also need to work through math problems confidently and efficiently. Here are our top tips for doing this:

  • Keep track of zeros: Case questions often involve large numbers, sometimes in the millions or even billions. Keeping close track of your zeros is therefore crucial. We recommend either counting the zeros in your calculation, using scientific notation, or assigning letter units to zeros.
  • Simplify your calculations: This will help you work through problems quickly and efficiently while reducing the potential for mistakes. One way of simplifying calculations is by rounding numbers up or down to make them more ‘friendly’.
  • Memorize frequently-used fractions: Some fraction values are used so frequently in case math that knowing them – along with their percentage value and decimal conversions – can save you significant time. We recommend memorizing the fraction and corresponding percentage and decimal values of 1/2, 1/3, all the way through to 1/10.

You can learn more in our guide to mastering case interview math .

4. Summarize your findings

Synthesis is a key skill assessed by interviewers, predominantly at the end of a case interview. You need to provide a clear and sound recommendation that answers the overall question convincingly. You must also describe the key supporting points that informed your recommendation and then outline any further steps you would advise the client to take.

When it comes to concluding cases effectively, this four-step framework can be extremely helpful:

  • Quickly play the case question back to your interviewer.
  • Answer the question directly and briefly by distilling your response into a single sentence, if possible.
  • List the points that support your conclusion.
  • Outline the next steps that you recommend to the client.

You can learn more in our article on how to conclude a case study interview .

5. Bring it all together with strong case leadership

Case leadership, more than any other dimension, will give your interviewer an indication of how independently you could handle your workstream as a consultant. It’s a particularly important skill in candidate-led cases, where you’ll set the course of the discussion without the interviewer steering you in a particular direction.

Demonstrating strong case leadership means progressing the case efficiently and staying focused on its overarching objectives. Using a ‘tracker page’ to capture your structure and organize your notes throughout the case will help you in this regard.

Another aspect of case leadership is gathering facts effectively. This includes making reasonable assumptions, requesting missing information, and asking probing questions.

Finally, you’ll be expected to build on new findings to develop your recommendation, adapt your approach, and suggest next steps.

Effective case leadership is all about showing your interviewer that you have a strong command of the problem-solving process. After investigating each key driver in your structure, you need to be able to articulate where you are in your overall approach to solving the problem, and what the next steps should be.

To do this, we recommend using a five-step process to handle every kind of analysis you conduct during the case, whether you’re responding to a numerical question, the data in an exhibit, or something else.

Here’s what that five-step process looks like:

  • Set your approach. Define what you’re going to do upfront. It’s particularly important to be explicit here, especially if the analysis is in any way complex or ambiguous.
  • Conduct your analysis. Your approach here will vary according to the kind of question you’re working through.
  • State your findings. You may also want to make a note of your findings on your tracker page.
  • State the implications of your findings. Explain how they impact both your answer to the question and the client’s broader goal.
  • Suggest next steps. Your findings will sometimes change how you want to approach the rest of the case. This may mean altering your initial structure and editing it on your tracker page.

6. Be your best on the day

When the day of your interview comes around, you’ll want to be at your very best. But what exactly does this mean?

First, you should present yourself in a professional manner. It goes without saying that you should arrive on time but, ideally, you should plan to arrive early. You should also come equipped with the right material: a pen, squared A4 or letter-size paper pad, and copies of your resume. It’s also vital to dress appropriately for the occasion. Usually, this means wearing formal business dress, but this means different attires in different locations. We recommend doing some research to find out what consultants wear at your target firm and office.

To be at your best on the day of your interview, you must be well rested. Sadly, tiredness is one of the most common reasons for underperformance in consulting interviews. The day before is not the time to cram in further preparation. Instead, aim to have a quiet day and to get plenty of sleep at night.

Ultimately, consulting firms want to hire people who can represent the firm and interact with clients at every level, from the shop floor to the C-suite. Successful candidates treat the case interview as an opportunity to play the role of a management consultant advising a client (i.e. the interviewer). This means exhibiting a great deal of confidence and credibility, together with effective communication and an engaging attitude. It’s vital to stay focused on the overall problem and to drive the resolution of the case while being receptive to the interviewer’s input.

There are a lot of balls to juggle in a case interview, with the added pressure of a potentially life-changing outcome, but successful candidates don’t let their nerves get the better of them. We’ve provided some helpful hints and tips in our article on handling the stress of consulting interviews .

Being your best on the day of your interview requires extensive preparation. It means mastering each dimension of the case interview scorecard to the extent that the skills become second nature to you. It also means completing sufficient case practice to be able to focus on the big picture of the case you’re solving, rather than on simply demonstrating a set of skills.

How to prepare for case interview success

Delivering a standard of performance worthy of an offer from a top firm requires extensive case interview prep. In our experience, most successful candidates invest around 60 hours – or 10 hours each week over a six-week period – in their preparation. Failing to put this effort in is among the most common reasons why many candidates are unsuccessful.

Here’s what effective case interview preparation involves:

Learning the skills

In a case interview, your performance is assessed against a set of common problem-solving dimensions. To recap, these are structuring, math, judgment and insights, creativity, synthesis, and case leadership. It’s important to:

  • gain a precise understanding of the expectations on each of these skills
  • learn the techniques that will allow you to meet these expectations
  • practice until your performance meets the required standard

We teach all these skills in our Interview Prep Course . In our bite-sized video lectures, we map out each of the key skills assessed in a case, and explain what you need to know to demonstrate each skill. We also share our tips on how to improve in each dimension, going above and beyond the advice we’ve included in this article.

In addition, our Interview Prep Course includes many more sample interviews that show real candidates – who went on to join top consulting firms – solving cases. Former consulting interviewers explain what the candidates did well on each dimension and where they could have improved.

Math is a critical prerequisite to handling cases and is something you should be comfortable with before you begin practicing. Our Case Math Course – provided as part of the Interview Prep Course – will help you brush up your skills. It contains 21 video lectures that cover everything you need to know, including the four operations, key math concepts, our pro tips, and business math.

After watching all our Interview Prep and Case Math video lectures, we recommend heading to the ‘Drills’ area of CaseCoach, where you can start practicing specific skills. Drills are interactive exercises that pose rapid-fire questions and provide instant feedback. They help you build your skills and confidence in specific case dimensions quickly, allowing you to make the most of your live case practice with partners. Our Interview Prep Course includes a comprehensive set of drills in four key areas: structuring, calculation, case math and chart interpretation.

When it comes to succeeding in a case interview, nothing beats live practice with a partner. Most candidates who go on to receive an offer from a top consulting firm like McKinsey, BCG or Bain complete at least 25 live practice sessions before their interview.

To practice live cases with a partner, you’ll need access to both case material and practice partners. In our Case Library , you’ll find over 100 cases – complete with solutions – developed by former management consultants. You can download eight of these cases right away by creating a free CaseCoach account. You’ll find a diverse community of fellow candidates who are all available for case interview practice in our Practice Room , where we facilitate over 3,000 practice sessions a week.

You can learn more in our article on how to practice case interviews .

Working with a coach

Some candidates choose to supplement their preparations by working with a consulting interview coach who has been an interviewer at a top firm.

These coaches have the skills and experience to gauge your level of performance and help you identify your areas of strength and weakness. They can also provide you with accurate and helpful feedback on your case-solving skills. This insight can help you accelerate your preparation and improve your performance. Getting used to interviewing with a professional should also help to reduce the stress of the consulting interview experience.

Here at CaseCoach, our coaches are all former consultants and interviewers who have been handpicked from the alumni of top firms such as McKinsey, BCG and Bain.

Do your research

Although employers who use case interviews all test candidates using similar methods, none of them approach the interview process in exactly the same way.

For instance, if you expect to interview with McKinsey, Bain or BCG, it’s helpful to know that these firms all give cases of similar complexity. However, there are some key differences. For example:

  • Bain has been known to use estimation questions, such as market sizing, in interviews for its most junior (i.e. Associate Consultant level) roles.
  • BCG and Bain occasionally use written cases.
  • When it comes to the ‘fit’ interview, McKinsey uses its Personal Experience Interview format, while most Bain offices now use a ‘behavioral interview’ . Only BCG consistently uses the classic fit interview format .

Other differences include the number of rounds of interviews each firm conducts, and their preference for using interviewer-led or candidate-led cases. Wherever you interview, it’s vital to do your research and find out what you might be able to expect.

When it comes to getting ready for the case interview, knowing what you will be assessed on, learning how to succeed, and having access to the best practice resources can all go a long way. Now, you need to put in the hard work and prepare! Good luck.

Continue to learn

case study data preparation

Explore other resources

Search resources, we value your privacy.

We are using cookies to give you the best experience on our website. By clicking "Accept all", you consent to our use of cookies. You can read our Privacy Information to learn more about how we use cookies.

  • Case Interview: A comprehensive guide
  • Pyramid Principle
  • Hypothesis driven structure
  • Fit Interview
  • Consulting math
  • The key to landing your consulting job
  • What is a case interview?
  • Types of case interview
  • How to solve cases with the Problem-Driven Structure?
  • What to remember in case interviews
  • Case examples or building blocks?
  • How do I prepare for case interviews
  • Interview day tips
  • How we can help

1. The key to landing your consulting job.

Case interviews - where you are asked to solve a business case study under scrutiny - are the core of the selection process right across McKinsey, Bain and BCG (the “MBB” firms). This interview format is also used pretty much universally across other high-end consultancies; including LEK, Kearney, Oliver Wyman and the consulting wings of the “Big Four”.

If you want to land a job at any of these firms, you will have to ace multiple case interviews.

It is increasingly likely that you will also have to solve online cases given by chatbots. You might need to pass these either before making it to interview or be asked to sit them alongside first round interviews.

Importantly, case studies aren’t something you can just wing . Firms explicitly expect you to have thoroughly prepared and many of your competitors on interview day will have been prepping for months.

Don’t worry though - MCC is here to help!

This article will take you through a full overview of everything you’ll need to know to do well, linking to more detailed articles and resources at each stage to let you really drill down into the details.

As well as traditional case interviews, we’ll also attend to the new formats in which cases are being delivered and otherwise make sure you’re up to speed with recent trends in this overall part of consulting recruitment.

Before we can figure out how to prepare for a case interview, though, we will first have to properly understand in detail what exactly you are up against. What format does a standard consulting case interview take? What is expected of you? How will you be assessed?

Let's dive right in and find out!

Professional help

Before going further, if this sounds like a lot to get your head around on your own, don't worry - help is available!

Our Case Academy course gives you everything you need to know to crack cases like a pro:

Case Academy Course

To put what you learn into practice (and secure some savings in the process) you can add mock interview coaching sessions with expereinced MBB consultants:

Coaching options

And, if you just want an experienced consultant to take charge of the whole selection process for you, you can check out our comprehensive mentoring programmes:

Explore mentoring

Now, back to the article!

2. What is a case interview?

Before we can hope to tackle a case interview, we have to understand what one is.

In short, a case interview simulates real consulting work by having you solve a business case study in conversation with your interviewer.

This case study will be a business problem where you have to advise a client - that is, an imaginary business or similar organisation in need of guidance.

You must help this client solve a problem and/or make a decision. This requires you to analyse the information you are given about that client organisation and figure out a final recommendation for what they should do next.

Business problems in general obviously vary in difficulty. Some are quite straightforward and can be addressed with fairly standard solutions. However, consulting firms exist precisely to solve the tough issues that businesses have failed to deal with internally - and so consultants will typically work on complex, idiosyncratic problems requiring novel solutions.

Some examples of case study questions might be:

  • How much would you pay for a banking licence in Ghana?
  • Estimate the potential value of the electric vehicle market in Germany
  • How much gas storage capacity should a UK domestic energy supplier build?

Consulting firms need the brightest minds they can find to put to work on these important, difficult projects. You can expect the case studies you have to solve in interview, then, to echo the unique, complicated problems consultancies deal with every day. As we’ll explain here, this means that you need to be ready to think outside the box to figure out genuinely novel solutions.

2.1. Where are case interviews in the consulting selection process?

Not everyone who applies to a consulting firm will have a case interview - far from it!

In fact, case interviews are pretty expensive and inconvenient for firms to host, requiring them to take consultants off active projects and even fly them back to the office from location for in-person interviews (although this happens less frequently now). Ideally, firms want to cut costs and save time by narrowing down the candidate pool as much as possible before any live interviews.

As such, there are some hoops to jump through before you make it to interview rounds.

Firms will typically eliminate as much as 80% of the applicant pool before interviews start . For most firms, 50%+ of applicants might be cut based on resumes, before a similar cut is made on those remaining based on aptitude tests. McKinsey currently gives their Solve assessment to most applicants, but will use their resulting test scores alongside resumes to cut 70%+ of the candidate pool before interviews.

You'll need to be on top of your game to get as far as a case interview with a top firm. Getting through the resume screen and any aptitude tests is an achievement in itself! Also we need to note that the general timeline of an application can differ depending on a series of factors, including which position you apply, your background, and the office you are applying to. For example, an undergraduate applying for a Business Analyst position (the entry level job at McKinsey) will most likely be part of a recruitment cycle and as such have pretty fixed dates when they need to sit the pre-screening test, and have the first and second round interviews (see more on those below). Conversely, an experienced hire will most likely have a much greater choice of test and interview dates as well as more time at their disposal to prepare.

For readers not yet embroiled in the selection process themselves, let’s put case interviews in context and take a quick look at each stage in turn. Importantly, note that you might also be asked to solve case studies outside interviews as well…

2.1.1. Application screen

It’s sometimes easy to forget that such a large cut is made at the application stage. At larger firms, this will mean your resume and cover letter is looked at by some combination of AI tools, recruitment staff and junior consulting staff (often someone from your own university).

Only the best applications will be passed to later stages, so make sure to check out our free resume and cover letter guides, and potentially get help with editing , to give yourself the best chance possible.

2.1.2. Aptitude tests and online cases

This part of the selection process has been changing quickly in recent years and is increasingly beginning to blur into the traditionally separate case interview rounds.

In the past, GMAT or PST style tests were the norm. Firms then used increasingly sophisticated and often gamified aptitude tests, like the Pymetrics test currently used by several firms, including BCG and Bain, and the original version of McKinsey’s Solve assessment (then branded as the Problem Solving Game).

Now, though, there is a move towards delivering relatively sophisticated case studies online. For example, McKinsey has replaced half the old Solve assessment with an online case. BCG’s Casey chatbot case now directly replaces a live first round case interview, and in the new era of AI chatbots, we expect these online cases to quickly become more realistic and increasingly start to relieve firms of some of the costs of live case interviews.

Our consultants collectively reckon that, over time, 50% of case interviews are likely to be replaced with these kinds of cases . We give some specific advice for online cases in section six. However, the important thing to note is that these are still just simulations of traditional case interviews - you still need to learn how to solve cases in precisely the same way, and your prep will largely remain the same.

2.1.3. Rounds of Interviews

Now, let’s not go overboard with talk of AI. Even in the long term, the client facing nature of consulting means that firms will have live case interviews for as long as they are hiring anyone. And in the immediate term, case interviews are still absolutely the core of consulting selection.

Before landing an offer at McKinsey, Bain, BCG or any similar firm, you won’t just have one case interview, but will have to complete four to six case interviews, usually divided into two rounds, with each interview lasting approximately 50-60 minutes .

Being invited to first round usually means two or three case interviews. As noted above, you might also be asked to complete an online case or similar alongside your first round interviews.

If you ace first round, you will be invited to second round to face the same again, but more gruelling. Only then - after up to six case interviews in total, can you hope to receive an offer.

2.2. Differences between first and second round interviews

Despite case interviews in the first and second round following the same format, second/final round interviews will be significantly more intense . The seniority of the interviewer, time pressure (with up to three interviews back-to-back), and the sheer value of the job at stake will likely make a second round consulting case interview one of the most challenging moments of your professional life.

There are three key differences between the two rounds:

  • Time Pressure : Final round case interviews test your ability to perform under pressure, with as many as three interviews in a row and often only very small breaks between them.
  • Focus : Since second round interviewers tend to be more senior (usually partners with 12+ years experience) and will be more interested in your personality and ability to handle challenges independently. Some partners will drill down into your experiences and achievements to the extreme. They want to understand how you react to challenges and your ability to identify and learn from past mistakes.
  • Psychological Pressure: While case interviews in the first round are usually more focused on you simply cracking the case, second round interviewers often employ a "bad cop" strategy to test the way you react to challenges and uncertainty.

2.3. What skills do case interviews assess?

Reliably impressing your interviewers means knowing what they are looking for. This means understanding the skills you are being assessed against in some detail.

Overall, it’s important always to remember that, with case studies, there are no strict right or wrong answers. What really matters is how you think problems through, how confident you are with your conclusions and how quick you are with the back of the envelope arithmetic.

The objective of this kind of interview isn’t to get to one particular solution, but to assess your skillset. This is even true of modern online cases, where sophisticated AI algorithms score how you work as well as the solutions you generate.

If you visit McKinsey , Bain and BCG web pages on case interviews, you will find that the three firms look for very similar traits, and the same will be true of other top consultancies.

Broadly speaking, your interviewer will be evaluating you across five key areas:

2.1.1.One: Probing mind

Showing intellectual curiosity by asking relevant and insightful questions that demonstrate critical thinking and a proactive nature. For instance, if we are told that revenues for a leading supermarket chain have been declining over the last ten years, a successful candidate would ask:

“ We know revenues have declined. This could be due to price or volume. Do we know how they changed over the same period? ”

This is as opposed to a laundry list of questions like:

  • Did customers change their preferences?
  • Which segment has shown the decline in volume?
  • Is there a price war in the industry?

2.1.2. Structure

Structure in this context means structuring a problem. This, in turn, means creating a framework - that is, a series of clear, sequential steps in order to get to a solution.

As with the case interview in general, the focus with case study structures isn’t on reaching a solution, but on how you get there.

This is the trickiest part of the case interview and the single most common reason candidates fail.

We discuss how to properly structure a case in more detail in section three. In terms of what your interviewer is looking for at high level, though, key pieces of your structure should be:

  • Proper understanding of the objective of the case - Ask yourself: "What is the single crucial piece of advice that the client absolutely needs?"
  • Identification of the drivers - Ask yourself: "What are the key forces that play a role in defining the outcome?"

Our Problem Driven Structure method, discussed in section three, bakes this approach in at a fundamental level. This is as opposed to the framework-based approach you will find in older case-solving

Focus on going through memorised sequences of steps too-often means failing to develop a full understanding of the case and the real key drivers.

At this link, we run through a case to illustrate the difference between a standard framework-based approach and our Problem Driven Structure method.

2.1.3. Problem Solving

You’ll be tested on your ability to identify problems and drivers, isolate causes and effects, demonstrate creativity and prioritise issues. In particular, the interviewer will look for the following skills:

  • Prioritisation - Can you distinguish relevant and irrelevant facts?
  • Connecting the dots - Can you connect new facts and evidence to the big picture?
  • Establishing conclusions - Can you establish correct conclusions without rushing to inferences not supported by evidence?

2.1.4. Numerical Agility

In case interviews, you are expected to be quick and confident with both precise and approximated numbers. This translates to:

  • Performing simple calculations quickly - Essential to solve cases quickly and impress clients with quick estimates and preliminary conclusions.
  • Analysing data - Extract data from graphs and charts, elaborate and draw insightful conclusions.
  • Solving business problems - Translate a real world case to a mathematical problem and solve it.

Our article on consulting math is a great resource here, though the extensive math content in our MCC Academy is the best and most comprehensive material available.

2.1.5. Communication

Real consulting work isn’t just about the raw analysis to come up with a recommendation - this then needs to be sold to the client as the right course of action.

Similarly, in a case interview, you must be able to turn your answer into a compelling recommendation. This is just as essential to impressing your interviewer as your structure and analysis.

Consultants already comment on how difficult it is to find candidates with the right communication skills. Add to this the current direction of travel, where AI will be able to automate more and more of the routine analytic side of consulting, and communication becomes a bigger and bigger part of what consultants are being paid for.

So, how do you make sure that your recommendations are relevant, smart, and engaging? The answer is to master what is known as CEO-level communication .

This art of speaking like a CEO can be quite challenging, as it often involves presenting information in effectively the opposite way to how you might normally.

To get it right, there are three key areas to focus on in your communications:

  • Top down : A CEO wants to hear the key message first. They will only ask for more details if they think that will actually be useful. Always consider what is absolutely critical for the CEO to know, and start with that. You can read more in our article on the Pyramid Principle .
  • Concise : This is not the time for "boiling the ocean" or listing an endless number possible solutions. CEOs, and thus consultants, want a structured, quick and concise recommendation for their business problem, that they can implement immediately.
  • Fact-based : Consultants share CEOs' hatred of opinions based on gut feel rather than facts. They want recommendations based on facts to make sure they are actually in control. Always go on to back up your conclusions with the relevant facts.

Being concise and to the point is key in many areas, networking being one for them. For more detail on all this, check out our full article on delivering recommendations .

Prep the right way

3. types of case interview.

While most case interviews share a similar structure, firms will have some differences in the particular ways they like to do things in terms of both the case study and the fit component.

As we’ll see, these differences aren’t hugely impactful in terms of how you prepare. That said, it's always good to know as much as possible about what you will be going up against.

3.1. Different case objectives

A guiding thread throughout this article and our approach in general will be to treat each case as a self-contained problem and not try to pigeonhole it into a certain category. Having said that, there are of course similarities between cases and we can identify certain parameters and objectives.

Broadly speaking, cases can be divided into issue-based cases and strategic decision cases. In the former you will be asked to solve a certain issue, such as declining profits, or low productivity whereas in the latter you will be ask whether your client should or should not do something, such as enter a specific market or acquire another company. The chart below is a good breakdown of these different objectives:

Case Focus

3.2. How do interviewers craft cases

While interviewers will very likely be given a case bank to choose from by their company, a good number of them will also choose to adapt the cases they would currently be working on to a case interview setting. The difference is that the latter cases will be harder to pigeonhole and apply standard frameworks to, so a tailored approach will be paramount.

If you’ve applied for a specific practice or type of consulting - such as operational consulting, for example - it’s very likely that you will receive a case geared towards that particular area alongside a ‘generalist’ consulting case (however, if that’s the case, you will generally be notified). The other main distinction when it comes to case interviews is between interviewer-led and candidate-led.

3.3. Candidate-led cases

Most consulting case interview questions test your ability to crack a broad problem, with a case prompt often going something like:

" How much would you pay to secure the rights to run a restaurant in the British Museum? "

You, as a candidate, are then expected to identify your path to solve the case (that is, provide a structure), leveraging your interviewer to collect the data and test your assumptions.

This is known as a “candidate-led” case interview and is used by Bain, BCG and other firms. From a structuring perspective, it’s easier to lose direction in a candidate-led case as there are no sign-posts along the way. As such, you need to come up with an approach that is both broad enough to cover all of the potential drivers in a case but also tailored enough to the problem you are asked to solve. It’s also up to you to figure out when you need to delve deeper into a certain branch of the case, brainstorm or ask for data. The following case from Bain is an excellent example on how to navigate a candidate-led case.

3.4. Interviewer-led cases

This type of case - employed most famously by McKinsey - is slightly different, with the interviewer controlling the pace and direction of the conversation much more than with other case interviews.

At McKinsey, your interviewer will ask you a set of pre-determined questions, regardless of your initial structure. For each question, you will have to understand the problem, come up with a mini structure, ask for additional data (if necessary) and come to the conclusion that answers the question. This more structured format of case also shows up in online cases by other firms - notably including BCG’s Casey chatbot (with the amusing result that practising McKinsey-style cases can be a great addition when prepping for BCG).

Essentially, these interviewer-led case studies are large cases made up of lots of mini-cases. You still use basically the same method as you would for standard (or candidate-led) cases - the main difference is simply that, instead of using that method to solve one big case, you are solving several mini-cases sequentially. These cases are easier to follow as the interviewer will guide you in the right direction. However, this doesn’t mean you should pay less attention to structure and deliver a generic framework! Also, usually (but not always!) the first question will ask you to map your approach and is the equivalent of the structuring question in candidate-led cases. Sometimes, if you’re missing key elements, the interviewer might prompt you in the right direction - so make sure to take those prompts seriously as they are there to help you get back on track (ask for 30 seconds to think on the prompt and structure your approach). Other times - and this is a less fortunate scenario - the interviewer might say nothing and simply move on to the next question. This is why you should put just as much thought (if not more) into the framework you build for interviewer-led cases , as you may be penalized if you produce something too generic or that doesn’t encompass all the issues of the case.

3.5. Case and fit

The standard case interview can be thought of as splitting into two standalone sub-interviews. Thus “case interviews” can be divided into the case study itself and a “fit interview” section, where culture fit questions are asked.

This can lead to a bit of confusion, as the actual case interview component might take up as little as half of your scheduled “case interview”. You need to make sure you are ready for both aspects.

To illustrate, here is the typical case interview timeline:

Case interview breakdown

  • First 15-30 minutes: Fit Interview - with questions assessing your motivation to be a consultant in that specific firm and your traits around leadership and teamwork. Learn more about the fit interview in our in-depth article here .
  • Next 30-40 minutes: Case Interview - solving a case study
  • Last 5 minutes: Fit Interview again - this time focussing on your questions for your interviewer.

Both the Case and Fit interviews play crucial roles in the finial hiring decision. There is no “average” taken between case and fit interviews: if your performance is not up to scratch in either of the two, you will not be able to move on to the next interview round or get an offer.

NB: No case without fit

Note that, even if you have only been told you are having a case interview or otherwise are just doing a case study, always be prepared to answer fit questions. At most firms, it is standard practice to include some fit questions in all case interviews, even if there are also separate explicit fit interviews, and interviewers will almost invariably include some of these questions around your case. This is perfectly natural - imagine how odd and artificial it would be to show up to an interview, simply do a case and leave again, without talking about anything else with the interviewer before or after.

3.5.1 Differences between firms

For the most part, a case interview is a case interview. However, firms will have some differences in the particular ways they like to do things in terms of both the case study and the fit component.

3.5.2. The McKinsey PEI

McKinsey brands its fit aspect of interviews as the Personal Experience Interview or PEI. Despite the different name, this is really much the same interview you will be going up against in Bain, BCG and any similar firms.

McKinsey does have a reputation for pushing candidates a little harder with fit or PEI questions , focusing on one story per interview and drilling down further into the specific details each time. We discuss this tendency more in our fit interview article . However, no top end firm is going to go easy on you and you should absolutely be ready for the same level of grilling at Bain, BCG and others. Thus any difference isn’t hugely salient in terms of prep.

3.6. What is different in 2023?

For the foreseeable future, you are going to have to go through multiple live case interviews to secure any decent consulting job. These might increasingly happen via Zoom rather than in person, but they should remain largely the same otherwise.

However, things are changing and the rise of AI in recent months seems pretty much guaranteed to accelerate existing trends.

Even before the explosive development of AI chatbots like ChatGPT we have seen in recent months, automation was already starting to change the recruitment process.

As we mentioned, case interviews are expensive and inconvenient for firms to run . Ideally, then, firms will try to reduce the number of interviews required for recruitment as far as possible. For many years, tests of various kinds served to cut down the applicant pool and thus the number of interviews. However, these tests had a limited capacity to assess candidates against the full consulting skillset in the way that case interviews do so well.

More recently, though, the development of online testing has allowed for more and more advanced assessments. Top consulting firms have been leveraging screening tests that better and better capture the same skillset as case interviews. Eventually this is converging on automated case studies. We see this very clearly with the addition of the Redrock case to McKinsey’s Solve assessment.

As these digital cases become closer to the real thing, the line between test and case interview blurs. Online cases don’t just reduce the number of candidates to case interview, but start directly replacing them.

Case in point here is BCG’s Casey chatbot . Previously, BCG had deployed less advanced online cases and similar tests to weed out some candidates before live case interviews began. Now, though, Casey actually replaces one first round case interview.

Casey, at time of writing, is still a relatively “basic” chatbot, basically running through a pre-set script. The Whatsapp-like interface does a lot of work to make it feel like one is chatting to a “real person” - the chatbot itself, though, cannot provide feedback or nudges to candidates as would a human interviewer.

We fully expect that, as soon as BCG and other firms can train a truer AI, these online cases will become more widespread and start replacing more live interviews.

We discuss the likely impacts of advanced AI on consulting recruitment and the industry more broadly in our blog.

Here, though, the real message is that you should expect to run into digital cases as well as traditional case interviews.

Luckily, despite any changes in specific case interview format, you will still need to master the same fundamental skills and prepare in much the same way.

We’ll cover a few ways to help prepare for chatbot cases in section four. Ultimately, though, firms are looking for the same problem solving ability and mindset as a real interviewer. Especially as chatbots get better at mimicking a real interviewer, candidates who are well prepared for case cracking in general should have no problem with AI administered cases.

3.6.1. Automated fit interviews

Analogous to online cases, in recent years there has been a trend towards automated, “one way” fit interviews, with these typically being administered for consultancies by specialist contractors like HireVue or SparkHire.

These are kind of like Zoom interviews, but if the interviewer didn’t show up. Instead you will be given fit questions to answer and must record your answer in your computer webcam. Your response will then go on to be assessed by an algorithm, scoring both what you say and how you say it.

Again, with advances in AI, it is easy to imagine these automated case interviews going from fully scripted interactions, where all candidates are asked the same list of questions, to a more interactive experience. Thus, we might soon arrive at a point where you are being grilled on the details of your stories - McKinsey PEI style - but by a bot rather than a human.

We include some tips on this kind of “one way” fit interview in section six here.

4. How to solve cases with the Problem-Driven Structure?

If you look around online for material on how to solve case studies, a lot of what you find will set out framework-based approaches. However, as we have mentioned, these frameworks tend to break down with more complex, unique cases - with these being exactly the kind of tough case studies you can expect to be given in your case interviews.

To address this problem, the MyConsultingCoach team has synthesized a new approach to case cracking that replicates how top management consultants approach actual engagements.

MyConsultingCoach’s Problem Driven Structure approach is a universal problem solving method that can be applied to any business problem , irrespective of its nature.

As opposed to just selecting a generic framework for each case interview, the Problem Driven Structure approach works by generating a bespoke structure for each individual question and is a simplified version of the roadmap McKinsey consultants use when working on engagements.

The canonical seven steps from McKinsey on real projects are simplified to four for case interview questions, as the analysis required for a six-month engagement is somewhat less than that needed for a 45-minute case study. However, the underlying flow is the same (see the method in action in the video below)

Let's zoom in to see how our method actually works in more detail:

4.1. Identify the problem

Identifying the problem means properly understanding the prompt/question you are given, so you get to the actual point of the case.

This might sound simple, but cases are often very tricky, and many candidates irretrievably mess things up within the first few minutes of starting. Often, they won’t notice this has happened until they are getting to the end of their analysis. Then, they suddenly realise that they have misunderstood the case prompt - and have effectively been answering the wrong question all along!

With no time to go back and start again, there is nothing to do. Even if there were time, making such a silly mistake early on will make a terrible impression on their interviewer, who might well have written them off already. The interview is scuppered and all the candidate’s preparation has been for nothing.

This error is so galling as it is so readily avoidable.

Our method prevents this problem by placing huge emphasis on a full understanding of the case prompt. This lays the foundations for success as, once we have identified the fundamental, underlying problem our client is facing, we focus our whole analysis around finding solutions to this specific issue.

Now, some case interview prompts are easy to digest. For example, “Our client, a supermarket, has seen a decline in profits. How can we bring them up?”. However, many of the prompts given in interviews for top firms are much more difficult and might refer to unfamiliar business areas or industries. For example, “How much would you pay for a banking license in Ghana?” or “What would be your key areas of concern be when setting up an NGO?”

Don’t worry if you have no idea how you might go about tackling some of these prompts!

In our article on identifying the problem and in our full lesson on the subject in our MCC Academy course, we teach a systematic, four step approach to identifying the problem , as well as running through common errors to ensure you start off on the right foot every time!

This is summarised here:

Four Steps to Identify the Problem

Following this method lets you excel where your competitors mess up and get off to a great start in impressing your interviewer!

4.2. Build your problem driven structure

After you have properly understood the problem, the next step is to successfully crack a case is to draw up a bespoke structure that captures all the unique features of the case.

This is what will guide your analysis through the rest of the case study and is precisely the same method used by real consultants working on real engagements.

Of course, it might be easier here to simply roll out one an old-fashioned framework, and a lot of candidates will do so. This is likely to be faster at this stage and requires a lot less thought than our problem-driven structure approach.

However, whilst our problem driven structure approach requires more work from you, our method has the advantage of actually working in the kind of complex case studies where generic frameworks fail - that is exactly the kind of cases you can expect at an MBB interview .

Since we effectively start from first principles every time, we can tackle any case with the same overarching method. Simple or complex, every case is the same to you and you don’t have to gamble a job on whether a framework will actually work

4.2.1 Issue trees

Issue trees break down the overall problem into a set of smaller problems that you can then solve individually. Representing this on a diagram also makes it easy for both you and your interviewer to keep track of your analysis.

To see how this is done, let’s look at the issue tree below breaking down the revenues of an airline:

Frame the Airline Case Study

These revenues can be segmented as the number of customers multiplied by the average ticket price. The number of customers can be further broken down into a number of flights multiplied by the number of seats, times average occupancy rate. The node corresponding to the average ticket price can then be segmented further.

4.2.2 Hypothesis trees

Hypothesis trees are similar, the only difference being that rather than just trying to break up the issue into smaller issues you are assuming that the problem can be solved and you are formulating solutions.

In the example above, you would assume revenues can be increased by either increasing the average ticket price or the number of customers . You can then hypothesize that you can increase the average occupancy rate in three ways: align the schedule of short and long haul flights, run a promotion to boost occupancy in off-peak times, or offer early bird discounts.

Frame the Airline Case Study Hypothesis

4.2.3 Other structures:structured lists

Structured lists are simply subcategories of a problem into which you can fit similar elements. This McKinsey case answer starts off by identifying several buckets such as retailer response, competitor response, current capabilities and brand image and then proceeds to consider what could fit into these categories.

Buckets can be a good way to start the structure of a complex case but when using them it can be very difficult to be MECE and consistent, so you should always aim to then re-organize them into either an issue or a hypothesis tree.

It is worth noting that the same problem can be structured in multiple valid ways by choosing different means to segment the key issues. Ultimately all these lists are methods to set out a logical hierachy among elements.

4.2.4 Structures in practice

That said, not all valid structures are equally useful in solving the underlying problem. A good structure fulfils several requirements - including MECE-ness , level consistency, materiality, simplicity, and actionability. It’s important to put in the time to master segmentation, so you can choose a scheme isn’t only valid, but actually useful in addressing the problem.

After taking the effort to identify the problem properly, an advantage of our method is that it will help ensure you stay focused on that same fundamental problem throughout. This might not sound like much, but many candidates end up getting lost in their own analysis, veering off on huge tangents and returning with an answer to a question they weren’t asked.

Another frequent issue - particularly with certain frameworks - is that candidates finish their analysis and, even if they have successfully stuck to the initial question, they have not actually reached a definite solution. Instead, they might simply have generated a laundry list of pros and cons, with no clear single recommendation for action.

Clients employ consultants for actionable answers, and this is what is expected in the case interview. The problem driven structure excels in ensuring that everything you do is clearly related back to the key question in a way that will generate a definitive answer. Thus, the problem driven structure builds in the hypothesis driven approach so characteristic of real consulting practice.

You can learn how to set out your own problem driven structures in our article here and in our full lesson in the MCC Academy course.

4.2. Lead the analysis

A problem driven structure might ensure we reach a proper solution eventually, but how do we actually get there?

We call this step " leading the analysis ", and it is the process whereby you systematically navigate through your structure, identifying the key factors driving the issue you are addressing.

Generally, this will mean continuing to grow your tree diagram, further segmenting what you identify as the most salient end nodes and thus drilling down into the most crucial factors causing the client’s central problem.

Once you have gotten right down into the detail of what is actually causing the company’s issues, solutions can then be generated quite straightforwardly.

To see this process in action, we can return to our airline revenue example:

Lead the analysis for the Airline Case Study

Let’s say we discover the average ticket price to be a key issue in the airline’s problems. Looking closer at the drivers of average ticket price, we find that the problem lies with economy class ticket prices. We can then further segment that price into the base fare and additional items such as food.

Having broken down the issue to such a fine-grained level and considering the 80/20 rule(see below), solutions occur quite naturally. In this case, we can suggest incentivising the crew to increase onboard sales, improving assortment in the plane, or offering discounts for online purchases.

Our article on leading the analysis is a great primer on the subject, with our video lesson in the MCC Academy providing the most comprehensive guide available.

4.4. Provide recommendations

So you have a solution - but you aren’t finished yet!

Now, you need to deliver your solution as a final recommendation.

This should be done as if you are briefing a busy CEO and thus should be a one minute, top-down, concise, structured, clear, and fact-based account of your findings.

The brevity of the final recommendation belies its importance. In real life consulting, the recommendation is what the client has potentially paid millions for - from their point of view, it is the only thing that matters.

In a case interview, your performance in this final summing up of your case is going to significantly colour your interviewer’s parting impression of you - and thus your chances of getting hired!

So, how do we do it right?

Barbara Minto's Pyramid Principle elegantly sums up almost everything required for a perfect recommendation. The answer comes first , as this is what is most important. This is then supported by a few key arguments , which are in turn buttressed by supporting facts .

Across the whole recommendation, the goal isn’t to just summarise what you have done. Instead, you are aiming to synthesize your findings to extract the key "so what?" insight that is useful to the client going forward.

All this might seem like common sense, but it is actually the opposite of how we relay results in academia and other fields. There, we typically move from data, through arguments and eventually to conclusions. As such, making good recommendations is a skill that takes practice to master.

We can see the Pyramid Principle illustrated in the diagram below:

The Pyramid principle often used in consulting

To supplement the basic Pyramid Principle scheme, we suggest candidates add a few brief remarks on potential risks and suggested next steps . This helps demonstrate the ability for critical self-reflection and lets your interviewer see you going the extra mile.

The combination of logical rigour and communication skills that is so definitive of consulting is particularly on display in the final recommendation.

Despite it only lasting 60 seconds, you will need to leverage a full set of key consulting skills to deliver a really excellent recommendation and leave your interviewer with a good final impression of your case solving abilities.

Our specific article on final recommendations and the specific video lesson on the same topic within our MCC Academy are great, comprehensive resources. Beyond those, our lesson on consulting thinking and our articles on MECE and the Pyramid Principle are also very useful.

4.5. What if I get stuck?

Naturally with case interviews being difficult problems there may be times where you’re unsure what to do or which direction to take. The most common scenario is that you will get stuck midway through the case and there are essentially two things that you should do:

  • 1. Go back to your structure
  • 2. Ask the interviewer for clarification

Your structure should always be your best friend - after all, this is why you put so much thought and effort into it: if it’s MECE it will point you in the right direction. This may seem abstract but let’s take the very simple example of a profitability case interview: if you’ve started your analysis by segmenting profit into revenue minus costs and you’ve seen that the cost side of the analysis is leading you nowhere, you can be certain that the declining profit is due to a decline in revenue.

Similarly, when you’re stuck on the quantitative section of the case interview, make sure that your framework for calculations is set up correctly (you can confirm this with the interviewer) and see what it is you’re trying to solve for: for example if you’re trying to find what price the client should sell their new t-shirt in order to break even on their investment, you should realize that what you’re trying to find is the break even point, so you can start by calculating either the costs or the revenues. You have all the data for the costs side and you know they’re trying to sell 10.000 pairs so you can simply set up the equation with x being the price.

As we’ve emphasised on several occasions, your case interview will be a dialogue. As such, if you don’t know what to do next or don’t understand something, make sure to ask the interviewer (and as a general rule always follow their prompts as they are trying to help, not trick you). This is especially true for the quantitative questions, where you should really understand what data you’re looking at before you jump into any calculations. Ideally you should ask your questions before you take time to formulate your approach but don’t be afraid to ask for further clarification if you really can’t make sense of what’s going on. It’s always good to walk your interviewer through your approach before you start doing the calculations and it’s no mistake to make sure that you both have the same understanding of the data. For example when confronted with the chart below, you might ask what GW (in this case gigawatt) means from the get-go and ask to confirm the different metrics (i.e. whether 1 GW = 1000 megawatts). You will never be penalised for asking a question like that.

Getting stuck

5. What to remember in case interviews

If you’re new to case cracking you might feel a bit hopeless when you see a difficult case question, not having any idea where to start.

In fact though, cracking case interviews is much like playing chess. The rules you need to know to get started are actually pretty simple. What will make you really proficient is time and practice.

In this section, we’ll run through a high level overview of everything you need to know, linking to more detailed resources at every step.

5.1. An overall clear structure

You will probably hear this more than you care for but it is the most important thing to keep in mind as you start solving cases, as not only it is a key evaluation criterion but the greatest tool you will have at your disposal. The ability to build a clear structure in all aspects of the case inteview will be the difference between breezing through a complicated case and struggling at its every step. Let’s look a bit closer at the key areas where you should be structured!

5.1.1 Structured notes

Every case interview starts with a prompt, usually verbal, and as such you will have to take some notes. And here is where your foray into structure begins, as the notes you take should be clear, concise and structured in a way that will allow you to repeat the case back to the interviewer without writing down any unnecessary information.

This may sound very basic but you should absolutely not be dismissive about it: taking clear and organized notes will allow what we found helps is to have separate sections for:

  • The case brief
  • Follow-up questions and answers
  • Numerical data
  • Case structure (the most crucial part when solving the case)
  • Any scrap work during the case (usually calculations)

When solving the case - or, as we call it here, in the Lead the analysis step, it is highly recommended to keep on feeding and integrating your structure, so that you never get lost. Maintaining a clear high level view is one of the most critical aspects in case interviews as it is a key skill in consulting: by constantly keeping track of where you are following your structure, you’ll never lose your focus on the end goal.

In the case of an interviewer-led case, you can also have separate sheets for each question (e.g. Question 1. What factors can we look at that drive profitability?). If you develop a system like this you’ll know exactly where to look for each point of data rather than rummage around in untidy notes. There are a couple more sections that you may have, depending on preference - we’ll get to these in the next sections.

5.1.2 Structured communication

There will be three main types of communication in cases:

  • 1. Asking and answering questions
  • 2. Walking the interviewer through your structure (either the case or calculation framework - we’ll get to that in a bit!)
  • 3. Delivering your recommendation

Asking and answering questions will be the most common of these and the key thing to do before you speak is ask for some time to collect your thoughts and get organised. What you want to avoid is a ‘laundry list’ of questions or anything that sounds too much like a stream of consciousness.

Different systems work for different candidates but a sure-fire way of being organised is numbering your questions and answers. So rather than saying something like ‘I would like to ask about the business model, operational capacity and customer personas’ it’s much better to break it down and say something along the lines of ‘I’ve got three key questions. Firstly I would like to inquire into the business model of our client. Secondly I would like to ask about their operational capacity. Thirdly I would like to know more about the different customer personas they are serving’.

A similar principle should be applied when walking the interviewer through your structure, and this is especially true of online case interviews (more and more frequent now) when the interviewer can’t see your notes. Even if you have your branches or buckets clearly defined, you should still use a numbering system to make it obvious to the interviewer. So, for example, when asked to identify whether a company should make an acquisition, you might say ‘I would like to examine the following key areas. Firstly the financial aspects of this issue, secondly the synergies and thirdly the client’s expertise’

The recommendation should be delivered top-down (see section 4.4 for specifics) and should employ the same numbering principle. To do so in a speedy manner, you should circle or mark the key facts that you encounter throughout the case so you can easily pull them out at the end.

5.1.3 Structured framework

It’s very important that you have a systematic approach - or framework - for every case. Let’s get one thing straight: there is a difference between having a problem-solving framework for your case and trying to force a case into a predetermined framework. Doing the former is an absolute must , whilst doing the latter will most likely have you unceremoniously dismissed.

We have seen there are several ways of building a framework, from identifying several categories of issues (or ‘buckets’) to building an issue or hypothesis tree (which is the most efficient type of framework). For the purpose of organization, we recommend having a separate sheet for the framework of the case, or, if it’s too much to manage, you can have it on the same sheet as the initial case prompt. That way you’ll have all the details as well as your proposed solution in one place.

5.1.4 Structured calculations

Whether it’s interviewer or candidate-led, at some point in the case you will get a bunch of numerical data and you will have to perform some calculations (for the specifics of the math you’ll need on consulting interviews, have a look at our Consulting Math Guide ). Here’s where we urge you to take your time and not dive straight into calculating! And here’s why: while your numerical agility is sure to impress interviewers, what they’re actually looking for is your logic and the calculations you need to perform in order to solve the problem . So it’s ok if you make a small mistake, as long as you’re solving for the right thing.

As such, make it easy for them - and yourself. Before you start, write down in steps the calculations you need to perform. Here’s an example: let’s say you need to find out by how much profits will change if variable costs are reduced by 10%. Your approach should look something like:

  • 1. Calculate current profits: Profits = Revenues - (Variable costs + Fixed costs)
  • 2. Calculate the reduction in variable costs: Variable costs x 0.9
  • 3. Calculate new profits: New profits = Revenues - (New variable costs + Fixed costs)

Of course, there may be more efficient ways to do that calculation, but what’s important - much like in the framework section - is to show your interviewer that you have a plan, in the form of a structured approach. You can write your plan on the sheet containing the data, then perform the calculations on a scrap sheet and fill in the results afterward.

5.2. Common business knowledge and formulas

Although some consulting firms claim they don’t evaluate candidates based on their business knowledge, familiarity with basic business concepts and formulae is very useful in terms of understanding the case studies you are given in the first instance and drawing inspiration for structuring and brainstorming.

If you are coming from a business undergrad, an MBA or are an experienced hire, you might well have this covered already. For those coming from a different background, it may be useful to cover some.

Luckily, you don’t need a degree-level understanding of business to crack case interviews , and a lot of the information you will pick up by osmosis as you read through articles like this and go through cases.

However, some things you will just need to sit down and learn. We cover everything you need to know in some detail in our Case Academy Course course. However, some examples here of things you need to learn are:

  • Basic accounting (particularly how to understand all the elements of a balance sheet)
  • Basic economics
  • Basic marketing
  • Basic strategy

Below we include a few elementary concepts and formulae so you can hit the ground running in solving cases. We should note that you should not memorise these and indeed a good portion of them can be worked out logically, but you should have at least some idea of what to expect as this will make you faster and will free up much of your mental computing power. In what follows we’ll tackle concepts that you will encounter in the private business sector as well as some situations that come up in cases that feature clients from the NGO or governmental sector.

5.2.1 Business sector concepts

These concepts are the bread and butter of almost any business case so you need to make sure you have them down. Naturally, there will be specificities and differences between cases but for the most part here is a breakdown of each of them.

5.2.1.1. Revenue

The revenue is the money that the company brings in and is usually equal to the number of products they sell multiplied to the price per item and can be expressed with the following equation:

Revenue = Volume x Price

Companies may have various sources of revenue or indeed multiple types of products, all priced differently which is something you will need to account for in your case interview. Let’s consider some situations. A clothing company such as Nike will derive most of their revenue from the number of products they sell times the average price per item. Conversely, for a retail bank revenue is measured as the volume of loans multiplied by the interest rate at which the loans are given out. As we’ll see below, we might consider primary revenues and ancillary revenues: in the case of a football club, we might calculate primary revenues by multiplying the number of tickets sold by the average ticket price, and ancillary revenues those coming from sales of merchandise (similarly, let’s say average t-shirt price times the number of t-shirts sold), tv rights and sponsorships.

These are but a few examples and another reminder that you should always aim to ask questions and understand the precise revenue structure of the companies you encounter in cases.

5.2.1.2. Costs

The costs are the expenses that a company incurs during its operations. Generally, they can be broken down into fixed and variable costs :

Costs = Fixed Costs + Variable Costs

As their name implies, fixed costs do not change based on the number of units produced or sold. For example, if you produce shoes and are renting the space for your factory, you will have to pay the rent regardless of whether you produce one pair or 100. On the other hand, variable costs depend on the level of activity, so in our shoe factory example they would be equivalent to the materials used to produce each pair of shoes and would increase the more we produce.

These concepts are of course guidelines used in order to simplify the analysis in cases, and you should be aware that in reality often the situation can be more complicated. However, this should be enough for case interviews. Costs can also be quasi-fixed, in that they increase marginally with volume. Take the example of a restaurant which has a regular staff, incurring a fixed cost but during very busy hours or periods they also employ some part-time workers. This cost is not exactly variable (as it doesn’t increase with the quantity of food produced) but also not entirely fixed, as the number of extra hands will depend on how busy the restaurant is. Fixed costs can also be non-linear in nature. Let’s consider the rent in the same restaurant: we would normally pay a fixed amount every month, but if the restaurant becomes very popular we might need to rent out some extra space so the cost will increase. Again, this is not always relevant for case interviews.

5.2.1.3. Profit and profit margin

The profit is the amount of money a company is left with after it has paid all of its expenses and can be expressed as follows:

Profit = Revenue - Costs

It’s very likely that you will encounter a profitability issue in one of your case interviews, namely you will be asked to increase a company’s profit. There are two main ways of doing this: increasing revenues and reducing costs , so these will be the two main areas you will have to investigate. This may seem simple but what you will really need to understand in a case are the key drivers of a business (and this should be done through clarifying questions to the interviewer - just as a real consultant would question their client).

For example, if your client is an airline you can assume that the main source of revenue is sales of tickets, but you should inquire how many types of ticket the specific airline sells. You may naturally consider economy and business class tickets, but you may find out that there is a more premium option - such as first class - and several in-between options. Similarly to our football club example, there may be ancillary revenues from selling of food and beverage as well as advertising certain products or services on flights.

You may also come across the profit margin in case interviews. This is simply the percentage of profit compared to the revenue and can be expressed as follows:

Profit margin = Profit/Revenue x 100

5.2.1.4. Break-even point

An ancillary concept to profit, the break-even point is the moment where revenues equal costs making the profit zero and can be expressed as the following equation:

Revenues = Costs (Fixed costs + Variable costs)

This formula will be useful when you are asked questions such as ‘What is the minimum price I should sell product X?’ or ‘What quantity do I need to sell in order to recoup my investment?’. Let’s say in a case interview an owner of a sandwich store asks us to figure out how many salami and cheese salami sandwiches she needs to sell in order to break even. She’s spending $4 on salami and $2 for cheese and lettuce per sandwich, and believes she can sell the sandwiches at around $7. The cost of utilities and personnel is around $5000 per month. We could lay this all out in the break-even equation:

7 x Q ( quantity ) = (4+2) x Q + 5000 ( variable + fixed costs )

In a different scenario, we may be asked to calculate the break-even price . Let’s consider our sandwich example and say our owner knows she has enough ingredients for about 5000 sandwiches per month but is not sure how much to sell them for. In that case, if we know our break-even equation, we can simply make the following changes:

P ( price ) x 5000 = (4+2) x 5000 + 5000

By solving the equation we get to the price of $7 per sandwich.

5.2.1.5. Market share and market size

We can also consider the market closely with profit, as in fact the company’s performance in the market is what drives profits. The market size is the total number of potential customers for a certain business or product, whereas the market share is the percentage of that market that your business controls (or could control, depending on the case).

There is a good chance you will have to estimate the market size in one of your case interviews and we get into more details on how to do that below. You may be asked to estimate this in either number of potential customers or total value . The latter simply refers to the number of customers multiplied by the average value of the product or service.

To calculate the market share you will have to divide the company’s share by the total market size and multiply by 100:

Note, though, that learning the very basics of business is the beginning rather than the end of your journey. Once you are able to “speak business” at a rudimentary level, you should try to “become fluent” and immerse yourself in reading/viewing/listening to as wide a variety of business material as possible, getting a feel for all kinds of companies and industries - and especially the kinds of problems that can come up in each context and how they are solved. The material put out by the consulting firms themselves is a great place to start, but you should also follow the business news and find out about different companies and sectors as much as possible between now and interviews. Remember, if you’re going to be a consultant, this should be fun rather than a chore!

5.3 Public sector and NGO concepts

As we mentioned, there will be some cases (see section 6.6 for a more detailed example) where the key performance indicators (or KPIs in short) will not be connected to profit. The most common ones will involve the government of a country or an NGO, but they can be way more diverse and require more thought and application of first principles. We have laid out a couple of the key concepts or KPIs that come up below

5.3.1 Quantifiability

In many such scenarios you will be asked to make an important strategic decision of some kind or to optimise a process. Of course these are not restricted to non-private sector cases but this is where they really come into their own as there can be great variation in the type of decision and the types of field.

While there may be no familiar business concepts to anchor yourself onto, a concept that is essential is quantifiability . This means, however qualitative the decision might seem, consultants rely on data so you should always aim to have aspects of a decision that can be quantified, even if the data doesn’t present itself in a straightforward manner.

Let’s take a practical example. Your younger sibling asks you to help them decide which university they should choose if they want to study engineering. One way to structure your approach would be to segment the problem into factors affecting your sibling’s experience at university and experience post-university. Within the ‘at uni’ category you might think about the following:

  • Financials : How much are tuition costs and accommodation costs?
  • Quality of teaching and research : How are possible universities ranked in the QS guide based on teaching and research?
  • Quality of resources : How well stocked is their library, are the labs well equipped etc.?
  • Subject ranking : How is engineering at different unis ranked?
  • Life on campus and the city : What are the living costs in the city where the university is based? What are the extracurricular opportunities and would your sibling like to live in that specific city based on them?

Within the ‘out of uni’ category you might think about:

  • Exit options : What are the fields in which your sibling could be employed and how long does it take the average student of that university to find a job?
  • Alumni network : What percentage of alumni are employed by major companies?
  • Signal : What percentage of applicants from the university get an interview in major engineering companies and related technical fields?

You will perhaps notice that all the buckets discussed pose quantifiable questions meant to provide us with data necessary to make a decision. It’s no point to ask ‘Which university has the nicest teaching staff?’ as that can be a very subjective metric.

5.3.1 Impact

Another key concept to consider when dealing with sectors other than the private one is how impactful a decision or a line of inquiry is on the overarching issue , or whether all our branches in our issue tree have a similar impact. This can often come in the form of impact on lives, such as in McKinsey’s conservation case discussed below, namely how many species can we save with our choice of habitat.

5.4 Common consulting concepts

Consultants use basic business concepts on an every day basis, as they help them articulate their frameworks to problems. However, they also use some consulting specific tools to quality check their analysis and perform in the most efficient way possible. These principles can be applied to all aspects of a consultant’s work, but for brevity we can say they mostly impact a consultant’s systematic approach and communication - two very important things that are also tested in case interviews. Therefore, it’s imperative that you not only get to know them, but learn how and when to use them as they are at the very core of good casing. They are MECE-ness, the Pareto Principle and the Pyramid principle and are explained briefly below - you should, however, go on to study them in-depth in their respective articles.

Perhaps the central pillar of all consulting work and an invaluable tool to solve cases, MECE stands for Mutually Exclusive and Collectively Exhaustive . It can refer to any and every aspect in a case but is most often used when talking about structure. We have a detailed article explaining the concept here , but the short version is that MECE-ness ensures that there is no overlap between elements of a structure (i.e. the Mutually Exclusive component) and that it covers all the drivers or areas of a problem (Collectively Exhaustive). It is a concept that can be applied to any segmentation when dividing a set into subsets that include it wholly but do not overlap.

Let’s take a simple example and then a case framework example. In simple terms, when we are asked to break down the set ‘cars’ into subsets, dividing cars into ‘red cars’ and ‘sports cars’ is neither mutually exclusive (as there are indeed red sports cars) nor exhaustive of the whole set (i.e. there are also yellow non-sports cars that are not covered by this segmentation). A MECE way to segment would be ‘cars produced before 2000’ and ‘cars produced after 2000’ as this segmentation allows for no overlap and covers all the cars in existence.

Dividing cars can be simple, but how can we ensure MECEness in a case-interview a.k.a. a business situation. While the same principles apply, a good tip to ensure that your structure is MECE is to think about all the stakeholders - i.e. those whom a specific venture involves.

Let’s consider that our client is a soda manufacturer who wants to move from a business-to-business strategy, i.e. selling to large chains of stores and supermarkets, to a business-to-consumer strategy where it sells directly to consumers. In doing so they would like to retrain part of their account managers as direct salespeople and need to know what factors to consider.

A stakeholder-driven approach would be to consider the workforce and customers and move further down the issue tree, thinking about individual issues that might affect them. In the case of the workforce, we might consider how the shift would affect their workload and whether it takes their skillset into account. As for the customers, we might wonder whether existing customers would be satisfied with this move: will the remaining B2B account managers be able to provide for the needs of all their clients and will the fact that the company is selling directly to consumers now not cannibalise their businesses? We see how by taking a stakeholder-centred approach we can ensure that every single perspective and potential issue arising from it is fully covered.

5.4.2 The Pareto Principle

Also known as the 80/20 rule, this principle is important when gauging the impact of a decision or a factor in your analysis. It simply states that in business (but not only) 80% of outcomes come from 20% of causes. What this means is you can make a few significant changes that will impact most of your business organisation, sales model, cost structure etc.

Let’s have a look at 3 quick examples to illustrate this:

  • 80% of all accidents are caused by 20% of drivers
  • 20% of a company’s products account for 80% of the sales
  • 80% of all results in a company are driven by 20% of its employees

The 80/20 rule will be a very good guide line in real engagements as well as case interviews, as it will essentially point to the easiest and most straightforward way of doing things. Let’s say one of the questions in a case is asking you to come up with an approach to understand the appeal of a new beard trimmer. Obviously you can’t interview the whole male population so you might think about setting up a webpage and asking people to comment their thoughts. But what you would get would be a laundry list of difficult to sift through data.

Using an 80/20 approach you would segment the population based on critical factors (age groups, grooming habits etc.) and then approach a significant sample size of each (e.g. 20), analysing the data and reaching a conclusion.

5.4.3 The Pyramid Principle

This principle refers to organising your communication in a top-down , efficient manner. While this is generally applicable, the pyramid principle will most often be employed when delivering the final recommendation to your client. This means - as is implicit in the name - that you would organise your recommendation (and communication in general) as a pyramid, stating the conclusion or most important element at the top then go down the pyramid listing 3 supporting arguments and then further (ideally also 3) supporting arguments for those supporting arguments.

Let’s look at this in practice in a case interview context: your client is a German air-conditioning unit manufacturer who was looking to expand into the French market. However, after your analysis you’ve determined that the market share they were looking to capture would not be feasible. A final recommendation using the Pyramid Principle would sound something like this: ‘I recommend that we do not enter the German market for the following three reasons. Firstly, the market is too small for our ambitions of $50 million. Secondly the market is heavily concentrated, being controlled by three major players and our 5 year goal would amount to controlling 25% of the market, a share larger than that of any of the players. Thirdly, the alternative of going into the corporate market would not be feasible, as it has high barriers to entry.Then, if needed, we could delve deeper into each of our categories

6. Case examples or building blocks?

As we mentioned before, in your case interview preparation you will undoubtedly find preparation resources that claim that there are several standard types of cases and that there is a general framework that can be applied to each type of case. While there are indeed cases that are straightforward at least in appearance and seemingly invite the application of such frameworks, the reality is never that simple and cases often involve multiple or more complicated components that cannot be fitted into a simple framework.

At MCC we don’t want you to get into the habit of trying to identify which case type you’re dealing with and pull out a framework, but we do recognize that there are recurring elements in frameworks that are useful - such as the profitability of a venture (with its revenues and costs), the valuation of a business, estimating and segmenting a market and pricing a product.

We call these building blocks because they can be used to build case frameworks but are not a framework in and of themselves, and they can be shuffled around and rearranged in any way necessary to be tailored to our case. Hence, our approach is not to make you think in terms of case types but work from first principles and use these building blocks to build your own framework. Let’s take two case prompts to illustrate our point.

The first is from the Bain website, where the candidate is asked whether they think it’s a good idea for their friend to open a coffee shop in Cambridge UK (see the case here ). The answer framework provided here is a very straightforward profitability analysis framework, examining the potential revenues and potential costs of the venture:

Profitability framework

While this is a good point to start for your case interview (especially taken together with the clarifying questions), we will notice that this approach will need more tailoring to the case - for example the quantity of coffee will be determined by the market for coffee drinkers in Cambridge, which we have to determine based on preference. We are in England so a lot of people will be drinking tea but we are in a university town so perhaps more people than average are drinking coffee as it provides a better boost when studying. All these are some much needed case-tailored hypotheses that we can make based on the initial approach.

Just by looking at this case we might be tempted to say that we can just take a profitability case and apply it without any issues. However, this generic framework is just a starting point and in reality we would need to tailor it much further in the way we had started to do in order to get to a satisfactory answer. For example, the framework for this specific case interview doesn’t cover aspects such as the customer’s expertise: does the friend have any knowledge of the coffee business, such as where to source coffee and how to prepare it? Also, we could argue there may be some legal factors to consider here, such as any approvals that they might need from the city council to run a coffee shop on site, or some specific trade licences that are not really covered in the basic profitability framework.

Let’s take a different case , however, from the McKinsey website. In this scenario, the candidate is being asked to identify some factors in order to choose where to focus the client’s conservation efforts. Immediately we can realise that this case doesn’t lend itself to any pre-packaged framework and we will need to come up with something from scratch - and take a look at McKinsey’s answer of the areas to focus on:

Conservation case

We notice immediately that this framework is 100% tailored to the case - of course there are elements which we encounter in other cases, such as costs and risks but again these are applied in an organic way. It’s pretty clear that while no standard framework would work in this case, the aforementioned concepts - costs and risks - and the way to approach them (a.k.a building blocks ) are fundamentally similar throughout cases (with the obvious specificities of each case).

In what follows, we’ll give a brief description of each building block starting from the Bain example discussed previously, in order to give you a general idea of what they are and their adaptability, but you should make sure to follow the link to the in-depth articles to learn all their ins and outs.

6.1 Estimates and segmentation

This building block will come into play mostly when you’re thinking about the market for a certain product (but make sure to read the full article for more details). Let’s take our Bain Cambridge coffee example. As we mentioned under the quantity bucket we need to understand what the market size for coffee in Cambridge would be - so we can make an estimation based on segmentation .

The key to a good estimation is the ability to logically break down the problem into more manageable pieces. This will generally mean segmenting a wider population to find a particular target group. We can start off with the population of Cambridge - which we estimate at 100.000. In reality the population is closer to 150.000 but that doesn’t matter - the estimation has to be reasonable and not accurate , so unless the interviewer gives you a reason to reconsider you can follow your instinct. We can divide that into people who do and don’t drink coffee. Given our arguments before, we can conclude that 80% of those, so 80.000 drink coffee. Then we can further segment into those who drink regularly - let’s say every day - and those who drink occasionally - let’s say once a week. Based on the assumptions before about the student population needing coffee to function, and with Cambridge having a high student population, we can assume that 80% of those drinking coffee are regular drinkers, so that would be 64.000 regular drinkers and 16.000 occasional drinkers. We can then decide whom we want to target what our strategy needs to be:

Coffee segmentation

This type of estimation and segmentation can be applied to any case specifics - hence why it is a building block.

6.2 Profitability

We had several looks at this building block so far (see an in-depth look here ) as it will show up in most case interivew scenarios, since profit is a key element in any company’s strategy. As we have seen, the starting point to this analysis is to consider both the costs and revenues of a company, and try to determine whether revenues need to be improved or whether costs need to be lowered. In the coffee example, the revenues are dictated by the average price per coffe x the number of coffees sold , whereas costs can be split into fixed and variable .

Some examples of fixed costs would be the rent for the stores and the cost of the personnel and utilities, while the most obvious variable costs would be the coffee beans used and the takeaway containers (when needed). We may further split revenues in this case into Main revenues - i.e. the sales of coffee - and Ancillary revenues , which can be divided into Sales of food products (sales of pastries, sandwiches etc., each with the same price x quantity schema) and Revenues from events - i.e renting out the coffee shop to events and catering for the events themselves. Bear in mind that revenues will be heavily influenced by the penetration rate , i.e. the share of the market which we can capture.

6.3 Pricing

Helping a company determine how much they should charge for their goods or services is another theme that comes up frequently in cases. While it may seem less complicated than the other building blocks, we assure you it’s not - you will have to understand and consider several factors, such as the costs a company is incurring, their general strategic positioning, availability, market trends as well as the customers’ willingness to pay (or WTP in short) - so make sure to check out our in-depth guide here .

Pricing Basics

In our example, we may determine that the cost per cup (coffee beans, staff, rent) is £1. We want to be student friendly so we should consider how much students would want to pay for a coffee as well as how much are competitors are charging. Based on those factors, it would be reasonable to charge on average £2 per cup of coffee. It’s true that our competitors are charging £3 but they are targeting mostly the adult market, whose willingness to pay is higher, so their pricing model takes that into account as well as the lower volume of customers in that demographic.

6.4. Valuation

A variant of the pricing building block, a valuation problem generally asks the candidate to determine how much a client should pay for a specific company (the target of an acquisition) as well as what other factors to consider. The two most important factors (but not the only ones - for a comprehensive review see our Valuation article ) to consider are the net present value (in consulting interviews usually in perpetuity) and the synergies .

In short, the net present value of a company is how much profit it currently brings in, divided by how much that cash flow will depreciate in the future and can be represented with the equation below:

Net Present Value

The synergies refer to what could be achieved should the companies operate as one, and can be divided into cost and revenue synergies .

Let’s expand our coffee example a bit to understand these. Imagine that our friend manages to open a chain of coffee shops in Cambridge and in the future considers acquiring a chain of take-out restaurants. The most straightforward example of revenue synergies would be cross-selling, in this case selling coffee in the restaurants as well as in the dedicated stores, and thus getting an immediate boost in market share by using the existing customers of the restaurant chain. A cost synergy would be merging the delivery services of the two businesses to deliver both food and coffee, thus avoiding redundancies and reducing costs associated with twice the number of drivers and vehicles.

6.5. Competitive interaction

This component of cases deals with situations where the market in which a company is operating changes and the company must decide what to do. These changes often have to do with a new player entering the market (again for more details make sure to dive into the Competitive Interaction article ).

Let’s assume that our Cambridge coffee shop has now become a chain and has flagged up to other competitors that Cambridge is a blooming market for coffee. As such, Starbucks has decided to open a few stores in Cambridge themselves, to test this market. The question which might be posed to a candidate is what should our coffee chain do. One way (and a MECE one) to approach the problem is to decide between doing something and doing nothing . We might consider merging with another coffee chain and pooling our resources or playing to our strengths and repositioning ourselves as ‘your student-friendly, shop around the corner’. Just as easily we may just wait the situation out and see whether indeed Starbucks is cutting into our market share - after all, the advantages of our product and services might speak for themselves and Starbucks might end up tanking. Both of these are viable options if argued right and depending on the further specifics of the case.

Competitive Interaction Structure

6.6. Special cases

Most cases deal with private sectors, where the overarching objective entails profit in some form. However, as hinted before, there are cases which deal with other sectors where there are other KPIs in place . The former will usually contain one or several of these building blocks whereas the latter will very likely have neither. This latter category is arguably the one that will stretch your analytical and organisational skills to the limit, since there will be very little familiarity that you can fall back on (McKinsey famously employs such cases in their interview process).

So how do we tackle the structure for such cases? The short answer would be starting from first principles and using the problem driven structure outlined above, but let’s look at a quick example in the form of a McKinsey case :

McKinsey Diconsa Case

The first question addressed to the candidate is the following:

McKinsey Diconsa Case

This is in fact asking us to build a structure for the case. So what should we have in mind here? Most importantly, we should start with a structure that is MECE and we should remember to do that by considering all the stakeholders . They are on the one hand the government and affiliated institutions and on the other the population. We might then consider which issues might arise for each shareholder and what the benefits for them would be, as well as the risks. This approach is illustrated in the answer McKinsey provides as well:

McKinsey Framework

More than anything, this type of case shows us how important it is to practise and build different types of structures, and think about MECE ways of segmenting the problem.

7. How Do I prepare for case interviews

In consulting fashion, the overall preparation can be structured into theoretical preparation and practical preparation , with each category then being subdivided into individual prep and prep with a partner .

As a general rule, the level and intensity of the preparation will differ based on your background - naturally if you have a business background (and have been part of a consulting club or something similar) your preparation will be less intensive than if you’re starting from scratch. The way we suggest you go about it is to start with theoretical preparation , which means learning about case interviews, business and basic consulting concepts (you can do this using free resources - such as the ones we provide - or if you want a more through preparation you can consider joining our Case Academy as well).

You can then move on to the practical preparation which should start with doing solo cases and focusing on areas of improvement, and then move on to preparation with a partner , which should be another candidate or - ideally - an ex-consultant.

Let’s go into more details with respect to each type of preparation.

7.1. Solo practice

The two most important areas of focus in sole preparation are:

  • Mental math

As we mentioned briefly, the best use of your time is to focus on solving cases. You can start with cases listed on MBB sites since they are clearly stated and have worked solutions as well (e.g. Bain is a good place to start) and then move to more complex cases (our Case Library also offers a range of cases of different complexities). To build your confidence, start out on easier case questions, work through with the solutions, and don't worry about time. As you get better, you can move on to more difficult cases and try to get through them more quickly. You should practice around eight case studies on your own to build your confidence.

Another important area of practice is your mental mathematics as this skill will considerably increase your confidence and is neglected by many applicants - much to their immediate regret in the case interview. Find our mental math tool here or in our course, and practice at least ten minutes per day, from day one until the day before the interview.

7.2. Preparation with a partner

There are aspects of a case interview - such as asking clarifying questions - which you cannot do alone and this is why, after you feel comfortable, you should move on to practice with another person. There are two options here:

  • Practicing with a peer
  • Practicing with an ex-consultant

In theory they can be complementary - especially if you’re peer is also preparing for consulting interviews - and each have advantages and disadvantages. A peer is likely to practice with you for free for longer, however you may end up reinforcing some bad habits or unable to get actionable feedback. A consultant will be able to provide you the latter but having their help for the same number of hours as a peer will come at a higher cost. Let’s look at each option in more detail.

7.2.1. Peer preparation

Once you have worked through eight cases solo, you should be ready to simulate the case interview more closely and start working with another person.

Here, many candidates turn to peer practice - that is, doing mock case interviews with friends, classmates or others also applying to consulting. If you’re in university, and especially in business school, there will very likely be a consulting club for you to join and do lots of case practice with. If you don’t have anyone to practice, though, or if you just want to get a bit more volume in with others, our free meeting board lets you find fellow applicants from around the world with whom to practice. We recommend practicing around 10 to 15 ‘live’ cases to really get to a point where you feel comfortable.

7.2.2. Preparation with a consultant

You can do a lot practising by yourself and with peers. However, nothing will bring up your skills so quickly and profoundly as working with a real consultant.

Perhaps think about it like boxing. You can practice drills and work on punch bags all you want, but at some point you need to get into the ring and do some actual sparring if you ever want to be ready to fight.

Practicing with an ex consultant is essentialy a simulation of a case interview. Of course, it isn’t possible to secure the time of experienced top-tier consultants for free. However, when considering whether you should invest to boost your chances of success, it is worth considering the difference in your salary over even just a few years between getting into a top-tier firm versus a second-tier one. In the light of thousands in increased annual earnings (easily accumulating into millions over multiple years), it becomes clear that getting expert interview help really is one of the best investments you can make in your own future.

Should you decide to make this step, MyConsultingCoach can help, offering bespoke mentoring programmes , where you are paired with a 5+ year experienced, ex-MBB mentor of your choosing, who will then oversee your whole case interview preparation from start to finish - giving you your best possible chance of landing a job!

7.3. Practice for online interviews

Standard preparation for interview case studies will carry directly over to online cases.

However, if you want to do some more specific prep, you can work through cases solo to a timer and using a calculator and/or Excel (online cases generally allow calculators and second computers to help you, whilst these are banned in live case interviews).

Older PST-style questions also make great prep, but a particularly good simulation is the self-assessment tests included in our Case Academy course . These multiple choice business questions conducted with a strict time limit are great preparation for the current crop of online cases.

7.4. Fit interviews

As we’ve noted, even something billed as a case interview is very likely to contain a fit interview as a subset.

We have an article on fit interviews and also include a full set of lessons on how to answer fit questions properly as a subset of our comprehensive Case Academy course .

Here though, the important thing to convey is that you take preparing for fit questions every bit as seriously as you do case prep.

Since they sound the same as you might encounter when interviewing for other industries, the temptation is to regard these as “just normal interview questions”.

However, consulting firms take your answers to these questions a good deal more seriously than elsewhere.

This isn’t just for fluffy “corporate culture” reasons. The long hours and close teamwork, as well as the client-facing nature of management consulting, mean that your personality and ability to get on with others is going to be a big part of making you a tolerable and effective co-worker.

If you know you’ll have to spend 14+ hour working days with someone you hire and that your annual bonus depends on them not alienating clients, you better believe you’ll pay attention to their character in interview.

There are also hard-nosed financial reasons for the likes of McKinsey, Bain and BCG to drill down so hard on your answers.

In particular, top consultancies have huge issues with staff retention. The average management consultant only stays with these firms for around two years before they have moved on to a new industry.

In some cases, consultants bail out because they can’t keep up with the arduous consulting lifestyle of long hours and endless travel. In many instances, though, departing consultants are lured away by exit opportunities - such as the well trodden paths towards internal strategy roles, private equity or becoming a start-up founder.

Indeed, many individuals will intentionally use a two year stint in consulting as something like an MBA they are getting paid for - giving them accelerated exposure to the business world and letting them pivot into something new.

Consulting firms want to get a decent return on investment for training new recruits. Thus, they want hires who not only intend to stick with consulting longer-term, but also have a temperament that makes this feasible and an overall career trajectory where it just makes sense for them to stay put.

This should hammer home the point that, if you want to get an offer, you need to be fully prepared to answer fit questions - and to do so excellently - any time you have a case interview.

8. Interview day - what to expect, with tips

Of course, all this theory is well and good, but a lot of readers might be concerned about what exactly to expect in real life . It’s perfectly reasonable to want to get as clear a picture as possible here - we all want to know what we are going up against when we face a new challenge!

Indeed, it is important to think about your interview in more holistic terms, rather than just focusing on small aspects of analysis. Getting everything exactly correct is less important than the overall approach you take to reasoning and how you communicate - and candidates often lose sight of this fact.

In this section, then, we’ll run through the case interview experience from start to finish, directing you to resources with more details where appropriate. As a supplement to this, the following video from Bain is excellent. It portrays an abridged version of a case interview, but is very useful as a guide to what to expect - not just from Bain, but from McKinsey, BCG and any other high-level consulting firm.

8.1. Getting started

Though you might be shown through to the office by a staff member, usually your interviewer will come and collect you from a waiting area. Either way, when you first encounter them, you should greet your interviewer with a warm smile and a handshake (unless they do not offer their hand). Be confident without verging into arrogance. You will be asked to take a seat in the interviewer’s office, where the case interview can then begin.

8.1.1. First impressions

In reality, your assessment begins before you even sit down at your interviewer’s desk. Whether at a conscious level or not, the impression you make within the first few seconds of meeting your interviewer is likely to significantly inform the final hiring decision (again, whether consciously or not).

Your presentation and how you hold yourself and behave are all important . If this seems strange, consider that, if hired, you will be personally responsible for many clients’ impressions of the firm. These things are part of the job! Much of material on the fit interview is useful here, whilst we also cover first impressions and presentation generally in our article on what to wear to interview .

As we have noted above, your interview might start with a fit segment - that is, with the interviewer asking questions about your experiences, your soft skills, and motivation to want to join consulting generally and that firm in particular. In short, the kinds of things a case study can’t tell them about you. We have a fit interview article and course to get you up to speed here.

8.1.2. Down to business

Following an initial conversation, your interviewer will introduce your case study , providing a prompt for the question you have to answer. You will have a pen and paper in front of you and should (neatly) note down the salient pieces of information (keep this up throughout the interview).

It is crucial here that you don’t delve into analysis or calculations straight away . Case prompts can be tricky and easy to misunderstand, especially when you are under pressure. Rather, ask any questions you need to fully understand the case question and then validate that understanding with the interviewer before you kick off any analysis. Better to eliminate mistakes now than experience that sinking feeling of realising you have gotten the whole thing wrong halfway through your case!

This process is covered in our article on identifying the problem and in greater detail in our Case Academy lesson on that subject.

8.1.3. Analysis

Once you understand the problem, you should take a few seconds to set your thoughts in order and draw up an initial structure for how you want to proceed. You might benefit from utilising one or more of our building blocks here to make a strong start. Present this to your interviewer and get their approval before you get into the nuts and bolts of analysis.

We cover the mechanics of how to structure your problem and lead the analysis in our articles here and here and more thoroughly in the MCC Case Academy . What it is important to convey here, though, is that your case interview is supposed to be a conversation rather than a written exam . Your interviewer takes a role closer to a co-worker than an invigilator and you should be conversing with them throughout.

Indeed, how you communicate with your interviewer and explain your rationale is a crucial element of how you will be assessed. Case questions in general, are not posed to see if you can produce the correct answer, but rather to see how you think . Your interviewer wants to see you approach the case in a structured, rational fashion. The only way they are going to know your thought processes, though, is if you tell them!

To demonstrate this point, here is another excellent video from Bain, where candidates are compared.

Note that multiple different answers to each question are considered acceptable and that Bain is primarily concerned with the thought processes of the candidate’s exhibit .

Another reason why communication is absolutely essential to case interview success is the simple reason that you will not have all the facts you need to complete your analysis at the outset. Rather, you will usually have to ask the interviewer for additional data throughout the case to allow you to proceed .

NB: Don't be let down by your math!

Your ability to quickly and accurately interpret these charts and other figures under pressure is one of the skills that is being assessed. You will also need to make any calculations with the same speed and accuracy (without a calculator!). As such, be sure that you are up to speed on your consulting math .

8.1.4. Recommendation

Finally, you will be asked to present a recommendation. This should be delivered in a brief, top-down "elevator pitch" format , as if you are speaking to a time-pressured CEO. Again here, how you communicate will be just as important as the details of what you say, and you should aim to speak clearly and with confidence.

For more detail on how to give the perfect recommendation, take a look at our articles on the Pyramid Principle and providing recommendations , as well the relevant lesson within MCC Academy .

8.1.5. Wrapping up

After your case is complete, there might be a few more fit questions - including a chance for you to ask some questions of the interviewer . This is your opportunity to make a good parting impression.

We deal with the details in our fit interview resources. However, it is always worth bearing in mind just how many candidates your interviewers are going to see giving similar answers to the same questions in the same office. A pretty obvious pre-requisite to being considered for a job is that your interviewer remembers you in the first place. Whilst you shouldn't do something stupid just to be noticed, asking interesting parting questions is a good way to be remembered.

Now, with the interview wrapped up, it’s time to shake hands, thank the interviewer for their time and leave the room .

You might have other case interviews or tests that day or you might be heading home. Either way, if know that you did all you could to prepare, you can leave content in the knowledge that you have the best possible chance of receiving an email with a job offer. This is our mission at MCC - to provide all the resources you need to realise your full potential and land your dream consulting job!

8.2. Remote and one-way interview tips

Zoom case interviews and “one-way” automated fit interviews are becoming more common as selection processes are increasingly remote, with these new formats being accompanied by their own unique challenges.

Obviously you won’t have to worry about lobbies and shaking hands for a video interview. However, a lot remains the same. You still need to do the same prep in terms of getting good at case cracking and expressing your fit answers. The specific considerations around remote case interviews are, in effect, around making sure you come across as effectively as you would in person.

8.2.1. Connection

It sounds trivial, but a successful video case interview of any kind presupposes a functioning computer with a stable and sufficient internet connection.

Absolutely don’t forget to have your laptop plugged in, as your battery will definitely let you down mid-interview. Similarly, make sure any housemates or family know not to use the microwave, vacuum cleaner or anything else that makes wifi cut out (or makes a lot of noise, obviously)

If you have to connect on a platform you don’t use much (for example, if it’s on Teams and you’re used to Zoom), make sure you have the up to date version of the app in advance, rather than having to wait for an obligatory download and end up late to join. Whilst you’re at it, make sure you’re familiar with the controls etc. At the risk of being made fun of, don’t be afraid to have a practice call with a friend.

8.2.2. Dress

You might get guidance on a slightly more relaxed dress code for a Zoom interview. However, if in doubt, dress as you would for the real thing (see our article here ).

Either way, always remember that presentation is part of what you are being assessed on - the firm needs to know you can be presentable for clients. Taking this stuff seriously also shows respect for your interviewer and their time in interviewing you.

8.2.3. Lighting

An aspect of presentation that you have to devote some thought to for a Zoom case interview is your lighting.

Hopefully, you long ago nailed a lighting set-up during the Covid lockdowns. However, make sure to check your lighting in advance with your webcam - bearing in mind what time if day your case interview actually is. If your case interview is late afternoon, don’t just check in the morning. Make sure you aren’t going to be blinded from light coming in a window behind your screen, or that you end up with the weird shadow stripes from blinds all over your face.

Natural light is always best, but if there won’t be much of that during your interview, you’ll likely want to experiment with moving some lamps around.

8.2.4. Clarity

The actual stories you tell in an automated “one-way” fit interview will be the same as for a live equivalent. If anything, things should be easier, as you can rattle off a practised monologue without an interviewer interrupting you to ask for clarifications.

You can probably also assume that the algorithm assessing your performance is sufficiently capable that it will be observing you at much the same level as a human interviewer. However, it is probably still worth speaking as clearly as possible with these kinds of interviews and paying extra attention to your lighting to ensure that your face is clearly visible.

No doubt the AIs scoring these interviews are improving all the time, but you still want to make their job as easy as possible. Just think about the same things as you would with a live Zoom case interview, but more so.

9. How we can help

There are lots of great free resources on this site to get you started with preparation, from all our articles on case solving and consulting skills to our free case library and peer practice meeting board .

To step your preparation up a notch, though, our Case Academy course will give you everything you need to know to solve the most complex of cases - whether those are in live case interviews, with chatbots, written tests or any other format.

Whatever kind of case you end up facing, nothing will bring up your skillset faster than the kind of acute, actionable feedback you can get from a mock case interview a real, MBB consultant. Whilst it's possible to get by without this kind of coaching, it does tend to be the biggest single difference maker for successful candidates.

You can find out more on our coaching page:

Explore Coaching

Of course, for those looking for a truly comprehensive programme, with a 5+ year experienced MBB consultant overseeing their entire prep personally, from networking and applications right through to your offer, we have our mentoring programmes.

You can read more here:

Comprehensive Mentoring

Account not confirmed

case study data preparation

Management Consulting Case Library

In our Case Library, you will find 200+ case studies for your case interview preparation. Solve challenging problems around market sizing, pricing, or sustainability.

case study data preparation

What to Expect in our Case Library

case study data preparation

This Is What Our Community Says

case study data preparation

Any Open Questions Left? Check Out Our FAQ

Almost all resources on PrepLounge (that includes the case studies) can be used with freemium access which means that you get to try it to a certain degree but have to upgrade to Premium to use it to the full extent. However, you have free and unlimited access to the cases included in the free Basic Membership.

Our selection of case studies mirrors the wide variety of real case interviews . Thus, you can solve challenging problems around market sizing, market entry, pricing, or operations strategy. Moreover, all provided cases are marked with a level of difficulty and the respective case style. This will allow you to filter for interviewer-led (as used by McKinsey) and candidate-led cases. In addition, some of our corporate partners granted some great original cases that have been used in real interviews.

There is no set number of cases you should solve before your interview. However, you will find that the more practice you get and the more different scenarios you tackle, the more confident you will become in dealing with case prompts. Also, in addition to self-study, we recommend that you act out real interview situations. Just choose from over 370,000 candidates and practice cases from the case library together.

Next to each case you will find a button that allows you to post a question directly to the case in our Consulting Forum . Our experts will answer you as soon as possible and support you with your problem. You can also contact the experts directly. 

Are you excited to start practicing for your case interview with real case studies?

case study data preparation

  • Select category
  • General Feedback
  • Case Interview Preparation
  • Technical Problems

A panoramic banner image featuring a hand holding a photograph against a blurred, abstract blue background. Inside the photograph, two professionals, Hojin and Sobitha, are engaged in a discussion across a table in a well-lit office setting overlooking the cityscape of Seoul. Both are dressed in business attire, and a laptop is visible between them, indicating a work-related meeting.

Case Interview Preparation

Perform at your best during your case interview., bcgers share their case study interview tips., follow these dos and don ’ ts to ace your case prep:, test your case interview skills with these interactive quizzes., set out a climate strategy for a client., restore client satisfaction at a digital bank..

logo

Data Preparation for Regression – Pricing Case Study Example (Part 2)

In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage (read part 1 :  regression case study example ). This is an interactive case study example and required your help to move forward. These are some of your observations from exploratory analysis that you shared in the comments of the last part ( download the data here )

Katya Chomakova : The house prices are approximately normally distributed. All values except the three outliers lie between 1492000 and 10515000. Among all numeric variables, house prices are most highly correlated with Carpet (0.9) and Builtup(0.75). Mani : Initially, it appears as if housing price has good correlation with built up and carpet. But, once we remove all observations having missing values (which is just ~4% of total obs), I find that the correlation drops down very low (~0.09 range)

Punch and Regression Analysis - by Roopam

Data Preparation for Regression Analysis – by Roopam

Katya and Mani noticed something unusual about missing observations and outliers in the data, and how their presence and absence were changing the results dramatically. This is the reason data preparation is an important exercise for any machine learning or statistical analysis to get consistent results. We will learn about data preparation for regression analysis in this part of the case study. Before we explore this in detail, let’s take a slight detour to understand the crux of stability and talk about fall of heroes.

Every kid needs a hero. I had many when I was growing up. This is a story of how I used a concept in physics caller ‘ center of gravity ‘ to chose one of my heroes by having an imaginary competition between:

Mike Tyson Vs. Bop Bag

The Champion : Mike Tyson was the undisputed heavyweight boxing champion in the late 1980s. He was no Mohammad Ali but was on his path to come closest to  The Greatest . This is where things went wrong for Tyson; he was convicted of rape and was in prison for 3 years. Out of jail and desperate to regain his glory days, Tyson challenged Evander Holyfield,the then undisputed champion. What followed was a disgrace for any sport where during the challenge match Tyson bit a part of Holyfield’s ear off and got disqualified.

The Challenger : Most of us have played with a bop bag or the punching toy as kids. It is designed in such a way that when punched, it topples for a while but eventually stands back up on its own. Bop bag is a perfect example where the center of gravity of the object is highly grounded and stays within its body. You could punch it, kick it, or perturb it in any possible way but the bop bag will stand back up after a fall – yes, it has that cute, funny smile too. On the other hand, like Mike Tyson, most of us struggle big time after a fall. Possibly because our center of gravity is outside our body in other people’s opinion about us. Tyson was mostly driven by the praises from others after a win rather than his love for the game.

The Winner : Center of gravity helped me choose my hero : bop bag. This cute toy reminds me every day to keep my center grounded and inside my body and not let others perturb my core – even when punched. I wish I could always wear a sincere smile like my hero.

Bop bag also has important lessons for data preparation for machine learning and data science models. The data for modeling needs to display stability similar to bop bag and must not give completely different results with different observations. Katya and Mani have noticed a major instability in our data in their exploratory analysis. They have highlighted the presence of missing data and outliers; we will explore these ideas further in this part when we will explore data preparation for regression analysis. Now, let’s go back to our case study example.

Data Preparation for Regression – Case Study Example

Regression model

In your effort to create a price estimation model, you have gathered this data . The next step is data preparation for regression analysis before the development of a model. This will require us to prepare a robust and logically correct data for analysis.

We will do our analysis for this case study example on R. For this I recommend you install R & R Studio on your system. However, you could also try these codes on this online R engine :  R-Fiddle .

We will first import the data in R and then prepare a summary report for all the variables using this command:

A version of the summary report is displayed here. Remember there are total 932 observations is this data set.

Min.    146 1666 3227 775 932
1st Qu. 6476 9354 11302 1318 1583
Median   8230 11161 13163 1480 1774
Mean    8230 11019 13072 1512 1795
3rd Qu. 9937 12670 14817 1655 1982
Max.    20662 20945 23294 24300 12730
NA’s    13 13 1 8 15

Look at the last row where all the above variables have some missing data. Parking and City_Category are categorical variables hence we have got levels for them. Notice there is missing data in Parking as well marked as ‘Not Provided’.

Covered : 188

No Parking: 145

Not Provided : 227

Open : 372

CAT A: 329

CAT B: 365

CAT C: 238

  

Min.    110 30000
1st Qu. 600 4658000
Median  780 5866000
Mean    785.6 6084695
3 Qu. 970 7187250
Max.    1560 150000000
NA’s     0  0

The first thing we will do is to remove missing variables from this dataset. We will explore later whether removal of missing variables is a good strategy or not. We will also calculate how many observations we will lose by removing missing data.

We have lost 34 observations after removal of missing data. The data set is now down to 898 observations. This is ~4% observations as Mani pointed in his comment. Also, notice that missing variables for categorical variables (Parking) are not removed, could you reason why?

In the next step, we will plot a box plot of housing price to identify outliers for the dependent variable.

Data Preparation for Regression Aanalysis

Let’s try to look at this extreme outlier by fetching this observation.

This observation seems to be for a large mansion in some countryside. As can be seen in data when compared with the summary data for other observations.

Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
20662 20945 23294 24300 12730 Covered CAT B 1130 150000000

There is no point in keeping this super-rich property in data while preparing a model for middle-class housing. Hence we will remove this observation. The next step is to look at the box plot of all the numerical variables in the model to find unusual observations. We will normalize the data to bring it to the same scale.

Data prepration for regression analysis 1

Sign-off Note

In this part, we have primarily spent our time on univariate analysis for data preparation for regression. In the next part, we will explore patterns through bivariate analysis before the development of multivariate models. These are some of the questions you may want to ponder and share your view before the next part:

1) We had removed 34 observations with missing data, what impact the removal of missing data can have on our analysis? Could we do something to minimize this impact?

2) Why did we not remove missing values from the categorical variable i.e. Parking?

3) What impact could the extreme outlier, a large mansion, have on the model we are developing for middle-class house prices? Was it a good idea to remove that outlier?

9 thoughts on “ Data Preparation for Regression – Pricing Case Study Example (Part 2) ”

In my opinion it is not always a good idea to exclude missing values from our data for several reasons: 1. We can lose a great amount of data, which is not our case here but is still an issue.Losing data is considered an issue because the amount of data that we use for modeling is crucial for the statistical significance of our results. 2. Sometimes the fact that a value is missing carries a message. For example, missing value for Dist_Hospital could mean that there is no hospital nearby. The same is true for Dist_Taxi and so on. When this is the case, we can impute values to missing data that represent the underlying reason for the data to be missing. For example, missing values for Dist_Hospital might be replaced with a huge number meaning that there is no hospital in the near distance. 3. When we can’t consider the reason behind missing values but we still don’t want to exclude them, a reasonable approach is to impute to missing values the average value for the corresponding variable. According to outliers, we should first examine if they are also high leverage points. If this is the case these observations have a potential to exert an influence to our model and it is recommended to exclude them. If they are not high leverage points, whether to exclude them depends on the modeling technique we would like to use. Decision trees are more sensitive to outliers than regression models. I am looking forward to your next post on the topic.

I do agree with Katya. Here is my opinion about 1.missing values and 2.outliers:

1. Missing values can cause serious problems. Most statistical procedures automatically eliminate cases with missing values, so you may not have enough data to perform the statistical analysis. Alternatively, although the analysis might run, the results may not be statistically significant because of the small amount of input data. Missing values can also cause misleading results by introducing bias.

2. Values may not be identically distributed because of the presence of outliers. Outliers are inconsistent values in the data. Outliers may have a strong influence over the fitted slope and intercept, giving a poor fit to the bulk of the data points. Outliers tend to increase the estimate of residual variance, lowering the chance of rejecting the null hypothesis.

Below is my opinion on the three points:-

1) Instead of removing the missing Data, can’t we substitue some value like taking avg of other observations of that column,etc. I am suggesting this because we might lose some important information by directly removing the missiong value observations.

2) flats having no parking availability can leads to useful information like a middle class family does not require parking space and they may buy flat with no parking facility. I am not quite so sure about this point. Kindly correct me if i am wrong.

3) outliers can lead to our observation in wrong side. But sometimes it is useful to identify some hidden patterns in our data. In our case, this extreme outlier can do a bad impact in our analysis as we are more focused for middle class house prices. so in my opinion, its a good idea to remove it.

Reply – Missing data can bias the model especially when we have Missing values at random and missing values not at random. We can use little test to confirm if the Missing is completely at random. If yes, we can go ahead with the list wise deletion. If it is missing at random we can use multiple imputation methods to use other variables to impute the values and build models on the complete data sets.

Reply – This extreme outlier can highly influence the model parameters. Checking Cook distance and h(leverage of an observation) indicates the observation is a high leverage and point and, therefore, we should remove it before training our model.

If i am not wrong, scale fn works like (x-mean)/sd , so when do we use this scaling process and when do we use log transformation for normalization ? is there any rule ?

Scaling and log transformation have completely different purposes and they cannot be used interchangeably. One of the usage of Scale is in cluster analysis (Read this article) where you are trying to bring different variables on the same scale. On the other hand, log transform is used to make a screwed distribution normal e.g. ARIMA and time series models

Thanks 🙂 it’s a great help 🙂

boxplot(scale(data_without_missing[data_without_missing$House_Price<10^8,c(2:6,9:10)]),col="Orange")

Throwing this error Error in `[.data.frame`(data_without_missing, data_without_missing$House_Price < : undefined columns selected

boxplot(scale(data_without_missing[data_without_missing$House_Price<10^8,c(2:6,9:10)]),col="Orange") need python code

Leave a comment Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Upcoming Webinar

Join us for a FREE Webinar on Automated Processing of Healthcare EDI Files with Astera

June 27, 2024 — 11 am PT / 1 pm CT / 2 pm ET

Home / Blogs /  Top 5 Data Preparation Tools In 2024

The Automated, No-Code Data Stack

Learn how Astera Data Stack can simplify and streamline your enterprise’s data management.

 Top 5 Data Preparation Tools In 2024

 Top 5 Data Preparation Tools In 2024

case study data preparation

Lead - Campaign Marketing

Data analysis demands precision, and at its core lies the vital role of data preparation tools. These tools serve as the cornerstone for ensuring your data is accurate, consistent, and reliable. Before embarking on your data analysis journey, it’s crucial to choose the right tool for the job. This guide will introduce you to the top five data preparation tools currently shaping the market.

What Are Data Preparation Tools?

Data preparation tools are software or platforms that automate and streamline the entire data preparation process. These user-friendly tools collect, clean, transform, and organize raw and incomplete data into a suitable and consistent format for further data processing , modeling, and analysis tasks. Data Preparation tools help users clean and transform large volumes of data faster and more efficiently than manual processes.

Key Features of a Good Data Preparation Tool

Here are some essential features of good data preparation software:

Connectors for Diverse Data Sources

A quality data preparation tool connects to in-demand relational databases such as Azure, Oracle, Redshift, and SQL Server. It should also have connectors for various CRM systems, CSV/JSON files, and multi-structured sources such as log files, PDFs, images, texts, etc.

Built-in connectivity for these sources allows for easier data extraction and integration, as users will be able to retrieve complex data with only a few clicks.

Data Security

Data security and privacy checks protect sensitive data from unauthorized access, theft, or manipulation. Despite intensive regulations, data breaches continue to result in significant financial losses for organizations every year. According to IBM research , in 2022, organizations lost an average of $4.35 million as a result of data breaches. This was up 2.6% from the previous year. Data security is necessary to keep this number down.

Most data preparation tools allow for access control. With access controls defined, only authorized users can access sensitive data. Additionally, access can be customized based on the user’s role or level of access needed. By limiting access to sensitive data pipelines or architectures, preparation tools can enhance accuracy by reducing the risk of errors and ensure compliance with data protection regulations.

End-to-End Process Automation

One of the main reasons organizations turn to data preparation solutions is to automate all the manual data preparation tasks and processes. Businesses significantly improve efficiency and productivity by automating data integration , cleaning, standardization, transformation, and storage tasks. Preparing reliable data can normally take weeks or months; however, automation can reduce this cycle to just a few hours or days.

Easy-to-Use, Code-Free Environment

By eliminating the need for writing complex code, data preparation tools reduce the risk of errors. These tools allow users to manipulate and transform data without the potential pitfalls of manual coding. This improves data quality and saves valuable time and resources that would otherwise be devoted to error detection and correction.

Interoperability

Once you’ve accessed, cleansed, and organized your data, the next crucial step is to utilize it within your analytics infrastructure effectively. While all data transformation solutions can generate flat files in CSV or similar formats, the most efficient data prep implementations will also easily integrate with your other productivity business intelligence (BI) tools.

Manual export and import steps in a system can add complexity to your data pipeline. When evaluating data preparation tools, look for solutions that easily connect data visualization and BI reporting applications to guide your decision-making processes, e.g., PowerBI, Tableau, etc.

Flexibility and Adaptability

Flexibility is the tool’s ability to work with various data sources, formats, and platforms without compromising performance or quality. An agile tool that can easily adopt various data architecture types and integrate with different providers will increase the efficiency of data workflows and ensure that data-driven insights can be derived from all relevant sources.

Adaptability is another important requirement. As businesses grow and evolve, so do their data requirements. This means that a data preparation automation tool should be capable of scaling and adapting to the organization’s changing needs. It should be able to adjust to new technologies, handle increasing data volumes, and accommodate new business goals.

Top 5 Data Preparation Tools for 2024

Astera is a unified data management platform with advanced data preparation, extraction, integration, warehousing , electronic data exchange, and API management capabilities. The platform’s easy-to-use visual interface allows you to design and develop end-to-end data pipelines without coding.

Astera’s dynamic platform includes rigorous data cleaning , transformation, and preparation features. The solution lets you connect to various data sources, including databases, files, and APIs, to access raw data easily. With its preview-focused interface, you can perform various data-cleaning activities, such as removing duplicates, handling missing values, and correcting inconsistencies.

Astera supports advanced transformations such as filtering, sorting, joining, and aggregating to restructure and improve the data quality. The integrity and quality of the prepared data can be verified using custom validation rules, data profiling , and verification checks to ensure reliability and consistency. Once satisfied, easily export the organized data to various formats or integrate it with downstream systems for analysis, visualization, or consumption with just a few clicks.

Key Features:

  • Point-and-Click Navigation/ No-Code Interface
  • Interactive Data Grid with Agile Correction Capabilities
  • Real-Time Data Health Checks
  • Effortless Integration of cleaned data with external systems
  • Workflow Automation
  • Data Quality Assurance with Comprehensive Checks and Rules
  • Rich Data Transformations
  • Connectors for a wide range of on-premises and cloud-based sources
  • AI-powered Data Extraction

2. Altair Monarch 

Altair Monarch is a self-service tool that supports desktop and server-based data preparation capabilities. The tool can clean and prepare data from a wide range of data sources and formals, including spreadsheets, PDFs, and big data repositories. Altair Monarch has a no-code interface to clean, transform, and prepare data. It supports data source access, profiling and classification, metadata management, and data joining.

  • No-code, visual interface
  • Workflow automation
  • Pre-built data transformation features
  • Reusable custom models

3. Alteryx 

Alteryx data preparation tool offers a visual interface with hundreds of no/low-code features to perform various data preparation tasks. The tool allows users to easily connect to various sources, including data warehouses , cloud applications, and spreadsheets. Alteryx can conduct a predictive, statistical, and spatial analysis of the retrieved data. The tool also lets users visually explore data through data exploration and profiling. Alteryx is available both as a cloud-based solution and on-premise.

  • AI-infused data quality enhancement recommendations
  • Data Exploration & Profiling
  • Data connectors for on-premises and cloud
  • User-friendly interface

Talend’s data prep module is a self-service data preparation application that uses machine learning algorithms for standardization, cleansing, and reconciliation activities. The tool’s browser-based interface and machine learning-enabled data prep features let users clean and prepare data. Talend connects to various data sources such as databases, CRM systems, FTP servers, and files, enabling data consolidation.

  • No-Code self-service interface
  • Role-based access for data security and governance
  • Real-time data quality monitoring

5. Datameer 

Datameer is a SaaS platform designed for data preparation within the Snowflake environment. The tool gives an option to prepare data using SQL code or through the drag-and-drop Excel-like interface to ingest and prepare data. Datameer uses a graphical formula builder for data transformations, profiling, etc. The tools allow for integrations with BI tools for further analysis.

  • No-code or SQL-code
  • Snowflake centered
  • Excel-like Interface
  • Runtime validation
  • Support for all data formats (structured, semi-structured, and unstructured)
  • Data Profiling and Transformations

How to Choose the Right Data Preparation Tool for Your Needs

Choosing the right data preparation tool is an important task. There are some key factors you must keep in mind to find a solution that fits your data requirements.

Consider the complexity of your data and the level of technical expertise available within your organization. Some tools are more suitable for technical users, while others focus on simplicity and ease of use for non-technical users. Additionally, evaluate the performance and scalability of the tool, as well as its compatibility with your existing infrastructure.

Evaluate the volume and variety of your data and the frequency of data updates. Consider whether you require real-time data integration, advanced data profiling capabilities, or specific data transformation functions.

Emerging Trends in Data Preparation

The rise of big data and the increasing complexity of data sources have led to the development of intelligent data preparation tools. These tools leverage AI and machine learning algorithms to automate data cleansing and transformation tasks, making the data preparation process more efficient and accurate. Additionally, data preparation tools are becoming more integrated with other data analytics technologies, such as data visualization and predictive analytics, enabling organizations to derive more value from their data.

Advancements in technology, such as cloud computing and distributed processing, are also revolutionizing the data preparation process. Integrating data preparation tools with data lakes and warehouses enables organizations to leverage the power of distributed processing, making data preparation faster and more efficient than ever before.

Streamline Your Data Preparation with Self Service Tools

Data preparation is a critical step in the data analysis process. With the right data preparation tool, you can ensure data quality, consistency, and accuracy, leading to more reliable insights and informed decision-making. By considering the key features and evaluating your specific needs, you can choose a data preparation tool that suits your requirements.

As technology advances, the future of data preparation looks promising, with intelligent tools and seamless integration shaping how we prepare and analyze data.

Astera is a powerful and AI-powered platform that enables self-service data preparation for users with varying technical expertise. You can automate repetitive tasks, such as data cleansing , transformation, and enrichment, reducing manual effort and saving time. With advanced data preparation capabilities, Astera is invaluable in any data-driven operation. It bridges the gap between data and analysis, accelerating business time-to-insights.

Experience how Astera can make your data preparation tasks easier and quicker. Sign up for our 14-day free trial or a free demo today!

case study data preparation

You MAY ALSO LIKE

case study data preparation

A Complete Guide to Legacy Application Modernization

Legacy systems have been important in the growth and success of several organizations. However, as these systems and applications age,...

case study data preparation

PostgreSQL API: What it is and How to Create One

PostgreSQL APIs let applications interact with Postgres databases, perform CRUD (Create, Read, Update, Delete) operations, and manage database schemas. Keep...

case study data preparation

Insurance Legacy System Transformation With API Integration: A Guide

Insurers' success has always hinged on their ability to analyze data effectively to price and underwrite policies accurately. While this...

Considering Astera For Your Data Management Needs?

Establish code-free connectivity with your enterprise applications, databases, and cloud applications to integrate all your data.

websights

Get the Reddit app

doge

Wiki has been Updated!

A space for data science professionals to engage in discussions and debates on the subject of data science.

Data Science Interview case study prep tips

I have a upcoming interview for data scientist role.

I am preparing for it by revising ISLR ,ESLR and the kaggle courses to revise pandas and sklearn functions.

Is there any platform, book wherein I can find case study prompts and tips on how to work on such prompts?

By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy .

Enter the 6-digit code from your authenticator app

You’ve set up two-factor authentication for this account.

Enter a 6-digit backup code

Create your username and password.

Reddit is anonymous, so your username is what you’ll go by here. Choose wisely—because once you get a name, you can’t change it.

Reset your password

Enter your email address or username and we’ll send you a link to reset your password

Check your inbox

An email with a link to reset your password was sent to the email address associated with your account

Choose a Reddit account to continue

  • Our Approach

Amazon Prime Day 2024: Results & Data Analysis for Sellers

emily-sullivan-headshot

Amazon’s 10th Prime Day ran on July 16th and 17th this year, solidifying its position as a major shopping event with ever-increasing popularity and scale. Despite expanding into a two-day event over the years, the retail holiday retained its distinct name – Prime Day. Consumers recognize Prime Day as a round-the-clock deals extravaganza, requiring them to regularly check Amazon throughout the event to snag limited-time discounts before they disappear.

As in previous years, savings opportunities started early with a variety of Early Prime Day Deals available on July 15th, the day before the main event, and many of the deepest discounts on Early Prime Day Deals were for Amazon’s own popular devices.

carousel of Amazon Prime Day 2024 offers

Source: Amazon

In this post, we’ll analyze some of Tinuiti’s initial findings on Prime Day 2024 and offer actionable insights to help with preparations for next year’s event. We’ll dive into Prime Day’s history, the results and key takeaways from 2024, how (and when) to prepare for the next Prime Day , and more. 

What is Prime Day?

Prime Day is a two-day shopping event hosted by Amazon exclusively for its Prime members. It typically features significant discounts on a vast array of products, from electronics and home goods to fashion and toys. Originally started as a celebration of Amazon’s Prime membership anniversary, it has evolved into a global shopping phenomenon rivaling Black Friday and Cyber Monday.

Prime Day 2024 promotional banner on Amazon.com

Key Data & Results for Prime Day 2024

Just a day after the sale wrapped, Amazon announced that Prime Day 2024 was Amazon’s biggest Prime Day shopping event ever, with millions more Prime members shopping the two-day shopping event compared to Prime Day 2023.

Prime Day By the Numbers 

Amazon didn’t reveal exact numbers, but Adobe Analytics reported that consumers spent $14.2 billion during the Prime Day event, an 11% jump from last year’s $12.7 billion total. According to Numerator , the average order size on Prime Day 2024 was $57.97 and 60% of households shopping Prime Day placed 2+ separate orders, bringing the average household spend to roughly $152.33.

Bar chart showing average order size during Amazon Prime Day 2024, with 21% spending approximately $20 and 17% spending approximately $100

Numerator noted that the top items on Prime Day 2024 were Amazon Fire TV Sticks, Premier Protein Shakes, and Liquid IV Packets. Apparel and shoes , home goods, and household essentials were also popular purchases among consumers. It was also reported that more than half of shoppers took advantage of sales to buy items they had been eyeing prior to the event.

Most popular categories and product types on Amazon Prime Day 2024, with 27% of purchases including Apparel & Shoes and 53% of shoppers waiting to buy items until they go on sale

Back-to-school shopping was also a focus for many consumers shopping on Prime Day as the sale ran later in the month compared to last year. Parents flocked to the platform for deals on everything from clothing and electronics to school supplies and dorm essentials. In fact, back-to-school shopping significantly boosted sales this year, with spending on related items surging 216% compared to average daily sales in June, according to Adobe.

“Prime Day 2024 was a huge success thanks to the millions of Prime members globally who turned to Amazon for fantastic deals, and our much-appreciated employees, delivery partners, and sellers around the world who helped bring the event to life for customers.”  Doug Herrington CEO of Worldwide Amazon Stores

Marketplace Competition

Amazon’s Prime Day faced stiff competition from a range of retailers. Walmart , Target, and Best Buy launched competing sales events, all happening before Prime Day, offering comparable deals and leveraging their own strengths like in-store pickup.

Numerator reports that over half (54%) of Prime Day shoppers compared prices with other retailers prior to placing their orders, and about 35% shopped during last week’s Target Circle Week sale and/or Walmart Deals sale. 

It’s clear shoppers are looking everywhere for the best deals on Prime Day, so it’s important to offer competitive prices and promotions. If you’re thinking about adding emerging marketplaces to your strategy, it might be worth exploring your options before next year’s sales events pick up ( and we can help ).

Price comparison shopping trends during Amazon Prime Day 2024, with a small majority checking prices at other retailers.

AI Integration: Rufus

This year, Rufus , Amazon’s generative AI-powered shopping assistant now available to all U.S. customers, enhanced the customer shopping experience for many Prime Day buyers. Rufus offers a conversational way to search for products, compare options, and get personalized recommendations. Users can easily track orders, find detailed product information, and even get help with customer service inquiries within the Amazon Shopping app.

“The introduction of Amazon’s AI-powered shopping assistant, Rufus, was a welcome addition to this year’s Prime Day. In my own Prime Day shopping experience, Rufus provided much faster information about the product I was viewing than if I scrolled through the entire product detail page. I am very interested to see how this feature impacts conversion rates moving forward. It definitely seems like a nice win for the customer shopping experience.” Joe O’Connor Sr. Innovation & Growth Director at Tinuiti

Example of Amazon’s generative AI, Rufus, on a product detail page during Prime Day 2024

Ad Console Outage 

It’s worth noting that during this year’s Prime Day event, the Amazon Ads console experienced downtime on the first day, which disrupted the ability to log into the platform. At Tinuiti, our teams noted the intermittent load-time issues about two hours before the full crash, which occurred at 8:33 pm PST and came back up around 9:53 pm PST. 

Prime Day 2024 Hot Takes From Tinuiti Experts 

We’ve covered the numbers, but let’s hear what stood out most to our Tinuiti experts during Prime Day 2024… 

Amazon Marketing Cloud Proved Invaluable for Advertisers

“Getting creative with AMC (Amazon Marketing Cloud) audiences allowed brands to drive incremental purchases with high ROAS . Most notably, we saw scale behind consumers who added a product to a registry and who added a product to their wishlist. We saw high efficiency targeting consumers who added a product to their cart over the last 2 weeks before Prime Day. Being able to target those consumers who are in the brands shopping funnel already with creatives pulling in promotions resulted in efficient clickthrough rates.”  Karen Hopkins Marketplaces Strategist at Tinuiti
“Implementing an effective lead-in strategy is crucial for scaling your lower funnel campaigns effectively during the Prime Day event. It is essential to build a sufficient pool of qualified consumers in advance in order to reach them starting on day one. Leveraging AMC audiences to engage customers through innovative approaches proved to be advantageous throughout the event.”  Jordan Lampi Senior Specialist, Commerce Media at Tinuiti 

Optimizing Budgets for the Time of the Day Was Key

“Account for the traffic flow when mapping out your budgets. The morning of day one will bring strong volume but also some of the pricier CPCs. Take advantage of lead in deals and a heavier spend push the day before the event starts, as some consumers who are browsing for Prime Day deals will make that purchase if a good deal is there while browsing the day before (I’m guilty of such behavior). Fast forward through a slower middle part of the day for Day one, evening hours offers a critical time to be visible. But day two is arguably the most critical for budget planning, as the morning and afternoon can be a bit of a deadzone for Prime Day, with the grand finale coming in the core post working hours as people get in their last minute deals. Make sure you are loaded up to spend in those hours, as they are the most critical of the entire event”  Ken Magner Strategist, Commerce Media at Tinuiti

Agility and Pre-Prime Day Preparation Proved to be Crucial 

“Despite the Amazon outage on Tuesday, leveraging our partnered third-party tools, like Skai, was essential for our Prime strategy. By using these tools, we efficiently managed budgets and set ourselves up for a successful second day. As a result, we not only achieved but exceeded our revenue goal, bringing in 114% of our target for this year.” Brooke Martinez Commerce Media Specialist at Tinuiti 
“Every Prime Day there is a need to pivot due to unforeseen issues. Taking time to work through the scenarios with a client ahead of time creates more flexibility during the event. If we know what the next steps are if a product runs out of budget or if we’re not hitting goals then we don’t need to wait for approval.”  Karie Casper Strategist, Commerce Media at Tinuiti

Prepping for Prime Day 2025

To prepare for Prime Day 2025 and future sales events, sellers should focus on several key areas. First, prioritize building a strong product foundation by collecting customer reviews and optimizing product detail pages with relevant keywords. Second, invest in customer retention strategies such as retargeting ads to recapture potential buyers. Third, anticipate increased competition by monitoring deals across platforms and considering expanding to new marketplaces. Finally, be sure to leverage data from the previous Prime Day to refine your strategy for the upcoming year, testing new tactics and optimizing performance.

If you’re interested in learning more about Tinuiti’s Commerce Services, or if you’d like to learn more about emerging marketplaces, contact us today . Also, be sure to sign up for our upcoming webinar where we’ll take a deeper dive into how you can crush Q4—and keep an eye out for our Prime Day Data Recap coming soon.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

case study data preparation

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Estimating urban traffic safety and analyzing spatial patterns through the integration of city-wide near-miss data: a new york city case study.

case study data preparation

1. Introduction

2. literature review, 3. methodology, 3.1. grid-based method, 3.2. empirical bayes method, 3.3. spatial analysis, 4. data preparation, 4.1. near-misses, 4.2. crash data, 4.3. other data, 4.3.1. traffic exposure data, 4.3.2. road network and transport facility, 4.3.3. land use, 4.3.4. population density, 4.4. variable summary, 5. results and discussion, 5.1. correlation analysis, 5.2. crash prediction model, 5.3. spatial analysis, 5.3.1. spatial distribution, 5.3.2. spatial autocorrelation, 5.3.3. observation and prediction difference, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

  • WHO. Road Traffic Injuries. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 25 July 2023).
  • ASIRT. ROAD SAFETY FACTS. Available online: https://www.asirt.org/safe-travel/road-safety-facts/ (accessed on 29 May 2023).
  • CHEKPEDS. NYC Crash Mapper. Available online: https://crashmapper.org/ (accessed on 20 April 2022).
  • FHWA. FHWA Strategic Plan 2022–2026. Available online: https://highways.dot.gov/about/fhwa-strategic-plan-2022-2026 (accessed on 29 May 2023).
  • NYC Vision Zero. What It Is. Available online: https://www.nyc.gov/content/visionzero/pages/what-it-is (accessed on 29 May 2023).
  • Hauer, E.; Harwood, D.W.; Council, F.M.; Griffith, M.S. Estimating safety by the empirical Bayes method: A tutorial. Transp. Res. Rec. 2002 , 1784 , 126–131. [ Google Scholar ] [ CrossRef ]
  • Huang, H.; Chin, H.C.; Haque, M.M. Empirical evaluation of alternative approaches in identifying crash hot spots: Naive ranking, empirical bayes, full bayes methods. Transp. Res. Rec. 2009 , 2103 , 32–41. [ Google Scholar ] [ CrossRef ]
  • Sanders, R.L. Perceived traffic risk for cyclists: The impact of near miss and collision experiences. Accid. Anal. Prev. 2015 , 75 , 26–34. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Arun, A.; Haque, M.M.; Washington, S.; Sayed, T.; Mannering, F. A systematic review of traffic conflict-based safety measures with a focus on application context. Anal. Methods Accid. Res. 2021 , 32 , 100185. [ Google Scholar ] [ CrossRef ]
  • Arun, A.; Haque, M.M.; Bhaskar, A.; Washington, S.; Sayed, T. A systematic mapping review of surrogate safety assessment using traffic conflict techniques. Accid. Anal. Prev. 2021 , 153 , 106016. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Park, J.-I.; Kim, S.; Kim, J.-K. Exploring spatial associations between near-miss and police-reported crashes: The Heinrich’s law in traffic safety. Transp. Res. Interdiscip. Perspect. 2023 , 19 , 100830. [ Google Scholar ] [ CrossRef ]
  • Wu, K.-F.; Jovanis, P.P. Crashes and crash-surrogate events: Exploratory modeling with naturalistic driving data. Accid. Anal. Prev. 2012 , 45 , 507–516. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Tarko, A.; Davis, G.; Saunier, N.; Sayed, T.; Washington, S. Surrogate Measures of Safety ; White Paper; Transportation Research Board: Washington, DC, USA, 2009. [ Google Scholar ]
  • Amundsen, F.; Hyden, C. Proceedings of First Workshop on Traffic Conflicts ; TTI: Oslo, Norway; LTH Lund: Lund, Sweden, 1977; Volume 78. [ Google Scholar ]
  • Shahdah, U.; Saccomanno, F.; Persaud, B. Integrated traffic conflict model for estimating crash modification factors. Accid. Anal. Prev. 2014 , 71 , 228–235. [ Google Scholar ] [ CrossRef ]
  • Hyden, C.; Linderholm, L.; Swedish traffic-conflicts technique, T. International Calibration Study of Traffic Conflict Techniques ; Springer: Berlin/Heidelberg, Germany, 1984; pp. 133–139. [ Google Scholar ]
  • Dijkstra, A. Assessing the safety of routes in a regional network. Transp. Res. Part C Emerg. Technol. 2013 , 32 , 103–115. [ Google Scholar ]
  • Machiani, S.G.; Abbas, M. Safety surrogate histograms (SSH): A novel real-time safety assessment of dilemma zone related conflicts at signalized intersections. Accid. Anal. Prev. 2016 , 96 , 361–370. [ Google Scholar ] [ CrossRef ]
  • Zheng, L.; Ismail, K.; Meng, X. Investigating the heterogeneity of postencroachment time thresholds determined by peak over threshold approach. Transp. Res. Rec. 2016 , 2601 , 17–23. [ Google Scholar ] [ CrossRef ]
  • Xie, K.; Yang, D.; Ozbay, K.; Yang, H. Use of real-world connected vehicle data in identifying high-risk locations based on a new surrogate safety measure. Accid. Anal. Prev. 2019 , 125 , 311–319. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Guo, H.; Xie, K.; Keyvan-Ekbatani, M. Modeling driver’s evasive behavior during safety–critical lane changes: Two-dimensional time-to-collision and deep reinforcement learning. Accid. Anal. Prev. 2023 , 186 , 107063. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Yang, D.; Xie, K.; Ozbay, K.; Yang, H.; Budnick, N. Modeling of time-dependent safety performance using anonymized and aggregated smartphone-based dangerous driving event data. Accid. Anal. Prev. 2019 , 132 , 105286. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Lyu, N.; Cao, Y.; Wu, C.; Xu, J.; Xie, L. The effect of gender, occupation and experience on behavior while driving on a freeway deceleration lane based on field operational test data. Accid. Anal. Prev. 2018 , 121 , 82–93. [ Google Scholar ] [ CrossRef ]
  • Papadoulis, A.; Quddus, M.; Imprialou, M. Evaluating the safety impact of connected and autonomous vehicles on motorways. Accid. Anal. Prev. 2019 , 124 , 12–22. [ Google Scholar ] [ CrossRef ]
  • Cai, P.; Wang, S.; Wang, H.; Liu, M. Carl-lead: Lidar-based end-to-end autonomous driving with contrastive deep reinforcement learning. arXiv 2021 , arXiv:2109.08473. [ Google Scholar ]
  • Ortiz, F.M.; Sammarco, M.; Detyniecki, M.; Costa, L.H.M. Road traffic safety assessment in self-driving vehicles based on time-to-collision with motion orientation. Accid. Anal. Prev. 2023 , 191 , 107172. [ Google Scholar ] [ CrossRef ]
  • Anisha, A.M.; Abdel-Aty, M.; Abdelraouf, A.; Islam, Z.; Zheng, O. Automated vehicle to vehicle conflict analysis at signalized intersections by camera and LiDAR sensor fusion. Transp. Res. Rec. 2023 , 2677 , 117–132. [ Google Scholar ] [ CrossRef ]
  • Gecchele, G.; Orsini, F.; Gastaldi, M.; Rossi, R. Freeway rear-end collision risk estimation with extreme value theory approach. A case study. Transp. Res. Procedia 2019 , 37 , 195–202. [ Google Scholar ] [ CrossRef ]
  • Wu, J.; Xu, H.; Zheng, Y.; Tian, Z. A novel method of vehicle-pedestrian near-crash identification with roadside LiDAR data. Accid. Anal. Prev. 2018 , 121 , 238–249. [ Google Scholar ] [ CrossRef ]
  • Xu, C.; Ozbay, K.; Liu, H.; Xie, K.; Yang, D. Exploring the impact of truck traffic on road segment-based severe crash proportion using extensive weigh-in-motion data. Saf. Sci. 2023 , 166 , 106261. [ Google Scholar ] [ CrossRef ]
  • Ali, Y.; Haque, M.M.; Mannering, F. A Bayesian generalised extreme value model to estimate real-time pedestrian crash risks at signalised intersections using artificial intelligence-based video analytics. Anal. Methods Accid. Res. 2023 , 38 , 100264. [ Google Scholar ] [ CrossRef ]
  • Fu, C.; Sayed, T. Identification of adequate sample size for conflict-based crash risk evaluation: An investigation using Bayesian hierarchical extreme value theory models. Anal. Methods Accid. Res. 2023 , 39 , 100281. [ Google Scholar ] [ CrossRef ]
  • AASHTO. Highway Safety Manual ; AASHTO: Washington, DC, USA, 2010. [ Google Scholar ]
  • Gao, J.; Xie, K.; Ozbay, K. Exploring the spatial dependence and selection bias of double parking citations data. Transp. Res. Rec. 2018 , 2672 , 159–169. [ Google Scholar ] [ CrossRef ]
  • Xie, K.; Ozbay, K.; Kurkcu, A.; Yang, H. Analysis of traffic crashes involving pedestrians using big data: Investigation of contributing factors and identification of hotspots. Risk Anal. 2017 , 37 , 1459–1476. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Xie, K.; Ozbay, K.; Zhu, Y.; Yang, H. Evacuation zone modeling under climate change: A data-driven method. J. Infrastruct. Syst. 2017 , 23 , 04017013. [ Google Scholar ] [ CrossRef ]
  • Moran, P.A.P. Notes on Continuous Stochastic Phenomena. Biometrika 1950 , 37 , 17–23. [ Google Scholar ] [ CrossRef ]
  • Moran, P.A. The interpretation of statistical maps. J. R. Stat. Society. Ser. B 1948 , 10 , 243–251. [ Google Scholar ]
  • Tiefelsdorf, M. The saddlepoint approximation of Moran’s I’s and local Moran’s Ii’s reference distributions and their numerical evaluation. Geogr. Anal. 2002 , 34 , 187–206. [ Google Scholar ]
  • Anselin, L.; Syabri, I.; Kho, Y. GeoDa: An introduction to spatial data analysis. Geogr. Anal. 2006 , 38 , 5–22. [ Google Scholar ] [ CrossRef ]
  • Xie, K.; Ozbay, K.; Yang, H. Spatial analysis of highway incident durations in the context of Hurricane Sandy. Accid. Anal. Prev. 2015 , 74 , 77–86. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Goodchild, M.F. Spatial Autocorrelation ; Geo Books: Kerala, India, 1986. [ Google Scholar ]
  • Xie, K.; Wang, X.; Ozbay, K.; Yang, H. Crash frequency modeling for signalized intersections in a high-density urban road network. Anal. Methods Accid. Res. 2014 , 2 , 39–51. [ Google Scholar ] [ CrossRef ]
  • Luc, A. Local Indicators of Spatial Association—LISA. Geogr. Anal. 1995 , 27 , 93–115. [ Google Scholar ] [ CrossRef ]
  • Wang, M.; Liao, Y.; Lyckvi, S.L.; Chen, F. How drivers respond to visual vs. auditory information in advisory traffic information systems. Behav. Inf. Technol. 2020 , 39 , 1308–1319. [ Google Scholar ] [ CrossRef ]
  • Deveaux, D.; Higuchi, T.; Uçar, S.; Wang, C.-H.; Härri, J.; Altintas, O. Extraction of Risk Knowledge from Time to Collision Variation in Roundabouts. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3665–3672. [ Google Scholar ]
  • Wilson, T.B.; Butler, W.; McGehee, D.V.; Dingus, T.A. Forward-Looking Collision Warning System Performance Guidelines ; SAE International: Warrendale, PA, USA, 1997; pp. 701–725. [ Google Scholar ]
  • Vasudevan, M.; O’Hara, J.; Townsend, H.; Asare, S.; Muhammad, S.; Ozbay, K.; Yang, D.; Gao, J.; Kurkcu, A.; Zuo, F. Algorithms to Convert Basic Safety Messages into Traffic Measures ; National Academy of Sciences: Washington, DC, USA, 2022. [ Google Scholar ]
  • Jeong, E.; Oh, C.; Lee, G.; Cho, H. Safety impacts of intervehicle warning information systems for moving hazards in connected vehicle environments. Transp. Res. Rec. 2014 , 2424 , 11–19. [ Google Scholar ] [ CrossRef ]
  • Genders, W.; Razavi, S.N. Impact of connected vehicle on work zone network safety through dynamic route guidance. J. Comput. Civ. Eng. 2016 , 30 , 04015020. [ Google Scholar ] [ CrossRef ]
  • Kondyli, A.; Schrock, S.D.; Tousif, F. Evaluation of Near-Miss Crashes Using a Video-Based Tool ; Kansas Department of Transportation. Bureau of Research: Topeka, KS, USA, 2023.
  • So, J.J.; Dedes, G.; Park, B.B.; HosseinyAlamdary, S.; Grejner-Brzezinsk, D. Development and evaluation of an enhanced surrogate safety assessment framework. Transp. Res. Part C Emerg. Technol. 2015 , 50 , 51–67. [ Google Scholar ] [ CrossRef ]
  • Borsos, A.; Farah, H.; Laureshyn, A.; Hagenzieker, M. Are collision and crossing course surrogate safety indicators transferable? A probability based approach using extreme value theory. Accid. Anal. Prev. 2020 , 143 , 105517. [ Google Scholar ] [ CrossRef ]
  • GIS Lab at Newman Library of Baruch College, CUNY, NYC Mass Transit Spatial Layers. 2020. Available online: https://www.baruch.cuny.edu/confluence/display/geoportal/NYC+Mass+Transit+Spatial+Layers (accessed on 25 July 2023).
  • Gardner, W.; Mulvey, E.P.; Shaw, E.C. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychol. Bull. 1995 , 118 , 392. [ Google Scholar ] [ CrossRef ]
  • Zou, Y.; Zhang, Y.; Lord, D. Application of finite mixture of negative binomial regression models with varying weight parameters for vehicle crash data analysis. Accid. Anal. Prev. 2013 , 50 , 1042–1051. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Canalys. Huge Opportunity as only 10% of the 1 Billion Cars in Use Have ADAS Features. Available online: https://www.canalys.com/newsroom/huge-opportunity-as-only-10-of-the-1-billion-cars-in-use-have-adas-features (accessed on 29 July 2023).
  • Krause, R. Can Mobileye Extend Its Auto Software Dominance Into Driverless Cars? Available online: https://www.investors.com/research/the-new-america/mobileye-stock-will-its-auto-software-dominance-extend-to-driverless-cars/ (accessed on 29 July 2023).
  • Gettman, D.; Pu, L.; Sayed, T.; Shelby, S.G.; Energy, S. Surrogate Safety Assessment Model and Validation ; Turner-Fairbank Highway Research Center: McLean, VA, USA, 2008.
  • Kim, K.; Pant, P.; Yamashita, E. Accidents and accessibility: Measuring influences of demographic and land use variables in Honolulu, Hawaii. Transp. Res. Rec. 2010 , 2147 , 9–17. [ Google Scholar ] [ CrossRef ]
  • Siddiqui, C.; Abdel-Aty, M.; Choi, K. Macroscopic spatial analysis of pedestrian and bicycle crashes. Accid. Anal. Prev. 2012 , 45 , 382–391. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Han, C.; Huang, H.; Lee, J.; Wang, J. Investigating varying effect of road-level factors on crash frequency across regions: A Bayesian hierarchical random parameter modeling approach. Anal. Methods Accid. Res. 2018 , 20 , 81–91. [ Google Scholar ] [ CrossRef ]
  • Lee, G.; Joo, S.; Oh, C.; Choi, K. An evaluation framework for traffic calming measures in residential areas. Transp. Res. Part D Transp. Environ. 2013 , 25 , 68–76. [ Google Scholar ] [ CrossRef ]
  • Xie, K.; Ozbay, K.; Yang, D.; Xu, C.; Yang, H. Modeling bicycle crash costs using big data: A grid-cell-based Tobit model with random parameters. J. Transp. Geogr. 2021 , 91 , 102953. [ Google Scholar ] [ CrossRef ]
  • Luo, Y.; Liu, Y.; Xing, L.; Wang, N.; Rao, L. Road safety evaluation framework for accessing park green space using active travel. Front. Environ. Sci. 2022 , 10 , 864966. [ Google Scholar ] [ CrossRef ]
  • Bartin, B.; Ozbay, K.; Xu, C. Extracting horizontal curvature data from GIS maps: Clustering method. Transp. Res. Rec. 2019 , 2673 , 264–275. [ Google Scholar ] [ CrossRef ]
  • Yao, Y.; Zhao, X.; Li, J.; Ma, J.; Zhang, Y. Traffic safety analysis at interchange exits using the surrogate measure of aggressive driving behavior and speed variation. J. Transp. Saf. Secur. 2023 , 15 , 515–540. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

VariableMeanS.D.MedianMinMax
Crash_tot1.432.021015
CW_ME830.8738.30190504
CW_OEM1.242.010017
Intersection0.931.401018
Subway_station0.0260.17002
Bus_stop0.340.65004
Road_length (ft)407.97176.82355.223.641032.05
VMT (mi × veh)12591869723015,777
Highway0.080.27001
Res_r15%20%5%0%98%
Comm_r16%23%2%0%87%
Open_r9%26%0%0%100%
Mix_rc_r14%16%8%0%91%
Population25922622701367
EstimateMarginal EffectsStd. Errorz-ValuePr(>|z|)
Intercept−0.88 0.09−9.3890.00 **
CW_ME80.010.00740.0010.6940.00 **
Intersection0.120.13220.026.8930.00 **
Subway_station0.200.22250.141.5060.132
Bus_stop0.190.20620.044.8780.00 **
Road_length0.010.00650.0010.1520.00 **
Res_r−0.73−0.80010.17−4.4190.00 **
Open_r−1.04−1.12950.16−6.3580.00 **
Mix_rc_r0.280.30020.171.5870.112
Ψ 1.53
Std. err0.12
AIC6045.2
2 × log-likelihood−6025.2
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Xu, C.; Gao, J.; Zuo, F.; Ozbay, K. Estimating Urban Traffic Safety and Analyzing Spatial Patterns through the Integration of City-Wide Near-Miss Data: A New York City Case Study. Appl. Sci. 2024 , 14 , 6378. https://doi.org/10.3390/app14146378

Xu C, Gao J, Zuo F, Ozbay K. Estimating Urban Traffic Safety and Analyzing Spatial Patterns through the Integration of City-Wide Near-Miss Data: A New York City Case Study. Applied Sciences . 2024; 14(14):6378. https://doi.org/10.3390/app14146378

Xu, Chuan, Jingqin Gao, Fan Zuo, and Kaan Ozbay. 2024. "Estimating Urban Traffic Safety and Analyzing Spatial Patterns through the Integration of City-Wide Near-Miss Data: A New York City Case Study" Applied Sciences 14, no. 14: 6378. https://doi.org/10.3390/app14146378

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. How to Create a Case Study + 14 Case Study Templates

    case study data preparation

  2. case study data interpretation

    case study data preparation

  3. Data Preparation

    case study data preparation

  4. PPT

    case study data preparation

  5. Case Study Preparation

    case study data preparation

  6. Case Analysis: Examples + How-to Guide & Writing Tips

    case study data preparation

VIDEO

  1. Case Study: Data Driven business with SAP Datasphere

  2. AQA Paper 3 Case Study Practice (B)

  3. Data Science Interview

  4. Difference between Data Analytics and Data Science . #shorts #short

  5. (Mastering JMP) Visualizing and Exploring Data

  6. Case Function In Google Data Studio: Example & Use Cases

COMMENTS

  1. DataInterview

    DataInterview is one of the best resources that helped me land a job at Apple. The case study course helped me not only understand the best ways to answer a case but also helped me understand how an interviewer evaluates the response and the difference between a good and bad response.

  2. 20+ Data Science Case Study Interview Questions (with Solutions)

    Product Case Studies - This type of case study tackles a specific product or feature offering, often tied to the interviewing company. Interviewers are generally looking for a sense of business sense geared towards product metrics. Data Analytics Case Study Questions - Data analytics case studies ask you to propose possible metrics in order to investigate an analytics problem.

  3. Enhancing data preparation: insights from a time series case study

    Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results. However, well-preparing data for machine learning is becoming difficult due to the variety of data quality issues and available data preparation tasks. For this reason, approaches that help users in performing this ...

  4. (PDF) Enhancing data preparation: insights from a time series case study

    Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results.

  5. Data science case interviews (what to expect & how to prepare)

    Overview of data science case study interviews at companies like Amazon, Google, and Meta (Facebook), as well as how to prepare for them. Includes answer framework, practice questions, and preparation steps.

  6. Data Preparation Steps for Analytics: An In-depth Guide

    Machine Learning and Artificial Intelligence (AI) can be used to predict customer churn. Data Science case study performed in this article describes a machine learning predictive model trained on Fintech and digital data to predict customers attrition in advance.

  7. What is Data Preparation? An In-Depth Guide

    Data preparation is often referred to informally as data prep.Alternatively, it's also known as data wrangling.But some practitioners use the latter term in a narrower sense to refer to cleansing, structuring and transforming data, which distinguishes data wrangling from the data preprocessing stage.. This comprehensive guide to data preparation further explains what it is, how to do it and ...

  8. Doing Data Science: A Framework and Case Study

    Figure 1. Data science framework. The data science framework starts with the research question, or problem identification, and continues through the following steps: data discovery—inventory, screening, and acquisition; data ingestion and governance; data wrangling—data profiling, data preparation and linkage, and data exploration; fitness-for-use assessment; statistical modeling and ...

  9. Part 5 Case Studies

    Part 5 Case Studies Chapter 25 Case Study 1—Building a Customer Data Mart Chapter 26 Case Study 2—Deriving Customer Segmentation Measures from Transactional Data Chapter 27 Case Study 3—Preparing Data … - Selection from Data Preparation for Analytics Using SAS [Book]

  10. Data in Action: 7 Data Science Case Studies Worth Reading

    Case studies can be an invaluable resource for students to gain an understanding of the data science field. Such studies can provide insights into the real-world application of data science, as well as the necessary skills like programming language skills and statistical models.

  11. Case Interview: The Free Preparation Guide (2024)

    The case interview is a challenging interview format that simulates the job of a management consultant, testing candidates across a wide range of problem-solving dimensions.. McKinsey, BCG and Bain - along with other top consulting firms - use the case interview because it's a statistically proven predictor of how well a candidate will perform in the role.

  12. Case Study Method: A Step-by-Step Guide for Business Researchers

    Although case studies have been discussed extensively in the literature, little has been written about the specific steps one may use to conduct case study research effectively (Gagnon, 2010; Hancock & Algozzine, 2016).Baskarada (2014) also emphasized the need to have a succinct guideline that can be practically followed as it is actually tough to execute a case study well in practice.

  13. Case Interview: all you need to know (and how to prepare)

    1. The key to landing your consulting job. Case interviews - where you are asked to solve a business case study under scrutiny - are the core of the selection process right across McKinsey, Bain and BCG (the "MBB" firms).

  14. Consulting Case Library

    In our Case Library, you will find 200+ case studies for your case interview preparation. Solve challenging problems around market sizing, pricing, or sustainability.

  15. How to write a case study

    Whether pulling from client testimonials or data-driven results, case studies tend to have more impact on new business because the story contains information that is both objective (data) and subjective (customer experience) — and the brand doesn't sound too self-promotional.

  16. Case Interview Prep

    Case interviews help you experience the type of work we do and show off your problem-solving skills. Explore BCG's case interview preparation tools today.

  17. Data Preparation for Regression

    In the last post we had started a case study example for regression analysis to help an investment firm make money through property price arbitrage (read part 1 : regression case study example). This is an interactive case study example and required your help to move forward. These are some of your observations from exploratory analysis that you sharedRead More...

  18. Top 5 Data Preparation Tools In 2024

    Data Solutions 2.0: Embracing the AI-driven Automation Era. Learn More About the Transformative Impact of AI and Automation on Data Management. Watch Webinar

  19. Data Science Interview case study prep tips : r/datascience

    46 votes, 12 comments. true. Some Data Case Study Tips: Clarify assumptions - make sure you understand what the business goal is (way too easy to get lost in open-ended questions talking about technical things)

  20. Data Science Case Study Interview Prep

    The data science case study interview is usually the last step in a long and arduous process. This may be at a consulting firm that offers its consulting services to different companies looking for business guidance.

  21. Data Prep for Data Science in Minutes—A Real World Use Case Study of

    Data exploration is an iterative and discovery oriented process. Data scientists spend an inordinate amount of time shaping and feature engineering their dat...

  22. Amazon Prime Day 2024: Results & Data Analysis for Sellers

    Shoppers spent more than 14 billion during Prime Day 2024, an 11% increase from 2023. Get the statistics and key insights from this year's big event.

  23. Applied Sciences

    City-wide near-miss data can be beneficial for traffic safety estimation. In this study, we evaluate urban traffic safety and examine spatial patterns by incorporating city-wide near-miss data (59,277 near-misses). Our methodology employs a grid-based method, the Empirical Bayes (EB) approach, and spatial analysis tools including global Moran's I and local Moran's I.