• Privacy Policy

Research Method

Home » Data Collection – Methods Types and Examples

Data Collection – Methods Types and Examples

Table of Contents

Data collection

Data Collection

Definition:

Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

In order for data collection to be effective, it is important to have a clear understanding of what data is needed and what the purpose of the data collection is. This can involve identifying the population or sample being studied, determining the variables to be measured, and selecting appropriate methods for collecting and recording data.

Types of Data Collection

Types of Data Collection are as follows:

Primary Data Collection

Primary data collection is the process of gathering original and firsthand information directly from the source or target population. This type of data collection involves collecting data that has not been previously gathered, recorded, or published. Primary data can be collected through various methods such as surveys, interviews, observations, experiments, and focus groups. The data collected is usually specific to the research question or objective and can provide valuable insights that cannot be obtained from secondary data sources. Primary data collection is often used in market research, social research, and scientific research.

Secondary Data Collection

Secondary data collection is the process of gathering information from existing sources that have already been collected and analyzed by someone else, rather than conducting new research to collect primary data. Secondary data can be collected from various sources, such as published reports, books, journals, newspapers, websites, government publications, and other documents.

Qualitative Data Collection

Qualitative data collection is used to gather non-numerical data such as opinions, experiences, perceptions, and feelings, through techniques such as interviews, focus groups, observations, and document analysis. It seeks to understand the deeper meaning and context of a phenomenon or situation and is often used in social sciences, psychology, and humanities. Qualitative data collection methods allow for a more in-depth and holistic exploration of research questions and can provide rich and nuanced insights into human behavior and experiences.

Quantitative Data Collection

Quantitative data collection is a used to gather numerical data that can be analyzed using statistical methods. This data is typically collected through surveys, experiments, and other structured data collection methods. Quantitative data collection seeks to quantify and measure variables, such as behaviors, attitudes, and opinions, in a systematic and objective way. This data is often used to test hypotheses, identify patterns, and establish correlations between variables. Quantitative data collection methods allow for precise measurement and generalization of findings to a larger population. It is commonly used in fields such as economics, psychology, and natural sciences.

Data Collection Methods

Data Collection Methods are as follows:

Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online.

Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone.

Focus Groups

Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic.

Observation

Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question.

Experiments

Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research.

Case Studies

Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon.

Secondary Data Analysis

Secondary data analysis involves using existing data that was collected for another purpose. Secondary data can come from various sources, such as government agencies, academic institutions, or private companies.

How to Collect Data

The following are some steps to consider when collecting data:

  • Define the objective : Before you start collecting data, you need to define the objective of the study. This will help you determine what data you need to collect and how to collect it.
  • Identify the data sources : Identify the sources of data that will help you achieve your objective. These sources can be primary sources, such as surveys, interviews, and observations, or secondary sources, such as books, articles, and databases.
  • Determine the data collection method : Once you have identified the data sources, you need to determine the data collection method. This could be through online surveys, phone interviews, or face-to-face meetings.
  • Develop a data collection plan : Develop a plan that outlines the steps you will take to collect the data. This plan should include the timeline, the tools and equipment needed, and the personnel involved.
  • Test the data collection process: Before you start collecting data, test the data collection process to ensure that it is effective and efficient.
  • Collect the data: Collect the data according to the plan you developed in step 4. Make sure you record the data accurately and consistently.
  • Analyze the data: Once you have collected the data, analyze it to draw conclusions and make recommendations.
  • Report the findings: Report the findings of your data analysis to the relevant stakeholders. This could be in the form of a report, a presentation, or a publication.
  • Monitor and evaluate the data collection process: After the data collection process is complete, monitor and evaluate the process to identify areas for improvement in future data collection efforts.
  • Ensure data quality: Ensure that the collected data is of high quality and free from errors. This can be achieved by validating the data for accuracy, completeness, and consistency.
  • Maintain data security: Ensure that the collected data is secure and protected from unauthorized access or disclosure. This can be achieved by implementing data security protocols and using secure storage and transmission methods.
  • Follow ethical considerations: Follow ethical considerations when collecting data, such as obtaining informed consent from participants, protecting their privacy and confidentiality, and ensuring that the research does not cause harm to participants.
  • Use appropriate data analysis methods : Use appropriate data analysis methods based on the type of data collected and the research objectives. This could include statistical analysis, qualitative analysis, or a combination of both.
  • Record and store data properly: Record and store the collected data properly, in a structured and organized format. This will make it easier to retrieve and use the data in future research or analysis.
  • Collaborate with other stakeholders : Collaborate with other stakeholders, such as colleagues, experts, or community members, to ensure that the data collected is relevant and useful for the intended purpose.

Applications of Data Collection

Data collection methods are widely used in different fields, including social sciences, healthcare, business, education, and more. Here are some examples of how data collection methods are used in different fields:

  • Social sciences : Social scientists often use surveys, questionnaires, and interviews to collect data from individuals or groups. They may also use observation to collect data on social behaviors and interactions. This data is often used to study topics such as human behavior, attitudes, and beliefs.
  • Healthcare : Data collection methods are used in healthcare to monitor patient health and track treatment outcomes. Electronic health records and medical charts are commonly used to collect data on patients’ medical history, diagnoses, and treatments. Researchers may also use clinical trials and surveys to collect data on the effectiveness of different treatments.
  • Business : Businesses use data collection methods to gather information on consumer behavior, market trends, and competitor activity. They may collect data through customer surveys, sales reports, and market research studies. This data is used to inform business decisions, develop marketing strategies, and improve products and services.
  • Education : In education, data collection methods are used to assess student performance and measure the effectiveness of teaching methods. Standardized tests, quizzes, and exams are commonly used to collect data on student learning outcomes. Teachers may also use classroom observation and student feedback to gather data on teaching effectiveness.
  • Agriculture : Farmers use data collection methods to monitor crop growth and health. Sensors and remote sensing technology can be used to collect data on soil moisture, temperature, and nutrient levels. This data is used to optimize crop yields and minimize waste.
  • Environmental sciences : Environmental scientists use data collection methods to monitor air and water quality, track climate patterns, and measure the impact of human activity on the environment. They may use sensors, satellite imagery, and laboratory analysis to collect data on environmental factors.
  • Transportation : Transportation companies use data collection methods to track vehicle performance, optimize routes, and improve safety. GPS systems, on-board sensors, and other tracking technologies are used to collect data on vehicle speed, fuel consumption, and driver behavior.

Examples of Data Collection

Examples of Data Collection are as follows:

  • Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety.
  • Social Media Monitoring : Companies can collect real-time data on social media platforms such as Twitter and Facebook to monitor their brand reputation, track customer sentiment, and respond to customer inquiries and complaints in real-time.
  • Weather Monitoring: Weather agencies collect real-time data on temperature, humidity, air pressure, and precipitation through weather stations and satellites. This information is used to provide accurate weather forecasts and warnings.
  • Stock Market Monitoring : Financial institutions collect real-time data on stock prices, trading volumes, and other market indicators to make informed investment decisions and respond to market fluctuations in real-time.
  • Health Monitoring : Medical devices such as wearable fitness trackers and smartwatches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues.

Purpose of Data Collection

The purpose of data collection can vary depending on the context and goals of the study, but generally, it serves to:

  • Provide information: Data collection provides information about a particular phenomenon or behavior that can be used to better understand it.
  • Measure progress : Data collection can be used to measure the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Support decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions.
  • Identify trends : Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Monitor and evaluate : Data collection can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.

When to use Data Collection

Data collection is used when there is a need to gather information or data on a specific topic or phenomenon. It is typically used in research, evaluation, and monitoring and is important for making informed decisions and improving outcomes.

Data collection is particularly useful in the following scenarios:

  • Research : When conducting research, data collection is used to gather information on variables of interest to answer research questions and test hypotheses.
  • Evaluation : Data collection is used in program evaluation to assess the effectiveness of programs or interventions, and to identify areas for improvement.
  • Monitoring : Data collection is used in monitoring to track progress towards achieving goals or targets, and to identify any areas that require attention.
  • Decision-making: Data collection is used to provide decision-makers with information that can be used to inform policies, strategies, and actions.
  • Quality improvement : Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Characteristics of Data Collection

Data collection can be characterized by several important characteristics that help to ensure the quality and accuracy of the data gathered. These characteristics include:

  • Validity : Validity refers to the accuracy and relevance of the data collected in relation to the research question or objective.
  • Reliability : Reliability refers to the consistency and stability of the data collection process, ensuring that the results obtained are consistent over time and across different contexts.
  • Objectivity : Objectivity refers to the impartiality of the data collection process, ensuring that the data collected is not influenced by the biases or personal opinions of the data collector.
  • Precision : Precision refers to the degree of accuracy and detail in the data collected, ensuring that the data is specific and accurate enough to answer the research question or objective.
  • Timeliness : Timeliness refers to the efficiency and speed with which the data is collected, ensuring that the data is collected in a timely manner to meet the needs of the research or evaluation.
  • Ethical considerations : Ethical considerations refer to the ethical principles that must be followed when collecting data, such as ensuring confidentiality and obtaining informed consent from participants.

Advantages of Data Collection

There are several advantages of data collection that make it an important process in research, evaluation, and monitoring. These advantages include:

  • Better decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions, leading to better decision-making.
  • Improved understanding: Data collection helps to improve our understanding of a particular phenomenon or behavior by providing empirical evidence that can be analyzed and interpreted.
  • Evaluation of interventions: Data collection is essential in evaluating the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Identifying trends and patterns: Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Increased accountability: Data collection increases accountability by providing evidence that can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.
  • Validation of theories: Data collection can be used to test hypotheses and validate theories, leading to a better understanding of the phenomenon being studied.
  • Improved quality: Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Limitations of Data Collection

While data collection has several advantages, it also has some limitations that must be considered. These limitations include:

  • Bias : Data collection can be influenced by the biases and personal opinions of the data collector, which can lead to inaccurate or misleading results.
  • Sampling bias : Data collection may not be representative of the entire population, resulting in sampling bias and inaccurate results.
  • Cost : Data collection can be expensive and time-consuming, particularly for large-scale studies.
  • Limited scope: Data collection is limited to the variables being measured, which may not capture the entire picture or context of the phenomenon being studied.
  • Ethical considerations : Data collection must follow ethical principles to protect the rights and confidentiality of the participants, which can limit the type of data that can be collected.
  • Data quality issues: Data collection may result in data quality issues such as missing or incomplete data, measurement errors, and inconsistencies.
  • Limited generalizability : Data collection may not be generalizable to other contexts or populations, limiting the generalizability of the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Background of The Study

Background of The Study – Examples and Writing...

APA Table of Contents

APA Table of Contents – Format and Example

References in Research

References in Research – Types, Examples and...

Significance of the Study

Significance of the Study – Examples and Writing...

Problem statement

Problem Statement – Writing Guide, Examples and...

Thesis

Thesis – Structure, Example and Writing Guide

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • Product Demos
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Artificial Intelligence

Market Research

  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Secondary Research

Try Qualtrics for free

Secondary research: definition, methods, & examples.

19 min read This ultimate guide to secondary research helps you understand changes in market trends, customers buying patterns and your competition using existing data sources.

In situations where you’re not involved in the data gathering process ( primary research ), you have to rely on existing information and data to arrive at specific research conclusions or outcomes. This approach is known as secondary research.

In this article, we’re going to explain what secondary research is, how it works, and share some examples of it in practice.

Free eBook: The ultimate guide to conducting market research

What is secondary research?

Secondary research, also known as desk research, is a research method that involves compiling existing data sourced from a variety of channels . This includes internal sources (e.g.in-house research) or, more commonly, external sources (such as government statistics, organizational bodies, and the internet).

Secondary research comes in several formats, such as published datasets, reports, and survey responses , and can also be sourced from websites, libraries, and museums.

The information is usually free — or available at a limited access cost — and gathered using surveys , telephone interviews, observation, face-to-face interviews, and more.

When using secondary research, researchers collect, verify, analyze and incorporate it to help them confirm research goals for the research period.

As well as the above, it can be used to review previous research into an area of interest. Researchers can look for patterns across data spanning several years and identify trends — or use it to verify early hypothesis statements and establish whether it’s worth continuing research into a prospective area.

How to conduct secondary research

There are five key steps to conducting secondary research effectively and efficiently:

1.    Identify and define the research topic

First, understand what you will be researching and define the topic by thinking about the research questions you want to be answered.

Ask yourself: What is the point of conducting this research? Then, ask: What do we want to achieve?

This may indicate an exploratory reason (why something happened) or confirm a hypothesis. The answers may indicate ideas that need primary or secondary research (or a combination) to investigate them.

2.    Find research and existing data sources

If secondary research is needed, think about where you might find the information. This helps you narrow down your secondary sources to those that help you answer your questions. What keywords do you need to use?

Which organizations are closely working on this topic already? Are there any competitors that you need to be aware of?

Create a list of the data sources, information, and people that could help you with your work.

3.    Begin searching and collecting the existing data

Now that you have the list of data sources, start accessing the data and collect the information into an organized system. This may mean you start setting up research journal accounts or making telephone calls to book meetings with third-party research teams to verify the details around data results.

As you search and access information, remember to check the data’s date, the credibility of the source, the relevance of the material to your research topic, and the methodology used by the third-party researchers. Start small and as you gain results, investigate further in the areas that help your research’s aims.

4.    Combine the data and compare the results

When you have your data in one place, you need to understand, filter, order, and combine it intelligently. Data may come in different formats where some data could be unusable, while other information may need to be deleted.

After this, you can start to look at different data sets to see what they tell you. You may find that you need to compare the same datasets over different periods for changes over time or compare different datasets to notice overlaps or trends. Ask yourself: What does this data mean to my research? Does it help or hinder my research?

5.    Analyze your data and explore further

In this last stage of the process, look at the information you have and ask yourself if this answers your original questions for your research. Are there any gaps? Do you understand the information you’ve found? If you feel there is more to cover, repeat the steps and delve deeper into the topic so that you can get all the information you need.

If secondary research can’t provide these answers, consider supplementing your results with data gained from primary research. As you explore further, add to your knowledge and update your findings. This will help you present clear, credible information.

Primary vs secondary research

Unlike secondary research, primary research involves creating data first-hand by directly working with interviewees, target users, or a target market. Primary research focuses on the method for carrying out research, asking questions, and collecting data using approaches such as:

  • Interviews (panel, face-to-face or over the phone)
  • Questionnaires or surveys
  • Focus groups

Using these methods, researchers can get in-depth, targeted responses to questions, making results more accurate and specific to their research goals. However, it does take time to do and administer.

Unlike primary research, secondary research uses existing data, which also includes published results from primary research. Researchers summarize the existing research and use the results to support their research goals.

Both primary and secondary research have their places. Primary research can support the findings found through secondary research (and fill knowledge gaps), while secondary research can be a starting point for further primary research. Because of this, these research methods are often combined for optimal research results that are accurate at both the micro and macro level.

First-hand research to collect data. May require a lot of time The research collects existing, published data. May require a little time
Creates raw data that the researcher owns The researcher has no control over data method or ownership
Relevant to the goals of the research May not be relevant to the goals of the research
The researcher conducts research. May be subject to researcher bias The researcher collects results. No information on what researcher bias existsSources of secondary research
Can be expensive to carry out More affordable due to access to free data

Sources of Secondary Research

There are two types of secondary research sources: internal and external. Internal data refers to in-house data that can be gathered from the researcher’s organization. External data refers to data published outside of and not owned by the researcher’s organization.

Internal data

Internal data is a good first port of call for insights and knowledge, as you may already have relevant information stored in your systems. Because you own this information — and it won’t be available to other researchers — it can give you a competitive edge . Examples of internal data include:

  • Database information on sales history and business goal conversions
  • Information from website applications and mobile site data
  • Customer-generated data on product and service efficiency and use
  • Previous research results or supplemental research areas
  • Previous campaign results

External data

External data is useful when you: 1) need information on a new topic, 2) want to fill in gaps in your knowledge, or 3) want data that breaks down a population or market for trend and pattern analysis. Examples of external data include:

  • Government, non-government agencies, and trade body statistics
  • Company reports and research
  • Competitor research
  • Public library collections
  • Textbooks and research journals
  • Media stories in newspapers
  • Online journals and research sites

Three examples of secondary research methods in action

How and why might you conduct secondary research? Let’s look at a few examples:

1.    Collecting factual information from the internet on a specific topic or market

There are plenty of sites that hold data for people to view and use in their research. For example, Google Scholar, ResearchGate, or Wiley Online Library all provide previous research on a particular topic. Researchers can create free accounts and use the search facilities to look into a topic by keyword, before following the instructions to download or export results for further analysis.

This can be useful for exploring a new market that your organization wants to consider entering. For instance, by viewing the U.S Census Bureau demographic data for that area, you can see what the demographics of your target audience are , and create compelling marketing campaigns accordingly.

2.    Finding out the views of your target audience on a particular topic

If you’re interested in seeing the historical views on a particular topic, for example, attitudes to women’s rights in the US, you can turn to secondary sources.

Textbooks, news articles, reviews, and journal entries can all provide qualitative reports and interviews covering how people discussed women’s rights. There may be multimedia elements like video or documented posters of propaganda showing biased language usage.

By gathering this information, synthesizing it, and evaluating the language, who created it and when it was shared, you can create a timeline of how a topic was discussed over time.

3.    When you want to know the latest thinking on a topic

Educational institutions, such as schools and colleges, create a lot of research-based reports on younger audiences or their academic specialisms. Dissertations from students also can be submitted to research journals, making these places useful places to see the latest insights from a new generation of academics.

Information can be requested — and sometimes academic institutions may want to collaborate and conduct research on your behalf. This can provide key primary data in areas that you want to research, as well as secondary data sources for your research.

Advantages of secondary research

There are several benefits of using secondary research, which we’ve outlined below:

  • Easily and readily available data – There is an abundance of readily accessible data sources that have been pre-collected for use, in person at local libraries and online using the internet. This data is usually sorted by filters or can be exported into spreadsheet format, meaning that little technical expertise is needed to access and use the data.
  • Faster research speeds – Since the data is already published and in the public arena, you don’t need to collect this information through primary research. This can make the research easier to do and faster, as you can get started with the data quickly.
  • Low financial and time costs – Most secondary data sources can be accessed for free or at a small cost to the researcher, so the overall research costs are kept low. In addition, by saving on preliminary research, the time costs for the researcher are kept down as well.
  • Secondary data can drive additional research actions – The insights gained can support future research activities (like conducting a follow-up survey or specifying future detailed research topics) or help add value to these activities.
  • Secondary data can be useful pre-research insights – Secondary source data can provide pre-research insights and information on effects that can help resolve whether research should be conducted. It can also help highlight knowledge gaps, so subsequent research can consider this.
  • Ability to scale up results – Secondary sources can include large datasets (like Census data results across several states) so research results can be scaled up quickly using large secondary data sources.

Disadvantages of secondary research

The disadvantages of secondary research are worth considering in advance of conducting research :

  • Secondary research data can be out of date – Secondary sources can be updated regularly, but if you’re exploring the data between two updates, the data can be out of date. Researchers will need to consider whether the data available provides the right research coverage dates, so that insights are accurate and timely, or if the data needs to be updated. Also, fast-moving markets may find secondary data expires very quickly.
  • Secondary research needs to be verified and interpreted – Where there’s a lot of data from one source, a researcher needs to review and analyze it. The data may need to be verified against other data sets or your hypotheses for accuracy and to ensure you’re using the right data for your research.
  • The researcher has had no control over the secondary research – As the researcher has not been involved in the secondary research, invalid data can affect the results. It’s therefore vital that the methodology and controls are closely reviewed so that the data is collected in a systematic and error-free way.
  • Secondary research data is not exclusive – As data sets are commonly available, there is no exclusivity and many researchers can use the same data. This can be problematic where researchers want to have exclusive rights over the research results and risk duplication of research in the future.

When do we conduct secondary research?

Now that you know the basics of secondary research, when do researchers normally conduct secondary research?

It’s often used at the beginning of research, when the researcher is trying to understand the current landscape . In addition, if the research area is new to the researcher, it can form crucial background context to help them understand what information exists already. This can plug knowledge gaps, supplement the researcher’s own learning or add to the research.

Secondary research can also be used in conjunction with primary research. Secondary research can become the formative research that helps pinpoint where further primary research is needed to find out specific information. It can also support or verify the findings from primary research.

You can use secondary research where high levels of control aren’t needed by the researcher, but a lot of knowledge on a topic is required from different angles.

Secondary research should not be used in place of primary research as both are very different and are used for various circumstances.

Questions to ask before conducting secondary research

Before you start your secondary research, ask yourself these questions:

  • Is there similar internal data that we have created for a similar area in the past?

If your organization has past research, it’s best to review this work before starting a new project. The older work may provide you with the answers, and give you a starting dataset and context of how your organization approached the research before. However, be mindful that the work is probably out of date and view it with that note in mind. Read through and look for where this helps your research goals or where more work is needed.

  • What am I trying to achieve with this research?

When you have clear goals, and understand what you need to achieve, you can look for the perfect type of secondary or primary research to support the aims. Different secondary research data will provide you with different information – for example, looking at news stories to tell you a breakdown of your market’s buying patterns won’t be as useful as internal or external data e-commerce and sales data sources.

  • How credible will my research be?

If you are looking for credibility, you want to consider how accurate the research results will need to be, and if you can sacrifice credibility for speed by using secondary sources to get you started. Bear in mind which sources you choose — low-credibility data sites, like political party websites that are highly biased to favor their own party, would skew your results.

  • What is the date of the secondary research?

When you’re looking to conduct research, you want the results to be as useful as possible , so using data that is 10 years old won’t be as accurate as using data that was created a year ago. Since a lot can change in a few years, note the date of your research and look for earlier data sets that can tell you a more recent picture of results. One caveat to this is using data collected over a long-term period for comparisons with earlier periods, which can tell you about the rate and direction of change.

  • Can the data sources be verified? Does the information you have check out?

If you can’t verify the data by looking at the research methodology, speaking to the original team or cross-checking the facts with other research, it could be hard to be sure that the data is accurate. Think about whether you can use another source, or if it’s worth doing some supplementary primary research to replicate and verify results to help with this issue.

We created a front-to-back guide on conducting market research, The ultimate guide to conducting market research , so you can understand the research journey with confidence.

In it, you’ll learn more about:

  • What effective market research looks like
  • The use cases for market research
  • The most important steps to conducting market research
  • And how to take action on your research findings

Download the free guide for a clearer view on secondary research and other key research types for your business.

Related resources

Market intelligence 10 min read, marketing insights 11 min read, ethnographic research 11 min read, qualitative vs quantitative research 13 min read, qualitative research questions 11 min read, qualitative research design 12 min read, primary vs secondary research 14 min read, request demo.

Ready to learn more about Qualtrics?

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Gen Intern Med
  • v.26(8); 2011 Aug

Logo of jgimed

Conducting High-Value Secondary Dataset Analysis: An Introductory Guide and Resources

Alexander k. smith.

1 Division of Geriatrics, Department of Medicine, University of California, San Francisco, 4150 Clement St (181G), 94121 San Francisco, CA USA

2 Veterans Affairs Medical Center, San Francisco, CA USA

John Z. Ayanian

3 Harvard Medical School, Boston, MA USA

4 Department of Health Care Policy, Harvard School of Public Health, Boston, MA USA

5 Division of General Medicine, Brigham and Women’s Hospital, Boston, MA USA

Kenneth E. Covinsky

Bruce e. landon.

6 Division of General Medicine and Primary Care, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA USA

Ellen P. McCarthy

Christina c. wee, michael a. steinman.

Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

INTRODUCTION

Secondary data analysis is analysis of data that was collected by someone else for another primary purpose. Increasingly, generalist researchers start their careers conducting analyses of existing datasets, and some continue to make this the focus of their career. Using secondary data enables one to conduct studies of high-impact research questions with dramatically less time and resources than required for most studies involving primary data collection. For fellows and junior faculty who need to demonstrate productivity by completing and publishing research in a timely manner, secondary data analysis can be a key foundation to successfully starting a research career. Successful completion demonstrates content and methodological expertise, and may yield useful data for future grants. Despite these attributes, conducting high quality secondary data research requires a distinct skill set and substantial effort. However, few frameworks are available to guide new investigators as they conduct secondary data analysies. 1 – 3

In this article we describe key principles and skills needed to conduct successful analysis of secondary data and provide a brief description of high-value datasets and online resources. The primary target audience of the article is investigators with an interest but limited prior experience in secondary data analysis, as well as mentors of these investigators, who may find this article a useful reference and teaching tool. While we focus on analysis of large, publicly available datasets, many of the concepts we cover are applicable to secondary analysis of proprietary datasets. Datasets we feature in this manuscript encompass a wide range of measures, and thus can be useful to evaluate not only one disease in isolation, but also its intersection with other clinical, demographic, and psychosocial characteristics of patients.

REASONS TO CONDUCT OR TO AVOID A SECONDARY DATA ANALYSIS

Many worthwhile studies simply cannot be done in a reasonable timeframe and cost with primary data collection. For example, if you wanted to examine racial and ethnic differences in health services utilization over the last 10 years of life, you could enroll a diverse cohort of subjects with chronic illness and wait a decade (or longer) for them to die, or you could find a dataset that includes a diverse sample of decedents. Even for less dramatic examples, primary data collection can be difficult without incurring substantial costs, including time and money—scarce resources for junior researchers in particular. Secondary datasets, in contrast, can provide access to large sample sizes, relevant measures, and longitudinal data, allowing junior investigators to formulate a generalizable answer to a high impact question. For those interested in conducting primary data collection, beginning with a secondary data analysis may provide a “bird’s eye view” of epidemiologic trends that future primary data studies examine in greater detail.

Secondary data analyses, however, have disadvantages that are important to consider. In a study focused on primary data, you can tightly control the desired study population, specify the exact measures that you would like to assess, and examine causal relationships (e.g., through a randomized controlled design). In secondary data analyses, the study population and measures collected are often not exactly what you might have chosen to collect, and the observational nature of most secondary data makes it difficult to assess causality (although some quasi-experimental methods, such as instrumental variable or regression discontinuity analysis, can partially address this issue). While not unique to secondary data analysis, another disadvantage to publicly available datasets is the potential to be “scooped,” meaning that someone else publishes a similar study from the same data set before you do. On the other hand, intentional replication of a study in a different dataset can be important in that it either supports or refutes the generalizability of the original findings. If you do find that someone has published the same study using the same dataset, try to find a unique angle to your study that builds on their findings.

STEPS TO CONDUCTING A SUCCESSFUL SECONDARY DATA ANALYSIS

The same basic research principles that apply to studies using primary data apply to secondary data analysis, including the development of a clear research question, study sample, appropriate measures, and a thoughtful analytic approach. For purposes of secondary data analysis, these principles can be conceived as a series of four key steps, described in Table  1 and the sections below. Table  2 provides a glossary of terms used in secondary analysis including dataset types and common sampling terminology.

Table 1

A Practical Approach to Successful Research with Large Datasets

StepsPractical advice
(1) Define your research topic and question(1) Start with a thorough literature review
(2) Ensure that the research question has clinical or policy relevance and is based on sound a priori reasoning. A good question is what makes a study good, not a large sample size
(3) Be flexible to adapt your question to the strengths and limitations of the potential datasets
(2) Select a dataset(1) Use a resource such as the Society of General Internal Medicine’s Online Compendium ( ) (Table  )
(2) To increase the novelty of your work, consider selecting a dataset that has not been widely used in your field or link datasets together to gain a fresh perspective
(3) Factor in complexity of the dataset
(4) Factor in dataset cost and time to acquire the actual dataset
(5) Consider selecting a dataset your mentor has used previously
(3) Get to know your dataset(1) Learn the answers to the following questions:
•Why does the database exist?
•Who reports the data?
•What are the incentives for accurate reporting?
•How are the data audited, if at all?
•Can you link your dataset to other large datasets?
(2) Read everything you can about the database
(3) Check to see if your measures have been validated against other sources
(4) Get a close feel for the data by analyzing it yourself or closely reviewing outputs if someone else is doing the programming
(4) Structure your analysis and presentation of findings in a way that is clinically meaningful(1) Think carefully about the clinical implications of your findings
(2) Be cautious when interpreting statistical significance (i.e., p-values). Large sample sizes can yield associations that are highly statistically significant but not clinically meaningful
(3) Consult with a statistician for complex datasets and analyses
(4) Think carefully about how you portray the data. A nice figure sometimes tells the story better than rows of data

Table 2

Glossary of Terms Used in Secondary Dataset Analysis Research

TermMeaning
Types of datasets (not mutually exclusive)
Administrative or claims dataDatasets generated from reimbursement claims, such as ICD-9 codes used to bill for clinical encounters, or discharge data such as discharge diagnoses
Longitudinal dataDatasets that measure factors of interest within the same subjects over time
Clinical registriesDatasets generated from registries of specific clinical conditions, such as regional cancer registries used to create the Surveillance Epidemiology and End Results Program (SEER) dataset
Population-based surveyA target population is available and well-defined, and a systematic approach is used to select members of that population to take part in the study. For example, SEER is a population-based survey because it aims to include data on all individuals with cancer cared for in the included regions
Nationally representative surveySurvey sample that is designed to be representative of the target population on a national level. Often uses a complex sampling scheme. The Health and Retirement Study (HRS), for example, is nationally representative of community-dwelling adults over age 50
Panel surveyA longitudinal survey in which data are collected in the same panel of subjects over time. As one panel is at the middle or end of its participation, a panel of new participants is enrolled. In the Medical Expenditures Panel Survey (MEPS), for example, individuals in the same household are surveyed several times over the course of 2 years
Statistical sampling terms
ClusteringEven simple random samples can be prohibitively expensive for practical reasons such as geographic distance between selected subjects. Identifying subjects within defined clusters, such as geographic regions or subjects treated by the same physicians, reduces cost and improves the feasibility of the study but may decrease the precision of the estimated variance (e.g., wider confidence intervals)
Complex survey designA survey design that is not a simple random selection of subjects. Surveys that incorporate stratification, clustering and oversampling (with patient weights) are examples of complex data. Statistical software is available that can account for complex survey designs and is often needed to generate accurate findings
OversamplingIntentionally sampling a greater proportion of a subgroup, increasing the precision of estimates for that subgroup. For example, in the HRS, African-Americans, Latinos, and residents of Florida are oversampled (see also survey weights)
StratificationIn stratification, the target population is divided into relatively homogeneous groups, and a pre-specified number of subjects is sampled from within each stratum. For example, in the National Ambulatory Medical Care Survey physicians are divided by specialty within each geographic area targeted for the survey, and a certain number of each type of physician is then identified to participate and provide data about their patients
Survey weightsWeights are used to account for the unequal probability of subject selection due to purposeful over- or under-sampling of certain types of subjects and non-response bias. The survey weight is the inverse probability of being selected. By applying survey weights, the effects of over- and under-sampling of certain types of patients can be corrected such that the data are representative of the entire target population

Define your Research Topic and Question

A fellow in general medicine has a strong interest in studying palliative and end-of-life care. Building on his interest in racial and ethnic disparities, he wants to examine disparities in use of health services at the end of life. He is leaning toward conducting a secondary data analysis and is not sure if he should begin with a more focused research question or a search for a dataset.

Investigators new to secondary data research are frequently challenged by the question “which comes first, the question or the dataset?” In general, we advocate that researchers begin by defining their research topic or question. A good question is essential—an uninteresting study with a huge sample size or extensively validated measures is still uninteresting. The answer to a research question should have implications for patient care or public policy. Imagine the possible findings and ask the dreaded question: "so what?" If possible, select a question that will be interesting regardless of the direction of the findings: positive or negative. Also, determine a target audience who would find your work interesting and useful.

It is often useful to start with a thorough literature review of the question or topic of interest. This effort both avoids duplicating others’ work and develops ways to build upon the literature. Once the question is established, identify datasets that are the best fit, in terms of the patient population, sample size, and measures of the variables of interest (including predictors, outcomes, and potential confounders). Once a candidate dataset has been identified, we recommend being flexible and adapting the research question to the strengths and limitations of the dataset, as long as the question remains interesting and specific and the methods to answer it are scientifically sound. Be creative. Some measures of interest may not have been ascertained directly, but data may be available to construct a suitable proxy. In some cases, you may find a dataset that initially looked promising lacks the necessary data (or data quality) to answer research questions in your area of interest reliably. In that case, you should be prepared to search for an alternative dataset.

A specific research question is essential to good research. However, many researchers have a general area of interest but find it difficult to identify specific research questions without knowing the specific data available. In that case, combing research documentation for unexamined yet interesting measures in your area of interest can be fruitful. Beginning with the dataset and no focused area of interest may lead to data dredging—simply creating cross tabulations of unexplored variables in search of significant associations is bad science. Yet, in our experience, many good studies have resulted from a researcher with a general topic area of interest finding a clinically meaningful yet underutilized measure and having the insight to frame a research question that uses that measure to answer a novel and clinically compelling question (see references for examples). 4 – 8 Dr. Warren Browner once exhorted, “just because you were not smart enough to think of a research question in advance doesn’t mean it’s not important!” [quote used with permission].

Select a Dataset

Case continued.

After a review of available datasets that fit his topic area of interest, the fellow decides to use data from the Surveillance Epidemiology and End Results Program linked to Medicare claims (SEER-Medicare).

The range and intricacy of large datasets can be daunting to a junior researcher. Fortunately, several online compendia are available to guide researchers (Table  3 ), including one recently developed by this manuscript’s authors for the Society of General Internal Medicine (SGIM) ( www.sgim.org/go/datasets ). The SGIM Research Dataset Compendium was developed and is maintained by members of the SGIM research committee. SGIM Compendium developers consulted with experts to identify and profile high-value datasets for generalist researchers. The Compendium includes a description of and links to over 40 high-value datasets used for health services, clinical epidemiology, and medical education research. The SGIM Compendium provides detailed information of use in selecting a dataset, including sample sizes and characteristics, available measures and how data was measured, comments from expert users, links to the dataset, and example publications (see Box for example). A selection of datasets from this Compendium is listed in Table  4 . SGIM members can request a one-time telephone consultation with an expert user of a large dataset (see details on the Compendium website).

An external file that holds a picture, illustration, etc.
Object name is 11606_2010_1621_Figa_HTML.jpg

Table 3

Online Compendia of Secondary Datasets

CompendiumWeb addressDescription
Society of General Internal Medicine (SGIM) Research Dataset Compendium Designed to assist investigators conducting research on existing datasets, with a particular emphasis on health services research, clinical epidemiology, and research on medical education. Includes information on strengths and weaknesses of datasets and the insights of experienced users about making best use of the data
National Information Center on Health Services Research and Health Care Technology (NICHSR) This group of sites provides links to a wide variety of data tools and statistics, including research datasets, data repositories, health statistics, survey instruments, and more. It is sponsored by the National Library of Medicine
Inter-University Consortium for Political and Social Research (ICPSR) World’s largest archive of digital social science data, including many datasets with extensive information on health and health care. ICPSR includes many sub-archives on specific topic areas, including minority health, international data, substance abuse, and mental health, and more
Partners in Information Access for the Public Health Workforce Provides links to a variety of national, state, and local health and public health datasets. Also provides links to sites providing a wide variety of health statistics, information on health information technology and standards, and other resources. Sponsored by a collaboration of US government agencies, public health organizations, and health sciences libraries
Canadian Research Data Centres Links to datasets available for analysis through Canada’s Research Data Centres (RDC) program
Directory of Health and Human Services Data Resources (US Dept. of Health and Human Services) This site provides brief information and links to almost all datasets from National Institutes of Health (NIH), Centers for Disease Control and Prevention (CDC), Centers for Medicare and Medicaid Services (CMS), Agency for Healthcare Research and Quality (AHRQ), Food and Drug Administration (FDA), and other agencies of the US Department of Health and Human Services
National Center for Health Statistics (NCHS) This site links to a variety of datasets from the National Center for Health Statistics, several of which are profiled in Table  . These datasets are available for downloading at no cost
Medicare Research Data Assistance Center (RESDAC); and Centers for Medicare and Medicaid Services (CMS) Research, Statistics, Data & Systems These sites link to a variety of datasets from the Centers for Medicare and Medicaid Services (CMS)
Veterans Affairs (VA) data A series of datasets using administrative and computerized clinical data to describe care provided in the VA health care system, including information on outpatient visits, pharmacy data, inpatient data, cost data, and more. With some exceptions, use is generally restricted to researchers with VA affiliations (this can include a co-investigator with a VA affiliation)

Table 4

Examples of High Value Datasets

Cost, availability, and complexityDatasetDescriptionSample publications
Free. Readily available. Population-based survey with cross-sectional design. Does not require special statistical techniques to address complex samplingSurveillance, Epidemiology and End Results Program (SEER) Population-based multi-regional cancer registry database. SEER data are updated annually. Can be linked to Medicare claims and files (see Medicare below)Trends in breast-conserving surgery among Asian Americans and Pacific Islanders, 1992–2000
Treatment and outcomes of gastric cancer among US-born and foreign-born Asians and Pacific Islanders
Free. Readily available. Requires statistical considerations to account for complex sampling design and use of survey weightsNational Ambulatory Medical Care Survey (NAMCS) & National Hospital Ambulatory Care Survey (NHAMCS) Nationally-representative serial cross-sectional surveys of outpatient and emergency department visits. Can combine survey years to increase sample sizes (e.g., for uncommon conditions) or evaluate temporal trends. Provides national estimatesPreventive health examinations and preventive gynecological examinations in the US
The NAMCS and NHAMCS are conducted annually. Do not link to other datasetsPrimary care physician office visits for depression by older Americans
National Health Interview Survey (NHIS) Nationally-representative serial cross-sectional survey of individuals and families including information on health status, injuries, health insurance, access and utilization information. The NHIS is conducted annually. Can combine survey years to look at rare conditionsPsychological distress in long-term survivors of adult-onset cancer: results from a national survey
Can be linked to National Center for Health Statistics Mortality Data; Medicare enrollment and claims data; Social Security Benefit History Data; Medical Expenditure Panel Survey (MEPS) data; and National Immunization Provider Records Check Survey (NIPRCS) data from 1997–1999Diabetes and Cardiovascular Disease among Asian Indians in the US
Behavioral Risk Factor Surveillance System (BRFSS) Serial cross-sectional nationally-representative survey of health risk behaviors, preventative health practices, and health care access. Provides national and state estimates. Since 2002, the Selected Metropolitan/Micropolitan Area Risk Trends (SMART) project has also used BRFSS data to identify trends in selected metropolitan and micropolitan statistical areas (MMSAs) with 500 or more respondents. BRFSS data are collected monthly. Does not link to other datasets

Perceived discrimination in health care and use of preventive health services

Use of recommended ambulatory care services: is the Veterans Affairs quality gap narrowing?

Free or minimal cost. Readily available. Can do more complex studies by combining data from multiple waves and/or records. Accounting for complex sampling design and use of survey weights can be more complex when using multiple waves—seek support from a statistician. Or can restrict sample to single waves for ease of useNationwide Inpatient Sample (NIS) The largest US database of inpatient hospital stays that incorporates data from all payers, containing data from approximately 20% of US community hospitals. Sampling frame includes approximately 90% of discharges from US hospitalsFactors associated with patients who leave acute-care hospitals against medical advice
NIS data is collected annually. For most states, the NIS includes hospital identifiers that permit linkages to the American Hospital Association (AHA) Annual Survey Database and county identifiers that permit linkages to the Area Resource File (ARF)Impact of hospital volume on racial disparities in cardiovascular procedure mortality
National Health and Nutrition Examination Survey (NHANES) Nationally- representative series of studies combining data from interviews, physical examination, and laboratory testsDemographic differences and trends of vitamin D insufficiency in the US population,1988-2004
NHANES data are collected annually. Can be linked to National Death Index (NDI) mortality data; Medicare enrollment and claims data; Social Security Benefit History Data; and Medical Expenditure Panel Survey (MEPS) data; and Dual Energy X-Ray Absorptiometry (DXA) Multiple Imputation Data Files from 1999–2004Association of hypertension, diabetes, dyslipidemia, and metabolic syndrome with obesity: findings from the National Health and Nutrition Examination Survey, 1999 to 2004
The Health and Retirement Study (HRS) A nationally-representative longitudinal survey of adults older than 50 designed to assess health status, employment decisions, and economic security during retirementChronic conditions and mortality among the oldest old
HRS data is collected every 2 years. Can be linked to Social Security Administration data; Internal Revenue Service data; Medicare claims data (see Medicare below); and Minimum Data Set (MDS) dataAdvance directed and surrogate decision making before death
Medical Expenditure Panel Survey (MEPS) Serial nationally-representative panel survey of individuals, families, health care providers, and employers covering a variety of topics. MEPS data are collected annuallyLoss of health insurance among non-elderly adults in Medicaid
Can be linked by request to the Agency for Healthcare Research and Quality to numerous datasets including the NHIS, Medicare data, and Social Security dataInfluence of patient-provider communication on colorectal cancer screening
Data costs are in the thousands to tens of thousands of dollars. Requires an extensive application and time to acquire data is on the order of months at a minimum. Databases frequently have observations on the order of 100,000 to >1,000,000. Require additional statistical considerations to account for complex sampling design, use of survey weights, or longitudinal analysis. Multiple records per individual. Complex database structure requires a higher degree of analytic and programming skill to create a study dataset efficiently.Medicare claims data (alone), SEER-Medicare, and HRS-Medicare Claims data on Medicare beneficiaries including demographics and resource utilization in a wide variety of inpatient and outpatient settings. Medicare claims data are collected continually and made available annually. Can be linked to other Medicare datasets that use the same unique identifier numbers for patients, providers, and institutions, for example, the Medicare Current Beneficiary Survey, the Long-Term Care Minimum Data Set, the American Hospital Association Annual Survey, and others. SEER and the HRS offer linkages to Medicare data as well (as described above)Long-term outcomes and costs of ventricular assist devices among Medicare beneficiaries
Association between the Medicare Modernization Act of 2003 and patient wait times and travel distance for chemotherapy
Medicare Current Beneficiary Survey (MCBS) Panel survey of a nationally-representative sample of Medicare beneficiaries including health status, health care use, health insurance, socioeconomic and demographic characteristics, and health expenditures. MCBS data are collected annually. Can be linked to other Medicare DataCost-related medication nonadherence and spending on basic needs following implementation of Medicare Part D
Medicare beneficiaries and free prescription drug samples: a national survey

Dataset complexity, cost, and time to acquire the data and obtain institutional review board (IRB) approval are critical considerations for junior researchers, who are new to secondary analysis, have few financial resources, and limited time to demonstrate productivity. Table  4 illustrates the complexity and cost of large datasets across a range of high value datasets used by generalist researchers. Dataset complexity increases by number of subjects, file structure (e.g., single versus multiple records per individual), and complexity of the survey design. Many publicly available datasets are free, and others can cost tens of thousands of dollars to obtain. Time to acquire the datasets and obtain IRB board approval vary. Some datasets can be downloaded from the web, others require multiple layers of permission and security, and in some cases data must be analyzed in a central data processing center. If the project requires linking new data to an existing database, this linkage will add to the time needed to complete the project and probably require enhanced data security. One advantage of most secondary studies using publicly available datasets is the rapid time to IRB approval. Many publicly available large datasets contain de-identified data and are therefore eligible for expedited review or exempt status. If you can download the dataset from the web, it is probably exempt, but your local IRB must make this determination.

Linking datasets can be a powerful method for examining an issue by providing multiple perspectives of patient experience. Many datasets, including SEER, for example, can be linked to the Area Resource File to examine regional variation in practice patterns. However, linking datasets together increases the complexity and cost of data management. A new researcher might consider first conducting a study only on the initial database, and then conducting their next study using the linked database. For some new investigators, this approach can progressively advance programming skills and build confidence while demonstrating productivity.

Get to Know your Dataset

The fellow’s primary mentor encourages him to closely examine the accuracy of the primary predictor for his study—race and ethnicity—as reported in SEER-Medicare. The fellow has a breakthrough when he finds an entire issue of the journal Medical Care dedicated to SEER-Medicare, including a whole chapter on the accuracy of coding of sociodemographic factors. 9

In an analysis of primary data you select the patients to be studied and choose the study measures. This process gives you a close familiarity with study subjects, and how and what data were collected, that is invaluable in assessing the validity of their measures, the potential bias in measuring associations between predictors and outcome variables (internal validity), and the generalizability of their findings to target populations (external validity). The importance of this familiarity with the strengths and weaknesses of the dataset cannot be overemphasized. Secondary data research requires considerable effort to obtain the same level of familiarity with the data. Therefore, knowing your data in detail is critical. Practically, this objective requires scouring online documentation and technical survey manuals, searching PubMed for validation studies, and closely reading previous studies using your dataset, to answer the following types of questions: Who collected the data, and for what purpose? How did subjects get into your dataset? How were they followed? Do your measures capture what you think they capture?

We strongly recommend taking advantage of help offered by the dataset managers, typically described on the dataset’s website. For example, the Research Data Assistance Center (ResDAC) is a dedicated resource for researchers using data from the Centers for Medicare and Medicaid Services (CMS).

Assessing the validity of your measures is one of the central challenges of large dataset research. For large survey datasets, a good first step in assessing the validity of your measures is to read the questions as they were asked in the survey. Some questions simply have face validity. Others, unfortunately, were collected in a way that makes the measure meaningless, problematic, or open to a range of interpretations. These ambiguities can occur in how the question was asked or in how the data were recorded into response categories.

Another essential step is to search the online documentation and published literature for previous validation studies. A PubMed search using the dataset name or measure name/type and the publication type “validation studies” is a good starting point. The key question for a validity study relates to how and why the question was asked and data were collected (e.g., self-report, chart abstraction, physical measurements, billing claims) in relationship to a gold standard. For example, if you are using claims data you should recognize that the primary purpose of those data was not for research, but for reimbursement. Consequently, claims data are limited by the scope of services that are reimbursable and the accuracy of coding by clinicians completing encounter forms for billing or by coders in the claims departments of hospitals and clinics. Some clinical measures can be assessed by asking subjects if they have the condition of interest, such as self reported diagnosis of hypertension. Self-reported data may be adequate for some research questions (e.g., does a diagnosis of hypertension lead people to exercise more?), but inadequate for others (e.g., the prevalence of hypertension among people with diabetes). Even measured data, such as blood pressure, have limitations in that methods of measurement for a study may differ from methods used to diagnose a disorder in the clinician’s office. In the National Health and Nutrition Examination Survey, for example, subject’s blood pressure is based on the average of several measures in a single visit. This differs from the standard clinical practice of measuring blood pressure at separate office visits before diagnosing hypertension. Rarely do available measures capture exactly what you are trying to study. In our experience measures in existing datasets are often good enough to answer the research question, with proper interpretation to account for what the measures actually assesses and how they differ from the underlying constructs.

Finally, we suggest paying close attention to the completeness of measures, and evaluating whether missing data are random or non-random (the latter might result in bias, whereas the former is generally acceptable). Statistical approaches to missing data are beyond the scope of this paper, and most statisticians can help you address this problem appropriately. However, pay close attention to “skip patterns”; some data are missing simply because the survey item is only asked of a subset for which it applies. For example, in the Health and Retirement Study the question about need for assistance with toileting is only asked of subjects who respond that they have difficulty using the toilet. If you were unaware of this skip pattern and attempted to study assistance with toileting, you would be distressed to find over three-quarters of respondents had missing responses for this question (because they reported no difficulty using the toilet).

Fellows and other trainees usually do their own computer programming. Although this may be daunting, we encourage this practice so fellows can get a close feel for the data and become more skilled in statistical analysis. Datasets, however, range in complexity (Table  4 ). In our experience, fellows who have completed introductory training in SAS, STATA, SPSS, or other similar statistical software have been highly successful analyzing datasets of moderate complexity without the on-going assistance of a statistical programmer. However, if you do have a programmer who will do much of the coding, be closely involved and review all data cleaning and statistical output as if you had programmed it yourself. Close attention can reveal all sorts of patterns, problems, and opportunities with the data that are obscured by focusing only on the final outputs prepared by a statistical programmer. Programmers and statisticians are not clinicians; they will often not recognize when the values of variables or patterns of missingness don’t make sense. If estimates seem implausible or do not match previously published estimates, then the analytic plan, statistical code, and measures should be carefully rechecked.

Keep in mind that “the perfect may be the enemy of the good.” No one expects perfect measures (this is also true for primary data collection). The closer you are to the data, the more you see the warts—don’t be discouraged by this. The measures need to pass the sniff test, in other words have clinical validity based primarily on judgement that they make sense clinically or scientifically, but also supported where possible by validation procedures, reference to auditing procedures, or in other studies that have independently validated the measures of interest.

Structure your Analysis and Presentation of Findings in a Way that Is Clinically Meaningful

Case continued.

The fellow finds that Blacks are less likely to receive chemotherapy in the last 2 weeks of life (Blacks 4%, Whites 6%, p < 0.001). He debates the meaning of this statistically significant 2% absolute difference.

Often, the main challenge for investigators who are new to secondary data analysis is carefully structuring the analysis and presentation of findings in a way that tells a meaningful story. Based on what you’ve found, what is the story that you want your target audience to understand? When appropriate, it can be useful to conduct carefully planned sensitivity analysis to evaluate the robustness of your primary findings. A sensitivity analysis assesses the effect of variation in assumptions on the outcome of interest. For example, if 10% of subjects did not answer a “yes” or “no” question, you could conduct sensitivity analyses to estimate the effects of excluding missing responses, or categorizing them as all “yes” or all “no.” Because large datasets may contain multiple measures of interests, co-variates, and outcomes, a frequent temptation is to present huge tables with multiple rows and columns. This is a mistake. These tables can be challenging to sort through, and the clinical importance of the story resulting from the analysis can be lost. In our experience, a thoughtful figure often captures the take-home message in a way that is more interpretable and memorable to readers than rows of data tables.

You should keep careful track of subjects you decide to exclude from the analysis and why. Editors, reviewers, and readers will want to know this information. The best way to keep track is to construct a flow diagram from the original denominator to the final sample.

Don’t confuse statistical significance with clinical importance in large datasets. Due to large sample sizes, associations may be statistically significant but not clinically meaningful. Be mindful of what is meaningful from a clinical or policy perspective. One concern that frequently arises at this stage in large database research is the acceptability of “exploratory” analyses, or the practice of examining associations between multiple factors of interest. On the one hand, exploratory analyses risk finding a significant association by chance alone from testing multiple associations (a false-positive result). On the other hand, the critical issue is not a statistical one, but rather whether the issue is important. 10 Exploratory analyses are acceptable if done in a thoughtful way that serves an a priori hypothesis, but not if merely data dredging looking for associations.

We recommend consulting with a statistician when using data from a complex survey design (see Table  2 ) or developing a conceptually advanced study design, for example, using longitudinal data, multilevel modeling with clustered data, or surivival analysis. The value of input (even if informal) from a statistician or other advisor with substantial methodological expertise cannot be overstated.

CONCLUSIONS

Case conclusion.

Two years after he began the project the fellow completes the analysis and publishes the paper in a peer-reviewed journal. 11

A 2-year timeline from inception to publication is typical for large database research. Academic potential is commonly assessed by the ability to see a study through to publication in a peer-reviewed journal. This timeline allows a fellow who began a secondary analysis at the start of a 2-year training program to search for a job with an article under review or in press.

In conclusion, secondary dataset research has tremendous advantages, including the ability to assess outcomes that would be difficult or impossible to study using primary data collection, such as those involving exceptionally long follow-up times or rare outcomes. For junior investigators, the potential for a shorter time to publication may help secure a job or career development funding. Some of the time “saved” by not collecting data yourself, however, needs to be “spent” becoming familiar with the dataset in intimate detail. Ultimately, the same factors that apply to successful primary data analysis apply to secondary data analysis, including the development of a clear research question, study sample, appropriate measures, and a thoughtful analytic approach.

Contributors

The authors would like to thank Sei Lee, MD, Mark Freidberg, MD, MPP, and J. Michael McWilliams, MD, PhD, for their input on portions of this manuscript.

Grant Support

Dr. Smith is supported by a Research Supplement to Promote Diversity in Health Related Research from the National Institute on Aging (R01AG028481), the National Center for Research Resources UCSF-CTSI (UL1 RR024131), and the National Palliative Care Research Center. Dr. Steinman is supported by the National Institute on Aging and the American Federation for Aging Research (K23 AG030999). An unrestricted grant from the Society of General Internal Medicine (SGIM) supported development of the SGIM Research Dataset Compendium.

Prior Presentations

An earlier version of this work was presented as a workshop at the Annual Meeting of the Society of General Internal Medicine in Minneapolis, MN, April 2010.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Conflict of Interest

None disclosed.

Research-Methodology

Data Collection Methods

Data collection is a process of collecting information from all the relevant sources to find answers to the research problem, test the hypothesis (if you are following deductive approach ) and evaluate the outcomes. Data collection methods can be divided into two categories: secondary methods of data collection and primary methods of data collection.

Secondary Data Collection Methods

Secondary data is a type of data that has already been published in books, newspapers, magazines, journals, online portals etc.  There is an abundance of data available in these sources about your research area in business studies, almost regardless of the nature of the research area. Therefore, application of appropriate set of criteria to select secondary data to be used in the study plays an important role in terms of increasing the levels of research validity and reliability.

These criteria include, but not limited to date of publication, credential of the author, reliability of the source, quality of discussions, depth of analyses, the extent of contribution of the text to the development of the research area etc. Secondary data collection is discussed in greater depth in Literature Review chapter.

Secondary data collection methods offer a range of advantages such as saving time, effort and expenses. However they have a major disadvantage. Specifically, secondary research does not make contribution to the expansion of the literature by producing fresh (new) data.

Primary Data Collection Methods

Primary data is the type of data that has not been around before. Primary data is unique findings of your research. Primary data collection and analysis typically requires more time and effort to conduct compared to the secondary data research. Primary data collection methods can be divided into two groups: quantitative and qualitative.

Quantitative data collection methods are based on mathematical calculations in various formats. Methods of quantitative data collection and analysis include questionnaires with closed-ended questions, methods of correlation and regression, mean, mode and median and others.

Quantitative methods are cheaper to apply and they can be applied within shorter duration of time compared to qualitative methods. Moreover, due to a high level of standardisation of quantitative methods, it is easy to make comparisons of findings.

Qualitative research methods , on the contrary, do not involve numbers or mathematical calculations. Qualitative research is closely associated with words, sounds, feeling, emotions, colours and other elements that are non-quantifiable.

Qualitative studies aim to ensure greater level of depth of understanding and qualitative data collection methods include interviews, questionnaires with open-ended questions, focus groups, observation, game or role-playing, case studies etc.

Your choice between quantitative or qualitative methods of data collection depends on the area of your research and the nature of research aims and objectives.

My e-book, The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline.

John Dudovskiy

Data Collection Methods

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • Data Collection | Definition, Methods & Examples

Data Collection | Definition, Methods & Examples

Published on June 5, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, other interesting articles, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analyzed through statistical methods .
  • Qualitative data is expressed in words and analyzed through interpretations and categorizations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data. If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

methods of collecting secondary data in research methodology

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

Data collection methods
Method When to use How to collect data
Experiment To test a causal relationship. Manipulate variables and measure their effects on others.
Survey To understand the general characteristics or opinions of a group of people. Distribute a list of questions to a sample online, in person or over-the-phone.
Interview/focus group To gain an in-depth understanding of perceptions or opinions on a topic. Verbally ask participants open-ended questions in individual interviews or focus group discussions.
Observation To understand something in its natural setting. Measure or survey a sample without trying to affect them.
Ethnography To study the culture of a community or organization first-hand. Join and participate in a community and record your observations and reflections.
Archival research To understand current or historical events, conditions or practices. Access manuscripts, documents or records from libraries, depositories or the internet.
Secondary data collection To analyze data from populations that you can’t access first-hand. Find existing datasets that have already been collected, from sources such as government agencies or research organizations.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design (e.g., determine inclusion and exclusion criteria ).

Operationalization

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalization means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and timeframe of the data collection.

Standardizing procedures

If multiple researchers are involved, write a detailed manual to standardize data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorize observations. This helps you avoid common research biases like omitted variable bias or information bias .

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organize and store your data.

  • If you are collecting data from people, you will likely need to anonymize and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimize distortion.
  • You can prevent loss of data by having an organization system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1–5. The data produced is numerical and can be statistically analyzed for averages and patterns.

To ensure that high quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g. understanding the needs of your consumers or user testing your website)
  • You can control and standardize the process for high reliability and validity (e.g. choosing appropriate measurements and sampling methods )

However, there are also some drawbacks: data collection can be time-consuming, labor-intensive and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

Operationalization means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioral avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalize the variables that you want to measure.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Data Collection | Definition, Methods & Examples. Scribbr. Retrieved July 29, 2024, from https://www.scribbr.com/methodology/data-collection/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs. quantitative research | differences, examples & methods, sampling methods | types, techniques & examples, what is your plagiarism score.

methods of collecting secondary data in research methodology

Secondary Data Collection Methods

Data is physical or digital information; information is knowledge and knowledge is power! But to leverage that powerful data and…

Sources of secondary data collection

Data is physical or digital information; information is knowledge and knowledge is power! But to leverage that powerful data and execute a successful strategy, businesses need to first gather the data—simply known as data collection.

Collecting data is more than just searching on Google. Although our society is heavily dependent on data, the importance of collecting it still eludes many. Accurately collecting data is crucial for ensuring quality assurance, keeping research integrity and making informed business decisions. There are methods, goals, time and money involved. Researchers have to have a data-driven approach and achieve the desired end results. Only after having a clear picture of the objective can a researcher decide whether to use primary or secondary data and where the primary or s econdary data can be collected from.

But before we learn about the sources of secondary data in research methodology , we must first understand the meaning of data collection. 

What Is Data Collection?

What is secondary data collection, various methods of collecting secondary data, how to use sources of secondary data in research methodology, advantages of secondary data collection methods, disadvantages of secondary data collection methods, secondary data collection examples.

Data collection is a crucial element of statistical research. The process involves collecting information from available sources to come up with solutions for a problem. The process evaluates the outcome and predicts trends and possibilities of the future. Researchers start by collecting the most basic data related to the problem and then progress with the volume and type of data to be collected.

There are two methods of data collection—primary data collection methods and secondary data collection methods. Data collection involves identifying data types, their sources and the methods being used. There are different collection methods that are used across commercial, governmental and research fields, and various sources are accessed where primary and secondary data can be collected from . Whether it’s for academic research or promoting a new product, data collection helps us make better choices and get better results. 

In this article, we’ll discuss secondary data collection, the various methods of collecting secondary data , its advantages, disadvantages, secondary data collection examples and sources of secondary data in research methodology .

Secondary data collection refers to gathering information that’s already available. The data was previously collected, has undergone necessary statistical analysis and isn’t owned by the researcher. This data is usually one that was collected from primary sources and later made available for everyone to access. In other words, secondary data is second-hand information that’s collected by third parties. A researcher may ask others to collect data or obtain it from other sources. Existing data is typically collated and summarized to boost the overall effectiveness of a research.

There are two t ypes of secondary data collection —qualitative secondary data collection and quantitative secondary data collection. Qualitative data deals with the intangibles and covers factors such as quality, color, preference or appearance. Quantitative data deals with numbers, statistics and percentages. Although the end goal determines which of the two types of secondary data collection a researcher chooses, secondary data collection is mostly concerned with quantitative data.

Let’s look at the common secondary data collection methods :

Collecting Information Available On The Internet 

One of the most popular methods of collecting secondary data is by using the internet. Readily available data can be accessed with the click of a button, which makes the internet one of the best places where secondary data can be collected from . It’s practically free of cost, although some websites may charge money—usually low prices. However, organizations and individuals must look out for inauthentic and untrustworthy sources of information.

Collecting Data Available In Government And Non-Government Agencies 

Government and non-government agencies such as Census bureaus, government printing offices and business development centers store relevant data and valuable information that both individuals and organizations can access.

Accessing Public Libraries 

Public libraries house copies of research, public documents and statistical information. Although services may vary, libraries usually have a vast collection of publications highlighting market statistics, business directories and newsletters. 

Using Data From Educational Institutions

Educational institutions are often overlooked when deciding a method of collection. Educational institutions conduct more research than any other sector. Universities have a plethora of primary data that can act as vital information for secondary research.

Using Sources Of Commercial Information 

Secondary data collection methods are cost-effective and hence quite popular among businesses and individuals. Small businesses that can’t afford expensive research have to resort to a cheaper method of data collection. They can request and obtain data from anywhere it’s available to identify prospective clients and have a wider reach when promoting products and services.

Here are the steps to conduct research using sources of secondary data collection :

  • Identify the topic of research, make a list of research attributes and define the purpose of research. 
  • Information sources have to be narrowed down and identified to access the most relevant data applicable to the research. 
  • Once the secondary data sources are narrowed down, check and collect all existing data related to the research from similar sources. 
  • After collecting the data, check for duplication before assembling it into a usable format. 
  • Analyze the collected data and check if it answers all questions crucial to meet the objective. 

The most important aspect of secondary research is looking out for any inauthentic source or incorrect data that may hamper the research.

These are the advantages of secondary data collection: Most of the data and information is readily available and there are plenty of sources of secondary data collection .  

  • The process is less expensive compared to the primary method. There’s minimum expenditure associated with obtaining data from authentic sources. 
  • Data collected for secondary research can give a fair idea about how effective the primary research was. Businesses can hypothesize and evaluate the cost of primary research. 
  • Re-evaluating data from another person’s point of view can uncover things that may have been overlooked. This may lead to discovering new features or fixing a bug in an app. 
  • Secondary data collection is less time-consuming as the data doesn’t need to be collected from the root. Hence, data collection time is significantly lower than primary methods. 
  • Longitudinal and comparative studies are easier to conduct with secondary data as we don’t have to wait to draw conclusions. For example, to compare the population difference in a country across five years, we can simply compare the present census with that of five years back. 

Researchers can look to collect data from both internal and external sources, which prevents relying on any special or specific data collection method. 

Let’s discuss the disadvantages of secondary data collection:

  • Data may be readily available but the credibility of sources is under constant scrutiny. Research can break down due to a lack of credible and authentic information
  • Most secondary data sources don’t offer the latest statistics, studies or reports. Accurate data doesn’t necessarily mean updated data
  • As a researcher has no control over the primary source or quality of information, the success of secondary research heavily depends on the quality of the primary research that was conducted 

Primary data collection may often be expensive but the credibility, accuracy and quality of information is seldom questionable. 

Here are some secondary data collection examples :

  • Journals and blogs are popular examples of secondary sources of data collection today. They’re both regularly updated but blogs run the risk of being less authentic than journals as the latter is backed by periodically updated information with new publications.
  • Newspapers have been at the top of the most reliable and authentic sources of secondary data collection for centuries. Although they mostly cover economic, educational and political information, there is specialized content available with newspapers dedicated to covering topics such as science, environment and sports. 
  • Podcasts are the new-age alternative to radio and are widely becoming a common source of secondary information. Presenters talk to the audience about specific topics or conduct interviews on the show. With the digital media boom, interactive podcasts have become wildly common and popular.

Some other examples of secondary data collection are letters, books, government records and columns.

Secondary data finds use across the fields of business, research and statistics. Researchers may choose secondary data due to finance issues, availability, research needs or time. Due to various factors, secondary data may sometimes be the only data available. In such cases, collecting authentic and relevant data and coming up with solutions to meet the objective may come down to a manager’s ability of CRITICAL THINKING . 

Using secondary data has its drawbacks and data collection is concerned with finding solutions. Managers need to go behind the scenes to fully understand the process of problem-solving. Learn to make research foolproof and analyze scenarios error-free with Harappa’s Create New Solutions pathway. Continuously seek, absorb and interpret new information. Lay down insightful questions, look for relevant data and use smart analyses to create working solutions. Strive to get all available information first and then make the best possible decision. Make well-reasoned and clearly articulated arguments that are backed by logic and evidence. 

Thriversitybannersidenav

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

methods of collecting secondary data in research methodology

Home QuestionPro QuestionPro Products

Data Collection Methods: Types & Examples

data-collection-methods

Data is a collection of facts, figures, objects, symbols, and events from different sources. Organizations collect data using various methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various times.

For example, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until it is analyzed or processed to achieve the desired results.

What are Data Collection Methods?

Data collection methods are techniques and procedures for gathering information for research purposes. They can range from simple self-reported surveys to more complex quantitative or qualitative experiments.

Some common data collection methods include surveys , interviews, observations, focus groups, experiments, and secondary data analysis . The data collected through these methods can then be analyzed to support or refute research hypotheses and draw conclusions about the study’s subject matter.

Understanding Data Collection Methods

Data collection methods encompass a variety of techniques and tools for gathering quantitative and qualitative data. These methods are integral to the data collection and ensure accurate and comprehensive data acquisition. 

Quantitative data collection methods involve systematic approaches, such as

  • Numerical data,
  • Surveys, polls and
  • Statistical analysis
  • To quantify phenomena and trends. 

Conversely, qualitative data collection methods focus on capturing non-numerical information, such as interviews, focus groups, and observations, to delve deeper into understanding attitudes, behaviors, and motivations. 

Combining quantitative and qualitative data collection techniques can enrich organizations’ datasets and gain comprehensive insights into complex phenomena.

Effective utilization of accurate data collection tools and techniques enhances the accuracy and reliability of collected data, facilitating informed decision-making and strategic planning.

Learn more about what is Self-Selection Bias, methods & its examples

Importance of Data Collection Methods

Data collection methods play a crucial role in the research process as they determine the quality and accuracy of the data collected. Here are some major importance of data collection methods.

  • Quality and Accuracy: The choice of data collection technique directly impacts the quality and accuracy of the data obtained. Properly designed methods help ensure that the data collected is error-free and relevant to the research questions.
  • Relevance, Validity, and Reliability: Effective data collection methods help ensure that the data collected is relevant to the research objectives, valid (measuring what it intends to measure), and reliable (consistent and reproducible).
  • Bias Reduction and Representativeness: Carefully chosen data collection methods can help minimize biases inherent in the research process, such as sampling or response bias. They also aid in achieving a representative sample, enhancing the findings’ generalizability.
  • Informed Decision Making: Accurate and reliable data collected through appropriate methods provide a solid foundation for making informed decisions based on research findings. This is crucial for both academic research and practical applications in various fields.
  • Achievement of Research Objectives: Data collection methods should align with the research objectives to ensure that the collected data effectively addresses the research questions or hypotheses. Properly collected data facilitates the attainment of these objectives.
  • Support for Validity and Reliability: Validity and reliability are essential to research validity. The choice of data collection methods can either enhance or detract from the validity and reliability of research findings. Therefore, selecting appropriate methods is critical for ensuring the credibility of the research.

The importance of data collection methods cannot be overstated, as they play a key role in the research study’s overall success and internal validity .

Types of Data Collection Methods

The choice of data collection method depends on the research question being addressed, the type of data needed, and the resources and time available. Data collection methods can be categorized into primary and secondary methods.

Data Collection Methods

1. Primary Data Collection Methods

Primary data is collected from first-hand experience and is not used in the past. The data gathered by primary data collection methods are highly accurate and specific to the research’s motive.

Primary data collection methods can be divided into two categories: quantitative and qualitative.

Quantitative Methods:

Quantitative techniques for market research and demand forecasting usually use statistical tools. In these techniques, demand is forecasted based on historical data. These methods of primary data collection are generally used to make long-term forecasts. Statistical analysis methods are highly reliable as subjectivity is minimal.

  • Time Series Analysis: A time series refers to a sequential order of values of a variable, known as a trend, at equal time intervals. Using patterns, an organization can predict the demand for its products and services over a projected time period. 
  • Smoothing Techniques: Smoothing techniques can be used in cases where the time series lacks significant trends. They eliminate random variation from the historical demand, helping identify patterns and demand levels to estimate future demand.  The most common methods used in smoothing demand forecasting are the simple moving average and weighted moving average methods. 
  • Barometric Method: Also known as the leading indicators approach, researchers use this method to speculate future trends based on current developments. When past events are considered to predict future events, they act as leading indicators.

methods of collecting secondary data in research methodology

Qualitative Methods:

Qualitative data collection methods are especially useful when historical data is unavailable or when numbers or mathematical calculations are unnecessary.

Qualitative research is closely associated with words, sounds, feelings, emotions, colors, and non-quantifiable elements. These techniques are based on experience, judgment, intuition, conjecture, emotion, etc.

Quantitative methods do not provide the motive behind participants’ responses, often don’t reach underrepresented populations, and require long periods of time to collect the data. Hence, it is best to combine quantitative methods with qualitative methods.

1. Surveys: Surveys collect data from the target audience and gather insights into their preferences, opinions, choices, and feedback related to their products and services. Most survey software offers a wide range of question types.

You can also use a ready-made survey template to save time and effort. Online surveys can be customized to match the business’s brand by changing the theme, logo, etc. They can be distributed through several channels, such as email, website, offline app, QR code, social media, etc. 

You can select the channel based on your audience’s type and source. Once the data is collected, survey software can generate various reports and run analytics algorithms to discover hidden insights. 

A survey dashboard can give you statistics related to response rate, completion rate, demographics-based filters, export and sharing options, etc. Integrating survey builders with third-party apps can maximize the effort spent on online real-time data collection . 

Practical business intelligence relies on the synergy between analytics and reporting , where analytics uncovers valuable insights, and reporting communicates these findings to stakeholders.

2. Polls: Polls comprise one single or multiple-choice question . They are useful when you need to get a quick pulse of the audience’s sentiments. Because they are short, it is easier to get responses from people.

Like surveys, online polls can be embedded into various platforms. Once the respondents answer the question, they can also be shown how their responses compare to others’.

Interviews: In this method, the interviewer asks the respondents face-to-face or by telephone. 

3. Interviews: In face-to-face interviews, the interviewer asks a series of questions to the interviewee in person and notes down responses. If it is not feasible to meet the person, the interviewer can go for a telephone interview. 

This form of data collection is suitable for only a few respondents. It is too time-consuming and tedious to repeat the same process if there are many participants.

methods of collecting secondary data in research methodology

4. Delphi Technique: In the Delphi method, market experts are provided with the estimates and assumptions of other industry experts’ forecasts. Based on this information, experts may reconsider and revise their estimates and assumptions. The consensus of all experts on demand forecasts constitutes the final demand forecast.

5. Focus Groups: Focus groups are one example of qualitative data in education . In a focus group, a small group of people, around 8-10 members, discuss the common areas of the research problem. Each individual provides his or her insights on the issue concerned. 

A moderator regulates the discussion among the group members. At the end of the discussion, the group reaches a consensus.

6. Questionnaire: A questionnaire is a printed set of open-ended or closed-ended questions that respondents must answer based on their knowledge and experience with the issue. The questionnaire is part of the survey, whereas the questionnaire’s end goal may or may not be a survey.

2. Secondary Data Collection Methods

Secondary data is data that has been used in the past. The researcher can obtain data from the data sources , both internal and external, to the organizational data . 

Internal sources of secondary data:

  • Organization’s health and safety records
  • Mission and vision statements
  • Financial Statements
  • Sales Report
  • CRM Software
  • Executive summaries

External sources of secondary data:

  • Government reports
  • Press releases
  • Business journals

Secondary data collection methods can also involve quantitative and qualitative techniques. Secondary data is easily available, less time-consuming, and expensive than primary data. However, the authenticity of the data gathered cannot be verified using these methods.

Secondary data collection methods can also involve quantitative and qualitative observation techniques. Secondary data is easily available, less time-consuming, and more expensive than primary data. 

However, the authenticity of the data gathered cannot be verified using these methods.

Regardless of the data collection method of your choice, there must be direct communication with decision-makers so that they understand and commit to acting according to the results.

For this reason, we must pay special attention to the analysis and presentation of the information obtained. Remember that these data must be useful and functional to us, so the data collection method has much to do with it.

LEARN ABOUT: Data Asset Management: What It Is & How to Manage It

Steps in the Data Collection Process

The data collection process typically involves several key steps to ensure the accuracy and reliability of the data gathered. These steps provide a structured approach to gathering and analyzing data effectively. Here are the key steps in the data collection process:

  • Define the Objectives: Clearly outline the goals of the data collection. What questions are you trying to answer?
  • Identify Data Sources: Determine where the data will come from. This could include surveys, interviews, existing databases, or observational data.
  • Surveys and questionnaires
  • Interviews (structured or unstructured)
  • Focus groups
  • Observations
  • Document analysis
  • Develop Data Collection Instruments: Create or adapt tools for collecting data, such as questionnaires or interview guides. Ensure they are valid and reliable.
  • Select a Sample: If you are not collecting data from the entire population, determine how to select your sample. Consider sampling methods like random, stratified, or convenience sampling.
  • Collect Data: Execute your data collection plan, following ethical guidelines and maintaining data integrity.
  • Store Data: Organize and store collected data securely, ensuring it’s easily accessible for analysis while maintaining confidentiality.
  • Analyze Data: After collecting the data, process and analyze it according to your objectives, using appropriate statistical or qualitative methods.
  • Interpret Results: Conclude your analysis, relating them back to your original objectives and research questions.
  • Report Findings: Present your findings clearly and organized, using visuals and summaries to communicate insights effectively.
  • Evaluate the Process: Reflect on the data collection process. Assess what worked well and what could be improved for future studies.

Recommended Data Collection Tools

Choosing the right data collection tools depends on your specific needs, such as the type of data you’re collecting, the scale of your project, and your budget. Here are some widely used tools across different categories:

Survey Tools

  • QuestionPro: Offers advanced survey features and analytics.
  • SurveyMonkey: User-friendly interface with customizable survey options.
  • Google Forms: Free and easy to use, suitable for simple surveys.

Interview and Focus Group Tools

  • Zoom: Great for virtual interviews and focus group discussions.
  • Microsoft Teams: Offers features for collaboration and recording sessions.

Observation and Field Data Collection

  • Open Data Kit (ODK): This is for mobile data collection in field settings.
  • REDCap: A secure web application for building and managing online surveys.

Mobile Data Collection

  • KoboToolbox: Designed for humanitarian work, useful for field data collection.
  • SurveyCTO: Provides offline data collection capabilities for mobile devices.

Data Analysis Tools

  • Tableau: Powerful data visualization tool to analyze survey results.
  • SPSS: Widely used for statistical analysis in research.

Qualitative Data Analysis

  • NVivo: For analyzing qualitative data like interviews or open-ended survey responses.
  • Dedoose: Useful for mixed-methods research, combining qualitative and quantitative data.

General Data Collection and Management

  • Airtable: Combines spreadsheet and database functionalities for organizing data.
  • Microsoft Excel: A versatile tool for data entry, analysis, and visualization.

If you are interested in purchasing, we invite you to visit our article, where we dive deeper and analyze the best data collection tools in the industry.

How Can QuestionPro Help to Create Effective Data Collection?

QuestionPro is a comprehensive online survey software platform that can greatly assist in various data collection methods. Here’s how it can help:

  • Survey Creation: QuestionPro offers a user-friendly interface for creating surveys with various question types, including multiple-choice, open-ended, Likert scale, and more. Researchers can customize surveys to fit their specific research needs and objectives.
  • Diverse Distribution Channels: The platform provides multiple channels for distributing surveys, including email, web links, social media, and website embedding surveys. This enables researchers to reach a wide audience and collect data efficiently.
  • Panel Management: QuestionPro offers panel management features, allowing researchers to create and manage panels of respondents for targeted data collection. This is particularly useful for longitudinal studies or when targeting specific demographics.
  • Data Analysis Tools: The platform includes robust data analysis tools that enable researchers to analyze survey responses in real time. Researchers can generate customizable reports, visualize data through charts and graphs, and identify trends and patterns within the data.
  • Data Security and Compliance: QuestionPro prioritizes data security and compliance with regulations such as GDPR and HIPAA. The platform offers features such as SSL encryption, data masking, and secure data storage to ensure the confidentiality and integrity of collected data.
  • Mobile Compatibility: With the increasing use of mobile devices, QuestionPro ensures that surveys are mobile-responsive, allowing respondents to participate in surveys conveniently from their smartphones or tablets.
  • Integration Capabilities: QuestionPro integrates with various third-party tools and platforms, including CRMs, email marketing software, and analytics tools. This allows researchers to streamline their data collection processes and incorporate survey data into their existing workflows.
  • Customization and Branding: Researchers can customize surveys with their branding elements, such as logos, colors, and themes, enhancing the professional appearance of surveys and increasing respondent engagement.

The conclusion you obtain from your investigation will set the course of the company’s decision-making, so present your report clearly and list the steps you followed to obtain those results.

Make sure that whoever will take the corresponding actions understands the importance of the information collected and that it gives them the solutions they expect.

QuestionPro offers a comprehensive suite of features and tools that can significantly streamline the data collection process, from survey creation to analysis, while ensuring data security and compliance. Remember that at QuestionPro, we can help you collect data easily and efficiently. Request a demo and learn about all the tools we have for you.

Frequently Asked Questions (FAQs)

A: Common methods include surveys, interviews, observations, focus groups, and experiments.

A: Data collection helps organizations make informed decisions and understand trends, customer preferences, and market demands.

A: Quantitative methods focus on numerical data and statistical analysis, while qualitative methods explore non-numerical insights like attitudes and behaviors.

A: Yes, combining methods can provide a more comprehensive understanding of the research topic.

A: Technology streamlines data collection with tools like online surveys, mobile data gathering, and integrated analytics platforms.

MORE LIKE THIS

Microsoft Forms vs SurveyMonkey

Microsoft Forms vs SurveyMonkey: Complete Analysis

Jul 29, 2024

Qualtrics vs Google Forms Comparison

Qualtrics vs Google Forms: Which is the Best Platform?

Jul 24, 2024

SurveyMonkey vs. Typeform

TypeForm vs. SurveyMonkey: Which is Better in 2024?

Surveymonkey-vs-google-forms

SurveyMonkey vs Google Forms: A Detailed Comparison

Jul 23, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence
  • Student Program
  • Sign Up for Free

A guide on primary and secondary data-collection methods

Data Collection Methods

A guide on primary and secondary data-collection methods

Whether you’re collecting data for business or academic research, the first step is to identify the type of data you need to collect and what method you’ll use to do so. In general, there are two data types — primary and secondary — and you can gather both with a variety of effective collection methods.

Primary data refers to original, firsthand information, while secondary data refers to information retrieved from already existing sources. Peter Drow, head of marketing at NCCuttingTools , explains that “original findings are primary data, whereas secondary data refers to information that has already been reported in secondary sources, such as books, newspapers, periodicals, magazines, web portals, etc.”

Both primary and secondary data-collection methods have their pros, cons, and particular use cases. Read on for an explanation of your options and a list of some of the best methods to consider.

Automate your data collection process for free with Jotform . It’s free!

Primary data-collection methods

As mentioned above, primary data collection involves gathering original and firsthand source information. Primary data-collection methods help researchers or service providers obtain specific and up-to-date information about their research subjects. These methods involve reaching out to a targeted group of people and sourcing data from them through surveys, interviews, observations, experiments, etc.

You can collect primary data using quantitative or qualitative methods. Let’s take a closer look at the two:

Quantitative data-collection methods involve collecting information that you can analyze numerically. Closed-ended surveys and questionnaires with predefined options are usually the ways researchers collect quantitative information. They can then analyze the results using mathematical calculations such as means, modes, and grouped frequencies. An example is a simple poll. It’s easy to quickly determine or express the number of participants who choose a specific option as a percentage of the whole.

Qualitative data collection involves retrieving nonmathematical data from primary sources. Unlike quantitative data-collection methods where subjects are limited to predefined options, qualitative data-collection methods give subjects a chance to freely express their thoughts about the research topic. As a result, the data researchers collect via these methods is unstructured and often nonquantifiable.

Here’s an important difference between the two: While quantitative methods focus on understanding “what,” “who,” or “how much,” qualitative methods focus on understanding “why” and “how.” For example, quantitative research on parents may show trends that are specific to fathers or mothers, but it may not uncover why those trends exist.

Drow explains that applying quantitative methods is faster and cheaper than applying qualitative methods. “It is simple to compare results because quantitative approaches are highly standardized. In contrast, qualitative research techniques rely on words, sounds, feelings, emotions, colors, and other intangible components.”

Drow emphasizes that the field of your study and the goals and objectives of your research will influence your decision about whether to use quantitative or qualitative methodologies for data collection.

Below are some examples of primary data-collection methods:

1. Questionnaires and surveys

While researchers often use the terms “survey” and “questionnaire” interchangeably, the two mean slightly different things.

A questionnaire refers specifically to the set of questions researchers use to collect information from respondents. It may include closed-ended questions, which means respondents are limited to predefined answers, or open-ended questions, which allow respondents to give their own answers.

A survey includes the entire process of creating questionnaires, collecting responses, and analyzing the results.

Jotform’s free survey maker makes it easy to conduct surveys. Using any of Jotform’s customizable survey templates, you can quickly create a questionnaire and share your survey with respondents using a shareable link. You can also analyze survey results in easy-to-read spreadsheets, charts, and more.

2. Interviews

An interview is a conversation in which one participant asks questions and the other provides answers. Interviews work best for small groups and help you understand the opinions and feelings of respondents.

Interviews may be structured or unstructured. Structured interviews are similar to questionnaires and involve asking predetermined questions with specific multiple-choice answers. Unstructured interviews, on the other hand, give subjects the freedom to provide their own answers. You can conduct interviews in person or via recorded video or audio conferencing.

3. Focus groups 

A focus group is a small group of people who have an informal discussion about a particular topic, product, or idea. The researcher selects participants with similar interests, gives them topics to discuss, and records what they say.

Focus groups can help you better understand the results of a large-group quantitative study. For example, a survey of 1,000 respondents may help you spot trends and patterns, but a focus group of 10 respondents will provide additional context for the results of the large-group survey.

4. Observation

Observation involves watching participants or their interactions with specific products or objects. It’s a great way to collect data from a group when they’re unwilling or unable to participate in interviews — children are a good example.

You can conduct observations covertly or overtly. The former involves discreetly observing people’s behavior without their knowledge. This allows you to see them acting naturally. On the other hand, you have to conduct overt observation openly, and it may cause the subjects to behave unnaturally.

Advantages of primary data-collection methods

  • Accuracy: You collect data firsthand from the target demographic, which leaves less room for error or misreporting.
  • Recency: Sourcing primary data ensures you have the most up-to-date information about the research subject.
  • Control: You have full control over the data-collection process and can make adjustments where necessary to improve the quality of the data you collect.
  • Relevance: You can ask specific questions that are directly relevant to your research.
  • Privacy: You can control access to the research results and maintain the confidentiality of respondents.

Disadvantages of primary data collection

  • Cost: Collecting primary data can be expensive, especially if you’re working with a large group.
  • Labor: Collecting raw data can be labor intensive. When you’re gathering data from large groups, you need more skilled hands. And if you’re researching something arcane or unusual, it might be difficult to find people with the appropriate expertise.
  • Time: Collecting primary data takes time. If you’re conducting surveys, for example, participants have to fill out questionnaires. This could take anywhere from a few days to several months, depending on the size of the study group, how you deliver the survey, and how quickly participants respond. Post-survey activities, such as organizing and cleaning data to make it usable, also add up.

Secondary data-collection methods

Secondary data collection involves retrieving already available data from sources other than the target audience. When working with secondary data, the researcher doesn’t “collect” data; instead, they consult secondary data sources.

Secondary data sources are broadly categorized into published and unpublished data. As the names suggest, published data has been published and released for public or private use, while unpublished data comprises unreleased private information that researchers or individuals have documented.

When choosing public data sources, Drow strongly recommends considering the date of publication, the author’s credentials, the source’s dependability, the text’s level of discussion and depth of analysis, and the impact it has had on the growth of the field of study.

Below are some examples of secondary data sources:

1. Online journals, records, and publications

Data that reputable organizations have collected from research is usually published online. Many of these sources are freely accessible and serve as reliable data sources. But it’s best to search for the latest editions of these publications because dated ones may provide invalid data.

2. Government records and publications

Periodically, government institutions collect data from people. The information can range from population figures to organizational records and other statistical information such as age distribution. You can usually find information like this in government libraries and use it for research purposes.

3. Business and industry records

Industries and trade organizations usually release revenue figures and periodic industry trends in quarterly or biannual publications. These records serve as viable secondary data sources since they’re industry-specific.

Previous business records, such as companies’ sales and revenue figures, can also be useful for research. While some of this information is available to the public, you may have to get permission to access other records.

4. Newspapers

Newspapers often publish data they’ve collected from their own surveys. Due to the volume of resources you’ll have to sift through, some surveys may be relevant to your niche but difficult to find on paper. Luckily, most newspapers are also published online, so looking through their online archives for specific data may be easier.

5. Unpublished sources

These include diaries, letters, reports, records, and figures belonging to private individuals; these sources aren’t in the public domain. Since authoritative bodies haven’t vetted or published the data, it can often be unreliable.

Advantages of secondary data-collection methods

Below are some of the benefits of secondary data-collection methods and their advantages over primary methods.

  • Speed: Secondary data-collection methods are efficient because delayed responses and data documentation don’t factor into the process. Using secondary data, analysts can go straight into data analysis.
  • Low cost: Using secondary data is easier on the budget when compared to primary data collection. Secondary data often allows you to avoid logistics and other survey expenses.
  • Volume: There are thousands of published resources available for data analysis. You can sift through the data that several individual research efforts have produced to find the components that are most relevant to your needs.
  • Ease of use: Secondary data, especially data that organizations and the government have published, is usually clean and organized. This makes it easy to understand and extract.
  • Ease of access: It’s generally easier to source secondary data than primary data. A basic internet search can return relevant information at little or no cost.

Disadvantages of secondary data collection

  • Lack of control: Using secondary data means you have no control over the survey process. Already published data may not include the questions you need answers to. This makes it difficult to find the exact data you need.
  • Lack of specificity: There may not be many available reports for new industries, and government publications often have the same problems. Furthermore, if there’s no available data for the niche your service specializes in, you’ll encounter problems using secondary data.
  • Lack of uniqueness: Using secondary sources may not give you the originality and uniqueness you need from data. For instance, if your service or product hinges on innovation and uses an out-of-the-norm approach to problem-solving, you may be disappointed by the generic nature of the data you collect.
  • Age: Because user preferences change over time, data can evolve. The secondary data you retrieve can become invalid. When this happens, it becomes difficult to source new data without conducting a hands-on survey.

A simplified data-collection process with Jotform

Whether you’re collecting primary or secondary data, Jotform’s collection of templates makes it easier to organize and track your data. You can quickly design survey forms with Jotform’s powerful form builder . You can also create databases that allow you to easily sort, filter, and group your data. Plus, you can import data from existing sources and create stunning visual reports at the click of a button.

Thank you for helping improve the Jotform Blog. 🎉

RECOMMENDED ARTICLES

Data Collection Methods

Qualitative data-collection methods

Automated data entry for optimized workflows

Automated data entry for optimized workflows

River sampling in market research: Definitions and examples

River sampling in market research: Definitions and examples

When to use focus groups vs surveys

When to use focus groups vs surveys

The 12 best Jotform integrations for managing collected data

The 12 best Jotform integrations for managing collected data

How small businesses can solve data-collection challenges

How small businesses can solve data-collection challenges

Why is data important to your business?

Why is data important to your business?

10 of the best data analysis tools

10 of the best data analysis tools

How to use the questionnaire method of data collection

How to use the questionnaire method of data collection

How to create a fillable form in Microsoft Word

How to create a fillable form in Microsoft Word

Quantitative data-collection methods

Quantitative data-collection methods

A comprehensive guide to types of research

A comprehensive guide to types of research

How to conduct an oral history interview

How to conduct an oral history interview

Types of sampling methods

Types of sampling methods

5 of the top data analytics tools for your business

5 of the top data analytics tools for your business

What is systematic sampling?

What is systematic sampling?

What is purposive sampling? An introduction

What is purposive sampling? An introduction

What are focus groups, and how do you conduct them?

What are focus groups, and how do you conduct them?

11 best voice recording software options

11 best voice recording software options

Understanding manual data entry

Understanding manual data entry

How to be GDPR compliant while collecting data

How to be GDPR compliant while collecting data

Benefits of data-collection: What makes a good data-collection form?

Benefits of data-collection: What makes a good data-collection form?

Qualitative vs quantitative data

Qualitative vs quantitative data

What is a double-barreled question, and how do you avoid it?

What is a double-barreled question, and how do you avoid it?

How to get started with business data collection

How to get started with business data collection

Population vs sample in research: What’s the difference?

Population vs sample in research: What’s the difference?

Cluster sampling: What it is and how to use it

Cluster sampling: What it is and how to use it

The 5 best data collection tools of 2024

The 5 best data collection tools of 2024

Send Comment :

 width=

2 Comments:

R Madonsela - Profile picture

104 days ago

keep giving us quality information

Hailu Soraga - Profile picture

More than a year ago

Nice compliment with full information.

Jotform Logo Mobile

caltech

  • Data Science

Caltech Bootcamp / Blog / /

Data Collection Methods: A Comprehensive View

  • Written by John Terra
  • Updated on February 21, 2024

What Is Data Processing

Companies that want to be competitive in today’s digital economy enjoy the benefit of countless reams of data available for market research. In fact, thanks to the advent of big data, there’s a veritable tidal wave of information ready to be put to good use, helping businesses make intelligent decisions and thrive.

But before that data can be used, it must be processed. But before it can be processed, it must be collected, and that’s what we’re here for. This article explores the subject of data collection. We will learn about the types of data collection methods and why they are essential.

We will detail primary and secondary data collection methods and discuss data collection procedures. We’ll also share how you can learn practical skills through online data science training.

But first, let’s get the definition out of the way. What is data collection?

What is Data Collection?

Data collection is the act of collecting, measuring and analyzing different kinds of information using a set of validated standard procedures and techniques. The primary objective of data collection procedures is to gather reliable, information-rich data and analyze it to make critical business decisions. Once the desired data is collected, it undergoes a process of data cleaning and processing to make the information actionable and valuable for businesses.

Your choice of data collection method (or alternately called a data gathering procedure) depends on the research questions you’re working on, the type of data required, and the available time and resources and time. You can categorize data-gathering procedures into two main methods:

  • Primary data collection . Primary data is collected via first-hand experiences and does not reference or use the past. The data obtained by primary data collection methods is exceptionally accurate and geared to the research’s motive. They are divided into two categories: quantitative and qualitative. We’ll explore the specifics later.
  • Secondary data collection. Secondary data is the information that’s been used in the past. The researcher can obtain data from internal and external sources, including organizational data.

Let’s take a closer look at specific examples of both data collection methods.

Also Read: Why Use Python for Data Science?

The Specific Types of Data Collection Methods

As mentioned, primary data collection methods are split into quantitative and qualitative. We will examine each method’s data collection tools separately. Then, we will discuss secondary data collection methods.

Quantitative Methods

Quantitative techniques for demand forecasting and market research typically use statistical tools. When using these techniques, historical data is used to forecast demand. These primary data-gathering procedures are most often used to make long-term forecasts. Statistical analysis methods are highly reliable because they carry minimal subjectivity.

  • Barometric Method. Also called the leading indicators approach, data analysts and researchers employ this method to speculate on future trends based on current developments. When past events are used to predict future events, they are considered leading indicators.
  • Smoothing Techniques. Smoothing techniques can be used in cases where the time series lacks significant trends. These techniques eliminate random variation from historical demand and help identify demand levels and patterns to estimate future demand. The most popular methods used in these techniques are the simple moving average and the weighted moving average methods.
  • Time Series Analysis. The term “time series” refers to the sequential order of values in a variable, also known as a trend, at equal time intervals. Using patterns, organizations can predict customer demand for their products and services during the projected time.

Qualitative Methods

Qualitative data collection methods are instrumental when no historical information is available, or numbers and mathematical calculations aren’t required. Qualitative research is closely linked to words, emotions, sounds, feelings, colors, and other non-quantifiable elements. These techniques rely on experience, conjecture, intuition, judgment, emotion, etc. Quantitative methods do not provide motives behind the participants’ responses. Additionally, they often don’t reach underrepresented populations and usually involve long data collection periods. Therefore, you get the best results using quantitative and qualitative methods together.

  • Questionnaires . Questionnaires are a printed set of either open-ended or closed-ended questions. Respondents must answer based on their experience and knowledge of the issue. A questionnaire is a part of a survey, while the questionnaire’s end goal doesn’t necessarily have to be a survey.
  • Surveys. Surveys collect data from target audiences, gathering insights into their opinions, preferences, choices, and feedback on the organization’s goods and services. Most survey software has a wide range of question types, or you can also use a ready-made survey template that saves time and effort. Surveys can be distributed via different channels such as e-mail, offline apps, websites, social media, QR codes, etc.

Once researchers collect the data, survey software generates reports and runs analytics algorithms to uncover hidden insights. Survey dashboards give you statistics relating to completion rates, response rates, filters based on demographics, export and sharing options, etc. Practical business intelligence depends on the synergy between analytics and reporting. Analytics uncovers valuable insights while reporting communicates these findings to the stakeholders.

  • Polls. Polls consist of one or more multiple-choice questions. Marketers can turn to polls when they want to take a quick snapshot of the audience’s sentiments. Since polls tend to be short, getting people to respond is more manageable. Like surveys, online polls can be embedded into various media and platforms. Once the respondents answer the question(s), they can be shown how they stand concerning other people’s responses.
  • Delphi Technique. The name is a callback to the Oracle of Delphi, a priestess at Apollo’s temple in ancient Greece, renowned for her prophecies. In this method, marketing experts are given the forecast estimates and assumptions made by other industry experts. The first batch of experts may then use the information provided by the other experts to revise and reconsider their estimates and assumptions. The total expert consensus on the demand forecasts creates the final demand forecast.
  • Interviews. In this method, interviewers talk to the respondents either face-to-face or by telephone. In the first case, the interviewer asks the interviewee a series of questions in person and notes the responses. The interviewer can opt for a telephone interview if the parties cannot meet in person. This data collection form is practical for use with only a few respondents; repeating the same process with a considerably larger group takes longer.
  • Focus Groups. Focus groups are one of the primary examples of qualitative data in education. In focus groups, small groups of people, usually around 8-10 members, discuss the research problem’s common aspects. Each person provides their insights on the issue, and a moderator regulates the discussion. When the discussion ends, the group reaches a consensus.

Also Read: A Beginner’s Guide to the Data Science Process

Secondary Data Collection Methods

Secondary data is the information that’s been used in past situations. Secondary data collection methods can include quantitative and qualitative techniques. In addition, secondary data is easily available, so it’s less time-consuming and expensive than using primary data. However, the authenticity of data gathered with secondary data collection tools cannot be verified.

Internal secondary data sources:

  • CRM Software
  • Executive summaries
  • Financial Statements
  • Mission and vision statements
  • Organization’s health and safety records
  • Sales Reports

External secondary data sources:

  • Business journals
  • Government reports
  • Press releases

The Importance of Data Collection Methods

Data collection methods play a critical part in the research process as they determine the accuracy and quality and accuracy of the collected data. Here’s a sample of some reasons why data collection procedures are so important:

  • They determine the quality and accuracy of collected data
  • They ensure the data and the research findings are valid, relevant and reliable
  • They help reduce bias and increase the sample’s representation
  • They are crucial for making informed decisions and arriving at accurate conclusions
  • They provide accurate data, which facilitates the achievement of research objectives

Also Read: What Is Data Processing? Definition, Examples, Trends

So, What’s the Difference Between Data Collecting and Data Processing?

Data collection is the first step in the data processing process. Data collection involves gathering information (raw data) from various sources such as interviews, surveys, questionnaires, etc. Data processing describes the steps taken to organize, manipulate and transform the collected data into a useful and meaningful resource. This process may include tasks such as cleaning and validating data, analyzing and summarizing data, and creating visualizations or reports.

So, data collection is just one step in the overall data processing chain of events.

Do You Want to Become a Data Scientist?

If this discussion about data collection and the professionals who conduct it has sparked your enthusiasm for a new career, why not check out this online data science program ?

The Glassdoor.com jobs website shows that data scientists in the United States typically make an average yearly salary of $129,127 plus additional bonuses and cash incentives. So, if you’re interested in a new career or are already in the field but want to upskill or refresh your current skill set, sign up for this bootcamp and prepare to tackle the challenges of today’s big data.

You might also like to read:

Navigating Data Scientist Roles and Responsibilities in Today’s Market

Differences Between Data Scientist and Data Analyst: Complete Explanation

What Is Data Collection? A Guide for Aspiring Data Scientists

A Data Scientist Job Description: The Roles and Responsibilities in 2024

Top Data Science Projects With Source Code to Try

Data Science Bootcamp

  • Learning Format:

Online Bootcamp

Leave a comment cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Recommended Articles

Data Science in Finance

Technology at Work: Data Science in Finance

In today’s data-driven world, industries leverage advanced data analytics and AI-powered tools to improve services and their bottom line. The financial services industry is at the forefront of this innovation. This blog discusses data science in finance, including how companies use it, the skills required to leverage it, and more.

Data Science Interview Questions

The Top Data Science Interview Questions for 2024

This article covers popular basic and advanced data science interview questions and the difference between data analytics and data science.

Big Data and Analytics

Big Data and Analytics: Unlocking the Future

Unlock the potential and benefits of big data and analytics in your career. Explore essential roles and discover the advantages of data-driven decision-making.

methods of collecting secondary data in research methodology

Five Outstanding Data Visualization Examples for Marketing

This article gives excellent data visualization examples in marketing, including defining data visualization and its advantages.

Data Science Bootcamps vs Traditional Degrees

Data Science Bootcamps vs. Traditional Degrees: Which Learning Path to Choose?

Need help deciding whether to choose a data science bootcamp or a traditional degree? Our blog breaks down the pros and cons of each to help you make an informed decision.

Data Scientist vs Machine Learning Engineer

Career Roundup: Data Scientist vs. Machine Learning Engineer

This article compares data scientists and machine learning engineers, contrasting their roles, responsibilities, functions, needed skills, and salaries.

Learning Format

Program Benefits

  • 12+ tools covered, 25+ hands-on projects
  • Masterclasses by distinguished Caltech CTME instructors
  • Caltech CTME Circle Membership
  • Industry-specific training from global experts
  • Call us on : 1800-212-7688

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Data Collection Methods | Step-by-Step Guide & Examples

Data Collection Methods | Step-by-Step Guide & Examples

Published on 4 May 2022 by Pritha Bhandari .

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental, or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address, and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analysed through statistical methods .
  • Qualitative data is expressed in words and analysed through interpretations and categorisations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data.

If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Prevent plagiarism, run a free check.

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research, and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

Data collection methods
Method When to use How to collect data
Experiment To test a causal relationship. Manipulate variables and measure their effects on others.
Survey To understand the general characteristics or opinions of a group of people. Distribute a list of questions to a sample online, in person, or over the phone.
Interview/focus group To gain an in-depth understanding of perceptions or opinions on a topic. Verbally ask participants open-ended questions in individual interviews or focus group discussions.
Observation To understand something in its natural setting. Measure or survey a sample without trying to affect them.
Ethnography To study the culture of a community or organisation first-hand. Join and participate in a community and record your observations and reflections.
Archival research To understand current or historical events, conditions, or practices. Access manuscripts, documents, or records from libraries, depositories, or the internet.
Secondary data collection To analyse data from populations that you can’t access first-hand. Find existing datasets that have already been collected, from sources such as government agencies or research organisations.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design .

Operationalisation

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalisation means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness, and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and time frame of the data collection.

Standardising procedures

If multiple researchers are involved, write a detailed manual to standardise data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorise observations.

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organise and store your data.

  • If you are collecting data from people, you will likely need to anonymise and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimise distortion.
  • You can prevent loss of data by having an organisation system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1 to 5. The data produced is numerical and can be statistically analysed for averages and patterns.

To ensure that high-quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organisations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g., understanding the needs of your consumers or user testing your website).
  • You can control and standardise the process for high reliability and validity (e.g., choosing appropriate measurements and sampling methods ).

However, there are also some drawbacks: data collection can be time-consuming, labour-intensive, and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to test a hypothesis by systematically collecting and analysing data, while qualitative methods allow you to explore ideas and experiences in depth.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research , you also have to consider the internal and external validity of your experiment.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, May 04). Data Collection Methods | Step-by-Step Guide & Examples. Scribbr. Retrieved 29 July 2024, from https://www.scribbr.co.uk/research-methods/data-collection-guide/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs quantitative research | examples & methods, triangulation in research | guide, types, examples, what is a conceptual framework | tips & examples.

  • Math Article

Data Collection Methods

Class Registration Banner

To analyze and make decisions about a certain business, sales, etc., data will be collected. This collected data will help in making some conclusions about the performance of a particular business. Thus, data collection is essential to analyze the performance of a business unit, solving a problem and making assumptions about specific things when required. Before going into the methods of data collection, let us understand what data collection is and how it helps in various fields.

What is Data Collection?

In Statistics, data collection is a process of gathering information from all the relevant sources to find a solution to the research problem. It helps to evaluate the outcome of the problem. The data collection methods allow a person to conclude an answer to the relevant question. Most of the organizations use data collection methods to make assumptions about future probabilities and trends. Once the data is collected, it is necessary to undergo the data organization process.

The main sources of the data collections methods are “Data”. Data can be classified into two types, namely primary data and secondary data. The primary importance of data collection in any research or business process is that it helps to determine many important things about the company, particularly the performance. So, the data collection process plays an important role in all the streams. Depending on the type of data, the data collection method is divided into two categories namely,

  • Primary Data Collection methods
  • Secondary Data Collection methods

In this article, the different types of data collection methods and their advantages and limitations are explained.

Primary Data Collection Methods

Primary data or raw data is a type of information that is obtained directly from the first-hand source through experiments, surveys or observations. The primary data collection method is further classified into two types. They are

Quantitative Data Collection Methods

Qualitative data collection methods.

Let us discuss the different methods performed to collect the data under these two data collection methods.

It is based on mathematical calculations using various formats like close-ended questions, correlation and regression methods, mean, median or mode measures. This method is cheaper than qualitative data collection methods and it can be applied in a short duration of time.

It does not involve any mathematical calculations. This method is closely associated with elements that are not quantifiable. This qualitative data collection method includes interviews, questionnaires, observations, case studies, etc. There are several methods to collect this type of data. They are

Observation Method

Observation method is used when the study relates to behavioural science. This method is planned systematically. It is subject to many controls and checks. The different types of observations are:

  • Structured and unstructured observation
  • Controlled and uncontrolled observation
  • Participant, non-participant and disguised observation

Interview Method

The method of collecting data in terms of verbal responses. It is achieved in two ways, such as

  • Personal Interview – In this method, a person known as an interviewer is required to ask questions face to face to the other person. The personal interview can be structured or unstructured, direct investigation, focused conversation, etc.
  • Telephonic Interview – In this method, an interviewer obtains information by contacting people on the telephone to ask the questions or views, verbally.

Questionnaire Method

In this method, the set of questions are mailed to the respondent. They should read, reply and subsequently return the questionnaire. The questions are printed in the definite order on the form. A good survey should have the following features:

  • Short and simple
  • Should follow a logical sequence
  • Provide adequate space for answers
  • Avoid technical terms
  • Should have good physical appearance such as colour, quality of the paper to attract the attention of the respondent

This method is similar to the questionnaire method with a slight difference. The enumerations are specially appointed for the purpose of filling the schedules. It explains the aims and objects of the investigation and may remove misunderstandings, if any have come up. Enumerators shou ld be trained to perform their job with hard work and patience.

Secondary Data Collection Methods

Secondary data is data collected by someone other than the actual user. It means that the information is already available, and someone analyses it. The secondary data includes magazines, newspapers, books, journals, etc. It may be either published data or unpublished data.

Published data are available in various resources including

  • Government publications
  • Public records
  • Historical and statistical documents
  • Business documents
  • Technical and trade journals

Unpublished data includes

  • Unpublished biographies, etc.

Frequently Asked Questions – FAQs

What are the 4 methods of data collection, what is data collection and its types, what are the primary data collection methods, what are data collection tools, what are quantitative data collection methods.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Maths related queries and study materials

Your result is as below

Request OTP on Voice Call

MATHS Related Links

methods of collecting secondary data in research methodology

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

Study Site Homepage

  • Request new password
  • Create a new account

The Essential Guide to Doing Your Research Project

Student resources, steps in secondary data analysis, stepping your way through effective secondary data analysis.

Determine your research question  – As indicated above, knowing exactly what you are looking for

Locating data – Knowing what is out there and whether you can gain access to it. A quick Internet search, possibly with the help of a librarian, will reveal a wealth of options.

Evaluating relevance of the data  – Considering things like the data’s original purpose, when it was collected, population, sampling strategy/sample, data collection protocols, operationalization of concepts, questions asked, and form/shape of the data.

Assessing credibility of the data  – Establishing the credentials of the original researchers, searching for full explication of methods including any problems encountered, determining how consistent the data is with data from other sources, and discovering whether the data has been used in any credible published research.

Analysis –  This will generally involve a range of statistical processes as discussed in Chapter 13.

  • Open access
  • Published: 26 July 2024

Improving Clerkship to Enhance Patients’ Quality of care (ICEPACQ): a baseline study

  • Kennedy Pangholi 1 ,
  • Enid Kawala Kagoya 2 ,
  • Allan G Nsubuga 3 ,
  • Irene Atuhairwe 3 ,
  • Prossy Nakattudde 3 ,
  • Brian Agaba 3 ,
  • Bonaventure Ahaisibwe 3 ,
  • Esther Ijangolet 3 ,
  • Eric Otim 3 ,
  • Paul Waako 4 ,
  • Julius Wandabwa 5 ,
  • Milton Musaba 5 ,
  • Antonina Webombesa 6 ,
  • Kenneth Mugabe 6 ,
  • Ashley Nakawuki 7 ,
  • Richard Mugahi 8 ,
  • Faith Nyangoma 1 ,
  • Jesca Atugonza 1 ,
  • Elizabeth Ajalo 1 ,
  • Alice Kalenda 1 ,
  • Ambrose Okibure 1 ,
  • Andrew Kagwa 1 ,
  • Ronald Kibuuka 1 ,
  • Betty Nakawuka 1 ,
  • Francis Okello 2 &
  • Proscovia Auma 2  

BMC Health Services Research volume  24 , Article number:  852 ( 2024 ) Cite this article

103 Accesses

3 Altmetric

Metrics details

Proper and complete clerkships for patients have long been shown to contribute to correct diagnosis and improved patient care. All sections for clerkship must be carefully and fully completed to guide the diagnosis and the plan of management; moreover, one section guides the next. Failure to perform a complete clerkship has been shown to lead to misdiagnosis due to its unpleasant outcomes, such as delayed recovery, prolonged inpatient stay, high cost of care and, at worst, death.

The objectives of the study were to determine the gap in clerkship, the impact of incomplete clerkship on the length of hospital stay, to explore the causes of the gap in clerkship of the patients and the strategies which can be used to improve clerkship of the patients admitted to, treated and discharged from the gynecological ward in Mbale RRH.

Methodology

This was a mixed methods study involving the collection of secondary data via the review of patients’ files and the collection of qualitative data via key informant interviews. The files of patients who were admitted from August 2022 to December 2022, treated and discharged were reviewed using a data extraction tool. The descriptive statistics of the data were analyzed using STATA version 15, while the qualitative data were analyzed via deductive thematic analysis using Atlas ti version 9.

Data were collected from 612 patient files. For qualitative data, a total of 8 key informant interviews were conducted. Social history had the most participants with no information provided at all (83.5% not recorded), with biodata and vital sign examination (20% not recorded) having the least number. For the patients’ biodata, at least one parameter was recorded in all the patients, with the greatest gap noted in terms of recording the nearest health facility of the patient (91% not recorded). In the history, the greatest gap was noted in the history of current pregnancy (37.5% not provided at all); however, there was also a large gap in the past gynecological history (71% not recorded at all), past medical history (71% not recorded at all), past surgical history (73% not recorded at all) and family history (80% not recorded at all). The physical examination revealed the greatest gap in the abdominal examination (43%), with substantial gaps in the general examination (38.5% not recorded at all) and vaginal examination (40.5% not recorded at all), and the vital sign examination revealed the least gap. There was no patient who received a complete clerkship. There was a significant association between clerkships and the length of hospital stay. The causes of the gap in clerkships were multifactorial and included those related to the hospital, those related to the health worker, those related to the health care system and those related to the patient. The strategies to improve the clerkship of patients also included measures taken by health care workers, measures taken by hospitals and measures taken by the government.

Conclusion and recommendation

There is a gap in the clerkships of patients at the gynecological ward that is recognized by the stakeholders at the ward, with some components of the clerkship being better recorded than others, and no patients who received a complete clerkship. There was a significant association between clerkships and the length of hospital stay.

The following is the recommended provision of clerkship tools, such as the standardized clerkship guide and equipment for patient examination, continuous education of health workers on clerkships and training them on how to use the available tools, the development of SOPs for patient clerkships, the promotion of clerkship culture and the supervision of health workers.

Peer Review reports

Introduction

A complete clerkship is the core upon which a medical diagnosis is made, and this depends on the patient’s medical history, the signs noticed on physical examination, and the results of laboratory investigations [ 1 ]. These sections of the clerkship should be completed carefully and appropriately to obtain a correct diagnosis; moreover, one part guides the next. A complete gynecological clerkship comprises the patient’s biodata, presenting complaint, history of presenting complaint, review of systems, past gynecological history, past obstetric history, past medical history, past surgical history, family history, social history, physical examination, laboratory investigation, diagnosis and management plan [ 2 , 3 ].

History taking, also known as medical interviews, is a brief personal inquiry and interrogation about bodily complaints by the doctor to the patient in addition to personal and social information about the patient [ 4 ]. It is estimated that 70-90% of a medical diagnosis can be determined by history alone [ 5 , 6 ]. Physical examination, in addition to the patient’s history, is equally important because it helps to discover more objective aspects of the disease [ 7 ]. The investigation of the patient should be guided by the findings that have been obtained on history taking and the physical examination [ 1 ].

Failure to establish a good complete and appropriate clerkship for patients leads to diagnostic uncertainties, which are associated with unfavorable outcomes. Some of the effects of poor clerkship include delayed diagnosis and inappropriate investigations, which lead to unnecessary expenditures on irrelevant tests and drugs and other effects, such as delayed recovery, prolonged inpatient stays, high costs of care and, at worst, death [ 8 , 9 ]. Despite health care workers receiving training in medical school about the relevance of physical examination, this has been poorly practiced and replaced with advanced imaging techniques such as ultrasounds, CT scans, and MRIs, which continue to make health care services unaffordable for most populations in developing countries [ 6 ]. In a study conducted to determine the prevalence and classification of misdiagnosis among hospitalized patients in five general hospitals in central Uganda, 9.2% of inpatients were misdiagnosed, and these were linked to inadequate medical history and examination, as the most common conditions were the most commonly misdiagnosed [ 9 ].

At Mbale RRH, there has been a progressive increase in the number of patients included in the gynecology department, which is expected to have compromised the quality of the clerkships that patients receive at the hospital [ 10 ]. However, there is limited information about the quality and completeness of clerkships for patients admitted to and treated at Mbale RRH. The current study therefore aimed to determine the gap in patient clerkships and the possible causes of these gaps and to suggest strategies for improving clerkships.

Methods and materials

Study design.

This was a baseline study, which was part of a quality improvement project aimed at improving the clerkships of patients admitted and treated at Mbale RRH. This mixed cross-sectional survey employing both quantitative and qualitative techniques was carried out from August 2022 to December 2022. Both techniques were employed to triangulate the results and address the gap in clerkship using quantitative techniques. Then, qualitative methods were used to explain the reasons for the observed discrepancy, and strategies to improve clerkship were suggested.

Study setting

The study was carried out in Mbale RRH, at the gynecologic ward. The hospital is in Mbale Municipal Council, 214 km to the east of the capital city of Kampala. It is the main regional referral hospital in the Elgon zone in eastern Uganda, a geographic area that borders the western part of Kenya. The Mbale RRH serves a catchment population of approximately 5 million people from 16 administrative districts. It is the referral hospital for the districts of Busia, Budaka, Kibuku, Kapchorwa, Bukwo, Butaleja, Manafwa, Mbale, Pallisa, Sironko and Tororo. The hospital is situated at an altitude of 1140 m within a range of 980–1800 m above sea level. Over 70% of inhabitants in this area are of Bantu ethnicity, and the great majority are part of rural agrarian communities. The Mbale RRH is a government-run, not-for-profit and charge-free 470-bed capacity that includes four major medical specialties: Obstetrics and Gynecology, Surgery, Internal Medicine, and Pediatrics and Child Health.

Study population, sample size and sampling strategy

We collected the files of patients who were admitted to the gynecology ward at Mbale RRH from August 2022 to December 2022. All the files were selected for review. We also interviewed health workers involved in patient clerkships at the gynecological ward. For qualitative data, participants were recruited until data saturation was reached.

Data collection

We collected both secondary and primary data. Secondary data were collected by reviewing the patients’ files. We identified research assistants who were trained in the data entry process. The data collection tool on Google Forms was distributed to the gadgets that were given to the assistants to enter the data. The qualitative data collection was performed via key informant interviews of the health workers involved in the clerkship of the patients, and the interviews were performed by the investigators. The selection of the participants was purposive, as we opted for those who clerk patients. After providing informed consent, the interview proceeded, with a voice recorder used to capture the data collected during the interview process and brief key notes made by the interviewer.

Data collection tool

A data abstraction tool was developed and fed into Google Forms, which were used to collect information about patients’ clerkships from patients’ files. The tool was developed by the investigators based on the requirements of a full clerkship, and it acted as a checklist for the parameters of clerkships that were provided or not provided. The validity of this tool was first determined by using it to collect information from ten patients’ files, which were not included in the study, and the tool was adjusted accordingly. The tool for collecting the qualitative information was an interview guide that was developed by the interviewer and was piloted with two health workers. Then, the guide was adjusted before it was used for data collection.

Variable handling

The dependent variable in the current study was the length of hospital stay. This was calculated from the date of admission and the date of discharge. There were two outcomes: “prolonged hospital stay” and “not prolonged”. A prolonged hospital stay was defined as a hospital stay of more than the 75 th percentile, according to a study conducted in Ethiopia [ 9 ]. This duration was more than 5 (five) days in the current study. The independent variables were the components of the clerkship.

Data analysis

Data analysis was performed using STATA version 15. Univariate, bivariate and multivariate analyses were performed. Continuous variables were summarized using measures of central tendency and measures of dispersion, while categorical variables were summarized using frequencies and proportions. Bivariate analysis was performed using chi-square or Fischer’s exact tests, one-way ANOVA and independent t tests, with the level of significance determined by a p value of <= 0.2. Multivariate analysis was performed using logistic regression, and the level of significance was determined by a p value of <=0.05.

Qualitative data were analyzed using Atlas Ti version 9 via deductive thematic analysis. The audio recordings were transcribed, and the transcripts were then imported into Atlas Ti.

Qualitative

The files of a total of 612 patients were reviewed.

The gap in the clerkships of patients

Patient biodata.

As shown in Fig. 1 below, at least one parameter under patient biodata was recorded for all the patients. The largest gap was identified in the recording of the nearest health facility of the patient, where 91% of the patients did not have this recorded, and the smallest gap was in the recording of the name and age, where less than 1% had this not recorded.

figure 1

The gap in patients’ biodata

Compliance, HPC and ROS

As shown in Fig. 2 below, the largest gap here was in recording the history of presenting complaint, which was not recorded in 32% of the participants. The least gap was in the review of systems, where it was not recorded in only 10% of the patients.

figure 2

Gap in the presenting of complaints, HPCs and ROS

As shown in Fig. 3 below, the past obstetric history had the greatest gap in recording the gestational age at delivery of each pregnancy (89% not recorded), while the least gap was in recording the number of pregnancies (43% not recorded). In terms of the history of current pregnancy, the greatest gap was in recording whether hematinics were given to the mother (92% not recorded), while the least gap was in recording the date of the first day of the last normal menstrual period (LNMP) (44% not recorded). On other gynecological history, the largest gap was in recording the history of gynecological procedures (88% not recorded), while the least gap was in the history of abortions (73% not recorded). In the past medical history, the largest gap was in terms of history of medication allergies and history of previous admissions (86% not recorded), and the smallest gap was in terms of history of chronic illnesses (72% not recorded). In the past surgical history, the largest gap was in the history of trauma (84% not recorded), while the least gap was in the history of blood transfusion (76% not recorded). In terms of family history, there was a greater gap in the family history of twin pregnancies (86% not recorded) than in the family history of familial illnesses (83% not recorded). In terms of social history, neither alcohol intake nor smoking were recorded for 84% of the patients.

figure 3

Gap in history

Physical examination

As shown in Fig. 4 below, the least recorded vital sign was oxygen saturation (SPO2), with 76% of the patients’ SPO2 not being recorded, while blood pressure was least recorded (21% not recorded). On the general examination, checking for edema had the greatest gap (63% not recorded), while checking for pallor had the least gap (45% not recorded). On abdominal examination, auscultation had the greatest gap (76% not recorded), while inspection of the abdomen had the least gap (56% not recorded). On vaginal examination, the greatest difference was in examining the vaginal OS (57% not recorded), while the least difference was in checking for vaginal bleeding (47% not recorded).

figure 4

Gap in physical examination

Investigations, provisional diagnosis and management plan

As shown in Fig. 5 below, the least common investigation was the malaria test (76% not performed), while the most common investigation was the CBC test (41% not performed). Provisional diagnosis was not performed in 20% of the patients. A management plan was not provided for approximately 4-5 of the patients.

figure 5

Gap in the provisional diagnosis and management plan

Summary of the gap in clerkships

As shown in Fig. 6 below, most participants had a social history with no information provided at all, while biodata and vital sign examinations had the least number of participants with no information provided at all. There was no patient who had a complete clerkship.

figure 6

Summary of the gaps in clerkships

Days of hospitalization

The days of hospitalization were not normally distributed and were positively skewed, with a median of 3 [ 2 , 5 ] days. The mean days of hospitalization was 6.2 (±11.1). As shown in Fig. 7 below, 20% of the patients had prolonged hospitalization.

figure 7

Duration of hospitalization

The effect of the clerkship gap on the number of days of hospital stay

As shown in Tables 1 and 2 below, the clerkship components that had a significant association with the days of hospitalization at the bivariate level included vital examination, abdominal examination, history of presenting complaint and treatment plan.

As shown in Table 3 , the only clerkship component that had a significant association with the days of hospitalization at the multivariate level was abdominal examination. People who had partial abdominal examinations were 1.9 times more likely to have prolonged hospital stays than those who had complete abdominal examinations.

Qualitative results

We conducted a total of 8 key informant interviews with the following characteristics as shown in table 4 below.

The qualitative results are summarized in Table 5 below.

The quality of clerkships on wards

It was reported that both the quality and completeness of clerkships on the ward are poor.

“…most are not clerking fully the patients, just put in like biodata three items name, age address, then they go on the present complaint, diagnosis then treatment; patient clerkship is missing out some important information…” (KIISAMW 2)

It was, however, noted that the quality of a clerkship depends on several factors, such as who is clerking, how sick the patient is, the number of patients to be seen that particular day and the number of hours a person clerks.

“…so, the quality of clerkship is dependent on who is clerking but also how sick the patient is…” (KIIMO 3)

Which people usually clerk patients on the ward?

The following people were identified as those who clerking patients, midwives, medical students, junior house officers, medical officers and specialists.

“…everyone clerks patients here; nurses, midwives, doctors, medical students, specialists, everyone as long as you are a health care provider…” (KIIMO 2)

Causes of the gaps in clerkships

These factors were divided into factors related to health workers, hospital-related factors, health system-related factors and patient-related factors.

Hospital-related factors

The absence of clerkship tools such as a standardized clerkship guide and equipment for the examination of patients, such as blood pressure machines, thermometers, and glucometers, among others, were among the reasons for the poor clerkships of the patients.

…of course, there are other things like BP machines, thermometers; sometimes you want to examine a patient, but you don’t have those examining tools…” (KIIMO 1)

The tools that were available were plain, and they play little role in facilitating clerkships. They reported that they end up using small exercise books with no guidance for easy clerkship and with limited space.

“…most of our tools have these questions that are open ended and not so direct, so the person who is not so knowledgeable in looking out for certain things may miss out on certain data…” (KIIOG 1)

The reluctance of some health workers to clerk patients fully was also reported to be because it is the new normal, and everyone follows a bandwagon to collect only limited information from patients because there is no one to follow up or supervise.

“…you know when you go to a place, what you find people doing is what you also end up doing; I think it is because of what people are doing and no one is being held accountable for poor clerkship…” (KIIMO 3)

The absence of specialized doctors in the OPD department forces most patients, even stable patients, to be managed by the OPD to crowd the ward, making complete clerkships for all patients difficult. Poor triaging of the patients was also noted as one of the causes of poor clerkship, as emergency cases are mixed with stable cases.

“…and this gyn ward is supposed to see emergency gynecological cases, but you find even cases which are supposed to be in the gyn clinic are also here; so, it creates large numbers of people who need services…” (KIIMO 1)

Clerkships being performed by the wrong people were also noted. It was emphasized that it is only a medical doctor who can perform good clerkships for patients, and any other cadres who perform clerkships contribute to poor clerkships on the ward.

Health worker-related factors

A poor attitude of health workers was reported, and it was found that many health workers consider complete clerkship to be a practice that is performed by people who do not know what they look for to make a diagnosis.

A lack of knowledge about clerkships is another factor that has been reported. Some health workers were reported to forget some of the components of clerkship; hence, they end up recording only what they remember at the time of clerkship.

A lack of confidence by some health workers and students that creates fear of committing to making a diagnosis and drawing a management plan was reported to hinder some of them from doing a complete clerkship of the patients.

“…a nurse or a student may clerk, but they don’t know the diagnosis; so, they don’t want to commit themselves to a diagnosis…” (KIIMO 2)

Some health workers reported finding the process of taking notes while clerking tedious; hence, they collected only limited information that they could write within a short period of time.

Health system-related factors

Understaffing of the ward was noted to cause a low health worker-to-patient ratio. This overworked the health workers due to the large numbers of patients to be seen.

“…due to the thin human resource for health, many patients have to be seen by the same health worker, and it becomes difficult for one to clerk adequately; they tend to look out for key things majorly…” (KIIOG 1)

It was noted that in the morning or at the start of a shift, the clerkship can be fair, but as the day progresses, the quality of the clerkship decreases due to exhaustion.

“…you can’t clerk the person you are seeing at 5 pm the same way you clerked the person you saw at 9 am…” (KIIMO 3)

The large numbers of patients were also associated with other factors, such as the inefficient referral system, where patients who can be managed in lower health facilities are also referred to Mbale RRH. It was also stated that some patients do not understand the referral system, causing self-referral to the RRH. Other factors that contributed to the poor referral system were limited trust of the patients, drug stockouts, limited skilled number of health workers, and limited laboratory facilities in the lower health facilities.

“…so, everyone comes in from wherever they can, even unnecessary referrals from those lower health facilities make the numbers very high…” (KIIMO 1)

Patient-related factors

It was reported that the nature of some cases does not allow the health worker to collect all the information from such a patient, for example, the emergency cases. However, some responders stated the emergent nature of the cases to be a contributor to the complete clerkship of such a patient, as the person clerking such a case is more likely to call for help, so they must have enough information on the patient. Additionally, they do not want to fill the gap in the care of this critical patient.

“…usually, a more critical patient gets a more elaborate clerkship compared to a more stable one, where we will get something quick…” (KIIMO 3)

The poor health of some patients makes them unable to afford the files and books where clerkship notes are to be taken.

“…a patient has no money, and they have to buy books where to write, then you start writing on ten pages; does it make sense...” (KIIMO 2)

Strategies to improve patients’ clerkships

These were divided into measures to be taken by the health workers, those to be taken by the hospital leadership and those to be taken by the government.

Measures to be taken by health workers.

Holding each other accountable with respect to clerkship quality and completeness was suggested, including providing feedback from fellow health workers and from the records department.

…like everyone I think should just be held accountable for their clerkship and give each other feedback…” (KIIMO 3)

It was also suggested that medical students be mentored by senior doctors on the ward on the clerkship, and they should clerk the patients and present them to the senior doctors for guidance on the diagnosis and the management plan. This approach was believed to save time for senior doctors who may not have obtained time to collect information from patients and to facilitate the learning of students, most importantly ensuring the complete clerkship of patients.

“…students can give us a very good clerkship if supervised well, then we can discuss issues of diagnosis, the investigations to be done and the management…” (KIIMO 1)

Changes in the attitudes of health workers toward clerkships were suggested. This was also encouraged for those who work in laboratories to be able to perform the required investigations to guide diagnosis and management.

“…our lab has the equipment, but they need to change their attitude toward doing the investigations…” (KIIMO 1)

Measures to be taken by hospital leaders

The provision of tools to be used in clerkships was suggested as one of the measures that can be taken. Among the tools that were suggested include the following: a standardized clerkship guide, equipment for examination of the patients, such as blood pressure machines, and thermometers, among others. It was also suggested that a printer machine be used to print the clerkship guide to ensure the sustainability and availability of the tools. An electronic clerkship provision was suggested so that the amount of tedious paperwork could be reduced, especially for those who are comfortable with it.

“…if the stakeholders, especially those who have funds, can help us to make sure that these tools are always available, it is a starting point…” (KIIOG 1)

Continuous education of the clinicians about clerkships was suggested in the CMEs, and routine morning meetings were always held in the ward. Then, it was suggested that clinicians who clerked patients the best way are rewarded to motivate them.

“…for the staff, we can may be continuously talking about it during our Monday morning meetings about how to clerk well and the importance of clerking…” (KIIOG 1)

They also suggested providing a separate conducive room for the examination of patients to ensure the privacy of the patient, as this will ensure more detailed examination of the patients by the clinicians.

It was also suggested that more close supervision of the clerkship be performed and that a culture of good clerkship be developed to make clerkship a norm.

“…as leaders of the ward and of the department, we should not get tired to talk about the importance of clerkship, not only in this hospital but also in the whole country…” (KIIOG 1)

Proper record-keeping was also suggested, for people clerking to be assured that information will not be discarded shortly.

“…because how good is it to make these notes yet we can’t keep them properly...” (KIIMO 2)

It was also suggested that a records assistant be allocated to take notes for the clinicians to reduce their workload.

Coming up with SOPs, for example, putting different check points that ensure that a patient is fully clerked before the next step

“…we can say, before a patient accesses theater or before a mother enters second stage room, they must be fully clerked, and there is a checklist at that point…” (KIIOG 1)

Measures to be taken by the government

Improving the staffing level is strongly suggested to increase the health worker-to-patient ratio. This, they believed would reduce the workload off the health workers and allow them to give more time to the patients.

“…we also need more staffing for the scan because the person who is there is overwhelmed…” (KIIMO 1)

Staff motivation was encouraged through the enhancement of staff salaries and allowances. It was believed that it would be easy for these health workers to be supervised when they are motivated.

“…employ more health workers, pay them well then you can supervise them well…” (KIIMO 1)

Providing refresher courses to clinicians was also suggested so that they could be updated during the clerkship process.

Streamlining the referral system was also suggested through the use of lower health facilities so that some minor cases can be managed in those facilities to reduce the overcrowding of patients in the RRH.

“…we need to also streamline the referral system, the way people come to the RRH; some of these cases can be handled in the lower health facilities; we need to see only patients who have been referred…” (KIIMO 2)

The qualitative results are further summarized in Fig. 8 below.

figure 8

Scheme of the clerkship of patients, including the causes of the clerkship gap and the strategies to improve the clerkship at Mbale RRH

Discussion of results

This study highlights a gap in the clerkships of patients admitted, treated, and discharged from the gynecological ward, with varying gaps in the different sections. This could be because some sections of the clerkship are considered more important than others. A study performed in Turkey revealed that physicians tended to record more information that aided their diagnostic tasks [ 11 ]. This is also reflected in the qualitative findings where participants expressed that particular information is required to make the diagnosis and not everything must be collected.

Biodata for patients were generally well recorded, and name and age were recorded for almost all the patients. A similar finding was found in the UK, where 100% of the patients had their personal details fully recorded [ 12 ]. Patient information should be carefully and thoroughly recorded because it enables health workers to create good rapport with patients and creates trust [ 13 ]. This information is also required for every interaction with the patient at the ward.

The presenting complaint, history of presenting complaint and the review of systems were fairly recorded, with each of them missing in less than 40% of the patients. The presence of a complaint is crucial in every interaction with the patient to the extent that a diagnosis can rarely be made without knowing the chief complaint [ 14 , 15 ]. This applies to the history of presenting complaint as well [ 16 ]. For the 30% who did not have the presenting complaint recorded, this could mean that even the patient’s primary problem was not given adequate attention.

In the history, the greatest gap was noted in the history of current pregnancy, where many parameters were not recorded in most patients. This is, however, expected since the study was conducted on a gynecological ward, where only a few pregnant women are expected to visit, as they are supposed to go to their antenatal clinics [ 17 ]. However, there was also a large gap in past gynecological history, which is expected to be fully explored in the gynecology ward. A good medical history is key to obtaining a good diagnosis, in addition to a good clinical examination [ 3 , 18 ]. Past obstetric history, past medical history, past surgical history, and family history also had large gaps, yet they are very important in the management of these patients.

The abdominal parameters, especially the pulse rate and blood pressure, were the least frequently recorded during the physical examination, and vital signs were most often recorded. However, there were substantial gaps in the general examination and vaginal examination. The least gap in vital sign examination is close monitoring, which is performed for most patients admitted to the ward due to the nature of the patients, some of whom are emergency patients [ 19 ].

Among the investigations, 29% of patients were not investigated. The least commonly performed investigations were pelvic USS and malaria tests, while complete blood count (CBC) was most commonly performed. Genital infections are among the most common reasons for women’s visits to health care facilities [ 20 ]. Therefore, most women in the gynecological ward are suspected to have genital tract infections, which could account for why CBC is most commonly performed.

The limited number of other investigations, such as pelvic ultrasound scans, underscore the relative contribution of medical history and physical examination to laboratory investigations and imaging studies aimed at making a diagnosis [ 1 ]. However, this would also highlight the system challenges of limited access to quality laboratory services in low- and middle-income countries [ 21 ]. This was also highlighted by one of the key informants who reported that the USS staff is available on some and not all days. This means that on days where the ultrasound department does not work, USS is not performed, even when needed.

We found that 20% of patients experienced prolonged hospitalization. This percentage is lower than the 24% reported in a study conducted in Ethiopia [ 22 ]. However, this study was conducted in a surgical ward. The median length of hospital stay was the same as that in a study conducted in Eastern Sudan among mothers following cesarean delivery [ 23 ]. A prolonged hospital stay has a negative impact not only on patients but also on the hospital [ 24 , 25 ]. Therefore, health systems should aim to reduce the length of hospital stay for patients as much as possible to improve the effectiveness of health services.

At the multivariate level, abdominal examination was significantly associated with length of hospital stay, with patients whose abdominal examination was not complete being more likely to have a prolonged hospital stay. This underscores the importance of good examination in the development of proper management plans that improve the care of patients, hence reducing the number of days of hospital stay [ 5 , 26 ].

There is a gap in the clerkships of patients at the gynecological ward, which is recognized by the stakeholders at the ward. Some components of clerkships were recorded better than others, with the reasoning that clerkships should be targeted. There were no patients who received a complete clerkship. There was a significant association between clerkships and the length of hospital stay. The causes of the gap in clerkships were multifactorial and included those related to the hospital, those related to the health worker, those related to the health care system and those related to the patient. The strategies to improve the clerkship of patients also included measures taken by health care workers, measures taken by hospitals and measures taken by the government.

Recommendations

Clerkship tools, such as the standardized clerkship guide and equipment for patient examination, were provided. The health workers were continuously educated on clerkships and trained on how to use the available tools. The development of SOPs for patient clerkships, the promotion of clerkship culture and the supervision of health workers.

Strengths of the study

A mixed study, therefore, allows for the triangulation of results.

Study limitations

The quantity of quantitative data collected, being secondary, is subject to bias due to documentation errors. We assessed the completeness of clerkship without considering the nature of patient admission. We did not record data on whether it was an emergency or stable case, which could be an important cofounder. However, this study gives a good insight into the status of clerkship in the gynecological ward and can lay foundation for future research into the subject.

Availability of data and materials

The data and materials are available upon request from the corresponding author via the email provided.

Hampton JR, Harrison M, Mitchell JR, Prichard JS, Seymour C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. Br Med J. 1975;2(5969):486.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kaufman MS, Holmes JS, Schachel PP, Latha G. Stead. First aid for the obstetrics and gynecology clerkship. 2011.

Leis Potter. Gynecological history taking 2010 [Available from: https://geekymedics.com/gynaecology-history-taking/ .

Stoeckle JD, Billings JA. A history of history-taking: the medical interview. J Gen Intern Med. 1987;2(2):119–27.

Article   CAS   PubMed   Google Scholar  

Muhrer JC. The importance of the history and physical in diagnosis. The Nurse Practitioner. 2014;39(4):30–5.

Article   PubMed   Google Scholar  

Foster DW. Every patient tells a story: medical mysteries and the art of diagnosis. J Clin Investig. 2010;120(1):4.

Article   CAS   PubMed Central   Google Scholar  

Elder AT, McManus IC, Patrick A, Nair K, Vaughan L, Dacre J. The value of the physical examination in clinical practice: an international survey. Clin Med (Lond). 2017;17(6):490–8.

Katongole SP, Anguyo RD, Nanyingi M, Nakiwala SR. Common medical errors and error reporting systems in selected Hospitals of Central Uganda. 2015.

Google Scholar  

Katongole SP, Akweongo P, Anguyo R, Kasozi DE, Adomah-Afari A. Prevalence and Classification of Misdiagnosis Among Hospitalised Patients in Five General Hospitals of Central Uganda. Clin Audit. 2022;14:65–77. https://doi.org/10.2147/CA.S370393 .

Article   Google Scholar  

Kirinya A. Patients Overwhelm Mbale Regional Referral Hospital. 2022.

Yusuff KB, Tayo F. Does a physician’s specialty influence the recording of medication history in patients’ case notes? Br J Clin Pharmacol. 2008;66(2):308–12.

Article   PubMed   PubMed Central   Google Scholar  

Wethers G, Brown J. Does an admission booklet improve patient safety? J Mental Health. 2011;20(5):438–44.

Flugelman MY. History-taking revisited: Simple techniques to foster patient collaboration, improve data attainment, and establish trust with the patient. GMS J Med Educ. 2021;38(6):Doc109.

PubMed   PubMed Central   Google Scholar  

Gehring C, Thronson R. The Chief “Complaint” and History of Present Illness. In: Wong CJ, Jackson SL, editors. The Patient-Centered Approach to Medical Note-Writing. Cham: Springer International Publishing; 2023. p. 83–103.

Chapter   Google Scholar  

Virden TB, Flint M. Presenting Problem, History of Presenting Problem, and Social History. In: Segal DL, editor. Diagnostic Interviewing. New York: Springer US; 2019. p. 55-75.

Shah N. Taking a history: Introduction and the presenting complaint. BMJ. 2005;331(Suppl S3):0509314.

Uganda MOH. Essential Maternal and Newborn Clinical Care Guidelines for Uganda, May 2022. 2022.

Waller KC, Fox J. Importance of Health History in Diagnosis of an Acute Illness. J Nurse Pract. 2020;16(6):e83–4.

Brekke IJ, Puntervoll LH, Pedersen PB, Kellett J, Brabrand M. The value of vital sign trends in predicting and monitoring clinical deterioration: a systematic review. PloS One. 2019;14(1):e0210875.

Mujuzi H, Siya A, Wambi R. Infectious vaginitis among women seeking reproductive health services at a sexual and reproductive health facility in Kampala, Uganda. BMC Womens Health. 2023;23(1):677.

Nkengasong JN, Yao K, Onyebujoh P. Laboratory medicine in low-income and middle-income countries: progress and challenges. Lancet. 2018;391(10133):1873–5.

Fetene D, Tekalegn Y, Abdela J, Aynalem A, Bekele G, Molla E. Prolonged length of hospital stay and associated factors among patients admitted at a surgical ward in selected Public Hospitals Arsi Zone, Oromia, Ethiopia, 2022. 2022.

Book   Google Scholar  

Hassan B, Mandar O, Alhabardi N, Adam I. Length of hospital stay after cesarean delivery and its determinants among women in Eastern Sudan. Int J Womens Health. 2022;14:731–8.

LifePoint Health. The impact prolonged length of stay has on hospital financial performance. 2023. Retrieved from: https://lifepointhealth.net/insights-and-trends/the-impact-prolonged-length-of-stay-has-on-hospital-financialperformance .

Kelly S. Patient discharge delays pose threat to health outcomes, AHA warns. Healthcare Dive. 2022. Retrieved from: https://www.healthcaredive.com/news/discharge-delay-American-Hospital-Association/638164/ .

Eskandari M, Alizadeh Bahmani AH, Mardani-Fard HA, Karimzadeh I, Omidifar N, Peymani P. Evaluation of factors that influenced the length of hospital stay using data mining techniques. BMC Med Inform Decis Mak. 2022;22(1):1–11.

Download references

The study did not receive any funding

Author information

Authors and affiliations.

Faculty of Health Science, Busitema University, P.O. Box 1460, Mbale, Uganda

Kennedy Pangholi, Faith Nyangoma, Jesca Atugonza, Elizabeth Ajalo, Alice Kalenda, Ambrose Okibure, Andrew Kagwa, Ronald Kibuuka & Betty Nakawuka

Institute of Public Health Department of Community Health, Busitema University, faculty if Health Sciences, P.O. Box 1460, Mbale, Uganda

Enid Kawala Kagoya, Francis Okello & Proscovia Auma

Seed Global Health, P.O. Box 124991, Kampala, Uganda

Allan G Nsubuga, Irene Atuhairwe, Prossy Nakattudde, Brian Agaba, Bonaventure Ahaisibwe, Esther Ijangolet & Eric Otim

Department of Pharmacology and Therapeutics, Busitema University, Faculty of Health Science, P.O. Box 1460, Mbale, Uganda

Department of Obstetrics and Gynecology, Busitema University, Faculty of Health Sciences, P.O. Box 1460, Mbale, Uganda

Julius Wandabwa & Milton Musaba

Department of Obstetrics and Gynecology, Mbale Regional Referral Hospital, P.O. Box 921, Mbale, Uganda

Antonina Webombesa & Kenneth Mugabe

Department of Nursing, Busitema University, Faculty of Health Sciences, P.O. Box 1460, Mbale, Uganda

Ashley Nakawuki

Ministry of Health, Plot 6, Lourdel Road, Nakasero, P.O. Box 7272, Kampala, Uganda

Richard Mugahi

You can also search for this author in PubMed   Google Scholar

Contributions

P.K came up with the concept and design of the work and coordinated the team to work K.E.K and A.P helped interpretation of the data O.F and O.A helped in the analysis of data N.A.G, A.I, N.P, W.P, W.J, M.M, A.W, M.K, N.F, A.J, A.E, M.R, K.A, K.A, A.B, A.B, I.E, O.E, N.A, K.R, N.B substantially revised the work.

Corresponding author

Correspondence to Kennedy Pangholi .

Ethics declarations

Ethics approval and consent to participate.

The study was conducted according to the Declaration of Helsinki and in line with the principles of Good Clinical Practice and Human Subject Protection. Prior to collecting the data, ethical approval was obtained from the Research Ethics Committee of Mbale RRH, approval number MRRH-2023-300. The confidentiality of the participant information was ensured throughout the research process. Permission was obtained from the hospital administration before the data were collected from the patients’ files, and informed consent was obtained from the participants before the qualitative data were collected. After entry of the data, the devices were returned to the principal investigator at the end of the day, and they were given to the data entrants the next day.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1. , rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Pangholi, K., Kagoya, E.K., Nsubuga, A.G. et al. Improving Clerkship to Enhance Patients’ Quality of care (ICEPACQ): a baseline study. BMC Health Serv Res 24 , 852 (2024). https://doi.org/10.1186/s12913-024-11337-w

Download citation

Received : 25 September 2023

Accepted : 22 July 2024

Published : 26 July 2024

DOI : https://doi.org/10.1186/s12913-024-11337-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Gynecology ward

BMC Health Services Research

ISSN: 1472-6963

methods of collecting secondary data in research methodology

  • DOI: 10.47604/ajps.2665
  • Corpus ID: 270603143

Assessing the Influence of Soil Composition on Plant Growth and Development in USA

  • Harper Olivia
  • Published in American Journal of Physical… 17 June 2024
  • Environmental Science, Agricultural and Food Sciences
  • American Journal of Physical Sciences

21 References

Optimal partitioning theory revisited: nonstructural carbohydrates dominate root mass responses to nitrogen., related papers.

Showing 1 through 3 of 0 Related Papers

  • Open access
  • Published: 26 July 2024

Predictive models for personalized precision medical intervention in spontaneous regression stages of cervical precancerous lesions

  • Simin He 1 , 2 ,
  • Guiming Zhu 1 , 2 ,
  • Ying Zhou 3 ,
  • Boran Yang 1 , 2 ,
  • Juping Wang 1 , 2 ,
  • Zhaoxia Wang 3 &
  • Tong Wang   ORCID: orcid.org/0000-0002-9403-7167 1 , 2  

Journal of Translational Medicine volume  22 , Article number:  686 ( 2024 ) Cite this article

94 Accesses

Metrics details

During the prolonged period from Human Papillomavirus (HPV) infection to cervical cancer development, Low-Grade Squamous Intraepithelial Lesion (LSIL) stage provides a critical opportunity for cervical cancer prevention, giving the high potential for reversal in this stage. However, there is few research and a lack of clear guidelines on appropriate intervention strategies at this stage, underscoring the need for real-time prognostic predictions and personalized treatments to promote lesion reversal.

We have established a prospective cohort. Since 2018, we have been collecting clinical data and pathological images of HPV-infected patients, followed by tracking the progression of their cervical lesions. In constructing our predictive models, we applied logistic regression and six machine learning models, evaluating each model’s predictive performance using metrics such as the Area Under the Curve (AUC). We also employed the SHAP method for interpretative analysis of the prediction results. Additionally, the model identifies key factors influencing the progression of the lesions.

Model comparisons highlighted the superior performance of Random Forests (RF) and Support Vector Machines (SVM), both in clinical parameter and pathological image-based predictions. Notably, the RF model, which integrates pathological images and clinical multi-parameters, achieved the highest AUC of 0.866. Another significant finding was the substantial impact of sleep quality on the spontaneous clearance of HPV and regression of LSIL.

Conclusions

In contrast to current cervical cancer prediction models, our model’s prognostic capabilities extend to the spontaneous regression stage of cervical cancer. This model aids clinicians in real-time monitoring of lesions and in developing personalized treatment or follow-up plans by assessing individual risk factors, thus fostering lesion spontaneous reversal and aiding in cervical cancer prevention and reduction.

Graphical Abstract

methods of collecting secondary data in research methodology

Introduction

Cervical cancer, a significant health threat to women, is primarily induced by Human Papillomavirus (HPV) infection. Over 80% of women experience at least one HPV infection in their lifetime [ 1 ], with the majority manifesting as asymptomatic and transient infections. However, in certain instances, the infection persists, potentially leading to mild cervical cell abnormalities known as Low-Grade Squamous Intraepithelial Lesion (LSIL) or Cervical Intraepithelial Neoplasia grade 1 (CIN1). This stage is generally reversible, with studies indicating that approximately 60–70% of LSIL cases spontaneously regress within one year, and this figure approaches approximately 90% within two years [ 2 ]. Should LSIL fail to resolve and HPV infection persists, the lesion may advance to High-Grade Squamous Intraepithelial Lesion (HSIL), encompassing Cervical Intraepithelial Neoplasia grades 2 and 3 (CIN2, CIN3). The probability of reversal at this stage is lower, yet spontaneous regression can still occur, particularly in younger women. Without appropriate treatment and management, HSIL will ultimately progress to cervical cancer. The prolonged period from HPV infection to cervical cancer development, highlighted by the high potential for reversal during the LSIL stage, provides a critical opportunity for cervical cancer prevention. Rational intervention at the LSIL stage can significantly enhance the prevention capabilities, effectively reducing the incidence of cervical cancer.

At present, intervention strategies for patients in the LSIL stage remain a contentious topic. According to the 2019 guidelines of the American Society for Colposcopy and Cervical Pathology (ASCCP), observation is the preferred approach for patients diagnosed with HPV-induced cervicitis and LSIL. Additionally, treatment is discretionary for high-risk HPV (hrHPV), a more significant proportion of high-grade lesions, and prolonged infection duration, based on patient preferences. With the implementation of screening policies and heightened health awareness, a significant number of women are detected with positive HPV infection or diagnosed at the LSIL stage during opportunistic screening. Due to the guidelines’ ambiguity regarding treatment, clinical practice currently relies more on empirical estimation of patient outcomes, subsequently guiding observation or pharmacological/surgical intervention. However, this intervention approach may lead to two adverse outcomes. For high-risk populations potentially progressing to HSIL, mere observation could miss the early treatment window, thereby increasing the risk of cancerous transformation [ 3 , 4 ]. Conversely, aggressive interventions such as pharmacological, physical, and surgical treatments could lead to overtreatment, entailing associated complications [ 5 , 6 ] and psychological [ 7 ] and financial burdens [ 8 ]. Moreover, current intervention plans do not fully consider individual differences. In resource-limited healthcare systems, treatments may consume resources that could be allocated to patients at higher risk.

Currently, intervention strategies and medical needs are progressively shifting towards novel treatment methods, precision medicine, early screening, and pre-cancer intervention strategies. The treatment and prevention of cervical cancer are trending towards personalized therapy, aiming to ensure oncological safety while minimizing the incidence rate. A recent comprehensive study outlined the global burden of cervical cancer, emphasizing the significance of early detection and intervention in improving prognosis, particularly in developing countries where rational intervention and treatment during early or pre-cancer stages are crucial in reducing the societal burden of cervical cancer [ 9 ]. Another review on the management of CIN stages also highlights the importance of CIN stages in cervical cancer prevention, stressing the necessity of predictive screening, appropriate intervention, and the application of precision medicine [ 4 ]. These findings underscore the pivotal role of the LSIL stage in the prevention, diagnosis, and treatment of cervical cancer. By accurately predicting the risk of lesion progression, these models assist in guiding personalized treatment decisions, optimizing resource allocation, and enhancing public awareness and participation in cervical cancer screening.

In early studies of cervical cancer prediction models, researchers primarily focused on predicting the survival or recurrence rates of cervical cancer patients [ 10 , 11 ]. In recent years, attention has gradually shifted towards predicting the onset of cervical cancer, leading to the construction of numerous models for early risk prediction and screening [ 12 ]. Particularly, the application of machine learning methods has significantly improved the performance and accuracy of these models [ 13 , 14 ]. With increasing recognition of the spontaneous regression potential of precancerous lesions, more researchers are acknowledging the importance of the CIN stage in cervical cancer prevention and control. Prediction models for this stage are gradually being developed. Austin et al. [ 15 ]. developed the Pittsburgh cervical cancer screening model based on 19 relevant variables, employing dynamic Bayesian methods to quantitatively estimate the risk of HSIL and carcinoma in situ in patients. Similarly, Charlton et al. [ 16 ]. utilized multivariate logistic regression to construct a model with HSIL and carcinoma in situ as the predicted outcomes, using basic clinical information to forecast the risk of cervical abnormality progression in patients with Atypical Squamous Cells of Undetermined Significance (ASCUS)/LSIL. However, the evaluation of this model revealed an Area Under the Curve (AUC) of only 0.63, indicating significant potential for enhancing its predictive performance. Koeneman et al. [ 17 ]. also applied multivariate logistic regression, basing their predictions on patient demographics and laboratory results to forecast spontaneous regression in CIN2 cases. The evaluation of their model similarly showed a relatively low AUC of 0.692. A recent study developed a predictive model using five variables: TCT results, HPV status, and the proportion of samples with acetowhite epithelium, abnormal blood vessels, and mosaicity [ 18 ]. The model demonstrated a good predictive performance (AUC = 0.851) for the prognosis of patients at the HSIL stage. However, as previously mentioned, the probability of regression for HSIL patients is lower compared to those at the LSIL stage. Although the model aims to predict the risk of cervical cancer in HSIL patients to reduce unnecessary surgeries and minimize side effects for low-risk patients, its effectiveness in promoting lesion regression and preventing cervical cancer may be somewhat limited.

The aforementioned models clearly demonstrate the effectiveness of predictive models in early identification of patients at high risk for cervical cancer. However, given the higher potential for reversal exhibited during the LSIL stage, models that advance their predictive endpoints to this earlier stage can facilitate the identification of high-risk LSIL cases, which is crucial for early intervention and the prevention of cervical cancer progression [ 18 , 19 ]. Particularly for developing countries, such models can assist in optimizing screening protocols, focusing resources on high-risk populations, thereby reducing the risks of missed diagnoses and misdiagnoses. Furthermore, prediction models can assist physicians in developing personalized treatment plans for LSIL patients at varying levels of risk. For patients at lower risk, the model can guide the adoption of more conservative observation strategies, while those at higher risk may necessitate more proactive interventions. Additionally, these prediction models can inform public health strategies, aiding health departments in more effective resource allocation and providing data support to policymakers. This assistance is crucial in formulating more efficient cervical cancer screening and prevention programs, thereby enhancing the overall efficacy of cervical cancer prevention and treatment efforts.

In this study, our research primarily focuses on the spontaneously regressive stages of lesion development during the progression from HPV infection to cervical cancer, especially during the LSIL phase. Our objective is to construct a predictive model to assess the prognosis risk of patients with HPV infection or LSIL, and to identify key factors influencing the infection status. Our study is based on data from both clinical information and pathological images. The inclusion of pathological images, serving as the “gold standard” for determining lesion status, is crucial in ensuring the predictive accuracy of the model. Through the development of this predictive model, we aim to achieve real-time monitoring of the prognostic status of patients in the LSIL stage, assisting clinicians in making rational medical decisions and in devising personalized follow-up or treatment plans for patients. This approach is intended to effectively prevent disease progression and promote natural regression of the infection, thereby reducing the incidence and prevalence of cervical cancer.

Patients, selection criteria, and follow-up

In this prospective cohort study, we undertook the systematic collection of data from patients presenting with initial, persistent, or recurrent HPV infection, establishing an extensive prospective cohort initiated at the First Hospital of Shanxi Medical University in January 2018. The study adhered to strict exclusion criteria: (a) patients with concurrent mental disorders; (b) patients incapable of comprehending or completing the questionnaire due to speech or intellectual disabilities; (c) patients with concurrent life-threatening diseases. Following these inclusion and exclusion criteria, a total of 511 patients diagnosed with HPV infection were enrolled in the study. All patients participating in the study have signed informed consent forms. Enrolled subjects were subjected to regular follow-ups every three months for meticulous documentation of HPV infection status, transition to negativity, duration of recurrent and persistent infections, disease progression, and clinical outcomes. The follow-up mainly consisted through outpatient visits. For patients unable to attend the clinic follow-up, such as those not residing locally, we conducted follow-ups by telephone ( n  = 68). The phone call involved asking patients about their health status and collected detailed and complete records of their HPV tests and cervical lesion examination results from local tertiary hospitals within a three-month period. Additionally, to ensure consistency in the assessment of cervical lesion, all follow-up results, whether from outpatient visits or telephone follow-ups, were evaluated by the same professional physician.

Clinical multi-parameters

All patients enrolled in the study were subjected to the Thinprep Cytologic Test (TCT) and HPV testing. Their infection status and specific HPV types were accurately determined based on the results of these diagnostic assessments. The TCT and HPV testing were performed during the patient’s non-menstrual period, with cell samples being scraped from the cervix. For the TCT, the collected cell samples were processed using a liquid-based cytology processor (ThinPrep 2000) to create thin-layer cell smears. The cell smears were then stained with Papanicolaou stain, and two cytopathology experts independently examined the smears under a microscope. For the HPV testing, the cervical samples were collected using the ThinPrep PreservCyt Solution to preserve the integrity of the nucleic acids within the cells. HPV testing utilizes Fully Automatic Nucleic Acid Hybridization Detector (Yaneng BIO YN-HR96) to detect the presence, viral load and type of HPV.

Informed by existing studies [ 3 , 20 , 21 ], we selected pertinent factors associated with cervical cancer and HPV infection to construct our questionnaire. The questionnaire comprehensively included sections on demographic data, socioeconomic status, family medical history, personal cervical disease history, sexual and reproductive history, health risk behaviors, anxiety and mental health, and sleep status. Among them, exercise, sleep quality, anxiety and mental health were assessed using standardized scales, specifically the IPAQ (International Physical Activity Questionnaire), PSQI (Pittsburgh Sleep Quality Index), SRSS (Self-Rating Scale of Sleep), and SAS (Self-Rating Anxiety Scale). The remaining variables were obtained through patient statements or self-reporting. For patients with incomplete questionnaires in the survey ( n  = 43), missing data were supplemented through timely telephone interviews with those patients to ensure the integrity and authenticity of the data. The details regarding the missing data in the questionnaire survey can be found in Supplementary Table S1 .

Patient clinical data was entered using EpiData software (version 3.1), with the entry process independently completed by two trained operators and subjected to consistency verification. The clinical data of the patients were statistically described and compared between groups. Following the normality test, quantitative variables with a normal distribution were described using \(\overline X {\rm{ \pm S}}\) , and group comparisons were conducted using the independent sample t -test. Categorical variables were described using percentage, and group comparisons were performed using the χ 2 test. The significance level was set at α <0.05.

Pathological image acquisition and image feature extraction

Pathological biopsy is acknowledged as the gold standard for evaluating the extent of cervical intraepithelial lesions. In this study, pathological sections were obtained from patients during their nonmenstrual phase. The acquired samples were processed with hematoxylin-eosin staining and subsequently underwent pathological examination. High-resolution images of the pathological sections were captured using an image acquisition system (Axio Scope.A1) at a magnification of 100x (resolution of 2048 × 1536 pixels). Each sections was labeled by two experienced pathologists independently.

The analysis of the pathological images involved color feature extraction and texture feature extraction, tailored to the specific properties of the pathological tissue sections. Color feature extraction in this study encompassed three distinct color features: RGB features [ 22 ], HSV features [ 23 ], and Lab features [ 22 ]. The extraction of color features from pathological images was based on the color histogram. This involved grouping the pixels in the image by color and then counting the number of pixels within each color group to generate a color histogram, which effectively represents the color distribution characteristics of the image. The variables extracted included the mean and standard deviation for each channel of R, G, B, H, S, V, L, a, and b, in addition to the overall mean and variance. Texture feature extraction incorporated two texture attributes: the Grey Level Co-occurrence Matrix (GLCM) [ 24 ] and Gabor texture features [ 25 ]. To ensure precision and classification accuracy in texture feature extraction, GLCM contrast, energy, and correlation features were extracted at angles of 0, 45, 90, and 135 degrees, respectively. The process of image feature extraction was carried out utilizing the OpenCV library within the Python 3.7.0 environment. After extraction, these image features were integrated with clinical data to form a unified dataset, enabling further analysis and model development.

Screening of variables

The predictive efficacy of a model is considerably influenced by the variables it incorporates. Factors such as the number of variables, inter-variable correlations, and the inclusion of critical variables significantly affect the accuracy and efficiency of the prediction model. Hence, the process of variable selection is crucial in the construction of an effective prediction model. In our study, we initially conducted a test for multicollinearity among the variables. Specifically, we assessed the presence of multicollinearity using the Corrected Generalized Variance Inflation Factor (CGVIF) [ 26 ], establishing a threshold where a CGVIF less than 5 indicates an acceptable level of multicollinearity. Subsequently, we employed the Boruta method [ 27 ] and the biosigner method [ 28 ] within the framework of the Random Forest (RF) variable importance scoring approach for variable screening. The set of influential factors to be included in the prediction model was determined based on the outcomes of this variable screening process.

The results of the biosigner method and Boruta method can be influenced by the quality and size of the dataset. Specifically, when the dataset is either too small or of suboptimal quality, the variable screening results might be inaccurate. Concurrently, the selection of an excessive number of variables by these methods can potentially lead to model overfitting. To mitigate these issues, we conducted an additional 500 iterations of 10-fold cross-validation within RF error rate. Ultimately, the optimal set of variables to be incorporated into the model was determined in consultation with the clinical chief physician, ensuring relevance and practical applicability.

Model settings

In this study, we developed three distinct model scenarios: Model 1, which incorporates only clinical feature data; Model 2, comprising solely pathological image feature; and Model 3, a combination of both clinical feature data and pathological image feature data. For the modeling approach, we selected the logistic regression along with six widely utilized machine learning algorithms: Decision Tree (DT), RF, Naive Bayesian (NB), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Neural Network (NN). The use of logistic regression, is due to its well-established application in binary outcome prediction, ease of interpretation, which are highly informative in clinical contexts. The application of multiple machine learning methods is based on their distinct advantages and capabilities in handling different types of data: DT offers interpretability and simplicity in visualizing decision paths; RF is known for its robustness to overfitting and ability to handle high-dimensional data effectively; SVM is effective in high-dimensional spaces and in cases where the number of dimensions exceeds the number of samples; KNN is simple to implement and understand, with effectiveness in local decision boundaries; NB is efficient and effective with small datasets and assumes independence between predictors, which can simplify the modeling process and NN are powerful in capturing complex non-linear relationships within the data.

Model training and evaluation

The dataset in our study was randomly divided into training and testing sets at a ratio of 8:2, and hyperparameter tuning was conducted using Grid Search Cross-Validation (GridSearchCV). To ascertain the optimal model, we performed 500 iterations of 10-fold cross-validation. The model evaluation metrics in our study encompassed a comprehensive range of indicators: sensitivity, specificity, Youden’s index, accuracy, balanced accuracy, precision, recall, Kappa coefficient, F1 score, Receiver Operating Characteristic curve (ROC), and AUC. Figure  1 depicts the detailed process of model construction.

figure 1

The flow chart of model construction

Interpretability analysis

To illuminate the significance of different features within our model, we employed Shapley Additive Explanations (SHAP), a post-hoc explanatory tool designed to decipher ‘black-box’ models using the Shapley value, a concept rooted in game theory. This approach quantifies the contribution of each feature to the model’s predictions, thereby enabling a detailed understanding of the impact of individual variables [ 29 ]. SHAP conceptualizes the Shapley value as a linear model where feature contributions are additive, transforming the model’s output into a sum of values attributed to each feature. It not only highlights the importance of each feature within the model but also delineates the directionality of their influence on the model’s predictive decision-making process.

Description of clinical data

The study encompassed a total of 511 patients. Following the follow-up period, 244 patients experienced HPV clearance, while 267 continued to exhibit persistent HPV infection. We conducted a statistical analysis of the clinical data from all 511 participants, comparing characteristics between the HPV clearance group and the persistent infection group. The findings are presented in Table  1 .

The results revealed significant differences between the two groups in several aspects, including age, frequency of sexual activity, pregnancy history, reproductive history, sleep quality (as measured by the Self-Rating Scale of Sleep, SRSS), and anxiety levels (as assessed by the Self-Rating Anxiety Scale, SAS). Specifically, the average age in the persistent infection group (47.33 ± 10.54 years) was higher compared to the HPV clearance group (44.62 ± 11.10 years). The frequency of sexual activity in the persistent infection group (4.674 ± 3.98 times/month) was lower than that in the HPV clearance group (5.988 ± 4.09 times/month). Regarding pregnancy and childbirth history, the HPV clearance group had fewer instances compared to the persistent infection group. In terms of sleep and anxiety status, the patients in the HPV clearance group reported better sleep quality and lower levels of anxiety compared to those in the persistent infection group.

Variable screening

The results shown that none variables have CGVIF exceeded 5, indicating the absence of collinear variables (Supplementary Table S2 ). Consequently, we utilized the biosigner and Boruta methods within the RF algorithm for variable importance scoring to conduct variable screening in all variables. The selection of influencing factors for the prediction model was based on the outcomes of the variable screening process. In Model 1, which focused solely on clinical variables, the Boruta method identified nine influential factors, including sleep quality. When pathological image features were incorporated in Model 3, the Boruta method expanded its selection to 29 variables, comprising five clinical variables and 24 image features. In contrast, the biosigner method, known for selecting the minimal optimal set of variables as features [ 28 ], identified only sleep quality as the influential factor in both Model 1 and Model 3. The detailed results of this variable selection process are depicted in Figs.  2 and 3 .

figure 2

Boruta method variable screening results: ( a ) Boruta method variable screening results of model 1; ( b ) Boruta method variable screening results of model 2; ( c ) Boruta method variable screening results of model 3

figure 3

biogenser method variable screening results: ( a ) biogenser method variable screening results of model 1; ( b ) biogenser method variable screening results of model 2; ( c ) biogenser method variable screening results of model 3

Given that the variable screening outcomes from the biosigner and Boruta methods can be affected by the size and quality of the dataset, we executed 500 iterations of 10-fold cross-validation for evaluating the RF error rate. The results are presented in Fig.  4 . Ultimately, guided by clinicians, we determined the optimal set of variables to be included in the model. The model was thus constructed with a total of 31 feature variables, which include 7 clinical features, color features from RGB, HSV, and Lab spectra, GLCM contrast, energy, correlation features, as well as Gabor texture features. Table  2 delineates this optimal feature set of the model.

figure 4

Cross-validation error rate results: ( a ) Cross-validation error rate results of model 1; ( b ) Cross-validation error rate results of model 2; ( c ) Cross-validation error rate results of model 3

Model evaluation and comparison

In Model 1, which incorporated only clinical data, Logistic Regression and DT demonstrated superior performance across various comprehensive evaluation metrics. Logistic Regression achieved the highest Youden’s Index (0.555), balanced accuracy (0.778), and Kappa coefficient (0.555). Meanwhile, DT excelled with the highest accuracy (0.779), F1 score (0.798), and a Kappa coefficient equal to that of Logistic Regression (0.555). Additionally, the RF method presented the highest AUC value (0.841). In Model 2, which utilized only pathological image data, RF and SVM surpassed other methods. SVM recorded the highest values in Youden’s Index (0.617), accuracy (0.809), balanced accuracy (0.809), F1 score (0.817), Kappa coefficient (0.617), and precision (0.820). RF, on the other hand, showed the highest sensitivity (0.826), recall (0.826), and AUC (0.812). For the comprehensive Model 3, integrating both clinical and imaging data, RF and SVM again displayed better predictive performance. In this model, SVM had the highest scores in Youden’s Index (0.575), accuracy (0.789), balanced accuracy (0.788), Kappa coefficient (0.576), and precision (0.790). RF led with the highest F1 score (0.800) and AUC (0.866). The results are depicted in Fig.  5 . Comparing all three models, Model 3 generally outperformed Models 1 and 2, especially with SVM and RF, indicating the best predictive capability among the three. Model 2 exhibited the weakest predictive performance in comparison to the other models. The detailed results of model evaluation are outlined in Table  3 .

figure 5

Model evaluation results: ( a ) Model evaluation results of model 1; ( b ) Model evaluation results of model 2; ( c ) Model evaluation results of model 3

Results of interpretability analysis

We utilized SHAP for the interpretability analysis of our predictive model, and assign importance rankings to the variables included in the model. The results, depicted in Fig.  6 , are based on interpretability analysis conducted across the entire sample population. Each patient’s attribution results for the importance of each variable feature are represented by colored dots, where orange signifies high-risk values and purple signifies low-risk values. On an entire sample scale, considering the multitude of factors influencing HPV clearance in patients, the top five factors in terms of importance were identified as sleep quality, medication usage, short-term sleep problems, type of work, and sexual frequency. Among these, sleep quality is particularly noteworthy, exhibiting a significant impact on HPV clearance. This insight underscores the critical role of sleep quality in the prognosis of HPV infection outcomes.

figure 6

Overall model interpretability analysis

In clinical practice, SHAP interpretability analysis for individual patients can help clinicians gain a more intuitive understanding of the patient’s prognosis and the main influencing factors. This, in turn, enables them to provide personalized medical intervention strategies for the patient. For example, Fig.  7 displays the results of the SHAP explanatory analysis for an individual patient. In this case, the model computed a SHAP value of 0.471, with a corresponding prediction score for the patient being 0.734. This indicates a higher likelihood of this patient to undergo spontaneous HPV conversion or regression of lesions. Among the numerous factors that may influence spontaneous HPV clearance, smoking, sleep quality, medication, and sexual frequency exhibit the greatest explanatory power.

figure 7

Model interpretability analysis of individuals

HPV has been identified as a significant factor contributing to the development of cervical cancer. Indeed, HPV-driven cancer is a relatively rare occurrence, as most infections are transient and can be spontaneously cleared by the host immune system. In cases of persistent HPV infection, it may take decades to progress to cervical cancer. This extended temporal window provides a golden opportunity for clinical prevention and early intervention. This study focuses on the stages of HPV infection with a higher potential for spontaneous regression during the progression to cervical cancer, particularly the stages of HPV infection accompanied by chronic inflammation and LSIL. In response to the current ambiguous treatment scenario at this stage and its pivotal role in cervical cancer prevention, we have developed a personalized precision medicine prediction model. This model not only predicts the progression of the disease based on the patient’s current condition and individual factors but also identifies the pivotal factors influencing the disease during its development. The evaluation of the model demonstrates that the predictive model we developed exhibits commendable performance in prediction capabilities, with an AUC of 0.866. Additionally, the identification of key influencing factors within the study confirmed the significant role of sleep quality in the spontaneous regression of LSIL and the spontaneous clearance of HPV infections.

In the preceding studies, researchers have recognized the significant role of the HSIL stage in the development of cervical cancer. Austin et al. [ 15 ]. employed 19 relevant variables, including HPV vaccination data, Papanicolaou test results, high-risk HPV infection, operation data, and histopathological results, to construct a Pittsburgh Cervical Cancer Screening model. Utilizing dynamic Bayesian methods, the researchers quantitatively assessed the risks associated with HSIL and adenocarcinoma in situ. However, this model primarily relies on molecular markers and laboratory indicators to predict the occurrence of cervical cancer. In practical clinical applications, its use may be limited by the accessibility of variable information. Following this, Charlton et al. [ 16 ] also focused on predicting the occurrence of HSIL and adenocarcinoma in situ, utilizing basic clinical information (age, smoking habits, number of sexual partners, pregnancies, immune status, etc.) to predict the risk of cervical abnormalities in ASCUS or LSIL patients. They employed multivariate logistic regression to construct the model, and the results indicated the predictive role of abnormal Papanicolaou test results in the further progression of cervical lesions. However, the model evaluation results showed an AUC of only 0.63, suggesting considerable room for improvement in predictive performance. In a similar vein, Koeneman et al. [ 17 ]. applied multivariate logistic regression to forecast the spontaneous regression in HSIL patients, based on demographic and laboratory data. The derived model, however, also exhibits a relatively low AUC of 0.692.

Our study presents two key enhancements building upon existing research. Firstly, our model advances the prediction outcome to the LSIL stage, given that this stage exhibits a higher potential for spontaneous regression. Consequently, timely detection and appropriate intervention at this stage hold greater significance and impact for the prevention of cervical cancer. With the assistance of this model, clinicians can monitor the lesion conditions of patients in real-time and, by evaluating individual risk factors, guide the development of personalized treatment or follow-up plans. For LSIL patients with a low risk of progression, this approach can avoid unnecessary treatments, thereby significantly reducing potential side effects and medical costs. Secondly, our predictive model integrates a multidimensional data including clinical information, laboratory test results, and pathological imaging. The inclusion of such a comprehensive range of variables significantly ensures the predictive efficacy of the model. The evaluation results also indicate that our model exhibits a substantial improvement in predictive performance compared to existing models. Additionally, in designing the model, we not only focused on ensuring its accuracy but also considered its clinical practicality. The variables included in our model are all accessible through routine diagnostic and treatment processes for LSIL patients or through simple questionnaires administered during regular clinical visits. This setup aims to ensure that the model can be easily integrated into existing clinical workflows while minimizing additional economic and medical burdens on patients and clinicians. However, the practical application of the model in clinical settings must also be comprehensively considered. Given that the optimal model in this study still includes some laboratory test indicators and pathological images, in relatively resource-limited situations where laboratory test indicators are unavailable, model1, which has a lower dependency on laboratory tests while maintaining relatively good predictive performance, can be considered significant. Conversely, when laboratory conditions permit, it is recommended to prioritize the use of model3, which integrates pathological images into the LSIL prediction model and offers better predictive performance compared to model1.

Another significant finding of our study is the identification of the role of sleep quality in the spontaneous clearance process following HPV infection. Both variable importance screenings and SHAP analyses of the predictive model highlight that enhanced sleep quality markedly facilitates the spontaneous clearance of HPV and the reversal of LSIL. This finding may be explicated from several perspectives: Firstly, sleep quality may influence the outcome of HPV infection by affecting immune function. The immune system plays a crucial role in clearing HPV and preventing the infection from progressing to more severe lesions. Poor sleep quality could weaken the immune system’s efficacy, thus potentially reducing the body’s ability to clear HPV infection [ 30 , 31 , 32 ]. Secondly, poor sleep quality may increase chronic inflammation in the body [ 33 , 34 ], which is considered a significant factor in the development of cervical cancer [ 35 , 36 ]. Therefore, sleep quality might affect HPV clearance and LSIL reversal through inflammatory responses. Thirdly, sleep quality can affect the balance of hormones such as estrogen and progesterone, which have been implicated in the course of HPV infection and the pathogenesis of cervical lesions [ 37 , 38 , 39 ]. Fourthly, sleep issues may increase psychological stress in patients. Research shows that psychological stress can prolong and exacerbate the duration and severity of HPV-related diseases [ 40 , 41 , 42 ]. Lastly, Sleep is involved in the regulation of cellular repair processes, including DNA repair. Inadequate sleep may impair the body’s ability to repair DNA damage in cervical cells caused by HPV, leading to increased risk of lesion progression [ 43 , 44 ]. Similar research findings have been corroborated in other studies. Sims et al [ 45 ]. highlighted that individual aged over 65, insufficient sleep, disruption of the circadian rhythm, and chronic stress may contribute to stress-related insomnia in women, thereby elevating susceptibility to cervical-vaginal infections and the associated risk of cervical cancer. Garbarino et al. [ 46 ]. and Moscicki et al. [ 47 ]. independently validated the impact of sleep quality on cervical cancer, examining it through the lenses of microbiology and cytology, respectively. Additionally, Li et al. [ 38 ]. identified a correlation indicating that a diminished sleep quality score is associated with an elevated risk of cervical cancer, based on their prospective study within the Kailuan Cohort. This result provides novel perspectives for clinical interventions at this stage. In clinical work, integrating the assessment and management of sleep quality into the process of diagnosis and treatment can improve the quality of sleep of patients, which can better promote the reversal of lesions and spontaneous clearance of HPV, thereby effectively preventing cervical cancer.

With the rapid advancement of big data, machine learning methods have become a vital analysis method in emerging data science. In this study, we used six machine learning methods and logistic regression to develop a predictive model. Among those machine learning methods, RF stood out among them benefiting from the ability to process high-dimensional data and its capability to capture nonlinear relationships between features. RF enhances accuracy by constructing multiple DTs and to ensemble their predictions, thereby mitigating the risk of overfitting and augmenting the model’s generalization capacity. However, it is important to recognize that machine learning does not supplant traditional statistical analysis. Specifically, parametric models have the potential to outperform machine learning algorithms, especially when dealing with small datasets. This is also reflected in our study, where the predictive performance of the logistic regression model is better than that of machine learning algorithms such as NN and KNN. This may be due to the fact that as the complexity of the model increases, the fitting error of the training data decreases, which may lead to the problems of overfitting and generalization error [ 48 ]. Another important issue to consider in the application of machine learning models is the privacy of personal health data. In this study, the model was constructed with full attention to data privacy. In future applications, strict measures such as anonymization should be employed to ensure the protection of data privacy.

Several limitations should be considered. First, the models were developed based on the dataset derived from a single center. Due to the relatively limited focus on the LSIL stage and the availability of related datasets, external validation was not feasible within the scope of this study. We have planned to undertake this crucial step in future research. Despite these challenges, we have made efforts to ensure the generalizability of our predictive model. Firstly, our study cohort was sourced from a tertiary A-level hospital, with patients coming from a wide geographic area, which improves sample representativeness. Meanwhile, we performed extensive cross-validation to enhance the model’s robustness. In the future, we will further expand the scope of cohort studies while actively seeking partnerships for multicenter research. This will allow for more extensive external independent validation to confirm the model’s generalizability and robustness. Furthermore, this study focused on predicting disease progression following HPV infection based on pathological images and clinical features. While external environmental factors play a role in disease occurrence and development, the impact of genetics on disease progression cannot be overlooked. In subsequent studies, we plan to incorporate genetic influences into the model. Beyond genetics, the model can be enriched with epidemiological data, biomarker information, various omics data, and other characteristic data related to individual infection risk. These multi-omics data and broader data types can provide a more comprehensive view of the biological processes involved in the progression of LSIL, help identify new biomarkers, and elucidate complex interactions that might not be evident through single-omics analysis. Enhancing the model with these diverse data types can further improve the comprehensiveness of risk assessment and increase its practical value. This, in turn, provides clearer and more effective guidance for intervention strategies against cervical cancer and its precancerous lesions. Additionally, developing a dynamic model of lesion progression is an important research direction. Establishing such a model is crucial for understanding real-time disease progression in patients and adjusting medical intervention strategies accordingly. In the future, we will consider further developing dynamic models to enhance the disease risk assessment process. Our aim is to achieve dynamic evaluation of the entire disease progression, providing more effective tools for treatment intervention during the LSIL stage and cervical cancer prevention.

This study specifically targeted crucial spontaneous regression stages in the progression from HPV infection to cervical cancer. The developed prognostic model for cervical precancerous lesions ensures high predictive performance and includes interpretability analysis. The model aids in individualized risk prediction for patients. Combined with existing clinical guidelines, this model can help clinicians gain a more intuitive understanding of a patient’s current disease progression status in clinical practice. The model’s predictions can assist in determining more personalized intervention strategies for patients. In comparison to existing cervical cancer prediction models, this study advances the predicted outcomes to the spontaneous regression stages in the disease development process. This innovation holds significant implications for enhancing comprehensive prevention capabilities, ultimately contributing to a reduction in the societal burden of cervical cancer.

Data availability

All data and code relevant to the study that are not in the article and supplementary material are available from the corresponding author on reasonable request.

Abbreviations

American Society for Colposcopy and Cervical Pathology

Atypical squamous cell of undetermined significance

Area under the ROC curve

Cervical intraepithelial neoplasia

Decision tree

Grey Level Co-occurrence Matrix

Human papillomavirus

High-risk HPV

High-grade squamous intraepithelial lesions

K nearest neighbor

Low-grade squamous intraepithelial lesions

Naive Bayesian

Neural network

Random forests

Receiver operating characteristic curve

Self-Rating Anxiety Scale

Shapley additive explanations

Self-Rating Scale of Sleep

Support vector machines

Thinprep cytologic testing

Amin FAS, Un Naher Z, Ali PSS. Molecular markers predicting the progression and prognosis of human papillomavirus-induced cervical lesions to cervical cancer. J Cancer Res Clin Oncol. 2023;149(10):8077–86.

Article   CAS   PubMed   Google Scholar  

Skinner SR, Wheeler CM, Romanowski B, Castellsague X, Lazcano-Ponce E, Del Rosario-Raymundo MR, Vallejos C, Minkina G, Pereira Da Silva D, McNeil S, et al. Progression of HPV infection to detectable cervical lesions or clearance in adult women: analysis of the control arm of the VIVIANE study. Int J Cancer. 2016;138(10):2428–38.

Article   PubMed   Google Scholar  

Wang F, Wang Z, Zhao L, Xin L, He S, Wang T. A prospective cohort study on the effect of circadian clock rhythm related factors on female HPV negative conversion. Chin J Oncol Prev Treat. 2022;14(05):552–7.

Google Scholar  

Gupta S, Nagtode N, Chandra V, Gomase K. From diagnosis to treatment: exploring the latest management Trends in Cervical Intraepithelial Neoplasia. Cureus. 2023;15(12):e50291.

PubMed   PubMed Central   Google Scholar  

Kacerovsky M, Musilova I, Baresova S, Kolarova K, Matulova J, Wiik J, Sengpiel V, Jacobsson B. Cervical excisional treatment increases the risk of intraamniotic infection in subsequent pregnancy complicated by preterm prelabor rupture of membranes. Am J Obstet Gynecol 2022.

Santesso N, Mustafa RA, Wiercioch W, Kehar R, Gandhi S, Chen Y, Cheung A, Hopkins J, Khatib R, Ma B, et al. Systematic reviews and meta-analyses of benefits and harms of cryotherapy, LEEP, and cold knife conization to treat cervical intraepithelial neoplasia. Int J Gynaecol Obstet. 2016;132(3):266–71.

Yang Y, Hu T, Ming X, Yang E, Min W, Li Z. REBACIN(R) is an optional intervention for persistent high-risk human papillomavirus infection: a retrospective analysis of 364 patients. Int J Gynaecol Obstet. 2021;152(1):82–7.

Sparic R, Bukumiric Z, Stefanovic R, Tinelli A, Kostov S, Watrowski R. Long-term quality of life assessment after excisional treatment for cervical dysplasia. J Obstet Gynaecol. 2022;42(7):3061–6.

Singh D, Vignat J, Lorenzoni V, Eslahi M, Ginsburg O, Lauby-Secretan B, Arbyn M, Basu P, Bray F, Vaccarella S. Global estimates of incidence and mortality of cervical cancer in 2020: a baseline analysis of the WHO Global Cervical Cancer Elimination Initiative. Lancet Glob Health. 2023;11(2):e197–206.

Lee YY, Kim TJ, Kim JY, Choi CH, Do IG, Song SY, Sohn I, Jung SH, Bae DS, Lee JW, et al. Genetic profiling to predict recurrence of early cervical cancer. Gynecol Oncol. 2013;131(3):650–4.

Paik ES, Lim MC, Kim MH, Kim YH, Song ES, Seong SJ, Suh DH, Lee JM, Lee C, Choi CH. Prognostic model for survival and recurrence in patients with early-stage cervical Cancer: a Korean Gynecologic Oncology Group Study (KGOG 1028). Cancer Res Treat. 2020;52(1):320–33.

Rothberg MB, Hu B, Lipold L, Schramm S, Jin XW, Sikon A, Taksler GB. A risk prediction model to allow personalized screening for cervical cancer. Cancer Causes Control. 2018;29(3):297–304.

Ijaz MF, Attique M, Son Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-sampling methods. Sens (Basel) 2020, 20(10).

Al Mudawi N, Alazeb A. A model for Predicting Cervical Cancer using machine learning algorithms. Sens (Basel) 2022, 22(11).

Austin RM, Onisko A, Druzdzel MJ. The Pittsburgh Cervical Cancer Screening Model: a risk assessment tool. Arch Pathol Lab Med. 2010;134(5):744–50.

Charlton BM, Carwile JL, Michels KB, Feldman S. A cervical abnormality risk prediction model: can we use clinical information to predict which patients with ASCUS/LSIL pap tests will develop CIN 2/3 or AIS? J Low Genit Tract Dis. 2013;17(3):242–7.

Article   PubMed   PubMed Central   Google Scholar  

Koeneman MM, van Lint FHM, van Kuijk SMJ, Smits LJM, Kooreman LFS, Kruitwagen R, Kruse AJ. A prediction model for spontaneous regression of cervical intraepithelial neoplasia grade 2, based on simple clinical parameters. Hum Pathol. 2017;59:62–9.

Sheng B, Yao D, Du X, Chen D, Zhou L. Establishment and validation of a risk prediction model for high-grade cervical lesions. Eur J Obstet Gynecol Reprod Biol. 2023;281:1–6.

Ferrara P, Dallagiacoma G, Alberti F, Gentile L, Bertuccio P, Odone A. Prevention, diagnosis and treatment of cervical cancer: a systematic review of the impact of COVID-19 on patient care. Prev Med. 2022;164:107264.

Omone OM, Kozlovszky M. The Associations Between HPV-infections Associated Risk Factors and Cervical Cancer Associated Risk Factors Using Chi-square Method. In: 2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES): 12–15 Aug. 2022 2022 ; 2022: 000225–000230.

Baltzer N, Nygård M, Sundström K, Dillner J, Nygård J, Komorowski J. Stratifying Cervical Cancer Risk with Registry Data. In: 2018 IEEE 14th International Conference on e-Science (e-Science): 29 Oct.-1 Nov. 2018 2018 ; 2018: 288–289.

Gonzalez R, Woods R, Gonzalez, Woods, Ruan Q, Ruan Y. Digital Image Processing (Second Edition): Publishing House of Electronics Industry; 2003.

Smith AR. Color gamut transform pairs. ACM Siggraph Comput Graphics. 1978;12(3):12–9.

Article   CAS   Google Scholar  

Huang ZC, Chan P, Ng W, Yeung DS. Content-Based Image Retrieval using Color Moment and Gabor Texture Feature. In: International Conference on Machine Learning & Cybernetics : 2012; 2012.

Turner MR. Texture discrimination by Gabor functions. Biol Cybern. 1986;55(2–3):71–82.

Fox J, Monette G. Generalized Collinearity Diagnostics. J Am Stat Assoc. 1992;87(417):178–83.

Article   Google Scholar  

Kursa MB, Jankowski A, Rudnicki WR. Boruta–a system for feature selection. Fundamenta Informaticae. 2010;101(4):271–85.

Rinaudo P, Boudah S, Junot C, Thevenot EA. Biosigner: a New Method for the Discovery of Significant Molecular signatures from Omics Data. Front Mol Biosci. 2016;3:26.

Lundberg S, Lee SI. A Unified Approach to interpreting model predictions. Adv Neural Inf Process Syst 2017(30).

Besedovsky L, Lange T, Haack M. The Sleep-Immune Crosstalk in Health and Disease. Physiol Rev. 2019;99(3):1325–80.

Besedovsky L, Lange T, Born J. Sleep and immune function. Pflugers Arch. 2012;463(1):121–37.

McAlpine CS, Kiss MG, Zuraikat FM, Cheek D, Schiroli G, Amatullah H, Huynh P, Bhatti MZ, Wong LP, Yates AG et al. Sleep exerts lasting effects on hematopoietic stem cell function and diversity. J Exp Med 2022, 219(11).

Irwin MR. Sleep and inflammation: partners in sickness and in health. Nat Rev Immunol. 2019;19(11):702–15.

Garbarino S, Lanteri P, Bragazzi NL, Magnavita N, Scoditti E. Role of sleep deprivation in immune-related disease risk and outcomes. Commun Biol. 2021;4(1):1304.

Vitkauskaite A, Urboniene D, Celiesiute J, Jariene K, Skrodeniene E, Nadisauskiene RJ, Vaitkiene D. Circulating inflammatory markers in cervical cancer patients and healthy controls. J Immunotoxicol. 2020;17(1):105–9.

Deivendran S, Marzook KH, Radhakrishna Pillai M. The role of inflammation in cervical cancer. Adv Exp Med Biol. 2014;816:377–99.

Cheng YS, Tseng PT, Wu MK, Tu YK, Wu YC, Li DJ, Chen TY, Su KP, Stubbs B, Carvalho AF, et al. Pharmacologic and hormonal treatments for menopausal sleep disturbances: a network meta-analysis of 43 randomized controlled trials and 32,271 menopausal women. Sleep Med Rev. 2021;57:101469.

Li W, Li C, Liu T, Wang Y, Ma X, Xiao X, Zhang Q, Qu J. Self-reported sleep disorders and the risk of all cancer types: evidence from the Kailuan Cohort study. Public Health. 2023;223:209–16.

Beroukhim G, Esencan E, Seifer DB. Impact of sleep patterns upon female neuroendocrinology and reproductive outcomes: a comprehensive review. Reprod Biol Endocrinol. 2022;20(1):16.

Fang CY, Miller SM, Bovbjerg DH, Bergman C, Edelson MI, Rosenblum NG, Bove BA, Godwin AK, Campbell DE, Douglas SD. Perceived stress is associated with impaired T-cell response to HPV16 in women with cervical dysplasia. Ann Behav Med. 2008;35(1):87–96.

Lugovic-Mihic L, Cvitanovic H, Djakovic I, Kuna M, Seserko A. The influence of psychological stress on HPV infection manifestations and carcinogenesis. Cell Physiol Biochem. 2021;55(S2):71–88.

Poller WC, Downey J, Mooslechner AA, Khan N, Li L, Chan CT, McAlpine CS, Xu C, Kahles F, He S, et al. Brain motor and fear circuits regulate leukocytes during acute stress. Nature. 2022;607(7919):578–84.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zada D, Bronshtein I, Lerer-Goldshtein T, Garini Y, Appelbaum L. Sleep increases chromosome dynamics to enable reduction of accumulating DNA damage in single neurons. Nat Commun. 2019;10(1):895.

Zada D, Sela Y, Matosevich N, Monsonego A, Lerer-Goldshtein T, Nir Y, Appelbaum L. Parp1 promotes sleep, which enhances DNA repair in neurons. Mol Cell. 2021;81(24):4979–e49934977.

Sims TT, Colbert LE, Klopp AH. The role of the cervicovaginal and gut microbiome in cervical intraepithelial neoplasia and cervical cancer. J Immunotherapy Precision Oncol. 2021;4(2):72–8.

Garbarino S, Lanteri P, Bragazzi NL, Magnavita N, Scoditti E. Role of sleep deprivation in immune-related disease risk and outcomes. Commun Biology. 2021;4(1):1304.

Moscicki A-B, Shi B, Huang H, Barnard E, Li H. Cervical-vaginal microbiome and associated cytokine profiles in a prospective study of HPV 16 acquisition, persistence, and clearance. Front Cell Infect Microbiol. 2020;10:569022.

Yongmiao H, Shouyang W. Big data, machine learning and statistics: challenges and opportunities. China J Econometrics. 2021;1(1):17.

Download references

Acknowledgements

We thank all the patients for their cooperation in this study.

This study was supported by the National Natural Science Foundation of China [grant numbers: 81872715 and 82073674] and the Fundamental Research Program of Shanxi Province [grant number: 202203021212382] for providing funding.

Author information

Authors and affiliations.

Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China

Simin He, Guiming Zhu, Boran Yang, Juping Wang & Tong Wang

Key Laboratory of Coal Environmental Pathogenicity and Prevention, Shanxi Medical University, Ministry of Education, Taiyuan, 030001, China

Department of Obstetrics and Gynecology, First Hospital of Shanxi Medical University, Taiyuan, 030001, China

Ying Zhou & Zhaoxia Wang

You can also search for this author in PubMed   Google Scholar

Contributions

T.W. conceived the idea and contributed to the interpretation of the results. T.W. and S.H. developed the model. S.H. conducted the analysis with assistance from G.Z.; interpreted the results with assistance from B.Y. and J.W.; and drafted and revised the manuscript with input from all other authors. Z.W. and Y.Z. acquired data and helped interpretation of data. All authors approved the final manuscript.

Corresponding author

Correspondence to Tong Wang .

Ethics declarations

Ethics approval and consent to participate.

The study protocol has been reviewed and approved by the Institutional Review Board in the First Hospital of Shanxi Medical University. All subjects involved in the study signed the informed consent.

Consent for publication

Not applicable.

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

He, S., Zhu, G., Zhou, Y. et al. Predictive models for personalized precision medical intervention in spontaneous regression stages of cervical precancerous lesions. J Transl Med 22 , 686 (2024). https://doi.org/10.1186/s12967-024-05417-y

Download citation

Received : 24 January 2024

Accepted : 19 June 2024

Published : 26 July 2024

DOI : https://doi.org/10.1186/s12967-024-05417-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Predictive model
  • HPV infection
  • Cervical cancer
  • Machine learning
  • Pathological images

Journal of Translational Medicine

ISSN: 1479-5876

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

methods of collecting secondary data in research methodology

  • Open access
  • Published: 29 July 2024

Critical thresholds of long-pressure reactivity index and impact of intracranial pressure monitoring methods in traumatic brain injury

  • Erik Hong 1 , 2   na1 ,
  • Logan Froese 1 , 3   na1 ,
  • Emeli Pontén 4 , 5 ,
  • Alexander Fletcher-Sandersjöö 1 , 2 ,
  • Charles Tatter 1 , 6 ,
  • Emma Hammarlund 1 , 7 ,
  • Cecilia A. I. Åkerlund 7 , 8 ,
  • Jonathan Tjerkaski 9 ,
  • Peter Alpkvist 1 , 2 ,
  • Jiri Bartek Jr 1 , 2 ,
  • Rahul Raj 10 ,
  • Caroline Lindblad 1 , 11 , 12 ,
  • David W. Nelson 7 , 8 ,
  • Frederick A. Zeiler 1 , 3 , 13 , 14 , 15   na2 &
  • Eric P. Thelin 1 , 16   na2  

Critical Care volume  28 , Article number:  256 ( 2024 ) Cite this article

Metrics details

Moderate-to-severe traumatic brain injury (TBI) has a global mortality rate of about 30%, resulting in acquired life-long disabilities in many survivors. To potentially improve outcomes in this TBI population, the management of secondary injuries, particularly the failure of cerebrovascular reactivity (assessed via the pressure reactivity index; PRx, a correlation between intracranial pressure (ICP) and mean arterial blood pressure (MAP)), has gained interest in the field. However, derivation of PRx requires high-resolution data and expensive technological solutions, as calculations use a short time-window, which has resulted in it being used in only a handful of centers worldwide. As a solution to this, low resolution (longer time-windows) PRx has been suggested, known as Long-PRx or LPRx. Though LPRx has been proposed little is known about the best methodology to derive this measure, with different thresholds and time-windows proposed. Furthermore, the impact of ICP monitoring on cerebrovascular reactivity measures is poorly understood. Hence, this observational study establishes critical thresholds of LPRx associated with long-term functional outcome, comparing different time-windows for calculating LPRx as well as evaluating LPRx determined through external ventricular drains (EVD) vs intraparenchymal pressure device (IPD) ICP monitoring.

The study included a total of n = 435 TBI patients from the Karolinska University Hospital. Patients were dichotomized into alive vs. dead and favorable vs. unfavorable outcomes based on 1-year Glasgow Outcome Scale (GOS). Pearson’s chi-square values were computed for incrementally increasing LPRx or ICP thresholds against outcome. The thresholds that generated the greatest chi-squared value for each LPRx or ICP parameter had the highest outcome discriminatory capacity. This methodology was also completed for the segmentation of the population based on EVD, IPD, and time of data recorded in hospital stay.

LPRx calculated with 10–120-min windows behaved similarly, with maximal chi-square values ranging at around a LPRx of 0.25–0.35, for both survival and favorable outcome. When investigating the temporal relations of LPRx derived thresholds, the first 4 days appeared to be the most associated with outcomes. The segmentation of the data based on intracranial monitoring found limited differences between EVD and IPD, with similar LPRx values around 0.3.

Our work suggests that the underlying prognostic factors causing impairment in cerebrovascular reactivity can, to some degree, be detected using lower resolution PRx metrics (similar found thresholding values) with LPRx found clinically using as low as 10 min-by-minute samples of MAP and ICP. Furthermore, EVD derived LPRx with intermittent cerebrospinal fluid draining, seems to present similar outcome capacity as IPD. This low-resolution low sample LPRx method appears to be an adequate substitute for the clinical prognostic value of PRx and may be implemented independent of ICP monitoring method when PRx is not feasible, though further research is warranted.

Introduction

Moderate-to-severe traumatic brain injury (TBI) is a deleterious condition with a global mortality rate of about 30%, resulting in acquired life-long disabilities in many survivors [ 1 ]. Specialized neuro-critical care units (NCCU), where invasive monitoring is employed, have been shown to improve outcomes as compared to treatment in conventional critical care units [ 2 , 3 ]. However, despite improvements in monitoring, about 40% of severe TBI patients deteriorate, presumably due to secondary brain injuries caused by a deranged metabolism, inadequate perfusion, and other intracranial insults [ 4 , 5 ]. In our regional TBI database, we have seen 39% of patients present with secondary lesions, not seen on admission imaging, that are predominantly lesions of an ischemic nature [ 6 ]. Thus, better monitoring is required to improve outcomes and prevent potentially irreversible secondary cerebral injuries in severe TBI patients.

The pressure reactivity index (PRx), as a surrogate for cerebrovascular reactivity, has been suggested as a metric that could be monitored in order to prevent secondary insults such as pressure-passive ischemia or hyperemia by taking the intracranial auto-regulatory capacity into consideration [ 7 ]. PRx is commonly calculated by a moving Pearson’s correlation between intracranial blood pressure (ICP) and mean arterial pressure (MAP), averaged over a 10-s period, using 5-min moving time-windows [ 7 , 8 , 9 , 10 ]. PRx ranges from − 1 (intact autoregulatory capacity) to 1 (impaired autoregulatory capacity), with established critical thresholds of PRx > 0.35 and > 0.25, and > 0.05 being associated with mortality and unfavorable outcomes at 6 months, respectively [ 11 , 12 , 13 , 14 , 15 ].

However, the problem with PRx is that it requires high-resolution data and potentially expensive information technology (IT)-solutions, which has resulted in it being used clinically in only a handful of centers worldwide. As a solution to this, a low resolution PRx has been suggested, known as Long-PRx or LPRx [ 16 , 17 ]. Previous studies have looked at time-windows from 5 to 240 min and found that LPRx holds similar outcome predictive capacity as PRx [ 16 , 18 , 19 , 20 , 21 ]. Yet, studies calculating both PRx and LPRx in the same cohort found PRx to have higher associations with outcomes than LPRx [ 22 , 23 ]. However, we studied a smaller cohort analyzing down-sampled PRx and ICP/MAP values which indicated that while a lot of granularity in the data is missed, when going to minute-by-minute data for ICP/MAP with 20-min time windows for LPRx derivation, a similar time-domain statistical structure for PRx and LPRx exists [ 21 , 24 ]. Thus, the vector-domain temporal relationships between ICP and MAP is preserved, providing confidence in the ability of LPRx to assess some aspects of cerebral autoregulation [ 24 ]. However, such work has been limited to date, and thus LPRx as a measure still remains underexplored.

As of today, there is only a single center cohort study that has investigated the critical thresholds of LPRx in TBI [ 21 ], furthermore several studies have used different cut-offs and time-windows [ 16 , 18 , 19 , 20 , 21 , 22 , 23 ]. Thus, it is still unclear which time window and threshold of LPRx is most appropriate, or if existing published critical thresholds for standard PRx can be used for LPRx monitoring. Moreover, almost all previous studies combine ICP monitoring of intraparenchymal devices (IPD) and external ventricular drains (EVD). ICP worldwide is still measured using EVD (while only 15% in Europe, EVDs are believed to constitute a majority in low-and-middle-income-countries (LMIC)) [ 25 , 26 , 27 ], making it important to establish a method that works for both types of acquired ICP.

Hence, this observational study aims to explore LPRx within a large TBI database to A) establish critical thresholds of LPRx that are associated with long-term functional outcome, B) determine which time-window for calculating LPRx is optimal for outcome prediction, and C) investigate if LPRx derived from EVD differs from intraparenchymal ICP devices. Our hypothesis is that similar thresholds as seen for PRx will be valid for LPRx, and that time-windows up to 20 min will be similar as 5-min time windows.

Materials and methods

Study design.

From between January 1, 2006 to December 31, 2019 patients admitted to the adult NCCU at Karolinska University Hospital, Stockholm, Sweden, a level one trauma center, with moderate and severe TBI (diagnosed as Glasgow coma scale (GCS) <  = 8 and > 15 years old) were included in this study. All patients had invasive ICP and MAP monitoring for more than 6 h that was archived in high-frequency (1–5-min median levels) and were retrospectively analyzed (n = 435) in this observational study. Treatment was mediated according to local guidelines in general concordance to that of the Brain Trauma Foundation (BTF) [ 2 , 28 , 29 ], and is described in detail elsewhere [ 30 ]. These patients were mechanically ventilated, with arterial partial pressure of CO 2 (PaCO 2 ) targets used, where normal to mild hyperventilation (defined here as PaCO 2 4.5–5 kPa) was commonly applied as one of several to manage increased ICP. Head of the bed was commonly elevated 30 degrees and cerebral perfusion pressure (CPP) was calculated with the arterial pressure transducer placed at the level of the tragus (some patients had a dual transducer to measure arterial blood pressure both at the cardiac and cerebral level) [ 31 ]. As part of our local patient registry, Glasgow Outcome Scale (GOS) was prospectively acquired through questionnaires and telephone interviews at about 12 months following injury [ 32 ].

This study was approved by the Swedish Ethical Review Authority (#2020–05227) on November 17, 2020 and adheres to the Helsinki Declaration of 1975.

Data collection

The patient data collection was identical to that previously described [ 33 ]. As a summary, all patient demographics, injury and treatment information were either manually collected by a medical professional from the electronic hospital chart system Take Care (CompuGroup Medical Sweden AB, Stockholm, Sweden) or automatically recorded using Clinisoft (Centricity Critical Care, CCC, General Electric Company, Boston, USA). The worst pre-sedation/intubation GCS score was used. Pre-hospital hypoxia (oxygen saturation < 90%) or hypotension (systolic blood pressure < 90 mmHg) were registered from the scene of the accident, or at the hospital admission [ 3 ]. The admission computerized tomography (CT) scan was assessed using the Marshall CT classification [ 34 ]. Primary decompressive craniectomy was defined as a craniectomy performed as initial surgery (i.e. where the bone flap was not returned following initial evacuation surgery or due to diffuse injury and brain swelling), while a secondary decompressive craniectomy was defined as a hemicraniectomy performed at least 48 h after trauma due to refractory high ICP [ 35 ].

MAP was obtained through either radial or femoral arterial lines connected to pressure transducers (Baxter Healthcare Corp. CardioVascular Group, Irvine, CA, or similar devices). ICP was acquired via an intra-parenchymal strain gauge probe (Codman ICP MicroSensor; Codman & Shurtleff Inc., Raynham, MA, USA), raumedic catheter Neurovent-P (Raumedic AG, Münchberg, Germany), parenchymal fiber optic pressure sensor (Camino ICP Monitor, Integra Life Sciences, Plainsboro, NJ, USA;  https://www.integralife.com/ ) or using EVD (Medtronic, Minneapolis, MN, USA). Both the MAP and ICP data were clean from data artifacts by using direct visual inspection and threshold limits (0 < ICP < 80 mmHg and 0 < MAP < 400 mmHg). ICP data when drains were opened were identified by manual indications in CCC, verified by manual inspection, and were removed. Thus, all time that EVD had open cerebral spinal fluid drain was removed.

Signal processing

Data collected was stored in the database as the median for each time period, predominantly that of 2 min, however ranging from 0.5 to 5 min (1 min median, interquartile rate of 1–2 min), generating unevenly sampled time series data (Appendix A for more details). It should be noted that CCC was not designed as a research tool, thus the reason behind why ICP and MAP values were sampled irregularly is hard to fully identify. Though some reasons include, data recording policy, adjustments in storage allotments, and modified sampling rate of the CCC system over the years. We performed two complete analyses on this database including one which we imputed the data to give regularly sampled data and one which used the data as is (with the sporadic sampling). Further details on the imputation method can be found in Appendix B , though given that the overall results were nearly identical (statistically similar for all key thresholds), the sporadic data (non-imputed) will be demonstrated and referenced for the rest of this manuscript. For all tables, the data is represented as grand means for each patient summed with median levels and interquartile ranges.

From the values of ICP and MAP, low-frequency PRx (LPRx) was derived via the moving Pearson’s correlation coefficient of multiple consecutive minute-by-minute samples, and calculated every minute [ 18 , 20 , 36 ]. LPRx values range from − 1 to 1, with higher values indicating increasingly impaired cerebrovascular reactivity as indicated by slow fluctuation responses of ICP to MAP changes. LPRx was calculated using 10, 15, 20, 30, 60, 90 and 120 consecutive samples (10–120 min of time) and labeled as: LPRx_10, LPRx_15, LPRx_20, LPRx_30, LPRx_60, LPRx_90 and LPRx_120; in line with previous literature on LPRx in TBI [ 18 , 20 , 36 ].

Statistical analysis

All statistical analysis was performed using R statistical computing software (R Foundation for Statistical Computing (2020), Vienna, Austria, http://www.R-project.org/ ). This manuscript performed an exploration into the relationships between final GOS (last registered GOS) and various overall mean cerebral/physiological responses. From this data the overall mean values for LPRx and ICP were calculated for the entire patient recording, the first 24/48/96/144 h (1/2/4/6 days) and daily times (days 1–7).

Pearson’s chi square test was used to find the best threshold for ICP and LPRx values in analogues with past work [ 11 , 37 ]. The data was dichotomized by different thresholds, above/below thresholds from − 0.5 to 0.7 (with incremental steps of 0.05) for LPRx and 0 to 40 (with incremental steps of 0.5 mmHg) for ICP. Chi-squared tests were then performed between each dichotomized threshold and outcome. Outcomes were defined as survival (GOS 1 vs. 2–5) [ 38 ] or favorable outcomes (GOS 1–3 vs. 4–5) [ 32 ]. For each threshold a chi-square statistic was calculated. The threshold with the highest chi-square statistic was assumed to have the best discriminative value for outcome, indicating that this threshold value had the most accurate categorization of the patient population. This procedure was repeated for all time periods (mean values of the full monitoring time, first 1/2/4/6 days and each of the first 7 days) as well as after creating subgroups according to EVD vs IPD. We also performed chi-squared analysis on patients without a decompressive craniectomy.

Next using the same chi-squared technique as previously described, the method was repeated for % time LPRx over key threshold (> 0, > 0.2 and > 0.3). These thresholds were chosen based on previously defined PRx thresholds (which are similar to the ones found in this manuscript) [ 12 , 23 , 37 , 39 ]. Again, the threshold with the highest chi-square statistic was assumed to have the best discriminative value for outcome, indicating that this % time LPRx above the threshold value had the most accurate categorization of the patient population. This procedure was repeated for all time periods (% time LPRx above threshold for the full monitoring time, first 1/2/4/6 days and each of the first 7 days) as well as after creating subgroups according to EVD vs IPD.

Basic physiological statistic of each of MAP, ICP and LPRx was compared using a Mann-U test of their overall distribution for the survival and favorable outcomes groups. P values were not adjusted for multiple comparisons, with overall alpha of significance set to 0.05.

Patient characteristics

N = 435 patients were eligible for the final analysis (Fig.  1 ), of whom 207 had IPD and 228 had EVD. One patient had both monitors placed, for whom only EVD data was used for analysis. The median age was 51 years (interquartile range; IQR: 33–62.5 years), with 338 (77.7%) being males (Table  1 ). 277 patients had at least 6 days of recorded physiology and 432 have at least a full day of recording. It should be noted here that the artifact removal resulted in, on average, less than 1% of the data loss per patient, however for some patients (mostly EVD drainage patients) this was up to 40% of the time (though this was rare, in 10 patients). In total, 260 (59.8%) had intracranial mass lesions removed, and 44 (10.1%) has either a primary- or secondary decompressive hemicraniectomy. TBI demographics are in keeping with normal TBI cohorts. Appendix C describes admission characteristics and type of monitoring for each year of recording, including outcome.

figure 1

Patient Selection The selection of the patient data from Stockholm, with inadequate monitoring demonstrating limited physiological data(< 6 h) or missing data for key physiologies. The remaining n = 435 patients represent ICU TBI patients requiring invasive monitoring to optimize recovery. EVD, external ventricular drain; IPD, intraparenchymal monitor; TBI, traumatic brain injury

Critical thresholds for outcome prediction for LPRx

The sequential chi-square method was performed for each LPRx window (LPRx_10/LPRx_15/LPRx_20/LPRx_30/LPRx_60/LPRx_90/LPRx_120, i.e. 10 to 120 min window derived correlation coefficients). Plots presenting the chi-square values for incremental thresholds of mean LPRx found over the full recording of each patient was completed for each parameter, both Alive vs Dead and Favorable vs Unfavorable, presented in Fig.  2 . For each plot, the threshold resulting in the highest chi-squared value was identified as the critical threshold.

figure 2

Outcome for LPRx The most optimal dichotomized threshold from all the LPRx values was encompassing 0.25–0.35, using both EVD and IPD monitoring for the full time of data (whole measurement period for all patients). EVD, external ventricular drain; IPD, intraparenchymal monitor; LPRx, long pressure reactivity

For most of the cerebrovascular reactivity indices, similar critical thresholds were found for Alive vs Dead and Favorable vs Unfavorable outcome. The LPRx_10 and LPRx_15 plots produced peaks at 0.25 or 0.3 for both outcome types, though the Alive vs Dead dichotomization had improved chi-squared values compared to Favorable vs Unfavorable outcome. The longer periods of time (LPRx_30, LPRx_90, and LPRx_120) had a slightly higher critical threshold of 0.3 or 0.35 and a strong chi-squared for the Alive vs Dead categorization. This is in alignment with the findings tabulated in Table  1 and the Appendices.

Finally, the longer the LPRx time window the lower overall chi-squared value, with the 10 min window having the most significant chi-squared value.

There was a limited impact of monitoring time on LPRx thresholds, with thresholds varying between 0.2–0.35 for durations of monitoring from 1 to 6 days (Appendix D / E ). However, when investigating critical LPRx thresholds based solely on individual daily mean values (0–24/24–48/48–72… hours) chi-squared values were notably lower and a decrease in the threshold was seen as the recording was further from the initial time of care (Appendix F / G with Appendix H summarizes the daily patient demographics). This is notable in the 4th to 5th day of recording, with almost all significance of LPRx lost after the 5th day (lower overall chi-squared and increased p values).

Critical thresholds of LPRx – Impact of ICP Monitoring Method

The sequential chi-square method was performed for each ICP monitoring method; EVD, IPD and these combined into one group. Figure  2 demonstrates EVD and IPD in one group and Fig.  3 demonstrates just EVD and just ICP for the full time (Appendix I / J shows patient demographics for IPD and EVD). Overall, there was a similar response between EVD and IPD monitoring of ICP and derived LPRx measures (both overall mean values and found thresholds), with peak values at 0.2–0.35 thresholds. There was limited difference seen in the patients without a decompressive craniectomy (Appendix P ).

figure 3

LPRx for Different ICP Monitoring Methods The figure displays different LPRx windows and the resulting thresholds with different ICP monitoring methods. Similar overall results with IPD and EVD. Noting also that as LPRx increase in time the chi-squared values decrease. EVD, external ventricular drain; ICP, intracranial pressure; IPD, intraparenchymal monitoring; LPRx, long pressure reactivity; _10, 10 min window; _15, 15 min; _30, 30 min; _60, 60 min; _90, 90 min; _120, 120 min

Critical thresholds for outcome prediction for ICP

The sequential chi-square method was performed for ICP, for patients with both EVD + IPD, just EVD and just IPD (Appendix K). As indicated in Appendix K, the EVD-based critical threshold for mean ICP was lower (16.5 mmHg) than IPD-based critical threshold at (20/24 mmHg).

Appendix K/L/M, show the time in NCCU and the overall impact of the ICP critical thresholds. To note, the first 2 days of care appeared to have the most significance, with ICP data after the 4th day losing a significant amount of its discriminative capacity (as reflected in a reduced overall chi-squared magnitude).

% Time of LPRx above critical threshold

All analysis for this aspect can be found in Appendices Q-Y, which document the association between % time LPRx was above each key threshold (> 0, > 0.2 and > 0.3) and outcome. Overall, for % time LPRx > 0.3 about 50% of the time was an indicator of poor outcome. For % time LPRx > 0 was about 70–80% of the time, which corresponds with what would be expected. Around the 3/4th day, the overall discriminative capacity of LPRx decreased, with the higher LPRx calculated windows (LPRx_60 to LPRx_120) having lower overall peak chi-squared values.

This is the first manuscript to extensively explore and derive key critical thresholds for outcomes association for LPRx over various time windows. Proceeded by recent work from Riemann where three time windows were explored [ 21 ], this current study offers unique insights into this surrogate measure of cerebral autoregulation. This preliminary work has confirmatory findings on LPRx which indicates that it has similar overall prognostic thresholds as standard PRx, thus as a clinical measure LPRx is likely a substitute for PRx. Our results confirmed similar LPRx thresholds between IPDs and closed EVDs, though external validation of our results will be important.

As this was the second study to investigate critical thresholds for LPRx in a large TBI population, it bears highlighting that a LPRx calculated over a 10 to 120-min window (with low 1–2 min samples) displayed thresholds of 0.2–0.35, similar to that of PRx (that uses 10 s samples) [ 11 , 12 , 14 , 21 ]. Though Riemann’s past work on LPRx also saw a significant drop in chi-squared magnitude for larger LPRx calculation windows, they found LPRx to have highly variable critical thresholds and overall a lack of statistical significance (for more information see Table  2 ) [ 21 ]. Therefore it is likely that LPRx windows capture many prognostic factors associated with PRx, with larger LPRx windows reducing its discriminative capacity. Furthermore, LPRx maybe a viable substitute for PRx clinical calculations in situations where PRx cannot be found, including artifact prone signals (monitoring effects from nurses) and provide a route to personalized cerebral autoregulation assessment in situations where the monitor data is diminished (low MAP/ICP yields or low minute plus, sampling rates). While this data comes from a low-resolution legacy system (CCC), we are not trying to promote a specific system but highlighting that low-resolution PRx is a viable surrogate in place where PRx can not be found, displaying similar historical chi-square thresholds [ 11 , 12 , 14 , 21 ]. However, for prospective monitoring of cerebrovascular reactivity measures today, bedside systems with this data available are more suitable.

Furthermore, past work has shown that patients with long periods of time in a dysautoregulation state (PRx > 0–0.3) have overall worse outcomes [ 12 , 15 , 37 , 39 , 42 , 43 ], reciprocated within our work. Therefore this highlights that like PRx, LPRx likely has similar descriptive information about outcomes as PRx. All of this is in line with past work discussing the lower limit of autoregulation (LLA), where LLA describes low systemic blood pressure linked with dysautoregulation [ 44 , 45 , 46 , 47 , 48 , 49 , 50 ]. LLA has been clearly documented in animal models, where systemic blood pressure was decreased [ 45 , 49 , 50 , 51 , 52 ]. In some of this work the LLA was linked with a PRx value of ~ 0.3 [ 45 , 49 ]. Such work has been expanded develop an individualized measure of care, with the optimal cerebral perfusion pressure (CPPopt) gaining extensive exploration in TBI care [ 36 , 53 , 54 ]. CPPopt uses the association between systemic blood pressure and PRx to provide a targetable personal value of care. Though this is still in its feasibility stage CPPopt, has demonstrated both prognostic and associations with outcome, with emerging work evaluating its impact [ 36 , 54 , 55 ]. All this work may benefit from LPRx as a substitution for PRx, where the more momentary assessment of PRx (10 s) is not feasible.

However, it should be noted that though we have demonstrated a prognostic similarity between LPRx and PRx, this does not indicate that these measures can be fully interchangeable. Many of the fast vasogenic aspects surrounding PRx determine calculation times, would likely be diminished at the larger time windows used by LPRx [ 56 , 57 ]. This may account for the decrease in overall chi-squared values as the LPRx calculation window increases. Moreover the cerebrovascular reactivity factors at higher frequency ranges (< 1 min) would be impossible for LPRx to capture [ 56 , 57 ]. Fundamentally current PRx/LPRx measures are derived from the correlated MAP and ICP values, and though factors that dramatically influence blood pressure likely influence PRx, recent work on decompressive craniectomies has demonstrated PRx had similar pathophysiological information pre/post treatment (reciprocated in our work) [ 58 , 59 ].

With the current cerebrovascular reactivity measures, there is a limitation with the direct thresholding method used in this study, as there is a wide individual range of optimal patient thresholds (the range for significant values ranges from 0 to 0.5). Moreover, compared to the strong relationship with survival, favorable outcomes had lower overall chi-squared values and a less distinct peak (in keeping with past literature) [ 12 , 13 , 14 , 15 , 60 ]. Presumably, the immediate deranged intracranial dynamics will play a smaller part in the long-term outcome prediction of survivors as compared to those that succumb from their injuries. Thus, though the dichotomization method for determining a threshold for a global population has some value, the more individual factors that drive LPRx in each patient needs to be explored (this is an issue for all cerebrovascular reactivity measures currently used) [ 11 , 12 , 24 ]. Again it should be re-emphasized that the post-hoc analyses generating chi-squares that we have performed are more for comparing our low-resolution PRx with that of other PRx papers and thus have directly replicated the thresholding analyses (which is what is current widely referenced and quoted PRx thresholds in the literature and clinical guidelines) [ 11 , 12 , 14 , 21 ]. It must be mentioned that this method only provides a prognostic threshold, as the method of dichotomization focuses on long-term outcome scoring systems and thus does not necessarily represent a pure physiologic threshold, but an epidemiologic one. Though, pre-clinical literature does support some relation between cerebrovascular reactivity thresholds of ~  + 0.2–0.3 and identification of the LLA during systemic hypotension and intracranial hypertension, using both ICP and infrared based metrics [ 45 , 49 ].

When analyzing the time and these individual LPRx measures, they appeared to sufficiently indicate similar overall values for critical thresholds associated with outcomes (ie. first 1–6 days had similar LPRx thresholds). However, when splitting up the data into each daily measure, from day one to day six, it was found that the dichotomization of the thresholding methodology lost its significance as the time got further from day one. This is likely due to a number of factors, though primarily the fact that extreme patients (either dead or fast recovery patients) would be removed from the data recording, focusing in on more dynamic patient cases as the time goes on. Moreover, given the fact that the longer a patient spends in the NCCU in theory would result in their overall intracranial physiology to move to normality, and the one-to-one thresholding for this time would be less indicative of an initial NCCU state as well as less responsive in overall physiological derangement.

Given the nature of this population, we had the unique opportunity to evaluate the two most common ICP monitoring methodologies, that being EVD and IPD methodologies. As EVD and IPD monitoring allows for different routes of care such as allowing cerebrospinal fluid (CSF) drainage, it makes populations with only EVD monitoring relevant to study. During periods of closed EVDs, we noticed that the LPRx measures performed similar regardless of monitoring device. This is in line with previous work showing that EVD and CSF drainage has a limited overall impact on the derived cerebrovascular reactivity index [ 61 , 62 , 63 ]. However, it should be noted that EVD patients had a slightly lower chi-squared value and lower overall LPRx/ICP critical thresholds, compared to those with LPRx derived from IPD devices, which may be explained by lower ICP values in the EVD group compared to IPD group. EVD drainage allows for a simplistic modification of brain pressure (particularly the lowering of overall ICP values, seen within this population with lower overall ICP thresholds observed). However, there was still a significant threshold seen with the EVD-based measures.

Limitations

Despite the over 400 patients within this analyzed population, there are still significant limitations to overall heterogeneity and cofactor considerations. The segmentation of the data based on ICP monitoring method resulted in about 200 patients within each category. Although effective as overall gross mean assessments, the cohort has a lot of heterogeneity regarding TBI injury pattern, demographics and overall patient care, factors not accounted for within this analysis. To evaluate the effects of these potential confounders, a larger patient cohort would be needed.

Although this study uses similar methodology as in earlier studies, it has its limitations. Particularly in the fact that the more individualized momentary physiological aspects associated with patient care are not accounted for. As time and care grow the direct response of these impaired states would in theory be mitigated or at least minimized and thus the noted associations from extreme cases (ie, the first days) would likely not be seen in the later days as seen within this population. To address this, more momentary assessment and personalized evaluation of physiological treatment should be completed.

Past work exploratory work on PRx use has used the chi squared approach to approximate the threshold that has the best discriminative capacity with the data. Though useful to explore the data, such a technique is favoring the best bifurcation of the data and does not account for potentially more relevant clinical factors (like what threshold is the patient in danger or outlier patient who may have higher overall risk). Thus, for future work defining clinical thresholds for variables, other methods need to be explored. Methods that use an area under curve that focuses on preserving sensitivity well maximizing specificity provides more clinically relevant information. This is because it is less prone to withdraw needed treatment and favors assessing cases where the patient may be in danger. Therefore, when implementing LPRx/PRx in larger data modeling, exploring the optimal threshold through the diagnostic accuracy approach would be of benefit.

Next, PRx as a method of cerebrovascular reactivity determination is less robust than new methods like that of the pulse amplitude index or wavelet PRx [ 39 , 64 , 65 , 66 ]. However, LPRx within this population appeared to have sufficient accuracy as to drive sufficiently similar PRx critical threshold values thus when data limitations exist, LPRx method can be considered.

Finally, as per the retrospective nature of this study, it is likely that some patients were withheld treatment due to severe injuries which are not deemed survivable, or per the known wishes of the patient or those of the next-of-kin. This is difficult to fully adjust for, but treatment withdrawal is generally uncommon at our institution. Likewise, we know from previous experience with the same cohort that few of the in-hospital mortality cases were due to withdrawal of treatment for the TBI itself and likely due to multi-organ failure [ 67 ].

For LPRx determined over 10 to 20-min windows we found a critical threshold of 0.25, which is similar to past studies using PRx thresholding values, indicating that our LPRx has similar clinical prognostic value as PRx. Therefore, in a clinical setting where high frequency PRx cannot be determined, LPRx is likely a sufficient substitute. As LPRx is found using only minute-by-minute samples of MAP and ICP (with as low as 10 samples), it therefore opens the use of LPRx in more clinical centers globally. Next, as EVD and IPD derived LPRx performed similarly, it indicates that despite the intermittent CSF draining, LPRx can still be clinically determined. Therefore as a clinical prognostic measure LPRx is an adequate substitute for PRx, though more research is warranted to study its association with more high-resolution metrics of cerebrovascular reactivity.

Data availability

Data is collected from Swedish medical institutions and contains private patient information, for access please contact Eric P Thelin for more details.

Maas AIR, Menon DK, Adelson PD, Andelic N, Bell MJ, Belli A, et al. Traumatic brain injury: integrated approaches to improve prevention, clinical care, and research. Lancet Neurol. 2017;16:987–1048.

Article   PubMed   Google Scholar  

Carney N, Totten AM, O’Reilly C, Ullman JS, Hawryluk GWJ, Bell MJ, et al. Guidelines for the management of severe traumatic brain injury. Fourth Edition Neurosurg. 2017;80:6–15.

Article   Google Scholar  

McCredie VA, Alali AS, Scales DC, Rubenfeld GD, Cuthbertson BH, Nathens AB. Impact of ICU structure and processes of care on outcomes after severe traumatic brain injury: a multicenter cohort study. Crit Care Med. 2018;46:1139–49.

Narayan RK, Michel ME, Ansell B, Baethmann A, Biegon A, Bracken MB, et al. Clinical trials in head injury. J Neurotrauma. 2002;19:503–57.

Werner C, Engelhard K. Pathophysiology of traumatic brain injury. Br J Anaesth. 2007;99:4–9.

Article   CAS   PubMed   Google Scholar  

Thelin EP, Nelson DW, Bellander B-M. Secondary peaks of S100B in serum relate to subsequent radiological pathology in traumatic brain injury. Neurocrit Care. 2014;20:217–29.

Czosnyka M, Smielewski P, Kirkpatrick P, Laing RJ, Menon D, Pickard JD. Continuous assessment of the cerebral vasomotor reactivity in head injury. Neurosurgery. 1997;41:11–9.

Calviello LA, Donnelly J, Zeiler FA, Thelin EP, Smielewski P, Czosnyka M. Cerebral autoregulation monitoring in acute traumatic brain injury: what’s the evidence? Minerva Anestesiol. 2017;83:844–57.

Zeiler FA, Aries M, Czosnyka M, Smielewski P. Cerebral autoregulation monitoring in traumatic brain injury: an overview of recent advances in personalized medicine. J Neurotrauma. 2022;39:1477–94.

Zeiler FA, Ercole A, Czosnyka M, Smielewski P, Hawryluk G, Hutchinson PJA, et al. Continuous cerebrovascular reactivity monitoring in moderate/severe traumatic brain injury: a narrative review of advances in neurocritical care. Br J Anaesth. 2020;124:440–53.

Sorrentino E, Diedler J, Kasprowicz M, Budohoski KP, Haubrich C, Smielewski P, et al. Critical thresholds for cerebrovascular reactivity after traumatic brain injury. Neurocrit Care. 2012;16:258–66.

Zeiler FA, Donnelly J, Smielewski P, Menon DK, Hutchinson PJ, Czosnyka M. Critical thresholds of intracranial pressure-derived continuous cerebrovascular reactivity indices for outcome prediction in noncraniectomized patients with traumatic brain injury. J Neurotrauma. 2018;35:1107–15.

Zeiler FA, Ercole A, Beqiri E, Cabeleira M, Thelin EP, Stocchetti N, et al. Association between cerebrovascular reactivity monitoring and mortality is preserved when adjusting for baseline admission characteristics in adult traumatic brain injury: a center-TBI study. J Neurotrauma. 2020;37:1233–41.

Article   PubMed   PubMed Central   Google Scholar  

Stein KY, Froese L, Sekhon M, Griesdale D, Thelin EP, Raj R, et al. Intracranial pressure-derived cerebrovascular reactivity indices and their critical thresholds: a canadian high resolution-traumatic brain injury validation study. J Neurotrauma. 2023;41:910–23.

Zeiler FA, Ercole A, Cabeleira M, Zoerle T, Stocchetti N, Menon DK, et al. Univariate comparison of performance of different cerebrovascular reactivity indices for outcome association in adult TBI: a CENTER-TBI study. Acta Neurochir (Wien). 2019;161:1217–27.

Sánchez-Porras R, Santos E, Czosnyka M, Zheng Z, Unterberg AW, Sakowitz OW. “Long” pressure reactivity index (L-PRx) as a measure of autoregulation correlates with outcome in traumatic brain injury patients. Acta Neurochir (Wien). 2012;154:1575–81.

Hasen M, Gomez A, Froese L, Dian J, Raj R, Thelin EP, et al. Alternative continuous intracranial pressure-derived cerebrovascular reactivity metrics in traumatic brain injury: a scoping overview. Acta Neurochir. 2020;162:1647–62.

Santos E, Diedler J, Sykora M, Orakcioglu B, Kentar M, Czosnyka M, et al. Low-frequency sampling for PRx calculation does not reduce prognostication and produces similar CPPopt in intracerebral haemorrhage patients. Acta Neurochir (Wien). 2011;153:2189–95.

Gritti P, Bonfanti M, Zangari R, Farina A, Longhi L, Rasulo FA, et al. Evaluation and application of ultra-low-resolution pressure reactivity index in moderate or severe traumatic brain injury. J Neurosurg Anesthesiol. 2023;35:313–21.

Depreitere B, Güiza F, Van den Berghe G, Schuhmann MU, Maier G, Piper I, et al. Pressure autoregulation monitoring and cerebral perfusion pressure target recommendation in patients with severe traumatic brain injury based on minute-by-minute monitoring data. J Neurosurg. 2014;120:1451–7.

Riemann L, Beqiri E, Younsi A, Czosnyka M, Smielewski P. Predictive and discriminative power of pressure reactivity indices in traumatic brain injury. Neurosurgery. 2020;87:655–63.

Lang EW, Kasprowicz M, Smielewski P, Santos E, Pickard J, Czosnyka M. Short pressure reactivity index versus long pressure reactivity index in the management of traumatic brain injury. J Neurosurg. 2015;122:588–94.

Riemann L, Beqiri E, Smielewski P, Czosnyka M, Stocchetti N, Sakowitz O, et al. Low-resolution pressure reactivity index and its derived optimal cerebral perfusion pressure in adult traumatic brain injury: a CENTER-TBI study. Crit Care. 2020;24:266.

Thelin EP, Raj R, Bellander B-M, Nelson D, Piippo-Karjalainen A, Siironen J, et al. Comparison of high versus low frequency cerebral physiology for cerebrovascular reactivity assessment in traumatic brain injury: a multi-center pilot study. J Clin Monit Comput. 2020;34:971–94.

Volovici V, Pisică D, Gravesteijn BY, Dirven CMF, Steyerberg EW, Ercole A, et al. Comparative effectiveness of intracranial hypertension management guided by ventricular versus intraparenchymal pressure monitoring: a CENTER-TBI study. Acta Neurochir. 2022;164:1693–705.

Clark D, Joannides A, Ibrahim Abdallah O, Olufemi Adeleye A, Hafid Bajamal A, Bashford T, et al. Management and outcomes following emergency surgery for traumatic brain injury - A multi-centre, international, prospective cohort study (the Global Neurotrauma Outcomes Study). Int J Surg Protoc. 2020;20:1–7.

Robba C, Graziano F, Rebora P, Elli F, Giussani C, Oddo M, et al. Intracranial pressure monitoring in patients with acute brain injury in the intensive care unit (SYNAPSE-ICU): an international, prospective observational cohort study. The Lancet Neurology. 2021;20:548–58.

Nordström C-H. Physiological and biochemical principles underlying volume-targeted therapy–the “Lund concept.” Neurocrit Care. 2005;2:83–95.

Grände P-O. The Lund concept for the treatment of patients with severe traumatic brain injury. J Neurosurg Anesthesiol. 2011;23:358–62.

Thelin EP, Jeppsson E, Frostell A, Svensson M, Mondello S, Bellander B-M, et al. Utility of neuron-specific enolase in traumatic brain injury; relations to S100B levels, outcome, and extracranial injury severity. Crit Care. 2016;20:285.

Mikkonen E, Blixt J, Ercole A, Alpkvist P, Sköldbring R, Bellander B-M, et al. A solution to the cerebral perfusion pressure transducer placement conundrum in neurointensive care? The Dual Transducer Neurocrit Care. 2023;40:391–4.

Jennett B, MacMillan R. Epidemiology of head injury. Br Med J (Clin Res Ed). 1981;282:101–4.

Froese L, Gomez A, Sainbh AS, Vakitbilir N, Marquez I, Amenta F, et al. Temporal relationship between vasopressor and sedative administration and cerebrovascular response in traumatic brain injury: a time-series analysis. Intensive Care Med Exper. 2023. https://doi.org/10.1186/s40635-023-00515-5 .

Marshall LF, Marshall SB, Klauber MR, van Clark M B, Eisenberg HM, Jane JA, et al. A new classification of head injury based on computerized tomography. J Neurosurg. 1991;75:14–20.

Raj R, Wennervirta JM, Tjerkaski J, Luoto TM, Posti JP, Nelson DW, et al. Dynamic prediction of mortality after traumatic brain injury using a machine learning algorithm. NPJ Digit Med. 2022;5:96.

Depreitere B, Güiza F, Van den Berghe G, Schuhmann MU, Maier G, Piper I, et al. Can optimal cerebral perfusion pressure in patients with severe traumatic brain injury be calculated based on minute-by-minute data monitoring? Acta Neurochir Suppl. 2016;122:245–8.

Sorrentino E, Budohoski KP, Kasprowicz M, Smielewski P, Matta B, Pickard JD, et al. Critical thresholds for transcranial Doppler indices of cerebral autoregulation in traumatic brain injury. Neurocrit Care. 2011;14:188–93.

Hyam JA, Welch CA, Harrison DA, Menon DK. Case mix, outcomes and comparison of risk prediction models for admissions to adult, general and specialist critical care units for head injury: a secondary analysis of the ICNARC Case Mix Programme Database. Crit Care. 2006;10(Suppl 2):S2.

Stein KY, Froese L, Gomez A, Sainbhi AS, Batson C, Mathieu F, et al. Association between cerebrovascular reactivity in adult traumatic brain injury and improvement in patient outcome over time: an exploratory analysis. Acta Neurochir (Wien). 2022;164:3107–18.

Zeiler FA, Cardim D, Donnelly J, Menon DK, Czosnyka M, Smielewski P. Transcranial doppler systolic flow index and ICP-derived cerebrovascular reactivity indices in traumatic brain injury. J Neurotrauma. 2018;35:314–22.

Gomez A, Froese L, Griesdale D, Thelin EP, Raj R, van Iperenburg L, et al. Prognostic value of near-infrared spectroscopy regional oxygen saturation and cerebrovascular reactivity index in acute traumatic neural injury: a CAnadian high-resolution traumatic brain injury (CAHR-TBI) Cohort Study. Crit Care. 2024;28:78.

Balestreri M, Czosnyka M, Steiner LA, Hiler M, Schmidt EA, Matta B, et al. Association between outcome, cerebral pressure reactivity and slow ICP waves following head injury. Acta Neurochir Suppl. 2005;95:25–8.

Budohoski KP, Czosnyka M, Kirkpatrick PJ, Reinhard M, Varsos GV, Kasprowicz M, et al. Bilateral failure of cerebral autoregulation is related to unfavorable outcome after subarachnoid hemorrhage. Neurocrit Care. 2015;22:65–73.

Beqiri E, Brady KM, Lee JK, Donnelly J, Zeiler FA, Czosnyka M, et al. Lower limit of reactivity assessed with PRx in an experimental setting. Acta Neurochir Suppl. 2021;131:275–8.

Brady KM, Lee JK, Kibler KK, Easley RB, Koehler RC, Czosnyka M, et al. The lower limit of cerebral blood flow autoregulation is increased with elevated intracranial pressure. Anesth Analg. 2009;108:1278–83.

Vavilala MS, Lee LA, Lam AM. The lower limit of cerebral autoregulation in children during sevoflurane anesthesia. J Neurosurg Anesthesiol. 2003;15:307–12.

Liu X, Akiyoshi K, Nakano M, Brady K, Bush B, Nadkarni R, et al. Determining thresholds for three indices of autoregulation to identify the lower limit of autoregulation during cardiac surgery. Crit Care Med. 2021;49:650–60.

Rivera-Lara L, Zorrilla-Vaca A, Healy RJ, Ziai W, Hogue C, Geocadin R, et al. Determining the upper and lower limits of cerebral autoregulation with cerebral oximetry autoregulation curves: a case series. Crit Care Med. 2018;46:e473–7.

Lee JK, Kibler KK, Benni PB, Easley RB, Czosnyka M, Smielewski P, et al. Cerebrovascular reactivity measured by near-infrared spectroscopy. Stroke. 2009;40:1820–6.

Sainbhi AS, Froese L, Gomez A, Batson C, Stein KY, Alizadeh A, et al. Continuous time-domain cerebrovascular reactivity metrics and discriminate capacity for the upper and lower limits of autoregulation: a scoping review of the animal literature. Neurotrauma Rep. 2021;2:639–59.

Brady KM, Lee JK, Kibler KK, Smielewski P, Czosnyka M, Easley RB, et al. Continuous time-domain analysis of cerebrovascular autoregulation using near-infrared spectroscopy. Stroke. 2007;38:2818–25.

Brady KM, Mytar JO, Kibler KK, Hogue CW, Lee JK, Czosnyka M, et al. Noninvasive autoregulation monitoring with and without intracranial pressure in the Naïve piglet brain. Anesth Analg. 2010;111:191–5.

Aries MJ, Czosnyka M, Budohoski K, Steiner L, Lavinio A, Kolias A, et al. Continuous determination of optimal cerebral perfusion pressure in traumatic brain injury*. Crit Care Med. 2012;40:2456–63.

Beqiri E, Smielewski P, Robba C, Czosnyka M, Cabeleira MT, Tas J, Ercole A. Feasibility of individualised severe traumatic brain injury management using an automated assessment of optimal cerebral perfusion pressure: the COGiTATE phase II study protocol. BMJ Open. 2019;9(9):e030727.

Zeiler FA, Ercole A, Cabeleira M, Carbonara M, Stocchetti N, Menon DK, et al. Comparison of performance of different optimal cerebral perfusion pressure parameters for outcome prediction in adult traumatic brain injury: a collaborative european neurotrauma effectiveness research in traumatic brain injury (CENTER-TBI) study. J Neurotrauma. 2019;36:1505–17.

Howells T, Johnson U, McKelvey T, Enblad P. An optimal frequency range for assessing the pressure reactivity index in patients with traumatic brain injury. J Clin Monit Comput. 2015;29:97–105.

Fraser CD, Brady KM, Rhee CJ, Easley RB, Kibler K, Smielewski P, et al. The frequency response of cerebral autoregulation. J Appl Physiol. 2013;115:52–6.

Zeiler FA, Aries M, Cabeleira M, van Essen TA, Stocchetti N, Menon DK, et al. Statistical cerebrovascular reactivity signal properties after secondary decompressive craniectomy in traumatic brain injury: a CENTER-TBI pilot analysis. J Neurotrauma. 2020;37:1306–14.

Wang EC, Ang BT, Wong J, Lim J, Ng I. Characterization of cerebrovascular reactivity after craniectomy for acute brain injury. Br J Neurosurg. 2006;20:24–30.

Donnelly J, Czosnyka M, Adams H, Cardim D, Kolias AG, Zeiler FA, et al. Twenty-five years of intracranial pressure monitoring after severe traumatic brain injury: a retrospective. Single-Center Analysis Neurosurgery. 2019;85:E75-82.

PubMed   Google Scholar  

Aries MJH, de Jong SF, van Dijk JMC, Regtien J, Depreitere B, Czosnyka M, et al. Observation of autoregulation indices during ventricular csf drainage after aneurysmal subarachnoid hemorrhage: a pilot study. Neurocrit Care. 2015;23:347–54.

Klein SP, Bruyninckx D, Callebaut I, Depreitere B. Comparison of intracranial pressure and pressure reactivity index obtained through pressure measurements in the ventricle and in the parenchyma during and outside cerebrospinal fluid drainage episodes in a manipulation-free patient setting. Acta Neurochir Suppl. 2018;126:287–90.

Howells T, Johnson U, McKelvey T, Ronne-Engström E, Enblad P. The effects of ventricular drainage on the intracranial pressure signal and the pressure reactivity index. J Clin Monit Comput. 2017;31:469–78.

Batson C, Stein KY, Gomez A, Sainbhi AS, Froese L, Alizadeh A, et al. Intracranial pressure-derived cerebrovascular reactivity indices, chronological age, and biological sex in traumatic brain injury: a scoping review. Neurotrauma Rep. 2022;3:44–56.

Batson C, Froese L, Sekhon MS, Griesdale DE, Gomez A, Thelin EP, et al. Impact of chronological age and biological sex on cerebrovascular reactivity in moderate/severe traumatic brain injury: a canadian high-resolution traumatic brain injury (CAHR-TBI) study. J Neurotrauma. 2023;40:1098–111.

Liu X, Donnelly J, Czosnyka M, Aries MJH, Brady K, Cardim D, et al. Cerebrovascular pressure reactivity monitoring using wavelet analysis in traumatic brain injury patients: a retrospective study. PLoS Med. 2017;14: e1002348.

Tjerkaski J, Nyström H, Raj R, Lindblad C, Bellander B-M, Nelson DW, et al. Extended analysis of axonal injuries detected using magnetic resonance imaging in critically Ill traumatic brain injury patients. J Neurotrauma. 2022;39:58–66.

Download references

Acknowledgements

This work was directly supported through the Natural Sciences and Engineering Research Council of Canada (NSERC) (DGECR-2022-00260, RGPIN-2022-03621 and ALLRP-576386-22) and the Manitoba Public Insurance (MPI) Neuroscience Research Operating Fund

Open access funding provided by Karolinska Institute. FAZ is supported through the Endowed Manitoba Public Insurance (MPI) Chair in Neuroscience, NSERC (DGECR-2022–00260, RGPIN-2022–03621, ALLRP-578524–22, ALLRP-576386–22, I2IPJ 586104–23, and ALLRP 586244–23), Canadian Institutes of Health Research (CIHR), the MPI Neuroscience Research Operating Fund, the Health Sciences Centre Foundation Winnipeg, the Pan Am Clinic Foundation (Winnipeg, MB), the Canada Foundation for Innovation (CFI) (Project #: 38583), Research Manitoba (Grant #: 3906 and 5429) and the University of Manitoba VPRI Research Investment Fund (RIF).EPT acknowledges funding support from Karolinska Institutet Funds (2022–01576), Region Stockholm ALF (FoUI-962566), Hjärnfonden/Swedish Brain Foundation (FO2023-0124), The Swedish Society of Medicine (SLS-985504) and Region Stockholm Clinical Research Appointment (ALF Klinisk Forskare, FoUI-981490). LF is supported through the University of Manitoba—Biomedical Engineering (BME) Fellowship Grant, Research Manitoba – Health Sciences PhD Studentship, Brain Canada Dr. Hubert van Tol Travel Fellowship, NSERC (ALLRP-576386–22), the University of Manitoba Graduate Enhancement of Tri-Agency Stipend (GETS) program and NSERC PDF. RR is supported by research grants from Helsinki University Hospital (TYH2023330), Finska Läkaresällskapet and The Swedish Cultural Foundation in Finland.

Author information

Erik Hong and Logan Froese have equally shared first authors.

Frederick A Zeiler and Eric P Thelin have equally shared last authors.

Authors and Affiliations

Department of Clinical Neuroscience, Karolinska Institutet, Stockholm, Sweden

Erik Hong, Logan Froese, Alexander Fletcher-Sandersjöö, Charles Tatter, Emma Hammarlund, Peter Alpkvist, Jiri Bartek Jr, Caroline Lindblad, Frederick A. Zeiler & Eric P. Thelin

Department of Neurosurgery, Karolinska University Hospital, Stockholm, Sweden

Erik Hong, Alexander Fletcher-Sandersjöö, Peter Alpkvist & Jiri Bartek Jr

Biomedical Engineering, Faculty of Engineering, University of Manitoba, Winnipeg, MB, Canada

Logan Froese & Frederick A. Zeiler

Department of Molecular Medicine and Surgery (MMK), Karolinska Institutet, Stockholm, Sweden

Emeli Pontén

Department of Neurosurgery, Skåne University Hospital, Lund, Sweden

Department of Radiology, Södersjukhuset, Stockholm, Sweden

Charles Tatter

Department of Perioperative Medicine and Intensive Care, Karolinska University Hospital, Stockholm, Sweden

Emma Hammarlund, Cecilia A. I. Åkerlund & David W. Nelson

Section of Perioperative Medicine and Intensive Care, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden

Cecilia A. I. Åkerlund & David W. Nelson

Department of Cardiology, Danderyd’s Hospital, Stockholm, Sweden

Jonathan Tjerkaski

Department of Neurosurgery, University of Helsinki, Helsinki, Finland

Department of Neurosurgery, Uppsala University Hospital, Uppsala, Sweden

Caroline Lindblad

Department of Medical Sciences, Uppsala University, Uppsala, Sweden

Section of Neurosurgery, Department of Surgery, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada

Frederick A. Zeiler

Pan Am Clinic Foundation, Winnipeg, MB, Canada

Centre on Aging, University of Manitoba, Winnipeg, Canada

Department of Neurology, Karolinska University Hospital, Stockholm, Sweden

Eric P. Thelin

You can also search for this author in PubMed   Google Scholar

Contributions

EH and LF wrote the main manuscript and were responsible for data preparation and demonstration. FAZ an EPT supervised and provided direct supervision and oversite of the project. Data collection was facilitated by EPT, with support from DWN, EH, AFS, CAIA, and JT. All authors reviewed the manuscript and supported its creation with future recommendations to its presentation.

Corresponding author

Correspondence to Logan Froese .

Ethics declarations

Study was approved by the Swedish Ethical Review Authority(#2020–05227) on November 17, 2020) and adheres to the Helsinki Declaration of 1975. Swedish Ethics Review Authority has waived the need for informed consent.

Consent for Publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Hong, E., Froese, L., Pontén, E. et al. Critical thresholds of long-pressure reactivity index and impact of intracranial pressure monitoring methods in traumatic brain injury. Crit Care 28 , 256 (2024). https://doi.org/10.1186/s13054-024-05042-7

Download citation

Received : 12 May 2024

Accepted : 16 July 2024

Published : 29 July 2024

DOI : https://doi.org/10.1186/s13054-024-05042-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Pressure reactivity index
  • Traumatic brain injury
  • Neuro-monitoring
  • Intracranial pressure
  • Functional outcome

Critical Care

ISSN: 1364-8535

methods of collecting secondary data in research methodology

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 July 2024

Potential assessment of CO 2 source/sink and its matching research during CCS process of deep unworkable seam

  • Huihuang Fang 1 , 2 , 3 ,
  • Yujie Wang 1 , 2 ,
  • Shuxun Sang 4 , 5 , 6 ,
  • Shua Yu 1 , 2 ,
  • Huihu Liu 1 , 2 ,
  • Jinran Guo 1 , 2 &
  • Zhangfei Wang 1 , 2  

Scientific Reports volume  14 , Article number:  17206 ( 2024 ) Cite this article

43 Accesses

Metrics details

  • Energy and society
  • Environmental economics
  • Solid Earth sciences
  • Stratigraphy

It is of great significance for the engineering popularization of CO 2 -ECBM technology to evaluate the potential of CCUS source and sink and study the matching of pipeline network of deep unworkable seam. In this study, the deep unworkable seam was taken as the research object. Firstly, the evaluation method of CO 2 storage potential in deep unworkable seam was discussed. Secondly, the CO 2 storage potential was analyzed. Then, the matching research of CO 2 source and sink was carried out, and the pipe network design was optimized. Finally, suggestions for the design of pipe network are put forward from the perspective of time and space scale. The results show that the average annual CO 2 emissions of coal-fired power plants vary greatly, and the total emissions are 58.76 million tons. The CO 2 storage potential in deep unworkable seam is huge with a total amount of 762 million tons, which can store CO 2 for 12.97 years. During the 10-year period, the deep unworkable seam can store 587.6 million tons of CO 2 , and the cumulative length of pipeline is 251.61 km with requiring a cumulative capital of $ 4.26 × 10 10 . In the process of CO 2 source-sink matching, the cumulative saving mileage of carbon sink is 98.75 km, and the cumulative saving cost is $ 25.669 billion with accounting for 39.25% and 60.26% of the total mileage and cost, respectively. Based on the three-step approach, the whole line of CO 2 source and sink in Huainan coalfield can be completed by stages and regions, and all CO 2 transportation and storage can be realized. CO 2 pipelines include gas collection and distribution branch lines, intra-regional trunk lines, and interregional trunk lines. Based on the reasonable layout of CO 2 pipelines, a variety of CCS applications can be simultaneously carried out, intra-regional and inter-regional CO 2 transport network demonstrations can be built, and integrated business models of CO 2 transport and storage can be simultaneously built on land and sea. The research results can provide reference for the evaluation of CO 2 sequestration potential of China's coal bases, and lay a foundation for the deployment of CCUS clusters.

Similar content being viewed by others

methods of collecting secondary data in research methodology

Study on generation, migration and accumulation of CO in the mining goaf of shallow-buried close distance coal seam group

methods of collecting secondary data in research methodology

Benefits of infrastructure symbiosis between coal power and wastewater treatment

methods of collecting secondary data in research methodology

Integrated assessment of CO2-ECBM potential in Jharia Coalfield, India

Introduction.

CCUS stands for CO 2 Capture, Utilization and Storage 1 . On the one hand, CCUS technology can reduce CO 2 emissions in the atmosphere and reduce the concentration of greenhouse gases 2 , 3 . On the other hand, it can help industries with CO 2 high-emission achieve low-carbon development and promote economic transformation 4 , 5 . Therefore, CCUS technology has broad application prospects in the field of global energy and environment. CO 2 emissions from coal are the largest source of carbon emissions in China, and it will take a long time for China to transform its energy situation 6 , 7 . Therefore, based on CCUS technology, it is of profound significance to reduce CO 2 emissions from coal, and can promote the realization of China's dual-carbon strategy.

CO 2 geological sequestration, a core component of CCUS, is an effective way to achieve large-scale de-carbonization 8 , 9 . Scientific evaluation of CO 2 storage potential in sedimentary basin and realization of source-sink matching are the basis of CCUS cluster deployment 10 , 11 . Major sedimentary basin in China have great potential for CO 2 storage, and the storage forms are diverse 12 . However, due to the lack of unified methods for CO 2 storage potential in sedimentary basin in China, the assessment of CO 2 storage potential greatly varies 13 . The CO 2 sequestration potential of geological body in China, such as oil and gas fields, deep unrecoverable seam, production and closed mines and goaf areas, is unclear and needs to be evaluated in detail.

Carbon emission sources in China's coal base are concentrated, and CO 2 emission sources and CO 2 storage sinks are highly overlapping 14 , which provides favorable conditions for CCUS cluster deployment. CCUS technology is the only way for coal base to achieve near zero for CO 2 emission in the future, and the deployment of “Coal base + CCUS” cluster has scale and agglomeration effects 15 . The geographical proximity of CO 2 sources and sinks can save more costs for CO 2 transportation, and the geographical concentration of a large number of CO 2 sources and sinks is also conducive to large-scale and clustered layout engineering practices. Geological body, such as deep unrecoverable seam, is the most typical forms of CO 2 storage in coal bases 16 , 17 . However, its CO 2 geological storage is still in the exploration stage, and there are few studies on its CO 2 storage potential 18 . Therefore, it is necessary to establish potential assessment methods suitable for the characteristics of China’s coal bases.

The CO 2 sequestration process can be simplified as the reverse process of the CBM extraction process, and its core mechanism is the dynamic process of CO 2 adsorption and displacement of CBM 19 , 20 . Therefore, the mechanism of CO 2 geological storage in unworkable seam is mainly about the mechanism of CO 2 adsorption and desorption in coal seam 21 . The coal resource in Huainan and Huaibei coalfields account for 97.7% of the total resources in the province, and the distribution is concentrated 22 , 23 , 24 . Therefore, the Huainan coalfield is determined as the estimation coalfield for CO 2 storage in this study. Due to the limitation of technical and geological conditions, the buried depth of coal mining in Anhui province is limited to less than 1000 m at present stage, and the coal seam with 1000–2000 m is the resource amount, which will be exploited in the next stage, and belongs to the deep unworkable seam at the present stage 16 , that is, the geological reserves with burial depth of 1000–2000 m are used to estimate the CO 2 storage potential in Anhui province.

In this study, the deep unworkable seam in Huainan coalfield was taken as the research object. Firstly, the evaluation method of CO 2 storage potential in deep unworkable seam was discussed. Secondly, the CO 2 geological storage potential was analyzed. Then, based on the lowest cost objective function and improved mileage saving method, the matching research of CO 2 source and sink for CO 2 geological storage was carried out, and the pipe network design was optimized. Finally, from the perspective of time and space scale, suggestions on the design of network planning of CCS source and sink are put forward in Huainan coalfield. The research innovations are described as follows: (1) Evaluation method of CO 2 storage potential in deep unworkable seam is discussed; (2) Matching problem of CO 2 source and sink is studied, and its pipe network design is optimized; (3) Design idea of network planning of CCS source and sink is systematically proposed. The results can provide reference for the evaluation of CO 2 sequestration potential of coal bases in China, and lay a foundation for CCUS cluster deployment.

Geological setting and analysis method

Geological background of the study area.

Based on regional structural analysis, the Huainan coalfield is located at the southern margin of North China Plate. In the west–east direction, the coal field boundary lies between the Kouziji-Nanzhaoji faults and the Xinchengkou-Changfeng faults. From north to south, the coalfield boundary lies between the Shangtangming-Longshan faults and Yingshang-Dingyuan faults (Fig.  1 ) 25 , 26 . The coalfield is a near east–west hedge tectonic basin with imbricate fan composed of nappe structures on both sides of the basin and simple synclinic structure in the interior (Fig.  1 ).

figure 1

Geological background of Huainan coalfield and distribution of CO 2 source-sink geological points in deep unrecoverable coal seams.

The coal-bearing strata are Taiyuan formation of upper Carboniferous series, Shanxi formation and Xiashihezi formation of lower Permian series, and Shangshihezi formation of upper Permian series, with a total thickness of about 900 m and about 40 layers of coal seams 27 , 28 . In the coal-bearing strata, there are 9–18 coal layers with a single layer thickness greater than 0.7 m on average, the maximum thickness is 12 m, and the total thickness is 23–36 m, which are distributed in Shanxi formation, Xiashihezi formation and lower part of Shangshihezi formation. In this study, the CO 2 emission sources were 10 coal-fired power plants in the coalfield with numbered D1-D10, respectively. Deep unworkable seams are CO 2 storage sinks, which are bounded by faults and numbered B1-B15, respectively (Fig.  1 ).

Evaluation method of CO 2 geological storage potential

In deep unworkable seam, CO 2 geological storage is mainly in adsorbed, dissolved and free states 29 , and adsorption storage is the main storage form of coal seam 30 . Considering the storage differences of different phase of CO 2 , the following potential assessment model of CO 2 storage can be adopted 16 , 31 :

where \(M_{{{\text{CO}}_{2} }}\) is CO 2 storage capacity, t; ρ CO2 is the CO 2 density, kg/m 3 ; M coal is proved coal reserves, t; m ab , m d and m f are the stored quantity of CO 2 adsorbed, dissolved and free states in coal per unit mass, m 3 /t.

In the unit mass coal, the storage potential of CO 2 adsorbed state in deep unworkable seam can be characterized by the following formula 16 , 31 :

where P is the reservoir pressure, which is also CO 2 adsorption pressure, MPa; T c is CO 2 critical temperature, K; Z is the CO 2 compression coefficient; p c is CO 2 critical pressure, MPa; T is the reservoir temperature, which also CO 2 adsorption temperature, K; and m ex is the CO 2 excess adsorption amount per unit mass of coal, m 3 /t, which can be calculated using the following D-R adsorption model 16 , 31 :

where m 0 is the maximum CO 2 adsorption capacity of coal per unit mass tested by adsorption experiment, m 3 /t; ρ f and ρ a are the densities of free and adsorbed CO 2 under the real temperature and pressure conditions, kg/m 3 ; D is the adsorption constant, and k is the constant associated with Henry's Law.

In coal reservoir, CO 2 density is a function of pressure and temperature, which can be expressed as ρ f  =  f(p, T) , and can be further characterized as follows 16 , 31 , 32 :

where δ  =  ρ c /ρ f is the CO 2 reduced density; ρ c is the CO 2 critical density, kg/m 3 ; τ  =  T c /T is the reduced temperature; and ϕ(δ,τ) is the Helmholtz free energy, which can be controlled by temperature and density 16 , 31 , 32 :

where ϕ o (δ, τ) is the Helmholtz free energy of ideal fluid, and ϕ r (δ, τ) is the Helmholtz free energy of the residual fluid.

In deep unworkable seam, the storage potential of dissolved CO 2 per unit mass of coal is a function of coal porosity, water saturation, coal density and CO 2 solubility, which can be characterized as follows 16 , 31 :

where φ is the coal porosity, %; S w is the water saturation, %; \(S_{{{\text{CO}}_{2} }}\) is the CO 2 solubility, and ρ coal is the coal density, kg/m 3 .

According to Boyle-Mariotte law, the free CO 2 storage potential per unit mass of coal in deep unworkable seam can be characterized as follows 16 , 31 :

where S g is the gas saturation, %; P 0 is the standard atmospheric pressure, MPa; T 0 is the temperature under the standard condition, K; and ρ visual is the coal apparent density, kg/m 3 .

Construction of matching model of CO 2 source-sink

Co 2 source and sink matching.

CO 2 source-sink matching is the basis of CCUS cluster deployment and its pipe network design and construction, with the goal of minimizing CO 2 transportation cost and maximizing carbon removal. Its essence is the optimization planning of CCUS cluster system 33 , 34 . Based on CO 2 emission source, storage sink, storage geological process, transport network connecting source and sink and corresponding parameter data, the dynamic optimal matching between CO 2 source and sink can be achieved in terms of target quantity, continuity and economic efficiency (Fig.  2 ).

figure 2

Schematic diagram of connotation of CO 2 source and sink matching.

The matching of CO 2 source and sink is mainly based on the characteristics of large number, different types and scattered locations of CO 2 emission sources (i.e., thermal power, steel, cement, chemical industry, etc.) and storage sinks (i.e., saltwater layer, CO 2 -ECBM, CO 2 -EOR, MCO 2 -ILU, CO 2 -SDR, etc.). Based on the discussion of constraint conditions and determination of objective function, the influence of regional geographical conditions, traffic, population density, transportation cost and transportation mode on CO 2 transport between emission sources and storage sinks is fully considered in the CCUS system. The optimal matching of CO 2 emission sources, storage sinks and transportation parameters was realized, so as to determine scientific and reasonable CO 2 source and sink matching schemes (Fig.  2 ).

Objective functions

Based on the theory of network analysis in operations research, theoretical models of CO 2 source-sink matching within CCUS technology can be constructed in Huainan coalfield by using the minimum support tree method. The construction of theoretical models should meet the following basic assumptions: (1) Source and sink with the lowest cost should be firstly matched; (2) Allow the matching of one source with multi sinks or one sink with multi sources; (3) Sequestration sink must meet the requirement of CCUS planning period.

In this study, the lowest total cost of matching of CO 2 source-sink in CCS technology is taken as the objective function, namely:

where i refers to the i th CO 2 source; j means the j th CO 2 sink; m indicates the number of CO 2 sources and the value is 10, and n indicates the number of CO 2 sinks with the value of 15.

CO 2 capture cost (i.e., C C )

Based on the analysis of the industrial sources report published by the National Energy Technology Laboratory of the United States, the average capture cost of CO 2 source in coal-fired power plants is $ 64.35 /t 30 , 35 . Therefore, the capture cost of CO 2 source in Huainan coalfield can be characterized as follows:

where \(\omega_{ij}\) represents the CO 2 capture cost in the i coal-fired power plant, $/t; and X ij represents CO 2 transport amount from the i coal-fired power plant to the j sequestration sink, t.

CO 2 transportation cost (i.e., C T )

CO 2 transport is most common by pipeline, ship and tanker, and pipeline transportation is suitable for directional transportation with large capacity, long distance and stable load, which mainly includes construction cost and operation and maintenance cost. The operation and maintenance cost accounts for about 1.5% of the construction cost 35 , which can be calculated according to formula 10 and 11 , respectively.

where L is the distance of pipeline transportation, km.

where N represents the transportation cycle of the pipeline, year.

Therefore, CO 2 transport cost can be characterized as follows:

CO 2 sequestration cost (i.e., C S )

The cost of CO 2 geological storage is closely related to the amount of CO 2 storage and the type of storage site, and the average storage cost coefficient is $ 5.59 /t 30 , 35 . Therefore, the cost of CO 2 geological storage in coal reservoir can be characterized as follows:

where \(\varepsilon_{ij}\) is the sequestration cost factor of transporting CO 2 from coal-fired power plant i to sequestration sink j , $/t.

In summary, by substituting formulas ( 9 ), ( 12 ) and ( 13 ) into formula ( 8 ), the minimum objective function of total cost of CO 2 source-sink matching in CCS technology can be obtained:

Constraint conditions

Based on the basic assumptions of theoretical model, in the planning process of matching pipe network of CO 2 source-sink with CCS technology, the constraint conditions of the lowest total cost objective function are as follows:

The total amount of CO 2 captured from all CO 2 emission sources is equal to the total amount of pipeline transport, that is:

where a i is the CO 2 capture amount of the i th coal-fired power plant.

The CO 2 content transported by the pipeline to the storage site shall not exceed the storage capacity of the storage sink, that is:

where b j is the storage capacity of the j th storage sink.

The amount of CO 2 captured in all coal-fired power plants must not exceed the total capacity of all potential sequestration sinks, that is:

Non-negative constraint: the pipeline of CO 2 transport content is non-negative, that is:

Optimization of matching pipe network of CO 2 source-sink

The core idea of the mileage saving algorithm is to merge two transportation loops into one loop to reduce the transportation distance in the merging process, and keep cycling until the limit condition is reached, thus reducing the transportation cost. Specifically, three points, A, B and C, transport goods from A to B and C, where the distance from A to B is L AB (unit: km), the distance from A to C is L AC (unit: km), and the distance from B to C is L BC (unit: km), if the transportation from A to B and A to C is separately completed, the transportation distance is 2 × (L AB  + L AC ) with including the round trip process (Fig.  3 a). If from A to B, then from B to C, and finally from C back to A, then the transport distance is L AB  + L AC  + L BC (Fig.  3 a), then the distance saved is 2 × (L AB  + L AC ) − (L AB  + L AC  + L BC ) = L AB  + L AC  − L BC  > 0.

figure 3

Optimization of CCUS source-sink matching pipe network. ( a ) Traditional mileage saving methods; ( b ) Improvement of the mileage saving method.

In CO 2 source-sink matching, each sink is taken as the distribution center and distributed with the connected source points. The basic principle is similar to the mileage saving method, except that there is only a transportation network from the source to the sink, and there is no return pipeline. Based on this, the idea of mileage saving method is introduced in this study, and it is improved to meet the needs of CO 2 source-sink matching and transportation network optimization. As shown in Fig.  3 b, the CO 2 emitted from points B and C is transported to the storage sink A for storage. The most direct way is from B to A, and then from C to A, with a transport distance of L AB  + L AC (Fig.  3 b). If it is transported from B to C and then from C to A or from C to B and then from B to A (Fig.  3 b), the transport distance is L AC  + L BC or L AB  + L BC . L AB and L AC need to be compared to choose a route with a smaller distance for connection. If L BC  < L AB /L AC , then L AB (L AC ) − L BC is the savings; if L BC  > L AB /L AC , then L AB /L AC  − L BC is negative, which means no savings (Fig.  3 b).

CO 2 source and sink characteristics

Characteristics of co 2 sources.

In Huainan coalfield, CO 2 emission sources are 10 coal-fired power plants within the coalfield, of which 9 have been put into operation, 1 has finished commissioning and plans to put into operation. According to the “Greenhouse Gas Emission Accounting Methods and Reporting Guidelines for Chinese Power Generation Enterprises (Trial)” and related methods, the carbon emission intensity of the coal-fired power plants was calculated, and on this basis, the average annual CO 2 emissions of each coal-fired power plant were estimated. The installed capacity of China's coal-fired power plants is mainly 300 WM, 600 WM and 1000 WM, and the CO 2 emission intensity of which is 0.845 t/MW/h, 0.807 t/MW/h and 0.768 t/MW/h, respectively, and in this study, the mean value is taken as the basis for estimation 36 , 37 . Based on the average annual power generation statistics of each power plant, the average annual CO 2 emissions of each coal-fired power plant can be analyzed (Table 1 ).

As can be seen from Table 1 , the average annual CO 2 emissions of coal-fired power plants vary greatly with ranging from 0.36 million tons to 17.12 million tons. Among them, the average annual CO 2 emissions of D7 power plant reach 17.12 million tons, accounting for about 30% of the total annual CO 2 emissions. The total annual CO 2 emissions of all coal-fired power plants are 58.76 million tons, which includes 5.28 million tons of emissions from the proposed D6 power plant (Table 1 ).

Assessment of CO 2 sink

The core parameters of potential assessment of CO 2 geological storage are mainly derived from engineering data, test data, experimental data and scientific research papers (Table 2 ) 16 , 31 , 38 , 39 . In this study, for deep unworkable seam in Huainan coalfield, the proved reserves with burial depth ≤ 1500 m are obtained from coal exploration, and the proved reserves with burial depth > 1500 m are predicted reserves by the resource management department. The geothermal gradient is 3.10 °C/100 m. When the depth of coal seam is less than 1000 m, the pressure gradient is 0.95 MPa/100 m. When the depth of coal seam is more than 1000 m, the pressure gradient is 1.08 MPa/100 m 16 , 31 . The core parameters of CO 2 geological storage potential assessment can be detailed in Table 2 16 , 31 , 38 , 39 .

The CO 2 geological storage potential of deep unworkable seam in Huainan coalfield is huge, and the total amount is 762 million tons. The adsorbed, free and dissolved CO 2 can be stored 685 million tons, 53 million tons and 24 million tons, respectively. The CO 2 geological storage with adsorbed state in deep unworkable seam is the most dominant, accounting for 89.895% of the total storage. When the buried depth of coal seam is ≤ 1500 m and > 1500 m, the total CO 2 geological storage is 253 million tons and 510 million tons, with accounting for 33.17% and 66.83% of the total storage, respectively. Regardless of the state in which CO 2 is stored, the total amount of CO 2 stored when the buried depth is greater than 1500 m is greater than that under the same state when the buried depth is less than 1500 m (Table 3 ).

When the buried depth of coal seam is > 1500 m and ≤ 1500 m, the proved coal reserves are 4.03 billion tons and 1.99 billion tons, respectively, with a ratio of 2.025. For the total amount of CO 2 geologic storage and its adsorption, free and dissolved state, the ratio of coal seam buried depth > 1500 m and ≤ 1500 m is 2.016, 1.996, 2.312 and 2.000, respectively. The main reason why the ratio of total CO 2 geological storage and total adsorption state is lower than 2.025 is that although the CO 2 geological storage potential of deep unworkable seam is positively correlated with the proved coal reserves, the maximum CO 2 adsorption capacity at the depth ≤ 1500 m is much higher than that at the depth > 1500 m. With the increase of burial depth, the reservoir pressure gradually increases, and the CO 2 storage potential in free state in pore structure gradually increases, which will make the free CO 2 ratio far greater than 2.025.

Matching characteristics of CO 2 source-sink

Plane distribution characteristics of co 2 sinks.

The total CO 2 storage potential of deep unworkable seam in Huainan coalfield is 762 million tons (Table 3 ). For the average annual CO 2 emissions of the 10 coal-fired power plants, it can be stored for 12.97 years. The deep unworkable seam is the most potential body for CO 2 storage in Huainan coalfield. The unrecoverable coal seam with buried depth ≤ 1500 m can meet the CO 2 geological storage requirements of coal-fired power plants for 4.31 years. Considering the technical challenges and implementation costs of CO 2 storage in coal seam with different burial depths, the unworkable coal seam with burial depths ≤ 1500 m should be the main target reservoir for the implementation of CO 2 -ECBM technology in the next five years.

With fault structure as the boundary, the deep unworkable seam can be divided into 15 CO 2 storage blocks, and the comparative analysis of the plane distribution of CO 2 storage sinks can be carried out according to the plane area size (Fig.  4 ). The main blocks of CO 2 geological storage are B9, B12, B8 and B5, and their sealable stocks are 124 million tons, 114 million tons, 97 million tons and 85 million tons, respectively, among which the largest two blocks, B9 and B12, can store the CO 2 emissions of 10 coal-fired power plants for nearly four years. The four blocks with larger area are also the main blocks of the CO 2 source-sink matching.

figure 4

Plane distribution of CO 2 storage sink in unrecoverable coal seams of Huainan coalfield.

According to the preliminary potential assessment analysis, for the average annual CO 2 emissions of the 10 coal-fired power plants in Huainan coalfield, the deep unworkable seam can be stored for 12.97 years. Therefore, in this study, the matching study of CO 2 source-sink was conducted based on the cumulative CO 2 emissions of 10 coal-fired power plants in Huainan coalfield in 10 years for deep unworkable seam (Fig.  5 ).

figure 5

CCS source and sink matching of cumulative CO 2 emissions from 10 coal-fired power plants in Huainan coalfield during the 10-year cycle.

Based on the matching results of CO 2 source and sink during the 10-year cycle in Huainan coalfield, it can be seen that the coal-fired power plant of D1 can be mainly stored in blocks of B2, B3, B4 and B7, with the stored stocks of 20.2 million tons, 19.7 million tons, 30.9 million tons and 10.8 million tons, respectively. Coal-power plant of D2 is mainly stored in block of B5, and the stored stock is 3.6 million tons. The coal-power plant of D3 is mainly stored in blocks of B7 and B10, with a stored stock of 25.8 million tons and 51 million tons, respectively. Coal-fired power plant of D4 is mainly stored in blocks of B8 and B9, with a storage capacity of 10.9 million tons and 12.3 million tons, respectively. Coal-fired power plant of D5 is mainly stored in block of B9, with a stored stock of 58.9 million tons. Coal-fired power plant of D6 is mainly stored in block of B9, and the stored stock is 52.8 million tons. Coal-fired power plant of D7 is mainly stored in blocks of B8, B12 and B14, with stored stocks of 61.1 million tons, 58.3 million tons and 51.8 million tons, respectively. Coal-fired power plant of D8 is mainly stored in block B8, with a stored stock of 15.5 million tons. Coal-fired power plant of D9 is mainly stored in block of B13, and the stored stock is 48.2 million tons. The coal-fired power plant of D10 is mainly stored in block of B12, with a stored stock of 56.0 million tons (Fig.  5 ). During the 10-year cycle, the CO 2 in deep unworkable seam can be stored up to 587.6 million tons, and the cumulative planned pipeline is 251.61 km, which will require a cumulative capital of $ 4.26 × 10 10 .

Discussions

Analysis of matching pipe network of co 2 source-sink.

Based on the analysis of matching pipe network of CO 2 source-sink in deep unworkable seam, it can be seen that the transportation routes of pipelines of 9, 4, 16, 5 and 8 are relatively long, which accounts for 53.65% of the total transportation route length (Fig.  6 ). Because the transportation cost is proportional to the route, it is important to optimize the line length of pipelines of 9, 4, 16, 5 and 8 to reduce the total cost.

figure 6

Analysis of the number, length and proportion of CO 2 source-sink matching pipe network in deep unrecoverable coal seams.

Based on the analysis of CO 2 storage and transport costs and their proportion in deep unworkable seam, it can be seen that the transport costs of blocks of 8, 7, 12 and 13 are the highest, which accounts for 36.96%, 14.01%, 11.60% and 11.86% of the total CO 2 storage and transport costs, respectively. The transportation cost of four CO 2 storage sinks accounted for 74.43% of the total cost. Therefore, blocks of 8, 7, 12 and 13 of deep unworkable seam will be the focus of optimization of matching pipe network of CO 2 source-sink. Blocks of 1, 6, 11 and 15 do not need to bear CO 2 geological storage for the time being, which can be used as alternative blocks for CO 2 storage (Figs.  5 and 7 ).

figure 7

Transportation cost and proportion of CO 2 storage sinks matched by CO 2 source and sink in deep unrecoverable coal seams.

Based on the improved mileage saving method, the optimization results of matching pipe network of CO 2 source-sink in deep unworkable seam can be obtained (Fig.  8 ).The unchanged pipe network paths are D1–B4, D1–B7, D3–B10, D4–B8, D4–B9 and D7–B13 (Fig.  8 ), and the routes among other source-sink take the minimum total transportation cost as the objective function, and the pipe network optimization is carried out according to the constraints of the emission source and the storage capacity (Fig.  8 ).

figure 8

Optimization results of CO 2 source-sink matching pipe network in Huainan coalfield.

Based on the optimization results of matching pipe network of CO 2 source-sink in Huainan coalfield, it can be seen that the accumulated mileage saved is 98.75 km, and the accumulated cost saved is $ 25.669 billion, which accounts for 39.25% and 60.26% of the total mileage and cost of pipeline, respectively (Table 4 ). Among them, the mileage and cost savings of 13 and 14 blocks in deep unworkable seam are more obvious, which accounts for 10.43% and 10.10% of the total mileage and 16.20% and 16.01% of the total cost, respectively (Table 4 ).

Planning and design of matching pipe network of CO 2 source-sink

Pipeline network planning on a time scale.

By analyzing the optimization results of matching pipe network of CO 2 source-sink in Huainan coalfield and the amount of CO 2 transported by each pipe network line, it can be seen that the entire pipe network is centrally distributed in the east and west regions, and it is obvious that the transport amount of the eastern pipe network is significantly greater than that of the western one (Fig.  9 ). The thicker the lines of the route, the greater the traffic amount (Fig.  9 ). The planning and design of matching pipe network of CO 2 source-sink should refer to the thickness of the transportation line, that is, the amount of CO 2 transported (Figs. 10 , 11 , 12 ). The planning and design of matching pipe network of CO 2 source-sink in Huainan coalfield is proposed in accordance with three steps:

figure 9

CO 2 transport statistics of CCS source-sink matching pipe networks in Huainan coalfield.

figure 10

Three-step planning and design of CO 2 source-sink matching pipe network in Huainan coalfield (First step).

First step: It is recommended to preferentially plan the pipeline route of D9–D8–D7–B12–D6–D4–B8 in the eastern region, and the D3–B10 and D1–B4 in the western region. This planned pipeline can effectively connect the coal-fired power plants of D9, D8, D7, D6 and D4, and unworkable blocks of B12, B8, B10 and B4 of Huainan coalfield (Fig.  10 ). At this step, the total amount of CO 2 that can be transported by the pipeline network is 6.65 billion tons, and the total amount of CO 2 that can be stored is 2.27 billion tons, which accounts for 56.99% and 38.74% of the total transportation and storage stock of CO 2 , respectively.

Second step: It is recommended to further plan the pipeline lines of D10–D9, D7–B13, D7–B14, D4–B9, D5–B9, B10–B7, and B4–B3–B2, which can further effectively connect the deep unworkable seam in the east, middle and west areas (Fig.  11 ). After the pipeline network planning at this step, the total amount of CO 2 transported can be 10.345 billion tons, and the total amount of CO 2 stored can be 5.84 billion tons, which accounts for 88.66% and 99.39% of the total CO 2 transport and storage, respectively.

figure 11

Three-step planning and design of CO 2 source-sink matching pipe network in Huainan coalfield (Second step).

Third step: Complete the design of all remaining pipelines to connect the deep unworkable seam in the east and west of the study area. It is suggested to add the design of B3 and B4 pipelines, so as to run through all CO 2 emission sources and CO 2 storage sinks in Huainan coalfield, so as to realize all CO 2 transportation and geological storage (Fig.  12 ).

figure 12

Three-step planning and design of CO 2 source-sink matching pipe network in Huainan coalfield (Third step).

Pipeline network planning at the spatial scale

In this study, the location of each point in deep unworkable seam is determined by taking the center location of each region (Fig.  1 ), but in the actual well location layout, the regional center location is often not the only consideration. Therefore, the analysis of the type of CCS pipeline within each region and the planning of CCS pipeline network between each region are very important (Fig.  13 ).

figure 13

Schematic diagram of four types of CO 2 pipelines connecting carbon sources and carbon sinks.

According to the location and use of CO 2 pipelines in the pipe network, CO 2 pipelines can be defined as the following four types (Fig.  13 ): (1) Gas collection branch, that is, the pipeline that communicates CO 2 source and transfer point, and the transport phase is determined according to its economy; (2) Distribution branch, that is, the pipeline from the end of the communication pipeline to the carbon sequestration point; (3) Intra-regional trunk lines, that is, trunk pipelines from the transfer point to the carbon sequestration point in the region; (4) Interregional trunk lines, that is, shared pipelines connecting regions. As far as Huainan coalfield is concerned, in terms of spatial scale, priority should be given to planning intra-regional pipe networks in various regions within unworkable seam bounded by faults, that is, the pipe networks in various regions within B1–B15 (Fig.  13 ).

Whether it is a small area of Huainan coalfield or the whole large area of China, the CCS pipe network layout should follow the following ideas. First of all, small-scale carbon sources in the region should be transferred to main pipelines through gas collection branch lines, and commercial CO 2 pipeline demonstration projects can be built. Secondly, the collection and distribution pipelines of regional carbon sources can be planned within the basin to form a backbone sharing pipeline, and a variety of CCS carbon sequestration applications can be simultaneously carried out to build an interregional transport network demonstration. Then, for areas that do not have the conditions for storage, inter-regional trunk pipelines should be built to gradually form a cross-regional carbon network on land to fully meet the matching transport of source and sink. Offshore CO 2 storage resources should be developed, suitable coastal injection points should be selected, marine transport pipelines and ship transport should be simultaneously carried out, and integrated business models of transport and storage based on land and sea should be built (Fig.  13 ).

Conclusions

In this study, the deep unworkable seam in Huainan coalfield was taken as the research object. Firstly, the evaluation method of CO 2 storage potential in deep unworkable seam was discussed. Secondly, the CO 2 geological storage potential was analyzed. Then, the matching research of CO 2 source and sink for CO 2 geological storage was carried out, and the pipe network design was optimized. Finally, suggestions on the design of network planning of CCS source and sink are put forward in Huainan coalfield. The main conclusions are as follows:

The total annual CO 2 emissions of each coal-fired power plant are 58.76 million tons, and the average annual CO 2 emissions of each coal-fired power plant vary greatly with ranging from 0.356 million tons to 17.12 million tons. The CO 2 geological storage potential of deep unworkable seam is huge, and the total amount is 762 million tons. It can store 685 million tons, 53 million tons and 24 million tons of CO 2 in adsorbed, free and dissolved states, respectively. For the average annual CO 2 emissions of coal-fired power plants, deep unworkable seam can be stored for 12.97 years. During the 10-year period, the deep unworkable coal seam can store 587.6 million tons, and the cumulative planning pipeline is 251.61 km, requiring a cumulative capital of $ 4.26 × 10 10 .

The main blocks of CO 2 geological storage are B9, B12, B8 and B5, with stored stocks of 124 million tons, 114 million tons, 97 million tons and 85 million tons, respectively. The matching of CO 2 source and sink saved 98.75 km, and saved $ 25.67 billion, accounting for 39.25% and 60.26% of the total mileage and cost, respectively. The mileage and cost savings in 13 and 14 blocks are more obvious, which accounts for 10.43%, 10.10% and 16.20% and 16.01% of the total mileage and cost, respectively.

Based on the three-step approach, the whole line of CO 2 emission sources and CO 2 storage sinks in Huainan coalfield can be completed by stages and regions, and all CO 2 transportation and storage can be realized. CO 2 pipelines include gas collection branch lines, gas distribution branch lines, intra-regional trunk lines, and interregional trunk lines. Based on the reasonable layout of various types of CO 2 pipelines, a variety of CCS carbon sequestration applications can be simultaneously carried out, the intra-regional and inter-regional network demonstration for CO 2 transport can be built, and integrated business models of CO 2 transport and storage can be built simultaneously on land and sea.

Data availability

All data generated or analysed during this study are included in this published article (Please refer to the manuscript that has been uploaded).

Kong, H. et al. The development path of direct coal liquefaction system under carbon neutrality target: Coupling green hydrogen or CCUS technology. Appl. Enegry 347 , 121451 (2023).

Article   CAS   Google Scholar  

Wang, X. et al. Research on CCUS business model and policy incentives for coal-fired power plants in China. Int. J. Greenh. Gas Control 125 , 103871 (2023).

Wang, F. et al. Carbon emission reduction accounting method for a CCUS-EOR project. Pet. Explor. Dev. 50 (4), 989–1000 (2023).

Article   Google Scholar  

Han, J. et al. Coal-fired power plant CCUS project comprehensive benefit evaluation and forecasting model study. J. Clean. Prod. 385 , 135657 (2022).

Fan, J. et al. Modelling plant-level abatement costs and effects of incentive policies for coal-fired power generation retrofitted with CCUS. Energ Policy 165 , 112959 (2022).

Xu, S. et al. Repowering coal power in China by nuclear energy-implementation strategy and potential. Energies 15 (3), 1072 (2022).

Tian, Y. et al. Evolution dynamic of intelligent construction strategy of coal mine enterprises in China. Heliyon 8 (10), e10933 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zhan, J. et al. Suitability evaluation of CO 2 geological sequestration based on unascertained measurement. Arab. J. Sci. Eng. 47 (9), 11453–11467 (2022).

Hou, L. et al. Self-sealing of caprocks during CO 2 geological sequestration. Energy 252 , 124064 (2022).

Sun, L. & Chen, W. Impact of carbon tax on CCUS source-sink matching: Finding from the improved China CCS DSS. J. Clean Prod. 333 , 130027 (2022).

Fan, J. et al. Near-term CO 2 storage potential for coal-fired power plants in China: A county-level source-sink matching assessment. Appl. Energy 279 , 115878 (2020).

Li, Y. et al. Grading evaluation and ranking of CO 2 sequestration capacity in place (CSCIP) in China’s major oil basins: Theoretical, effective, practical and CCUS-EOR. ACTA Geol. Sin.-Engl. 97 (3), 873–888 (2023).

Ming, X. et al. Thin-film dawsonite in Jurassic coal measure strata of the Yaojie coalfield, Minhe Basin, China: A natural analogue for mineral carbon storage in wet supercritical CO 2 . Int. J. Coal Geol. 180 , 83–99 (2017).

Fan, J. et al. Carbon reduction potential of China’s coal-fired power plants based on a CCUS source-sink matching model. Resour. Conserv. Recycl. 168 , 105320 (2021).

Liu, S. et al. Emission reduction path for coal-based enterprises via carbon capture, geological utilization, and storage: China energy group. Energy 273 , 127222 (2023).

Liu, S. et al. Evaluation of carbon dioxide geological sequestration potential in coal mining area. Int. J. Greenh. Gas. Control 122 , 103814 (2023).

Wang, F. et al. Mechanism of supercritical CO 2 on the chemical structure and composition of high-rank coals with different damage degrees. Fuel 344 , 128027 (2023).

Omotilewa, O. et al. Evaluation of enhanced coalbed methane recovery and carbon dioxide sequestration potential in high volatile bituminous coal. Gas Sci. Eng. 91 , 103979 (2021).

Liu, X. et al. Mechanistic insight into the optimal recovery efficiency of CBM in sub-bituminous coal through molecular simulation. Fuel 266 , 117137 (2020).

Li, Y. et al. Variation in permeability during CO 2 -CH 4 displacement in coal seams: Part 1-Experimental insights. Fuel 263 , 116666 (2020).

Li, J. et al. Simulation of adsorption-desorption behavior in coal seam gas reservoirs at the molecular level: A comprehensive review. Energy Fuel 34 (3), 2619–2642 (2020).

Hou, H. et al. Pore structure characterization of middle- and high-ranked coal reservoirs in northern China. AAPG Bull. 107 (2), 213–241 (2023).

Liu, H. et al. Insight into difference in high-pressure adsorption-desorption of CO 2 and CH 4 from low permeability coal seam of Huainan-Huaibei coalfield, China. J. Environ. Chem. Eng. 16 (6), 108846 (2022).

Yu, K. et al. Influence of sedimentary environment on the brittleness of coal-bearing shale: Evidence from geochemistry and micropetrology. J. Petrol. Sci. Eng. 185 , 106603 (2020).

Wang, G. et al. Pore structure characteristics of coal-bearing shale using fluid invasion methods: A case study in the Huainan-Huaibei Coalfield in China. Mar. Pet. Geol. 62 , 1–13 (2015).

Fang, H. et al. Numerical analysis of permeability rebound and recovery evolution with THM multi-physical field models during CBM extraction in crushed soft coal with low permeability and its indicative significance to CO 2 geological sequestration. Energy 262 , 125395 (2023).

Xiong, S., Lu, J. & Qin, Y. Prediction of coal-bearing strata characteristics using multi-component seismic data-a case study of Guqiao coalmine in China. Arab. J. Geosci. 11 (15), 408 (2018).

Zhang, K. et al. Experimental study on the influence of effective stress on the adsorption-desorption behavior of tectonically deformed coal compared with primary undeformed coal in Huainan coalfield, China[J]. Energies 15 (18), 6501 (2022).

Wang, M. et al. Current research into the use of supercritical CO 2 technology in shale gas exploitation. Int. J. Min. Sci. Technol. 29 (5), 739–744 (2019).

Zhu, Q. et al. Optimal matching between CO 2 sources in Jiangsu province and sinks in Subei-Southern South Yellow Sea basin, China. Greenh. Gase 9 (1), 95–105 (2019).

Xu, H. et al. CO 2 storage capacity of anthracite coal in deep burial depth conditions and its potential uncertainty analysis: A case study of the No. 3 coal seam in the Zhengzhuang Block in Qinshui Basin, China. Geosci. J. 25 (5), 715–729 (2021).

Article   ADS   CAS   Google Scholar  

Span, R. & Wagner, W. A new equation of state for Carbon dioxide covering the fluid region from the triple—point temperature to 1100 K at pressures up to 800 MPa. J. Phys. Chem. Ref. Data 25 (6), 1509–1596 (1996).

Wang, J. Optimal Design of CCUS Source Sink Matching Pipe Network for Coal Fired Power Plants in North China (Chengdu University of Technology, Chengdu, 2021).

Google Scholar  

Sang, S. et al. Research progress on technical basis of synergy between CO 2 geological storage potential and energy resources. J. China Coal Soc. 48 (7), 2700–2716 (2023) ( in Chinese with English abstract ).

Mo, H., Liu, S. & Sang, S. Matching of CO 2 geological sequestration source and sink for industrial fixed emission source in Subei-Southern Yellow Sea Basin. Geol. Rev. 69 (S1), 128–130 (2023) ( in Chinese with English abstract ).

Liu, M. et al. Assessing the cost reduction potential of CCUS cluster projects of coal-fired plants in Guangdong province in China. Front. Earth Sci. 17 (3), 844–855 (2023).

Article   ADS   Google Scholar  

Kong, H. et al. The development path of direct coal liquefaction system under carbon neutrality target: Coupling green hydrogen or CCUS technology. Appl. Energy 347 , 121451 (2023).

Huang, D., Hou, X. & Wu, Y. The mechanism and capacity evaluation on CO 2 sequestration in antiquated coal mine gob. Environ. Eng. 32 (S1), 1076–1080 (2014) ( in Chinese with English abstract ).

Sun, W., Zhang, E. & Wu, H. An analysis of production potential of residual CBM in Huainan mining area. Coal Geol. China 22 (12), 24–28 (2010) ( in Chinese with English abstract ).

Download references

Acknowledgements

We would like to express our gratitude to the anonymous reviewers for offering their constructive suggestions and comments which improved this manuscript in many aspects. This work was financially supported by the Natural Science Research Project of Anhui Educational Committee (2023AH040154), the Anhui Provincial Natural Science Foundation (2308085Y30), the Anhui Provincial Key Research and Development Project (2023z04020001), the National Natural Science Foundation of China (No. 42102217; 42277483), and the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-018).

Author information

Authors and affiliations.

School of Earth and Environment, Anhui University of Science and Technology, Huainan, 232001, Anhui, China

Huihuang Fang, Yujie Wang, Shua Yu, Huihu Liu, Jinran Guo & Zhangfei Wang

Institute of Energy, Hefei Comprehensive National Science Center, Hefei, 230000, China

Department of Geological Sciences, University of Saskatchewan, Saskatoon, SK, S7N 5E2, Canada

Huihuang Fang

Carbon Neutrality Institute, China University of Mining and Technology, Xuzhou, 221008, China

Shuxun Sang

School of Resources and Geosciences, China University of Mining and Technology, Xuzhou, 221116, China

Jiangsu Key Laboratory of Coal-based Greenhouse Gas Control and Utilization, China University of Mining and Technology, Xuzhou, 221008, China

You can also search for this author in PubMed   Google Scholar

Contributions

H.F. and S.S.: The conception and design of the study, revising it critically for important intellectual content, final approval of the version to be submitted. H.F. and Y.W.: Drafting the article. J.G. and H.L.: Drawing of all figures. H.F. and Z.W.: Collection and analysis of the field data. S.Y.: Derivation of mathematical models.

Corresponding author

Correspondence to Huihuang Fang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Fang, H., Wang, Y., Sang, S. et al. Potential assessment of CO 2 source/sink and its matching research during CCS process of deep unworkable seam. Sci Rep 14 , 17206 (2024). https://doi.org/10.1038/s41598-024-67968-w

Download citation

Received : 29 April 2024

Accepted : 18 July 2024

Published : 26 July 2024

DOI : https://doi.org/10.1038/s41598-024-67968-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Carbon capture, utilization and storage (CCUS)
  • Source-sink matching model
  • CO 2 geological storage
  • Mileage saving method
  • Deep unworkable seam
  • Huainan coalfield

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

methods of collecting secondary data in research methodology

IMAGES

  1. Methods of Data Collection-Primary and secondary sources

    methods of collecting secondary data in research methodology

  2. Data Collection Methods

    methods of collecting secondary data in research methodology

  3. Data Collection Methods: Types & Examples

    methods of collecting secondary data in research methodology

  4. Difference Between Data Collection and Data Analysis

    methods of collecting secondary data in research methodology

  5. Secondary Data Collection Methods Ppt Powerpoint Presentation

    methods of collecting secondary data in research methodology

  6. Secondary Data Collection Methods

    methods of collecting secondary data in research methodology

VIDEO

  1. Methods Of Collecting Primary Data

  2. 4. Research Methods for Collecting Primary Data |RESEARCH METHODOLOGY (242517) |Hon's 4th Accounting

  3. 7 Steps for Action Research in your Classroom by Tracey Tokuhama-Espinosa

  4. Difference between Primary Data and Secondary data

  5. Secondary Data Collection Methods (द्वितीयक डेटा संग्रहण विधियाँ)

  6. Research Methodology

COMMENTS

  1. Secondary Data

    Types of secondary data are as follows: Published data: Published data refers to data that has been published in books, magazines, newspapers, and other print media. Examples include statistical reports, market research reports, and scholarly articles. Government data: Government data refers to data collected by government agencies and departments.

  2. What is Secondary Research?

    When to use secondary research. Secondary research is a very common research method, used in lieu of collecting your own primary data. It is often used in research designs or as a way to start your research process if you plan to conduct primary research later on.. Since it is often inexpensive or free to access, secondary research is a low-stakes way to determine if further primary research ...

  3. Secondary Qualitative Research Methodology Using Online Data within the

    For this reason, we propose a new systematic step-by-step guideline with a set of methods for secondary data collection, filtering, and analysis to mitigate the downfalls of secondary data analysis, particularly in the setting of forced migration research when using online, publicly accessible data.

  4. Data Collection

    Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation. In order for data collection to be effective, it is important to have a clear understanding ...

  5. Secondary Research: Definition, Methods & Examples

    Secondary Research: First-hand research to collect data. May require a lot of time: The research collects existing, published data. May require a little time: Creates raw data that the researcher owns: The researcher has no control over data method or ownership: Relevant to the goals of the research: May not be relevant to the goals of the research

  6. Conducting High-Value Secondary Dataset Analysis: An Introductory Guide

    A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology.

  7. Data Collection Methods

    Data Collection Methods. Data collection is a process of collecting information from all the relevant sources to find answers to the research problem, test the hypothesis (if you are following deductive approach) and evaluate the outcomes. Data collection methods can be divided into two categories: secondary methods of data collection and ...

  8. Secondary Data Analysis: Using existing data to answer new questions

    Secondary data analysis is a valuable research approach that can be used to advance knowledge across many disciplines through the use of quantitative, qualitative, or mixed methods data to answer new research questions ( Polit & Beck, 2021 ). This research method dates to the 1960s and involves the utilization of existing or primary data ...

  9. Data Collection

    Step 2: Choose your data collection method. Based on the data you want to collect, decide which method is best suited for your research. Experimental research is primarily a quantitative method. Interviews, focus groups, and ethnographies are qualitative methods. Surveys, observations, archival research and secondary data collection can be ...

  10. Data Collection Methods and Tools for Research; A Step-by-Step Guide to

    Data Collection, Research Methodology, Data Collection Methods, Academic Research Paper, Data Collection Techniques. I. INTRODUCTION Different methods for gathering information regarding specific variables of the study aiming to employ them in the data analysis phase to achieve the results of the study, gain the answer of the research

  11. Secondary Data Collection Methods

    Various Methods Of Collecting Secondary Data. There are two t ypes of secondary data collection —qualitative secondary data collection and quantitative secondary data collection. Qualitative data deals with the intangibles and covers factors such as quality, color, preference or appearance. Quantitative data deals with numbers, statistics and ...

  12. PDF Methods of Data Collection in Quantitative, Qualitative, and Mixed Research

    asses to explore the reasons and thinking that produce this quantitative relationship.There are actually tw. kinds of mixing of the six major methods of data collection (Johnson & Turner, 2003). The first is intermethod mixing, which mean. two or more of the different methods of data collection ar.

  13. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    One of the main stages in a research study is data collection that enables the researcher to find answers to research questions. Data collection is the process of collecting data aiming to gain ...

  14. Secondary Data Analysis

    The analysis of existing data sets is routine in disciplines such as economics, political science, and sociology, but it is less well established in psychology (but see Brooks-Gunn & Chase-Lansdale, 1991; Brooks-Gunn, Berlin, Leventhal, & Fuligini, 2000).Moreover, biases against secondary data analysis in favor of primary research may be present in psychology (see McCall & Appelbaum, 1991).

  15. Data Collection Methods: Types & Examples

    Some common data collection methods include surveys, interviews, observations, focus groups, experiments, and secondary data analysis. The data collected through these methods can then be analyzed to support or refute research hypotheses and draw conclusions about the study's subject matter.

  16. Conducting secondary analysis of qualitative data: Should we, can we

    While secondary data analysis of quantitative data has become commonplace and encouraged across disciplines, the practice of secondary data analysis with qualitative data has met more criticism and concerns regarding potential methodological and ethical problems.

  17. Secondary Qualitative Research Methodology Using Online Data within the

    Qualitative research using interviews is a crucial and established inquiry method in social sciences to ensure that the study outputs represent the researched people and area rather than those who are researching. However, first hand primary data collection is not always possible, often due to external circumstances.

  18. Secondary Data In Research Methodology (With Examples)

    Secondary data in research methodology refers to pre-existing information collected through primary resources, reducing time and effort for researchers as it is readily accessible. ... choosing data collection methods and working with publishers. Secondary data has much less involvement from the researcher, as the brunt of the research work has ...

  19. A guide on primary and secondary data-collection methods

    Below are some examples of primary data-collection methods: 1. Questionnaires and surveys. While researchers often use the terms "survey" and "questionnaire" interchangeably, the two mean slightly different things. A questionnaire refers specifically to the set of questions researchers use to collect information from respondents.

  20. Data Collection Methods: A Comprehensive View

    The data obtained by primary data collection methods is exceptionally accurate and geared to the research's motive. They are divided into two categories: quantitative and qualitative. We'll explore the specifics later. Secondary data collection. Secondary data is the information that's been used in the past.

  21. Data Collection Methods

    Step 2: Choose your data collection method. Based on the data you want to collect, decide which method is best suited for your research. Experimental research is primarily a quantitative method. Interviews, focus groups, and ethnographies are qualitative methods. Surveys, observations, archival research, and secondary data collection can be ...

  22. Data Collection Methods

    Primary Data Collection Methods. Primary data or raw data is a type of information that is obtained directly from the first-hand source through experiments, surveys or observations. The primary data collection method is further classified into two types. They are. Quantitative Data Collection Methods. Qualitative Data Collection Methods.

  23. Steps in Secondary Data Analysis

    Steps in Secondary Data Analysis. Locating data - Knowing what is out there and whether you can gain access to it. A quick Internet search, possibly with the help of a librarian, will reveal a wealth of options. Evaluating relevance of the data - Considering things like the data's original purpose, when it was collected, population ...

  24. Secondary data collection methods/tools in research methodology with

    In this video source of data collection methods in research methodology are explained with different examples in each points. Secondary data collection metho...

  25. Improving Clerkship to Enhance Patients' Quality of care (ICEPACQ): a

    This was a mixed methods study involving the collection of secondary data via the review of patients' files and the collection of qualitative data via key informant interviews. The files of patients who were admitted from August 2022 to December 2022, treated and discharged were reviewed using a data extraction tool.

  26. What Is Qualitative Research? An Overview and Guidelines

    Research methodology in doctoral research: Understanding the meaning of conducting qualitative research [Conference session]. Association of Researchers in Construction Management (ARCOM) Doctoral Workshop (pp. 48-57). Association of Researchers in Construction Management.

  27. Assessing the Influence of Soil Composition on Plant Growth and

    Purpose: The aim of the study was to assess the influence of soil composition on plant growth and development. Methodology: This study adopted a desk methodology. A desk study research design is commonly known as secondary data collection. This is basically collecting data from existing resources preferably because of its low cost advantage as compared to a field research. Our current study ...

  28. Predictive models for personalized precision medical intervention in

    During the prolonged period from Human Papillomavirus (HPV) infection to cervical cancer development, Low-Grade Squamous Intraepithelial Lesion (LSIL) stage provides a critical opportunity for cervical cancer prevention, giving the high potential for reversal in this stage. However, there is few research and a lack of clear guidelines on appropriate intervention strategies at this stage ...

  29. Critical thresholds of long-pressure reactivity index and impact of

    This methodology was also completed for the segmentation of the population based on EVD, IPD, and time of data recorded in hospital stay. ... Data collection. The patient data collection was identical to that previously described . As a summary, all patient demographics, injury and treatment information were either manually collected by a ...

  30. Potential assessment of CO2 source/sink and its matching research

    The research innovations are described as follows: (1) Evaluation method of CO 2 storage potential in deep unworkable seam is discussed; (2) Matching problem of CO 2 source and sink is studied ...