• Open access
  • Published: 19 April 2024

Measuring, visualizing, and diagnosing reference bias with biastools

  • Mao-Jan Lin 1 ,
  • Sheila Iyer 1 ,
  • Nae-Chyun Chen 1 &
  • Ben Langmead   ORCID: orcid.org/0000-0003-2437-1976 1  

Genome Biology volume  25 , Article number:  101 ( 2024 ) Cite this article

1073 Accesses

3 Altmetric

Metrics details

Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor’s variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools , we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.

Most sequencing data analyses start by aligning sequencing reads to a reference genome. This strategy comes with a drawback called reference bias . The aligner tends to miss alignments or report incorrect alignments for reads containing non-reference alleles. This can lead to confounded measurements and incorrect results, especially for analyses of hypervariable regions [ 4 ], allele-specific effects [ 10 , 33 , 34 , 38 ], ancient DNA analysis [ 17 , 26 ], or epigenenomic signals [ 16 ].

Recent tools seek to reduce this bias by indexing collections of reference genome sequences, i.e., pangenomes. By including many known genetic variants in the pangenome, such methods remove alignment penalties incurred by known alternate alleles. This has spurred research in indexing graphs (e.g., the definition and use of Wheeler Graphs [ 13 , 35 ]) and repetitive collections of strings, e.g.,  r -index [ 21 ] and hybrid indexes [ 37 ]. These ideas are used in practical tools like HISAT2 [ 20 ], VG [ 14 ] and VG-Giraffe [ 35 ]. Mitigating reference bias is also the stated motivation for the Human Pangenome Reference Consortium’s project to create a human pangenome [ 25 ].

However, the topic of “reference bias” itself — what it means and how it happens — has received comparatively little attention. Studies proposing bias-reducing tools have evaluated and visualized reference bias in divergent ways. There are no standard tools or metrics, and no methods exist to trace specific causes of reference bias events.

We present Biastools , a tool for measuring and diagnosing reference bias in datasets from diploid individuals such as humans. In its simulate mode, biastools enables users to set up and run simulation experiments to (a) compare different alignment programs and reference representations in terms of the bias they yield, and (b) categorize instances of reference bias according to their cause, which might be primarily due to genetic differences, repetitiveness, local coordinate ambiguity due to gaps, or other causes. In its predict mode, biastools enables users to analyze real sequencing datasets derived from donors with known genetic variants, both quantifying the overall level of reference bias and predicting which specific sites are most affected by bias. In its scan mode, biastools enables users to analyze real sequencing datasets from individuals with no foreknowledge of their genetic variants, identifying regions of higher reference bias.

We use biastools to study reference bias in various scenarios, including using aligners like Bowtie 2 [ 22 ], BWA-MEM [ 24 ] and the pangenome graph aligner VG Giraffe [ 35 ]. Our results support previous studies that found that including more variants in a pangenome graph reference reduces reference bias [ 6 , 30 ]. Interestingly, we also find that end-to-end alignment modes of popular tools like Bowtie 2 and BWA-MEM (a local aligner by default, but with the ability to penalize non-end-to-end alignments) are particularly effective in reducing bias at insertions and deletions. By contrast, aligners that favor local alignments, with no penalty on “soft clipping,” exhibit more bias around gaps. Finally, we found that applying biastools ’s scan mode revealed large-scale differences in reference bias observed using only the GRCh38 assembly [ 7 ] versus when using the combined benefits of both the GRCh38 and the T2T-CHM13 [ 27 ] assemblies.

Ideally, a read aligner would map each read to its true point of origin with no bias toward one haplotype or the other. Also, an ideal method for analyzing read alignments and tallying the reference (REF) and alternate (ALT) alleles covering a given site would do so without introducing bias. However, real aligners, reference genomes and assignment methods are imperfect, and several factors interact to produce distinct reference-bias signatures. We describe how biastools can measure and plot reference bias. We focus on bias in the context of diploid individuals (i.e., human) being sequenced using high-quality short reads, e.g., from Illumina instruments.

Measuring sources of bias in simulation

We performed a simulation experiment using biastools ’s simulate mode, detailed in the “ Methods ” “ Biastools workflow ” section. We started from a Variant Call Format (VCF) file describing HG002’s variants as determined by the Q100 project [ 31 , 32 ], a collaboration between the Telomere-to-Telomere (T2T) consortium, Human Pangenome Reference Consortium (HPRC), and Genome in a Bottle (GIAB) project. We generated a diploid personalized reference genome for HG002 using bcftools consensus . We used biastools --simulate , which in turn uses mason2 [ 19 ], to simulate Illumina-like whole genome sequencing (WGS) data to a total of \(\sim\) 30 \(\times\) average coverage, taking \(\sim\) 15 \(\times\) evenly from the two haplotypes. We used standard read aligners including Bowtie 2 [ 22 ], BWA-MEM [ 24 ], and Minimap 2 [ 23 ] to align to the GRCh38 reference genome [ 7 ]. We used VG Giraffe [ 35 ] to align to various graph pangenomes.

Types of allelic balance

After simulation and alignment, we measured three types of allelic balance at each heterozygous (HET) variant site (Fig. 1 ). We measured simulation balance (SB) as the proportion of simulated reads overlapping the HET that originated from the REF-carrying haplotype. SB is computed purely from the simulator output; the simulator annotates reads with their haplotype and point of origin. We measured mapping balance (MB) as the allelic balance at each HET site considering only the reads that both truly originated from the HET (as reported by the simulator) and that overlapped it after read alignment. An overlapping read that originated from the REF-carrying haplotype contributes a REF allele, and likewise for an ALT-carrying read and ALT allele. MB ignores fine-grained details about how individual bases line up to the HET site in the pileup. Note that the simulation balance and mapping balance both use information from the simulator.

Finally, we measured assignment balance (AB) as the allelic balance after using an assignment algorithm to determine the haplotype of origin for each read overlapping the HET site. This does not make use of information from the simulator, and so can be measured for real reads as well as simulated ones. Assignment balance depends on the particular algorithm used to assign alignments to haplotypes. We tried two distinct algorithms, a “naive” assignment algorithm and a “context-aware” algorithm. The naive algorithm simply examines the nucleotides from each read that align across the HET site and computes a ratio according to how many of those sequences matched the REF allele versus how many matched the ALT allele. That is, the naive algorithm trusts that the aligner is correct and precise in how it places each base in the alignment and pileup.

figure 1

Illustration of the types of balance measurement — SB, MB, and AB — with respect to read simulation, read mapping, and halpotype assignment. Note that the mismapped reads are excluded when calculating MB, and the reads assigned “Others” are also excluded when calculating AB. Columns indicate distinct types of bias event. “Loss \(^{*}\) ” indicates a bias event due to reads with ALT alleles failing to align. “Loss \(^{**}\) ” indicates a bias event due to reads mapping elsewhere than their true point of origin. “Flux” indicates bias from gaining mismapped reads from other sites. “Local” indicates that local repeat content, as well as sequencing errors, combine to make a gap placement ambiguous

The context-aware algorithm, on the other hand, does not trust the aligner’s decisions, instead revisiting and possibly changing those decisions in light of all the alignments and the ploidy of the donor. It is a multi-part algorithm that decides whether each read is contributing a REF or ALT allele, or whether to exclude the read from consideration for lack of context. Assignment algorithms are detailed in the “ Methods ” “ Assignment method ” section.

Types of reference bias

To categorize instances of reference biases, we computed these combinations of simulation balance (SB), mapping balance (MB) and assignment balance (AB):

Normalized mapping balance (NMB) \(\equiv\) MB - SB. NMB > 0 implies that mapping creates more bias toward the REF allele compared to simulation, while NMB < 0 means mapping creates bias toward ALT.

Normalized assignment balance (NAB) \(\equiv\) AB - SB. NAB > 0 implies that alignment and assignment together create more bias toward the REF allele compared to simulation, while NAB < 0 means mapping and assignment create bias toward ALT.

To demonstrate the utility of these measures, we examined the read alignments produced by Bowtie 2. We measured and plotted allelic balance at HET sites according to their NMB (horizontal) and NAB (vertical) (Fig. 2 ). Since SNVs and gaps exhibited distinct bias profiles, we plotted them separately. In this plot, HET sites with little or no bias will appear close to the origin. We called sites “balanced” and colored them green if they were within \(\pm 0.1\) of 0 for NMB and NAB.

figure 2

Normalized mapping balance to normalized assignment balance (NMB-NAB) plot of a  SNV sites with naive assignment method, b  SNV sites with context-aware assignment method, c  insertion and deletion sites with naive assignment method, and d  insertion and deletion sites with context-aware assignment method. Each dot represent a variant site in HG002 chromosome 20. The simulated reads are aligned using Bowtie 2 and default parameters. The balance and bias subcategories are classified based on the position of the dots (“ Biased-site classification ” section). For visual clarity, sites with no correctly-mapped REF reads are omitted; the full plot including these sites is available as Additional file 1 : Fig. S1

We next categorized HET sites that appeared far from the origin and along the diagonal (colored orange), the bulk of which were in the upper-right quadrant. Proximity to the diagonal indicates MB and AB are equally distant from SB. We inferred that this bias signature was likely introduced in the mapping stage, when reads systematically failed to align to the ALT-carrying haplotype. We called this “loss” bias. Most loss events appear in the upper right (as opposed to the lower left) because the ALT allele is usually harder for the aligner to map across, causing the aligner to fail more often.

We next categorized HET sites that were vertically above or below the origin. These sites had near-zero NMB, meaning that mapping did not introduce significant bias. The combination of near-zero NBM with non-zero NAB indicates that the reads overlapping the site are roughly evenly drawn from the REF and ALT alleles, but that the assignment algorithm has a bias in which allele it assigns. For points above the origin, there is a bias toward the REF allele after assignment.

We further divided these into “flux” and “local” events. Flux events (colored blue) involve reads with low mapping quality, indicating that the read aligner had nearly-equally-good choices for where to map these reads. Such reads may be placed incorrectly, leading to the true evidence for REFs and ALTs being spread (and averaged) over many copies of a repeat. Flux events were more common for SNVs and rarer for insertions and deletions.

Local events (colored purple) are those where the evidence comes from mostly high-mapping-quality reads. In these cases, we hypothesized that the bias was caused by the assignment step. When using the naive assignment method (Fig. 2 a, c), most local bias events were caused by short tandem repeats, which created many equally good gap placements. Out of 3228 local bias events including SNVs and gaps, 2561 ( \(79\%\) ) were at sites annotated by Repeatmasker. One thousand twelve of these sites were in Simple repeats (micro-satellites), 302 were in LINEs, and 934 were in SINEs.

When gap placement decisions are not consistent from read to read, this interferes with correct tallying of REF and ALT evidence and contributes to bias. This bias can potentially be avoided post facto by reconsidering and modifying the base-by-base alignments in light of the expected ploidy of the donor and the other alignments. This is the goal of past work on “local realignment” or “indel realignment,” sometimes implemented in standalone tools [ 1 , 18 ] or as components of larger variant-calling systems [ 8 , 11 ].

A small number of sites did not belong to any of the above categories, and we called these “outliers” (colored gray). These can result from the co-occurrence of multiple of the above causes. A visual representation of all these categories is shown in the “ Biased-site classification ” section.

Observations on local bias

Comparing Fig. 2 panels a and c (naive assignment) versus panels b and d (context-aware assignment), we observed that the context-aware method yielded fewer local-bias events compared to the naive method, especially for insertions and deletions. This was expected, since gap-placement ambiguity can cause the aligner to place a gap in a position that differs from its VCF position. The context-aware method avoids this by disregarding the aligner’s gap decisions and scanning reads directly for variant sequences. Further, we stratified panels c and d by the length of the gaps (Additional file 1 : Fig. S2). The three rows from top to bottom show the gaps longer than 10, 20, and 50 bases assigned by naive or context-aware method. It can be seen that the longer the gaps, the higher the ratio of variants are classified as “local” or “flux” bias in naive assignment. On the other hand, the context-aware method successfully classified the majority of the variants into “balanced” or “loss” in all scenarios.

We also observed that the context-aware method did not totally avoid local bias (Fig. 2 b, d). Since this method requires that a substring of the read have an exact match to the REF or ALT allele at the site (“ Assignment method ” section), sequencing errors can affect the assignment balance either by artificially boosting the evidence for REF or ALT (if an error spurious creates a match), or more frequently by attenuating the evidence (if an error disrupts a match). This effect is more severe for longer insertions or deletions, since more opportunities exist for a position to mismatch. For long insertions, we expect the shorter REF allele to be less vulnerable to disruption by sequencing errors and so to be over-represented. For long deletions, we expect the ALT allele to be over-represented.

When multiple variants are situated near each other with respect to the reference, the read aligner can make decisions that cause context-aware assignment to fail. This can happen when a collection of nearby variants including gaps can be “explained” using fewer gaps and mismatches, causing portions of the read to shift with respect to the reference. An example is presented in Additional file 1 : Fig. S3. The shifting is more likely to happen in the ALT allele, whereas sequencing errors happen roughly evenly in REF and ALT haplotypes.

Visualizing bias for indels

We evaluated reference bias as a function of insertion and deletion length using the bias-by-allele-length plot (Fig. 3 ), modeled on a plot made in previous publications [ 9 , 14 , 35 ]. Here, the vertical axis is the ratio of alternate alleles observed spanning HET sites. That is, the vertical axis is the ratio ALT/(ALT+REF), where ALT and REF refer to the number of reads supporting the alternate and reference alleles respectively. For SNVs (length = 0), all measurements were well centered on 0.5. The naive assignment method (red) exhibited substantial bias across indel lengths, whereas both mapping balance (orange) and balance from context-aware assignment (green) stayed close to the simulation balance. This occurs for the same reasons that we see more local bias events for the naive assignment method in Fig. 2 .

Measuring bias across aligners

We performed the above analysis using multiple read aligners, including Bowtie 2 (in its default end-to-end alignment mode) [ 22 ], BWA-MEM [ 24 ], BWA-MEM with option “ -L 30 ” (to encourage end-to-end alignment) and the VG Giraffe graph aligner [ 35 ]. For VG Giraffe, we performed alignment using four different indexes.

Giraffe-linear: a graph consisting only of the linear reference genome GRCh38 [ 7 ].

Giraffe-major: a graph consisting of the GRCh38 reference but with major alleles added. With the addition of the major alleles, the graph contains 1,998,961 polymorphic sites.

Giraffe-pop5: A graph consisting of all the variants from 5 pre-built haplotype genomes based on the “RandFlow-LD” pangenome used in the Reference Flow study [ 6 ]. Each haplotype genome is based on a 1000 Genomes Project (1KGP) super-population. At each polymorphic site, the ALT allele is chosen with probability equal to its allele frequency. Linkage disequilibrium is preserved for each 1000 bp chunk. There are total 6,461,708 polymorphic sites across the 5 pre-built haplotype genomes combined.

Giraffe-1KGP: A graph containing all the phase-3 variants from the 1KGP with allele frequency greater than 0.01, using GRCh38 as the reference. This graph contains a total of 13,511,768 polymorphic sites.

While Giraffe-linear uses the Giraffe graph aligner, the “graph” consists of a single linear genome in that case. The linear and major indexes serve as baselines to highlight how the inclusion of more variation (i.e., for Giraffe-pop5 and Giraffe-1KGP) impacts bias.

figure 3

Bias-by-allele-length plots if we consider only Simulation Balance (blue), Mapping Balance (orange), Assignment Balance using context-aware assignment (green), and the same using naive assignment (red). Variant length varies along the x- axis, with positive values standing for insertion and negative values for deletions, and 0 for SNVs. The alignment is done by Bowtie 2 on HG002 simulated data. Top: Balance for all four measures. Dots represent median of the distribution and the whiskers indicate the first and third quartiles. Middle: Zoom-in on Mapping Balance and context-aware Assignment Bias with data normalized by subtracting median SB in each stratum. Bottom: number of variants with each length. Gaps exceeding 25 bp are collapsed into the \(-25\) or 25 strata

figure 4

Bias-by-allele-length for 8 alignment workflows. We used simulated and real WGS datasets derived from HG002. We subsetted to reads aligning to HET sites on chromosome 20. Variants are arranged according to their length, with positive values standing for insertions and negative values standing for deletions. Zero indicates SNVs. a  Fraction of ALT alleles in the simulation (blue) and after mapping of simulated reads (other colors). b  Fraction of ALT alleles after mapping and context-aware assignment using simulated reads. c  Fraction of ALT alleles after mapping and context-aware assignment using real reads. d  The number of incidents of each size

These experiments use the same simulated HG002 WGS dataset as in the previous section. In all cases, we used the context-aware assignment method to analyze allelic balance with respect to Q100 project-called variants for HG002 chromosome 20. Table 1 tallies and categorizes reference-bias events at chromosome-20 HET sites using the same classification strategy as in Fig. 2 . The only category where aligners produced substantially different tallies was “loss,” consistent with this category being directly related to the mapping of reads. Since Bowtie 2’s default alignment mode is end-to-end alignment (which does not perform soft clipping) whereas the default mode for all other tools was local alignment (allowing soft clipping), we hypothesized that end-to-end alignment was a less biased strategy for gaps. To test this, we included results for BWA-MEM with the -L 30 option, which increases the threshold for clipping from its default of -L 5 . Specifically, BWA-MEM allows clipping only in cases where the increase in alignment score is greater than the number specified with -L . Consistent with our hypothesis, BWA-MEM with the -L 30 option achieved the most balanced events for gaps compared to all other methods, including the end-to-end aligner, Bowtie 2, which achieved the second-most. The difference between the BWA-MEM modes is illustrated in Additional file 1 : Fig. S4. BWA-MEM -L 30 generally performed somewhat better than Minimap2 in all categories.

Comparing results for the various Giraffe indexes, we observed that the number of biased sites decreased as we moved from the linear reference (Giraffe-linear) to the references inclusive of more genetic variation (major, pop5 and 1KGP), with the reduction being chiefly due to loss events. The trend holds for both SNVs and gaps. We repeated the analysis on chromosome 16, giving similar results as for chromosome 20 (Additional file 1 : Table S1).

Figure 4 shows bias-by-allele-length plots including each aligner, along with the SB baseline (blue). Note that all of the balance measurements are modified to put the ALT in numerator for consistency with past studies. Panel a shows mapping balance (MB), and b shows assignment balance (AB) using the context-aware algorithm. In all cases, the lines tend to diverge more for the more extreme-length insertions and deletions. The bias noted for longer insertions seems to be greater than that of longer deletions. Note that reads carrying inserted sequence contain fewer bases that align to the reference, which in turn makes them harder to align correctly. This is in contrast to reads spanning deletions, which still align well to the reference genome, albeit with a deletion-sized gap. In addition, reads carrying insertions can sometimes not spanning the whole insertion, and have only one end overlapping the reference. BWA-MEM -L 30 stays the closest to simulation balance followed by Bowtie2, VG Giraffe, and the default settings of BWA-MEM is the most biased. Across the different Giraffe indexes, the balance improves from the linear to the major, pop5, and 1KGP indexes.

Measuring bias using real reads on well-characterized genome

Biastools can also be applied to study reference bias in real datasets. Here we discuss biastools ’s usage when reads come from a well-studied individual for which we have foreknowledge of HET sites. Since simulation balance (SB) and mapping balance (MB) relied on information from the simulator, we do not use them here. We continue to use assignment balance (AB) including with the context-aware assignment algorithm.

Visualizing bias

We made the bias-by-allele-length plot shown in Fig. 4 c. Since simulation balance is not available as a baseline, we used an ALT fraction of 0.5 as the baseline. The trends observed were similar to those observed for simulated data (Fig. 4 a, b). BWA-MEM with the -L 30 option and Bowtie 2 had the most even balance for longer insertions and deletions. For VG Giraffe, the indexes that included more variants had less bias than the indexes with fewer variants.

Classifying biased sites

Given a set of read alignments, biastools can predict which sites were affected by reference bias. To do this, biastools first performs context-aware assignment and measures allelic balance at the HET sites. Biastools also measures the mapping quality of the alignments overlapping each HET site, since low mapping-quality reads indicate possibly mis-mapping due to repeats.

We hypothesized that a combination of (a) allelic balance and (b) the average mapping quality of the overlapping reads could be used to predict if a variant site is affected by reference bias. We combined allelic balance (which varies from 0 to 1) with average mapping quality (normalized to vary from 0 to 1) using both addition and multiplication, then used these to rank sites according to their likelihood to be affected by bias. We applied these both to the simulated read data aligned by Bowtie 2, and to the real reads aligned by Bowtie 2. At each HET, we applied the classifier and compared its true/false categorization to the categorization obtained using the NMB-NAB analysis detailed in the “ Biased-site classification ” section. Recall that the NMB-NAB categorization uses information about the simulated points of origin to classify sites as balance or as one of several bias event categories: loss, flux, local or outlier. In this evaluation, we collapse these into a single “biased” category.

figure 5

The receiver operating characteristic (ROC) curve and the precision and recall (PR) curve of the biastools classifier on Bowtie2 alignment. a  ROC curve of SNVs, b  PR curve of SNVs, c  ROC curve of gaps, d  PR curve of gaps. The four lines are the simulated (blue and orange) and real data (green and red) based on multiplication scoring (mul) and addition scoring (add). auc: area under curve

While we lack ground-truth information about which HET sites are biased for the real reads, we assumed that bias events observed in the HG002 simulation would also occur in the real HG002 reads. That is, we transferred the ground-truth bias labels from the simulation to the real data. Figure 5 shows the receiver operating characteristic (ROC) curve and precision/recall (PR) curve evaluating our two-feature classifier. Panels a and b show the resulting curves for SNVs. Panel a shows that the classifier had area-under-curve (AUC) above 0.95 in all cases, whether we used addition or multiplication to combine features, and whether we evaluated on simulated or real reads. The PR curve for SNVs (panel b) had area-under-precision-recall-curve (AUPRC) ranging from 0.87 to 0.91. Further, the PR curves showed a more pronounced difference whereby classification accuracy for real data was lower than for simulated data.

For gaps, however, the ROC (Fig. 5 c) and PR (Fig. 5 d) curves were noticeably worse than for SNVs, with AUC of ROC ranging from 0.83 to 0.89, and AUPRC ranging from 0.57 to 0.66. That is expected, since the majority of biased SNV sites are loss or flux events that are well characterized by our allelic balance and average mapping quality features. For gaps, however, a larger proportion of the bias comes from loss or local events, and our features are only partially effective at capturing local bias.

Measuring bias using real reads from an uncharacterized genome

While the above experiments used either simulated reads or foreknowledge of HET sites, a common scenario is that the reads come from donor individual with unknown variants. We hypothesized that biastools could still detect biased regions based on three measures: (a) read depth, (b) density of ALT alleles detected, and (c) frequency of sites for which the evidence is inconsistent with a diploid state. We expect some or all of these measures to become extreme in areas affected by reference bias. For example, if a donor has multiple copies of a segmental duplication that exists in a single copy in GRCh38, reads from the duplicates will accumulate in a single region on GRCh38, leading to higher depth and, due to the collapsed evidence, some non-diploid variants.

Biastools ’s scan mode computes windowed running statistics over the pileup. In each window, it computes a read depth (RD) score, variant density (VD) score, and non-diploidy (ND) score, each of which are ultimately transformed to Z scores. The Z scores are then combined by taking their sum. Regions with combined score \(\ge 5\) are called “biased” and regions with score in the interval [3, 5] are called “suspicious.” When biased regions are close to each other (within 1 kbp), they are combined to make one longer biased region. This combining also happens for suspicious reasons. Details are in the  Methods “ Sliding window approach of scan mode ” section.

To evaluate scan mode, we ran it on the simulated HG002 read data from “ Measuring sources of bias in simulation ” section aligned by Bowtie 2 to the full GRCh38 reference. It reported 72,165 biased regions of average length 872 and 90,368 suspicious regions of average length 326 across the genome. While the input to this experiment was simulated data, our analysis does not use any information from the simulater, nor does it use foreknowledge of HG002’s variants. Focusing on chromosome 20, we compared the regions called biased by scan mode with the variant sites that were called biased using context-aware assignment as described in “ Results ” “ Measuring sources of bias in simulation ” section. scan mode called 3384 biased regions on chromosome 20, covering \(4.9\%\) of its bases. Of the SNVs and gaps on chromosome 20 that the biastools classifier (which does use simulation information and foreknowledge of HETs) calls balanced, \(81\%\) and \(74\%\) , respectively fell outside of the regions called biased by scan mode. On the other hand, \(75\%\) of SNV sites and \(78\%\) of gap sites called biased by the NMB-NAB analysis fell inside regions called biased by scan mode (Table 2 ).

In this way, scan mode reproduced the results of the per-site classifier in part, but not completely. This is expected since scan mode lacks foreknowledge of HET locations.

Bias near structural differences

As a further demonstration of scan mode, we ran it on the real HG002 Bowtie 2 alignments used in “ Results ” “ Measuring bias using real reads on well-characterized genome ” section. Biastools scan marked \(4.6\%\) of the GRCh38 primary assembly (considering the 22 autosomes and the two sex chromosomes) as belonging to biased regions. Since bias can be caused by missing or incorrectly collapsed sequence in the reference, we hypothesized that the biased regions would have a tendency to be in or near HG002’s structural variants (SVs). We examined HG002’s SVs as called by Human Genome Structural Variation Consortium (HGSVC) [ 12 ] with length over 100 bp, finding 9709 insertions and 5521 deletions. We found that 6183 insertions ( \(64\%\) ) and 2690 deletions ( \(49\%\) ) fell inside or within 100 bp of regions called biased by scan mode (Table 3 ). The greater enrichment of insertions in biased regions was expected, since reads containing inserted non-reference sequence are more likely to align incorrectly.

Bias due to incomplete reference representations

We used scan mode to compare two different reference representations and alignment strategies. The first used Bowtie 2 to align directly to GRCh38, as we did above. We call this the “direct-to-GRC” method. The second used a workflow that additionally makes use of the complete telomere-to-telomere (T2T) CHM13 human genome assembly. The second method uses Bowtie 2 to align first to the more complete T2T-CHM13 assembly. Then, for reads that fail to align unambiguously to T2T-CHM13, it additionally aligns those to the GRCh38 assembly. For reads that align successfully to both, the alignment with the higher alignment score (to its original target, not necessarily to GRCh38) is chosen. After merging, all alignments are ultimately “lifted” to GRCh38, i.e., translated into GRCh38 coordinates. We call the second method — the one that uses both T2T-CHM13 and GRCh38 — the “LevioSAM 2” method, since it was first proposed in the LevioSAM 2 study by Chen et al. [ 5 ].

We found that \(4.0\%\) of the LevioSAM 2 alignments fell into regions that were classified as biased by biastools scan (125,498 biased regions, average length 974 bp), compared to \(4.5\%\) for the direct-to-GRC alignments (130,771 regions, average length 1071 bp). We used bedtools subtract to find the regions called biased using one method (direct-to-GRC or LevioSAM 2) but not the other. Out of the 130,771 regions in direct-to-GRC, 27,831 ( \(21\%\) ) were improved by more than \(25\%\) bases when using LevioSAM2. In contrast, 11,447 ( \(9\%\) ) bias regions in “LevioSAM 2” were aligned more balance in “direct-to-GRC”.

Since the improved performance of the LevioSAM 2 workflow is related to the completeness of the T2T-CHM13 reference relative to GRCh38, we hypothesized that the improvements would tend to be in regions where the T2T-CHM13 assembly is known to be superior, such as centromeres. We define that a bias region is near the centromere if it is inside the centromeric region or within 500 k range extend from the centromere. The summation of the extended centromeric regions contains 86,076,358 bp, which is around \(3\%\) of the whole genome. We collect all the bias regions improved by \(25\%\) in LevioSAM 2, and measure how many improved bases are from the regions near centromere, and how many are not. Thirty-eight percent of the improved bases are actually near centromeres, which is a high enrichment comparing to \(3\%\) (Table 4 ). Furthermore, if we consider only the bias region greater than 1000 bp, the ratio of bases near centromere becomes \(40\%\) , indicating that the bias region near the centromere tends to be longer.

figure 6

Biastools called bias region of HG002 with two different method. The tracks from top down are: combined Z -score for direct-to-GRC alignment, combined Z -score for Leviosam2 alignment, IGV read arrangement of direct alignment, read arrangement of Leviosam2, “Biased region” of direct alignment, “Biased region” of Leviosam2. Combined Z scores include read depth, variant density, and non-diploid variant. The scores above 10 are truncated in the panel to show the details between 0 and 10. Note that the read coverage tracks use different scales. For direct alignment, the track ranges from 0 to 254, while that of Leviosam2 ranges from 0 to 60

Figure 6 illustrates a region near a centromere where the direct-to-GRC method yields more reference bias compared to LevioSAM 2. Non-gray colors (blue, red, green, orange) in the IGV pileup denote places where alignments carried an ALT allele relative to GRCh38. The top pileup shows that direct-to-GRC alignment created a dense area of ALT alleles (evident from the density of non-gray coloring). Further, the direct-to-GRC alignments tended to cover the region to much higher depth compared to the LevioSAM 2 alignments, evident from the scaling of the top (0–254) and bottom (0–60) coverage tracks. These factors indicate that, for direct-to-GRC alignment, reads from more than one region of the donor genome have aligned in a “collapsed” fashion to this single region, create extreme values for RD, VD and ND and causing biastools ’s scan mode to mark the entire region as biased.

The LevioSAM 2 pileup exhibits much less bias, though biastools ’s scan mode reports some small biased regions here, as can be seen in the bottommost panel. The contrast between the combined RD, VD and ND score is illustrated toward the top of the screenshot, where the blue curves show the combined score, truncated to remain in the interval [0, 10]. The threshold determining biased or not is on 5. Most of the regional score in direct alignment (upper track) are actually above 10 while only a few region in Leviosam2 (lower track) reach 10.

Computational performance

To test the computational efficiency of biastools, we performed experiments using the simulated WGS data on a Linux x86_64 system with single thread (Table 5 ). While the various alignment tools take different amounts of time, Minimap 2 was the fastest. As a result, we used Minimap 2, plus the necesary alignment sorting task, as the baseline for our measurements. After alignment and sorting, context-aware assignment and generation of the bias report ( sim mode) took 9.24 h while using a peak memory footprint of 37.90 GB. When run on the real WGS reads (without ground truth information), biastools’ assignment phase took 8.46 h and used a peak memory footprint of 15.39 GB.

To run biastools_scan --scan , the input file must be in mpileup format. Transforming .bam format to .mpileup with bedtools , then performing the biastools_scan --scan took 47.38 h, while performing biastools_scan --scan on an existing mpileup file took 6.42 h. The peak memory usage in either case was around 368 GB.

We presented biastools , a novel method and tool that directly measures and categorizes instances of reference bias. In a simulation setting, we demonstrated its utility for identifying different categories of reference-bias events, and used this facility for comparing some well known alignment methods. Using real data, we showed its accuracy in a range of situations, including when we either do or don’t have foreknowledge of the donor individual’s HET sites.

As the bioinformatics community continues to develop new bias-avoiding methods [ 15 ] we expect biastools ’s ability to measure and categorize bias events will be essential. Direct measurement of reference bias will lead to clearer interpretation and evaluation compared to the alternative of measuring accuracy in a downstream result like variant calling. Findings obtained using biastools will help in designing the next generation of reference representations and alignment algorithms. For instance, our finding that end-to-end alignment leads to less bias in some circumstances could indicate that future algorithms should favor end-to-end alignments in more situations.

By measuring reference bias at an early point in the alignment process, biastools can disentangle reference bias due to the aligner and reference representation from any bias caused by downstream tools. This is particularly important since downstream tools can themselves be tuned (or trained) to counteract reference bias, sometimes “learning” the bias, when the more effective measure would be to analyze and remove the bias upstream. An example is the DeepVariant variant caller, which can refuse to call variants in bias-prone regions of the genome [ 5 ].

In the future, it will be important to refine biastools ’s models for predicting whether a given site is experiencing reference bias. In particular, the model presented here in “ Results ” “ Measuring bias using real reads on well-characterized genome ” section performs well for relatively simple variants like SNVs, but not as well for gaps. To improve the utility of biastools , it will be important to include more information in this model to allow for more accurate predictions. In particular, a future task is to develop models that both identify relevant features (beyond coverage and MAPQ) and combine to make a prediction in an automated way, possibly using deep learning. Indeed, such models may exist within the larger models already developed for variant calling in tools like DeepVariant [ 29 ]. To date and to our knowledge, no existing model is designed for the specific task of measuring reference bias, which is key to understanding how well upstream tools are fulfilling their stated purpose.

Currently, biastools supports only diploid genomes, since most of the work on reference bias avoidance has focused on human and other diploid genomes. However, biastools in principle can be extended to genome with higher ploidy. For instance, the simulation and the assignment methods would be essentially the same for a triploid, with the expected allelic balance ratios being 1:2 or 1:1:1. Note that the problem of distinguishing reference bias from sequencing error becomes harder as the ploidy increases.

This study focuses on short reads, since their shorter length makes them more prone to reference bias. However, biastools ’s methods are applicable to long-read alignments as well. Reference bias will manifest differently for long reads compared to short ones. Since long-read aligners have the benefit of longer sequence length and more anchors, scattered pockets of dense ALT alleles are less likely to affect the aligner’s ability to place the read correctly. In light of this, we expect biastools ’s scan mode to be particularly well suited to identifying the larger-scale bias events that are likely to dominate the reference bias landscape for long reads.

Conclusions

Biastools is a novel method and tool that directly measures and categorizes instances of reference bias. As new reference representations and alignment tools continue to be developed, biastools can help to standardize and formally measure the degree to which they address the reference-bias problem.

Biastools workflow

Biastools analyzes, measures and reports instances of reference bias in short-read alignments. Biastools focuses on bias with respect to diploid genomes, though the constituent methods could be generalized to other ploidies. If genetic variants are not known for the donor genome, biastools ’s scan mode reports regions that are “biased” or “suspicious.” If the donor has known variants, biastools ’s predict mode performs a more detailed analysis, taking bias measurements at each heterozygous site. Biastools ’s simulate goes a step further by first running a read simulator, then analyzing the simulated reads with one or more read alignment workflows. This allows for detailed categorization of bias events (e.g., whether they are due to loss, flux, etc), and for comparative studies of bias caused by different tools and reference representations.

simulate mode

To obtain a diploid reference from which reads can be simulated, biastools --simulate first uses bcftools consensus to generate the two FASTA-format haplotypes for the donor individual from a reference genome and a set of phased variant calls in VCF format. biastools --simulate then uses mason2 to simulate Illumina-like short reads from the autosomes of the two haplotypes. biastools --simulate uses different random seeds for the two haplotypes, to avoid correlation between the read coverage profiles. Note that mason2 annotates simulated reads with their haplotype and point of origin. In our experiments of the “ Results ” “ Measuring sources of bias in simulation ” and “ Measuring bias across aligners ” sections, the individual with high quality variant calls was HG002, the VCF file used was from the Q100 project. The VCF provides the phased variant information of HG002. We filtered out the variants that had been placed in any “FILTER” category, including variants that lacked evidence on one haplotype.

Simulated reads are then aligned to the GRCh38 primary assembly with one or more user-specified read alignment workflows. Bowtie 2 and BWA-MEM align directly to an index of GRCh38. VG Giraffe aligns to a graph based on GRCh38, and with all read alignments ultimately surjected (“lifted”) onto GRCh38. For each variant site, biastools analyzes the site using both its naive and its context-aware assignment methods, detailed in the “ Methods ” “ Assignment method ” section. Given the evidence supporting the REF and ALT alleles, three levels of allelic balance are calculated: the simulated balance (SB), mapping balance (MB), and assigned balance (AB). SB and MB require information about the reads’ true haplotype and point of origin, which are provided by the simulator, whereas AB is based only on the results of the context-aware assignment assignment method (“ Methods ” “ Assignment method ” section). These measures in turn allow biastools to categorize HET sites, as detailed in the “ Methods ” “ Biased-site classification ” section.

predict mode

This mode, biastools --predict , uses its context-aware assignment method to analyze each variant site. Since we lack simulated ground truth, only the AB measure is computed. This is sufficient to predict instances of reference bias (see the “ Results ” “ Measuring bias using real reads on well-characterized genome ” section), and to create diagnostic plots like the bias-by-allele-length plot (Figs. 3  and  4 ).

As presented in the “ Results ” “ Measuring bias using real reads on well-characterized genome ” section, biastools can predict which HET sites are affected by reference bias using a simple model. The model uses two inputs computed by biastools --predict : (a) the average mapping quality (MAPQ) of all the reads overlapping the site, and (b) the allelic balance at the variant site. This model is too simplistic to divide instances of bias into categories such as flux and loss. Still, our evaluations of the simple model, using simulated data to obtain ground truth for testing, indicates that it performs quite well on data derived from HG002 and aligned to GRCh38.

biastools_scan --scan first uses samtools mpileup to transform the alignments into the column-wise mpileup format. Biastools then scans the mpileup file, performing a windowed analysis and seeking regions with unusual degrees of (a) depth of coverage, (b) SNV variant density or (c) instances where the evidence is inconsistent with a diploid donor genome. The three measurements are combined into a single score by adding or multiplying them. Regions having combined score above a threshold are marked as “biased.” We cross check the scanning mode with both simulated data and real data.

Assignment method

Biastools contains two algorithms (the “naive” and the “context-aware” algorithms) for assigning reads to haplotypes. Both examine each read that aligns across a given site and assign each read to the reference-allele-carrying (REF) or the alternate-allele-carrying (ALT) haplotype. This problem is made difficult by the presence of sequencing errors, ambiguity in placement of alignment gaps, and the presence of repetitive sequence. While both algorithms attempt to assign each read to one haplotype or the other, they can fail in the case of some reads, ultimately assigning them to neither haplotype.

Before describing these assignment methods, we first describe how biastools computes two different baselines for understanding allelic balance.

Simulated balance (SB)

SB is computed as the number of ground-truth REF reads simulated from across the site, divided by the total number of reads simulated across the site. That is, it is the ratio REF/(REF+ALT), where REF and ALT are obtained by examining the simulated reads and simply counting the number that overlap the site and come from the REF-carrying haplotype and ALT-carrying haplotype.

Mapping balance (MB)

MB is computed as the fraction of reads overlapping the site that both (a) originated overlapping the site, and (b) aligned overlapping the site. Information from the read simulator is used to determine the read’s haplotype and point of origin. Reads that aligned overlapping the site but that were actually simulated from elsewhere in the genome are not counted in the MB measure. The MB measure differs from the SB measure since some reads truly originating from the site will fail to align there.

VCF files can contain nearby variants that are interdependent in a way that prevents the sites from varying independently. For example, a deletion could extend through and cover an SNV; that is, the deletion removes the SNV site, making the SNV neither REF nor ALT. Some VCF files use “./.” to represent such cases. To avoid the complications that arise from these cases, we identified instance of overlapping variants and removed them from consideration by ignoring all of the polymorphisms involved.

Naive assignment method

Given all of the reads that aligned overlapping a given site, the naive assignment method examines which base(s) from the reads align to the variant’s exact reference coordinates. From those, it tallies the REF/(REF+ALT) fraction. For insertions and deletions, the method only tallies a read if its sequence exactly matches the ALT or REF allele. If the sequence is different from both reference and alternative allele, e.g., if the sequence was affected by one or more sequencing errors or if the placement of gaps or insertions was different from the VCF, the read is classified as “other” and is not counted.

Note that this method uses the exact base-by-base alignment information reported by the read aligner. In other words, decisions made by this assignment method are essentially the same as those that would be made by examining the pileup columns corresponding to the variant. The following context-aware method improves upon the naive method by reanalyzing the read sequences.

Context-aware assignment method

This method works by searching for the REF and ALT alleles, together with some of their flanking sequence, within the sequences of all the reads that aligned overlapping the variant. As a first step, this method extracts variant information from the VCF, constructing strings that represent the REF/ALT alleles together with their flanking sequence. We use the term “allelic context sequence” to describe the allele together with its flanking sequence.” The default flanking sequence length is 5 bp. Note that flanking sequences are drawn from the same haplotype as the allele; e.g., if two phased SNV variants are within 5 of each other, each will appear – phased appropriately – in the other’s flanking sequence.

To determine if a read overlapping a variant site supports the REF or ALT allele, the read sequence is scanned for the allelic context sequences for REF and/or ALT. If exactly one of the two (REF or ALT) context sequences is found, the read is classified accordingly. The allelic context sequence need not appear in its entirely; it is sufficient for a suffix or prefix to appear, as long as a suffix or prefix of the other does not also appear. A read may contain context sequences but lack the context to distinguish REF from ALT. That is, the read sequence may contain equally good matches for both alleles. This is particularly common in regions with tandem repeats. In this case, the read is classified as “both” REF and ALT for the purpose of tallying bias. In cases where the read sequence contains neither of the allelic sequence contexts, the read is classified as “other”. “Both” and “other” reads are excluded from the AB calculation.

Subtleties can arise when many variants are clustered close together, with some variants (i.e., indels) affecting the coordinates at which others occur. In such cases, the evidence for any one of the variants is best understood in the context of the entire phased cluster of variants. This type of method has been adopted by multiple previous tools when analyzing variant combinations that might involve indels [ 2 , 36 ]. The context-aware assignment method will cluster variants appearing within a short distance (default: 25 bp) together into a “cohort.” The cohort extends in either direction until no other variants can be reached (up to the distance) in either direction. For such variants, the context-aware assignment algorithm will first take the entire (clustered) REF and ALT strings and search for them within the sequences of the overlapping reads. A read assigned in this way is tallied with respect to all of the variant sites making up the cohort. That is, if three phased variants are involved in a cohort, and the REF allele string is found in a read, that read counts toward the REF tally for all three variants.

While some overlapping reads can be tallied in this way, some overlapping reads might not overlap all or much of the cohort. For reads that cannot be assigned using the entire cohort string, the assignment algorithm falls back on the variant-by-variant strategy described previously.

When comparing the context sequences to the REF and ALT sequences, there may be a need to try different anchoring points, especially when gaps and tandem repeats are present. This is illustrated in Fig. 7 . On the left, the coordinate of read1 is the same as reference because its insertion is anchored on both sides and correctly placed. But for read2 and read3, the inserted sequence cannot be anchored on one side, causing their coordinates to shift with respect to the reference. To deal with this, the context-aware method will first try anchoring the read on the left side boundary of the variant. If no match is found between context sequence, REF and ALT, the method will try anchoring the read on the right side boundary of the variants (Fig. 7 ). In the same fashion, when comparing the read sequence through the cohort of a set of variants, the left and right end of the cohort are anchored to comparison. In this way, the aligner’s placement of gaps does not affect the comparison as long as the alignment beyond one of the variant boundaries is correct.

Note that the context-aware comparison method has limitations in cases where the variant calling file provides only partial information. For instance, when true variants are missing from the VCF, bias measurements at nearby sites can be inaccurate because biastools lacks the accurate flanking sequences needed for context-aware assignment. Similarly, absence of accurate (or any) phasing information can interfere with biastools ’s ability to establish accurate flanking sequences for assignment.

Repetitive context

When a variant is situated in or near a tandem repeat, it may not be possible to distinguish REF and ALT alleles simply by taking a fixed sequence context. For example, in Fig. 8 a, the REF haplotype contains attc repeated 7 times in tandem. The ALT haplotype has the same sequence repeated only 6 times. If we only compare the variant region defined in the VCF, which is 1 attc difference, it is easy to mistake a read with one attc deletion to reference read if the aligner didn’t put the gap in the exact place.

To cope with the complication, we defined the concept of “effective variant”. When building the variant map, if one context sequence (REF or ALT) is the prefix, suffix, or substring of the other, the “context-aware” method will keep extending the variant. If the repetitiveness is on the right side, that is, one context sequence is the other’s prefix, the method will extend the variant to the right until the first difference is encountered. For example, in Fig. 8 a, the effective variant become the whole repetitive region of the attc s. Similarly, the method will extend the variant to the left if the repetitiveness is on the left, that is, one context sequence is the other’s suffix. Occasionally, the repetitiveness is on both side (Fig. 8 b). In these cases, the method chooses the side where the extension is shorter. A read that does not cover the entire effective variant will be classified as “both,” reflecting the fact that we cannot determine the true origin of a read that does not cover the whole repetitive region. Reads that partially cover the effective variant are not evaluated in our simulation experiment, since they are not possible for the assignment method to determine the haplotype and only complicated the result when being included in analysis. The variant with “effective variant” longer than 70 bp are also disregarded in analysis.

figure 7

The aligned reads and variants in alignment coordinate and expansion coordinate. For expansion coordinate, the expansion can be anchored on the left side of the variant or the right side of the variant

figure 8

Two examples of repetitive context. a  The repetitive is extending to the right side, so the effective variant is extending to the right end so that the ALT context sequence is no longer a prefix of REF context sequence. b  The case original ALT context sequence is a substring of REF context sequence. There are two choices of effective variant. Biastools would chose the shorter effective variant (choice 2)

Biased-site classification

For simulated reads, we can diagnose the cause of the bias by examining our bias measures (AB from the two assignment methods) as well as our baseline measures (SB and MB). We divide biased sites into three categories (or “events”): loss, flux, and local. Loss events are caused by ALT-carrying reads that fail to align to their true point of origin. Flux events are caused by reads that aligned to a site but that originated from another site on the genome. Local events are caused by the aligner put the reads in roughly correct place; however, the reads’ haplotype is determined incorrectly by the assignment method. It can be due to the assignment method is fooled by the placement of the gaps such as the most “local” bias cases in naive assignment. The “local” bias also happens when the aligner put the read off certain bases due to the tandem repeats or the uneven incidents of sequencing error in the variant region.

figure 9

The illustration of biases categorization with NAB and NMB. Variants positioned within the green circle with a radius of 0.1 at the origin are classified as balance. Variants in the yellow region along the diagonal are categorized as bias “loss”. The blue region, where \(|\text {NMB}| > 0.1\) and excluding the bias “loss” region classifies variants as either bias “flux” or bias “local”. The classification between “flux” or “local” is determined by if there are more than 5 reads being mismapped to the site. Variants falling outside these categories are classified as outliers. NAB: normalized assignment balance, NMB: normalized mapping balance

We rely on three combined measures to classify the biases. One is the normalized mapping balance (NMB), equal to MB - SB. NMB measures bias that manifests due to read alignment. Another combined measure is the normalized assignment balance (NAB), equal to AB - SB. NAB measures bias that manifests due to either read alignment or a failure to correctly tally the evidence present in the overlapping aligned reads, e.g., due to ambiguity caused by gaps and tandem repeats. A final measure is the number of reads that aligned to the site incorrectly due to having ambiguous alignments.

Our bias categories are defiend based on these three measures. Figure 9 illustrates how categories are determined based on the measures. Most sites generally do not exhibit reference bias, and so would tend to appear near the origin of the plot, meaning that MB and AB are both close to SB. Specifically, any sites falling within the circle about the origin with radius 0.1 are classified as “balanced.”

The yellow region that surrounds the diagonal \(y=x\) line in Fig. 9 (but excluding the “balanced” circle around the origin) demarcates the sites that are categorized as “loss” events. The boundary is specifically defined by two lines with slopes of 2 and 1/2. Positioned along the diagonal means that the NMB and NAB are close to each other, indicating that the assignment method reflects the balance of reads mapping to the site. However, positioning in the upper-right quadrant means that these sites are biased toward the reference, which results from the loss of alternative allele reads. In some rare case the reads carried reference allele are lost, in the case the variant site will situated in the lower-left quadrant. The blue region in Fig. 9 are where the variants with discordant NMB and NAB located. In most of the cases, the NMB are close to zero, while the NAB is positive, meaning that the mapping of the reads are close to the simulation, but the assignment is not correct. Both the flux and local biases position in similar place, thus we introduce a third measure, number of mismapped reads, to differentiate these two categories. For the variant site in the blue region, if there are more than 5 reads coming from other place of the genome, then the site is classified as bias “flux”, else it is classified as “local”. The sites not included in the green, yellow, or blue regions are classified as outliers.

Construction of pangenome graphs

To construct the pangenome graph, we used vg autoindex --workflow giraffe with the GRCh38 reference and the target VCF file. Then we used vg giraffe with the index files to align both the simulated and real reads. The option -o BAM was used to project the alignment result back to the linear reference GRCh38.

The command we used to filter the 1KGP variants with allele frequency greater than 0.01 was bcftools view --min-af 0.01 . The command leaves only the non-reference alleles with frequency greater than 0.01 in the population.

Evaluating biased-site predictions

As mentioned in Results, the absence of phasing information in the “truth” VCF can create problems for biastools ’s algorithms. Before evaluating the performance of the prediction model on real read data, we first filtered out the sites potentially being affected by incomplete phasing information. To identify these, we classified each HET site as “affected” if more than \(90\%\) of reads covering the site contained an “other” (i.e., neither REF or ALT) allele, or if evidence for one of the two HET alleles was completely absent and more than \(40\%\) of reads contained an “other” allele. We then omitted the affected HETs from further analysis.

Since these real reads have no known point of origin, the measures previously used — e.g., NMB, NAB, and number of mismapped reads — are not available. We can still evaluate AB for each variant using biastools ’s assignment methods. We found that the most relevant measures for real read alignment are the average read mapping quality and the AB of the variant. Mapping quality of the reads is the proxy to tell if there are reads from other place align to the site, or if there are reads origin in the site align to other place. AB captures whether a variant suffers from biased read loss/gain. AB does not capture the reason for the bias; i.e., sites with unbalanced REF-ALT ratio can result from random sequencing error or systematic bias. Still, we found that variant sites with extreme AB and low average mapping quality were likely to be biased sites.

We found that transforming AB and average mapping quality into Z scores and combining them provided a useful measure of bias. We used two methods to combine the Z scores; multiplication and addition.

Note that 42 is the maximum possible score for Bowtie2 and BWA MEM aligner. For VG Giraffe, the maximum scoring was adjusted to its maximum of 60. We observed that these two combinations performed similarly when predicting bias of SNV variants (Fig. 5 ).

Sliding window approach of scan mode

In scan mode, biastools uses bcftools mpileup to obtain an alignment pileup in the target region. Biastools scans the region with a sliding window (default: 400 bp), finding windowed averages for three measures: read depth, variant density, instances of non-diploid pileups. The three measures are combined into a bias score as below:

The read depth (RD) as a Z score: (window mean RD - total mean RD)/(total RD std). Since we are interested only in cases where RD is much greater than average, any Z score less than 1 is rounded to 0.

The variant density (VD) as a Z score (window mean VD - total mean VD)/(total VD std). Since we are interested only in cases where density is greater than average, negative Z scores are truncated to 0.

The non-diploid (ND) score as a Z score (window mean ND - total mean ND)/(total ND std). Since we are interested only in cases where evidence is inconsistent with the diploid state, negative Z scores are truncated to 0

The non-diploid (ND) score is calculated from the ratio of nucleotides appearing in each individual position in the window. For a given position, any nucleotide appearing with greater than \(15\%\) frequency is considered as an allele (i.e., is not likely to be a coincidence of sequencing errors). Any position with more than one allele is considered a SNV. A site is called non-diploid if either (a) more than two alleles are present at the \(>15\%\) level, or if the most frequent allele has a frequency more than twice that of the second most frequent.

A region is classified as “biased” if the sum of RD, VD, ND score ≥ 5. A region is classified as “suspicious” if the sum of RD, VD, and ND score ≥ 3 and \(<5\) . If two nearby bias regions are within distance of 1 kbp, the scanning mode will chain them into one single long biased region. In a similar fashion, sites with unusual high measures but not extremely high would be classified as suspicious sites and linked together if they are within 1 kbp range.

Note that the transformation to Z scores requires that biastools determine (or estimate) the scores’ means and standard deviations in the dataset. The user can chose to have these computed automatically using a sampling method, which by default samples \(1/1000^{\text {th}}\) of genome sequence and estimates based on the alignment data in that subset. Alternately, the user can specify pre-calculated means and standard deviations.

The final score of the sliding window is:

Comparing two alignment workflows with scan mode

To compare alignments from two alignment workflows, we first obtained a single set of average and standard-deviation parameters, derived jointly from the alignments generated by both workflows. We found that using independently sampled parameters, i.e., obtaining separate average and standard-deviation parameters for each workflow, would create an imbalance. For example, since “LevioSAM 2” produced an overall less biased set of alignments in our experiment, the average and std values of RD, VD, and ND were lower. The biased and suspicious regions reported by biastools scan were therefore less extreme for “LevioSAM 2” than for the more biased workflow that aligned directly to GRCh38.

To obtain these joint parameters, biastools scan samples from both alignment bam files, creating a sample drawn half from one workflow and half from the other. So the scan mode can be performed on both bam files with the same scoring.

When comparing the biased regions from the two alignment workflows, regions with low or no read depth were excluded, since it was difficult to interpret these as being improved by one workflow or the other. An example of a dubious “improvement” is illustrated in Additional file 1 : Fig. S5. To classify a region identified as “biased” in one workflow as being “improved” by the other workflow, biastools scan required that at least 25% of the bases in the region be both well covered (over \(1/5^{\text {th}}\) of the overall average read depth) and not classified as biased.

Availability of data and materials

The VCF file of HG002 from the Q100 project was downloaded from the GIAB HG002 GRCh38 assembly-based small and structural variants draft benchmark sets [ 28 ] with the URL https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/GRCh38_HG002-T2TQ100-V1.0_smvar.vcf.gz .

The real short read sequencing data for HG002 was downloaded from Google brain genomics sequencing dataset for benchmarking and development [ 3 ] with the URL https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/ .

The software biastools is available at https://github.com/maojanlin/biastools with the Zenodo DOI: 10.5281/zenodo.10819028 and https://pypi.org/project/biastools/ under the MIT license. Scripts for the experiments described in this paper are at https://github.com/maojanlin/biastools_experiment , with the Zenodo DOI 10.5281/zenodo.10818966 .

Anson EL, Myers EW. ReAligner: a program for refining DNA sequence multi-alignments. J Comput Biol. 1997;4(3):369–83.

Article   CAS   PubMed   Google Scholar  

Assmus J, Kleffe J, Schmitt AO, Brockmann GA. Equivalent indels-ambiguous functional classes and redundancy in databases. PLoS ONE. 2013;8(5):e62803.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Baid G, Nattestad M, Kolesnikov A, Goel S, Yang H, Chang PC, et al. Google Brain Genomics Sequencing Dataset for Benchmarking and Development. Dataset. 2020. https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x . Accessed 15 Apr 2024.

Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda). 2015;5(5):931–41.

Article   PubMed   Google Scholar  

Chen NC, Paulin LF, Sedlazeck FJ, Koren S, Phillippy AM, Langmead B. Improved sequence mapping using a complete reference genome and lift-over. Nat Methods. 2024;21(1):41–9.

Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021;22(1):1–17.

Article   Google Scholar  

Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, et al. Extending reference assembly models. Genome Biol. 2015;16:13.

Article   PubMed   PubMed Central   Google Scholar  

Cooke DP, Wedge DC, Lunter G. A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol. 2021;39(7):885–92.

Crysnanto D, Pausch H. Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery. Genome Biol. 2020;21(1):184.

Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):3207–12.

DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8.

Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372(6537):eabf7117.

Gagie T, Manzini G, Sirén J. Wheeler graphs: a framework for BWT-based data structures. Theor Comput Sci. 2017;698:67–78.

Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–9.

Garrison E, Guarracino A. Unbiased pangenome graphs. Bioinform. 2023;39(1):btac743.

Groza C, Kwan T, Soranzo N, Pastinen T, Bourque G. Personalized and graph genomes reveal missing signal in epigenomic data. Genome Biol. 2020;21(1):1–22.

Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 2019;15(7):e1008302.

Hagiwara K, Edmonson MN, Wheeler DA, Zhang J. indelPost: harmonizing ambiguities in simple and complex indel alignments. Bioinformatics. 2022;38(2):549–51.

Holtgrewe M. Mason: a read simulator for second generation sequencing data. Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin; 2010. TR-B-10-06.

Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.

Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient construction of a complete index for pan-genomics read alignment. J Comput Biol. 2020;27(4):500–13.

Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357.

Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100.

Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. arXiv preprint arXiv:1303.3997 .

Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, et al. A draft human pangenome reference. Nature. 2023;617(7960):312–24.

Martiniano R, Garrison E, Jones ER, Manica A, Durbin R. Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. Genome Biol. 2020;21(1):250.

Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53.

Olson ND, Zook JM. GIAB HG002 GRCh38 Assembly-Based Small and Structural Variants Draft Benchmark Sets. Dataset. 2023. https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.012-20231107/ . Accessed 15 Apr 2024.

Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv. 2018:201178.

Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol. 2018;19(1):220.

Rautiainen M, Nurk S, Walenz BP, Logsdon GA, Porubsky D, Rhie A, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol. 2023;41(10):1474–82.

Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, et al. The complete sequence of a human Y chromosome. Nature. 2023;621(7978):344–54.

Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol. 2011;7(1):522.

Salavati M, Bush SJ, Palma-Vera S, Mcculloch MEB, Hume DA, Clark EL. Elimination of reference mapping bias reveals robust immune related allele-specific expression in cross-bred sheep. Front Genet. 2019;10:863.

Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2021;374(6574):abg8871.

Sun C, Medvedev P. VarMatch: robust matching of small variant datasets using flexible scoring schemes. Bioinformatics. 2017;33(9):1301–8.

Valenzuela D, Norri T, ki N, nen E, kinen V. Towards pan-genome read alignment to improve variation calling. BMC Genomics. 2018;19(Suppl 2):87.

Van De Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015;12(11):1061–3.

Download references

Acknowledgements

This work was carried out at the Advanced Research Computing at Hopkins (ARCH) core facility , which is supported by the National Science Foundation (NSF) grant number OAC 1920103.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2 .

ML, SI, NC and BL were supported by NIH grant R01HG011392 to BL.

Author information

Authors and affiliations.

Department of Computer Science, Johns Hopkins University, Baltimore, USA

Mao-Jan Lin, Sheila Iyer, Nae-Chyun Chen & Ben Langmead

You can also search for this author in PubMed   Google Scholar

Contributions

ML, SI, NC and BL designed the method. ML and SI wrote the software and performed the experiments. ML and BL wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Mao-Jan Lin or Ben Langmead .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1:.

  Figure S1. Full normalized mapping balance to normalized assignment balance (NMB-NAB) plot. Figure S2. Normalized mapping balance to normalized assignment balance (NMB-NAB) plot stratified by allele length. Figure S3. Example of local decision by Bowtie 2 and BWA MEM. Figure S4. Example of local decision by default BWA MEM and BWA MEM with option -L 30. Figure S5. An example of the low coverage result of LevioSAM 2 and direct-to-GRC methods. Table S1. Number of balanced sites and different categories of biased sites on chromosome 16.

Additional file 2:

 Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Lin, MJ., Iyer, S., Chen, NC. et al. Measuring, visualizing, and diagnosing reference bias with biastools. Genome Biol 25 , 101 (2024). https://doi.org/10.1186/s13059-024-03240-8

Download citation

Received : 13 September 2023

Accepted : 04 April 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s13059-024-03240-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Reference bias
  • Sequence alignment
  • Pangenomics

Genome Biology

ISSN: 1474-760X

data assignment method

The Assignment Method: Definition, Applications, and Implementation Strategies

Last updated 03/15/2024 by

Fact checked by

Understanding the assignment method

Optimized resource utilization, enhanced production efficiency, maximized profitability, applications of the assignment method, workforce allocation, production planning, sales territory management, resource budgeting.

  • Optimizes resource utilization
  • Enhances production efficiency
  • Maximizes profitability
  • Requires thorough analysis of past performance and market conditions
  • Potential for misallocation of resources if not executed properly

Frequently asked questions

How does the assignment method differ from other resource allocation methods, what factors should organizations consider when implementing the assignment method, can the assignment method be applied to non-profit organizations or public sector agencies, what role does technology play in implementing the assignment method, are there any ethical considerations associated with the assignment method, key takeaways.

  • The assignment method optimizes resource allocation to enhance efficiency and profitability.
  • Applications include workforce allocation, production planning, sales territory management, and resource budgeting.
  • Effective implementation requires thorough analysis of past performance and market conditions.
  • Strategic allocation of resources can drive overall performance and revenue growth.

Show Article Sources

You might also like.

  • Privacy Policy

Research Method

Home » Data Collection – Methods Types and Examples

Data Collection – Methods Types and Examples

Table of Contents

Data collection

Data Collection

Definition:

Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

In order for data collection to be effective, it is important to have a clear understanding of what data is needed and what the purpose of the data collection is. This can involve identifying the population or sample being studied, determining the variables to be measured, and selecting appropriate methods for collecting and recording data.

Types of Data Collection

Types of Data Collection are as follows:

Primary Data Collection

Primary data collection is the process of gathering original and firsthand information directly from the source or target population. This type of data collection involves collecting data that has not been previously gathered, recorded, or published. Primary data can be collected through various methods such as surveys, interviews, observations, experiments, and focus groups. The data collected is usually specific to the research question or objective and can provide valuable insights that cannot be obtained from secondary data sources. Primary data collection is often used in market research, social research, and scientific research.

Secondary Data Collection

Secondary data collection is the process of gathering information from existing sources that have already been collected and analyzed by someone else, rather than conducting new research to collect primary data. Secondary data can be collected from various sources, such as published reports, books, journals, newspapers, websites, government publications, and other documents.

Qualitative Data Collection

Qualitative data collection is used to gather non-numerical data such as opinions, experiences, perceptions, and feelings, through techniques such as interviews, focus groups, observations, and document analysis. It seeks to understand the deeper meaning and context of a phenomenon or situation and is often used in social sciences, psychology, and humanities. Qualitative data collection methods allow for a more in-depth and holistic exploration of research questions and can provide rich and nuanced insights into human behavior and experiences.

Quantitative Data Collection

Quantitative data collection is a used to gather numerical data that can be analyzed using statistical methods. This data is typically collected through surveys, experiments, and other structured data collection methods. Quantitative data collection seeks to quantify and measure variables, such as behaviors, attitudes, and opinions, in a systematic and objective way. This data is often used to test hypotheses, identify patterns, and establish correlations between variables. Quantitative data collection methods allow for precise measurement and generalization of findings to a larger population. It is commonly used in fields such as economics, psychology, and natural sciences.

Data Collection Methods

Data Collection Methods are as follows:

Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online.

Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone.

Focus Groups

Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic.

Observation

Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question.

Experiments

Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research.

Case Studies

Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon.

Secondary Data Analysis

Secondary data analysis involves using existing data that was collected for another purpose. Secondary data can come from various sources, such as government agencies, academic institutions, or private companies.

How to Collect Data

The following are some steps to consider when collecting data:

  • Define the objective : Before you start collecting data, you need to define the objective of the study. This will help you determine what data you need to collect and how to collect it.
  • Identify the data sources : Identify the sources of data that will help you achieve your objective. These sources can be primary sources, such as surveys, interviews, and observations, or secondary sources, such as books, articles, and databases.
  • Determine the data collection method : Once you have identified the data sources, you need to determine the data collection method. This could be through online surveys, phone interviews, or face-to-face meetings.
  • Develop a data collection plan : Develop a plan that outlines the steps you will take to collect the data. This plan should include the timeline, the tools and equipment needed, and the personnel involved.
  • Test the data collection process: Before you start collecting data, test the data collection process to ensure that it is effective and efficient.
  • Collect the data: Collect the data according to the plan you developed in step 4. Make sure you record the data accurately and consistently.
  • Analyze the data: Once you have collected the data, analyze it to draw conclusions and make recommendations.
  • Report the findings: Report the findings of your data analysis to the relevant stakeholders. This could be in the form of a report, a presentation, or a publication.
  • Monitor and evaluate the data collection process: After the data collection process is complete, monitor and evaluate the process to identify areas for improvement in future data collection efforts.
  • Ensure data quality: Ensure that the collected data is of high quality and free from errors. This can be achieved by validating the data for accuracy, completeness, and consistency.
  • Maintain data security: Ensure that the collected data is secure and protected from unauthorized access or disclosure. This can be achieved by implementing data security protocols and using secure storage and transmission methods.
  • Follow ethical considerations: Follow ethical considerations when collecting data, such as obtaining informed consent from participants, protecting their privacy and confidentiality, and ensuring that the research does not cause harm to participants.
  • Use appropriate data analysis methods : Use appropriate data analysis methods based on the type of data collected and the research objectives. This could include statistical analysis, qualitative analysis, or a combination of both.
  • Record and store data properly: Record and store the collected data properly, in a structured and organized format. This will make it easier to retrieve and use the data in future research or analysis.
  • Collaborate with other stakeholders : Collaborate with other stakeholders, such as colleagues, experts, or community members, to ensure that the data collected is relevant and useful for the intended purpose.

Applications of Data Collection

Data collection methods are widely used in different fields, including social sciences, healthcare, business, education, and more. Here are some examples of how data collection methods are used in different fields:

  • Social sciences : Social scientists often use surveys, questionnaires, and interviews to collect data from individuals or groups. They may also use observation to collect data on social behaviors and interactions. This data is often used to study topics such as human behavior, attitudes, and beliefs.
  • Healthcare : Data collection methods are used in healthcare to monitor patient health and track treatment outcomes. Electronic health records and medical charts are commonly used to collect data on patients’ medical history, diagnoses, and treatments. Researchers may also use clinical trials and surveys to collect data on the effectiveness of different treatments.
  • Business : Businesses use data collection methods to gather information on consumer behavior, market trends, and competitor activity. They may collect data through customer surveys, sales reports, and market research studies. This data is used to inform business decisions, develop marketing strategies, and improve products and services.
  • Education : In education, data collection methods are used to assess student performance and measure the effectiveness of teaching methods. Standardized tests, quizzes, and exams are commonly used to collect data on student learning outcomes. Teachers may also use classroom observation and student feedback to gather data on teaching effectiveness.
  • Agriculture : Farmers use data collection methods to monitor crop growth and health. Sensors and remote sensing technology can be used to collect data on soil moisture, temperature, and nutrient levels. This data is used to optimize crop yields and minimize waste.
  • Environmental sciences : Environmental scientists use data collection methods to monitor air and water quality, track climate patterns, and measure the impact of human activity on the environment. They may use sensors, satellite imagery, and laboratory analysis to collect data on environmental factors.
  • Transportation : Transportation companies use data collection methods to track vehicle performance, optimize routes, and improve safety. GPS systems, on-board sensors, and other tracking technologies are used to collect data on vehicle speed, fuel consumption, and driver behavior.

Examples of Data Collection

Examples of Data Collection are as follows:

  • Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety.
  • Social Media Monitoring : Companies can collect real-time data on social media platforms such as Twitter and Facebook to monitor their brand reputation, track customer sentiment, and respond to customer inquiries and complaints in real-time.
  • Weather Monitoring: Weather agencies collect real-time data on temperature, humidity, air pressure, and precipitation through weather stations and satellites. This information is used to provide accurate weather forecasts and warnings.
  • Stock Market Monitoring : Financial institutions collect real-time data on stock prices, trading volumes, and other market indicators to make informed investment decisions and respond to market fluctuations in real-time.
  • Health Monitoring : Medical devices such as wearable fitness trackers and smartwatches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues.

Purpose of Data Collection

The purpose of data collection can vary depending on the context and goals of the study, but generally, it serves to:

  • Provide information: Data collection provides information about a particular phenomenon or behavior that can be used to better understand it.
  • Measure progress : Data collection can be used to measure the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Support decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions.
  • Identify trends : Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Monitor and evaluate : Data collection can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.

When to use Data Collection

Data collection is used when there is a need to gather information or data on a specific topic or phenomenon. It is typically used in research, evaluation, and monitoring and is important for making informed decisions and improving outcomes.

Data collection is particularly useful in the following scenarios:

  • Research : When conducting research, data collection is used to gather information on variables of interest to answer research questions and test hypotheses.
  • Evaluation : Data collection is used in program evaluation to assess the effectiveness of programs or interventions, and to identify areas for improvement.
  • Monitoring : Data collection is used in monitoring to track progress towards achieving goals or targets, and to identify any areas that require attention.
  • Decision-making: Data collection is used to provide decision-makers with information that can be used to inform policies, strategies, and actions.
  • Quality improvement : Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Characteristics of Data Collection

Data collection can be characterized by several important characteristics that help to ensure the quality and accuracy of the data gathered. These characteristics include:

  • Validity : Validity refers to the accuracy and relevance of the data collected in relation to the research question or objective.
  • Reliability : Reliability refers to the consistency and stability of the data collection process, ensuring that the results obtained are consistent over time and across different contexts.
  • Objectivity : Objectivity refers to the impartiality of the data collection process, ensuring that the data collected is not influenced by the biases or personal opinions of the data collector.
  • Precision : Precision refers to the degree of accuracy and detail in the data collected, ensuring that the data is specific and accurate enough to answer the research question or objective.
  • Timeliness : Timeliness refers to the efficiency and speed with which the data is collected, ensuring that the data is collected in a timely manner to meet the needs of the research or evaluation.
  • Ethical considerations : Ethical considerations refer to the ethical principles that must be followed when collecting data, such as ensuring confidentiality and obtaining informed consent from participants.

Advantages of Data Collection

There are several advantages of data collection that make it an important process in research, evaluation, and monitoring. These advantages include:

  • Better decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions, leading to better decision-making.
  • Improved understanding: Data collection helps to improve our understanding of a particular phenomenon or behavior by providing empirical evidence that can be analyzed and interpreted.
  • Evaluation of interventions: Data collection is essential in evaluating the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Identifying trends and patterns: Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Increased accountability: Data collection increases accountability by providing evidence that can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.
  • Validation of theories: Data collection can be used to test hypotheses and validate theories, leading to a better understanding of the phenomenon being studied.
  • Improved quality: Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Limitations of Data Collection

While data collection has several advantages, it also has some limitations that must be considered. These limitations include:

  • Bias : Data collection can be influenced by the biases and personal opinions of the data collector, which can lead to inaccurate or misleading results.
  • Sampling bias : Data collection may not be representative of the entire population, resulting in sampling bias and inaccurate results.
  • Cost : Data collection can be expensive and time-consuming, particularly for large-scale studies.
  • Limited scope: Data collection is limited to the variables being measured, which may not capture the entire picture or context of the phenomenon being studied.
  • Ethical considerations : Data collection must follow ethical principles to protect the rights and confidentiality of the participants, which can limit the type of data that can be collected.
  • Data quality issues: Data collection may result in data quality issues such as missing or incomplete data, measurement errors, and inconsistencies.
  • Limited generalizability : Data collection may not be generalizable to other contexts or populations, limiting the generalizability of the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Research Techniques

Research Techniques – Methods, Types and Examples

Appendix in Research Paper

Appendix in Research Paper – Examples and...

Conceptual Framework

Conceptual Framework – Types, Methodology and...

Limitations in Research

Limitations in Research – Types, Examples and...

Data Interpretation

Data Interpretation – Process, Methods and...

Research Methodology

Research Methodology – Types, Examples and...

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Research Methods | Definitions, Types, Examples

Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.

First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :

  • Qualitative vs. quantitative : Will your data take the form of words or numbers?
  • Primary vs. secondary : Will you collect original data yourself, or will you use data that has already been collected by someone else?
  • Descriptive vs. experimental : Will you take measurements of something as it is, or will you perform an experiment?

Second, decide how you will analyze the data .

  • For quantitative data, you can use statistical analysis methods to test relationships between variables.
  • For qualitative data, you can use methods such as thematic analysis to interpret patterns and meanings in the data.

Table of contents

Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.

Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.

Qualitative vs. quantitative data

Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.

For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .

If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .

Qualitative to broader populations. .
Quantitative .

You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.

Primary vs. secondary research

Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).

If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.

Primary . methods.
Secondary

Descriptive vs. experimental data

In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .

In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .

To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.

Descriptive . .
Experimental

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Research methods for collecting data
Research method Primary or secondary? Qualitative or quantitative? When to use
Primary Quantitative To test cause-and-effect relationships.
Primary Quantitative To understand general characteristics of a population.
Interview/focus group Primary Qualitative To gain more in-depth understanding of a topic.
Observation Primary Either To understand how something occurs in its natural setting.
Secondary Either To situate your research in an existing body of work, or to evaluate trends within a research topic.
Either Either To gain an in-depth understanding of a specific group or context, or when you don’t have the resources for a large study.

Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.

Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.

Qualitative analysis methods

Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:

  • From open-ended surveys and interviews , literature reviews , case studies , ethnographies , and other sources that use text rather than numbers.
  • Using non-probability sampling methods .

Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .

Quantitative analysis methods

Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).

You can use quantitative analysis to interpret data that was collected either:

  • During an experiment .
  • Using probability sampling methods .

Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.

Research methods for analyzing data
Research method Qualitative or quantitative? When to use
Quantitative To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations).
Meta-analysis Quantitative To statistically analyze the results of a large collection of studies.

Can only be applied to studies that collected data in a statistically valid manner.

Qualitative To analyze data collected from interviews, , or textual sources.

To understand general themes in the data and how they are communicated.

Either To analyze large volumes of textual or visual data collected from surveys, literature reviews, or other sources.

Can be quantitative (i.e. frequencies of words) or qualitative (i.e. meanings of words).

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis
  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
  • If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Is this article helpful?

Other students also liked, writing strong research questions | criteria & examples.

  • What Is a Research Design | Types, Guide & Examples
  • Data Collection | Definition, Methods & Examples

More interesting articles

  • Between-Subjects Design | Examples, Pros, & Cons
  • Cluster Sampling | A Simple Step-by-Step Guide with Examples
  • Confounding Variables | Definition, Examples & Controls
  • Construct Validity | Definition, Types, & Examples
  • Content Analysis | Guide, Methods & Examples
  • Control Groups and Treatment Groups | Uses & Examples
  • Control Variables | What Are They & Why Do They Matter?
  • Correlation vs. Causation | Difference, Designs & Examples
  • Correlational Research | When & How to Use
  • Critical Discourse Analysis | Definition, Guide & Examples
  • Cross-Sectional Study | Definition, Uses & Examples
  • Descriptive Research | Definition, Types, Methods & Examples
  • Ethical Considerations in Research | Types & Examples
  • Explanatory and Response Variables | Definitions & Examples
  • Explanatory Research | Definition, Guide, & Examples
  • Exploratory Research | Definition, Guide, & Examples
  • External Validity | Definition, Types, Threats & Examples
  • Extraneous Variables | Examples, Types & Controls
  • Guide to Experimental Design | Overview, Steps, & Examples
  • How Do You Incorporate an Interview into a Dissertation? | Tips
  • How to Do Thematic Analysis | Step-by-Step Guide & Examples
  • How to Write a Literature Review | Guide, Examples, & Templates
  • How to Write a Strong Hypothesis | Steps & Examples
  • Inclusion and Exclusion Criteria | Examples & Definition
  • Independent vs. Dependent Variables | Definition & Examples
  • Inductive Reasoning | Types, Examples, Explanation
  • Inductive vs. Deductive Research Approach | Steps & Examples
  • Internal Validity in Research | Definition, Threats, & Examples
  • Internal vs. External Validity | Understanding Differences & Threats
  • Longitudinal Study | Definition, Approaches & Examples
  • Mediator vs. Moderator Variables | Differences & Examples
  • Mixed Methods Research | Definition, Guide & Examples
  • Multistage Sampling | Introductory Guide & Examples
  • Naturalistic Observation | Definition, Guide & Examples
  • Operationalization | A Guide with Examples, Pros & Cons
  • Population vs. Sample | Definitions, Differences & Examples
  • Primary Research | Definition, Types, & Examples
  • Qualitative vs. Quantitative Research | Differences, Examples & Methods
  • Quasi-Experimental Design | Definition, Types & Examples
  • Questionnaire Design | Methods, Question Types & Examples
  • Random Assignment in Experiments | Introduction & Examples
  • Random vs. Systematic Error | Definition & Examples
  • Reliability vs. Validity in Research | Difference, Types and Examples
  • Reproducibility vs Replicability | Difference & Examples
  • Reproducibility vs. Replicability | Difference & Examples
  • Sampling Methods | Types, Techniques & Examples
  • Semi-Structured Interview | Definition, Guide & Examples
  • Simple Random Sampling | Definition, Steps & Examples
  • Single, Double, & Triple Blind Study | Definition & Examples
  • Stratified Sampling | Definition, Guide & Examples
  • Structured Interview | Definition, Guide & Examples
  • Survey Research | Definition, Examples & Methods
  • Systematic Review | Definition, Example, & Guide
  • Systematic Sampling | A Step-by-Step Guide with Examples
  • Textual Analysis | Guide, 3 Approaches & Examples
  • The 4 Types of Reliability in Research | Definitions & Examples
  • The 4 Types of Validity in Research | Definitions & Examples
  • Transcribing an Interview | 5 Steps & Transcription Software
  • Triangulation in Research | Guide, Types, Examples
  • Types of Interviews in Research | Guide & Examples
  • Types of Research Designs Compared | Guide & Examples
  • Types of Variables in Research & Statistics | Examples
  • Unstructured Interview | Definition, Guide & Examples
  • What Is a Case Study? | Definition, Examples & Methods
  • What Is a Case-Control Study? | Definition & Examples
  • What Is a Cohort Study? | Definition & Examples
  • What Is a Conceptual Framework? | Tips & Examples
  • What Is a Controlled Experiment? | Definitions & Examples
  • What Is a Double-Barreled Question?
  • What Is a Focus Group? | Step-by-Step Guide & Examples
  • What Is a Likert Scale? | Guide & Examples
  • What Is a Prospective Cohort Study? | Definition & Examples
  • What Is a Retrospective Cohort Study? | Definition & Examples
  • What Is Action Research? | Definition & Examples
  • What Is an Observational Study? | Guide & Examples
  • What Is Concurrent Validity? | Definition & Examples
  • What Is Content Validity? | Definition & Examples
  • What Is Convenience Sampling? | Definition & Examples
  • What Is Convergent Validity? | Definition & Examples
  • What Is Criterion Validity? | Definition & Examples
  • What Is Data Cleansing? | Definition, Guide & Examples
  • What Is Deductive Reasoning? | Explanation & Examples
  • What Is Discriminant Validity? | Definition & Example
  • What Is Ecological Validity? | Definition & Examples
  • What Is Ethnography? | Definition, Guide & Examples
  • What Is Face Validity? | Guide, Definition & Examples
  • What Is Non-Probability Sampling? | Types & Examples
  • What Is Participant Observation? | Definition & Examples
  • What Is Peer Review? | Types & Examples
  • What Is Predictive Validity? | Examples & Definition
  • What Is Probability Sampling? | Types & Examples
  • What Is Purposive Sampling? | Definition & Examples
  • What Is Qualitative Observation? | Definition & Examples
  • What Is Qualitative Research? | Methods & Examples
  • What Is Quantitative Observation? | Definition & Examples
  • What Is Quantitative Research? | Definition, Uses & Methods

What is your plagiarism score?

data assignment method

Quantitative Data Analysis 101

The lingo, methods and techniques, explained simply.

By: Derek Jansen (MBA)  and Kerryn Warren (PhD) | December 2020

Quantitative data analysis is one of those things that often strikes fear in students. It’s totally understandable – quantitative analysis is a complex topic, full of daunting lingo , like medians, modes, correlation and regression. Suddenly we’re all wishing we’d paid a little more attention in math class…

The good news is that while quantitative data analysis is a mammoth topic, gaining a working understanding of the basics isn’t that hard , even for those of us who avoid numbers and math . In this post, we’ll break quantitative analysis down into simple , bite-sized chunks so you can approach your research with confidence.

Quantitative data analysis methods and techniques 101

Overview: Quantitative Data Analysis 101

  • What (exactly) is quantitative data analysis?
  • When to use quantitative analysis
  • How quantitative analysis works

The two “branches” of quantitative analysis

  • Descriptive statistics 101
  • Inferential statistics 101
  • How to choose the right quantitative methods
  • Recap & summary

What is quantitative data analysis?

Despite being a mouthful, quantitative data analysis simply means analysing data that is numbers-based – or data that can be easily “converted” into numbers without losing any meaning.

For example, category-based variables like gender, ethnicity, or native language could all be “converted” into numbers without losing meaning – for example, English could equal 1, French 2, etc.

This contrasts against qualitative data analysis, where the focus is on words, phrases and expressions that can’t be reduced to numbers. If you’re interested in learning about qualitative analysis, check out our post and video here .

What is quantitative analysis used for?

Quantitative analysis is generally used for three purposes.

  • Firstly, it’s used to measure differences between groups . For example, the popularity of different clothing colours or brands.
  • Secondly, it’s used to assess relationships between variables . For example, the relationship between weather temperature and voter turnout.
  • And third, it’s used to test hypotheses in a scientifically rigorous way. For example, a hypothesis about the impact of a certain vaccine.

Again, this contrasts with qualitative analysis , which can be used to analyse people’s perceptions and feelings about an event or situation. In other words, things that can’t be reduced to numbers.

How does quantitative analysis work?

Well, since quantitative data analysis is all about analysing numbers , it’s no surprise that it involves statistics . Statistical analysis methods form the engine that powers quantitative analysis, and these methods can vary from pretty basic calculations (for example, averages and medians) to more sophisticated analyses (for example, correlations and regressions).

Sounds like gibberish? Don’t worry. We’ll explain all of that in this post. Importantly, you don’t need to be a statistician or math wiz to pull off a good quantitative analysis. We’ll break down all the technical mumbo jumbo in this post.

Need a helping hand?

data assignment method

As I mentioned, quantitative analysis is powered by statistical analysis methods . There are two main “branches” of statistical methods that are used – descriptive statistics and inferential statistics . In your research, you might only use descriptive statistics, or you might use a mix of both , depending on what you’re trying to figure out. In other words, depending on your research questions, aims and objectives . I’ll explain how to choose your methods later.

So, what are descriptive and inferential statistics?

Well, before I can explain that, we need to take a quick detour to explain some lingo. To understand the difference between these two branches of statistics, you need to understand two important words. These words are population and sample .

First up, population . In statistics, the population is the entire group of people (or animals or organisations or whatever) that you’re interested in researching. For example, if you were interested in researching Tesla owners in the US, then the population would be all Tesla owners in the US.

However, it’s extremely unlikely that you’re going to be able to interview or survey every single Tesla owner in the US. Realistically, you’ll likely only get access to a few hundred, or maybe a few thousand owners using an online survey. This smaller group of accessible people whose data you actually collect is called your sample .

So, to recap – the population is the entire group of people you’re interested in, and the sample is the subset of the population that you can actually get access to. In other words, the population is the full chocolate cake , whereas the sample is a slice of that cake.

So, why is this sample-population thing important?

Well, descriptive statistics focus on describing the sample , while inferential statistics aim to make predictions about the population, based on the findings within the sample. In other words, we use one group of statistical methods – descriptive statistics – to investigate the slice of cake, and another group of methods – inferential statistics – to draw conclusions about the entire cake. There I go with the cake analogy again…

With that out the way, let’s take a closer look at each of these branches in more detail.

Descriptive statistics vs inferential statistics

Branch 1: Descriptive Statistics

Descriptive statistics serve a simple but critically important role in your research – to describe your data set – hence the name. In other words, they help you understand the details of your sample . Unlike inferential statistics (which we’ll get to soon), descriptive statistics don’t aim to make inferences or predictions about the entire population – they’re purely interested in the details of your specific sample .

When you’re writing up your analysis, descriptive statistics are the first set of stats you’ll cover, before moving on to inferential statistics. But, that said, depending on your research objectives and research questions , they may be the only type of statistics you use. We’ll explore that a little later.

So, what kind of statistics are usually covered in this section?

Some common statistical tests used in this branch include the following:

  • Mean – this is simply the mathematical average of a range of numbers.
  • Median – this is the midpoint in a range of numbers when the numbers are arranged in numerical order. If the data set makes up an odd number, then the median is the number right in the middle of the set. If the data set makes up an even number, then the median is the midpoint between the two middle numbers.
  • Mode – this is simply the most commonly occurring number in the data set.
  • In cases where most of the numbers are quite close to the average, the standard deviation will be relatively low.
  • Conversely, in cases where the numbers are scattered all over the place, the standard deviation will be relatively high.
  • Skewness . As the name suggests, skewness indicates how symmetrical a range of numbers is. In other words, do they tend to cluster into a smooth bell curve shape in the middle of the graph, or do they skew to the left or right?

Feeling a bit confused? Let’s look at a practical example using a small data set.

Descriptive statistics example data

On the left-hand side is the data set. This details the bodyweight of a sample of 10 people. On the right-hand side, we have the descriptive statistics. Let’s take a look at each of them.

First, we can see that the mean weight is 72.4 kilograms. In other words, the average weight across the sample is 72.4 kilograms. Straightforward.

Next, we can see that the median is very similar to the mean (the average). This suggests that this data set has a reasonably symmetrical distribution (in other words, a relatively smooth, centred distribution of weights, clustered towards the centre).

In terms of the mode , there is no mode in this data set. This is because each number is present only once and so there cannot be a “most common number”. If there were two people who were both 65 kilograms, for example, then the mode would be 65.

Next up is the standard deviation . 10.6 indicates that there’s quite a wide spread of numbers. We can see this quite easily by looking at the numbers themselves, which range from 55 to 90, which is quite a stretch from the mean of 72.4.

And lastly, the skewness of -0.2 tells us that the data is very slightly negatively skewed. This makes sense since the mean and the median are slightly different.

As you can see, these descriptive statistics give us some useful insight into the data set. Of course, this is a very small data set (only 10 records), so we can’t read into these statistics too much. Also, keep in mind that this is not a list of all possible descriptive statistics – just the most common ones.

But why do all of these numbers matter?

While these descriptive statistics are all fairly basic, they’re important for a few reasons:

  • Firstly, they help you get both a macro and micro-level view of your data. In other words, they help you understand both the big picture and the finer details.
  • Secondly, they help you spot potential errors in the data – for example, if an average is way higher than you’d expect, or responses to a question are highly varied, this can act as a warning sign that you need to double-check the data.
  • And lastly, these descriptive statistics help inform which inferential statistical techniques you can use, as those techniques depend on the skewness (in other words, the symmetry and normality) of the data.

Simply put, descriptive statistics are really important , even though the statistical techniques used are fairly basic. All too often at Grad Coach, we see students skimming over the descriptives in their eagerness to get to the more exciting inferential methods, and then landing up with some very flawed results.

Don’t be a sucker – give your descriptive statistics the love and attention they deserve!

Examples of descriptive statistics

Branch 2: Inferential Statistics

As I mentioned, while descriptive statistics are all about the details of your specific data set – your sample – inferential statistics aim to make inferences about the population . In other words, you’ll use inferential statistics to make predictions about what you’d expect to find in the full population.

What kind of predictions, you ask? Well, there are two common types of predictions that researchers try to make using inferential stats:

  • Firstly, predictions about differences between groups – for example, height differences between children grouped by their favourite meal or gender.
  • And secondly, relationships between variables – for example, the relationship between body weight and the number of hours a week a person does yoga.

In other words, inferential statistics (when done correctly), allow you to connect the dots and make predictions about what you expect to see in the real world population, based on what you observe in your sample data. For this reason, inferential statistics are used for hypothesis testing – in other words, to test hypotheses that predict changes or differences.

Inferential statistics are used to make predictions about what you’d expect to find in the full population, based on the sample.

Of course, when you’re working with inferential statistics, the composition of your sample is really important. In other words, if your sample doesn’t accurately represent the population you’re researching, then your findings won’t necessarily be very useful.

For example, if your population of interest is a mix of 50% male and 50% female , but your sample is 80% male , you can’t make inferences about the population based on your sample, since it’s not representative. This area of statistics is called sampling, but we won’t go down that rabbit hole here (it’s a deep one!) – we’ll save that for another post .

What statistics are usually used in this branch?

There are many, many different statistical analysis methods within the inferential branch and it’d be impossible for us to discuss them all here. So we’ll just take a look at some of the most common inferential statistical methods so that you have a solid starting point.

First up are T-Tests . T-tests compare the means (the averages) of two groups of data to assess whether they’re statistically significantly different. In other words, do they have significantly different means, standard deviations and skewness.

This type of testing is very useful for understanding just how similar or different two groups of data are. For example, you might want to compare the mean blood pressure between two groups of people – one that has taken a new medication and one that hasn’t – to assess whether they are significantly different.

Kicking things up a level, we have ANOVA, which stands for “analysis of variance”. This test is similar to a T-test in that it compares the means of various groups, but ANOVA allows you to analyse multiple groups , not just two groups So it’s basically a t-test on steroids…

Next, we have correlation analysis . This type of analysis assesses the relationship between two variables. In other words, if one variable increases, does the other variable also increase, decrease or stay the same. For example, if the average temperature goes up, do average ice creams sales increase too? We’d expect some sort of relationship between these two variables intuitively , but correlation analysis allows us to measure that relationship scientifically .

Lastly, we have regression analysis – this is quite similar to correlation in that it assesses the relationship between variables, but it goes a step further to understand cause and effect between variables, not just whether they move together. In other words, does the one variable actually cause the other one to move, or do they just happen to move together naturally thanks to another force? Just because two variables correlate doesn’t necessarily mean that one causes the other.

Stats overload…

I hear you. To make this all a little more tangible, let’s take a look at an example of a correlation in action.

Here’s a scatter plot demonstrating the correlation (relationship) between weight and height. Intuitively, we’d expect there to be some relationship between these two variables, which is what we see in this scatter plot. In other words, the results tend to cluster together in a diagonal line from bottom left to top right.

Sample correlation

As I mentioned, these are are just a handful of inferential techniques – there are many, many more. Importantly, each statistical method has its own assumptions and limitations .

For example, some methods only work with normally distributed (parametric) data, while other methods are designed specifically for non-parametric data. And that’s exactly why descriptive statistics are so important – they’re the first step to knowing which inferential techniques you can and can’t use.

Remember that every statistical method has its own assumptions and limitations,  so you need to be aware of these.

How to choose the right analysis method

To choose the right statistical methods, you need to think about two important factors :

  • The type of quantitative data you have (specifically, level of measurement and the shape of the data). And,
  • Your research questions and hypotheses

Let’s take a closer look at each of these.

Factor 1 – Data type

The first thing you need to consider is the type of data you’ve collected (or the type of data you will collect). By data types, I’m referring to the four levels of measurement – namely, nominal, ordinal, interval and ratio. If you’re not familiar with this lingo, check out the video below.

Why does this matter?

Well, because different statistical methods and techniques require different types of data. This is one of the “assumptions” I mentioned earlier – every method has its assumptions regarding the type of data.

For example, some techniques work with categorical data (for example, yes/no type questions, or gender or ethnicity), while others work with continuous numerical data (for example, age, weight or income) – and, of course, some work with multiple data types.

If you try to use a statistical method that doesn’t support the data type you have, your results will be largely meaningless . So, make sure that you have a clear understanding of what types of data you’ve collected (or will collect). Once you have this, you can then check which statistical methods would support your data types here .

If you haven’t collected your data yet, you can work in reverse and look at which statistical method would give you the most useful insights, and then design your data collection strategy to collect the correct data types.

Another important factor to consider is the shape of your data . Specifically, does it have a normal distribution (in other words, is it a bell-shaped curve, centred in the middle) or is it very skewed to the left or the right? Again, different statistical techniques work for different shapes of data – some are designed for symmetrical data while others are designed for skewed data.

This is another reminder of why descriptive statistics are so important – they tell you all about the shape of your data.

Factor 2: Your research questions

The next thing you need to consider is your specific research questions, as well as your hypotheses (if you have some). The nature of your research questions and research hypotheses will heavily influence which statistical methods and techniques you should use.

If you’re just interested in understanding the attributes of your sample (as opposed to the entire population), then descriptive statistics are probably all you need. For example, if you just want to assess the means (averages) and medians (centre points) of variables in a group of people.

On the other hand, if you aim to understand differences between groups or relationships between variables and to infer or predict outcomes in the population, then you’ll likely need both descriptive statistics and inferential statistics.

So, it’s really important to get very clear about your research aims and research questions, as well your hypotheses – before you start looking at which statistical techniques to use.

Never shoehorn a specific statistical technique into your research just because you like it or have some experience with it. Your choice of methods must align with all the factors we’ve covered here.

Time to recap…

You’re still with me? That’s impressive. We’ve covered a lot of ground here, so let’s recap on the key points:

  • Quantitative data analysis is all about  analysing number-based data  (which includes categorical and numerical data) using various statistical techniques.
  • The two main  branches  of statistics are  descriptive statistics  and  inferential statistics . Descriptives describe your sample, whereas inferentials make predictions about what you’ll find in the population.
  • Common  descriptive statistical methods include  mean  (average),  median , standard  deviation  and  skewness .
  • Common  inferential statistical methods include  t-tests ,  ANOVA ,  correlation  and  regression  analysis.
  • To choose the right statistical methods and techniques, you need to consider the  type of data you’re working with , as well as your  research questions  and hypotheses.

data assignment method

Psst... there’s more!

This post was based on one of our popular Research Bootcamps . If you're working on a research project, you'll definitely want to check this out ...

77 Comments

Oddy Labs

Hi, I have read your article. Such a brilliant post you have created.

Derek Jansen

Thank you for the feedback. Good luck with your quantitative analysis.

Abdullahi Ramat

Thank you so much.

Obi Eric Onyedikachi

Thank you so much. I learnt much well. I love your summaries of the concepts. I had love you to explain how to input data using SPSS

MWASOMOLA, BROWN

Very useful, I have got the concept

Lumbuka Kaunda

Amazing and simple way of breaking down quantitative methods.

Charles Lwanga

This is beautiful….especially for non-statisticians. I have skimmed through but I wish to read again. and please include me in other articles of the same nature when you do post. I am interested. I am sure, I could easily learn from you and get off the fear that I have had in the past. Thank you sincerely.

Essau Sefolo

Send me every new information you might have.

fatime

i need every new information

Dr Peter

Thank you for the blog. It is quite informative. Dr Peter Nemaenzhe PhD

Mvogo Mvogo Ephrem

It is wonderful. l’ve understood some of the concepts in a more compréhensive manner

Maya

Your article is so good! However, I am still a bit lost. I am doing a secondary research on Gun control in the US and increase in crime rates and I am not sure which analysis method I should use?

Joy

Based on the given learning points, this is inferential analysis, thus, use ‘t-tests, ANOVA, correlation and regression analysis’

Peter

Well explained notes. Am an MPH student and currently working on my thesis proposal, this has really helped me understand some of the things I didn’t know.

Jejamaije Mujoro

I like your page..helpful

prashant pandey

wonderful i got my concept crystal clear. thankyou!!

Dailess Banda

This is really helpful , thank you

Lulu

Thank you so much this helped

wossen

Wonderfully explained

Niamatullah zaheer

thank u so much, it was so informative

mona

THANKYOU, this was very informative and very helpful

Thaddeus Ogwoka

This is great GRADACOACH I am not a statistician but I require more of this in my thesis

Include me in your posts.

Alem Teshome

This is so great and fully useful. I would like to thank you again and again.

Mrinal

Glad to read this article. I’ve read lot of articles but this article is clear on all concepts. Thanks for sharing.

Emiola Adesina

Thank you so much. This is a very good foundation and intro into quantitative data analysis. Appreciate!

Josyl Hey Aquilam

You have a very impressive, simple but concise explanation of data analysis for Quantitative Research here. This is a God-send link for me to appreciate research more. Thank you so much!

Lynnet Chikwaikwai

Avery good presentation followed by the write up. yes you simplified statistics to make sense even to a layman like me. Thank so much keep it up. The presenter did ell too. i would like more of this for Qualitative and exhaust more of the test example like the Anova.

Adewole Ikeoluwa

This is a very helpful article, couldn’t have been clearer. Thank you.

Samih Soud ALBusaidi

Awesome and phenomenal information.Well done

Nūr

The video with the accompanying article is super helpful to demystify this topic. Very well done. Thank you so much.

Lalah

thank you so much, your presentation helped me a lot

Anjali

I don’t know how should I express that ur article is saviour for me 🥺😍

Saiqa Aftab Tunio

It is well defined information and thanks for sharing. It helps me a lot in understanding the statistical data.

Funeka Mvandaba

I gain a lot and thanks for sharing brilliant ideas, so wish to be linked on your email update.

Rita Kathomi Gikonyo

Very helpful and clear .Thank you Gradcoach.

Hilaria Barsabal

Thank for sharing this article, well organized and information presented are very clear.

AMON TAYEBWA

VERY INTERESTING AND SUPPORTIVE TO NEW RESEARCHERS LIKE ME. AT LEAST SOME BASICS ABOUT QUANTITATIVE.

Tariq

An outstanding, well explained and helpful article. This will help me so much with my data analysis for my research project. Thank you!

chikumbutso

wow this has just simplified everything i was scared of how i am gonna analyse my data but thanks to you i will be able to do so

Idris Haruna

simple and constant direction to research. thanks

Mbunda Castro

This is helpful

AshikB

Great writing!! Comprehensive and very helpful.

himalaya ravi

Do you provide any assistance for other steps of research methodology like making research problem testing hypothesis report and thesis writing?

Sarah chiwamba

Thank you so much for such useful article!

Lopamudra

Amazing article. So nicely explained. Wow

Thisali Liyanage

Very insightfull. Thanks

Melissa

I am doing a quality improvement project to determine if the implementation of a protocol will change prescribing habits. Would this be a t-test?

Aliyah

The is a very helpful blog, however, I’m still not sure how to analyze my data collected. I’m doing a research on “Free Education at the University of Guyana”

Belayneh Kassahun

tnx. fruitful blog!

Suzanne

So I am writing exams and would like to know how do establish which method of data analysis to use from the below research questions: I am a bit lost as to how I determine the data analysis method from the research questions.

Do female employees report higher job satisfaction than male employees with similar job descriptions across the South African telecommunications sector? – I though that maybe Chi Square could be used here. – Is there a gender difference in talented employees’ actual turnover decisions across the South African telecommunications sector? T-tests or Correlation in this one. – Is there a gender difference in the cost of actual turnover decisions across the South African telecommunications sector? T-tests or Correlation in this one. – What practical recommendations can be made to the management of South African telecommunications companies on leveraging gender to mitigate employee turnover decisions?

Your assistance will be appreciated if I could get a response as early as possible tomorrow

Like

This was quite helpful. Thank you so much.

kidane Getachew

wow I got a lot from this article, thank you very much, keep it up

FAROUK AHMAD NKENGA

Thanks for yhe guidance. Can you send me this guidance on my email? To enable offline reading?

Nosi Ruth Xabendlini

Thank you very much, this service is very helpful.

George William Kiyingi

Every novice researcher needs to read this article as it puts things so clear and easy to follow. Its been very helpful.

Adebisi

Wonderful!!!! you explained everything in a way that anyone can learn. Thank you!!

Miss Annah

I really enjoyed reading though this. Very easy to follow. Thank you

Reza Kia

Many thanks for your useful lecture, I would be really appreciated if you could possibly share with me the PPT of presentation related to Data type?

Protasia Tairo

Thank you very much for sharing, I got much from this article

Fatuma Chobo

This is a very informative write-up. Kindly include me in your latest posts.

naphtal

Very interesting mostly for social scientists

Boy M. Bachtiar

Thank you so much, very helpfull

You’re welcome 🙂

Dr Mafaza Mansoor

woow, its great, its very informative and well understood because of your way of writing like teaching in front of me in simple languages.

Opio Len

I have been struggling to understand a lot of these concepts. Thank you for the informative piece which is written with outstanding clarity.

Eric

very informative article. Easy to understand

Leena Fukey

Beautiful read, much needed.

didin

Always greet intro and summary. I learn so much from GradCoach

Mmusyoka

Quite informative. Simple and clear summary.

Jewel Faver

I thoroughly enjoyed reading your informative and inspiring piece. Your profound insights into this topic truly provide a better understanding of its complexity. I agree with the points you raised, especially when you delved into the specifics of the article. In my opinion, that aspect is often overlooked and deserves further attention.

Shantae

Absolutely!!! Thank you

Thazika Chitimera

Thank you very much for this post. It made me to understand how to do my data analysis.

lule victor

its nice work and excellent job ,you have made my work easier

Pedro Uwadum

Wow! So explicit. Well done.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

logo

Python Numerical Methods

../_images/book_cover.jpg

This notebook contains an excerpt from the Python Programming and Numerical Methods - A Guide for Engineers and Scientists , the content is also available at Berkeley Python Numerical Methods .

The copyright of the book belongs to Elsevier. We also have this interactive book online for a better learning experience. The code is released under the MIT license . If you find this content useful, please consider supporting the work on Elsevier or Amazon !

< 2.0 Variables and Basic Data Structures | Contents | 2.2 Data Structure - Strings >

Variables and Assignment ¶

When programming, it is useful to be able to store information in variables. A variable is a string of characters and numbers associated with a piece of information. The assignment operator , denoted by the “=” symbol, is the operator that is used to assign values to variables in Python. The line x=1 takes the known value, 1, and assigns that value to the variable with name “x”. After executing this line, this number will be stored into this variable. Until the value is changed or the variable deleted, the character x behaves like the value 1.

TRY IT! Assign the value 2 to the variable y. Multiply y by 3 to show that it behaves like the value 2.

A variable is more like a container to store the data in the computer’s memory, the name of the variable tells the computer where to find this value in the memory. For now, it is sufficient to know that the notebook has its own memory space to store all the variables in the notebook. As a result of the previous example, you will see the variable “x” and “y” in the memory. You can view a list of all the variables in the notebook using the magic command %whos .

TRY IT! List all the variables in this notebook

Note that the equal sign in programming is not the same as a truth statement in mathematics. In math, the statement x = 2 declares the universal truth within the given framework, x is 2 . In programming, the statement x=2 means a known value is being associated with a variable name, store 2 in x. Although it is perfectly valid to say 1 = x in mathematics, assignments in Python always go left : meaning the value to the right of the equal sign is assigned to the variable on the left of the equal sign. Therefore, 1=x will generate an error in Python. The assignment operator is always last in the order of operations relative to mathematical, logical, and comparison operators.

TRY IT! The mathematical statement x=x+1 has no solution for any value of x . In programming, if we initialize the value of x to be 1, then the statement makes perfect sense. It means, “Add x and 1, which is 2, then assign that value to the variable x”. Note that this operation overwrites the previous value stored in x .

There are some restrictions on the names variables can take. Variables can only contain alphanumeric characters (letters and numbers) as well as underscores. However, the first character of a variable name must be a letter or underscores. Spaces within a variable name are not permitted, and the variable names are case-sensitive (e.g., x and X will be considered different variables).

TIP! Unlike in pure mathematics, variables in programming almost always represent something tangible. It may be the distance between two points in space or the number of rabbits in a population. Therefore, as your code becomes increasingly complicated, it is very important that your variables carry a name that can easily be associated with what they represent. For example, the distance between two points in space is better represented by the variable dist than x , and the number of rabbits in a population is better represented by nRabbits than y .

Note that when a variable is assigned, it has no memory of how it was assigned. That is, if the value of a variable, y , is constructed from other variables, like x , reassigning the value of x will not change the value of y .

EXAMPLE: What value will y have after the following lines of code are executed?

WARNING! You can overwrite variables or functions that have been stored in Python. For example, the command help = 2 will store the value 2 in the variable with name help . After this assignment help will behave like the value 2 instead of the function help . Therefore, you should always be careful not to give your variables the same name as built-in functions or values.

TIP! Now that you know how to assign variables, it is important that you learn to never leave unassigned commands. An unassigned command is an operation that has a result, but that result is not assigned to a variable. For example, you should never use 2+2 . You should instead assign it to some variable x=2+2 . This allows you to “hold on” to the results of previous commands and will make your interaction with Python must less confusing.

You can clear a variable from the notebook using the del function. Typing del x will clear the variable x from the workspace. If you want to remove all the variables in the notebook, you can use the magic command %reset .

In mathematics, variables are usually associated with unknown numbers; in programming, variables are associated with a value of a certain type. There are many data types that can be assigned to variables. A data type is a classification of the type of information that is being stored in a variable. The basic data types that you will utilize throughout this book are boolean, int, float, string, list, tuple, dictionary, set. A formal description of these data types is given in the following sections.

Data-Driven Traffic Assignment: A Novel Approach for Learning Traffic Flow Patterns Using Graph Convolutional Neural Network

  • Published: 24 July 2023
  • Volume 5 , article number  11 , ( 2023 )

Cite this article

data assignment method

  • Rezaur Rahman 1 &
  • Samiul Hasan 1  

585 Accesses

5 Citations

Explore all metrics

We present a novel data-driven approach of learning traffic flow patterns of a transportation network given that many instances of origin to destination (OD) travel demand and link flows of the network are available. Instead of estimating traffic flow patterns assuming certain user behavior (e.g., user equilibrium or system optimal), here we explore the idea of learning those flow patterns directly from the data. To implement this idea, we have formulated the traditional traffic assignment problem (from the field of transportation science) as a data-driven learning problem and developed a neural network-based framework known as Graph Convolutional Neural Network (GCNN) to solve it. The proposed framework represents the transportation network and OD demand in an efficient way and utilizes the diffusion process of multiple OD demands from nodes to links. We validate the solutions of the model against analytical solutions generated from running static user equilibrium-based traffic assignments over Sioux Falls and East Massachusetts networks. The validation results show that the implemented GCNN model can learn the flow patterns very well with less than 2% mean absolute difference between the actual and estimated link flows for both networks under varying congested conditions. When the training of the model is complete, it can instantly determine the traffic flows of a large-scale network. Hence, this approach can overcome the challenges of deploying traffic assignment models over large-scale networks and open new directions of research in data-driven network modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

data assignment method

Similar content being viewed by others

data assignment method

Modeling Urban Traffic Data Through Graph-Based Neural Networks

data assignment method

Predicting Taxi Hotspots in Dynamic Conditions Using Graph Neural Network

data assignment method

MTGCN: A Multitask Deep Learning Model for Traffic Flow Prediction

Data availability.

The data that support the findings of this study are available from the corresponding author, [[email protected]], upon reasonable request.

Abdelfatah AS, Mahmassani HS (2002) A simulation-based signal optimization algorithm within a dynamic traffic assignment framework 428–433. https://doi.org/10.1109/itsc.2001.948695

Alexander L, Jiang S, Murga M, González MC (2015) Origin-destination trips by purpose and time of day inferred from mobile phone data. Transp Res Part C Emerg Technol 58:240–250. https://doi.org/10.1016/j.trc.2015.02.018

Article   Google Scholar  

Atwood J, Towsley D (2015) Diffusion-convolutional neural networks. Adv Neural Inf Process Syst. https://doi.org/10.5555/3157096.3157320

Ban XJ, Liu HX, Ferris MC, Ran B (2008) A link-node complementarity model and solution algorithm for dynamic user equilibria with exact flow propagations. Transp Res Part B Methodol 42:823–842. https://doi.org/10.1016/j.trb.2008.01.006

Bar-Gera H (2002) Origin-based algorithm for the traffic assignment problem. Transp Sci 36:398–417. https://doi.org/10.1287/trsc.36.4.398.549

Article   MATH   Google Scholar  

Barthélemy J, Carletti T (2017) A dynamic behavioural traffic assignment model with strategic agents. Transp Res Part C Emerg Technol 85:23–46. https://doi.org/10.1016/j.trc.2017.09.004

Ben-Akiva ME, Gao S, Wei Z, Wen Y (2012) A dynamic traffic assignment model for highly congested urban networks. Transp Res Part C Emerg Technol 24:62–82. https://doi.org/10.1016/j.trc.2012.02.006

Billings D, Jiann-Shiou Y (2006) Application of the ARIMA Models to Urban Roadway Travel Time Prediction-A Case Study. Systems, Man and Cybernetics, 2006. SMC’06. IEEE International Conference on 2529–2534

Boyles S, Ukkusuri SV, Waller ST, Kockelman KM (2006) A comparison of static and dynamic traffic assignment under tolls: a study of the dallas-fort worth network. 85th Annual Meeting of 14

Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25:163–177. https://doi.org/10.1080/0022250X.2001.9990249

Cai P, Wang Y, Lu G, Chen P, Ding C, Sun J (2016) A spatiotemporal correlative k-nearest neighbor model for short-term traffic multistep forecasting. Transp Res Part C Emerg Technol 62:21–34. https://doi.org/10.1016/j.trc.2015.11.002

Chien SI-J, Kuchipudi CM (2003) Dynamic travel time prediction with real-time and historic data. J Transp Eng 129:608–616. https://doi.org/10.1061/(ASCE)0733-947X(2003)129:6(608)

Cui Z, Henrickson K, Ke R, Wang Y (2018a) Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2019.2950416

Cui Z, Ke R, Pu Z, Wang Y (2018b) Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. International Workshop on Urban Computing (UrbComp) 2017

Cui Z, Lin L, Pu Z, Wang Y (2020) Graph Markov network for traffic forecasting with missing data. Transp Res Part C Emerg Technol 117:102671. https://doi.org/10.1016/j.trc.2020.102671

Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Adv Neural Inf Process Syst 3844–3852

Deshpande M, Bajaj PR (2016) Performance analysis of support vector machine for traffic flow prediction. 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC) 126–129. https://doi.org/10.1109/ICGTSPICC.2016.7955283

Elhenawy M, Chen H, Rakha HA (2014) Dynamic travel time prediction using data clustering and genetic programming. Transp Res Part C Emerg Technol 42:82–98. https://doi.org/10.1016/j.trc.2014.02.016

Foytik P, Jordan C, Robinson RM (2017) Exploring simulation based dynamic traffic assignment with large scale microscopic traffic simulation model

Friesz TL, Luque J, Tobin RL, Wie BW (1989) Dynamic network traffic assignment considered as a continuous time optimal control problem. Oper Res 37:893–901. https://doi.org/10.2307/171471

Article   MathSciNet   MATH   Google Scholar  

Gundlegård D, Rydergren C, Breyer N, Rajna B (2016) Travel demand estimation and network assignment based on cellular network data. Comput Commun 95:29–42. https://doi.org/10.1016/j.comcom.2016.04.015

Guo S, Lin Y, Feng N, Song C, Wan H (2019) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proceed AAAI Conf Artif Intell 33:922–929. https://doi.org/10.1609/aaai.v33i01.3301922

Guo K, Hu Y, Qian Z, Sun Y, Gao J, Yin B (2020) Dynamic graph convolution network for traffic forecasting based on latent network of Laplace matrix estimation. IEEE Trans Intell Transp Syst 23:1009–1018. https://doi.org/10.1109/tits.2020.3019497

Guo K, Hu Y, Qian Z, Liu H, Zhang K, Sun Y, Gao J, Yin B (2021) Optimized graph convolution recurrent neural network for traffic prediction. IEEE Trans Intell Transp Syst 22:1138–1149. https://doi.org/10.1109/TITS.2019.2963722

Hammond DK, Vandergheynst P, Gribonval R (2011) Wavelets on graphs via spectral graph theory. Appl Comput Harmon Anal 30:129–150. https://doi.org/10.1016/j.acha.2010.04.005

He X, Guo X, Liu HX (2010) A link-based day-to-day traffic assignment model. Transp Res Part B 44:597–608. https://doi.org/10.1016/j.trb.2009.10.001

Huang Y, Kockelman KM (2019) Electric vehicle charging station locations: elastic demand, station congestion, and network equilibrium. Transp Res D Transp Environ. https://doi.org/10.1016/j.trd.2019.11.008

Innamaa S (2005) Short-term prediction of travel time using neural networks on an interurban highway. Transportation (amst) 32:649–669. https://doi.org/10.1007/s11116-005-0219-y

Jafari E, Pandey V, Boyles SD (2017) A decomposition approach to the static traffic assignment problem. Transp Res Part b Methodol 105:270–296. https://doi.org/10.1016/j.trb.2017.09.011

Janson BN (1989) Dynamic traffic assignment for urban road networks. Transp Res Part B Methodol 25:143–161

Ji X, Shao C, Wang B (2016) Stochastic dynamic traffic assignment model under emergent incidents. Procedia Eng 137:620–629. https://doi.org/10.1016/j.proeng.2016.01.299

Jiang Y, Wong SC, Ho HW, Zhang P, Liu R, Sumalee A (2011) A dynamic traffic assignment model for a continuum transportation system. Transp Res Part b Methodol 45:343–363. https://doi.org/10.1016/j.trb.2010.07.003

Kim H, Oh JS, Jayakrishnan R (2009) Effects of user equilibrium assumptions on network traffic pattern. KSCE J Civ Eng 13:117–127. https://doi.org/10.1007/s12205-009-0117-5

Kim TS, Lee WK, Sohn SY (2019) Graph convolutional network approach applied to predict hourly bike-sharing demands considering spatial, temporal, and global effects. PLoS ONE 14:e0220782. https://doi.org/10.1371/journal.pone.0220782

Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. 5th International Conference on Learning Representations, ICLR 2017- Conference Track Proceedings

LeBlanc LJ, Morlok EK, Pierskalla WP (1975) An efficient approach to solving the road network equilibrium traffic assignment problem. Transp Res 9:309–318. https://doi.org/10.1016/0041-1647(75)90030-1

Lee YLY (2009) Freeway travel time forecast using artifical neural networks with cluster method. 2009 12th International Conference on Information Fusion 1331–1338

Leon-Garcia A, Tizghadam A (2009) A graph theoretical approach to traffic engineering and network control problem. 21st International Teletraffic Congress, ITC 21: Traffic and Performance Issues in Networks of the Future-Final Programme

Leurent F, Chandakas E, Poulhès A (2011) User and service equilibrium in a structural model of traffic assignment to a transit network. Procedia Soc Behav Sci 20:495–505. https://doi.org/10.1016/j.sbspro.2011.08.056

Li Y, Yu R, Shahabi C, Liu Y (2018) Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. 6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings 1–16

Li G, Knoop VL, van Lint H (2021) Multistep traffic forecasting by dynamic graph convolution: interpretations of real-time spatial correlations. Transp Res Part C Emerg Technol 128:103185. https://doi.org/10.1016/j.trc.2021.103185

Lin L, He Z, Peeta S (2018) Predicting station-level hourly demand in a large-scale bike-sharing network: a graph convolutional neural network approach. Transp Res Part C 97:258–276. https://doi.org/10.1016/j.trc.2018.10.011

Liu HX, Ban X, Ran B, Mirchandani P (2007) Analytical dynamic traffic assignment model with probabilistic travel times and perceptions. Transp Res Record 1783:125–133. https://doi.org/10.3141/1783-16

Liu Y, Zheng H, Feng X, Chen Z (2017) Short-term traffic flow prediction with Conv-LSTM. 2017 9th International Conference on Wireless Communications and Signal Processing, WCSP 2017-Proceedings. https://doi.org/10.1109/WCSP.2017.8171119

Lo HK, Szeto WY (2002) A cell-based variational inequality formulation of the dynamic user optimal assignment problem. Transp Res Part b Methodol 36:421–443. https://doi.org/10.1016/S0191-2615(01)00011-X

Luo X, Li D, Yang Y, Zhang S (2019) Spatiotemporal traffic flow prediction with KNN and LSTM. J Adv Transp. https://doi.org/10.1155/2019/4145353

Ma X, Tao Z, Wang Y, Yu H, Wang Y (2015) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transp Res Part C Emerg Technol 54:187–197. https://doi.org/10.1016/j.trc.2015.03.014

Ma X, Dai Z, He Z, Ma J, Wang Y, Wang Y (2017) Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors 17:818. https://doi.org/10.3390/s17040818

Mahmassani HaniS (2001) Dynamic network traffic assignment and simulation methodology for advanced system management applications. Netw Spat Econ 1:267–292. https://doi.org/10.1023/A:1012831808926

Merchant DK, Nemhauser GL (1978) A model and an algorithm for the dynamic traffic assignment problems. Transp Sci 12:183–199. https://doi.org/10.1287/trsc.12.3.183

Mitradjieva M, Lindberg PO (2013) The stiff is moving—conjugate direction Frank-Wolfe methods with applications to traffic assignment * . Transp Sci 47:280–293. https://doi.org/10.1287/trsc.1120.0409

Myung J, Kim D-K, Kho S-Y, Park C-H (2011) Travel time prediction using k nearest neighbor method with combined data from vehicle detector system and automatic toll collection system. Transp Res Record 2256:51–59. https://doi.org/10.3141/2256-07

Nie YM, Zhang HM (2010) Solving the dynamic user optimal assignment problem considering queue spillback. Netw Spat Econ 10:49–71. https://doi.org/10.1007/s11067-007-9022-y

Peeta S, Mahmassani HS (1995) System optimal and user equilibrium time-dependent traffic assignment in congested networks. Ann Oper Res. https://doi.org/10.1007/BF02031941

Peeta S, Ziliaskopoulos AK (2001) Foundations of dynamic traffic assignment: the past, the present and the future. Netw Spat Econ 1:233–265

Peng H, Wang H, Du B, Bhuiyan MZA, Ma H, Liu J, Wang L, Yang Z, Du L, Wang S, Yu PS (2020) Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf Sci (n y) 521:277–290. https://doi.org/10.1016/j.ins.2020.01.043

Peng H, Du B, Liu M, Liu M, Ji S, Wang S, Zhang X, He L (2021) Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning. Inf Sci (n y) 578:401–416. https://doi.org/10.1016/j.ins.2021.07.007

Article   MathSciNet   Google Scholar  

Polson NG, Sokolov VO (2017) Deep learning for short-term traffic flow prediction. Transp Res Part C Emerg Technol 79:1–17. https://doi.org/10.1016/j.trc.2017.02.024

Primer A (2011) Dynamic traffic assignment. Transportation Network Modeling Committee 1–39. https://doi.org/10.1016/j.trd.2016.06.003

PyTorch [WWW document], 2016. URL https://pytorch.org/ . Accessed 10 Jun 2020

Rahman R, Hasan S (2018). Short-term traffic speed prediction for freeways during hurricane evacuation: a deep learning approach 1291–1296. https://doi.org/10.1109/ITSC.2018.8569443

Ran BIN, Boyce DE, Leblanc LJ (1993) A new class of instantaneous dynamic user-optimal traffic assignment models. Oper Res 41:192–202

Sanchez-Gonzalez A, Godwin J, Pfaff T, Ying R, Leskovec J, Battaglia PW (2020) Learning to simulate complex physics with graph networks. 37th International Conference on Machine Learning, ICML 2020 PartF16814, 8428–8437

Shafiei S, Gu Z, Saberi M (2018) Calibration and validation of a simulation-based dynamic traffic assignment model for a large-scale congested network. Simul Model Pract Theory 86:169–186. https://doi.org/10.1016/j.simpat.2018.04.006

Sheffi Y (1985) Urban transportation networks. Prentice-Hall Inc., Englewood Cliffs. https://doi.org/10.1016/0191-2607(86)90023-3

Book   Google Scholar  

Tang C, Sun J, Sun Y, Peng M, Gan N (2020) A General traffic flow prediction approach based on spatial-temporal graph attention. IEEE Access 8:153731–153741. https://doi.org/10.1109/ACCESS.2020.3018452

Teng SH (2016) Scalable algorithms for data and network analysis. Found Trends Theor Comput Sci 12:1–274. https://doi.org/10.1561/0400000051

Tizghadam A, Leon-garcia A (2007) A robust routing plan to optimize throughput in core networks 117–128

Transportation Networks for Research Core Team (2016) Transportation Networks for Research [WWW Document]. 2016. URL https://github.com/bstabler/TransportationNetworks (Accessed 8 Jul 2018)

Ukkusuri SV, Han L, Doan K (2012) Dynamic user equilibrium with a path based cell transmission model for general traffic networks. Transp Res Part b Methodol 46:1657–1684. https://doi.org/10.1016/j.trb.2012.07.010

Waller ST, Fajardo D, Duell M, Dixit V (2013) Linear programming formulation for strategic dynamic traffic assignment. Netw Spat Econ 13:427–443. https://doi.org/10.1007/s11067-013-9187-5

Wang HW, Peng ZR, Wang D, Meng Y, Wu T, Sun W, Lu QC (2020) Evaluation and prediction of transportation resilience under extreme weather events: a diffusion graph convolutional approach. Transp Res Part C Emerg Technol 115:102619. https://doi.org/10.1016/j.trc.2020.102619

Wardrop J (1952) Some theoretical aspects of road traffic research. Proc Inst Civil Eng Part II 1:325–378. https://doi.org/10.1680/ipeds.1952.11362

Watling D, Hazelton ML (2003) The dynamics and equilibria of day-to-day assignment models. Netw Spat Econ. https://doi.org/10.1023/A:1025398302560

Wu CH, Ho JM, Lee DT (2004) Travel-time prediction with support vector regression. IEEE Trans Intell Transp Syst 5:276–281. https://doi.org/10.1109/TITS.2004.837813

Yasdi R (1999) Prediction of road traffic using a neural network approach. Neural Comput Appl 8:135–142. https://doi.org/10.1007/s005210050015

Yu B, Song X, Guan F, Yang Z, Yao B (2016) k-nearest neighbor model for multiple-time-step prediction of short-term traffic condition. J Transp Eng 142:4016018. https://doi.org/10.1061/(ASCE)TE.1943-5436.0000816

Yu B, Yin H, Zhu Z (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, California pp. 3634–3640. https://doi.org/10.24963/ijcai.2018/505

Zhang Y, Haghani A (2015) A gradient boosting method to improve travel time prediction. Transp Res Part C Emerg Technol 58:308–324. https://doi.org/10.1016/j.trc.2015.02.019

Zhang S, Tong H, Xu J, Maciejewski R (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw. https://doi.org/10.1186/s40649-019-0069-y

Zhao L, Song Y, Zhang C, Liu Y, Wang P, Lin T, Deng M, Li H (2020) T-GCN: a temporal graph convolutional network for traffic prediction. IEEE Trans Intell Transp Syst 21:3848–3858. https://doi.org/10.1109/TITS.2019.2935152

Zhou J, Cui G, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M (2020) Graph neural networks: a review of methods and applications. AI Open.  https://doi.org/10.1016/j.aiopen.2021.01.001

Download references

This study was supported by the U.S. National Science Foundation through the grant CMMI #1917019. However, the authors are solely responsible for the facts and accuracy of the information presented in the paper.

Author information

Authors and affiliations.

Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL, USA

Rezaur Rahman & Samiul Hasan

You can also search for this author in PubMed   Google Scholar

Contributions

The authors confirm their contribution to the paper as follows: study conception and design: RR, SH; analysis and interpretation of results: RR, SH; draft manuscript preparation: RR, SH. All authors reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Rezaur Rahman .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: modeling traffic flows using spectral graph convolution

In spectral graph convolution, a spectral convolutional filter is used to learn traffic flow patterns inside a transportation network in response to travel demand variations. The spectral filter is derived from spectrum of the Laplacian matrix, which consists of eigenvalues of the Laplacian matrix. So to construct the spectrum, we must calculate the eigenvalues of` the Laplacian matrix. For a symmetric graph, we can compute the eigenvalues using Eigen decomposition of the Laplacian matrix. In this problem, we consider the transportation network as a symmetric-directed graph, same number of links getting out and getting inside a node, which means the in-degree and out-degree matrices of the graph are similar. Thus, the Laplacian matrix of this graph is diagonalizable as follows using Eigen decomposition

where \(\boldsymbol{\Lambda }\) is a diagonal matrix with eigenvalues, \({\lambda }_{0},{\lambda }_{1},{\lambda }_{2}, . . . ,{\lambda }_{N}\) and \({\varvec{U}}\) indicates the eigen vectors, \({u}_{0},{u}_{1},{u}_{2}, . . . ,{u}_{N}\) . Eigen values represent characteristics of transportation network in terms of strength of a particular node based on its position, distance between adjacent nodes, and dimension of the network. The spectral graph convolution filter can be defined as follows:

where \(\theta\) is the parameter for the convolution filter shared by all the nodes of the network and \(K\) is the size of the convolution filter. Now the spectral graph convolution over the graph signal ( \({\varvec{X}})\) is defined as follows:

According to spectral graph theory, the shortest path distance i.e., minimum number of links connecting nodes \(i\) and \(j\) is longer than \(K\) , such that \({L}^{K}\left(i, j\right) = 0\) (Hammond et al. 2011 ). Consequently, for a given pair of origin ( \(i\) ) and destination ( \(j)\) nodes, a spectral graph filter of size K has access to all the nodes on the shortest path of the graph. It means that the spectral graph convolution filter of size \(K\) captures flow propagation through each node on the shortest path. So the spectral graph convolution operation can model the interdependency between a link and its \(i\) th order adjacent nodes on the shortest paths, given that 0 ≤  i  ≤  K .

The computational complexity of calculating \({{\varvec{L}}}_{{\varvec{w}}}^{{\varvec{k}}}\) is high due to K times multiplication of \({L}_{w}\) . A way to overcome this challenge is to approximate the spectral filter \({g}_{\theta }\) with Chebyshev polynomials up to ( \(K-1\) )th order (Hammond et al. 2011 ). Defferrard et al. (Defferrard et al. 2016 ) applied this approach to build a K -localized ChebNet, where the convolution is defined as

in which \(\overline{{\varvec{L}} }=2{{\varvec{L}}}_{{\varvec{s}}{\varvec{y}}{\varvec{m}}}/{{\varvec{\uplambda}}}_{{\varvec{m}}{\varvec{a}}{\varvec{x}}}-{\varvec{I}}\) . \(\overline{{\varvec{L}} }\) represents a scaling of graph Laplacian that maps the eigenvalues from [0, \({\uplambda }_{max}\) ] to [-1,1]. \({{\varvec{L}}}_{{\varvec{s}}{\varvec{y}}{\varvec{m}}}\) is defined as symmetric normalization of the Laplacian matrix \({{{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}{{\varvec{L}}}_{{\varvec{w}}}{{\boldsymbol{ }{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}.\) \({T}_{k}\) and θ denote the Chebyshev polynomials and Chebyshev coefficients. The Chebyshev polynomials are defined recursively by \({T}_{k}\left(\overline{{\varvec{L}} }\right)=2x{T}_{k-1}\left(\overline{{\varvec{L}} }\right)-{T}_{k-2}\left(\overline{{\varvec{L}} }\right)\) with \({T}_{0}\left(\overline{{\varvec{L}} }\right)=1\) and \({T}_{1}\left(\overline{{\varvec{L}} }\right)=\overline{{\varvec{L}} }\) . These are the basis of Chebyshev polynomials. Kipf and Welling (Kipf and Welling 2016 ) simplified this model by approximating the largest eigenvalue \({\lambda }_{max}\) of \(\overline{L }\) as 2. In this way, the convolution becomes

where Chebyshev coefficient, \(\theta ={\theta }_{0}=-{\theta }_{1}\) , All the details about the assumptions and their implications of Chebyshev polynomial can be found in (Hammond et al. 2011 ). Now the simplified graph convolution can be written as follows:

Since \({\varvec{I}}+{{{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}{{\varvec{A}}}_{{\varvec{w}}}{{\boldsymbol{ }{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}\) has eigenvalues in the range [0, 2], it may lead to exploding or vanishing gradients when used in a deep neural network model. To alleviate this problem, Kipf et al. (Kipf and Welling 2016 ) use a renormalization trick by replacing the term \({\varvec{I}}+{{{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}{{\varvec{A}}}_{{\varvec{w}}}{{\boldsymbol{ }{\varvec{D}}}_{{\varvec{w}}}}^{-1/2}\) with \({{\overline{{\varvec{D}}} }_{{\varvec{w}}}\boldsymbol{ }\boldsymbol{ }}^{-1/2}{\overline{{\varvec{A}}} }_{{\varvec{w}}}{{\boldsymbol{ }\overline{{\varvec{D}}} }_{{\varvec{w}}}}^{-1/2}\) , with \({\overline{{\varvec{A}}} }_{{\varvec{w}}}={{\varvec{A}}}_{{\varvec{w}}}+{\varvec{I}}\) , similar to adding a self-loop. Now, we can simplify the spectral graph convolution as follows:

where \({\varvec{\Theta}}\in {{\varvec{R}}}^{{\varvec{N}}\times {\varvec{N}}}\) indicates the parameters of the convolution filter to be learnt during training process. From Eq. 21 , we can observe that spectral graph convolution is a special case of diffusion convolution (Li et al. 2018 ), but the only difference is that in spectral convolution, we symmetrically normalized the adjacency matrix.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Rahman, R., Hasan, S. Data-Driven Traffic Assignment: A Novel Approach for Learning Traffic Flow Patterns Using Graph Convolutional Neural Network. Data Sci. Transp. 5 , 11 (2023). https://doi.org/10.1007/s42421-023-00073-y

Download citation

Received : 21 May 2023

Revised : 09 June 2023

Accepted : 13 June 2023

Published : 24 July 2023

DOI : https://doi.org/10.1007/s42421-023-00073-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Traffic assignment problem
  • Data-driven method
  • Deep learning
  • Graph convolutional neural network
  • Find a journal
  • Publish with us
  • Track your research

CSE 163, Summer 2020: Homework 3: Data Analysis

In this assignment, you will apply what you've learned so far in a more extensive "real-world" dataset using more powerful features of the Pandas library. As in HW2, this dataset is provided in CSV format. We have cleaned up the data some, but you will need to handle more edge cases common to real-world datasets, including null cells to represent unknown information.

Note that there is no graded testing portion of this assignment. We still recommend writing tests to verify the correctness of the methods that you write in Part 0, but it will be difficult to write tests for Part 1 and 2. We've provided tips in those sections to help you gain confidence about the correctness of your solutions without writing formal test functions!

This assignment is supposed to introduce you to various parts of the data science process involving being able to answer questions about your data, how to visualize your data, and how to use your data to make predictions for new data. To help prepare for your final project, this assignment has been designed to be wide in scope so you can get practice with many different aspects of data analysis. While this assignment might look large because there are many parts, each individual part is relatively small.

Learning Objectives

After this homework, students will be able to:

  • Work with basic Python data structures.
  • Handle edge cases appropriately, including addressing missing values/data.
  • Practice user-friendly error-handling.
  • Read plotting library documentation and use example plotting code to figure out how to create more complex Seaborn plots.
  • Train a machine learning model and use it to make a prediction about the future using the scikit-learn library.

Expectations

Here are some baseline expectations we expect you to meet:

Follow the course collaboration policies

If you are developing on Ed, all the files are there. The files included are:

  • hw3-nces-ed-attainment.csv : A CSV file that contains data from the National Center for Education Statistics. This is described in more detail below.
  • hw3.py : The file for you to put solutions to Part 0, Part 1, and Part 2. You are required to add a main method that parses the provided dataset and calls all of the functions you are to write for this homework.
  • hw3-written.txt : The file for you to put your answers to the questions in Part 3.
  • cse163_utils.py : Provides utility functions for this assignment. You probably don't need to use anything inside this file except importing it if you have a Mac (see comment in hw3.py )

If you are developing locally, you should navigate to Ed and in the assignment view open the file explorer (on the left). Once there, you can right-click to select the option to "Download All" to download a zip and open it as the project in Visual Studio Code.

The dataset you will be processing comes from the National Center for Education Statistics. You can find the original dataset here . We have cleaned it a bit to make it easier to process in the context of this assignment. You must use our provided CSV file in this assignment.

The original dataset is titled: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018 . The cleaned version you will be working with has columns for Year, Sex, Educational Attainment, and race/ethnicity categories considered in the dataset. Note that not all columns will have data starting at 1920.

Our provided hw3-nces-ed-attainment.csv looks like: (⋮ represents omitted rows):

Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
1920 A high school --- 22.0 6.3 --- --- --- --- ---
1940 A high school 38.1 41.2 12.3 --- --- --- --- ---
2018 F master's 10.7 12.6 6.2 3.8 29.9 --- --- ---

Column Descriptions

  • Year: The year this row represents. Note there may be more than one row for the same year to show the percent breakdowns by sex.
  • Sex: The sex of the students this row pertains to, one of "F" for female, "M" for male, or "A" for all students.
  • Min degree: The degree this row pertains to. One of "high school", "associate's", "bachelor's", or "master's".
  • Total: The total percent of students of the specified gender to reach at least the minimum level of educational attainment in this year.
  • White / Black / Hispanic / Asian / Pacific Islander / American Indian or Alaska Native / Two or more races: The percent of students of this race and the specified gender to reach at least the minimum level of educational attainment in this year.

Interactive Development

When using data science libraries like pandas , seaborn , or scikit-learn it's extremely helpful to actually interact with the tools your using so you can have a better idea about the shape of your data. The preferred practice by people in industry is to use a Jupyter Notebook, like we have been in lecture, to play around with the dataset to help figure out how to answer the questions you want to answer. This is incredibly helpful when you're first learning a tool as you can actually experiment and get real-time feedback if the code you wrote does what you want.

We recommend that you try figuring out how to solve these problems in a Jupyter Notebook so you can actually interact with the data. We have made a Playground Jupyter Notebook for you that has the data uploaded. At the top-right of this page in Ed is a "Fork" button (looks like a fork in the road). This will make your own copy of this Notebook so you can run the code and experiment with anything there! When you open the Workspace, you should see a list of notebooks and CSV files. You can always access this launch page by clicking the Jupyter logo.

Part 0: Statistical Functions with Pandas

In this part of the homework, you will write code to perform various analytical operations on data parsed from a file.

Part 0 Expectations

  • All functions for this part of the assignment should be written in hw3.py .
  • For this part of the assignment, you may import and use the math and pandas modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions. The goal of this part of the assignment is to use pandas as a tool to help answer questions about your dataset.

Problem 0: Parse data

In your main method, parse the data from the CSV file using pandas. Note that the file uses '---' as the entry to represent missing data. You do NOT need to anything fancy like set a datetime index.

The function to read a CSV file in pandas takes a parameter called na_values that takes a str to specify which values are NaN values in the file. It will replace all occurrences of those characters with NaN. You should specify this parameter to make sure the data parses correctly.

Problem 1: compare_bachelors_1980

What were the percentages for women vs. men having earned a Bachelor's Degree in 1980? Call this method compare_bachelors_1980 and return the result as a DataFrame with a row for men and a row for women with the columns "Sex" and "Total".

Sex Total
112 M 24.0
180 F 21.0

The index of the DataFrame is shown as the left-most column above.

Problem 2: top_2_2000s

What were the two most commonly awarded levels of educational attainment awarded between 2000-2010 (inclusive)? Use the mean percent over the years to compare the education levels in order to find the two largest. For this computation, you should use the rows for the 'A' sex. Call this method top_2_2000s and return a Series with the top two values (the index should be the degree names and the values should be the percent).

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data , then top_2_2000s(data) will return the following Series (shows the index on the left, then the value on the right)

Hint: The Series class also has a method nlargest that behaves similarly to the one for the DataFrame , but does not take a column parameter (as Series objects don't have columns).

Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.

Optional: Why 0.001?

Whenever you work with floating point numbers, it is very likely you will run into imprecision of floating point arithmetic . You have probably run into this with your every day calculator! If you take 1, divide by 3, and then multiply by 3 again you could get something like 0.99999999 instead of 1 like you would expect.

This is due to the fact that there is only a finite number of bits to represent floats so we will at some point lose some precision. Below, we show some example Python expressions that give imprecise results.

Because of this, you can never safely check if one float is == to another. Instead, we only check that the numbers match within some small delta that is permissible by the application. We kind of arbitrarily chose 0.001, and if you need really high accuracy you would want to only allow for smaller deviations, but equality is never guaranteed.

Problem 3: percent_change_bachelors_2000s

What is the difference between total percent of bachelor's degrees received in 2000 as compared to 2010? Take a sex parameter so the client can specify 'M', 'F', or 'A' for evaluating. If a call does not specify the sex to evaluate, you should evaluate the percent change for all students (sex = ‘A’). Call this method percent_change_bachelors_2000s and return the difference (the percent in 2010 minus the percent in 2000) as a float.

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data , then the call percent_change_bachelors_2000s(data) will return 2.599999999999998 . Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.

Hint: For this problem you will need to use the squeeze() function on a Series to get a single value from a Series of length 1.

Part 1: Plotting with Seaborn

Next, you will write functions to generate data visualizations using the Seaborn library. For each of the functions save the generated graph with the specified name. These methods should only take the pandas DataFrame as a parameter. For each problem, only drop rows that have missing data in the columns that are necessary for plotting that problem ( do not drop any additional rows ).

Part 1 Expectations

  • When submitting on Ed, you DO NOT need to specify the absolute path (e.g. /home/FILE_NAME ) for the output file name. If you specify absolute paths for this assignment your code will not pass the tests!
  • You will want to pass the parameter value bbox_inches='tight' to the call to savefig to make sure edges of the image look correct!
  • For this part of the assignment, you may import the math , pandas , seaborn , and matplotlib modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions.
  • Do not use any of the other seaborn plotting functions for this assignment besides the ones we showed in the reference box below. For example, even though the documentation for relplot links to another method called scatterplot , you should not call scatterplot . Instead use relplot(..., kind='scatter') like we showed in class. This is not an issue of stylistic preference, but these functions behave slightly differently. If you use these other functions, your output might look different than the expected picture. You don't yet have the tools necessary to use scatterplot correctly! We will see these extra tools later in the quarter.

Part 1 Development Strategy

  • Print your filtered DataFrame before creating the graph to ensure you’re selecting the correct data.
  • Call the DataFrame describe() method to see some statistical information about the data you've selected. This can sometimes help you determine what to expect in your generated graph.
  • Re-read the problem statement to make sure your generated graph is answering the correct question.
  • Compare the data on your graph to the values in hw3-nces-ed-attainment.csv. For example, for problem 0 you could check that the generated line goes through the point (2005, 28.8) because of this row in the dataset: 2005,A,bachelor's,28.8,34.5,17.6,11.2,62.1,17.0,16.4,28.0

Seaborn Reference

Of all the libraries we will learn this quarter, Seaborn is by far the best documented. We want to give you experience reading real world documentation to learn how to use a library so we will not be providing a specialized cheat-sheet for this assignment. What we will do to make sure you don't have to look through pages and pages of documentation is link you to some key pages you might find helpful for this assignment; you do not have to use every page we link, so part of the challenge here is figuring out which of these pages you need. As a data scientist, a huge part of solving a problem is learning how to skim lots of documentation for a tool that you might be able to leverage to solve your problem.

We recommend to read the documentation in the following order:

  • Start by skimming the examples to see the possible things the function can do. Don't spend too much time trying to figure out what the code is doing yet, but you can quickly look at it to see how much work is involved.
  • Then read the top paragraph(s) that give a general overview of what the function does.
  • Now that you have a better idea of what the function is doing, go look back at the examples and look at the code much more carefully. When you see an example like the one you want to generate, look carefully at the parameters it passes and go check the parameter list near the top for documentation on those parameters.
  • It sometimes (but not always), helps to skim the other parameters in the list just so you have an idea what this function is capable of doing

As a reminder, you will want to refer to the lecture/section material to see the additional matplotlib calls you might need in order to display/save the plots. You'll also need to call the set function on seaborn to get everything set up initially.

Here are the seaborn functions you might need for this assignment:

  • Bar/Violin Plot ( catplot )
  • Plot a Discrete Distribution ( distplot ) or Continuous Distribution ( kdeplot )
  • Scatter/Line Plot ( relplot )
  • Linear Regression Plot ( regplot )
  • Compare Two Variables ( jointplot )
  • Heatmap ( heatmap )
Make sure you read the bullet point at the top of the page warning you to only use these functions!

Problem 0: Line Chart

Plot the total percentages of all people of bachelor's degree as minimal completion with a line chart over years. To select all people, you should filter to rows where sex is 'A'. Label the x-axis "Year", the y-axis "Percentage", and title the plot "Percentage Earning Bachelor's over Time". Name your method line_plot_bachelors and save your generated graph as line_plot_bachelors.png .

result of line_plot_bachelors

Problem 1: Bar Chart

Plot the total percentages of women, men, and total people with a minimum education of high school degrees in the year 2009. Label the x-axis "Sex", the y-axis "Percentage", and title the plot "Percentage Completed High School by Sex". Name your method bar_chart_high_school and save your generated graph as bar_chart_high_school.png .

Do you think this bar chart is an effective data visualization? Include your reasoning in hw3-written.txt as described in Part 3.

result of bar_chart_high_school

Problem 2: Custom Plot

Plot the results of how the percent of Hispanic individuals with degrees has changed between 1990 and 2010 (inclusive) for high school and bachelor's degrees with a chart of your choice. Make sure you label your axes with descriptive names and give a title to the graph. Name your method plot_hispanic_min_degree and save your visualization as plot_hispanic_min_degree.png .

Include a justification of your choice of data visualization in hw3-written.txt , as described in Part 3.

Part 2: Machine Learning using scikit-learn

Now you will be making a simple machine learning model for the provided education data using scikit-learn . Complete this in a function called fit_and_predict_degrees that takes the data as a parameter and returns the test mean squared error as a float. This may sound like a lot, so we've broken it down into steps for you:

  • Filter the DataFrame to only include the columns for year, degree type, sex, and total.
  • Do the following pre-processing: Drop rows that have missing data for just the columns we are using; do not drop any additional rows . Convert string values to their one-hot encoding. Split the columns as needed into input features and labels.
  • Randomly split the dataset into 80% for training and 20% for testing.
  • Train a decision tree regressor model to take in year, degree type, and sex to predict the percent of individuals of the specified sex to achieve that degree type in the specified year.
  • Use your model to predict on the test set. Calculate the accuracy of your predictions using the mean squared error of the test dataset.

You do not need to anything fancy like find the optimal settings for parameters to maximize performance. We just want you to start simple and train a model from scratch! The reference below has all the methods you will need for this section!

scikit-learn Reference

You can find our reference sheet for machine learning with scikit-learn ScikitLearnReference . This reference sheet has information about general scikit-learn calls that are helpful, as well as how to train the tree models we talked about in class. At the top-right of this page in Ed is a "Fork" button (looks like a fork in the road). This will make your own copy of this Notebook so you can run the code and experiment with anything there! When you open the Workspace, you should see a list of notebooks and CSV files. You can always access this launch page by clikcing the Jupyter logo.

Part 2 Development Strategy

Like in Part 1, it can be difficult to write tests for this section. Machine Learning is all about uncertainty, and it's often difficult to write tests to know what is right. This requires diligence and making sure you are very careful with the method calls you make. To help you with this, we've provided some alternative ways to gain confidence in your result:

  • Print your test y values and your predictions to compare them manually. They won't be exactly the same, but you should notice that they have some correlation. For example, I might be concerned if my test y values were [2, 755, …] and my predicted values were [1022, 5...] because they seem to not correlate at all.
  • Calculate your mean squared error on your training data as well as your test data. The error should be lower on your training data than on your testing data.

Optional: ML for Time Series

Since this is technically time series data, we should point out that our method for assessing the model's accuracy is slightly wrong (but we will keep it simple for our HW). When working with time series, it is common to use the last rows for your test set rather than random sampling (assuming your data is sorted chronologically). The reason is when working with time series data in machine learning, it's common that our goal is to make a model to help predict the future. By randomly sampling a test set, we are assessing the model on its ability to predict in the past! This is because it might have trained on rows that came after some rows in the test set chronologically. However, this is not a task we particularly care that the model does well at. Instead, by using the last section of the dataset (the most recent in terms of time), we are now assessing its ability to predict into the future from the perspective of its training set.

Even though it's not the best approach to randomly sample here, we ask you to do it anyways. This is because random sampling is the most common method for all other data types.

Part 3: Written Responses

Review the source of the dataset here . For the following reflection questions consider the accuracy of data collected, and how it's used as a public dataset (e.g. presentation of data, publishing in media, etc.). All of your answers should be complete sentences and show thoughtful responses. "No" or "I don't know" or any response like that are not valid responses for any questions. There is not one particularly right answer to these questions, instead, we are looking to see you use your critical thinking and justify your answers!

  • Do you think the bar chart from part 1b is an effective data visualization? Explain in 1-2 sentences why or why not.
  • Why did you choose the type of plot that you did in part 1c? Explain in a few sentences why you chose this type of plot.
  • Datasets can be biased. Bias in data means it might be skewed away from or portray a wrong picture of reality. The data might contain inaccuracies or the methods used to collect the data may have been flawed. Describe a possible bias present in this dataset and why it might have occurred. Your answer should be about 2 or 3 sentences long.

Context : Later in the quarter we will talk about ethics and data science. This question is supposed to be a warm-up to get you thinking about our responsibilities having this power to process data. We are not trying to train to misuse your powers for evil here! Most misuses of data analysis that result in ethical concerns happen unintentionally. As preparation to understand these unintentional consequences, we thought it would be a good exercise to think about a theoretical world where you would willingly try to misuse data.

Congrats! You just got an internship at Evil Corp! Your first task is to come up with an application or analysis that uses this dataset to do something unethical or nefarious. Describe a way that this dataset could be misused in some application or an analysis (potentially using the bias you identified for the last question). Regardless of what nefarious act you choose, evil still has rules: You need to justify why using the data in this is a misuse and why a regular person who is not evil (like you in the real world outside of this problem) would think using the data in this way would be wrong. There are no right answers here of what defines something as unethical, this is why you need to justify your answer! Your response should be 2 to 4 sentences long.

Turn your answers to these question in by writing them in hw3-written.txt and submitting them on Ed

Your submission will be evaluated on the following dimensions:

  • Your solution correctly implements the described behaviors. You will have access to some tests when you turn in your assignment, but we will withhold other tests to test your solution when grading. All behavior we test is completely described by the problem specification or shown in an example.
  • No method should modify its input parameters.
  • Your main method in hw3.py must call every one of the methods you implemented in this assignment. There are no requirements on the format of the output, besides that it should save the files for Part 1 with the proper names specified in Part 1.
  • We can run your hw3.py without it crashing or causing any errors or warnings.
  • When we run your code, it should produce no errors or warnings.
  • All files submitted pass flake8
  • All program files should be written with good programming style. This means your code should satisfy the requirements within the CSE 163 Code Quality Guide .
  • Any expectations on this page or the sub-pages for the assignment are met as well as all requirements for each of the problems are met.

Make sure you carefully read the bullets above as they may or may not change from assignment to assignment!

A note on allowed material

A lot of students have been asking questions like "Can I use this method or can I use this language feature in this class?". The general answer to this question is it depends on what you want to use, what the problem is asking you to do and if there are any restrictions that problem places on your solution.

There is no automatic deduction for using some advanced feature or using material that we have not covered in class yet, but if it violates the restrictions of the assignment, it is possible you will lose points. It's not possible for us to list out every possible thing you can't use on the assignment, but we can say for sure that you are safe to use anything we have covered in class so far as long as it meets what the specification asks and you are appropriately using it as we showed in class.

For example, some things that are probably okay to use even though we didn't cover them:

  • Using the update method on the set class even though I didn't show it in lecture. It was clear we talked about sets and that you are allowed to use them on future assignments and if you found a method on them that does what you need, it's probably fine as long as it isn't violating some explicit restriction on that assignment.
  • Using something like a ternary operator in Python. This doesn't make a problem any easier, it's just syntax.

For example, some things that are probably not okay to use:

  • Importing some random library that can solve the problem we ask you to solve in one line.
  • If the problem says "don't use a loop" to solve it, it would not be appropriate to use some advanced programming concept like recursion to "get around" that restriction.

These are not allowed because they might make the problem trivially easy or violate what the learning objective of the problem is.

You should think about what the spec is asking you to do and as long as you are meeting those requirements, we will award credit. If you are concerned that an advanced feature you want to use falls in that second category above and might cost you points, then you should just not use it! These problems are designed to be solvable with the material we have learned so far so it's entirely not necessary to go look up a bunch of advanced material to solve them.

tl;dr; We will not be answering every question of "Can I use X" or "Will I lose points if I use Y" because the general answer is "You are not forbidden from using anything as long as it meets the spec requirements. If you're unsure if it violates a spec restriction, don't use it and just stick to what we learned before the assignment was released."

This assignment is due by Thursday, July 23 at 23:59 (PDT) .

You should submit your finished hw3.py , and hw3-written.txt on Ed .

You may submit your assignment as many times as you want before the late cutoff (remember submitting after the due date will cost late days). Recall on Ed, you submit by pressing the "Mark" button. You are welcome to develop the assignment on Ed or develop locally and then upload to Ed before marking.

  • Search Search Please fill out this field.
  • Business Essentials

Assignment Method: Examples of How Resources Are Allocated

data assignment method

What Is the Assignment Method?

The assignment method is a way of allocating organizational resources in which each resource is assigned to a particular task. The resource could be monetary, personnel , or technological.

Understanding the Assignment Method

The assignment method is used to determine what resources are assigned to which department, machine, or center of operation in the production process. The goal is to assign resources in such a way to enhance production efficiency, control costs, and maximize profits.

The assignment method has various applications in maximizing resources, including:

  • Allocating the proper number of employees to a machine or task
  • Allocating a machine or a manufacturing plant and the number of jobs that a given machine or factory can produce
  • Assigning a number of salespersons to a given territory or territories
  • Assigning new computers, laptops, and other expensive high-tech devices to the areas that need them the most while lower priority departments would get the older models

Companies can make budgeting decisions using the assignment method since it can help determine the amount of capital or money needed for each area of the company. Allocating money or resources can be done by analyzing the past performance of an employee, project, or department to determine the most efficient approach.

Regardless of the resource being allocated or the task to be accomplished, the goal is to assign resources to maximize the profit produced by the task or project.

Example of Assignment Method

A bank is allocating its sales force to grow its mortgage lending business. The bank has over 50 branches in New York but only ten in Chicago. Each branch has a staff that is used to bring in new clients.

The bank's management team decides to perform an analysis using the assignment method to determine where their newly-hired salespeople should be allocated. Given the past performance results in the Chicago area, the bank has produced fewer new clients than in New York. The fewer new clients are the result of having a small market presence in Chicago.

As a result, the management decides to allocate the new hires to the New York region, where it has a greater market share to maximize new client growth and, ultimately, revenue.

data assignment method

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • User's Guide
  • Managing I/O Resources
  • Understanding I/O Resource Management (IORM)
  • About I/O Resource Management (IORM) in Exadata Database Machine
  • Resource Assignment Methods

5.1.1.3 Resource Assignment Methods

You can use shares or allocations to assign resources in an IORM plan.

A share value represents the relative importance of each entity. With share-based resource allocation, a higher share value implies higher priority and more access to the I/O resources. For example, a database with a share value of 2 gets twice the resource allocation of a database with a share value of 1.

Valid share values are 1 to 32, with 1 being the lowest share, and 32 being the highest share. The sum of all share values in a plan cannot be greater than 32768.

Share-based resource allocation is the recommended method for the interdatabase plan ( dbplan ). For the cluster plan ( clusterplan ), share-based resource allocation is the only option.

With allocation-based resource management, an allocation specifies the resource allocation as a percentage (0-100). Each allocation is associated with a level . Valid level values are from 1 to 8, and the sum of allocation values cannot exceed 100 for each level. Resources are allocated to level 1 first, and then remaining resources are allocated to level 2, and so on.

Though not recommended, allocation-based resource management can be used in the interdatabase plan ( dbplan ). For the category plan ( catplan ), allocation-based resource management is the only option.

Parent topic: About I/O Resource Management (IORM) in Exadata Database Machine

Pedagogy in Action

  • ⋮⋮⋮ ×

Assignment Design

Related links.

Designing Quantitative Reasoning Assignments

Ivan working with the Town of Arlington

To speed your work, check out the Discipline- and method-specific activity collections or browse activities that emphasize teaching with data in a variety of disciplines

Begin with explicitly articulated student learning outcomes for your course

What are the 4 to 7 key learning goals for this course? And how does teaching with data support students' achievement of those goals? It is easy to become excited by a new pedagogy when the research suggests it can notably enhance student learning. And when we are excited we want to dive right in. But before jumping in to revising assignments or course modules it is important to step back and identify the ways the new activities will connect with the primary objectives for the course.

This is true for at least two reasons. First, syllabi are very tight. Most faculty have never met a colleague who concluded a term saying, "With two weeks left in the term I had covered everything I felt I needed to cover. The rest of the term was just filler." Across all fields, faculty are keenly aware of disciplinary expectations of content coverage for courses. This sense of obligation is particularly sharp in courses which are pre-requisites for other courses in the curriculum. With so much external pressure, any aspect of a course that doesn't align with primary learning goals must be quickly tossed aside, no matter how much excitement surrounded its initial addition to the syllabus.

Second (and more importantly), innovations in teaching should never be done for the sake of "change." The goal is to teach students more effectively. While teaching with data often achieves that end, it is very time intensive (for both the faculty member and students) and so needs to be used deliberately when and where it is aligned with course goals.

Read more about course design

Consider having students work in teams to mitigate skills gaps

Sometimes teams can allow an instructor to work around variations in student experience with tools or methods. If you survey your class at the beginning of the term to find out who is comfortable with what, you can assign students to teams designed to ensure that each team has an "expert" in each tool and/or method required in the assignment.

If you choose to proceed in this direction, make sure to ask yourself whether you want all of the students to end up equally well-prepared with the required methods and tools. Is it okay if the one student who has experience with this instrument collects the data while others look on (and never learn to do it themselves)? If not, then be sure to build in time for peer instruction.

Read more about cooperative learning Read more about peer-led team learning

Consider "scaffolding" your teach-with-data assignment

students using the mouse

Scaffolding allows you to correct fundamental errors before they are inserted into the larger, final product. This serves two purposes. First, it helps students organize their work. Research shows that when students are asked to take on new tasks, they may experience regression in previously mastered skills. This predictably follows from having their attention devoted to the new task at hand. Scaffolding helps students see their work as a series of more manageable pieces.

The second purpose for scaffolding is more pragmatic. Teaching-with-data assignments often involve a complex interaction of tasks. (Indeed, that is often the point!) When students make a fundamental mistake in the first step, the resulting final product can be incredibly difficult to grade. While much of what the students did after making the early error "made sense," the final product may be irreparably damaged. The tension between "what follows makes sense given the error" and "the error leads to a ridiculous end point" can be very difficult to resolve.

Scaffolding also allows you to increase the complexity of assignments over the course of the term. For example, if some skills are not consistently taught in prior classes, then you can use assignments early in the term to teach students the methods you want them to use in the final project.

Of course, scaffolding need not be an either/or proposition. You may choose to scaffold assignments at the beginning of the term and then eliminate this guardrail as students gain confidence and competence. The main point is that instructors must equip students with the skills they need before they take on any new task.

Read more about scaffolding and sequencing

Provide instruction for the methods/tools students will need

Sometimes this principle can go without saying because the goal of the teaching-with-data activity is to teach students the new method or tool. But often our goal is to get students to wrestle with the data or the ideas behind the data. The analysis tool is just a vehicle. For example, you may ask students to explore the correlations between several variables using Excel. Despite fluently mastering hundreds of apps for the smart phones, a large fraction of students have never used a spreadsheet. Even a "simple" task like plotting the data in a scatter plot can pose significant challenges. Without instruction, students can waste time that was intended for data exploration.

While students often require instruction, this does not necessarily mean you need to provide it personally during class time. Most information technology departments can provide introductions to software tools. And many such tools have online tutorials which can be assigned to students who lack experience. However, the more specific the tool or method is to your discipline the more likely you will have to teach it yourself. Two areas that commonly require teaching: the use of statistical tools and understanding and creating graphs .

You get what you teach: Provide explicit instruction on data analysis and presentation

spring 2012 hackNY student hackathon presentations

Providing examples of high quality work will help some students. Better yet, provide a set of examples that demonstrates a range of quality. But most will need you to "walk them through" those examples to help them see what makes good work good. If you have a grading rubric, strongly consider sharing it with your students. As one colleague of mine said about her early (and bumpy) experience as a scholar, "I wasn't producing C work because I wanted to. It was just that no one had shown me how to produce A work!"

  • The literature on learning shows that students need two or more examples to distinguish surface characteristics from underlying principles--and you probably want them to focus on the principles! See Gick and Holyoak 1983 for the seminal research on the importance of multiple examples.
  • Jane Miller's Chicago Guide to Writing about Numbers provides a great model for showing students a set of examples of varying quality.)

Model the behaviors you want your students to adopt

Let your students see you engage data. Sometimes this may be staged. For example, when presenting a lecture you might show students a plot of the raw data and then "work through" the process of analysis in front of them. But don't be afraid to occasionally let them see you take on real, open-ended problems. Many students find it very powerful to see their teacher in the process, generating hypotheses in "real time." While it may feel risky to teach from material for which we don't know the answers, it teaches students the important lesson (particularly in the sciences) that scholarship is not about mastering a canon. Rather, it is about generating and exploring important new questions for which we do not have clear answers.

Either way, as you model the practices you value, be sure to call students' attention to the moves you are making (and that you wish them to copy).

Gick, M. L. and Holyoak, K. J. 1983. "Schema induction and analogical transfer," Cognitive Psychology , 15(1): 1-38.

Miller, J. E. 2004. The Chicago Guide to Writing about Numbers: The Effective Presentation of Quantitative Information. Chicago: University of Chicago Press.

« Previous Page       Next Page »

  • Open access
  • Published: 23 July 2024

Outcome risk model development for heterogeneity of treatment effect analyses: a comparison of non-parametric machine learning methods and semi-parametric statistical methods

  • Edward Xu 1 ,
  • Joseph Vanghelof 2 ,
  • Yiyang Wang 1 ,
  • Anisha Patel 1 ,
  • Jacob Furst 1 ,
  • Daniela Stan Raicu 1 ,
  • Johannes Tobias Neumann 4 , 5 ,
  • Rory Wolfe 6 , 7 ,
  • Caroline X. Gao 7 , 8 , 9 ,
  • John J. McNeil 6 ,
  • Raj C. Shah 2 , 3 &
  • Roselyne Tchoua 1  

BMC Medical Research Methodology volume  24 , Article number:  158 ( 2024 ) Cite this article

17 Accesses

1 Altmetric

Metrics details

In randomized clinical trials, treatment effects may vary, and this possibility is referred to as heterogeneity of treatment effect (HTE). One way to quantify HTE is to partition participants into subgroups based on individual’s risk of experiencing an outcome, then measuring treatment effect by subgroup. Given the limited availability of externally validated outcome risk prediction models, internal models (created using the same dataset in which heterogeneity of treatment analyses also will be performed) are commonly developed for subgroup identification. We aim to compare different methods for generating internally developed outcome risk prediction models for subject partitioning in HTE analysis.

Three approaches were selected for generating subgroups for the 2,441 participants from the United States enrolled in the ASPirin in Reducing Events in the Elderly (ASPREE) randomized controlled trial. An extant proportional hazards-based outcomes predictive risk model developed on the overall ASPREE cohort of 19,114 participants was identified and was used to partition United States’ participants by risk of experiencing a composite outcome of death, dementia, or persistent physical disability. Next, two supervised non-parametric machine learning outcome classifiers, decision trees and random forests, were used to develop multivariable risk prediction models and partition participants into subgroups with varied risks of experiencing the composite outcome. Then, we assessed how the partitioning from the proportional hazard model compared to those generated by the machine learning models in an HTE analysis of the 5-year absolute risk reduction (ARR) and hazard ratio for aspirin vs. placebo in each subgroup. Cochran’s Q test was used to detect if ARR varied significantly by subgroup.

The proportional hazard model was used to generate 5 subgroups using the quintiles of the estimated risk scores; the decision tree model was used to generate 6 subgroups (6 automatically determined tree leaves); and the random forest model was used to generate 5 subgroups using the quintiles of the prediction probability as risk scores. Using the semi-parametric proportional hazards model, the ARR at 5 years was 15.1% (95% CI 4.0–26.3%) for participants with the highest 20% of predicted risk. Using the random forest model, the ARR at 5 years was 13.7% (95% CI 3.1–24.4%) for participants with the highest 20% of predicted risk. The highest outcome risk group in the decision tree model also exhibited a risk reduction, but the confidence interval was wider (5-year ARR = 17.0%, 95% CI= -5.4–39.4%). Cochran’s Q test indicated ARR varied significantly only by subgroups created using the proportional hazards model. The hazard ratio for aspirin vs. placebo therapy did not significantly vary by subgroup in any of the models. The highest risk groups for the proportional hazards model and random forest model contained 230 participants each, while the highest risk group in the decision tree model contained 41 participants.

Conclusions

The choice of technique for internally developed models for outcome risk subgroups influences HTE analyses. The rationale for the use of a particular subgroup determination model in HTE analyses needs to be explicitly defined based on desired levels of explainability (with features importance), uncertainty of prediction, chances of overfitting, and assumptions regarding the underlying data structure. Replication of these analyses using data from other mid-size clinical trials may help to establish guidance for selecting an outcomes risk prediction modelling technique for HTE analyses.

Peer Review reports

By design, randomized clinical trials (RCTs) provide information about the average treatment effect for an intervention. Heterogeneity of treatment effect (HTE) refers to the circumstance in which treatment outcomes vary within a population. For example, in an RCT, it may be the case that certain types of participants experience a large decrease in mortality, while the majority experience a modest increase in mortality. In that study, the average treatment effect would indicate a moderate decrease in mortality, but would be poorly representative of the experiences of participants. The discrepancy in treatment effect is a defining characteristic of HTE.

One traditional way to assess HTE in RCTs is to conduct “one-variable-at-a-time” subgroup analyses, evaluating whether treatment effect differs across demographics or baseline risk factors [ 1 ]. This approach has been criticized for its tendency to produce false positives due to the many tests performed, false negatives when subgroups are limited in size, and limited clinical applicability since research participants have many traits that simultaneously influence outcomes [ 2 ].

A proposed alternative to conventional subgroup analysis is to create subgroups based on research participants’ baseline predicted risk of experiencing an event [ 3 ]. In this framework, investigators (1) identify an externally validated predictive risk model for the primary outcome of interest; (2) compute the predicted risk for each participant; (3) partition the participants into subgroups based on the predicted risk; and (4) test for HTE by subgroup. [ 3 ]. This approach offers an opportunity to demonstrate HTE in the circumstance that treatment outcomes are correlated with the baseline risk of experiencing that outcome.

Investigators assessing HTE with this approach may discover there are limited externally derived risk prediction tools applicable to their study population or outcome of interest. In such cases, prediction tools may be developed using data from the cohort they are investigating (i.e., an internally developed outcome risk prediction model) [ 3 ]. For internal models, development and accuracy are dependent on the number of samples and the frequency of the outcome of interest. Such models must be fine-tuned to prevent overfitting. Traditionally, predictive risk models have been developed using logistic regression, a fully parametric approach, or Cox proportional hazard regression, a semi-parametric approach. However, these models require assumptions of the underlying data structure and significant expert clinical knowledge.

More recently, supervised machine learning models, such as random forests and decision trees, have shown utility towards the development of predictive risk models. Both approaches are ubiquitous, nonparametric (i.e., they make no assumptions about the data distribution), and explainable predictive models that can provide insights into feature importance with respect to predicted outcomes. For example, decision trees can be translated into human-readable “if-then” rules. These approaches also produce partitions in data which maximize the homogeneity with respect to the predicted outcome. For example, decision trees use a partition-based algorithm which separates subjects into homogeneous subgroups with respect to the outcome. The random forest is an ensemble of decision tree designed to reduce variability by aggregating results from multiple decision trees. Whether supervised learning models perform comparably to semi-parametric models for partitioning participants on HTE modelling has not been fully explored.

The ASPirin in Reducing Events in the Elderly (ASPREE) study (Clinical trial registry number: NCT01038583) was a double-blind, randomized controlled trial that assigned participants to aspirin 100 mg daily or placebo starting in 2010 [ 4 ]. A total of 19,114 study participants were recruited from Australia and the United States (US). Of the total, 2,411 participants were from the US. Participants must have been at least 70 years of age, or at least 65 years of age if African American or Hispanic in the US, and free of diagnoses of cardiovascular disease, dementia, or physical disability. The primary outcome was a composite of death, dementia, or persistent physical disability. We will refer to this as disability-free longevity. The overall finding of the ASPREE study was that daily low-dose aspirin conferred neither benefit nor harm on the primary outcome (HR = 1.01 95% CI: 0.92–1.11 p  = 0.79) [ 5 ]. A conventional subgroup analysis was conducted for 12 pre-specified measures. Treatment benefit had statistically significant variation by frailty, but by none of the other measures [ 5 ].

After publication of the main ASPREE findings described above, a semi-parametric model for risk prediction was developed with the overall study data and published [ 6 ]. This work provided an opportunity to examine the properties of the predictive risk model in HTE analyses to use as a standard for determining if partition-based supervised machine learning models (decision trees and random forests) yielded comparable HTE conclusions on the absolute and relative scales. We were interested in partition model performance using a medium sized dataset reflecting a typical clinical trial [ 7 ].

To conduct the comparative analyses of outcome risk models, we divided our process into four steps: (1) data preparation; (2) models for generating subgroups; (3) assessment of model predictive ability; and (4) model performance in heterogeneity of treatment effect analyses.

Data preparation

As shown in Fig.  1 , US participants who did not have any missing features were selected from the ASPREE dataset. The entire dataset was used for the analyses for the extant, semi-parametric, proportional hazards predictive risk model. The cohort was then split 50%/50% into two sets: (1) a training and validation set, which was used to develop the machine learning models; and (2) a testing set, used for assessing model performance.

We used a stratified sampling approach to ensure the sets retained a similar ratio of the composite outcome. In ASPREE, only about 10% of participants experienced the outcome by the end of the study. Machine learning techniques tend to learn more about the outcome type for which they have more examples. This can result in models which have poor sensitivity for underrepresented outcome types yet exhibit high overall accuracy. To address this potential for biased learning, we created an augmented training and validation set by randomly oversampling participants who experienced the outcome with replacement until the count matched that of the participants with disability free longevity. The test set was not altered and was representative of the original participant population (10% who had the composite outcome).

Models for generating subgroups

Three approaches for generating subgroups were selected: (1) a proportional hazards model; (2) a decision tree model; and (3) a random forest model. The outcome for all models was a composite of experiencing death, dementia, or persistent physical disability. The proportional hazards model accounted for time to the event and censoring, while the machine learning models accounted solely for whether the event occurred or not.

figure 1

Participant flow diagram and methodology overview

Extant semi-parametric proportional hazards predictive risk model:

A literature search was conducted to identify published models predicting the primary composite outcome or individual components. Neumann et al. used Cox proportional hazard regression to predict the 5-year risk of the primary composite endpoint in ASPREE [ 6 ]. Proportional hazards regressions are semi-parametric, time-to-event models; a non-parametric component specifies a baseline hazard function; and a parametric portion specifies how the log of the hazard function varies linearly with the covariates. The authors selected 24 baseline measures as candidate predictors in their analysis [ 6 ], indicated in Appendix 1 . The candidate features to were used to create two models, one for men and one for women, Appendix 2 . To create subgroups for assessing HTE, the sex-specific models were used to generate a risk prediction score for each US ASPREE participant with non-missing data. Then, participants were stratified into subgroups by risk quintile, with group 1 containing the fifth with lowest predicted risk, and group 5 containing the fifth with highest predicted risk.

Supervised non-parametric machine learning outcome classifiers

While supervised models are classically used to predict outcomes, we used them for subgrouping, based on outcome. As such, while we tuned our models to prevent overfitting, we focused on generating stable subgroups rather than sensitivity analysis and optimization of accuracy. As shown in Appendix 1 , the machine learning models were developed using a total of 26 baseline measures, 21 overlapping with the proportional hazard model [ 6 ], and an additional 5 which were prespecified in the statistical analysis plan [ 8 ]; two of which had similar properties to measures used in the proportional hazard model.

Decision tree classification:

We trained classification trees on 30 bootstraps of the augmented training and validation set (one on each bootstrap) to predict the primary composite outcome and provide confidence intervals. To prevent overfitting of the trees and check the stability of the results, we tuned basic parameters using cross-validation, such as the tree depth, and the minimum number of data points per leaf, and determined that a maximum of 6 leaves represented an optimal value. In other words, a typical decision tree model for this method has 6 terminal nodes representing 6 groups in the data. We then selected the decision tree with median test accuracy as our representative model to partition the set aside test data into 6 leaves with different distributions of outcome, creating subgroups for assessing HTE. Decision tree analysis was performed using the rpart library in R [ 9 ].

Random Forest classification:

We trained a random forest classifier by using 30 bootstraps of the augmented training and validation data to predict the primary composite outcome and obtain confidence intervals. Random forests, an ensemble model, are designed to reduce overfitting in the decision tree algorithm while maintaining its advantages [ 10 ]. We tuned the parameters using 10-fold cross-validation to prevent overfitting and check the stability of the results. The random forest models used 100 decision trees as base classifiers with each tree pruned to a maximum of 10 terminal nodes. The algorithm classified an instance by a majority vote across all the classification outputs of the individual decision trees. Votes were weighted by the individual predicted probabilities of the positive class before being aggregated. Unlike the decision tree, the leaf node groupings of a single decision tree in a random forest can no longer be used to identify subgroups. Therefore, using the classification probabilities (i.e., the probability of reaching an endpoint vs. not) as risk scores, the participants were stratified into subgroups by risk quintile, with group 1 containing the fifth with lowest predicted risk, and group 5 containing the fifth with highest predicted risk. Random forests were trained using the randomForest library in R [ 11 ].

Assessment of model predictive ability

We used the proportional hazards model to predict risk of the composite outcome for US APREE participants in the testing set. The accuracy, sensitivity, specificity, and positive predictive value were computed at a risk prediction threshold of 50%. The area under the receiver operating characteristic curve (AUC ROC) was computed as the time dependent AUC at 5 years after randomization, in SAS 9.4 TS1M6 using proc phreg, participants’ predicted risk probabilities, and the nearest neighbor method. This procedure was repeated for the decision tree and random forest models. Calibration was assessed by comparing the mean predicted risk in each subgroup to the observed event rate in each subgroup. No formal tests were conducted to assess if significant differences existed between models. These metrics were used to assess the reliability of the subgroups generated by the models; but were not, in themselves, indicative of the model’s ability to reveal HTE.

Model performance in heterogeneity of treatment effect analyses

We assessed HTE on the absolute scale by computing the 5-year absolute risk reduction imparted by aspirin. Starting with the groups developed with the extant proportional hazard model, disability free longevity at 5-years was computed using the Kaplan–Meier estimator for each combination of treatment and subgroup assignment. Then, the 5-year event rate was calculated as one minus the 5-year disability free longevity rate. Last, the 5-year absolute risk reduction (ARR) was computed as event rate in the group assigned to placebo minus the event rate in the group assigned to aspirin therapy, with the 95% confidence interval computed as defined in equation in Appendix 3 . A meta-analysis was conducted to identify if ARR varied by subgroup. Cochran’s Q-test was interpreted to determine whether significant HTE was detected on the absolute scale. We assessed HTE on the relative scale by computing the hazard ratio for aspirin therapy in each subgroup. The Wald Chi-Squared test for the interaction of subgroup and treatment assignment was interpreted to determine whether significant HTE was detected on the relative scale. This procedure was repeated for the decision tree and random forest model.

Participants

In total, 19,114 participants enrolled in the ASPREE study, 2,411 of whom were from the United States. A participant flow diagram is shown in Fig.  1 . After excluding 120 participants due to missing data, 1,141 were assigned to the Training & Validation set, and 1,150 were assigned to the Testing set for a total of 2,291 participants analyzed after accounting for missing data. The baseline characteristics for the final study population set by treatment group are shown in Table  1 .

Model predictive ability

Model accuracy, sensitivity, specificity, positive predictive value, and area under the curve (AUC) are displayed in Table  2 , and receiver operator curves (ROC) in Appendix 4 . The accuracy and AUC in the proportional hazards model were 0.89 and 0.674 respectively. The decision tree model had a lower accuracy, but similar AUC (0.69, 0.672). The random forest model had a similar accuracy to the proportional hazards model, but higher AUC (0.88, 0.732). The sensitivity of the proportional hazards model was 0.12. Sensitivity was much greater in the decision tree model (0.64), but again, similar to the proportional hazards model in the random forest model (0.15). The positive predictive value of the proportional hazards model was 0.44. Positive predictive value was much lower in the decision tree model (0.20) and was again similar to the proportional hazards model in the random forest model (0.36). The predicted risk and observed risk were most similar in the proportional hazards model; however, the predicted risk was much greater than the observed values in the much greater than the observed values in the decision tree and random forest models, as show in Appendix 5 .

Significant HTE was detected on the absolute scale in the proportional hazard model ( p  = 0.033), Appendix 6 . The findings are shown graphically in Fig.  2 . Using the proportional hazards model, participants in group 5 (the fifth with highest predicted risk) experienced significantly fewer events when on aspirin therapy compared to placebo (ARR = 15.1%; 95% CI 4.0–26.3%). Using the decision tree model, all subgroups had an absolute risk difference which included a difference of zero in the 95% confidence interval. Similar to the proportional hazard model, when using the random forest classifier, participants in group 5, experienced fewer events on the absolute scale when assigned to aspirin therapy compared to placebo (ARR = 13.7%; 95% CI 3.1–24.4%); however, the difference across groups was not significant ( p  = 0.085). The number needed to treat for group 5 in the proportional hazard model and random forest are 6.6 (3.8 to 25.1) and 7.3 (4.1 to 32.5) respectively. None of the models exhibited HTE on the relative scale (Wald Chi-Squared test for interaction subgroup-values ranged from 0.28 to 0.72), Appendix 6 .

figure 2

Absolute risk difference at 5 years by model, values reflect data in Appendix 6 . *Each proportional hazard group and each random forest groups contained 230 participants. The Decision tree model contained 56 participants in group 1, 41 in group 2, 668 in group 3, 100 in group 4, 244 in group 5, and 41 in group r

We investigated non-parametric approaches (supervised machine learning models) as compared to a standard, semi-parametric approach for creating subgroups. We then compared the models in their utilization in HTE models. Although externally developed outcomes risk models (models developed independently of the cohort they will be applied) are preferred in HTE analyses, internally developed prediction models are appropriate when high quality external models are not readily available [ 3 ]. To the best of our knowledge, non-parametric machine learning approaches have not been compared to the more widely utilized Cox proportional hazards model in terms of stratifying risk for discovery of potential treatment heterogeneity. To permit a more equal comparison between techniques, we limited our machine learning approaches to using candidate factors which had been commonly used in previous assessments of ASPREE (the measures indicated in Appendix 1 ). Our modeling implementations included participants in both treatment arms, not just the control arm, for model development, as recommended [ 3 ].

Our internally developed, non-parametric random forest algorithm performed similarly compared to a previously developed proportional hazards-based model in terms of both outcome discrimination and HTE identification. For both models, participants who were in the group with the highest predicted risk experienced fewer events on the absolute scale when treated with aspirin compared to placebo. However, the overall difference across groups was significant for only the proportional hazards model. Although confidence intervals were wide, at least in part a consequence of the limited number of participants in the subgroups, the point estimate for the absolute risk reduction was greater in participants with a higher predicted risk by the decision tree model.

Supervised machine learning models offer benefits over proportional hazard models as well as limitations. First, the supervised machine learning models subgroup based on baseline data while the proportional hazards model takes time into consideration, which adds complexity and an additional dynamic variable. Second, the supervised machine learning models make no assumptions on the distribution of the data while the semi-parametric proportional hazards model does. Third, there is less data pre-processing required for the supervised machine learning models than the proportional hazards model. Fourth, supervised machine learning models provide ranking of variables based on their ability to discriminate between research participants with and without outcome while coefficient magnitude is used as a proxy for feature importance in semi-parametric proportional hazards models. Fourth, supervised machine learning models overcome the potential of data leakage in HTE analyses as two Cox proportional hazards models with the same outcome are not used, while this situation occurs when a proportional hazards model for subgroup generation is utilized. However, supervised machine learning models used in these analyses only predict occurrence of outcome while the proportional hazards model predicts time-to-event.

The choice between the supervised learning models (random forests vs. decision tree) has benefits and limitations to consider. The random forest is designed to decrease the variability in decision tree and provide more stable predictions. Decision tree models are the most explainable as they can be directly translated into human understandable rules. Decision tree follows an “if-then” format where conditions on variables are evaluated in sequence to determine the final prediction.

A limitation of the study is the selected supervised machine learning models did not consider time-to-event capture. There are such models including survival trees and random survival forests that can account for time-to-event information and censorship more directly [ 12 , 13 ]. Although recent work has shown poor agreements between them, there are other families of machine learning approaches that have been proposed to identify individualized treatment rules, such as causal forests [ 14 ]. In addition, the supervised learning models did not take censorship into account. However, we chose to first examine the more ubiquitous and understood methods of random forests and decision tree first. Strengths of this study include describing a process for comparing different outcome risk model methods for generating subgroups for HTE analyses. The key learning point of our work is that the choice of outcomes risk modelling to generate subgroups for HTE analyses is a balance of trade-offs that must be explicitly stated in the methods section of a manuscript. As other options become available for outcomes risk modelling, they will have to be compared and contrasted with existing methods to better appreciate the trade-offs. In addition, we highlight that utilizing multiple methods at outcome risk models may be beneficial in determining the robustness of HTE analysis results.

Potential future work includes comparing the feature importance characteristics of outcome prediction models for HTE analyses as this effort could identify mechanistic pathways to explain HTE analyses findings. Such work could result in further hypothesis generation for the tailored application of health interventions. Also, confirmation of these findings regarding the trade-offs of outcome risk model choices in HTE analyses in other mid-sized clinical trial derived datasets is needed.

This study evaluated non-parametric machine learning models as risk predictors for HTE subgrouping and a previously developed proportional hazards model as a comparator. Non-parametric partition-based machine learning methods can generate internal subgroups for HTE analysis which exhibit similar performance to conventional regression-based approaches. Supervised machine learning models may be promising contenders for internally developed models for subgroups analysis when compared to a traditionally used, risk-based, semi-parametric model. They may produce comparable groupings based on outcomes risk but with less training data, less variables (omitting time and self-selecting important features), and less assumption on the underlying structure of the data.

Data availability

For access to the ASPirin in Reducing Events in the Elderly (ASPREE) project data, visit ams.aspree.org.Code for this project and final hyperparameters are available at: https://anonymous.4open.science/r/P428_HTE-781 A/README.md.

Angus DC, Chang CCH. Heterogeneity of treatment effect: estimating how the effects of interventions vary across individuals. JAMA. 2021;326(22):2312–3.

Article   PubMed   Google Scholar  

Burke JF, Sussman JB, Kent DM, Hayward RA. Three simple rules to ensure reasonably credible subgroup analyses. BMJ. 2015;351. https://doi.org/10.1136/bmj.h5651 . PMID: 26537915; PMCID: PMC4632208.

Kent DM, Paulus JK, Van Klaveren D, D’Agostino R, Goodman S, Hayward R, Ioannidis JP, Steyerberg EW, et al. The predictive approaches to treatment effect heterogeneity (PATH) statement. Ann Intern Med. 2020;172(1):35–45. https://doi.org/10.7326/M18-3667 . PMID: 31711134; PMCID: PMC7531587.

ASPREE Investigator Group. Study design of ASPirin in reducing events in the Elderly (ASPREE): a randomized, controlled trial. Contemp Clin Trials. 2013;36(2):555–64. https://doi.org/10.1016/j.cct.2013.09.014 .

Article   PubMed Central   Google Scholar  

McNeil JJ, Woods RL, Nelson MR, et al. Effect of aspirin on disability-free survival in the healthy Elderly. N Engl J Med. 2018;379(16):1499–508. https://doi.org/10.1056/NEJMoa1800722 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Neumann JT, Thao LTP, Murray AM, et al. Prediction of disability-free survival in healthy older people. GeroScience. 2022;44(3):1641–55. https://doi.org/10.1007/s11357-022-00547-x .

Article   PubMed   PubMed Central   Google Scholar  

Gresham G, Meinert JL, Gresham AG, Meinert CL. Assessment of trends in the design, accrual, and completion of trials registered in ClinicalTrials. Gov by sponsor type, 2000–2019. JAMA Netw Open. 2020;3(8):e2014682–2014682.

Wolfe R, Murray AM, Woods RL, Kirpach B, Gilbertson D, Shah RC, Nelson MR, Reid CM, Ernst ME, Lockery J, Donnan GA, Williamson J, McNeil JJ. The aspirin in reducing events in the elderly trial: statistical analysis plan. Int J Stroke. 2018;13(3):335–8.

Therneau T, Atkinson B, Ripley B, Ripley MB. (2015). Package ‘rpart’. Available online: cran.ma. ic. ac.uk/web/packages/rpart/rpart.pdf (accessed on 20 April 2016).

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Article   Google Scholar  

Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22. https://CRAN.R-project.org/doc/Rnews/ .

Google Scholar  

Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Annals Appl Stat. 2008;2(3):841–860. https://doi.org/10.1214/08-AOAS169 .

LeBlanc M, Crowley J. Relative risk trees for censored Survival Data. Biometrics. 1992;48(2):411–25. https://doi.org/10.2307/2532300 .

Article   CAS   PubMed   Google Scholar  

Bouvier F, Peyrot E, Balendran A, Ségalas C, Roberts I, Petit F, Porcher R. (2023). Do machine learning methods lead to similar individualized treatment rules? A comparison study on real data. Stat Med. 2024. https://doi.org/10.1002/sim.10059 . Epub ahead of print. PMID: 38472745.

Download references

Acknowledgements

The authors recognize the significant contributions made by the research participants, staff, and investigators for the ASPirin in Reducing Events in the Elderly clinical trial.

Support : This work was funded by the NIH (NIA U19AG062682, UL1TR002389). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Author information

Authors and affiliations.

Jarvis College of Computing and Digital Media, DePaul University, Chicago, IL, United States of America

Edward Xu, Yiyang Wang, Anisha Patel, Jacob Furst, Daniela Stan Raicu & Roselyne Tchoua

Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago, IL, United States of America

Joseph Vanghelof & Raj C. Shah

Department of Family & Preventive Medicine, Rush University Medical Center, Chicago, IL, United States of America

Raj C. Shah

Department of Cardiology, University Heart & Vascular Centre Hamburg, Hamburg, Germany

Johannes Tobias Neumann

German Centre for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany

Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, Australia

Rory Wolfe & John J. McNeil

Monash University Clinical Trials Centre, Monash University, Melbourne, VIC, Australia

Rory Wolfe & Caroline X. Gao

Centre for Youth Mental Health, University of Melbourne, Parkview, VIC, Australia

Caroline X. Gao

Orygen, Parkview, VIC, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

EX, JV, YW, AP, JF, DSR, RCS, and RT all participated in the design of the project and the critical writing and editing of the manuscript. JTN, RW, CXG, and JJM all participated in the critical writing and editing of the manuscript. EX, JV, YW, and AP also participated in the data analyses.

Corresponding author

Correspondence to Roselyne Tchoua .

Ethics declarations

Ethics approval and consent to participate.

The ASPREE-XT study which included analyses of data from the ASPREE clinical trial was approved by the University of Iowa Institutional Review Board. Informed consent was obtained for all participants.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Xu, E., Vanghelof, J., Wang, Y. et al. Outcome risk model development for heterogeneity of treatment effect analyses: a comparison of non-parametric machine learning methods and semi-parametric statistical methods. BMC Med Res Methodol 24 , 158 (2024). https://doi.org/10.1186/s12874-024-02265-8

Download citation

Received : 15 December 2023

Accepted : 14 June 2024

Published : 23 July 2024

DOI : https://doi.org/10.1186/s12874-024-02265-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Heterogeneity of treatment effect
  • Random forest
  • Decision tree
  • Outcome risk modelling
  • Disability-free longevity
  • Clinical trial

BMC Medical Research Methodology

ISSN: 1471-2288

data assignment method

Quantitative Data Analysis: A Comprehensive Guide

By: Ofem Eteng | Published: May 18, 2022

Related Articles

data assignment method

A healthcare giant successfully introduces the most effective drug dosage through rigorous statistical modeling, saving countless lives. A marketing team predicts consumer trends with uncanny accuracy, tailoring campaigns for maximum impact.

Table of Contents

These trends and dosages are not just any numbers but are a result of meticulous quantitative data analysis. Quantitative data analysis offers a robust framework for understanding complex phenomena, evaluating hypotheses, and predicting future outcomes.

In this blog, we’ll walk through the concept of quantitative data analysis, the steps required, its advantages, and the methods and techniques that are used in this analysis. Read on!

What is Quantitative Data Analysis?

Quantitative data analysis is a systematic process of examining, interpreting, and drawing meaningful conclusions from numerical data. It involves the application of statistical methods, mathematical models, and computational techniques to understand patterns, relationships, and trends within datasets.

Quantitative data analysis methods typically work with algorithms, mathematical analysis tools, and software to gain insights from the data, answering questions such as how many, how often, and how much. Data for quantitative data analysis is usually collected from close-ended surveys, questionnaires, polls, etc. The data can also be obtained from sales figures, email click-through rates, number of website visitors, and percentage revenue increase. 

Quantitative Data Analysis vs Qualitative Data Analysis

When we talk about data, we directly think about the pattern, the relationship, and the connection between the datasets – analyzing the data in short. Therefore when it comes to data analysis, there are broadly two types – Quantitative Data Analysis and Qualitative Data Analysis.

Quantitative data analysis revolves around numerical data and statistics, which are suitable for functions that can be counted or measured. In contrast, qualitative data analysis includes description and subjective information – for things that can be observed but not measured.

Let us differentiate between Quantitative Data Analysis and Quantitative Data Analysis for a better understanding.

Numerical data – statistics, counts, metrics measurementsText data – customer feedback, opinions, documents, notes, audio/video recordings
Close-ended surveys, polls and experiments.Open-ended questions, descriptive interviews
What? How much? Why (to a certain extent)?How? Why? What are individual experiences and motivations?
Statistical programming software like R, Python, SAS and Data visualization like Tableau, Power BINVivo, Atlas.ti for qualitative coding.
Word processors and highlighters – Mindmaps and visual canvases
Best used for large sample sizes for quick answers.Best used for small to middle sample sizes for descriptive insights

Data Preparation Steps for Quantitative Data Analysis

Quantitative data has to be gathered and cleaned before proceeding to the stage of analyzing it. Below are the steps to prepare a data before quantitative research analysis:

  • Step 1: Data Collection

Before beginning the analysis process, you need data. Data can be collected through rigorous quantitative research, which includes methods such as interviews, focus groups, surveys, and questionnaires.

  • Step 2: Data Cleaning

Once the data is collected, begin the data cleaning process by scanning through the entire data for duplicates, errors, and omissions. Keep a close eye for outliers (data points that are significantly different from the majority of the dataset) because they can skew your analysis results if they are not removed.

This data-cleaning process ensures data accuracy, consistency and relevancy before analysis.

  • Step 3: Data Analysis and Interpretation

Now that you have collected and cleaned your data, it is now time to carry out the quantitative analysis. There are two methods of quantitative data analysis, which we will discuss in the next section.

However, if you have data from multiple sources, collecting and cleaning it can be a cumbersome task. This is where Hevo Data steps in. With Hevo, extracting, transforming, and loading data from source to destination becomes a seamless task, eliminating the need for manual coding. This not only saves valuable time but also enhances the overall efficiency of data analysis and visualization, empowering users to derive insights quickly and with precision

Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.

Start for free now!

Now that you are familiar with what quantitative data analysis is and how to prepare your data for analysis, the focus will shift to the purpose of this article, which is to describe the methods and techniques of quantitative data analysis.

Methods and Techniques of Quantitative Data Analysis

Quantitative data analysis employs two techniques to extract meaningful insights from datasets, broadly. The first method is descriptive statistics, which summarizes and portrays essential features of a dataset, such as mean, median, and standard deviation.

Inferential statistics, the second method, extrapolates insights and predictions from a sample dataset to make broader inferences about an entire population, such as hypothesis testing and regression analysis.

An in-depth explanation of both the methods is provided below:

  • Descriptive Statistics
  • Inferential Statistics

1) Descriptive Statistics

Descriptive statistics as the name implies is used to describe a dataset. It helps understand the details of your data by summarizing it and finding patterns from the specific data sample. They provide absolute numbers obtained from a sample but do not necessarily explain the rationale behind the numbers and are mostly used for analyzing single variables. The methods used in descriptive statistics include: 

  • Mean:   This calculates the numerical average of a set of values.
  • Median: This is used to get the midpoint of a set of values when the numbers are arranged in numerical order.
  • Mode: This is used to find the most commonly occurring value in a dataset.
  • Percentage: This is used to express how a value or group of respondents within the data relates to a larger group of respondents.
  • Frequency: This indicates the number of times a value is found.
  • Range: This shows the highest and lowest values in a dataset.
  • Standard Deviation: This is used to indicate how dispersed a range of numbers is, meaning, it shows how close all the numbers are to the mean.
  • Skewness: It indicates how symmetrical a range of numbers is, showing if they cluster into a smooth bell curve shape in the middle of the graph or if they skew towards the left or right.

2) Inferential Statistics

In quantitative analysis, the expectation is to turn raw numbers into meaningful insight using numerical values, and descriptive statistics is all about explaining details of a specific dataset using numbers, but it does not explain the motives behind the numbers; hence, a need for further analysis using inferential statistics.

Inferential statistics aim to make predictions or highlight possible outcomes from the analyzed data obtained from descriptive statistics. They are used to generalize results and make predictions between groups, show relationships that exist between multiple variables, and are used for hypothesis testing that predicts changes or differences.

There are various statistical analysis methods used within inferential statistics; a few are discussed below.

  • Cross Tabulations: Cross tabulation or crosstab is used to show the relationship that exists between two variables and is often used to compare results by demographic groups. It uses a basic tabular form to draw inferences between different data sets and contains data that is mutually exclusive or has some connection with each other. Crosstabs help understand the nuances of a dataset and factors that may influence a data point.
  • Regression Analysis: Regression analysis estimates the relationship between a set of variables. It shows the correlation between a dependent variable (the variable or outcome you want to measure or predict) and any number of independent variables (factors that may impact the dependent variable). Therefore, the purpose of the regression analysis is to estimate how one or more variables might affect a dependent variable to identify trends and patterns to make predictions and forecast possible future trends. There are many types of regression analysis, and the model you choose will be determined by the type of data you have for the dependent variable. The types of regression analysis include linear regression, non-linear regression, binary logistic regression, etc.
  • Monte Carlo Simulation: Monte Carlo simulation, also known as the Monte Carlo method, is a computerized technique of generating models of possible outcomes and showing their probability distributions. It considers a range of possible outcomes and then tries to calculate how likely each outcome will occur. Data analysts use it to perform advanced risk analyses to help forecast future events and make decisions accordingly.
  • Analysis of Variance (ANOVA): This is used to test the extent to which two or more groups differ from each other. It compares the mean of various groups and allows the analysis of multiple groups.
  • Factor Analysis:   A large number of variables can be reduced into a smaller number of factors using the factor analysis technique. It works on the principle that multiple separate observable variables correlate with each other because they are all associated with an underlying construct. It helps in reducing large datasets into smaller, more manageable samples.
  • Cohort Analysis: Cohort analysis can be defined as a subset of behavioral analytics that operates from data taken from a given dataset. Rather than looking at all users as one unit, cohort analysis breaks down data into related groups for analysis, where these groups or cohorts usually have common characteristics or similarities within a defined period.
  • MaxDiff Analysis: This is a quantitative data analysis method that is used to gauge customers’ preferences for purchase and what parameters rank higher than the others in the process. 
  • Cluster Analysis: Cluster analysis is a technique used to identify structures within a dataset. Cluster analysis aims to be able to sort different data points into groups that are internally similar and externally different; that is, data points within a cluster will look like each other and different from data points in other clusters.
  • Time Series Analysis: This is a statistical analytic technique used to identify trends and cycles over time. It is simply the measurement of the same variables at different times, like weekly and monthly email sign-ups, to uncover trends, seasonality, and cyclic patterns. By doing this, the data analyst can forecast how variables of interest may fluctuate in the future. 
  • SWOT analysis: This is a quantitative data analysis method that assigns numerical values to indicate strengths, weaknesses, opportunities, and threats of an organization, product, or service to show a clearer picture of competition to foster better business strategies

How to Choose the Right Method for your Analysis?

Choosing between Descriptive Statistics or Inferential Statistics can be often confusing. You should consider the following factors before choosing the right method for your quantitative data analysis:

1. Type of Data

The first consideration in data analysis is understanding the type of data you have. Different statistical methods have specific requirements based on these data types, and using the wrong method can render results meaningless. The choice of statistical method should align with the nature and distribution of your data to ensure meaningful and accurate analysis.

2. Your Research Questions

When deciding on statistical methods, it’s crucial to align them with your specific research questions and hypotheses. The nature of your questions will influence whether descriptive statistics alone, which reveal sample attributes, are sufficient or if you need both descriptive and inferential statistics to understand group differences or relationships between variables and make population inferences.

Pros and Cons of Quantitative Data Analysis

1. Objectivity and Generalizability:

  • Quantitative data analysis offers objective, numerical measurements, minimizing bias and personal interpretation.
  • Results can often be generalized to larger populations, making them applicable to broader contexts.

Example: A study using quantitative data analysis to measure student test scores can objectively compare performance across different schools and demographics, leading to generalizable insights about educational strategies.

2. Precision and Efficiency:

  • Statistical methods provide precise numerical results, allowing for accurate comparisons and prediction.
  • Large datasets can be analyzed efficiently with the help of computer software, saving time and resources.

Example: A marketing team can use quantitative data analysis to precisely track click-through rates and conversion rates on different ad campaigns, quickly identifying the most effective strategies for maximizing customer engagement.

3. Identification of Patterns and Relationships:

  • Statistical techniques reveal hidden patterns and relationships between variables that might not be apparent through observation alone.
  • This can lead to new insights and understanding of complex phenomena.

Example: A medical researcher can use quantitative analysis to pinpoint correlations between lifestyle factors and disease risk, aiding in the development of prevention strategies.

1. Limited Scope:

  • Quantitative analysis focuses on quantifiable aspects of a phenomenon ,  potentially overlooking important qualitative nuances, such as emotions, motivations, or cultural contexts.

Example: A survey measuring customer satisfaction with numerical ratings might miss key insights about the underlying reasons for their satisfaction or dissatisfaction, which could be better captured through open-ended feedback.

2. Oversimplification:

  • Reducing complex phenomena to numerical data can lead to oversimplification and a loss of richness in understanding.

Example: Analyzing employee productivity solely through quantitative metrics like hours worked or tasks completed might not account for factors like creativity, collaboration, or problem-solving skills, which are crucial for overall performance.

3. Potential for Misinterpretation:

  • Statistical results can be misinterpreted if not analyzed carefully and with appropriate expertise.
  • The choice of statistical methods and assumptions can significantly influence results.

This blog discusses the steps, methods, and techniques of quantitative data analysis. It also gives insights into the methods of data collection, the type of data one should work with, and the pros and cons of such analysis.

Gain a better understanding of data analysis with these essential reads:

  • Data Analysis and Modeling: 4 Critical Differences
  • Exploratory Data Analysis Simplified 101
  • 25 Best Data Analysis Tools in 2024

Carrying out successful data analysis requires prepping the data and making it analysis-ready. That is where Hevo steps in.

Want to give Hevo a try? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing Hevo price , which will assist you in selecting the best plan for your requirements.

Share your experience of understanding Quantitative Data Analysis in the comment section below! We would love to hear your thoughts.

Ofem Eteng is a seasoned technical content writer with over 12 years of experience. He has held pivotal roles such as System Analyst (DevOps) at Dagbs Nigeria Limited and Full-Stack Developer at Pedoquasphere International Limited. He specializes in data science, data analytics and cutting-edge technologies, making him an expert in the data industry.

No-code Data Pipeline for your Data Warehouse

  • Data Analysis
  • Data Warehouse
  • Quantitative Data Analysis

Continue Reading

data assignment method

Srujana Maddula

Apache Iceberg vs Parquet – Comparing Table and File Formats

data assignment method

Raju Mandal

A Deep Dive into Data Lakes

data assignment method

Radhika Gholap

Apache Iceberg Table Format: Comprehensive Guide

I want to read this e-book.

data assignment method

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Data Collection Methods | Step-by-Step Guide & Examples

Data Collection Methods | Step-by-Step Guide & Examples

Published on 4 May 2022 by Pritha Bhandari .

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental, or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address, and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analysed through statistical methods .
  • Qualitative data is expressed in words and analysed through interpretations and categorisations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data.

If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Prevent plagiarism, run a free check.

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research, and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

Data collection methods
Method When to use How to collect data
Experiment To test a causal relationship. Manipulate variables and measure their effects on others.
Survey To understand the general characteristics or opinions of a group of people. Distribute a list of questions to a sample online, in person, or over the phone.
Interview/focus group To gain an in-depth understanding of perceptions or opinions on a topic. Verbally ask participants open-ended questions in individual interviews or focus group discussions.
Observation To understand something in its natural setting. Measure or survey a sample without trying to affect them.
Ethnography To study the culture of a community or organisation first-hand. Join and participate in a community and record your observations and reflections.
Archival research To understand current or historical events, conditions, or practices. Access manuscripts, documents, or records from libraries, depositories, or the internet.
Secondary data collection To analyse data from populations that you can’t access first-hand. Find existing datasets that have already been collected, from sources such as government agencies or research organisations.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design .

Operationalisation

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalisation means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness, and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and time frame of the data collection.

Standardising procedures

If multiple researchers are involved, write a detailed manual to standardise data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorise observations.

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organise and store your data.

  • If you are collecting data from people, you will likely need to anonymise and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimise distortion.
  • You can prevent loss of data by having an organisation system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1 to 5. The data produced is numerical and can be statistically analysed for averages and patterns.

To ensure that high-quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organisations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g., understanding the needs of your consumers or user testing your website).
  • You can control and standardise the process for high reliability and validity (e.g., choosing appropriate measurements and sampling methods ).

However, there are also some drawbacks: data collection can be time-consuming, labour-intensive, and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to test a hypothesis by systematically collecting and analysing data, while qualitative methods allow you to explore ideas and experiences in depth.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research , you also have to consider the internal and external validity of your experiment.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, May 04). Data Collection Methods | Step-by-Step Guide & Examples. Scribbr. Retrieved 22 July 2024, from https://www.scribbr.co.uk/research-methods/data-collection-guide/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs quantitative research | examples & methods, triangulation in research | guide, types, examples, what is a conceptual framework | tips & examples.

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

data assignment method

Home QuestionPro QuestionPro Products

Data Collection Methods: Types & Examples

data-collection-methods

Data is a collection of facts, figures, objects, symbols, and events from different sources. Organizations collect data using various methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various times.

For example, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until it is analyzed or processed to achieve the desired results.

What are Data Collection Methods?

Data collection methods are techniques and procedures for gathering information for research purposes. They can range from simple self-reported surveys to more complex quantitative or qualitative experiments.

Some common data collection methods include surveys , interviews, observations, focus groups, experiments, and secondary data analysis . The data collected through these methods can then be analyzed to support or refute research hypotheses and draw conclusions about the study’s subject matter.

Understanding Data Collection Methods

Data collection methods encompass a variety of techniques and tools for gathering quantitative and qualitative data. These methods are integral to the data collection and ensure accurate and comprehensive data acquisition. 

Quantitative data collection methods involve systematic approaches, such as

  • Numerical data,
  • Surveys, polls and
  • Statistical analysis
  • To quantify phenomena and trends. 

Conversely, qualitative data collection methods focus on capturing non-numerical information, such as interviews, focus groups, and observations, to delve deeper into understanding attitudes, behaviors, and motivations. 

Combining quantitative and qualitative data collection techniques can enrich organizations’ datasets and gain comprehensive insights into complex phenomena.

Effective utilization of accurate data collection tools and techniques enhances the accuracy and reliability of collected data, facilitating informed decision-making and strategic planning.

Learn more about what is Self-Selection Bias, methods & its examples

Importance of Data Collection Methods

Data collection methods play a crucial role in the research process as they determine the quality and accuracy of the data collected. Here are some major importance of data collection methods.

  • Quality and Accuracy: The choice of data collection technique directly impacts the quality and accuracy of the data obtained. Properly designed methods help ensure that the data collected is error-free and relevant to the research questions.
  • Relevance, Validity, and Reliability: Effective data collection methods help ensure that the data collected is relevant to the research objectives, valid (measuring what it intends to measure), and reliable (consistent and reproducible).
  • Bias Reduction and Representativeness: Carefully chosen data collection methods can help minimize biases inherent in the research process, such as sampling or response bias. They also aid in achieving a representative sample, enhancing the findings’ generalizability.
  • Informed Decision Making: Accurate and reliable data collected through appropriate methods provide a solid foundation for making informed decisions based on research findings. This is crucial for both academic research and practical applications in various fields.
  • Achievement of Research Objectives: Data collection methods should align with the research objectives to ensure that the collected data effectively addresses the research questions or hypotheses. Properly collected data facilitates the attainment of these objectives.
  • Support for Validity and Reliability: Validity and reliability are essential to research validity. The choice of data collection methods can either enhance or detract from the validity and reliability of research findings. Therefore, selecting appropriate methods is critical for ensuring the credibility of the research.

The importance of data collection methods cannot be overstated, as they play a key role in the research study’s overall success and internal validity .

Types of Data Collection Methods

The choice of data collection method depends on the research question being addressed, the type of data needed, and the resources and time available. Data collection methods can be categorized into primary and secondary methods.

Data Collection Methods

1. Primary Data Collection Methods

Primary data is collected from first-hand experience and is not used in the past. The data gathered by primary data collection methods are highly accurate and specific to the research’s motive.

Primary data collection methods can be divided into two categories: quantitative and qualitative.

Quantitative Methods:

Quantitative techniques for market research and demand forecasting usually use statistical tools. In these techniques, demand is forecasted based on historical data. These methods of primary data collection are generally used to make long-term forecasts. Statistical analysis methods are highly reliable as subjectivity is minimal.

  • Time Series Analysis: A time series refers to a sequential order of values of a variable, known as a trend, at equal time intervals. Using patterns, an organization can predict the demand for its products and services over a projected time period. 
  • Smoothing Techniques: Smoothing techniques can be used in cases where the time series lacks significant trends. They eliminate random variation from the historical demand, helping identify patterns and demand levels to estimate future demand.  The most common methods used in smoothing demand forecasting are the simple moving average and weighted moving average methods. 
  • Barometric Method: Also known as the leading indicators approach, researchers use this method to speculate future trends based on current developments. When past events are considered to predict future events, they act as leading indicators.

data assignment method

Qualitative Methods:

Qualitative data collection methods are especially useful when historical data is unavailable or when numbers or mathematical calculations are unnecessary.

Qualitative research is closely associated with words, sounds, feelings, emotions, colors, and non-quantifiable elements. These techniques are based on experience, judgment, intuition, conjecture, emotion, etc.

Quantitative methods do not provide the motive behind participants’ responses, often don’t reach underrepresented populations, and require long periods of time to collect the data. Hence, it is best to combine quantitative methods with qualitative methods.

1. Surveys: Surveys collect data from the target audience and gather insights into their preferences, opinions, choices, and feedback related to their products and services. Most survey software offers a wide range of question types.

You can also use a ready-made survey template to save time and effort. Online surveys can be customized to match the business’s brand by changing the theme, logo, etc. They can be distributed through several channels, such as email, website, offline app, QR code, social media, etc. 

You can select the channel based on your audience’s type and source. Once the data is collected, survey software can generate various reports and run analytics algorithms to discover hidden insights. 

A survey dashboard can give you statistics related to response rate, completion rate, demographics-based filters, export and sharing options, etc. Integrating survey builders with third-party apps can maximize the effort spent on online real-time data collection . 

Practical business intelligence relies on the synergy between analytics and reporting , where analytics uncovers valuable insights, and reporting communicates these findings to stakeholders.

2. Polls: Polls comprise one single or multiple-choice question . They are useful when you need to get a quick pulse of the audience’s sentiments. Because they are short, it is easier to get responses from people.

Like surveys, online polls can be embedded into various platforms. Once the respondents answer the question, they can also be shown how their responses compare to others’.

Interviews: In this method, the interviewer asks the respondents face-to-face or by telephone. 

3. Interviews: In face-to-face interviews, the interviewer asks a series of questions to the interviewee in person and notes down responses. If it is not feasible to meet the person, the interviewer can go for a telephone interview. 

This form of data collection is suitable for only a few respondents. It is too time-consuming and tedious to repeat the same process if there are many participants.

data assignment method

4. Delphi Technique: In the Delphi method, market experts are provided with the estimates and assumptions of other industry experts’ forecasts. Based on this information, experts may reconsider and revise their estimates and assumptions. The consensus of all experts on demand forecasts constitutes the final demand forecast.

5. Focus Groups: Focus groups are one example of qualitative data in education . In a focus group, a small group of people, around 8-10 members, discuss the common areas of the research problem. Each individual provides his or her insights on the issue concerned. 

A moderator regulates the discussion among the group members. At the end of the discussion, the group reaches a consensus.

6. Questionnaire: A questionnaire is a printed set of open-ended or closed-ended questions that respondents must answer based on their knowledge and experience with the issue. The questionnaire is part of the survey, whereas the questionnaire’s end goal may or may not be a survey.

2. Secondary Data Collection Methods

Secondary data is data that has been used in the past. The researcher can obtain data from the data sources , both internal and external, to the organizational data . 

Internal sources of secondary data:

  • Organization’s health and safety records
  • Mission and vision statements
  • Financial Statements
  • Sales Report
  • CRM Software
  • Executive summaries

External sources of secondary data:

  • Government reports
  • Press releases
  • Business journals

Secondary data collection methods can also involve quantitative and qualitative techniques. Secondary data is easily available, less time-consuming, and expensive than primary data. However, the authenticity of the data gathered cannot be verified using these methods.

Secondary data collection methods can also involve quantitative and qualitative observation techniques. Secondary data is easily available, less time-consuming, and more expensive than primary data. 

However, the authenticity of the data gathered cannot be verified using these methods.

Regardless of the data collection method of your choice, there must be direct communication with decision-makers so that they understand and commit to acting according to the results.

For this reason, we must pay special attention to the analysis and presentation of the information obtained. Remember that these data must be useful and functional to us, so the data collection method has much to do with it.

LEARN ABOUT: Data Asset Management: What It Is & How to Manage It

Steps in the Data Collection Process

The data collection process typically involves several key steps to ensure the accuracy and reliability of the data gathered. These steps provide a structured approach to gathering and analyzing data effectively. Here are the key steps in the data collection process:

  • Define the Objectives: Clearly outline the goals of the data collection. What questions are you trying to answer?
  • Identify Data Sources: Determine where the data will come from. This could include surveys, interviews, existing databases, or observational data.
  • Surveys and questionnaires
  • Interviews (structured or unstructured)
  • Focus groups
  • Observations
  • Document analysis
  • Develop Data Collection Instruments: Create or adapt tools for collecting data, such as questionnaires or interview guides. Ensure they are valid and reliable.
  • Select a Sample: If you are not collecting data from the entire population, determine how to select your sample. Consider sampling methods like random, stratified, or convenience sampling.
  • Collect Data: Execute your data collection plan, following ethical guidelines and maintaining data integrity.
  • Store Data: Organize and store collected data securely, ensuring it’s easily accessible for analysis while maintaining confidentiality.
  • Analyze Data: After collecting the data, process and analyze it according to your objectives, using appropriate statistical or qualitative methods.
  • Interpret Results: Conclude your analysis, relating them back to your original objectives and research questions.
  • Report Findings: Present your findings clearly and organized, using visuals and summaries to communicate insights effectively.
  • Evaluate the Process: Reflect on the data collection process. Assess what worked well and what could be improved for future studies.

Recommended Data Collection Tools

Choosing the right data collection tools depends on your specific needs, such as the type of data you’re collecting, the scale of your project, and your budget. Here are some widely used tools across different categories:

Survey Tools

  • QuestionPro: Offers advanced survey features and analytics.
  • SurveyMonkey: User-friendly interface with customizable survey options.
  • Google Forms: Free and easy to use, suitable for simple surveys.

Interview and Focus Group Tools

  • Zoom: Great for virtual interviews and focus group discussions.
  • Microsoft Teams: Offers features for collaboration and recording sessions.

Observation and Field Data Collection

  • Open Data Kit (ODK): This is for mobile data collection in field settings.
  • REDCap: A secure web application for building and managing online surveys.

Mobile Data Collection

  • KoboToolbox: Designed for humanitarian work, useful for field data collection.
  • SurveyCTO: Provides offline data collection capabilities for mobile devices.

Data Analysis Tools

  • Tableau: Powerful data visualization tool to analyze survey results.
  • SPSS: Widely used for statistical analysis in research.

Qualitative Data Analysis

  • NVivo: For analyzing qualitative data like interviews or open-ended survey responses.
  • Dedoose: Useful for mixed-methods research, combining qualitative and quantitative data.

General Data Collection and Management

  • Airtable: Combines spreadsheet and database functionalities for organizing data.
  • Microsoft Excel: A versatile tool for data entry, analysis, and visualization.

If you are interested in purchasing, we invite you to visit our article, where we dive deeper and analyze the best data collection tools in the industry.

How Can QuestionPro Help to Create Effective Data Collection?

QuestionPro is a comprehensive online survey software platform that can greatly assist in various data collection methods. Here’s how it can help:

  • Survey Creation: QuestionPro offers a user-friendly interface for creating surveys with various question types, including multiple-choice, open-ended, Likert scale, and more. Researchers can customize surveys to fit their specific research needs and objectives.
  • Diverse Distribution Channels: The platform provides multiple channels for distributing surveys, including email, web links, social media, and website embedding surveys. This enables researchers to reach a wide audience and collect data efficiently.
  • Panel Management: QuestionPro offers panel management features, allowing researchers to create and manage panels of respondents for targeted data collection. This is particularly useful for longitudinal studies or when targeting specific demographics.
  • Data Analysis Tools: The platform includes robust data analysis tools that enable researchers to analyze survey responses in real time. Researchers can generate customizable reports, visualize data through charts and graphs, and identify trends and patterns within the data.
  • Data Security and Compliance: QuestionPro prioritizes data security and compliance with regulations such as GDPR and HIPAA. The platform offers features such as SSL encryption, data masking, and secure data storage to ensure the confidentiality and integrity of collected data.
  • Mobile Compatibility: With the increasing use of mobile devices, QuestionPro ensures that surveys are mobile-responsive, allowing respondents to participate in surveys conveniently from their smartphones or tablets.
  • Integration Capabilities: QuestionPro integrates with various third-party tools and platforms, including CRMs, email marketing software, and analytics tools. This allows researchers to streamline their data collection processes and incorporate survey data into their existing workflows.
  • Customization and Branding: Researchers can customize surveys with their branding elements, such as logos, colors, and themes, enhancing the professional appearance of surveys and increasing respondent engagement.

The conclusion you obtain from your investigation will set the course of the company’s decision-making, so present your report clearly and list the steps you followed to obtain those results.

Make sure that whoever will take the corresponding actions understands the importance of the information collected and that it gives them the solutions they expect.

QuestionPro offers a comprehensive suite of features and tools that can significantly streamline the data collection process, from survey creation to analysis, while ensuring data security and compliance. Remember that at QuestionPro, we can help you collect data easily and efficiently. Request a demo and learn about all the tools we have for you.

Frequently Asked Questions (FAQs)

A: Common methods include surveys, interviews, observations, focus groups, and experiments.

A: Data collection helps organizations make informed decisions and understand trends, customer preferences, and market demands.

A: Quantitative methods focus on numerical data and statistical analysis, while qualitative methods explore non-numerical insights like attitudes and behaviors.

A: Yes, combining methods can provide a more comprehensive understanding of the research topic.

A: Technology streamlines data collection with tools like online surveys, mobile data gathering, and integrated analytics platforms.

MORE LIKE THIS

Qualtrics vs Google Forms Comparison

Qualtrics vs Google Forms: Which is the Best Platform?

Jul 24, 2024

SurveyMonkey vs. Typeform

TypeForm vs. SurveyMonkey: Which is Better in 2024?

Surveymonkey-vs-google-forms

SurveyMonkey vs Google Forms: A Detailed Comparison

Jul 23, 2024

Typeform vs Jotform

Jotform vs Typeform: Which is the Best Option? Comparison (2024)

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence
  • Python Basics
  • Interview Questions
  • Python Quiz
  • Popular Packages
  • Python Projects
  • Practice Python
  • AI With Python
  • Learn Python3
  • Python Automation
  • Python Web Dev
  • DSA with Python
  • Python OOPs
  • Dictionaries

Pandas DataFrame assign() Method | Create new Columns in DataFrame

Python is a great language for data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, making importing and analyzing data much easier.

The Dataframe.assign() method assigns new columns to a DataFrame , returning a new object (a copy) with the new columns added to the original ones. 

Existing columns that are re-assigned will be overwritten . The length of the newly assigned column must match the number of rows in the DataFrame.

Syntax: DataFrame.assign(**kwargs)  Parameters kwargs : keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas don’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.  Returns: A new DataFrame with the new columns in addition to all the existing columns.

Let’s look at some Python programs and learn how to use the assign() method of the Pandas library to create new columns in DataFrame with these examples.

Assign a new column called Revised_Salary with a 10% increment of the original Salary.

dataframe printed

Assigning more than one column at a time

new column added

Pandas DataFrame assign() Method | Create new Columns in DataFrame – FAQs

How to use regex replace in pandas dataframe.

You can use the replace() method in pandas with a regex pattern to replace values in a DataFrame or a Series. This method is very versatile and can handle complex string patterns: import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'text': ['foo123', 'bar321', 'baz123'] }) # Replace numbers with 'XYZ' using regex df['text'] = df['text'].str.replace(r'\d+', 'XYZ', regex=True) print(df)

How to Replace Values with Regex

To replace values using regex in a DataFrame, specify the regex=True parameter in the replace() method, as shown above. This allows for pattern matching and replacement within strings: # Replace specific pattern (e.g., anything starting with 'ba') df['text'] = df['text'].str.replace(r'^ba\w+', 'Matched', regex=True) print(df)

How to Replace Values in Pandas DataFrame

For non-regex simple replacements, you can use replace() without enabling the regex functionality: # Replace 'Matched' with 'Found' df['text'] = df['text'].replace('Matched', 'Found') print(df)

How to Make a Regex Pattern in Python

Creating a regex pattern in Python involves using raw strings (prefixed with r ) to avoid having to escape backslashes. Here’s how to make a simple regex pattern: import re # Regex pattern to match email pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' # Check if the pattern is valid try: re.compile(pattern) is_valid = True except re.error: is_valid = False print("Is the regex pattern valid?", is_valid)

How to Compare String with Regex in Python

To compare a string against a regex pattern in Python, you can use re.match() or re.search() from the re module: import re text = "[email protected]" match = re.match(pattern, text) if match: print("The string matches the pattern.") else: print("The string does not match the pattern.") # Using re.search() to find a pattern anywhere in the string search_result = re.search(pattern, "User email is [email protected].") if search_result: print("Found a match.") else: print("No match found.")

Please Login to comment...

Similar reads.

  • Technical Scripter
  • AI-ML-DS With Python
  • Python pandas-dataFrame
  • Python pandas-dataFrame-methods
  • Python-pandas

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. 2: Finding transforms and the data assignment procedure

    data assignment method

  2. Random Assignment in Experiments

    data assignment method

  3. Objectives in Scheduling Loading Sequencing. Monitoring. Advanced

    data assignment method

  4. Variables, Assignment & Data Types

    data assignment method

  5. Template Method In Java

    data assignment method

  6. Assignment Method

    data assignment method

VIDEO

  1. COMP6210 Big Data Assignment 2

  2. Assignment 09

  3. Big data Assignment

  4. Big Data

  5. ASSIGNMENT METHOD OF TEACHING

  6. Big Data Computing

COMMENTS

  1. Assignment Method

    Assignment Method Explained. The assignment method in operation research is a strategy for allocating organizational resources to tasks to increase profit via efficiency gains, cost reductions, and improved handling of operations that might create bottlenecks.It is an operations management tool that, by allocating jobs to the appropriate individual, minimizes expenses, time, and effort.

  2. Data Analysis

    Data Analysis. Definition: Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets.

  3. Data assignment methods for parameters

    Update Method Data Assignment Method; Sequential Random Unique; Each iteration: The Vuser takes the next value from the data table for each iteration.. Example: All the Vusers use Kim in the first iteration, David in the second iteration, Michael in the third iteration, and continue in this manner until the last iteration, when all the Vusers use Fred. The Vuser takes a new random value from ...

  4. PDF Assignment Methods in Combinatorial Data Analysis ...

    Only time will tell whether Assignment Methods in Combinatorial Data Analysis meets all these criteria, but one cannot read this book without noting its potential for laying the foundation for a paradigm in statistics. In 320 pages, it subsumes an impres- sive list of traditional statistics, invents a family of new ones, and provides algorithms ...

  5. Data Collection

    Data Collection | Definition, Methods & Examples. Published on June 5, 2020 by Pritha Bhandari.Revised on June 21, 2023. Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

  6. Data assignment and update methods for file parameters

    For File type parameters, the Data Assignment method that you select, together with your choice of Update method, affect the values that the Vusers use to substitute parameters during the scenario run. The Data Assignment method is determined by the Select next row field, and the Update method is determined by the Update value on field.

  7. Measuring, visualizing, and diagnosing reference bias with biastools

    We cross check the scanning mode with both simulated data and real data. Assignment method. Biastools contains two algorithms (the "naive" and the "context-aware" algorithms) for assigning reads to haplotypes. Both examine each read that aligns across a given site and assign each read to the reference-allele-carrying (REF) or the ...

  8. The Assignment Method: Definition, Applications, and ...

    Technology plays a crucial role in implementing the assignment method by providing tools and systems for data analysis, resource allocation, and performance monitoring. Advanced analytics platforms, scheduling software, and project management tools can help organizations make data-driven allocation decisions and optimize resource utilization.

  9. Data Collection

    Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation. In order for data collection to be effective, it is important to have a clear understanding ...

  10. Writing Assignments to Assess Statistical Thinking

    The statistical thinking outcome to be assessed in Chi-Square Conditions Conundrum was students' understanding of variability in the data. This assignment was given to students after a section on methods for analysis for categorical data. This included using the Chi-square distribution for goodness-of-fit tests and tests for independence.

  11. Research Methods

    Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:

  12. Set an assignment method

    Define a data assignment method. In the Select next value list, select a data assignment method to instruct the Vuser how to select the file data during Vuser script execution. The options are: Sequential, Random, or Unique. For more information, see Data assignment methods for file-type parameters. Select an update option for the parameter.

  13. Quantitative Data Analysis Methods & Techniques 101

    Factor 1 - Data type. The first thing you need to consider is the type of data you've collected (or the type of data you will collect). By data types, I'm referring to the four levels of measurement - namely, nominal, ordinal, interval and ratio. If you're not familiar with this lingo, check out the video below.

  14. Variables and Assignment

    Variables and Assignment¶. When programming, it is useful to be able to store information in variables. A variable is a string of characters and numbers associated with a piece of information. The assignment operator, denoted by the "=" symbol, is the operator that is used to assign values to variables in Python.The line x=1 takes the known value, 1, and assigns that value to the variable ...

  15. Data-Driven Traffic Assignment: A Novel Approach for ...

    In this study, we develop a data-driven method to model the interaction between network wide travel demand variations and links flows. Thereby, we formulate the traffic assignment problem as data-driven learning problem. To solve this problem, we proposed a novel deep learning architecture adapting the concept of graph convolution.

  16. Data Collection Methods and Tools for Research; A Step-by-Step Guide to

    the other hand, although a suitable data collection method helps to plan good research, it cannot necessarily guarantee the overall success of the research project (Olsen, 2012). II. TYPES OF DATA Before selecting a data collection method, the type of data that is required for the study should be determined (Kabir, 2016).

  17. 6 Methods of Data Collection (With Types and Examples)

    6 methods of data collection. There are many methods of data collection that you can use in your workplace, including: 1. Observation. Observational methods focus on examining things and collecting data about them. This might include observing individual animals or people in their natural spaces and places.

  18. Homework 3: Data Analysis

    Call this method percent_change_bachelors_2000s and return the difference (the percent in 2010 minus the percent in 2000) as a float. For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, then the call percent_change_bachelors_2000s(data) will return 2.599999999999998.

  19. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    PDF | Learn how to choose the best data collection methods and tools for your research project, with examples and tips from ResearchGate experts. | Download and read the full-text PDF.

  20. Assignment Method: Examples of How Resources Are Allocated

    Assignment Method: A method of allocating organizational resources. The assignment method is used to determine what resources are assigned to which department, machine or center of operation in ...

  21. Resource Assignment Methods

    5.1.1.3 Resource Assignment Methods You can use shares or allocations to assign resources in an IORM plan. A share value represents the relative importance of each entity. With share-based resource allocation, a higher share value implies higher priority and more access to the I/O resources. For example, a database with a share value of 2 gets ...

  22. Assignment Design

    This predictably follows from having their attention devoted to the new task at hand. Scaffolding helps students see their work as a series of more manageable pieces. The second purpose for scaffolding is more pragmatic. Teaching-with-data assignments often involve a complex interaction of tasks.

  23. Outcome risk model development for heterogeneity of treatment effect

    Participants. In total, 19,114 participants enrolled in the ASPREE study, 2,411 of whom were from the United States. A participant flow diagram is shown in Fig. 1.After excluding 120 participants due to missing data, 1,141 were assigned to the Training & Validation set, and 1,150 were assigned to the Testing set for a total of 2,291 participants analyzed after accounting for missing data.

  24. Quantitative Data Analysis: A Comprehensive Guide

    Below are the steps to prepare a data before quantitative research analysis: Step 1: Data Collection. Before beginning the analysis process, you need data. Data can be collected through rigorous quantitative research, which includes methods such as interviews, focus groups, surveys, and questionnaires. Step 2: Data Cleaning.

  25. Spatially and temporally probing distinctive ...

    Further processing of the demultiplexed data by Hadamard Transform and data post-processing (see Methods), HRdm processed data revealed IM peak widths significantly narrower than those observed in ...

  26. Data Collection Methods

    Table of contents. Step 1: Define the aim of your research. Step 2: Choose your data collection method. Step 3: Plan your data collection procedures. Step 4: Collect the data. Frequently asked questions about data collection.

  27. Data Collection Methods: Types & Examples

    Hence, it is best to combine quantitative methods with qualitative methods. 1. Surveys: Surveys collect data from the target audience and gather insights into their preferences, opinions, choices, and feedback related to their products and services. Most survey software offers a wide range of question types.

  28. Pandas DataFrame assign() Method

    Python is a great language for data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, making importing and analyzing data much easier.. The Dataframe.assign() method assigns new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.

  29. Assignment 2: Designing methods for complex data

    For now the most important ones are: using spaces instead of tabs, indenting by 2 characters, following the naming conventions (data type names start with a capital letter, names of fields and methods start with a lower case letter), and having spaces before curly braces. You will submit this assignment by the deadlines using the course handin ...

  30. The US economy is pulling off something historic

    The US economy is on the verge of an extremely rare achievement. Economic growth in the first half of the year was solid, with the economy expanding a robust 2.8% annualized rate in the second ...