Moderate Evidence 27 min read Updated 2025-12-25

Annotator Values and Bias in Machine Learning Datasets: A Literature Review

Author: Research Team
Last Updated: December 24, 2025
Focus Areas: Annotator bias, human values in AI, age bias, dataset documentation, value-sensitive design


Table of Contents

  1. Introduction
  2. Annotator Bias and Identity in Machine Learning
  3. Beyond Identity Politics: Positionality and Standpoint
  4. Crowdworker Demographics and Annotation Quality
  5. Values Over Time: Social Movements and Algorithmic Systems
  6. Value-Sensitive Design Approaches
  7. Dataset and Model Documentation
  8. Annotation Task Design and Survey Methodology
  9. Age Bias in Algorithmic Systems
  10. Challenges in Subjective Annotation Tasks
  11. Conclusions and Future Directions

1. Introduction {#introduction}

This literature review examines the emerging body of research on how human values and biases shape machine learning datasets through the data annotation process. Despite the critical role that annotators play in creating training data, relatively little attention has been paid to the social experiences, identities, and standpoints of data annotators, all of which influence annotation behavior and subsequently impact model performance and fairness (Gebru et al., 2018; Mitchell et al., 2019).

The central question this review addresses is: how can datasets—and the models built upon them—be intentionally curated to express values in accordance with desired end users and stakeholders, given that data cannot be generated independent of human biases? This review synthesizes research across multiple domains including machine learning fairness, human-computer interaction, social computing, and value-sensitive design to provide a comprehensive understanding of current knowledge and identify critical gaps.

Research Scope and Objectives

This review has three primary aims:

  1. To surface current understanding of annotator biases and methods for measuring them
  2. To examine how these biases influence annotation judgments and resulting model behavior
  3. To identify best practices for selecting annotators and designing annotation tasks, particularly for subjective concepts

2. Annotator Bias and Identity in Machine Learning {#annotator-bias-and-identity}

2.1 The Problem of Annotator Invisibility

Broadly speaking, little information is catalogued about who data annotators are, despite evidence that different groups of annotators will annotate the same content differently (Sen et al., 2015). Recent research has demonstrated that annotation bias—the systematic patterns of judgment introduced by human annotators—represents a fundamental challenge in machine learning systems (arXiv:2511.14662, 2024; arXiv:1908.07898, 2019).

A 2024 study examining bias in multilingual large language models proposed a comprehensive typology of annotation bias encompassing instruction bias, annotator bias, and contextual or cultural bias, arguing that trends in multilingual and multimodal benchmarks have "exposed a structural vulnerability: annotation bias" (arXiv:2511.14662, 2024). This systematic examination reveals that bias does not merely arise during model training or deployment, but is embedded from the earliest stages of data collection.

2.2 Annotator Identity as Signal, Not Noise

Treating annotators as uniform, interchangeable components of the ML pipeline limits researchers' and developers' ability to produce documentation to guide ethical use of models and datasets (Mitchell et al., 2019; Gebru et al., 2018). Research by Geva et al. (2019) pointed out that common NLP datasets tend to be composed of data produced by a relatively small number of crowdworkers. Critically, they found that models often perform worse on examples labeled by annotators withheld from training data and that models can learn individual annotators' biases.

The influential 2019 paper "Are We Modeling the Task or the Annotator?" demonstrated that model performance improves when training with annotator identifiers as features, suggesting that what is often dismissed as "noise" in annotation data may actually contain meaningful signal about annotator perspectives (arXiv:1908.07898, 2019). This finding challenges the conventional view that seeks to eliminate annotator variability rather than understanding and appropriately incorporating it.

2.3 Sources of Annotation Bias

Recent research has identified that annotation bias often begins even before annotators see the data. A 2024 study titled "Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions" hypothesizes that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data (arXiv:2205.00415, 2022).

Furthermore, research on "Blind Spots and Biases: Exploring the Role of Annotator Cognitive Biases in NLP" systematically examines how human cognitive biases—confirmation bias, anchoring effects, and availability heuristic—influence annotation outcomes (arXiv:2404.19071, 2024). This work demonstrates that annotator bias is not merely a matter of individual prejudice, but is rooted in fundamental cognitive processes that affect all human judgment.

2.4 From Bias Elimination to Bias Awareness

A significant paradigm shift has emerged in recent years from pursuing "unbiased" datasets toward using datasets with awareness of inherent biases. Research on "Situated Ground Truths" argues for embracing the nuanced, situated nature of annotations rather than seeking a single universal truth (arXiv:2406.07583, 2024). This perspective acknowledges that for many tasks—particularly those involving subjective judgments about content like toxicity, sentiment, or appropriateness—the notion of a singular "gold standard" is philosophically and practically untenable.


3. Beyond Identity Politics: Positionality and Standpoint {#beyond-identity-politics}

3.1 The Limitations of Identity Categories Alone

Although the demographic makeup of annotator pools is an important step in cataloguing who is represented in data, social identity is just one consideration that might account for differences in annotation behavior and model performance. Recent scholarship has emphasized that social identity alone may not sufficiently account for relevant differences in annotation behavior and resulting model performance. As several researchers note, social identity serves as a proxy for a set of likely social experiences that inform standpoints, but social identity is not, in and of itself, equivalent to standpoint.

Focusing on social identity categories rather than lived experiences or attitudes correlated with those categories can be problematic because individuals from stigmatized social identity groups can internalize discriminatory attitudes. This phenomenon has been documented in age bias research, where older adults themselves produced age-biased annotations in sentiment analysis tasks.

3.2 Positionality in Data Annotation

Contemporary research on positionality in machine learning has drawn from feminist epistemology and standpoint theory to argue that all knowledge production is situated. A 2024 study titled "How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision" found that tech workers—engineers, data scientists, and researchers—introduce their own positionalities when defining "identity" concepts, specifically instilling "the way they understand and are shaped by the world around them" (ACM CSCW, 2024).

Given the link between positionality and standpoint, knowledge production, and power relations, researchers have turned to the notion of reflexivity as a way to recognize one's positionalities as well as those of research participants. The framework of "model positionality and computational reflexivity" has been proposed to help data scientists reflect on and communicate the social and cultural context of a model's development, the data annotators and their annotations, and the data scientists themselves (arXiv:2203.07031, 2022).

3.3 Lived Experience as Domain Expertise

Patton and colleagues' groundbreaking research on gang-related social media content provides compelling evidence for the value of lived experience in data annotation. Their work used both qualitative and quantitative methods to show how social work graduate students and community members differently annotate online interactions between gang members (Patton et al., 2017). Importantly, the differences in annotation arose even after graduate students were trained in the social and historical context of the tweet authors, on top of their specialized training as social work graduate students.

In their 2018 follow-up study "Multimodal Social Media Analysis for Gang Violence Prevention," Patton and colleagues partnered computer scientists with social work researchers who have domain expertise in gang violence to develop a rigorous methodology for collecting and annotating tweets, gathering 1,851 tweets and accompanying annotations related to visual concepts and psychosocial codes including aggression, loss, and substance use (arXiv:1807.08465, 2018).

Graduate students in Patton's studies less often annotated images as aggressive. The researchers posited two potential explanations: (1) social work graduate students might be averse to associating marginalized communities with labels that might further marginalize them, or (2) community members, by virtue of their proximity to gang violence, might be less wary of false negatives because "aggression" annotations might lead to intervention that could prevent violence and save lives. These competing explanations represent fundamentally different value orientations toward the annotation task—prioritizing avoiding problematic labeling versus prioritizing potential safety.

3.4 Activist Expertise in Hate Speech Detection

Waseem (2017) took a significant step in the direction of capturing annotator standpoint by recruiting feminist and anti-racism activists as experts for a hate speech annotation task. Research examining this approach found that feminist and anti-racism activists were recruited along with crowdsourcing for the annotation of tweets, with labels including racist, sexist, neither, or both.

Studies comparing expert and amateur annotators have found meaningful differences: amateur annotators recruited on CrowdFlower without selection tended to misclassify content as hate speech more often than their expert counterparts. Furthermore, hate speech detection datasets labeled by general annotators contain systematic bias, as they cannot effectively consider language use differences among different speakers, while experts can produce much less biased annotations.

However, important questions remain about how "expert" should be defined. Within the range of "expert" annotators, possibilities include paper authors, linguistics experts, activists, and political experts—each bringing different forms of knowledge and potentially different biases to the annotation task.


4. Crowdworker Demographics and Annotation Quality {#crowdworker-demographics}

4.1 Quality Challenges in Crowdsourced Data

Crowdsourced data displays significant variability in quality compared to expert-collected data, which is particularly problematic in machine learning where crowdsourcing is widely used to obtain annotations for creating ML datasets. Noisy labels arise from limited expertise or unreliability of annotators, and research has documented that this noise can substantially impact model performance.

A 2024 study on "Data Quality in Crowdsourcing and Spamming Behavior Detection" examined various quality assurance mechanisms and their effectiveness in identifying and mitigating low-quality annotations (arXiv:2404.17582, 2024). The research emphasized that it is important to assess the quality of crowd-provided data to improve analysis performance and reduce biases in subsequent machine learning tasks.

4.2 The Myth of One Truth

The notion of "one truth" in crowdsourcing responses is increasingly recognized as a myth. Research on "Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks" argues that disagreement among annotators in subjective tasks is expected because annotation outcomes depend on the annotator's perspective and lived experience, the item being evaluated, task circumstances, and instruction clarity (arXiv:2311.09743, 2023).

The notion of objective ground-truth does not always hold, especially in domains such as medical diagnosis, sentiment analysis, and artistic evaluation where labels could be fairly subjective. A classifier that does not take variety in human labels into account risks overlooking minority perspectives by ignoring potentially informative differences between annotators.

4.3 Beyond Majority Vote

Traditional approaches to handling disagreement among annotators typically rely on majority voting, but research published in the Proceedings of the ACM on Human-Computer Interaction found that a pipeline combining pairwise comparison labeling with Elo scoring outperforms majority-voting methods in reducing both random error and measurement bias in subjective constructs (ACM CSCW, 2023).

This finding suggests that the aggregation method used to synthesize multiple annotators' judgments can significantly impact the quality of the resulting training data, and that more sophisticated approaches than simple majority vote may be necessary for subjective annotation tasks.


5. Values Over Time: Social Movements and Algorithmic Systems {#values-over-time}

5.1 Temporal Dynamics of Social Norms

Historical and social context remains a challenge for algorithmic learning. Social movements produce new language and ways of communicating, adding a challenging layer to algorithmic language detection and understanding. Social movements can fundamentally shift societal understanding or acceptance of social norms and behaviors that many algorithmic technologies are designed to detect and analyze.

5.2 The #MeToo Movement as Case Study

The #MeToo movement provides a compelling case study of how social movements reshape the normative boundaries that algorithmic systems must encode. The movement has spurred broad-reaching conversations about inappropriate sexual behavior from men in power, as well as men more generally. These conversations directly challenge behaviors that have been historically considered appropriate or even blamed on women.

Researchers at Caltech developed machine learning tools to detect online harassment related to the #MeToo movement, using GloVe word embeddings to understand context and analyze how words like "female" were used differently across platforms. A cross-platform study analyzed sexual violence disclosures across social media, using machine learning to identify 2,927 disclosures for content analysis during the #MeToo period and surrounding timeframes.

Historical moments invoked by #MeToo and related conversations shift mainstream notions of sexually inappropriate behavior and therefore force reassessments of how algorithmic systems encode these concepts. However, societal debate surrounding how normative behavior should be defined inherently complicates and politicizes decisions and definitions operationalized into algorithms.

5.3 Gender Differences in Harassment Perception

Research has documented that men and women make significantly different assessments of sexual harassment online (Duggan, 2017). An algorithmic definition of what constitutes inappropriately sexual communication will inherently be concordant with some views and discordant with others, further emphasizing a need to validate technological systems in relation to particular social contexts and marginalized perspectives.

5.4 Algorithmic Influence on Social Movement Visibility

The relationship between algorithms and social movements is bidirectional. Social media algorithms affect #MeToo's intersectional origins, as algorithms are much more likely to favor certain types of content over others. The #MeToo tweet's spread was linked to "the algorithmic distribution of a tweet," highlighting how algorithms affect social movement visibility.

5.5 Conversational AI and Harassment

Research examining how conversational AI systems like Amazon's Alexa respond to sexual harassment found that commercial systems mainly avoid answering, rule-based chatbots deflect, while data-driven systems "are often non-coherent, but also run the risk of being interpreted as flirtatious and sometimes react with counter-aggression" (ACL Anthology, 2018). This finding demonstrates how training data and system design choices embed particular responses to harassment that may or may not align with user needs or societal values.


6. Value-Sensitive Design Approaches {#value-sensitive-design}

6.1 Foundations of Value-Sensitive Design

Value-sensitive design (VSD) is an established method for integrating values into technical design, developed by Batya Friedman and Peter Kahn at the University of Washington starting in the late 1980s and early 1990s. VSD provides a helpful framework for thinking 'beyond identity politics'—it illuminates not only the significance of human values in relation to standpoint, but also provides a framework for outlining the ways in which humans and computational systems express values both separately and in tandem.

6.2 Application to Machine Learning

VSD has been applied to different technologies and, more recently, to artificial intelligence. However, machine learning, in particular, poses two challenges: First, humans may not understand how an AI system learns certain things, which requires paying attention to values such as transparency, explicability, and accountability. Second, ML may lead to AI systems adapting in ways that 'disembody' the values embedded in them.

6.3 Mapping VSD to AI for Social Good

Research published in the journal AI and Ethics examined "Mapping value sensitive design onto AI for social good principles," finding important alignments between VSD methodologies and AI4SG (AI for Social Good) frameworks (AI and Ethics, 2021). The paper is available through PMC (PMC7848675) and represents an important bridge between ethical design frameworks and practical AI development.

6.4 Socio-Technical Design Processes

A critical review titled "Designing value-sensitive AI: a critical review and recommendations for socio-technical design processes" examined how different socio-technical design processes for AI-based systems support the creation of value-sensitive AI (VSAI) (AI and Ethics, 2023). This work emphasizes that value sensitivity cannot be achieved through technical means alone, but requires careful attention to the social processes of design and deployment.

6.5 Integration with Responsible AI

Recent research has examined how VSD approaches are compatible with and advantageous to the development of Responsible AI (RAI) toolkits. A paper published in the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems provided "Guidelines for Integrating Value Sensitive Design in Responsible AI Toolkits," offering practical guidance for practitioners (CHI 2024).

6.6 Values and the Fallacy of "Unbiased" Data

At least for social data and data produced by humans, thinking about bias in terms of values helps delineate the fallacy of "unbiased" data or models. Intuitively, "valuelessness" does not make sense when human judgment is involved. Acknowledging that a system cannot be "value-less," researchers and practitioners must figure out how to carefully curate values in ways that serve marginalized communities and other stakeholders.


7. Dataset and Model Documentation {#documentation-practices}

7.1 Datasheets for Datasets

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, Gebru et al. (2018) proposed "Datasheets for Datasets"—a framework in which every dataset would be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and other critical information (arXiv:1803.09010, 2018).

The proposal was inspired by the electronics industry, where every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. The authors—Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé, and Kate Crawford—argued that datasheets for datasets will facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability.

The paper was initially posted to arXiv on March 23, 2018, and later published in Communications of the ACM, becoming one of the most influential frameworks for dataset documentation.

7.2 Model Cards for Model Reporting

Complementing datasheets for datasets, Mitchell et al. (2019) proposed "Model Cards for Model Reporting"—short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (FAT* 2019, arXiv:1810.03993).

The paper was authored by Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru, and was presented at the Conference on Fairness, Accountability, and Transparency on January 29-31, 2019, in Atlanta, GA.

The goal of model cards is to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited. This framework encourages transparent model reporting and has become influential in promoting responsible AI development practices.

7.3 Documenting Annotator Information

Both datasheets and model cards create opportunities to document annotator demographics, standpoints, and values—information that is currently rarely collected or reported. However, these documentation frameworks raise important questions about privacy, informed consent, and the potential for annotator information to be used in ways that could harm annotators themselves.


8. Annotation Task Design and Survey Methodology {#annotation-task-design}

8.1 Structural Similarity to Surveys

There is a relatively rich literature on survey methodology that directly applies to annotation task design, since annotation tasks frequently resemble surveys. A 2023 paper titled "Quality aspects of annotated data" published in the AStA Wirtschafts- und Sozialstatistisches Archiv establishes the structural similarity between annotation tasks and surveys, exploring how data annotation could benefit from survey methodology research (Springer, 2023).

The paper examines quality confounders from two perspectives: annotator composition and data collection strategy, providing a framework for thinking systematically about annotation quality.

8.2 Insights from Survey Methodology

Research titled "Position: Insights from Survey Methodology can Improve Training Data" makes the case that label collection is similar to survey data collection and develops hypotheses about data collection task facets that may impact label quality (arXiv:2403.01208, 2024).

Annotation instruments have similarities to web surveys, and research has shown that wording, ordering, and annotator effects impact both survey data and model training annotations. A 2023 presentation at City, University of London on "Bringing Survey Methodology to Machine Learning" discussed how these parallels can inform better annotation design.

8.3 Breaking Down Complex Concepts

There is a need for guidance around designing annotation tasks that is rooted in survey design principles. For example, asking whether or not something is hate speech versus asking about whether something is pejorative AND about an identity group/label (breaking down a concept into component parts) represents fundamentally different task designs that may yield different quality annotations.

Similarly, asking whether a person feels represented by an image (rather than asking whether an image set is 'diverse') gets at the goal of the annotation/model building process more directly. This principle—operationalizing the actual construct of interest rather than proxies for it—is well-established in survey methodology but underutilized in annotation task design.

8.4 Question Design Choices

How specific questions are designed matters significantly: check-all-that-apply questions versus explicitly asking whether something is pejorative and then asking if it is about an identity label can produce different response patterns. These design choices, well-studied in survey methodology, have received insufficient attention in the machine learning annotation literature.

8.5 Large Language Models for Annotation

A 2024 comprehensive survey titled "Large Language Models for Data Annotation and Synthesis: A Survey" examined the emerging practice of using LLMs for annotation tasks (arXiv:2402.13446, 2024). While LLMs offer potential efficiency gains, the survey raises important questions about whether LLM annotations replicate human biases, introduce new biases, or fail to capture important human variability in subjective judgments.


9. Age Bias in Algorithmic Systems {#age-bias}

9.1 AI Ageism: A Critical Roadmap

The concept of "AI ageism" has been introduced to expand understanding of inclusion and exclusion in AI to include age. Research published in PMC defines AI ageism as "practices and ideologies operating within AI that exclude, discriminate, or neglect the interests, experiences, and needs of older populations" (PMC9527733).

AI ageism manifests in five interconnected forms:

  1. Age biases in algorithms and datasets
  2. Age stereotypes of AI actors
  3. Invisibility of old age in AI discourses
  4. Discriminatory effects of AI technology on different age groups
  5. Exclusion as users of AI technology

9.2 The Algorithmic Divide

The research emphasizes that algorithms are neither neutral, fair, nor objective, as they reproduce the assumptions and beliefs of those who decide about their design and deployment. This evidence suggests continuous exclusion of older populations on a group level—an "algorithmic divide" understood as an extension of the digital divide, with effects believed to threaten various political, social, economic, cultural, educational, and career opportunities provided by machine learning and AI.

9.3 Age Bias in Sentiment Analysis

Preliminary research has found that in re-annotation tasks, older adults produced annotations on age-related content such that a sentiment model trained on the data rated references to older age more negatively than references to younger age in counterfactual tests. This mirrored bias produced by original annotations, but the error pattern between the two models was significantly different.

This finding suggests that simply recruiting annotators from an affected demographic group (older adults) is insufficient to eliminate bias—annotators can internalize societal biases against their own groups, highlighting the need to assess standpoints and values rather than identity alone.

9.4 Research Gaps

Critical questions remain underexplored:

  • How does experience with ageism correlate with annotation behavior?
  • How does identifying as an older adult correlate with annotation behavior?
  • Does slicing the annotator pool by age versus by attitudes toward aging and age discrimination produce significantly different data annotations?

10. Challenges in Subjective Annotation Tasks {#subjective-tasks}

10.1 Inter-Rater Reliability in Subjective Contexts

Inter-Rater Consistency (IRC) measures the degree of agreement among different human annotators when labeling or judging the same data. For tasks like emotion detection, sentiment analysis, or summarizing text—tasks that rely heavily on human judgment—people may see things differently, especially in complex or subjective tasks.

AI systems depend on human-labeled data to learn and improve, but when human annotators disagree, the data becomes unreliable, and so do the benchmarks used to judge model performance. Low Cohen's kappa values may reflect ambiguous instructions or inherently subjective questions rather than rater error.

10.2 Impact on Model Training

If there is high IRC, the labels are consistent, and models can learn from a stable target. However, if there is low IRC, the labels are noisy and models may get penalized even when their answers are reasonable. Research by Anthropic observed that more sophisticated topics get lower agreement, suggesting that task difficulty and subjectivity are inherently linked.

10.3 Common Metrics and Their Limitations

Several statistical measures are used to assess inter-rater reliability:

  • Cohen's Kappa: Measures IRR between two raters for categorical items
  • Fleiss' Kappa: Measures consistency between multiple annotators
  • Krippendorff's Alpha: A more universal method suitable for any number of raters and data types

However, these metrics assume that agreement is desirable and that disagreement represents error. This assumption is increasingly challenged in the context of subjective tasks where legitimate differences in perspective may be valuable signal rather than noise.


11. Conclusions and Future Directions {#conclusions}

11.1 Summary of Key Findings

This literature review has synthesized research across multiple domains to examine how annotator values and biases shape machine learning datasets. Several key themes emerge:

  1. Annotator bias is systematic and measurable: Research demonstrates that annotation bias is not random noise but reflects systematic patterns related to annotator identity, positionality, cognitive processes, and task design.

  2. Identity alone is insufficient: While annotator demographics matter, lived experience, standpoint, and values provide more nuanced understanding of annotation behavior than identity categories alone.

  3. The myth of objectivity: For many annotation tasks, particularly those involving subjective judgments, the notion of a single objective ground truth is philosophically untenable. Disagreement may reflect legitimate perspectival differences.

  4. Values change over time: Social movements and historical context shift the meaning of concepts that algorithmic systems are designed to detect, requiring ongoing reassessment of training data.

  5. Documentation is critical: Frameworks like datasheets for datasets and model cards provide mechanisms for transparency, but questions remain about what annotator information should be collected and how to protect annotator privacy.

  6. Design matters: Annotation task design, informed by survey methodology principles, significantly impacts annotation quality and the values embedded in resulting datasets.

11.2 Critical Research Gaps

Despite growing research attention, significant gaps remain:

Empirical studies linking annotator characteristics to model behavior: While research has established that annotator identity and values affect annotations, more work is needed to trace these effects through to deployed model behavior and downstream harms.

Best practices for annotator selection: Clear guidance is lacking on how to determine whose values should be represented in annotation pools for different tasks and deployment contexts.

Privacy-preserving methods for collecting annotator information: Balancing the need for transparency about annotator standpoints with annotator privacy and safety requires methodological innovation.

Handling internalized bias: When annotators from marginalized groups have internalized discriminatory attitudes, how should their annotations be weighted or contextualized?

Temporal validity: How long do annotations remain valid as social norms evolve? How should datasets be updated to reflect changing values?

11.3 Practical Implications

For practitioners developing machine learning systems, this research suggests several actionable recommendations:

  1. Document annotator information systematically: At minimum, collect demographics, but ideally also assess relevant attitudes, values, and lived experiences.

  2. Design annotation tasks using survey methodology principles: Break complex concepts into components, use clear language, and pilot test instructions.

  3. Embrace disagreement as signal: Rather than forcing consensus through majority vote, consider whether disagreement reflects important perspectival differences.

  4. Situate datasets and models: Use datasheets and model cards to document who annotated data, under what conditions, and for what purposes.

  5. Consider value alignment: Explicitly articulate what values the system should embody and assess whether annotation processes support those values.

11.4 Toward Value-Conscious AI

The research reviewed here points toward a fundamental reframing of the relationship between human values and machine learning. Rather than treating values as "bias" to be eliminated, value-conscious AI development would:

  • Acknowledge that all datasets embody values
  • Make those values explicit and transparent
  • Align dataset values with the needs of affected communities
  • Build systems that respect multiple legitimate perspectives
  • Continuously reassess value alignment as social contexts evolve

This represents a shift from the pursuit of impossible objectivity to the practice of responsible value curation in service of fairness and equity.


References

Annotator Bias and Identity

Positionality and Standpoint Theory

Lived Experience and Domain Expertise

Crowdworker Quality and Subjectivity

Social Movements and Algorithmic Systems

Value-Sensitive Design

Documentation Practices

  • Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé, H., & Crawford, K. (2018). Datasheets for Datasets. arXiv:1803.09010. Communications of the ACM. https://arxiv.org/abs/1803.09010
  • Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., & Gebru, T. (2019). Model Cards for Model Reporting. FAT* 2019, arXiv:1810.03993. https://arxiv.org/abs/1810.03993

Annotation Task Design and Survey Methodology

Age Bias in AI

Inter-Rater Reliability


Document End

This literature review was compiled on December 24, 2025, drawing from academic papers, conference proceedings, and peer-reviewed journals. All citations include direct links to source materials for verification and further reading.