• Español – América Latina
  • Português – Brasil
  • Tiếng Việt

TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.

imdb_reviews

  • Description :

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Additional Documentation : Explore on Papers With Code north_east

Homepage : http://ai.stanford.edu/~amaas/data/sentiment/

Source code : tfds.datasets.imdb_reviews.Builder

  • 1.0.0 (default): New split API ( https://tensorflow.org/datasets/splits )

Download size : 80.23 MiB

Auto-cached ( documentation ): Yes

Supervised keys (See as_supervised doc ): ('text', 'label')

Figure ( tfds.show_examples ): Not supported.

imdb_reviews/plain_text (default config)

Config description : Plain text

Dataset size : 129.83 MiB

Feature structure :

  • Feature documentation :
  • Examples ( tfds.as_dataframe ):

imdb_reviews/bytes

Config description : Uses byte-level text encoding with tfds.deprecated.text.ByteTextEncoder

Dataset size : 129.88 MiB

imdb_reviews/subwords8k

Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 8k vocab size

Dataset size : 54.72 MiB

imdb_reviews/subwords32k

Config description : Uses tfds.deprecated.text.SubwordTextEncoder with 32k vocab size

Dataset size : 50.33 MiB

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-10 UTC.

Movie Review Data

Sentiment polarity datasets.

  • polarity dataset v2.0 ( 3.0Mb) (includes README v2.0 ): 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.
  • Pool of 27886 unprocessed html files (81.1Mb) from which the polarity dataset v2.0 was derived. (This file is identical to movie.zip from data release v1.0.)
  • sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0 : 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
  • polarity dataset v1.1 (2.2Mb) (includes README.1.1 ): approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar , who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt .
  • polarity dataset v0.9 (2.8Mb) (includes a README ):. 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002. Please read the "Rating Information - WARNING" section of the README.
  • movie.zip (81.1Mb) : all html files we collected from the IMDb archive.

Sentiment scale datasets

  • Sep 30, 2009: Yanir Seroussi points out that due to some misformatting in the raw html files, six reviews are misattributed to Dennis Schwartz (29411 should be Max Messier, 29412 should be Norm Schrager, 29418 should be Steve Rhodes, 29419 should be Blake French, 29420 should be Pete Croatto, 29422 should be Rachel Gordon) and one (23982) is blank.

Subjectivity datasets

  • subjectivity dataset v1.0 (508K) (includes subjectivity README v1.0 ): 5000 subjective and 5000 objective processed sentences. Introduced in Pang/Lee ACL 2004. Released June 2004.
  • Pool of unprocessed source documents (9.3Mb) from which the sentences in the subjectivity dataset v1.0 were extracted. Note : On April 2, 2012, we replaced the original gzipped tarball with one in which the subjective files are now in the correct directory (so that the subjectivity directory is no longer empty; the subjective files were mistakenly placed in the wrong directory, although distinguishable by their different naming scheme).

If you have any questions or comments regarding this site, please send email to Bo Pang or Lillian Lee .

Datasets: rotten_tomatoes like 41

Dataset card for "rotten_tomatoes", dataset summary.

Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

Supported Tasks and Leaderboards

More Information Needed

Dataset Structure

Data instances.

  • Size of downloaded dataset files: 0.49 MB
  • Size of the generated dataset: 1.34 MB
  • Total amount of disk used: 1.84 MB

An example of 'validation' looks as follows.

Data Fields

The data fields are the same among all splits.

  • text : a string feature.
  • label : a classification label, with possible values including neg (0), pos (1).

Data Splits

Reads Rotten Tomatoes sentences and splits into 80% train, 10% validation, and 10% test, as is the practice set out in

Jinfeng Li, ``TEXTBUGGER: Generating Adversarial Text Against Real-world Applications.''

Dataset Creation

Curation rationale, source data, initial data collection and normalization, who are the source language producers, annotations, annotation process, who are the annotators, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators, licensing information, citation information, contributions.

Thanks to @thomwolf , @jxmorris12 for adding this dataset.

Models trained or fine-tuned on rotten_tomatoes

Sileod/deberta-v3-base-tasksource-nli, sileod/deberta-v3-small-tasksource-nli, sileod/deberta-v3-large-tasksource-nli.

movie review dataset download

Hazqeel/electra-small-finetuned-sst2-rotten_tomatoes-distilled

movie review dataset download

solwol/my-awesome-adapter

movie review dataset download

AdapterHub/bert-base-uncased-pf-rotten_tomatoes

movie review dataset download

IMDB movie review sentiment classification dataset

Load_data function.

Loads the IMDB dataset .

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode the pad token.

  • path : where to cache the data (relative to ~/.keras/dataset ).
  • num_words : integer or None. Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None .
  • skip_top : skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. When 0, no words are skipped. Defaults to 0 .
  • maxlen : int or None. Maximum sequence length. Any longer sequence will be truncated. None, means no truncation. Defaults to None .
  • seed : int. Seed for reproducible data shuffling.
  • start_char : int. The start of a sequence will be marked with this character. 0 is usually the padding character. Defaults to 1 .
  • oov_char : int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
  • index_from : int. Index actual words with this index and higher.
  • Tuple of Numpy arrays : (x_train, y_train), (x_test, y_test) .

x_train , x_test : lists of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words - 1 . If the maxlen argument was specified, the largest possible sequence length is maxlen .

y_train , y_test : lists of integer labels (1 or 0).

Note : The 'out of vocabulary' character is only used for words that were present in the training set but are not included because they're not making the num_words cut here. Words that were not seen in the training set but are in the test set have simply been skipped.

get_word_index function

Retrieves a dict mapping words to their index in the IMDB dataset.

The word index dictionary. Keys are word strings, values are their index.

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • huggingface/datasets -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Mr (mr movie reviews).

movie review dataset download

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove.

movie review dataset download

Similar Datasets

License edit, modalities edit, languages edit.

GroupLens logo

GroupLens Research has collected and made available rating data sets from the MovieLens web site ( https://movielens.org ). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.

Seeking permission? If you are interested in obtaining permission to use MovieLens datasets, please first read the terms of use that are included in the README file. Then, please fill out this form to request use.  We typically do not permit public redistribution (see Kaggle for an alternative download location if you are concerned about availability).

recommended for new research

Movielens 25m dataset.

MovieLens 25M movie ratings . Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019

  • ml-25m.zip (size: 250 MB, checksum )

Permalink: https://grouplens.org/datasets/movielens/25m/

MovieLens Tag Genome Dataset 2021

10.5 million computed tag-movie relevance scores from a pool of 1,084 tags applied to 9,734 movies. Released 12/2021. This dataset also contains input necessary to generate the tag genome using both the original process (Vig et al. 2012) and a more recent improvement (Kotkov et al. 2021)

  • genome_2021_readme.txt
  • genome_2021.zip (size: 1.8GB)

Permalink: https://grouplens.org/datasets/movielens/tag-genome-2021

recommended for education and development

Movielens latest datasets.

These datasets will change over time, and are not appropriate for reporting research results. We will keep the download links stable for automated downloads. We will not archive or make available previously released versions.

Small : 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

  • README.html
  • ml-latest-small.zip  (size: 1 MB)

Full : approximately 33,000,000 ratings and 2,000,000 tag applications applied to 86,000 movies by 330,975 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018.

  • ml-latest.zip (size: 335 MB)

Permalink: https://grouplens.org/datasets/movielens/latest/

synthetic datasets

Movielens 1b synthetic dataset.

MovieLens 1B is a synthetic dataset that is expanded from the 20 million real-world ratings from ML-20M, distributed in support of MLPerf . Note that these data are distributed as .npz files, which you must read using python and numpy .

  • ml-20mx16x32.tar (3.1 GB)
  • ml-20mx16x32.tar.md5

The code for the expansion algorithm is available here: https://github.com/mlperf/training/tree/master/data_generation

To create the dataset above, we ran the algorithm (using commit 1c6ae725a81d15437a2b2df05cac0673fde5c3a4) as described in the README under the section “Running instructions for the recommendation benchmark”.

Permalink: https://grouplens.org/datasets/movielens/movielens-1b/

older datasets

Movielens 100k dataset.

MovieLens 100K movie ratings . Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.

  • ml-100k.zip (size: 5 MB, checksum )
  • Index of unzipped files

Permalink: https://grouplens.org/datasets/movielens/100k/

MovieLens 1M Dataset

MovieLens 1M movie ratings . Stable benchmark dataset. 1 million ratings from 6000 users on 4000 movies. Released 2/2003.

  • ml-1m.zip (size: 6 MB, checksum )

Permalink: https://grouplens.org/datasets/movielens/1m/

MovieLens 10M Dataset

MovieLens 10M movie ratings . Stable benchmark dataset. 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.

  • ml-10m.zip (size: 63 MB, checksum )

Permalink: https://grouplens.org/datasets/movielens/10m/

MovieLens 20M Dataset

MovieLens 20M movie ratings . Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.

  • ml-20m.zip (size: 190 MB, checksum )

Also see the MovieLens 20M YouTube Trailers Dataset for links between MovieLens movies and movie trailers hosted on YouTube.

Permalink: https://grouplens.org/datasets/movielens/20m/

MovieLens Tag Genome Dataset 2014

11 million computed tag-movie relevance scores from a pool of 1,100 tags applied to 10,000 movies. Released 3/2014.

Also consider using the MovieLens 20M or latest datasets, which also contain (more recent) tag genome data or the Tag Genome 2021 dataset .

  • tag-genome.zip  (size: 41 MB)

Permalink: https://grouplens.org/datasets/movielens/tag-genome/

  • Book-Crossing
  • Book Genome Dataset
  • Rating Disposition 2023
  • Learning from Sets of Items 2019
  • Personality 2018
  • Serendipity 2018
  • HetRec 2011

13 Best Movie data sets for Machine Learning Projects

July 21, 2021

After the year inside that was 2020, it’s safe to say that just about all of us are film buffs. That’s why we at iMerit have compiled this list of movie data sets for machine learning for the film buffs among us. These data sets are perfect for anyone looking to experiment and master basic machine learning concepts, and are decidedly more interesting than the typical data set one might leverage in such an endeavor.

Build your own proprietary movie dataset. Get a quote for an end-to-end data solution to your specific requirements.

The data that’s most useful for machine learning purposes contained within these data sets include cast and crew member information, script, plot, screen time, reviews, and more. Each of these can be leveraged for different machine learning purposes including natural language processing, sentiment analysis, and more. 

Here are our iMerit’s top 13 movie data sets for machine learning basics.

Movie data sets for Machine Learning

IMDB Reviews : Ideal for sentiment analysis, this movie data set contains 5,000 movie reviews. The data set has a perfect 10 review in terms of usability by the nearly 7,000 people who’ve downloaded it, making it a perfect data set to test with.

IMDB Film Reviews data set : Designed for binary sentiment classification, this movie data set contains a substantial sum of data than the previous IMDB entry on this list. The data set contains 25,000 highly polar movie reviews for training with another 25,000 for testing. It also contains some unlabeled data and raw text for those looking to cut their teeth in annotation.

MovieLens 25M data set : Collected from the MovieLens website, this movie data set contains 25 million ratings along with one million tag applications that have been applied to over 62,000 movies. 

OMDB API : This web service is a crowdsourced movie database that continuously updates with the most current movies. It contains content and images for various films including over 280,000 posters.

OMDB API

Film data set from UCI : Containing over 10,000 films, this movie data set was donated back in 1997 to the University of California, Irvine. It contains information around casting, roles, actors, writers, producers, cinematographers, remakes, and studios involved. 

Cornell Film Review Data : Featuring movie-review data that’s perfect for anyone looking to conduct sentiment-analysis experiments, this body of data contains over 220,000 conversations between 10,000+ pairs of movie characters. 

Full MovieLens data set on Kaggle : This movie data set contains metadata for the 45,000 films that are listed on the Full MovieLens Dataset. Information contained within pertains to films released on or before July 2017 that focuses on cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages. It also contains 26 million ratings from over 270,000 users for every film.

Full MovieLens data set on Kaggle

French National Cinema Center data sets : This data set focuses exclusively on french films gathered by the CNC (Centre National du Cinema) and features 33 data sets around movie attendance, television demand, cinematographic practices and establishments, blockbuster films, and more.

Linguistic Data of 32k Film Subtitles with IMBDb Meta-Data : Linguistic data from more than 32,000 films with all meta-data matched to word-count categories from subtitle files.

Movie Industry : This data repository includes 6820 movies (220 movies per year between 1986 and 2016). The following attributes have been intimately detailed from each film: budget, company, year, writer, star, cotes, score, runtime, reviews, release date, rating, name, gross, genre, director, and country. 

Indian Movie Theaters : This data set features intimate knowledge surrounding Indian theaters and their corresponding theatre capacities, screen sizes, average ticket prices, and local coordinates.

Movie Body Counts : This data set contains a tally of the number of on-screen deaths, bodies, kills, and violent action across a slew of classic hollywood sci-fi, fantasy, and action films.

Movie Body Counts

You might also like

Selecting data labeling tools doesn’t have to be hard – read these simple tips, what data quality means to the success of your ml models – 6 rules you need to follow, 3 best emerging solutions in data labeling – how to achieve both quality and speed.

iMerit

Subscribe to our newsletter

  • Awards & Recognition
  • Compliance & Certifications
  • Social Impact
  • Privacy & Whistleblower Policy
  • Environmental & Social Policy
  • AI Ethics Policy
  • [email protected]
  • +1 (650) 777-7857

IMDb Non-Commercial Datasets

Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

As of March 18, 2024 the datasets on this page are backed by a new data source. There has been no change in location or schema, but if you encounter issues with the datasets following the March 18th update, please contact [email protected].

Data Location

The dataset files can be accessed and downloaded from https://datasets.imdbws.com/ . The data is refreshed daily.

IMDb Dataset Details

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

title.akas.tsv.gz

  • titleId (string) - a tconst, an alphanumeric unique identifier of the title
  • ordering (integer) – a number to uniquely identify rows for a given titleId
  • title (string) – the localized title
  • region (string) - the region for this version of the title
  • language (string) - the language of the title
  • types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
  • attributes (array) - Additional terms to describe this alternative title, not enumerated
  • isOriginalTitle (boolean) – 0: not original title; 1: original title

title.basics.tsv.gz

  • tconst (string) - alphanumeric unique identifier of the title
  • titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
  • primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
  • originalTitle (string) - original title, in the original language
  • isAdult (boolean) - 0: non-adult title; 1: adult title
  • startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
  • endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
  • runtimeMinutes – primary runtime of the title, in minutes
  • genres (string array) – includes up to three genres associated with the title

title.crew.tsv.gz

  • directors (array of nconsts) - director(s) of the given title
  • writers (array of nconsts) – writer(s) of the given title

title.episode.tsv.gz

  • tconst (string) - alphanumeric identifier of episode
  • parentTconst (string) - alphanumeric identifier of the parent TV Series
  • seasonNumber (integer) – season number the episode belongs to
  • episodeNumber (integer) – episode number of the tconst in the TV series

title.principals.tsv.gz

  • nconst (string) - alphanumeric unique identifier of the name/person
  • category (string) - the category of job that person was in
  • job (string) - the specific job title if applicable, else '\N'
  • characters (string) - the name of the character played if applicable, else '\N'

title.ratings.tsv.gz

  • averageRating – weighted average of all the individual user ratings
  • numVotes - number of votes the title has received

name.basics.tsv.gz

  • primaryName (string)– name by which the person is most often credited
  • birthYear – in YYYY format
  • deathYear – in YYYY format if applicable, else '\N'
  • primaryProfession (array of strings)– the top-3 professions of the person
  • knownForTitles (array of tconsts) – titles the person is known for

Get started

Contact us to see how IMDb data can solve your customers needs.

movie review dataset download

IMDB Large Movie Review Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

http://ai.stanford.edu/~amaas/data/sentiment/

Character, path to directory where data will be stored. If NULL , user_cache_dir will be used to determine path.

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

Logical, set TRUE to delete dataset.

Logical, set TRUE to return the path of the dataset.

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE .

A tibble with 25,000 rows and 2 variables:

Character, denoting the sentiment

Character, text of the review

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

When using this dataset, please cite the ACL 2011 paper

InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }

movie review dataset download

  • About data.world
  • Terms & Privacy
  • © 2024 data.world, inc

IMAGES

  1. IMDB 5000+ Movie Dataset 分析

    movie review dataset download

  2. Download free IMDB Movies Review Dataset

    movie review dataset download

  3. IMDb Top 1000 Movies Dataset

    movie review dataset download

  4. Machine Learning on Movie Reviews (IMDB Dataset

    movie review dataset download

  5. IMDb Movie Reviews Dataset

    movie review dataset download

  6. Movie Reviews Dataset: 10k+ Scraped Data

    movie review dataset download

VIDEO

  1. 4D LiDAR measures Velocity

  2. KNIME Challenge

  3. Discover the Real prithviraj chauhan movie review

  4. IoT Data Quality Issues and Potential Solutions A Literature Review

  5. movie rating screen

  6. Security and Privacy for Reconfigurable Intelligent Surface in 6G A Review of Prospective Applicati

COMMENTS

  1. IMDB Dataset of 50K Movie Reviews

    Large Movie Review Dataset. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more.

  2. IMDb Movie Reviews Dataset

    The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.

  3. Large Movie Review Dataset

    Sentiment Analysis. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  4. imdb_reviews

    imdb_reviews. Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

  5. Movie Review Data

    Movie Review Data This page is a distribution site for movie-review data for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or ...

  6. rotten_tomatoes · Datasets at Hugging Face

    Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL ...

  7. IMDB movie review sentiment classification dataset

    Loads the IMDB dataset. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most ...

  8. IMDb Movie Reviews Dataset

    This dataset contains nearly 1 Million unique movie reviews from 1150 different IMDb movies spread across 17 IMDb genres - Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Fantasy, History, Horror, Music, Mystery, Romance, Sci-Fi, Sport, Thriller and War. The dataset also contains movie metadata such as date of release of the movie, run length, IMDb rating, movie rating (PG-13, R ...

  9. MR Dataset

    MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

  10. MovieLens

    MovieLens 1B is a synthetic dataset that is expanded from the 20 million real-world ratings from ML-20M, distributed in support of MLPerf. Note that these data are distributed as .npz files, which you must read using python and numpy. README. ml-20mx16x32.tar (3.1 GB)

  11. Preparing IMDB Movie Review Data for NLP Experiments

    The Large Movie Review Dataset is the primary storage site for the raw IMDB movie reviews data, but you can also find it at other locations using an internet search. If you click on the link on the web page, you will download an 80 MB file in tar-GNU-zip format named aclImdb_v1.tar.gz.

  12. 13 Best Movie data sets for Machine Learning Projects

    Here are our iMerit's top 13 movie data sets for machine learning basics. Movie data sets for Machine Learning. IMDB Reviews: Ideal for sentiment analysis, this movie data set contains 5,000 movie reviews. The data set has a perfect 10 review in terms of usability by the nearly 7,000 people who've downloaded it, making it a perfect data set ...

  13. IMDb Non-Commercial Datasets

    IMDb Dataset Details. Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A '\N' is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

  14. IMDB Large Movie Review Dataset

    IMDB Large Movie Review Dataset. Source: R/dataset_imdb.R. The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

  15. How to Prepare Movie Review Data for Sentiment Analysis (Text

    The reviews were originally released in 2002, but an updated and cleaned up version was released in 2004, referred to as "v2.0". The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at IMDB. The authors refer to this dataset as the "polarity ...

  16. MovieNet (ECCV 2020)

    In this paper, we introduce MovieNet -- a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. trailers, photos, plot descriptions, etc. Besides, different aspects of manual annotations are provided in MovieNet, including 1.1M characters with bounding boxes and identities, 42K ...

  17. Sentiment Analysis on IMDB Movie Reviews

    Notebook to train an XLNet model to perform sentiment analysis. The dataset used is a balanced collection of (50,000 - 1:1 train-test ratio) IMDB movie reviews with binary labels: postive or negative from the paper by Maas et al. (2011).The current state-of-the-art model on this dataset is XLNet by Yang et al. (2019) which has an accuracy of 96.2%.We get an accuracy of 92.2% due to the ...

  18. Mastering Word Embeddings in 10 Minutes with IMDB Reviews

    Download the IMDB Reviews Dataset. IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB.The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing.

  19. There are 93 movie datasets available on data.world

    Dataset of 15506 Indian movies taken from IMDb. This is all the Indian movies on IMDb as of 16/06/2021. Dataset with 292 projects 2 files 1 table. Tagged. imdb eda bollywood python data science +17. 2,170. ... Movies pos/neg reviews. Dataset with 6 projects 2 files 2 tables. Tagged. movie movies nlp. 62.

  20. Rotten Tomatoes movies and critic reviews dataset

    17k+ movies and their related critic reviews scraped from Rotten Tomatoes. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook.