Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

BDCC-logo

Article Menu

data lake research paper

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

An overview of data warehouse and data lake in modern enterprise data management.

data lake research paper

1. Introduction

2. definition: big data analytics, data warehouses, and data lakes, 2.1. big data analytics.

  • Volume, or the available amount of data;
  • Velocity, or the speed of data processing;
  • Variety, or the different types of big data;
  • Volatility, or the variability of the data;
  • Veracity, or the accuracy of the data;
  • Visualization, or the depiction of big data-generated insights through visual representation;
  • Value, or the benefits organizations derive from the data.

2.2. Data Warehouses

2.3. data lake, 2.4. the difference between data warehouses and data lakes, 2.5. literature review, 3. architecture, 3.1. data warehouse architecture.

  • Single-tier architecture : This kind of single-layer model minimizes the amount of data stored. It helps remove data redundancy. However, its disadvantage is the lack of a component that separates analytical and transactional processing. This kind of architecture is not frequently used in practice.
  • Two-tier architecture : This model separates physically available sources and the data warehouse by means of a staging area. Such an architecture makes sure that all data loaded into the warehouse are in an appropriate cleansed format. Nevertheless, this architecture is not expandable nor can it support many end users. Additionally, it has connectivity problems due to network limitations.
  • Three-tier architecture : This is the most widely used architecture for data warehouses [ 56 , 57 ]. It consists of a top, middle, and bottom tier. In the bottom tier, data are cleansed, transformed, and loaded via backend tools. This tier serves as the database of the data warehouse. The middle tier is an OLAP server that presents an abstract view of the database by acting as a mediator between the end user and the database. The top tier, the front-end client layer, consists of the tools and an API that are used to connect and get data out from the data warehouse (e.g., query tools, reporting tools, managed query tools, analysis tools, and data mining tools).
  • Data warehouse database : The core foundation of the data warehouse environment is its central database. This is implemented using RDBMS technology [ 58 ]. However, there is a limitation to such implementations, since the traditional RDBMS system is optimized for transactional database processing and not for data warehousing. In this regard, the alternative means are (1) the usage of relational databases in parallel, which enables shared memory on various multiprocessor configurations or parallel processors, (2) new index structures to get rid of relational table scanning and improve the speed, and (3) multidimensional databases (MDDBs) used to circumvent the limitations caused by the relational data warehouse models.
  • Extract, transform, and load (ETL) tools : All the conversions, summarizations, and changes required to transform data into a unified format in the data warehouse are carried out via extract, transform, and load (ETL) tools [ 59 ]. This ETL process helps the data warehouse achieve enhanced system performance and business intelligence, timely access to data, and a high return on investment: – Extraction : This involves connecting systems and collecting the data needed for analytical processing; – Transformation : The extracted data are converted into a standard format; – Loading : The transformed data are imported into a large data warehouse. ETL anonymizes data as per regulatory stipulations, thereby anonymizing confidential and sensitive information before loading it into the target data store [ 60 ]. ETL eliminates unwanted data in operational databases from loading into DWs. ETL tools carry out amendments to the data arriving from different sources and calculate summaries and derived data. Such ETL tools generate background jobs, Cobol programs, shell scripts, etc. that regularly update the data in the data warehouse. ETL tools also help with maintaining the metadata.
  • Metadata : Metadata is the data about the data that define the data warehouse [ 61 ]. It deals with some high-level technological concepts and helps with building, maintaining, and managing the data warehouse. Metadata plays an important role in transforming data into knowledge, since it defines the source, usage, values, and features of the data warehouse and how to update and process the data in a data warehouse. This is the most difficult tool to choose due to the lack of a clear standard. Efforts are being made among data warehousing tool vendors to unify a metadata model. One category of metadata known as technical metadata contains information about the warehouse that is used by its designers and administrators, whereas another category called business metadata contains details that enable end users to understand the information stored in the data warehouse.
  • Query Tools : Query tools allow users to interact with the DW system and collect information relevant to businesses to make strategic decisions. Such tools can be of different types: – Query and reporting tools : Such tools help organizations generate regular operational reports and support high-volume batch jobs such as printing and calculating. Some popular reporting tools are Brio, Oracle, Powersoft, and SAS Institute. Similarly, query tools help end users to resolve pitfalls in SQL and database structure by inserting a meta-layer between the users and the database. – Application development tools : In addition to the built-in graphical and analytical tools, application development tools are leveraged to satisfy the analytical needs of an organization. – Data mining tools : This tool helps in automating the process of discovering meaningful new correlations and structures by mining large amounts of data. – OLAP tools : Online analytical processing (OLAP) tools exploit the concepts of a multidimensional database and help analyze the data using complex multidimensional views [ 28 , 62 ]. There are two types of OLAP tools: multidimensional OLAP (MOLAP) and relational OLAP (ROLAP) [ 63 ]: * MOLAP : In such an OLAP tool, a cube is aggregated from the relational data source. Based on the user report request, the MOLAP tool generates a prompt result, since all the data are already pre-aggregated within the cube [ 64 ]. * ROLAP : The ROLAP engine acts as a smart SQL generator. It comes with a “designer” piece, wherein the administrator specifies the association between the relational tables, attributes, and hierarchy map and the underlying database tables [ 65 ].

3.2. Data Lake Architecture

  • Raw data layer : This layer is also known as the ingestion layer or landing area because it acts as the sink of the data lake. The prime goal is to ingest raw data as quickly and as efficiently as possible. No transformations are allowed at this stage. With the help of the archive, it is possible to get back to a point in time with raw data. Overriding (i.e., handling duplicate versions of the same data) is not permitted. End users are not granted access to this layer. These are not ready-to-use data, and they need a lot of knowledge in terms of relevant consumption.
  • Standardized data layer : This is optional in most implementations. If one expects fast growth for his or her data lake architecture, then this is a good option. The prime objective of the standardized layer is to boost the performance of the data transfer from the raw layer to the curated layer. In the raw layer, data are stored in their native format, whereas in the standardized layer, the appropriate format that fits best for cleansing is selected.
  • Cleansed layer or curated layer : In this layer, data are transformed into consumable data sets and stored in files or tables. This is one of the most complex parts of the whole data lake solution since it requires cleansing, transformation, denormalization, and consolidation of different objects. Furthermore, the data are organized by purpose, type, and file structure. Usually, end users are granted access only to this layer.
  • Application layer : This is also known as the trusted layer, secure layer, or production layer. This is sourced from the cleansed layer and enforced with requisite business logic. In case the applications use machine learning models on the data lake, they are obtained from here. The structure of the data is the same as in the cleansed layer.
  • Sandbox data layer : This is also another optional layer that is meant for analysts’ and data scientists’ work to carry out experiments and search for patterns or correlations. The sandbox data layer is the proper place to enrich the data with any source from the Internet.
  • Security : While data lakes are not exposed to a broad audience, the security aspects are of great importance, especially during the initial phase and architecture. These are not like relational databases, which have an artillery of security mechanisms.
  • Governance : Monitoring and logging operations become crucial at some point while performing analysis.
  • Metadata : This is the data about data. Most of the schemas reload additional details of the purpose of data, with descriptions on how they are meant to be exploited.
  • Stewardship : Based on the scale that is required, either the creation of a separate role or delegation of this responsibility to the users will be carried out, possibly through some metadata solutions.
  • Master Data : This is an essential part of serving ready-to-use data. It can be either stored on the data lake or referenced while executing ELT processes.
  • Archive : Data lakes keep some archive data that come from data warehousing. Otherwise, performance and storage-related problems may occur.
  • Offload : This area helps to offload some time- and resource-consuming ETL processes to a data lake in case of relational data warehousing solutions.
  • Orchestration and ELT processes : Once the data are pushed from the raw layer through the cleansed layer and to the sandbox and application layers, a tool is required to orchestrate the flow. Either an orchestration tool or some additional resources to execute them are leveraged in this regard.

4. Design Aspects

4.1. data warehouse design considerations for business needs.

  • User needs and appropriate data model : The very first design consideration in a data warehouse is the business and user needs. Hence, during the designing phase, the integration of the data warehouse with existing business processes and compatibility checks with long-term strategies have to be ensured. Enterprises have to clearly comprehend the purpose of their data warehouse, any technical requirements, benefits of end users from the system, improved means of reporting for business intelligence (BI), and analytics. In this regard, finding the notion of what information is important to the business is quintessential to the success of the data warehouse. To facilitate this, creating an appropriate data model of the business is a key aspect when designing DWs (e.g., SQL Developer Data Modeler (SDDM)). Furthermore, a data flow diagram can also help in depicting the data flow within the company in diagram format.
  • Adopting a standard data warehouse architecture and methodology : While designing a DW, yet another important practical consideration is to leverage a recognized DW modeling standard (e.g., 3NF, star schema (dimensional), and Data Vault) [ 73 ]. Selecting such a standard architecture and sticking to the same one can augment the efficiency within a data warehouse development approach. Similarly, an agile data warehouse methodology is also an important practical aspect. With proper planning, DW projects can be compartmentalized to smaller pieces capable of delivering faster. This design trick helps to prioritize the DW as a business’s needs change.
  • Cloud vs. on-premise storage : Enterprises can opt for either on-premises architecture or a cloud data warehouse [ 13 ]. The former category requires setting up the physical environment, including all the servers necessary to power ETL processes, storage, and analytic operations, whereas the latter can skip this step. However, a few circumstances exist where it still makes sense to consider an on-premises approach. For example, if most of the critical databases are on-premises and are old enough, they will not work well with cloud-based data warehouses. Furthermore, if the organization has to deal with strict regulatory requirements, which might include no offshore data storage, an on-premise setting might be the better choice. Nevertheless, cloud-based services provide the most flexible data warehousing service in the market in terms of storage and the pay-as-you-go nature.
  • Data tool ecosystem and data modeling : The organization’s ecosystem plays a key role. Adopting a DW automation tool ensures the efficient usage of IT resources, faster implementation through projects, and better support by enforcing coding standards (Wherescape ( https://www.wherescape.com , accessed on 25 September 2022), AnalytixDS, Ajilius ( https://tracxn.com/d/companies/ajilius.com , accessed on 25 September 2022), etc.).The data modeling planning step imparts detailed, reusable documentation of a data warehouse’s implementation. Specifically, it assesses the data structures, investigates how to efficiently represent these sources in the data warehouse, specifies OLAP requirements, etc.
  • ETL or ELT design : Selection of the appropriate ETL or ELT solution is yet another design concern [ 39 ]. When businesses use expensive in-house analytics systems, much prep work including transformations can be conducted, as in the ETL scheme. However, ELT is a better approach when the destination is a cloud data warehouse. Once data are colocated, the power of a single cloud engine can be leveraged to perform integrations and transformations efficiently. Organizations can transform their raw data at any time according to their use case, rather than a step in the data pipeline.
  • Semantic and reporting layers : Based on previously documented data models, the OLAP server is implemented to facilitate the analytical queries of the users and to empower BI systems. In this regard, data engineers should carefully consider time-to-analysis and latency requirements to assess the analytical processing capabilities of the data warehouse. Similarly, while designing the reporting layer, the implementation of reporting interfaces or delivery methods as well as permissible access have to be set by the administrator.
  • Ease of scalability : Understanding current business needs is critical to business intelligence and decision making. This includes how much data the organization currently has and how quickly its needs are likely to grow. Staffing and vendor costs need to be taken into consideration while deciding the scale of growth.

4.2. Data Lake Design Aspects for Enterprise Data Management

  • Focus on business objectives rather than technology : By anchoring the business objectives, a data lake can prioritize the efforts and outcomes accordingly. For instance, for a particular business objective, there may be some data that are more valuable than others. This kind of comprehension and analysis is the key to an enterprise’s data lake success. With such an oriented goal, data lakes can start small and then accordingly learn, adapt, and produce accelerated outcomes for a business. In particular, some key factors in this regard are (1) whether it solves an actual business problem, (2) if it imparts new capabilities, and (3) the access or ownership of data, among others.
  • Scalability and durability are two more major criteria [ 74 ]. Scalability enables scaling to any size of data while importing them in real time. This is an essential criterion for a data lake since it is a centralized data repository for an entire organization. Another important aspect (i.e., durability) deals with providing consistent uptime while ensuring no loss or corruption of data.
  • Another key design aspect in a data lake is its capability to store unstructured, semi-structured, and structured data , which helps organizations to transfer anything from raw, unprocessed data to fully aggregated analytical outcomes [ 75 ]. In particular, the data lake has to deliver business-ready data. Practically speaking, data by themselves have no meaning. Although file formats and schemas can parse the data (e.g., JSON and XML), they fail at delivering insight into their meaning. To circumvent such a limitation, a critical component of any data lake technical design is the incorporation of a knowledge catalog. Such a catalog helps in finding and understanding information assets. The knowledge catalog’s contents include the semantic meaning of the data, format and ownership of data, and data policies, among other elements.
  • Security considerations are also of prime importance in a data lake in the cloud. The three domains of security are encryption, network-level security, and access control. Network-level security imparts a robust defense strategy by denying inappropriate access at the network level, whereas encryption ensures security at least for those types of data that are not publicly available. Security should be part of data lake design from the beginning. Compliance standards that regulate data protection and privacy are incorporated in many industries, such as the Payment Card Industry Data Security Standard (PCI DSS) for financial services and Health Insurance Portability and Accountability Act (HIPAA) for healthcare [ 76 ]. Furthermore, two of the biggest regulations regarding consumer privacy (i.e., California’s Consumer Privacy Act (CCPA) and the European Union’s General Data Protection Regulation (GDPR)) restrict the ownership, use, and management of personal and private data.
  • A data lake design must include metadata storage functionality to help users to search and learn about the data sets in the lake [ 77 ]. A data lake allows the storage of all data that are independent of the fixed schema . Instead, data are read at the time of processing, should they be parsed and adapted into a schema, only as necessary. This feature saves plenty of time for enterprises.
  • Architecture in motion is another interesting concept (i.e., the architecture will likely include more than one data lake and must be adaptable to address changing requirements). For instance, on-premises work with Hadoop could be moved to the cloud or a hybrid platform in the future. By facilitating the innovation of multi-cloud storage, a data lake can be easily upgraded to be used across data centers, on premises, and in private clouds. In addition, machine learning and automation can augment the data flow capabilities of an enterprise’s data lake design.

5. Tools and Utilities

5.1. popular data warehouse tools and services.

  • Amazon Web Services (AWS) data warehouse tools : AWS is one of the major leaders in data warehousing solutions [ 78 ] ( https://aws.amazon.com/training/classroom/data-warehousing-on-aws/ , accessed on 25 September 2022). AWS has many services, such as AWS Redshift, AWS S3, and Amazon RDS, making it a very cost-effective and highly scalable platform. AWS Redshift is a suitable platform for businesses that require very advanced capabilities that exploit high-end tools [ 79 ]. It consists of an in-house team that organizes AWS’s extensive menu of services. Amazon Simple Storage Service (AWS S3) is a low-cost storage solution with industry-leading scalability, performance, and security features. Amazon Relational Database Service (Amazon RDS) is an AWS cloud data storage service that runs and scales a relational database. It has resizable and cost-effective technology that facilitates an industry-standard relational database and manages all database management activities.
  • Google data warehouse tools : Google is highly acclaimed for its data management skills along with its dominance as a search engine ( https://cloud.google.com , accessed on 25 September 2022). Google’s data warehouse tools ( https://research.google/research-areas/data-management/ , accessed on 25 September 2022) excel in cutting-edge data management and analytics by incorporating machine intelligence. Google BigQuery is a business-level cloud-based data warehousing solution platform specially designed to save time by storing and querying large data sets through using super-fast SQL searches against multi-terabyte data sets in seconds, offering customers real-time data insights. Google Cloud Data Fusion is a cloud ETL solution which is entirely managed and allows data integration at any size with a visual point-and-click interface. Dataflow is another cloud-based data-processing service that can be used to stream data in batches or in real time. Google Data Studio enables turning the data into entirely customizable, easy-to-read reports and dashboards.
  • Microsoft Azure Data Warehouse tools : Microsoft Azure is a recent cloud computing platform that provides Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) as well as 200+ products and cloud services [ 80 ] ( https://azure.microsoft.com/en-in/ , accessed on 25 September 2022). Azure SQL Database is suitable for data warehousing applications with up to 8 TB of data volume and a large number of active users, facilitating advanced query processing. Azure Synapse Analytics consists of data integration, big data analytics, and enterprise data warehousing capabilities by also integrating machine learning technologies.
  • Oracle Autonomous Data Warehouse : Oracle Autonomous Data Warehouse [ 81 ] is a cloud-based data warehouse service that manages the complexities associated with data warehouse development, data protection, data application development, etc. The setting, safeguarding, regulating, and backing up of data are all automated using this technology. This cloud computing solution is easy to use, secure, quick to respond, as well as scalable.
  • Snowflake : Snowflake [ 82 ] is a cloud-based data warehouse tool offering a quick, easy-to-use, and adaptable data warehouse platform ( https://www.snowflake.com , accessed on 25 September 2022). It has a comprehensive Software as a Service (SaaS) architecture since it runs entirely in the cloud. This makes data processing easier by permitting users to work with a single language, SQL for data blending, analysis, and transformations on a variety of data types. Snowflake’s multi-tenant design enables real-time data exchange throughout the enterprise without relocating data.
  • IBM Data Warehouse tools : IBM is a preferred choice for large business clients due to its huge install base, vertical data models, various data management solutions, and real-time analytics ( https://www.ibm.com/in-en/analytics , accessed on 25 September 2022). One DW tool (i.e., IBM DB2 Warehouse ) is a cloud DW that enables self-scaling data storage and processing and deployment flexibility. Another tool is IBM Datastage , which can take data from a source system, transform it, and feed it into a target system. This enables the users to merge data from several corporate systems using either an on-premises or cloud-based parallel architecture.

5.2. Popular Data Lake Tools and Services

  • Azure Data Lake : Azure Data Lake makes it easy for developers and data scientists to store data of any size, shape, and speed and conduct all types of processing and analytics across platforms and languages ( https://azure.microsoft.com/en-in/solutions/data-lake/ , accessed on 25 September 2022). It removes the complexities associated with ingesting and storing the data and makes it faster to bring up and execute with batch, streaming, and interactive analytics [ 85 ]. Some of the key features of Azure Data Lake include unlimited scale and data durability, on-par performance even with demanding workloads, high security with flexible mechanisms, and cost optimization through independent scaling of storage.
  • AWS : Amazon Web Services claims to provide “the most secure, scalable, comprehensive, and cost-effective portfolio of services for customers to build their data lake in the cloud” ( https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc , accessed on 25 September 2022). AWS Lake Formation helps to set up a secure data lake that can collect and catalog data from databases and object storage, move the data into the new Amazon Simple Storage Service (S3) data lake, and clean and classify the data using ML algorithms. It offers various aspects of scalability, agility, and flexibility that are required by the companies to fuse data and analytics approaches. AWS customers include NETFLIX, Zillow, NASDAQ, Yelp, and iRobot.
  • Google BigLake : BigLake is a storage engine that unifies data warehouses and lakes ( https://cloud.google.com/biglake , accessed on 25 September 2022). It removes the need to duplicate or move data, thus making the system efficient and cost-effective. BigLake provides detailed access controls and performance acceleration across BigQuery and multi-cloud data lakes, with open formats to ensure a unified, flexible, and cost-effective lakehouse architecture. The top features of BigLake include (1) users being able to enforce consistent access controls across most analytics engines with a single copy of data and (2) unified governance and management at scale. Users can extend BigQuery to multi-cloud data lakes and open formats with fine-grained security controls without setting up a new infrastructure.
  • Cloudera : Cloudera SDX is a data lake service for creating safe, secure, and governed data lakes with protective rings around the data wherever they stored, from object stores to the Hadoop Distributed File System (HDFS) ( https://www.cloudera.com , accessed on 25 September 2022). It provides the capabilities needed for (1) data schema and metadata information, (2) metadata governance and management, (3) data access authorization and authentication, and (4) compliance-ready access auditing.
  • Snowflake : Snowflake’s cross-cloud platform breaks down silos and enables a data lake strategy ( https://www.snowflake.com/workloads/data-lake/ , accessed on 25 September 2022). Data scientists, analysts, and developers can seamlessly leverage governed data self-service for a variety of workloads. The key features of Snowflake include (1) all data on one platform that combines structured, semi-structured, and unstructured data of any format across clouds and regions, (2) fast, reliable processing and querying, simplifying the architecture with an elastic engine to power many workloads, and (3) secure collaboration via easy integration of external data without ETL.

6. Challenges

6.1. challenges in big data analytics, 6.2. data warehouse implementation challenges, 6.3. data lake implementation challenges, 7. opportunities and future directions, 7.1. data warehouses: opportunities and future directions.

  • All the data are accessible from a single location;
  • The capability to outsource the task of maintaining that service’s high availability to all customers;
  • Governance based on policies;
  • Platforms with high user experience (UX) discoverability;
  • Platforms that cater to all customers.

7.2. Data Lakes: Opportunities and Future Directions

8. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Tsai, C.W.; Lai, C.F.; Chao, H.C.; Vasilakos, A.V. Big data analytics: A survey. J. Big Data 2015 , 2 , 21. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Big Data—Statistics & Facts. Available online: https://www.statista.com/topics/1464/big-data/ (accessed on 27 October 2022).
  • Wise, J. Big Data Statistics 2022: Facts, Market Size & Industry Growth. Available online: https://earthweb.com/big-data-statistics/ (accessed on 27 October 2022).
  • Jain, A. The 5 V’s of Big Data. 2016. Available online: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/ (accessed on 27 October 2022).
  • Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015 , 35 , 137–144. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Sun, Z.; Zou, H.; Strang, K. Big Data Analytics as a Service for Business Intelligence. In Open and Big Data Management and Innovation ; Springer International Publishing: Cham, Switzerland, 2015; Volume 9373, pp. 200–211. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Big Data and Analytics Services Global Market Report. Available online: https://www.reportlinker.com/p06246484/Big-Data-and-Analytics-Services-Global-Market-Report.html (accessed on 27 October 2022).
  • BI & Analytics Software Market Value Worldwide 2019–2025. Available online: https://www.statista.com/statistics/590054/worldwide-business-analytics-software-vendor-market/ (accessed on 27 October 2022).
  • Kumar, S. What Is a Data Repository and What Is it Used for? 2019. Available online: https://stealthbits.com/blog/what-is-a-data-repository-and-what-is-it-used-for/ (accessed on 27 October 2022).
  • Khine, P.P.; Wang, Z.S. Data lake: A new, ideology in big data era. ITM Web Conf. 2018 , 17 , 03025. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Arif, M.; Mujtaba, G. A Survey: Data Warehouse Architecture. Int. J. Hybrid Inf. Technol. 2015 , 8 , 349–356. [ Google Scholar ] [ CrossRef ]
  • El Aissi, M.E.M.; Benjelloun, S.; Loukili, Y.; Lakhrissi, Y.; Boushaki, A.E.; Chougrad, H.; Elhaj Ben Ali, S. Data Lake Versus Data Warehouse Architecture: A Comparative Study. In WITS 2020 ; Bennani, S., Lakhrissi, Y., Khaissidi, G., Mansouri, A., Khamlichi, Y., Eds.; Springer: Singapore, 2022; Volume 745, pp. 201–210. [ Google Scholar ] [ CrossRef ]
  • Rehman, K.U.U.; Ahmad, U.; Mahmood, S. A Comparative Analysis of Traditional and Cloud Data Warehouse. VAWKUM Trans. Comput. Sci. 2018 , 6 , 34–40. [ Google Scholar ] [ CrossRef ]
  • Devlin, B.A.; Murphy, P.T. An architecture for a business and information system. IBM Syst. J. 1988 , 27 , 60–80. [ Google Scholar ] [ CrossRef ]
  • Garani, G.; Chernov, A.; Savvas, I.; Butakova, M. A Data Warehouse Approach for Business Intelligence. In Proceedings of the 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Napoli, Italy, 12–14 June 2019; pp. 70–75. [ Google Scholar ] [ CrossRef ]
  • Gupta, V.; Singh, J. A Review of Data Warehousing and Business Intelligence in different perspective. Int. J. Comput. Sci. Inf. Technol. 2014 , 5 , 8263–8268. [ Google Scholar ]
  • Sagiroglu, S.; Sinanc, D. Big data: A review. In Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 20–24 May 2013; pp. 42–47. [ Google Scholar ] [ CrossRef ]
  • Miloslavskaya, N.; Tolstoy, A. Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues. In Proceedings of the 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), Vienna, Austria, 22–24 August 2016; pp. 148–153. [ Google Scholar ] [ CrossRef ]
  • Giebler, C.; Stach, C.; Schwarz, H.; Mitschang, B. BRAID—A Hybrid Processing Architecture for Big Data. In Proceedings of the 7th International Conference on Data Science, Technology and Applications, Porto, Portugal, 26–28 July 2018; pp. 294–301. [ Google Scholar ] [ CrossRef ]
  • Lin, J. The Lambda and the Kappa. IEEE Internet Comput. 2017 , 21 , 60–66. [ Google Scholar ] [ CrossRef ]
  • Devlin, B. Thirty Years of Data Warehousing—Part 1. 2020. Available online: https://www.irmconnects.com/thirty-years-of-data-warehousing-part-1/ (accessed on 27 October 2022).
  • Inmon, W.H. Building the Data Warehouse , 4th ed.; Wiley Publishing: Indianapolis, IN, USA, 2005. [ Google Scholar ]
  • Chandra, P.; Gupta, M.K. Comprehensive survey on data warehousing research. Int. J. Inf. Technol. 2018 , 10 , 217–224. [ Google Scholar ] [ CrossRef ]
  • Simões, D.M. Enterprise Data Warehouses: A conceptual framework for a successful implementation. In Proceedings of the Canadian Council for Small Business & Entrepreneurship Annual Conference, Calgary, AL, Canada, 28–30 October 2010. [ Google Scholar ]
  • Al-Debei, M.M. Data Warehouse as a Backbone for Business Intelligence: Issues and Challenges. Eur. J. Econ. Financ. Adm. Sci. 2011 , 33 , 153–166. [ Google Scholar ]
  • Report by Market Research Future (MRFR). Available online: https://finance.yahoo.com/news/data-warehouse-dwaas-market-predicted-153000649.html (accessed on 27 October 2022).
  • Chaudhuri, S.; Dayal, U. An overview of data warehousing and OLAP technology. ACM Sigmod Rec. 1997 , 26 , 65–74. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Codd, E.F.; Codd, S.B.; Salley, C.T. Providing OLAP to User-Analysts: An IT Mandate ; Codd & Associates: Ladera Ranch, CA, USA, 1993; pp. 1–26. [ Google Scholar ]
  • The Best Applications of Data Warehousing. 2020. Available online: https://datachannel.co/blogs/best-applications-of-data-warehousing/ (accessed on 27 October 2022).
  • Hai, R.; Quix, C.; Jarke, M. Data lake concept and systems: A survey. arXiv 2021 , arXiv:2106.09592. [ Google Scholar ] [ CrossRef ]
  • Zagan, E.; Danubianu, M. Data Lake Approaches: A Survey. In Proceedings of the 2020 International Conference on Development and Application Systems (DAS), Suceava, Romania, 21–23 May 2020; pp. 189–193. [ Google Scholar ] [ CrossRef ]
  • Cherradi, M.; El Haddadi, A. Data Lakes: A Survey Paper. In Innovations in Smart Cities Applications ; Ben Ahmed, M., Boudhir, A.A., Karaș, R., Jain, V., Mellouli, S., Eds.; Lecture Notes in Networks and Systems; Springer International Publishing: Cham, Switzerland, 2022; Volume 5, pp. 823–835. [ Google Scholar ] [ CrossRef ]
  • Dixon, J. Pentaho, Hadoop, and Data Lakes. 2010. Available online: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (accessed on 27 October 2022).
  • King, T. The Emergence of Data Lake: Pros and Cons. 2016. Available online: https://solutionsreview.com/data-integration/the-emergence-of-data-lake-pros-and-cons/ (accessed on 27 October 2022).
  • Alrehamy, H.; Walker, C. Personal Data Lake with Data Gravity Pull. In Proceedings of the IEEE Fifth International Conference on Big Data and Cloud Computing 2015, Beijing, China, 26–28 August 2015. [ Google Scholar ] [ CrossRef ]
  • Yang, Q.; Ge, M.; Helfert, M. Analysis of Data Warehouse Architectures: Modeling and Classification. In Proceedings of the 21st International Conference on Enterprise Information Systems, Heraklion, Greece, 3–5 May 2019; pp. 604–611. [ Google Scholar ]
  • Yessad, L.; Labiod, A. Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault. In Proceedings of the 2016 International Conference on System Reliability and Science (ICSRS), Paris, France, 15–18 November 2016; pp. 95–99. [ Google Scholar ] [ CrossRef ]
  • Oueslati, W.; Akaichi, J. A Survey on Data Warehouse Evolution. Int. J. Database Manag. Syst. 2010 , 2 , 11–24. [ Google Scholar ] [ CrossRef ]
  • Ali, F.S.E. A Survey of Real-Time Data Warehouse and ETL. Int. J. Sci. Eng. Res. 2014 , 5 , 3–9. [ Google Scholar ]
  • Aftab, U.; Siddiqui, G.F. Big Data Augmentation with Data Warehouse: A Survey. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2785–2794. [ Google Scholar ] [ CrossRef ]
  • Alsqour, M.; Matouk, K.; Owoc, M. A survey of data warehouse architectures—Preliminary results. In Proceedings of the Federated Conference on Computer Science and Information Systems, Wroclaw, Poland, 9–12 September 2012; pp. 1121–1126. [ Google Scholar ]
  • Rizzi, S.; Abelló, A.; Lechtenbörger, J.; Trujillo, J. Research in data warehouse modeling and design: Dead or alive? In Proceedings of the 9th ACM international workshop on Data warehousing and OLAP, DOLAP ’06, Arlington, VA, USA, 10 November 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 3–10. [ Google Scholar ] [ CrossRef ]
  • Maccioni, A.; Torlone, R. KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake. In Advanced Information Systems Engineering ; Krogstie, J., Reijers, H.A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 474–489. [ Google Scholar ] [ CrossRef ]
  • Gao, Y.; Huang, S.; Parameswaran, A. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; ACM: Houston, TX, USA, 2018; pp. 943–958. [ Google Scholar ] [ CrossRef ]
  • Astriani, W.; Trisminingsih, R. Extraction, Transformation, and Loading (ETL) Module for Hotspot Spatial Data Warehouse Using Geokettle. Procedia Environ. Sci. 2016 , 33 , 626–634. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Halevy, A.V.; Korn, F.; Noy, N.F.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S.E. Managing Google’s data lake: An overview of the Goods system. IEEE Data Eng. Bull. 2016 , 39 , 5–14. [ Google Scholar ]
  • Dehne, F.; Robillard, D.; Rau-Chaplin, A.; Burke, N. VOLAP: A Scalable Distributed System for Real-Time OLAP with High Velocity Data. In Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Taiwan, 13–15 September 2016; pp. 354–363. [ Google Scholar ] [ CrossRef ]
  • Hurtado, C.A.; Gutierrez, C.; Mendelzon, A.O. Capturing summarizability with integrity constraints in OLAP. ACM Trans. Database Syst. 2005 , 30 , 854–886. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Farid, M.; Roatis, A.; Ilyas, I.F.; Hoffmann, H.F.; Chu, X. CLAMS: Bringing Quality to Data Lakes. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, San Francisco, CA, USA, 26 June–1 July 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 2089–2092. [ Google Scholar ] [ CrossRef ]
  • Zhang, Y.; Ives, Z.G. Juneau: Data lake management for Jupyter. Proc. VLDB Endow. 2019 , 12 , 1902–1905. [ Google Scholar ] [ CrossRef ]
  • Zhu, E.; Deng, D.; Nargesian, F.; Miller, R.J. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, Amsterdam, The Netherlands, 30 June–5 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 847–864. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Beheshti, A.; Benatallah, B.; Nouri, R.; Chhieng, V.M.; Xiong, H.; Zhao, X. CoreDB: A Data Lake Service. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 2451–2454. [ Google Scholar ] [ CrossRef ]
  • Hai, R.; Geisler, S.; Quix, C. Constance: An Intelligent Data Lake System. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, San Francisco, CA, USA, 26 June–1 July 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 2097–2100. [ Google Scholar ] [ CrossRef ]
  • Ahmed, A.S.; Salem, A.M.; Alhabibi, Y.A. Combining the Data Warehouse and Operational Data Store. In Proceedings of the Eighth International Conference on Enterprise Information Systems, Paphos, Cyprus, 23–27 May 2006; pp. 282–288. [ Google Scholar ] [ CrossRef ]
  • Software Architecture: N Tier, 3 Tier, 1 Tier, 2 Tier Architecture. Available online: https://www.appsierra.com/blog/url (accessed on 27 October 2022).
  • Han, S.W. Three-Tier Architecture for Sentinel Applications and Tools: Separating Presentation from Functionality. Ph.D. Thesis, University of Florida, Gainesville, FL, USA, 1997. [ Google Scholar ]
  • What Is Three-Tier Architecture. Available online: https://www.ibm.com/in-en/cloud/learn/three-tier-architecture (accessed on 27 October 2022).
  • Phaneendra, S.V.; Reddy, E.M. Big Data—Solutions for RDBMS Problems—A Survey. Int. J. Adv. Res. Comput. Commun. Eng. 2013 , 2 , 3686–3691. [ Google Scholar ]
  • Simitsis, A.; Vassiliadis, P.; Sellis, T. Optimizing ETL processes in data warehouses. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; pp. 564–575. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Prasser, F.; Spengler, H.; Bild, R.; Eicher, J.; Kuhn, K.A. Privacy-enhancing ETL-processes for biomedical data. Int. J. Med. Inform. 2019 , 126 , 72–81. [ Google Scholar ] [ CrossRef ]
  • Rousidis, D.; Garoufallou, E.; Balatsoukas, P.; Sicilia, M.A. Metadata for Big Data: A preliminary investigation of metadata quality issues in research data repositories. Inf. Serv. Use 2014 , 34 , 279–286. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Mailvaganam, H. Introduction to OLAP—Slice, Dice and Drill! 2007. Data Warehousing Review. Retrieved on 18 March 2008. Available online: https://web.archive.org/web/20180928201202/http://dwreview.com/OLAP/Introduction_OLAP.html (accessed on 25 September 2022).
  • Pendse, N. What is OLAP? Available online: https://dssresources.com/papers/features/pendse04072002.htm (accessed on 27 October 2022).
  • Xu, J.; Luo, Y.Q.; Zhou, X.X. Solution for Data Growth Problem of MOLAP. Appl. Mech. Mater. 2013 , 321–324 , 2551–2556. [ Google Scholar ] [ CrossRef ]
  • Dehne, F.; Eavis, T.; Rau-Chaplin, A. Parallel multi-dimensional ROLAP indexing. In Proceedings of the CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, Tokyo, Japan, 12–15 May 2003; pp. 86–93. [ Google Scholar ] [ CrossRef ]
  • Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [ Google Scholar ] [ CrossRef ]
  • Luo, Z.; Niu, L.; Korukanti, V.; Sun, Y.; Basmanova, M.; He, Y.; Wang, B.; Agrawal, D.; Luo, H.; Tang, C.; et al. From Batch Processing to Real Time Analytics: Running Presto ® at Scale. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 1598–1609. [ Google Scholar ] [ CrossRef ]
  • Sethi, R.; Traverso, M.; Sundstrom, D.; Phillips, D.; Xie, W.; Sun, Y.; Yegitbasi, N.; Jin, H.; Hwang, E.; Shingte, N.; et al. Presto: SQL on Everything. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–1 April 2019; pp. 1802–1813. [ Google Scholar ] [ CrossRef ]
  • Kinley, J. The Lambda Architecture: Principles for Architecting Realtime Big Data Systems. 2013. Available online: http://jameskinley.tumblr.1084com/post/37398560534/thelambda-architecture-principles-for (accessed on 27 October 2022).
  • Ferrera Bertran, P. Lambda Architecture: A state-of-the-Art. Datasalt. 17 January 2014. Available online: https://github.com/pereferrera/trident-lambda-splout (accessed on 25 September 2022).
  • Carbone, P.; Katsifodimos, A.; Ewen, S.; Markl, V.; Haridi, S.; Tzoumas, K. Apache Flink™: Stream and Batch Processing in a Single Engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2015 , 36 , 28–38. [ Google Scholar ]
  • Kreps, J. Questioning the Lambda Architecture. 2014. Available online: https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (accessed on 27 October 2022).
  • Data Vault vs Star Schema vs Third Normal Form: Which Data Model to Use? Available online: https://www.matillion.com/resources/blog/data-vault-vs-star-schema-vs-third-normal-form-which-data-model-to-use (accessed on 27 October 2022).
  • Patranabish, D. Data Lakes: The New Enabler of Scalability in Cross Channel Analytics—Tech-Talk by Durjoy Patranabish | ET CIO. Available online: http://cio.economictimes.indiatimes.com/tech-talk/data-lakes-the-new-enabler-of-scalability-in-cross-channel-analytics/585 (accessed on 27 October 2022).
  • Nargesian, F.; Zhu, E.; Miller, R.J.; Pu, K.Q.; Arocena, P.C. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 2019 , 12 , 1986–1989. [ Google Scholar ] [ CrossRef ]
  • A Brief Look at 4 Major Data Compliance Standards: GDPR, HIPAA, PCI DSS, CCPA. Available online: https://www.pentasecurity.com/blog/4-data-compliance-standards-gdpr-hipaa-pci-dss-ccpa/ (accessed on 27 October 2022).
  • Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. J. Intell. Inf. Syst. 2021 , 56 , 97–120. [ Google Scholar ] [ CrossRef ]
  • Overview of Amazon Web Services: AWS Whitepaper. 2022. Available online: https://d1.awsstatic.com/whitepapers/aws-overview.pdf (accessed on 27 October 2022).
  • Pandis, I. The evolution of Amazon redshift. Proc. VLDB Endow. 2021 , 14 , 3162–3174. [ Google Scholar ] [ CrossRef ]
  • Microsoft Azure Documentation. Available online: http://azure.microsoft.com/en-us/documentation/ (accessed on 27 October 2022).
  • Automate Your Data Warehouse. Available online: https://www.oracle.com/autonomous-database/autonomous-data-warehouse/ (accessed on 27 October 2022).
  • Dageville, B.; Cruanes, T.; Zukowski, M.; Antonov, V.; Avanes, A.; Bock, J.; Claybaugh, J.; Engovatov, D.; Hentschel, M.; Huang, J.; et al. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; ACM: San Francisco, CA, USA, 2016; pp. 215–226. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Mathis, C. Data Lakes. Datenbank-Spektrum 2017 , 17 , 289–293. [ Google Scholar ] [ CrossRef ]
  • Zagan, E.; Danubianu, M. Cloud DATA LAKE: The new trend of data storage. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Online, 11–13 June 2021; IEEE: Ankara, Turkey, 2021; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Ramakrishnan, R.; Sridharan, B.; Douceur, J.R.; Kasturi, P.; Krishnamachari-Sampath, B.; Krishnamoorthy, K.; Li, P.; Manu, M.; Michaylov, S.; Ramos, R.; et al. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, Chicago, IL, USA, 14–19 May 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 51–63. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Elgendy, N.; Elragal, A. Big Data Analytics: A Literature Review Paper. In Advances in Data Mining. Applications and Theoretical Aspects ; Perner, P., Ed.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 214–227. [ Google Scholar ] [ CrossRef ]
  • Jin, X.; Wah, B.W.; Cheng, X.; Wang, Y. Significance and Challenges of Big Data Research. Big Data Res. 2015 , 2 , 59–64. [ Google Scholar ] [ CrossRef ]
  • Agrawal, R.; Nyamful, C. Challenges of big data storage and management. Glob. J. Inf. Technol. Emerg. Technol. 2016 , 6 , 1–10. [ Google Scholar ] [ CrossRef ]
  • Padgavankar, M.H.; Gupta, S.R. Big Data Storage and Challenges. Int. J. Comput. Sci. Inf. Technol. 2014 , 5 , 2218–2223. [ Google Scholar ]
  • Kadadi, A.; Agrawal, R.; Nyamful, C.; Atiq, R. Challenges of data integration and interoperability in big data. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; IEEE: Washington, DC, USA, 2014; pp. 38–40. [ Google Scholar ] [ CrossRef ]
  • Best Data Integration Tools. Available online: https://www.peerspot.com/categories/data-integration-tools (accessed on 27 October 2022).
  • Toshniwal, R.; Dastidar, K.G.; Nath, A. Big Data Security Issues and Challenges. Int. J. Innov. Res. Adv. Eng. 2014 , 2 , 15–20. [ Google Scholar ]
  • Demchenko, Y.; Ngo, C.; de Laat, C.; Membrey, P.; Gordijenko, D. Big Security for Big Data: Addressing Security Challenges for the Big Data Infrastructure. In Secure Data Management ; Jonker, W., Petković, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 76–94. [ Google Scholar ] [ CrossRef ]
  • Chen, E.T. Implementation issues of enterprise data warehousing and business intelligence in the healthcare industry. Commun. IIMA 2012 , 12 , 3. [ Google Scholar ]
  • Cuzzocrea, A.; Bellatreche, L.; Song, I.Y. Data warehousing and OLAP over big data: Current challenges and future research directions. In Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, DOLAP ’13, San Francisco, CA, USA, 28 October 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 67–70. [ Google Scholar ] [ CrossRef ]
  • Singh, R.; Singh, K. A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing. Int. J. Comput. Sci. Issues 2010 , 7 , 41. [ Google Scholar ]
  • Longbottom, C.; Bamforth, R. Optimising the Data Warehouse. 2013. Available online: https://www.it-daily.net/downloads/WP_Optimising-the-data-warehouse.pdf (accessed on 27 October 2022).
  • Santos, R.J.; Bernardino, J.; Vieira, M. A survey on data security in data warehousing: Issues, challenges and opportunities. In Proceedings of the 2011 IEEE EUROCON—International Conference on Computer as a Tool, Lisbon, Portugal, 27–29 April 2011; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Responsibilities of a Data Warehouse Governance Committee. Available online: https://docs.oracle.com/cd/E29633_01/CDMOG/GUID-7E43F311-4510-4F1E-A17E-693F94BD0EC7.htm (accessed on 28 October 2022).
  • Gupta, S.; Giri, V. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake , 1st ed.; Apress: Berkeley, CA, USA, 2018. [ Google Scholar ]
  • Giebler, C.; Gröger, C.; Hoos, E.; Schwarz, H.; Mitschang, B. Leveraging the Data Lake: Current State and Challenges. In Big Data Analytics and Knowledge Discovery ; Ordonez, C., Song, I.Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 179–188. [ Google Scholar ] [ CrossRef ]
  • Lock, M. Maximizing Your Data Lake with a Cloud or Hybrid Approach. 2016. Available online: https://technology-signals.com/wp-content/uploads/download-manager-files/maximizingyourdatalake.pdf (accessed on 27 October 2022).
  • Kumar, N. Cloud Data Warehouse Is the Future of Data Storage. 2020. Available online: https://www.sigmoid.com/blogs/cloud-data-warehouse-is-the-future-of-data-storage/ (accessed on 27 October 2022).
  • Kahn, M.G.; Mui, J.Y.; Ames, M.J.; Yamsani, A.K.; Pozdeyev, N.; Rafaels, N.; Brooks, I.M. Migrating a research data warehouse to a public cloud: Challenges and opportunities. J. Am. Med. Inform. Assoc. 2022 , 29 , 592–600. [ Google Scholar ] [ CrossRef ]
  • Mishra, N.; Lin, C.C.; Chang, H.T. A Cognitive Adopted Framework for IoT Big-Data Management and Knowledge Discovery Prospective. Int. J. Distrib. Sens. Netw. 2015 , 2015 , 1–12. [ Google Scholar ] [ CrossRef ]
  • Alserafi, A.; Abelló, A.; Romero, O.; Calders, T. Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. In Model and Data Engineering ; Schewe, K.D., Singh, N.K., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 35–49. [ Google Scholar ] [ CrossRef ]
  • Bogatu, A.; Fernandes, A.A.A.; Paton, N.W.; Konstantinou, N. Dataset Discovery in Data Lakes. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Dallas, TX, USA, 2020; pp. 709–720. [ Google Scholar ] [ CrossRef ]
  • Armbrust, M.; Ghodsi, A.; Xin, R.; Zaharia, M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Proceedings of the Conference on Innovative Data Systems Research, Virtual Event, 11–15 January 2021. [ Google Scholar ]

Click here to enlarge figure

ParametersData WarehouseData Lake
DataData warehouse focuses only on business processesData lakes store everything
ProcessingHighly processed dataData are mainly unprocessed
Type of DataThey are mostly in the tabular form and structureThey can be unstructured, semi-structured, or structured
TaskOptimized for data retrievalShare data stewardship
AgilityLess agile and has fixed configuration compared with data lakesHighly agile and can configure and reconfigure as needed
UsersWidely used by business professionals and business analystsData lakes are used by data scientists, data developers, and business analysts
StorageExpensive storage that gives fast response times is usedData lakes are designed for low-cost storage
SecurityAllows better control of the dataOffers less control
SchemaSchema on writing (predefined schemas)Schema on reading (no predefined schemas)
Data ProcessingTime-consuming to introduce new contentHelps with fast ingestion of new data
Data GranularityData at the summary or aggregated level of detailData at a low level of detail or granularity
ToolsMostly commercial toolsCan use open-source tools such as Hadoop or Map Reduce
TopicSurvey PapersContributions
Data warehouse[ ]Data warehouse concepts, multilingualism issues in data warehouse design and solutions
Data warehouse[ ]Data warehouse architecture modeling
and classifications
Data warehouse
and big data
[ ]A comprehensive survey on big data, big data analytics, augmentation, and big data warehouses
Data warehouse[ ]Data warehouse survey
Data warehouse[ ]Real-time data warehouse and ETL
Data warehouse[ ]Architectures of data warehouses (DWs) and their selection
Data warehouse[ ]Data warehouse (DW) evolution
Data warehouse[ ]Data warehouse modeling and design
Data warehouse[ ]Comparative study on data warehouse architectures
Data lake[ ]A survey on designing, implementing,
and applying data lakes
Data lake[ ]Recent approaches and architectures using data lakes
Data lake[ ]Overview of data lake definitions, architectures, and technologies
Data lake
vs. data warehouse
[ ]Explores the two architectures of data warehouses
and data lakes
Systems or Topic AreaData WarehouseData LakeFunction or Work PerformedReference
OLAP Online analytical
processing (OLAP)
Providing OLAP to User-Analysts:
an IT Mandate [ ]
GEMMS Metadata extraction,
Metadata modeling
Metadata Extraction and Management in Data Lakes with GEMMS [ ]
KAYAK Dataset preparation
and organization
KAYAK: a Framework for Just-in-Time Data Preparation in a Data Lake [ ]
DWHA Modeling and classification
of DW
Analysis of Data Warehouse Architectures: Modelling and Classification [ ]
DATAMARAN Metadata extractionNavigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets [ ]
Geokettle Data warehouse architecture, design, and testingExtraction, Transformation, and Loading (ETL) Module for Hotspot Spatial Data Warehouse Using Geokettle [ ]
GOODS Dataset preparation and organization,
metadata enrichment
Managing Google’s data lake: an overview of the Goods system [ ]
VOLAP OLAP, query processing,
and optimization
VOLAP: a Scalable Distributed System
for Real-Time OLAP
with High-Velocity Data [ ]
Dimension constraints Multidimensional data modeling, OLAP, query processing,
and optimization
Capturing summarizability with integrity constraints in OLAP [ ]
CLAMS Data quality improvementCLAMS: Bringing Quality
to Data Lakes [ ]
Juneau Dataset preparation and organization, discover related data sets, and query-driven
data discovery
Juneau: Data Lake Management
for Jupyter [ ]
JOSIE Discover related data sets and query-driven data discoveryJosie: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes [ ]
CoreDB Metadata enrichment and query heterogeneous dataCoreDB: a Data Lake Service [ ]
Constance Unified interface for query processing and data explorationConstance: An Intelligent
Data Lake System [ ]
ODS Operational data storeCombining the Data Warehouse and Operational Data Store [ ]
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

Nambiar, A.; Mundra, D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data Cogn. Comput. 2022 , 6 , 132. https://doi.org/10.3390/bdcc6040132

Nambiar A, Mundra D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data and Cognitive Computing . 2022; 6(4):132. https://doi.org/10.3390/bdcc6040132

Nambiar, Athira, and Divyansh Mundra. 2022. "An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management" Big Data and Cognitive Computing 6, no. 4: 132. https://doi.org/10.3390/bdcc6040132

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Data Management in the Data Lake: A Systematic Mapping

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

  • Ghanim M Alammary J (2023) Cloud-Based Simulation Model for Agriculture Big Data in the Kingdom of Bahrain Proceedings of Eighth International Congress on Information and Communication Technology 10.1007/978-981-99-3243-6_59 (741-757) Online publication date: 25-Jul-2023 https://doi.org/10.1007/978-981-99-3243-6_59

Recommendations

Data lake management: challenges and opportunities.

The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset ...

Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS

Consolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a ...

Investigations into Data Ecosystems: a systematic mapping study

Data Ecosystems are socio-technical complex networks in which actors interact and collaborate with each other to find, archive, publish, consume, or reuse data as well as to foster innovation, create value, and support new businesses. While the ...

Information

Published in.

cover image ACM Other conferences

Concordia University, Canada

Stanford University, USA

Univ. Of West England, U.K

Keio University, Japan

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Data management
  • Systematic mapping
  • Short-paper
  • Refereed limited

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 1 Total Citations View Citations
  • 284 Total Downloads
  • Downloads (Last 12 months) 63
  • Downloads (Last 6 weeks) 6

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Big Data

Logo of frontbigdata

Toward data lakes as central building blocks for data management and analysis

Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.

1. Introduction

In recent years, data lakes have become increasingly popular in various industrial and academic domains. In particular for academia, data lakes come with the promise to provide solutions for several data management challenges at once. Similar to Data Warehouses (Devlin and Murphy, 1988 ; Inmon, 2005 ), data lakes aim at integrating heterogeneous data from different sources into a single, homogeneous data management system. This allows data holders to overcome the limits of disparate and isolated data silos and enforce uniform data governance.

Data Warehouses have a fixed schema, which implies a so-called schema-on-write approach to feed data into them. Extract-Transform-Load (ETL) processes are therefore needed to extract the raw data from its source, transform, e.g., to clean it or to fit it into the predefined schema, and then load it into the Data Warehouse (El-Sappagh et al., 2011 ). Although there are some known challenges when using these ETL processes (Munappy et al., 2020 ), the main drawback is the loss of information during the transformation to fit data into the fixed schema. To prevent this information loss, which limits the reuse of the data e.g., for research questions outside the original scope, James Dixon proposed the data lake concept in Dixon ( 2010 ). Here, in contrast to the schema-on-write approach of a Data Warehouses, data is retained in its original format and a schema is only inferred when a subsequent process reads the data, an approach which is termed schema-on-read .

The necessity for low cost and highly scalable mass storage with the ability to be integrated into parallelised computations was recognized as a key feature already at the advent of data lakes, leading to a close connection between a data lakes and Apache Hadoop (Khine and Wang, 2018 ). This approach was at some point challenged by large cloud providers like Amazon or Microsoft and their proprietary data lake solutions like AWS Lake Formation or Azure Data Lake (Hukkeri et al., 2020 ; Aundhkar and Guja, 2021 ). These products introduced, among other features, the separation of storage and compute and offered customers the well-known cloud features like the pay-as-you-go payment model.

Although a data lake implements a schema-on-read semantic, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality (Hai et al., 2018 ). Such data modeling typically consists of a conceptual model, which should facilitate frequent changes and therefore should not enforce a fixed schema (Mathis, 2017 ; Khine and Wang, 2018 ). The required metadata can be gathered by extracting prescriptive information from the data itself, for instance by reading out header information, or metadata can be additionally extracted from the source along with the original raw data itself. In addition, data can be continuously enriched with metadata during its lifetime in the data lake, for instance by identifying relationships among the different data sets (Hai et al., 2016 ; Sawadogo et al., 2019 ) or by auditing provenance information.

Quite some literature exists related to the use of data lakes in different industries (Terrizzano et al., 2015 ; Golec, 2019 ; Hukkeri et al., 2020 ), particularly with the intent to manage big amounts of data (Miloslavskaya and Tolstoy, 2016 ). However, there is also huge potential for the adoption of data lakes in research institutions. One benefit is, for example, that data silos, which quickly arise when different research teams work independently, can be prevented or integrated. This also enables novel analysis approaches across an homogeneous data set, which are not possible dealing with distributed and isolated data silos. Another advantage is that a common data governance can be enforced on an overarching level, like an institute or a research project, to guarantee a predefined data quality level while and to assist researchers to adhere to good scientific practices.

When scaling out the usage of a data lake across an entire research institution to be the central research data management system, one is encountered with different use cases and users who have a diverse skill set. In this paper we want to explore the current state of the art of data lakes and make based on this survey an applicability analysis of the presented works in the context of a large research institution. For this, papers were collected, which had unique contributions to at least one of the topics presented below. We start in Section 2 with a discussion about the existing Data Lake Architectures, which offers an overview about the highest level of organization and abstraction of a given implementation. In the following Section 3 different metadata models are presented, which lie conceptually one layer below the general architecture, and ensure the correct data organization in the data lake, which involves semantic information about the data itself as well as metadata describing the relationships among the data. One if not the most important relationship which needs to be modeled in a data lake is the data lineage, which is in detail discussed in Section 4. Closely related to the topic of provenance auditing is the general ability to perform automated data analytics workflows, ideally in a scalable manner, on the data lake, which is discussed in the following Section 5. In Section 6.2 two disparate data lake users, i.e., a domain researcher and a data scientist, are used to perform an applicability analysis of the before presented works. In addition a comparison based on common but also topic-specific criteria is done to extend on the generic applicability analysis. Based on the general survey in each of these topics, future challenges are identified.

2. Data lake architectures

As of today, a lot of development and analysis was conducted in the area of data lake architectures , where the so-called zone architecture (Patel et al., 2017 ; Ravat and Zhao, 2019 ), including the pond architecture (Inmon, 2016 ), became the most cited and used. These architectures have already been surveyed by Hai et al. ( 2021 ) and Sawadogo and Darmont ( 2021 ), and a functional architecture was proposed by both of them, and a maturity and a hybrid architecture have been derived by Sawadogo and Darmont ( 2021 ). These surveys, however, did not include recent works like the definition of a zone reference model (Giebler et al., 2020 ) or a data lake architecture based on FAIR Digital Objects (FDOs) (Nolte and Wieder, 2022 ).

2.1. Definition of the term data lake architecture

The term data lake architecture was defined by Giebler et al. ( 2021 ) to represent the comprehensive design of a data lake, including the infrastructure, data storage, data flow, data modeling, data organization, data processes, metadata management, data security and privacy, and data quality. In this data lake architecture framework, only the data security and privacy and the data quality are considered to be purely conceptual, whereas the other aspects include a conceptual and a physical, i.e., system specific, dimension. As stated by Madera and Laurent ( 2016 ) more generically, a data lake generally has a logical and a physical organization. In this paper, we refer to the term data lake architecture only with respect to the conceptual organization of a data lake in the highest level of abstraction, since this should make this work more comparable to the existing literature, although there exists a strong dependency on other aspects of a data lake, like the metadata modeling.

2.2. Zone architecture

The general idea to divide a data lake into different zones arises from the necessity to automatically run standardized pre-processing pipelines, organize the resulting pre-processed data, and make it available to subsequent processing steps, like reporting, Online Analytical Processing (OLAP), and particularly advanced analytics. This is achieved by assigning data to different zones based on the degree of processing, and sometimes the intended future use case. Therefore, it is common to have a raw data zone , where, according to the original idea of a data lake, data is retained in its raw format to facilitate repetition of processes or the application of new methods based on the original data. Pre-processed data is then usually collected in a dedicated zone for pre-processed , or refined data, sometimes called staging zone (Zikopoulos, 2015 ) or processing zone (Ravat and Zhao, 2019 ). Data that requires additional governance can be collected in a dedicated zone of its own called trusted zone (Zikopoulos, 2015 ), or sensitive zone (Gorelik, 2019 ).

The most extensive analysis of the zone architecture was conducted by Giebler et al. ( 2020 ), where five different data lakes based on the zone architecture (Madsen, 2015 ; Zikopoulos, 2015 ; Patel et al., 2017 ; Sharma, 2018 ; Gorelik, 2019 ; Ravat and Zhao, 2019 ) were analyzed with respect to their design differences, specific features and their individual use cases in order to derive a generic meta-model for a zone, and to specify a zone reference model based on it. Giebler et al. identified that a zone is uniquely defined by the characteristics of the data contained in that zone, the intrinsic properties a zone enforces on the data, the user groups which are intended to work in that zone, the modeling approach to organize the corresponding metadata, and the data sources as well as destinations. In the presented zone reference model, Giebler et al. propose to split the zones up in a raw zone and a harmonized zone , which is use case independent, and a use case-specific distilled zone , which serves data to the final delivery zone , to support reporting and OLAP tasks, and a explorative zone to support advanced analytics. Each zone hosts a protected area for data that requires special governance. The actual implementation and the deployed systems can vary in each zone, including the storage, the metadata model, and the metadata management system itself. This entails synchronously, that also the user interface potentially changes with each zone.

2.3. Lambda architecture

The Lambda Architecture has been proposed to enhance the capability of a data lake to process data streams in near real-time instead of fully ingesting hot data into the data lake and performing batch-processing with a certain time delay (Mathis, 2017 ). However, retaining all raw data in its native format is the core idea of a data lake. In order to resolve this contradiction, the Lambda Architecture (Warren and Marz, 2015 ) implements two processing streams in parallel. Here, data is being processed in near real-time in the speed layer , whereas the batch layer ingests data into the data lake and performs some predefined processing steps. There have been numerous implementations proposed for a data lake utilizing the Lambda Architecture (Hasani et al., 2014 ; Villari et al., 2014 ; Batyuk and Voityshyn, 2016 ). However, the following two particular works are presented which are building on top of public cloud offerings.

A Lambda Architecture was used by Munshi and Mohamed ( 2018 ) to build a data lake for smart grid data analytics using Google's cloud computing as Infrastructure as a Service (IaaS). Here, the data is collected by a dedicated Data Collecting Layer , in this particular case realized by Apache Flume 1 . From there, the data is sent to the core of this specific data lake implementation, a Hadoop Cluster. The master node stores the data on HDFS (Borthakur, 2007 ), and computes arbitrary, predefined functions using MapReduce (Dean and Ghemawat, 2008 ). The speed layer is implemented using Apache Spark (Zaharia et al., 2010 ). The serving layer combines the output of the batch and the speed layer and provides a batch view of the relevant data, using e.g., Hive (Thusoo et al., 2009 ), Impala as shown by Li ( 2014 ), and Spark SQL (Armbrust et al., 2015 ).

Similarly, Pérez-Arteaga et al. ( 2018 ) compared three different implementations using the Software as a Service (SaaS) offerings with a focus on serverless delivery of three different public cloud providers, i.e., Google Cloud Platform, Microsoft Azure, and Amazon Web Services Cloud. On AWS the speed layer accepts data via Kinesis Data streams and processes them using Kinesis Analytics and AWS Lambda. The results are stored in a dedicated S3-Speed-Bucket . Similarly, the batch layer uses Kinesis Firehose to ingest data into AWS Lambda, from where it is stored in an S3-Batch-Bucket . From here the data is read by AWS Glue and stored in an S3-Result-Bucket . The serving layer is realized by Athena which reads the data from both, the S3-Result-Bucket and the S3-Speed-Bucket . In the Google Cloud implementation, data is ingested by Pub/Sub to the speed and the batch layer, which are both realized using Dataflow. On the batch layer, an additional Datastore is employed to retain the raw incoming datasets. The serving layer uses BigQuery. On Microsoft Azure, data is received and ingested by EventHub. The speed layer uses Stream Analytics and forwards directly into the Serving Layer which is Cosmos DB. The batch layer also uses Stream Analytics to store the raw data into Data Lake Store (Ramakrishnan et al., 2017 ). From there it is read by Data lake Analytics, which also stores its results in Cosmos DB.

2.4. Lakehouse

Lakehouses as described by Armbrust et al. ( 2021 ) are a consequence of the general observation that in some cases the raw data from a data lake is used as an input for an ETL process to populate a data warehouse. The first step into a more unified setup was provided by Delta lakes (Armbrust et al., 2020 ), which provides ACID (atomicity, consistency, isolation, durability) transactions on cloud object storage for tables stores. These tables can be accessed from different systems, like Spark, Hive, Presto (Sethi et al., 2019 ), and others. This approach introduces, among other things, the advantage of still separating storage and compute. Lakehouses offer on top of ACID transactions direct access to the storage with traditional database semantics, e.g., SQL, using open file formats like Apache Parquet (Vohra, 2016 ) or ORC 2 . Therefore, a metadata layer on top of the cloud storage can provide convenient SQL-like access to tables, while compute-intensive, non-SQL code, like machine learning, can directly access the files on the storage devices and thereby get higher performance.

2.5. FAIR digital object-based architecture

Using a FAIR Digital Object -based architecture, as proposed by Nolte and Wieder ( 2022 ), the data lake is not divided into different zones but realizes a flat, homogeneous, and uniform research data management from the user's point of view. To allow for segregation of data with a different pedigree of subjected processing, the FAIR Digital Object encapsulating the corresponding data has a certain type with which the delimitation between different data points is represented. This can mean in a simple example, that in practice there is a Scanner − X Raw data type and a Scanner − X Preprocessed data type. This leads to a much more fine-grained partition of the data lake as compared to the zone-architecture . This highly segregated data lake, however, does not entail a correlated increase in system complexity and administrative effort, since only one, or a necessary subset of types of the FAIR Digital Objects needs to be implemented and can then be further inherited from, hereby reusing the existing implementation on the used infrastructure. Since everything in this data lake is a FAIR Digital Object , not only data but also including workflows and execution environments, the user interface is completely homogeneous, since the user interacts with these objects by calling predefined functions. Each of these data objects is equivalently suited as input for automated workflows or user-defined advanced analytics. The requirement for additional governance or security measures can be defined on a per object basis and can be globally enforced based on the typed attributes describing the metadata of the FAIR Digital Object .

2.6. Functional and maturity-based architectures

The classification into functional and maturity -oriented data lake architectures do, unlike in the case of the zone, lambda, lakehouse, and FAIR Digital Object-based architectures, not represent yet another design concept, but rather serve as an improved way for classifying the different architectural approaches. The goal is to allow for a more modular comparison of existing data lake solutions and to better plan, the data life-cycle as well as to help match the individual functionality of the architectural pieces, which are building up the data lake, like zones, or objects, on the required infrastructure.

Within a functional -based architecture classification, the data lake is analyzed toward its operations which are performed on the data while moving through the general data lake workflow. Hai et al. ( 2021 ) define three layers, ingestion, maintenance , and exploration , where corresponding functions are then sub-grouped. A similar definition is provided by Sawadogo and Darmont ( 2021 ), where the four main components of a data lake are defined as ingestion, storage, processing, querying .

Following the maturity-based architecture classification, the degree of the processing of the data is the central point of consideration. This classification is only helpful in the discrimination and organization of different data sets, however, it completely lacks consideration of workflows and processing capabilities. However, Sawadogo and Darmont ( 2021 ) highlight the advantage of the planning of the data life-cycle. Therefore, a hybrid architecture was proposed by Sawadogo and Darmont ( 2021 ) alongside the functional and maturity based classifications. Within this architecture, the individual components are uniquely identified by the data refinement and the possible functionality, that can be performed on the specific set of data.

3. Metadata models

Proper metadata management is key to prevent that a data lake turns into a data swamp and thus is the most important component to ensure a continuous operation and usability (Walker and Alrehamy, 2015 ; Khine and Wang, 2018 ). Due to the generally flat hierarchy and the requirement to store any data in its native format, there is always the risk of losing the overall comprehensibility of the data lake. This comprehensibility is lost, if data cannot be found or the relationship to other data sets cannot be retrieved. One of the most severe consequences of this is the inability to define concise queries to select the data one is looking for in a fine-grained manner. As a consequence, numerous metadata models and systems tailor-made for usage in data lakes have been proposed. These models and systems originate from different use cases and represent various viewpoints, and therefore differ regarding their feature sets. From this wide variety of available options, a few distinct works have been selected and are discussed in the following sections.

3.1. Data vault

Data modeling in a data vault was proposed by Linstedt in the 1990s and published in the 2000s to allow for a more agile metadata evolution, i.e., the continuous development of the metadata schema, in data warehouses, compared to star or a snowflake schemata (Lindstedt and Graziano, 2011 ). This ensemble modeling uses traditionally relational database systems and combines the third normal form with the star schema. All data is stored in three different types of tables. Hubs describe a business concept and are implemented as lists of unique keys, and can be populated by different data sources. Links describe relationships between the aforementioned hubs. Satellites contain all attributes which describe the properties of a hub or a link. Evolving the data vault over time then mainly implies adding additional satellite tables to links and hubs. Therefore, there is no need to migrate existing tables, which facilities the continuous addition of metadata over time.

Due to these characteristics of the data vault concept, it was also applied in data lakes. Nogueira et al. ( 2018 ) explained the definition of a data vault for a specific use case and discussed the advantages of porting it to a NoSQL database by comparing benchmark results compared to a SQL database. They also exemplify how new data sources can be added by defining new hubs, links, and particularly satellites. Giebler et al. ( 2019 ) proposed to split the one central, data lake-wide data vault up into three distinct sub-data vaults: the Raw Vault , the Business Vault , and the Data Mart , whereby the latter does not necessarily need to be modeled in a data vault, but could also be a flat table, or a star schema. The authors reported that the agile approach along with the ability to make incremental updates serves well the needs for a data lake implementation. However, they pointed out that it can be hard to enforce business rules across the independent sub-data vaults, which they use, that managing ambiguous keys can not finally be solved, and that high-frequency data can critically inflate satellites.

GEMMS is proposed by Quix et al. ( 2016 ) as a Generic and Extensible Metadata Management System with a particular focus on scientific data management and, in this context, specifically for the domain of live sciences. The key component of GEMMS is an abstract entity called Data Unit , which consists of raw data and its associated metadata. It is stated, that the main advantages are, flexibility during ingestion and a user interface that abstracts singular files. These Data Units can be annotated with semantic metadata according to a suitable ontology. However, the core is described with structure metadata . Mappings are only discussed for semi-structured files, like CSV, XML, or spreadsheets, however, it seems straightforward to extend this in other use cases.

3.3. MEDAL and goldMEDAL

A graph-based metadata model was presented by Sawadogo et al. ( 2019 ), where a subset of data, called an object, is represented as a hypernode that contains all information about that particular object, like the version, semantic information, or something called representations . Representations present the data in a specific way, for instance as a word cloud for textual data. There is at least one representation required per object, which is connected to this object by a Transformation . These representations can be transformed , which is represented as a directed edge in the hypergraph. This edge contains information about the transformation , i.e., a script description or similar. Data versioning is performed at the attribute level of these hyperedges connecting two different representations. Additionally, it is possible to define undetected hyperedges representing the similarity of two objects, provided that the two data sets are comparable.

This approach was revised by Scholly et al. ( 2021 ). Here, the concept was simplified to only use data entities, processes, links , and groupings . Processes also generate new data entities , dropping the rather complicated idea of representations . These concepts are again mapped on a hypergraph. Both models require global metadata, such as ontologies or thesauri .

The data lake, and in particular the utilized metadata model called CODAL (Sawadogo et al., 2019 ) was purpose-built for textual data. It combines a graph model connecting all ingested data sets with a data vault describing an individual data set. One core component is the xml manifest , which is divided into three parts: i) atomic metadata , ii) non-atomic metadata , and iii) a division for physical relational metadata . Metadata of the first category can be described as key-value pairs, whereas non-atomic metadata only contain references to a specific entity on a different system, they are "stored in a specific format in the filesystem" (Sawadogo et al., 2019 ). Additional information about the link strength which is modeling the relational metadata is stored in a dedicated graph database. Here, each node represents one document with a reference to the corresponding xml manifest .

3.5. Network-based models

A network-based model, which extends the simple categorization by Oram ( 2015 ) into three distinct types of metadata, i.e., Business Metadata, Operational Metadata , and Technical Metadata , was proposed by Diamantini et al. ( 2018 ) to improve the data integration of different data sources ingesting heterogeneous and unstructured data into the data lake. Here, the notion of objects, or nodes in the resulting graph, are used as well, which are defined by the corresponding source typology. Based on these objects, links are generated, containing a structural , a similarity or a Lemma (Navigli and Ponzetto, 2012 ) relationship. In this approach, a node is not only created for each source but also for each tag used in the structural relationship modeling. Lexical similarities are derived if two nodes have a common lemma in a thesaurus, while string similarities are computed using a suitable metric, in that particular case N-Grams (Peterlongo et al., 2005 ) was used. Similar nodes are merged. Due to this merge, synonyms in user queries can be detected and appropriately handled.

3.6. CoreKG

CoreKG (Beheshti et al., 2018 ) contextualizes the metadata in the data catalog. To this end, four features has been identified to constitute this curation service (Beheshti et al., 2017b ): Extraction, Enrichment, Linking and Annotation . The Extraction functionality extracts information from the raw data containing natural language, like the names of persons, locations, or organizations. Enrichment first provides synonyms and stems from the extracted features by using lexical knowledge bases like WordNet (Miller, 1995 ). These extracted and enriched features then need to be linked to external knowledge bases, like Wikidata (Vrandečić, 2012 ). This enables CoreKG to understand, if, for instance, the name of a certain politician was extracted, to link against the corresponding country that politician is active in, i.e., to set it into context. Additionally, users can also annotate the data items.

GOODS is the internal data lake of Google (Halevy et al., 2016a , b ). It is unique compared to all other systems presented since it gathers all its information in a post-hoc manner. This means, that the individual teams continue working with their specific tools within their established data silos, while GOODS extracts metadata about each dataset by crawling through the corresponding processing logs or storage-system catalogs. The central entity of this data lake is a data set, which can be additionally annotated by users or a special data-stewardship team. These datasets are then connected by a knowledge graph (Singhal, 2012 ) to represent their relationships. Within these relationships, the dataset containment enables to split up data sets, as it allows for bigtable column families (Chang et al., 2008 ) to be a data lake entity themselves, along the entire bigtable . Due to efficient naming conventions for file paths, GOODS can build up logical clusters , depending on whether they are regularly, e.g., daily, generated, if they are replicated across different compute centers or if they sharded into smaller data sets. In addition, the data sets are linked by content similarity as well. Since the entire data lake contains more than 20 billion data sets with the creation/deletion of 1 billion data sets per day, no pairwise similarity can be performed. Instead, locality-sensitive hash values are generated for individual fields of the data set are generated and compared.

3.8. Constance

Constance (Hai et al., 2016 ) is a data lake service, which extracts explicit and implicit metadata from the ingested data, allows semantic annotations and provides derived metadata matching and enrichment for a continuous improvement of the available metadata, and enables inexperienced users to work with simple keyword-based queries by providing a query rewriting engine (Hai et al., 2018 ). As it is typically done in data lakes, data is ingested in raw format. The next step is to extract as much metadata from it as possible, which is for structured data like XML easier since schema definitions can be directly extracted. In the case of semi-structured data, like JSON or CSV files, a two step process called the Structural Metadata Discovery is necessary. First, it is checked, whether or not metadata is either encoded in the raw file itself, like a self-describing spreadsheet or if metadata is encoded in the filename or file path. In a second step, relationships are tried to be discovered during the lifetime of the data lake between the different datasets, for instance, based on the frequencies of join operations. Semantic Metadata Matching is provided by a graph model and should use a common ontology. In addition, schemata can be grouped based on their similarity, which is useful in highly heterogeneous data lakes.

4. Provenance

One of the generally most important metadata attribute in the context of linked data is provenance (Hartig and Zhao, 2010 ). Data provenance or the data lineage hereby contains information about the origin of a dataset, e.g., how it was created, by whom, when it was created, etc. There has been an effort by the W3C to standardize the representation of provenance information by the use of an OWL2 ontology, as well as a general data model, among other documents to complete their specification called PROV (Belhajjame et al., 2013 ; Missier et al., 2013 ). Provenance is also becoming increasingly important in science, as it is a natural way to make scientific work more comprehensible and reproducible. This can be exemplified by the adaption of research objects (Bechhofer et al., 2010 ) and reusable research objects (Yuan et al., 2018 ), focusing even more on precise provenance information and repeatability of computational experiments. Apart from this, provenance is considered key in data lakes, to organize, track and link data sets across different transformations and thereby ensure the maintainability of a data lake.

4.1. CoreDB and CoreKG

CoreDB (Beheshti et al., 2017a ) and CoreKG (Beheshti et al., 2018 ) are data lake services with a main emphasis on a comprehensive REST API , to organize, index and query data across multiple databases. At the highest level, the main entities of this data lake are data sets, which can be either of type relational or of type NoSQL . In order to enable simultaneous querying capabilities the CoreDB web service is itself in front of all the other employed services. On this layer, queries are translated between SQL and NoSQL . A particular focus is lineage tracing of these entities. The recorded provenance is hereby modeled by a directed acyclic graph, where user/roles and entities are nodes while connecting edges represent the interaction. This employed definition is given by the Temporal Provenance Model (Beheshti et al., 2012 ) and can answer, when, from where, by whom, and how a data set was created, read, updated, deleted, or queried.

GOODS metadata model has a particular focus on provenance (Halevy et al., 2016a , b ). In order to build up the provenance graph, production logs are analyzed in a post-hoc manner. Then the transitive closure is calculated to determine the linkage between the data sets themselves. Since the data-access events in those logs are extremely high, only a sample is actually calculated and the transient closure is reduced to a limited amount of hops.

4.3. Komadu-based provenance auditing

Suriarachchi and Plale ( 2016a , b ) proposed a data lake reference architecture to track data lineage across the lake by utilizing a central provenance collection subsystem. This subsystem enables stream processing of provenance events by providing a suitable Ingest API along with a Query API . In order to centrally collect provenance and process it, Komadu (Suriarachchi et al., 2015 ) is used. Hereby, distributed components can send provenance information via RabbitMQ and web service channels. These single events are then assembled into a global directed acyclic provenance graph, which can be visualized as forward or backward provenance graphs. Using this central subsystem, the need for provenance stitching (Missier et al., 2010 ) is circumvented.

4.4. HPCSerA

A data lake use case is described in the work of Bingert et al. ( 2021 ). Here, a user specifies a so-called job manifest , which unambiguously describes the job, which should be computed. This includes the actual compute command, the compute environments which are provided by Singularity containers (Kurtzer et al., 2017 ), git repositories which should be cloned and potentially build at run-time, environment variables, user annotations, and most importantly the input and expected output data. This job manifest, written as a json document, is then sent to the data lake, which is here represented by a dedicated web application, which is taking control of the actual synchronization with the underlying services, like the high performance compute cluster or the databases. The data lake generates all necessary scripts, which are divided into three phases: i) pre-processing, run, and post-processing. These scripts are submitted to the compute cluster, where within the pre-processing step the compute environment is built on the front-end, and data from a remote S3 storage is staged on a fast parallel file system. Within this step, all possible degrees of freedom, like the input data, or the git commit, are recorded and submitted to the data lake, where it is being indexed. Due to this mechanism, jobs are later on searchable and a provenance graph is automatically created, which connects the artifacts via the job manifest as edges to their input or raw data. Due to this recording, as well as the wrapping job manifest, each job is precisely reproducible since one can submit the exact same job without any unknowns.

4.5. JUNEAU

JUNEAU is build on top of Jupyter Notebooks , by replacing its backend and customizing the user interface (Zhang and Ives, 2019 ). It is therefore specifically targeted at data scientists who are already familiar with Jupyter Notebooks. The main constituents of the data lake are tables, or data frames, of which transformations are tracked. Herefore, the notebook itself is considered to be the workflow, and each executed cell within is a task. The provenance information is captured, when the code within a cell is transmitted to the used kernel . Based on this, the notebook is reformatted into a versioned data flow graph, where procedural code is transformed into a declarative form (Ives and Zhang, 2019 ). Using a modified top- k threshold algorithm (Fagin et al., 2003 ), similar data sets can be found with respect to the individual provenance.

In order to manage automotive sensor data, Robert Bosch GmbH has built a data lake (Dibowski et al., 2020 ). Although the paper mainly focuses on their extensive DCPAC (Data Catalog, Provenance, Access Control) ontology to build their semantic layer, a dedicated data processing mechanism is provided. Data processing is done using containerized applications, which can access data in the data lake, and either create a new data resource from it or curate existing data sets. The semantic data catalog is updated via Apache Kafka messages. Hereby, new data items are integrated and their provenance is automatically recorded.

4.7. DataHub

DataHub (Bhardwaj et al., 2014 ) combines a dataset version control system, capable of tracking which operations were performed on which dataset by whom as well as their dependencies, with a hosted platform on top of it. Hereby DataHub uses tables which contain records as their primary entities. Records consist of a key, along with any number of typed, named attributes. In the case of completely unstructured data, only the key could then refer to an entire file, in the case of structured or semi-structured files like XML or JSON , the schema can be (partially) modeled into this record . These individual tables can then be linked to form data sets under specification of the corresponding relationships. The version information of a table or data set is managed using a version graph i.e., a directed acyclic graph where the nodes are data sets and the edges contain provenance information. In order to query multiple versions at a time, a SQL-based query language called VQL is provided, which extends SQL about the knowledge that there are different tables for the different versions of a data set.

Along with DataHub, ProvDB (Miao et al., 2017 ; Miao and Deshpande, 2018 ) is being developed. It incorporates a provenance data model (Chavan et al., 2015 ) which consists of a Conceptual Data Model and a Physical Property Graph Data Model . The first model considers a data science project as working directory where all files are either of type ResultFile, DataFile , or ScriptFile . These files can be further annotated by properties , i.e., JSON files. This model is then mapped onto a property graph, where the edges represent the relationship, e.g., parenthood. Provenance ingestion is possible threefold. The first option is to prefix shell commands with provdb ingest which then forwards audited information to different specialized collectors . Secondly, users can provide annotations. Lastly, there are so-called File Views , which allow defining virtual files as a transformation on an existing file. This can be the execution of a script or of an SQL query.

5. Support for workflows and automation

Although the first major challenge in building a data lake is the aforementioned metadata management, scaling toward big amounts of data (automated) operation and manageability of the data lake become increasingly important. For example, the extraction of metadata related to data, which is being ingested in a data lake, requires a scalable solution and highly automated processes that best can be integrated into work- or data flows wherever necessary (Mathis, 2017 ). As in the case of metadata extraction, it is also here sometimes more comfortable to split a complicated analysis up into a workflow consisting of different steps. This has the additional advantage that different parallelization techniques (Pautasso and Alonso, 2006 ; de Oliveira et al., 2012 ) can then be applied to improve the scalability of the implemented analysis.

KAYAK (Maccioni and Torlone, 2017 , 2018 ) offers so-called primitives to analyze newly inserted data in a data lake in an ad-hoc manner. KAYAK itself is a layer on top of the file system and offers a user interface for interactions. The respective primitives are defined by a workflow of atomic and consistent tasks and can range from inserting or searching for a data set in the data lake, to computing k-means, or performing an outlier analysis. Tasks can be executed either by KAYAK itself, or a third party tool can be triggered, like Apache Spark (Zaharia et al., 2016 ) or Metanome (Papenbrock et al., 2015 ). Furthermore, tasks can be sub-divided into individual steps. By defining a directed acyclic graph, which consists of consecutive dependent primitives, so-called pipelines can be constructed with KAYAK. Here, output data is not immediately used as input for a consecutive primitive, but output data is first stored back into the data lake and the corresponding metadata in the data catalog is updated. Users can define a time-to-action to specify the maximum time they are willing to wait for a result or preview, or they define a tolerance , which specifies the minimal accuracy they demand. A preview is a preliminary result of a step. In order to enable these features, each step has to expose a confidence to quantify the uncertainty of the correctness of a preview, and a cost function to provide information about the necessary run-time to achieve certain confidence. KAYAK enables the parallel execution of steps by managing dependencies between tasks. These dependencies are modeled as a directed acyclic graph for each primitive. By decomposing these dependency graphs into singular steps, these can be scheduled by a queue manager . It enables the asynchronous execution of tasks by utilizing a messaging system to schedule these tasks on a task executor , which is typically provided multiple times on a cluster to allow for parallel processing.

5.2. Klimatic

Klimatic (Skluzacek et al., 2016 ) integrates over 10,000 different geo-spatial data sets from numerous online repositories. It accesses these data sets via HTTP or Globus GridFTP. This done in a manner that allows to capture path-based provenance information and can therein identify relevant data sets based on file extensions, like NetCDF or CSV . It then pre-processes these heterogeneous data sets to integrate them into this single, homogeneous data lake while ensuring topological, geo-spatial, and user-defined constraints (Elmasri and Navathe, 1994 ; Cockcroft, 1997 ; Borges et al., 1999 ). The pre-processing is done automatically within a three-phase data ingestion pipeline. The first step consists of crawling and scraping, where Docker containers are deployed in a scalable pool. These crawlers retrieve a URL from a crawling queue and then process any data found at that URL , while adding newly discovered URLs back into the crawling queue . Using this approach it is already enough to start with a limited amount of initial repositories, like those of the National Oceanic and Atmospheric Administration or the University Corporation for Atmospheric Research . After these data sets have been successfully discovered, these are submitted to a extraction queue . Elements of this queue are then read by extractor instances , and also Docker containers which can be elastically deployed. These extract metadata with suitable libraries/tools, like UK Gemini 2.2 , and then load the extracted metadata into a PostgreSQL database. Using these automated processes, a user is for instance able to query for data in a certain range of latitudes and longitudes, and Klimatic will estimate the time needed to extract all data from the different data sets within the specified range, and will then provide the user with an integrated data set using focal operations (Shashi and Sanjay, 2003 ).

6. Discussion: Selected use cases and resulting challenges

In this section, two different use cases or user groups, which can be considered to be representative of larger research institutions, are presented. Based on these use cases the previously discussed data lake systems are being analyzed for their applicability in these use cases. In addition, for each section, the presented systems are analyzed and compared to each other. for this, 4 standard criteria are chosen. The Generality measures how easily the presented system can be used to cover all kinds of different (research) data. The Administrative Effort estimates, how much work is needed to host the system and necessary backend services without actually doing the domain research itself. This is covered by the Ease of Use , where the accessibility from a pure user's perspective is analyzed. Lastly, since data lakes are also commonly used in the context of big data, the Scalability is a crucial criterion to estimate the worth of deployment. In addition to these four criteria, more topic-specific criteria might be added with regard to the actual focus in the particular section.

6.1. User groups

In the following two disparate user groups are presented, which mainly differ in their technical proficiency.

6.1.1. Data scientists

In this use case, mainly data scientists with an above-average technology understanding are interacting with the data lake. Their motivation to use a data lake can come from a big data background, where a massive amount of data should be stored and processed, as well as standardizing processes and their provenance auditing to enhance the reproducibility of their experiments. Hereby, data scientist have the knowledge to work with SQL, NoSQL , and graph databases, and to interact with mass storage like CEPH (Weil et al., 2006 ). In order to perform their computations, they rely on ad-hoc , local execution of code, e.g., in Juypter Notebooks but need to massively scale out their computations in a later stage. Therefore, they need to be able to either work in a cloud environment or on high-performance compute clusters, which are much more cost efficient and are purpose-built for large parallel applications.

6.1.2. Domain scientist

In this use case, mainly domain scientists are using the data lake. It can be assumed, that they are working in a laboratory, or something similar in their respective field. Their motivation to use a data lake is driven by the general necessity of having proper research data management. These users are generally less experienced in dealing with more complicated applications like databases and are generally not required to program large applications or efficient, parallel code for high-performance compute clusters. Data lineage does not only need to be recorded based on digital transformations, i.e., monitoring which artifacts were created by which processes based on which input data, but also along measurements and analysis steps that are happening in laboratories, or comparable. Here, a data lake should be able to manage all experiment data and associated data, e.g., an electronic notebook corresponding to an experiment, and track for instance a sample across the creation and subsequent measurement cycle.

6.2. Applicability analysis of the presented data lakes

The following applicability analysis of the previously presented data lakes will be done based on the two provided use cases as well as their perceived administrative effort.

6.3. Architecture

6.3.0.1. zone architecture:.

The Zone Architecture divides a data lake into different zones to allow for some organization of different data types, which allows for easier automation of repetitive tasks. These zones can be physically implemented on different machines. Users do not necessarily have access to all zones, which means, that they can for instance not directly access raw data, on their own. This entails an administrative effort to serve all users. Additionally, there is no assistance for domain scientists and built-in guidance for reproducible analysis for data scientists by design.

6.3.0.2. Lambda architecture:

The Lambda Architecture has some similarities with the zone architecture but has generally a reduced complexity. The rather rigidly coupled batch and speed layers prevent an agile development by scientists but are ideally suited to provide production systems of established workflows while maintaining all raw data sets for later reuse.

6.3.0.3. Lakehouse:

The Lakehouse adds a valuable metadata layer on top of an object store and facilitates the advantage of the separation of storage and compute. This, however, entails a limited set of supported file formats and therefore use cases. The completely flat hierarchy and homogeneous usage make this architecture well suited for data scientists and domain scientists alike.

6.3.0.4. FAIR-DO based architecture:

The FAIR Digital Object-based Architecture offers a fine-grained refinement based on the types which also have a clear abstraction, increasing the general comprehensibility. The administrative effort is decreased, since new data types are derived from existing ones and general data lake functionalities only need to be implemented once for the parent objects, afterwards they can be reused in user space. The flat architecture does not intrinsically restrict access and offers a homogeneous interface across all stages. This allows to implement a customized but homogeneous user interfaces for domain researchers covering the entire life cycle of a certain experiment. Meanwhile, can data scientists work with the well-known abstraction of objects they are familiar with from the object-oriented programming paradigm. The possibility to globally enforce data governance based on the typed attributes of these FAIR Digital Objects is well suited to integrate different data sources or silos into a single research data management system.

6.3.0.5. Functional and maturity-based architectures:

The classification in either functional -based, maturity -based, or hybrid data lakes undermines the importance to sort data based on their refinement, i.e., on the degree of processing they were subjected to, while also stretching the importance to formulate activities on the data lake as functions. This has the advantage of a high standardization which eases the administrative overhead while guaranteeing minimal data quality and adherence to institutional policies. It is hard to distinguish here between domain researchers and data scientists since it is not clear how actual implementations of these concepts would respect the different needs of those user groups.

6.3.0.6. Qualitative comparison:

Looking at the four presented architectures one can do a qualitative comparison as shown in Table 1 . As discussed, the zone architecture has the highest complexity of all four of them, therefore lacking in administrative effort and ease of use, but it has a high generality. The Lambda architecture reduces the complexity compared to zone architecture, and is, therefore, easier to maintain and use, but is not as versatile applicable, since it mainly serves production systems with established workflows. Similar arguments can be found for the lakehouse, which can only support limited file formats. The FAIR Digital Object-based architecture has a high generality since it can wrap any existing file. Always offering Digital Objects to interact with is comfortable for the users but requires more administrative work, particularly at the beginning. One can also see that all architectures fulfill the general requirement for a data lake to be scalable.

Comparing the four presented architectures.

Zone++
Lambda0+++
Lakehouse0+++
FAIR-DO+0++

Putting the presented architectures into the context of the overall evolution of data lakes which were in their first years mostly realized using Hadoop clusters and the associated software stack (Khine and Wang, 2018 ), one can see a clear development toward more abstracted systems. The first level of abstraction was proposed by Functional-based architectures, which can also be mapped on Zone architectures by associating a certain functionality with a certain zone. This idea was greatly advanced by the FAIR-DO-based Architecture where the users don't see the actual system they are working on, but only trigger the execution of predefined functions using a REST API. This approach will lower the entry barrier, particularly for domain researchers, while restricting the general risk of a loss of consistency across the data lake. The general idea to organize the data in the data lake regarding their pedigree of subjected processing has clearly convinced, as it is also a fundamental part of the newer architectures, i.e., the FAIR-DO based Architecture, and the Maturity-based architecture. Since the lambda architecture only offers the serving layer, this is also true here. Although there was the strive in data lakes to separate storage and compute, the importance of storage performance becomes more important in the recent developments around lakehouses. Here, in future work, one should include active storage into the entire concept. Promising ideas are shown by Chakraborty et al. ( 2022 ), which extends the capabilities of a Ceph cluster.

6.3.1. Metadata models

6.3.1.1. data vault.

Although Data Vaults may seem old at first glance, they actually offer a generic and flexible way of modeling diverse data sets into a single model. However, designing a proper Data Vault is very challenging and requires deep knowledge about data modeling in general as well as about the usage of databases. Therefore, this model rather seems to be more suited for data scientists than domain researchers, while the administrative overhead depends on the used system.

6.3.1.2. GEMMS

The Generic and Extensible Metadata Management System was particularly designed for research data. The concept of a data unit seems straightforward, however, semantic annotations are only possible with a suitable ontology. Although this increases the quality of the resulting metadata catalog, resulting in challenges like ontology merging by administrators and the effort of domain researchers to get their vocabulary into these ontologies is a drawback. There was also no example provided, of how this model can be used in conjunction with unstructured data.

6.3.1.3. MEDAL and goldMEDAL

Also in these models, global metadata like ontologies and thesauri are necessary. The improved version of goldMEDAL seems matured, as it only uses straight forward concepts as data entities, processes, links , and groupings . More unintelligible concepts like representations have been dropped. The presented implementation used Neo4J , which is within the defined realm of data scientists. An open challenge seems to be the integration of fully automated processes with adherence to global governance to ensure data quality and to guide inexperienced users through the data lake.

6.3.1.4. CODAL

CODAL was purpose-built for textual data, it, therefore, lacks the capacity to scale to generic use cases. The combination of a Data Vault with a graph database along with the usage of XML documents seems only suited for experienced users, i.e., data scientists. Combining these two models seems powerful for the specific use case, however, entails a corresponding administrative overhead.

6.3.1.5. Network-based models

This model is also a graph-based model, which aims to integrate heterogeneous and particularly unstructured data into a single research data management system. The notion of objects, represented as nodes, offers the necessary versatility, to adapt to different use cases. The required definition of the corresponding source typology might not be easily implementable for domain scientists, but the overhead of including experienced administrators for the initial setup seems reasonable. However, the powerful relationship modeling using structural, similarity and Lemma relationships will introduce quite some maintenance overhead. This model, therefore, seems more appropriate for well-experienced data scientists who can ensure correct implementation and operation.

6.3.1.6. CoreKG

This data model was discussed in detail for the case of data containing natural language. The proposed model and presented workflow to implement this model are very convincing with the big restriction, that it is only meaningful and implementable for text documents containing natural language. Once the workflow is set up, it seems useful for data scientists as well as domain researchers.

6.3.1.7. GOODS

GOODS builds up a data catalog in a post-hoc manner. It creates data sets which are enriched with information within logs and user annotations. These data sets are then connected by a knowledge graph to represent their relationships. Although the idea to build up a data lake in a post-hoc manner seems very promising for any larger institution, it is connected with large challenges. Each log format and naming convention, also on the file path level, needs to be understood by the data lake. This requires for instance domain researchers to strictly follow global data governance, which usually requires additional auditing. Also on an administrative site, such a setup is as difficult to implement as it is compelling to have. Accessing all systems from one central data lake also comes with certain security risks which need to be addressed, increasing the complexity even more. Therefore, as impressive as this internal data lake of Google is, it is most likely out of reach for other, much smaller research institutions.

6.3.1.8. Constance

The presented metadata model in Constance was applied to structured and semi-structured data, where metadata was automatically extracted from the raw data itself or in the full file path. This approach lacks the ability to upload an associated metadata file. This could be done by a domain researcher who uploads an electronic lab book containing all metadata of a certain experiment. If this metadata needs to be indexed to enable a semantic search over it, such a mechanism needs to be provided. Furthermore, the usage of ontologies enables semantic metadata matching on the one side, while on the other side this might be hard to implement. Problems here are, that rather data scientists are trained to use them compared to domain researchers and that the broader a data lake becomes, the more likely the introduction of an additional ontology becomes, which then might require complicated merges of ontologies (Noy and Musen, 2003 ; Hitzler et al., 2005 ). Therefore, this approach seems more feasible for data scientists who are operating on a restricted set of data sources.

6.3.1.9. Qualitative comparison

In Table 2 a qualitative comparison of the eight discussed models is provided. In addition to the previously used criteria, Similar Dataset Exploration and Semantic Enrichment is added. The first one describes the possibility to find new data in a data lake that is similar to data a user already has found. This is for instance for statistical analysis like machine learning important, to be able to increase the input data set. Semantic Enrichment describes the possibility to, ideally continuously, add semantic information to data in the data lake to improve the findability. Implementing and using a data vault on top of a SQL or NoSQL database requires a manageable amount of time. In addition, it is scalable, one can describe generic entities and their relations, and allows for evolution over time, therefore enabling not a continuous but a discrete semantic enrichment. GEMMS was not yet by default extended to support any file type, and the use of ontologies has certain disadvantages, like ontology merging and user consultation. It is also not completely clear how similar data sets can be found and how a continuous semantic enrichment can be performed. MEDAL is a rather complicated model, with a steep learning curve. Relying on Representation which is derived by transformation will probably limit the usage scenarios. Similarity links allow for easy dataset exploration and allow for semantic enrichment. The revised version goldMEDAL improves, compared to MEDAL, usability by dropping the complicated Representation and Transformation relationship and reducing it to simpler Processes. CODAL was purpose-built for textual data, and thus lacks generality. In addition, relying on a filesystem limits scalability. Updating semantic metadata in an xml file, however, allows for continuous semantic enrichment, and connecting all data entities with nodes representing relationships allows for a good dataset exploration. The network-based models can describe generic data, but however, the more complicated notion of source typologies decreases the ease of use. Using N-Grams to compute similarities, similar data sets can be detected. Structural metadata can be randomly added to allow for semantic enrichment. CoreKG is again a purpose-built metadata model for textual data, therefore it is not generalizable. setting up the full curation service requires some administrative effort, but offers afterwards an easy and powerful model. The enrichment and linking service enable continuous data curation and exploration. GOODS requires that necessary metadata is encoded in log files or in storage-system catalogs, which limits the generality. The administrative effort here is enormous, however, the ease of use for the users is great since no change to their data silos and employed techniques is necessary. The capability to scale across dozens or hundreds of data centers is leading and the integration into an existing knowledge graph enables similar dataset explorations. The evaluation of semantic enrichment is difficult due to the high velocity of the data. Constance offers a generic metadata model, however, the structural metadata discovery did not explicitly include unstructured data like images. The query rewriting engine eases the use drastically and offers similar dataset exploration and semantic enrichment.

Comparing the nine presented metadata models.

Data Vault+000+0
GEMMS0000
MEDAL000++
goldMEDAL0000++
CODAL-00-++
Network Models+00++
CoreKG-+0++
GOODS0+++0
Constance00+0++

To summarize, the discussed metadata models offer diverse approaches for data modeling. However, there are common patterns across these models. All of these models have an atomic entity around which the entire modeling evolves. In order to harmonize these different atomic entities, in some models called objects, ontologies are commonly utilized. This increases the entry barrier, particularly for domain researchers, and can always lead to the necessity to perform ontology merging. These models also always employed a method to model relationships between their atomic entities, or aggregations of them. As a general observation, one can state that a proper metadata model for data lakes has to offer some means to describe, also semantically, its entities on its own, as well as their relationship toward each other. This has led to the simultaneous usage of different database systems, i.e., SQL, NoSQL, and Graph databases, within a single data lake, which introduced the challenge to query these different systems with a single user query. More powerful query-rewriting engines and/or suitable meta-languages which support the integration of semantic meaning within this process is one of the challenges metadata modeling in data lakes is currently facing. In addition, semantic metadata extraction from particularly unstructured data, like images, is also a key challenge to improve the usability and adoption of data lakes.

6.3.2. Provenance

6.3.2.1. coredb(kg).

The employed Temporal Provenance Model is suitable for data scientists and domain researchers, although it is not a widely used standard. However, no details about the technical implementation and the versatility are given. Therefore, no final assessment of the actual applicability is possible.

6.3.2.2. GOODS

Apart from the already discussed challenges of employing a post-hoc analysis to populate a central data lake, an additional challenge arises when using this approach to gather provenance information: The used analysis software needs to write suitable logs. Since this is not the case for all scientific software, this approach is hard to implement for domain researchers, while data scientists might be able to cope with that issue when using self-written code or wrappers around existing software suits. Interestingly, only GOODS calculates the transitive closure of the provenance graph, which seems very useful.

6.3.2.3. Komadu

This data lake implementation offers a dedicated Ingest API which uses a RabbitMQ messaging queue, to retrieve lineage information about the performed processing steps. The transparent assembly of these singular tasks to a global provenance graph is comfortable and useful. As in the GOODS discussion, data scientists can use custom build software or write wrappers around existing ones to utilize the messaging system. Domain researchers will probably have a hard time when their scientific software suit does not support this kind of provenance auditing.

6.3.2.4. HPCSerA

Here, users are required to describe their analysis they want to run in a job manifest . This job is then executed on an HPC system. By using Singularity containers and enabling the dynamic build and integration of arbitrary git commits, this integrates well with a typical HPC workflow. These systems are often used by data scientists, which benefit here from transparent provenance auditing of completely generic jobs and the ability to re-run a previous analysis to reproduce the results. This mechanism can also be extended for better Cloud support, however, there is a lack of an ad-hoc analysis with a similar provenance auditing, which might be important for domain researchers.

6.3.2.5. JUNEAU

This modification for Jupyter Notebook offers an ad-hoc experience for users, who are working with Python and with tabularized data. Since Jupyter Notebook is broadly utilized by data scientists and domain researchers alike, it is generally suited for both groups. However, this approach only works for tabularized data and only supports Python which is limiting the possible use cases. In addition, this presented data lake implementation fell short of detailed data and metadata handling and mainly focused on ad-hoc processing. It remains unclear, how well this implementation is able to serve as a central research data management system for a variety of data sources and user groups.

6.3.2.6. DCPAC

Data lineage in DCPAC is recorded by custom build containers which send messages to Apache Kafka . This approach requires users to containerize their applications and implement a messaging mechanism. This is a suitable method for data scientists but is challenging for domain researchers. particularly for domain researchers, it would be necessary to check the messages for quality and consistency.

6.3.2.7. DataHub

This platform which offers a data set version control system is a good solution for all researchers. The representation of the provenance within a version graph is interesting and straightforward. In addition, the possibility to use ProvDB offers more detailed modeling capabilities on the basis of files. The ingestion of the provenance data, however, is not generic enough. Using the shell command, will not offer a suitable depth, for instance, to offer full reproducibility, while on the other hand, the file views are only suitable for data scientists familiar with SQL . The third provided option, i.e., user annotations, is very error prone and is therefore unsuited as a guideline for good scientific practice.

6.3.2.8. Qualitative comparison

In Table 3 a qualitative comparison of the seven discussed models is provided. In addition to the previously used criteria, Reproducibility is added. This specifies, whether based on the gathered provenance information, a result can always be reproduced, which becomes increasingly relevant for scientific publications. CoreDB/CoreKG offers a RestAPI and thereby an easy-to-use interface and the distinction between the types SQL and NoSQL offers a great generality. However, the employed Temporal Provenance Model is not an established standard and is not aimed to guarantee reproducibility, but rather comprehensibility. GOODS relies on production logs, based on which heuristics are used to calculate the provenance graph. It is aimed for scalability and efficiency, not for reproducibility. Komadu relies on a RabbitMQ messaging, it is therefore not generally applicable. The provided RestAPI is a useful user interface. However, reproducibility relies on the quality of the messages, thus it is depending on the analytics job running, and is independent of the data lake itself. The data lake which uses HPCSerA can execute arbitrary scripts and it offers a RestAPI to work with. Its strength lies in its transparent lineage auditing on an HPC system by using Job Manifests. By storing and linking all entities, every artifact can be reproduced. The inclusion of HPC systems makes this setup very scalable. JUNEAU is extremely user-friendly by building directly on top of the well-known Jupyter Notebooks. Therefore it lacks generality and scalability since it depends on data frames and is limited to the resources of a single notebook. The transparent lineage recording during the submission of the code in a cell to the kernel allows reproducibility. DCPAC works on arbitrary data, however the usage of an extensive ontologies requires users to familiarize with it. The usage of Docker containers is scalable, and offers a fair reproducibility. DataHub can deal with any kind of data and offers a comfortable graphical user interface. The, although limited, provenance information in conjunction with the version control of the data allows for decent reproducibility. In conclusion, while all previously discussed data lake implementations share the common idea of building a global provenance graph, there exists a wealth of different provenance models and auditing methods. In the future, data lakes should focus more on established models for provenance representation, to enhance interoperability. Furthermore, from the fact that each data lake implementation found a unique provenance auditing approach, it becomes clear that in each case a specific processing mechanism was in mind, like an HPC system in HPCSerA or a Jupyter Notebook in JUNEAU. This means, that not a single data lake offered provenance auditing capabilities over the entire life cycle of a data-driven project for generic applications. A future challenge here is, to support provenance auditing in ad-hoc analytics, like within a Jupyter Notebook, as well as in larger computations that run in a Cloud or HPC environment and integrate this into a homogeneous, global provenance graph, ideally in a reproducible manner. These single tasks need then to be linked to workflows, with the same provenance granularity. A similar challenge here is to support generic applications, without relying on a built-in messaging or logging functionality.

Comparing the seven provenance models.

CoreDB/CoreKG+0++0
GOODS0++
Komadu-0+00
HPCSerA+0+++
JUNEAU0++
DCPAC+00+0
DataHub+0+0+

6.3.3. Support for workflows and automation

6.3.3.1. kayak.

KAYAK offers a very sophisticated mechanism to implement parallelizable and automatable workflows. The decomposition of an abstract workflow into tasks which can then be combined to primitives and finally can be chained to entire pipelines , requires some prior knowledge. Although powerful, it is rather suited for data scientists, and not so much for domain researchers, since the above decomposition of established scientific software suits into these entities is not straightforward. Furthermore, although the idea of a tolerance and a time-to-action is very useful on a data lake, this is only suitable for a subset of methods that are iterative by nature. From the viewpoint of a generic scientific application, this might simply add additional overhead and increase the entry barrier. Therefore, this well-designed data lake is mostly suitable for experienced data scientists.

6.3.3.2. Klimatic

The automated processing capabilities of Klimatic are based on Docker containers. The generic approach to split up a pipeline into distinct stages which are linked by dedicated queues can be adopted to serve other use cases as well. Although it was only used for data set exploration within this particular implementation, also data analysis pipelines could be set up using this approach. This would require building containers that are pushing back their output into a certain queue, or the current data lake implementation can be extended to offer a generic wrapping method that accepts arbitrary containers and then orchestrates the communication with the queuing system. One can see, how this data lake can be extended to also serve domain researchers from other sciences as well.

6.3.3.3. Qualitative comparison

In Table 4 a qualitative comparison of the two discussed processing concepts is provided. KAYAK has a scalable and parallelizable approach to executing a workflow. Since user-defined Primitives are supported, it is also very generic. In addition, the user interface on top of the filesystem eases the use, however, the entire execution model with its depth of options requires some time to familiarize with it. Klimatic presents a specific use case and it is not completely clear how generalizable the general approach is. Setting all queues and ingestion pipelines up for the first time requires some administrative effort but is, therefore, more comfortable for users to use. The usage of remote Docker hosts to serve multiple workers which get their jobs from a queue is also very scalable.

Comparing the two workflow and automation tools.

KAYAK+00+
Klimatic0++

In conclusion, there is only a limited amount of work focusing on workflows and automation for processes on data lakes. The challenge here is to incorporate a scalable back-end to support the compute-intensive operations associated with big data. Using portable containers is a well-suited approach. Future developments, however, would largely benefit from a modularized approach allowing to integrate different back-ends in a suitable manner, i.e., to have native support for individual machines, as well as for Cloud and HPC environments. This extends explicitly to the employed workflow engines, which should similarly prevent a lock-in effect, as envisioned by CWL (Amstutz et al., 2016 ).

7. Summary and outlook

This paper presents and summarizes the most relevant papers connected to data lakes and analyzes the past developments to identify future challenges. This is done with a particular focus on the applicability for larger research institutions which are characterized by diverse user groups, which are for simplicity represented by domain researchers and data scientists within this paper.

One can see in Section 2, that there is a trend toward an abstraction of the underlying systems. This allows to conceptually model the data life-cycle and increases the usability by defining a certain functionality within this life-cycle. Furthermore, by only exposing predefined functions to users the consistency of the data lake can be ensured, even when used by inexperienced users. To increase the general user group of a data lake, it is important, that the metadata model is similarly easy to use and yet generic enough to be suitable for diverse use cases.

In Section 3 it was seen, that there is the general need to model relationships between some atomic data lake entities. In addition, these atomic entities also need to be described by semantic metadata, which will be more intuitive, particularly for domain researchers. The most important challenge here is, to find a metadata model, which offers a low entry barrier for domain researchers to fill in their data, but offers a enough depth to for experienced users to utilize the sophisticated methods for special use cases, as they are presented.

By analyzing the papers which are presented in Section 4 it becomes clear, that there is the open quest to develop a uniform provenance auditing mechanism which is able to capture homogeneous lineage information along the entire project life-cycle, reaching from first ad-hoc scripts to large-scale parallel applications.

Also, in Section 5 there is a clear trend toward containerized applications to enable processing on data lakes. The advantages are many-fold, reaching from portability to an increased reproducibility. The provided mechanisms to allow for parallel and asynchronous executions are convincing. The next future challenge can be identified to enable these methods to use different back-ends, reaching from single systems, to public or private clouds, and HPC systems.

The concluding analysis in Section 6 took more different criteria into account and compared the presented data lake systems based on the criteria. Here one could see, that there is no single data lake system, which fulfills all requirements. Instead, most of the systems are to some extend purpose built systems, which are compromising in some aspects to better exceed at other. In addition, two different user groups from the view point of a larger research institution were defined in Section 6.1. Based on these user groups a more subjective analysis was done, with the purpose to motivate the general accessibility of this user group to certain data lake implementation. This might be useful in order to improve systems to attract more user, wich is the largest challenge currently in data lake development. However, it would have also the largest benefit, to get diverse user groups excited of the idea of data lakes. This will lead to an increased influx of new (meta-) data and methods, and the integration of previously siloed data will enable novel analysis which have not been possible before.

Author contributions

HN: writing original draft, analysis, and interpretation of existing literature. PW: writing original draft, writing review, and supervision. Both authors reviewed the results and approved the final version of the manuscript.

We gratefully acknowledge funding by the Niedersächsisches Vorab funding line of the Volkswagen Foundation and Nationales Hochleistungsrechnen (NHR), a network of eight computing centers in Germany to provide computing capacity and promote methodological skills.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1 https://flume.apache.org/ .

2 https://orc.apache.org/ .

  • Amstutz P., Crusoe M. R., Tijaníc N. (2016). Common Workflow Language. v1. 0 . Available online at: https://www.commonwl.org/v1.0/Workflow.html (accessed August 07, 2022).
  • Armbrust M., Das T., Sun L., Yavuz B., Zhu S., Murthy M., et al.. (2020). Delta lake: high-performance acid table storage over cloud object stores . Proc. VLDB Endowment 13 , 3411–3424. 10.14778/3415478.3415560 [ CrossRef ] [ Google Scholar ]
  • Armbrust M., Ghodsi A., Xin R., Zaharia M. (2021). Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics, in Proceedings of CIDR . [ Google Scholar ]
  • Armbrust M., Xin R. S., Lian C., Huai Y., Liu D., Bradley J. K., et al.. (2015). Spark sql: relational data processing in spark, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, VIC: ), 1383–1394. [ Google Scholar ]
  • Aundhkar A., Guja S. (2021). A review on enterprise data lake solutions . J. Sci. Technol . 6 , 11–14. 10.46243/jst.2021.v6.i04.pp11-14 [ CrossRef ] [ Google Scholar ]
  • Batyuk A., Voityshyn V. (2016). Apache storm based on topology for real-time processing of streaming data from social networks, in 2016 IEEE First International Conference on Data Stream Mining and Processing (DSMP) (Lviv: IEEE; ), 345–349. [ Google Scholar ]
  • Bechhofer S., De Roure D., Gamble M., Goble C., Buchan I. (2010). Research objects: toward exchange and reuse of digital knowledge . Nat. Preced . 10.1038/npre.2010.4626.1 [ CrossRef ] [ Google Scholar ]
  • Beheshti A., Benatallah B., Nouri R., Chhieng V. M., Xiong H., Zhao X. (2017a). Coredb: a data lake service, in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore: ), 2451–2454. [ Google Scholar ]
  • Beheshti A., Benatallah B., Nouri R., Tabebordbar A. (2018). Corekg: a knowledge lake service . Proc. VLDB Endowment 11 , 1942–1945. 10.14778/3229863.3236230 [ CrossRef ] [ Google Scholar ]
  • Beheshti S.-M.-R., Motahari-Nezhad H. R., Benatallah B. (2012). Temporal provenance model (TPM): model and query language . arXiv preprint arXiv:1211.5009 . 10.48550/arXiv.1211.5009 [ CrossRef ] [ Google Scholar ]
  • Beheshti S.-M.-R., Tabebordbar A., Benatallah B., Nouri R. (2017b). On automating basic data curation tasks, in Proceedings of the 26th International Conference on World Wide Web Companion (Perth, WA: ), 165–169. [ Google Scholar ]
  • Belhajjame K., B'Far R., Cheney J., Coppens S., Cresswell S., Gil Y., et al.. (2013). Prov-dm: The prov data model . Technical Report. [ Google Scholar ]
  • Bhardwaj A., Bhattacherjee S., Chavan A., Deshpande A., Elmore A. J., Madden S., et al.. (2014). Datahub: collaborative data science and dataset version management at scale . arXiv preprint arXiv:1409.0798 . 10.48550/arXiv.1409.0798 [ CrossRef ] [ Google Scholar ]
  • Bingert S., Köhler C., Nolte H., Alamgir W. (2021). An API to include HPC resources in workflow systems, in INFOCOMP 2021, The Eleventh International Conference on Advanced Communications and Computation (Porto), ed C. -P. Rückemann, 15–20. [ Google Scholar ]
  • Borges K. A., Laender A. H., Davis C. A., Jr (1999). Spatial data integrity constraints in object oriented geographic data modeling, in Proceedings of the 7th ACM International Symposium on Advances in Geographic Information Systems (Kansas City, MO: ), 1–6. [ Google Scholar ]
  • Borthakur D. (2007). The hadoop distributed file system: architecture and design . Hadoop Project Website 11 , 21. [ Google Scholar ]
  • Chakraborty J., Jimenez I., Rodriguez S. A., Uta A., LeFevre J., Maltzahn C. (2022). Skyhook: towards an arrow-native storage system . arXiv preprint arXiv:2204.06074 . 10.1109/CCGrid54584.2022.00017 [ CrossRef ] [ Google Scholar ]
  • Chang F., Dean J., Ghemawat S., Hsieh W. C., Wallach D. A., Burrows M., et al.. (2008). Bigtable: a distributed storage system for structured data . ACM Trans. Comput. Syst . 26 , 1–26. 10.1145/1365815.1365816 [ CrossRef ] [ Google Scholar ]
  • Chavan A., Huang S., Deshpande A., Elmore A., Madden S., Parameswaran A. (2015). Towards a unified query language for provenance and versioning, in 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) (Edinburgh: ). [ Google Scholar ]
  • Cockcroft S. (1997). A taxonomy of spatial data integrity constraints . Geoinformatica 1 , 327–343. 10.1023/A:1009754327059 [ CrossRef ] [ Google Scholar ]
  • de Oliveira D., Ogasawara E., Oca na K., Bai ao F., Mattoso M. (2012). An adaptive parallel execution strategy for cloud-based scientific workflows . Concurrency Comput . 24 , 1531–1550. 10.1002/cpe.1880 [ CrossRef ] [ Google Scholar ]
  • Dean J., Ghemawat S. (2008). Mapreduce: simplified data processing on large clusters . Commun. ACM . 51 , 107–113. 10.1145/1327452.1327492 [ CrossRef ] [ Google Scholar ]
  • Devlin B. A., Murphy P. T. (1988). An architecture for a business and information system . IBM Syst. J . 27 , 60–80. 10.1147/sj.271.0060 [ CrossRef ] [ Google Scholar ]
  • Diamantini C., Giudice P. L., Musarella L., Potena D., Storti E., Ursino D. (2018). A new metadata model to uniformly handle heterogeneous data lake sources, in European Conference on Advances in Databases and Information Systems (Nicosia: Springer; ), 165–177. [ Google Scholar ]
  • Dibowski H., Schmid S., Svetashova Y., Henson C., Tran T. (2020). Using semantic technologies to manage a data lake: data catalog, provenance and access control, in SSWS@ ISWC (Athen: ), 65–80. [ Google Scholar ]
  • Dixon J. (2010). Pentaho, Hadoop, and Data Lakes . Available online at: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (accessed April 22, 2022).
  • Elmasri R., Navathe S. (1994). Fundamentals of database systems. [ Google Scholar ]
  • El-Sappagh S. H. A., Hendawi A. M. A., El Bastawissy A. H. (2011). A proposed model for data warehouse ETL processes . J. King Saud Univer. Comput. Inf. Sci . 23 , 91–104. 10.1016/j.jksuci.2011.05.005 [ CrossRef ] [ Google Scholar ]
  • Fagin R., Lotem A., Naor M. (2003). Optimal aggregation algorithms for middleware . J. Comput. Syst. Sci . 66 , 614–656. 10.1016/S0022-0000(03)00026-6 [ CrossRef ] [ Google Scholar ]
  • Giebler C., Gröger C., Hoos E., Eichler R., Schwarz H., Mitschang B. (2021). The data lake architecture framework: a foundation for building a comprehensive data lake architecture, in Proceedings der 19. Fachtagung für Datenbanksysteme für Business, Technologie und Web (BTW 2021) . [ Google Scholar ]
  • Giebler C., Gröger C., Hoos E., Schwarz H., Mitschang B. (2019). Modeling data lakes with data vault: practical experiences, assessment, and lessons learned, in International Conference on Conceptual Modeling (Salvador: Springer; ), 63–77. [ Google Scholar ]
  • Giebler C., Gröger C., Hoos E., Schwarz H., Mitschang B. (2020). A zone reference model for enterprise-grade data lake management, in Proceedings of the 24th IEEE Enterprise Computing Conference (EDOC 2020) (Eindhoven: IEEE; ). [ Google Scholar ]
  • Golec D. (2019). Data lake architecture for a banking data model, in ENTRENOVA-ENTerprise REsearch InNOVAtion, Vol. 5 (Zagreb: ), 112–116. [ Google Scholar ]
  • Gorelik A. (2019). The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science . Sebastopol, CA: O'Reilly Media. [ Google Scholar ]
  • Hai R., Geisler S., Quix C. (2016). Constance: an intelligent data lake system, in Proceedings of the 2016 International Conference on Management of Data (San Francisco, CA: ), 2097–2100. [ Google Scholar ]
  • Hai R., Quix C., Jarke M. (2021). Data lake concept and systems: a survey . arXiv preprint arXiv:2106.09592 . 10.48550/arXiv.2106.09592 [ CrossRef ] [ Google Scholar ]
  • Hai R., Quix C., Zhou C. (2018). Query rewriting for heterogeneous data lakes, in European Conference on Advances in Databases and Information Systems (Budapest: Springer; ), 35–49. [ Google Scholar ]
  • Halevy A., Korn F., Noy N. F., Olston C., Polyzotis N., Roy S., et al.. (2016a). Goods: organizing google's datasets, in Proceedings of the 2016 International Conference on Management of Data (San Francisco, CA: ), 795–806. [ Google Scholar ]
  • Halevy A. Y., Korn F., Noy N. F., Olston C., Polyzotis N., Roy S., et al.. (2016b). Managing google's data lake: an overview of the goods system . IEEE Data Eng. Bull . 39 , 5–14. 10.1145/2882903.2903730 [ CrossRef ] [ Google Scholar ]
  • Hartig O., Zhao J. (2010). Publishing and consuming provenance metadata on the web of linked data, in International Provenance and Annotation Workshop (Troy: Springer; ), 78–90. [ Google Scholar ]
  • Hasani Z., Kon-Popovska M., Velinov G. (2014). Lambda architecture for real time big data analytic, in ICT Innovations (Ohrid: ), 133–143. [ Google Scholar ]
  • Hitzler P., Krötzsch M., Ehrig M., Sure Y. (2005). What is ontology merging? in American Association for Artificial Intelligence (Palo Alto, CA: AAAI Press; ), 4. [ Google Scholar ]
  • Hukkeri T. S., Kanoria V., Shetty J. (2020). A study of enterprise data lake solutions, in International Research Journal of Engineering and Technology (IRJET), Vol. 7 (Tamilnadu: ). [ Google Scholar ]
  • Inmon B. (2016). Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump . Basking Ridge, NJ: Technics Publications. [ Google Scholar ]
  • Inmon W. H. (2005). Building the Data Warehouse . Indianapolis, IN: John Wiley & Sons. [ Google Scholar ]
  • Ives Z. G., Zhang Y. (2019). Dataset relationship management, in Proceedings of Conference on Innovative Database Systems Research (CIDR 19) (Monterey, CA: ). [ Google Scholar ]
  • Khine P. P., Wang Z. S. (2018). Data lake: a new ideology in big data era . ITM Web Conf . 17, 03025. 10.1051/itmconf/20181703025 [ CrossRef ] [ Google Scholar ]
  • Kurtzer G. M., Sochat V., Bauer M. W. (2017). Singularity: Scientific containers for mobility of compute . PLoS ONE 12 , e0177459. 10.1371/journal.pone.0177459 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Li J. (2014). Design of real-time data analysis system based on impala, in 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA) (Ottawa, ON: IEEE; ), 934–936. [ Google Scholar ]
  • Lindstedt D., Graziano K. (2011). Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault . North Charleston, CA: CreateSpace. [ Google Scholar ]
  • Maccioni A., Torlone R. (2017). Crossing the finish line faster when paddling the data lake with kayak . Proc. VLDB Endowment 10 , 1853–1856. 10.14778/3137765.3137792 [ CrossRef ] [ Google Scholar ]
  • Maccioni A., Torlone R. (2018). Kayak: a framework for just-in-time data preparation in a data lake, in International Conference on Advanced Information Systems Engineering (Tallinn: Springer; ), 474–489. [ Google Scholar ]
  • Madera C., Laurent A. (2016). The next information architecture evolution: the data lake wave, in Proceedings of the 8th International Conference on Management of Digital Ecosystems (Biarritz: ), 174–180. [ Google Scholar ]
  • Madsen M. (2015). How to Build an Enterprise Data Lake: Important Considerations Before Jumping in . San Mateo, CA: Third Nature Inc. [ Google Scholar ]
  • Mathis C. (2017). Data lakes . Datenbank Spektrum 17 , 289–293. 10.1007/s13222-017-0272-7 [ CrossRef ] [ Google Scholar ]
  • Miao H., Chavan A., Deshpande A. (2017). Provdb: Lifecycle management of collaborative analysis workflows, in Proceedings of the 2nd Workshop on Human-in-the-Loop Data Analytics (Chicago, IL: ), 1–6. [ Google Scholar ]
  • Miao H., Deshpande A. (2018). Provdb: provenance-enabled lifecycle management of collaborative data analysis workflows . IEEE Data Eng. Bull . 41 , 26–38. 10.1145/3077257.3077267 [ CrossRef ] [ Google Scholar ]
  • Miller G. A. (1995). Wordnet: a lexical database for english . Commun. ACM . 38 , 39–41. 10.1145/219717.219748 [ CrossRef ] [ Google Scholar ]
  • Miloslavskaya N., Tolstoy A. (2016). Big data, fast data and data lake concepts . Procedia Comput. Sci . 88 , 300–305. 10.1016/j.procs.2016.07.439 [ CrossRef ] [ Google Scholar ]
  • Missier P., Belhajjame K., Cheney J. (2013). The W3C PROV family of specifications for modelling provenance metadata, in Proceedings of the 16th International Conference on Extending Database Technology (Genoa: ), 773–776. [ Google Scholar ]
  • Missier P., Ludäscher B., Bowers S., Dey S., Sarkar A., Shrestha B., et al.. (2010). Linking multiple workflow provenance traces for interoperable collaborative science, in The 5th Workshop on Workflows in Support of Large-Scale Science (New Orleans, LA: IEEE; ), 1–8. [ Google Scholar ]
  • Munappy A. R., Bosch J., Olsson H. H. (2020). Data pipeline management in practice: challenges and opportunities, in Product-Focused Software Process Improvement , ed Morisio M., Torchiano M., Jedlitschka A. (Cham: Springer International Publishing; ), 168–184. [ Google Scholar ]
  • Munshi A. A., Mohamed Y. A.-R. I. (2018). Data lake lambda architecture for smart grids big data analytics . IEEE Access 6 , 40463–40471. 10.1109/ACCESS.2018.2858256 [ CrossRef ] [ Google Scholar ]
  • Navigli R., Ponzetto S. P. (2012). Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network . Artif. Intell . 193 , 217–250. 10.1016/j.artint.2012.07.001 [ CrossRef ] [ Google Scholar ]
  • Nogueira I. D., Romdhane M., Darmont J. (2018). Modeling data lake metadata with a data vault, in Proceedings of the 22nd International Database Engineering and Applications Symposium (Villa San Giovanni: ), 253–261. [ Google Scholar ]
  • Nolte H., Wieder P. (2022). Realising data-centric scientific workflows with provenance-capturing on data lakes . Data Intell . 4 , 426–438. 10.1162/dint_a_00141 [ CrossRef ] [ Google Scholar ]
  • Noy N. F., Musen M. A. (2003). The prompt suite: interactive tools for ontology merging and mapping . Int. J. Hum. Comput. Stud . 59 , 983–1024. 10.1016/j.ijhcs.2003.08.002 [ CrossRef ] [ Google Scholar ]
  • Oram A. (2015). Managing the Data Lake: Moving to Big Data Analysis . Sebastopol, CA: O'Reilly Media. [ Google Scholar ]
  • Papenbrock T., Bergmann T., Finke M., Zwiener J., Naumann F. (2015). Data profiling with metanome . Proc. VLDB Endowment 8 , 1860–1863. 10.14778/2824032.2824086 [ CrossRef ] [ Google Scholar ]
  • Patel P., Wood G., Diaz A. (2017). Data lake governance best practices, in The DZone Guide to Big Data-Data Science and Advanced Analytics, Vol. 4 (Durham, NC: ), 6–7. [ Google Scholar ]
  • Pautasso C., Alonso G. (2006). Parallel computing patterns for grid workflows, in 2006 Workshop on Workflows in Support of Large-Scale Science (Paris: IEEE; ), 1–10. [ Google Scholar ]
  • Pérez-Arteaga P. F., Castellanos C. C., Castro H., Correal D., Guzmán L. A., Denneulin Y. (2018). Cost comparison of lambda architecture implementations for transportation analytics using public cloud software as a service, in Special Session on Software Engineering for Service and Cloud Computing (Porto: ), 855–862. [ Google Scholar ]
  • Peterlongo P., Pisanti N., Boyer F., Sagot M.-F. (2005). Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array, in International Symposium on String Processing and Information Retrieval (Buenos Aires: Springer; ), 179–190. [ Google Scholar ]
  • Quix C., Hai R., Vatov I. (2016). Gemms: a generic and extensible metadata management system for data lakes, in CAiSE Forum, Vol. 129 (Ljubljana: ). [ Google Scholar ]
  • Ramakrishnan R., Sridharan B., Douceur J. R., Kasturi P., Krishnamachari-Sampath B., Krishnamoorthy K., et al.. (2017). Azure data lake store: a hyperscale distributed file service for big data analytics, in Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, IL: ), 51–63. [ Google Scholar ]
  • Ravat F., Zhao Y. (2019). Data lakes: trends and perspectives, in International Conference on Database and Expert Systems Applications (Linz: Springer; ), 304–313. [ Google Scholar ]
  • Sawadogo P., Darmont J. (2021). On data lake architectures and metadata management . J. Intell. Inf. Syst . 56 , 97–120. 10.1007/s10844-020-00608-7 [ CrossRef ] [ Google Scholar ]
  • Sawadogo P. N., Scholly E., Favre C., Ferey E., Loudcher S., Darmont J. (2019). Metadata systems for data lakes: models and features, in European Conference on Advances in Databases and Information Systems (Bled: Springer; ), 440–451. [ Google Scholar ]
  • Scholly E., Sawadogo P., Liu P., Espinosa-Oviedo J. A., Favre C., Loudcher S., et al.. (2021). Coining goldmedal: a new contribution to data lake generic metadata modeling . arXiv preprint arXiv:2103.13155 . 10.48550/arXiv.2103.13155 [ CrossRef ] [ Google Scholar ]
  • Sethi R., Traverso M., Sundstrom D., Phillips D., Xie W., Sun Y., et al.. (2019). Presto: Sql on everything, in 2019 IEEE 35th International Conference on Data Engineering (ICDE) (Macao: IEEE; ), 1802–1813. [ Google Scholar ]
  • Sharma B. (2018). Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases . Sebastopol, CA: O'Reilly Media. [ Google Scholar ]
  • Shashi S., Sanjay C. (2003). Spatial databases: A Tour . Upper Saddle River, NJ: Prentice Hall. [ Google Scholar ]
  • Singhal A. (2012). Introducing the knowledge graph: things, not strings . Off. Google Blog 5 , 16. [ Google Scholar ]
  • Skluzacek T. J., Chard K., Foster I. (2016). Klimatic: a virtual data lake for harvesting and distribution of geospatial data, in 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) (Salt Lake City, UT: IEEE; ), 31–36. [ Google Scholar ]
  • Suriarachchi I., Plale B. (2016a). Crossing analytics systems: a case for integrated provenance in data lakes, in 2016 IEEE 12th International Conference on e-Science (e-Science) (Baltimore, MD: IEEE; ), 349–354. [ Google Scholar ]
  • Suriarachchi I., Plale B. (2016b). Provenance as essential infrastructure for data lakes, in International Provenance and Annotation Workshop (McLean, VA: Springer; ), 178–182. [ Google Scholar ]
  • Suriarachchi I., Zhou Q., Plale B. (2015). Komadu: a capture and visualization system for scientific data provenance . J. Open Res. Software 3 , e4. 10.5334/jors.bq [ CrossRef ] [ Google Scholar ]
  • Terrizzano I. G., Schwarz P. M., Roth M., Colino J. E. (2015). Data wrangling: the challenging yourney from the wild to the lake, in CIDR (Asilomar: ). [ Google Scholar ]
  • Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Anthony S., et al.. (2009). Hive: a warehousing solution over a map-reduce framework . Proc. VLDB Endowment 2 , 1626–1629. 10.14778/1687553.1687609 [ CrossRef ] [ Google Scholar ]
  • Villari M., Celesti A., Fazio M., Puliafito A. (2014). Alljoyn lambda: an architecture for the management of smart environments in iot, in 2014 International Conference on Smart Computing Workshops (Hong Kong: IEEE; ), 9–14. [ Google Scholar ]
  • Vohra D. (2016). Apache parquet. In Practical Hadoop Ecosystem . New York, NY: Springer. [ Google Scholar ]
  • Vrandečić D. (2012). Wikidata: a new platform for collaborative data collection, in Proceedings of the 21st International Conference on World Wide Web (Lyon: ), 1063–1064. [ Google Scholar ]
  • Walker C., Alrehamy H. (2015). Personal data lake with data gravity pull, in 2015 IEEE Fifth International Conference on Big Data and Cloud Computing (Dalian: IEEE; ), 160–167. [ Google Scholar ]
  • Warren J., Marz N. (2015). Big Data: Principles and Best Practices of Scalable Realtime Data Systems . Shelter Island, NY: Simon and Schuster. [ Google Scholar ]
  • Weil S. A., Brandt S. A., Miller E. L., Long D. D., Maltzahn C. (2006). Ceph: a scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, WA: ), 307–320. [ Google Scholar ]
  • Yuan Z., Ton That D. H., Kothari S., Fils G., Malik T.. (2018). Utilizing provenance in reusable research objects . Informatics , 5, 14. 10.3390/informatics5010014 [ CrossRef ] [ Google Scholar ]
  • Zaharia M., Chowdhury M., Franklin M. J., Shenker S., Stoica I. (2010). Spark: cluster computing with working sets, in 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (New York, NY: ). [ Google Scholar ]
  • Zaharia M., Xin R. S., Wendell P., Das T., Armbrust M., Dave A., et al.. (2016). Apache spark: a unified engine for big data processing . Commun. ACM . 59 , 56–65. 10.1145/2934664 [ CrossRef ] [ Google Scholar ]
  • Zhang Y., Ives Z. G. (2019). Juneau: data lake management for jupyter . Proc. VLDB Endowment 12 , 3352095. 10.14778/3352063.3352095 [ CrossRef ] [ Google Scholar ]
  • Zikopoulos P. (2015). Big Data Beyond the Hype: A Guide to Conversations for Today's Data Center . New York, NY; Chicago, IL; San Francisco, CA; Athens; Athens; London; Madrid; Mexico City; Milan; New Delhi; Singapore; Sydney, NSW; Toronto, ON: McGraw-Hill Education. [ Google Scholar ]

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 01 June 2023

SciSciNet: A large-scale open data lake for the science of science research

  • Zihang Lin   ORCID: orcid.org/0000-0003-4262-6354 1 , 2 , 3 , 4 ,
  • Yian Yin   ORCID: orcid.org/0000-0003-3018-4544 1 , 2 , 3 , 5 ,
  • Lu Liu 1 , 2 , 3 &
  • Dashun Wang   ORCID: orcid.org/0000-0002-7054-2206 1 , 2 , 3 , 5  

Scientific Data volume  10 , Article number:  315 ( 2023 ) Cite this article

19k Accesses

19 Citations

70 Altmetric

Metrics details

  • Scientific community

The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.

Similar content being viewed by others

data lake research paper

Data, measurement and empirical methods in the science of science

data lake research paper

A dataset for measuring the impact of research data and their curation

data lake research paper

Envisioning a “science diplomacy 2.0”: on data, global challenges, and multi-layered networks

Background & summary.

Modern databases capturing the innerworkings of science have been growing exponentially over the past decades, offering new opportunities to study scientific production and use at larger scales and finer resolution than previously possible. Fuelled in part by the increasing availability of large-scale datasets, the science of science community turns scientific methods on science itself 1 , 2 , 3 , 4 , 5 , 6 , helping us understand in a quantitative fashion a range of important questions that are central to scientific progress—and of great interest to scientists themselves—from the evolution of individual scientific careers 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 to collaborations 19 , 20 , 21 , 22 , 23 , 24 , 25 and science institutions 26 , 27 , 28 to the evolution of science 2 , 3 , 5 , 29 , 30 , 31 , 32 , 33 , 34 to the nature of scientific progress and impact 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , – 55 .

Scholarly big data have flourished over the past decade, with several large-scale initiatives providing researchers free access to data. For example, CiteSeerX 56 , one of the earliest digital library search engines, offers a large-scale scientific library focusing on the literature in computer and information science. Building on a series of advanced data mining techniques, AMiner 57 indexes and integrates a wide range of data about academic social networks 58 . Crossref ( https://www.crossref.org/ ) 59 , as well as other initiatives in the open metadata community, have collected metadata such as Digital Object Identifier (DOI) in each publication record and linked them to a broad body of event data covering scholarly discussions. OpenAlex ( https://openalex.org/ ) 60 , based on Microsoft Academic Graph (MAG) 61 , 62 , 63 , aims to build a large-scale open catalog for the global research system, incorporating scholarly entities and their connections across multiple datasets. In addition to data on scientific publications and citations capturing within-science dynamics, researchers have also tracked interactions between science and other socioeconomic spheres by tracing, for example, how science is referenced in patented inventions 64 , 65 , 66 , regarding both front-page and in-text citations from patents to publications 67 , 68 . Table  1 summarizes several exemplary datasets commonly used in the science of science literature, with information on their coverage and accessibility.

The rapid growth of the science of science community 69 , 70 , 71 , combined with its interdisciplinary nature, raises several key challenges confronting researchers in the field. First, it becomes increasingly difficult to keep track of available datasets and their potential linkages across disparate sources, raising the question of whether there are research questions that are underexplored simply due to a lack of awareness of the data. Second, as data and their linkages become more complex, there are substantial data pre-processing steps involved prior to analyses. Many of these steps are often too detailed to document in publications, with researchers making their own analytical choices when processing the data. Third, as tools and techniques used in the science of science grow in sophistication, measurements on these datasets can be computationally involved, requiring substantial investment of time and resources to compute these measures.

All these challenges highlight the need for a common data resource designed for research purposes, which could benefit the community in several important ways. First, it provides a large-scale empirical basis for research, helping to strengthen the level of evidence supporting new findings as well as increase the replicability and robustness of these findings. Second, it helps to reduce duplication of efforts across the community in data preprocessing and common measurements. Third, by compiling various datasets, linkages, and measurements, the data resource significantly lowers the barrier to entry, hence has the potential to broaden the diversity and representation of new ideas in the field.

To support these needs in the community, we present SciSciNet, a large-scale open data lake for the science of science research. The data lake not only incorporates databases that capture scientific publications, researchers, and institutions, but also tracks their linkages to related entities, ranging from upstream funding sources like NIH and NSF to downstream public uses, including references of scientific publications in patents, clinical trials, and media and social media mentions (see Fig.  1 and Table  2 for more details of entities and their relationships). Building on this collection of linked databases, we further calculate a series of commonly used measurements in the science of science, providing benchmark measures to facilitate further investigations while illustrating how researchers can further contribute collectively to the data lake. Finally, we validate the data lake using multiple approaches, including internal data validation, cross-database verification, as well as reproducing canonical results in the literature.

figure 1

The entity relationship diagram of SciSciNet. SciSciNet includes “SciSciNet_Papers” as the main data table, with linkages to other tables capturing data from a range of sources. For clarity, here we show a subset of the tables (see Data Records section for a more comprehensive view of the tables). PK represents primary key, and FK represents foreign key.

The data lake, SciSciNet, is freely available at Figshare 72 . At the core of the data lake is the Microsoft Academic Graph (MAG) dataset 61 , 62 , 63 . The MAG data is one of the largest and most comprehensive bibliometrics data in the world, and a popular dataset for the science of science research. However, MAG was sunset by Microsoft at the end of 2021. Since then, there have also been several important efforts in the community to ensure the continuity of data and services. For example, there are mirror datasets 73 available online for MAG, and the OpenAlex ( https://openalex.org ) initiative builds on the MAG data, and not only makes it open to all but also provides continuous updates 60 . While these efforts have minimized potential disruptions, the sunsetting of MAG has also accelerated the need to construct open data resources designed for research purposes. Indeed, large-scale systematic datasets for the science of science mostly come in the form of raw data, which requires further data pre-processing and filtering operations to extract fine-grained research data with high quality. It usually takes substantial efforts and expertise to clean the data, and many of these steps are often too detailed to document in publications, with researchers making their own analytical choices. It thus suggests that there is value in constructing an open data lake, which aims to continue to extend the usefulness of MAG, with substantial data pre-processing steps documented. Moreover, the data lake links together several disparate sources and pre-computed measures commonly used in the literature, serving as an open data resource for researchers interested in the quantitative studies of science and innovation.

Importantly, the curated data lake is not meant to be exhaustive; rather it represents an initial step toward a common data resource to which researchers across the community can collectively contribute. Indeed, as more data and measurements in the science of science become available, researchers can help to contribute to the continuous improvement of this data lake by adding new data, measurements, and linkages, thereby further increasing the utility of the data lake. For example, if a new paper reports a new measurement, the authors could publish a data file linking the new measurement with SciSciNet IDs, which would make it much easier for future researchers to build on their work.

Data selection and curation from MAG

The Microsoft Academic Graph (MAG) dataset 61 , 62 , 63 covers a wide range of publication records, authors, institutions, and citation records among publications. MAG has a rich set of prominent features, including the application of advanced machine learning algorithms to classify fields of study in large-scale publication records, identify paper families, and disambiguate authors and affiliations. Here we use the edition released on December 6 th , 2021 by MAG, in total covering 270,694,050 publication records.

The extensive nature of the MAG data highlights a common challenge. Indeed, using the raw data for research often requires substantial pre-processing and data-cleaning steps to arrive at a research-ready database. For example, one may need to perform a series of data selection and curation operations, including the selection of scientific publications with reliable sources, aggregation of family papers, and redistribution of citation and reference counts. After going through these steps, one may generate a curated publication data table, which serves as the primary scientific publication data table in SciSciNet (Table  3 , “SciSciNet_Papers”). However, each of these steps requires us to make specific analytical choices, but given the detailed nature of these steps, the specific choices made through these steps have remained difficult to document through research publications.

Here we document in detail the various procedures we took in constructing the data lake. From the original publication data in MAG, we use MAG Paper ID as the primary key, and consider a subset of main attributes, including DOI (Digital Object Identifier), document type and publication year. As we are mainly interested in scientific publications within MAG, we first remove paper records whose document type is marked as patent. We also remove those with neither document type nor DOI information. Each scientific publication in the database may be represented by different entities (e.g., preprint and conference), indicated as a paper “family” in MAG. To avoid duplication, we aggregate all papers in the same family into one primary paper. We also do not include retracted papers in the primary paper table in SciSciNet. Instead, we include records of retracted papers and affiliated papers in paper families in another data table “SciSciNet_PaperDetails” (Table  8 ) linked to the primary paper table, recording information of DOIs, titles, original venue names, and original counts for citations and references in MAG. Following these steps, the primary data table “SciSciNet_Papers” contains 134,129,188 publication records with unique primary paper ids, including 90,764,813 journal papers, 4,629,342 books, 3,932,366 book chapters, 5,123,597 conference papers, 145,594 datasets, 3,083,949 repositories, 5,998,509 thesis papers, and 20,451,018 other papers with DOI information.

For consistency, we recalculate the citation and reference counts within the subset of 134 M primary papers, such that each citation or reference record is also included in this subset and can be found in “SciSciNet_PaperReferences” (Table  5 ). For papers in the same family, we aggregate their citations and references into the primary paper and drop duplicated citation pairs. Building on the updated citations, we recalculate the number of references and citations for each primary paper.

MAG also contains information of authors, institutions, and fields. While author disambiguation 58 , 74 , 75 , 76 , 77 , 78 , 79 remains a major challenge, we adopt the author disambiguation method from MAG and create an author table, which offers a baseline for future studies of individual careers. We also supplement the author table with empirical name-gender associations to support gender research 80 , drawing from work by Van Buskirk et al . 80 ; this allows us to build “SciSciNet_Authors_Gender” (Table  9 ) with 134,197,162 author records including their full names.

For fields, we use the fields of study records from MAG and focus on the records related to the selected primary papers (19 Level-0 fields and 292 Level-1 fields, Table  6 ). We incorporate this information into two tables, the “SciSciNet_PaperAuthorAffiliations” (Table  4 ) and “SciSciNet_PaperFields” (Table  7 ), with 413,869,501 and 277,494,994 records, respectively.

We further use the information of “PaperExtendedAttributes” table from MAG to construct high-quality linkages between MAG Paper ID and PubMed Identifier (PMID). We drop duplicate links by only keeping the MAG primary paper record (if one PMID was linked to multiple MAG Paper IDs) or the latest updated PubMed record (if one MAG Paper ID was linked to multiple PMIDs), obtaining 31,230,206 primary MAG Paper ID-PMID linkages (95.6% of the original records) to further support linkage with external sources.

Together, the resulting SciSciNet includes 134,129,188 publications (Table  3 ), 134,197,162 authors (Table  9 ), 26,998 institutions (Table  10 ), 49,066 journals (Tables  21 ), 4,551 conference series (Tables  22 ), 19 top-level fields of study, 292 subfields (Table  6 ), and the internal links between them, including 1,588,739,703 paper-references records (Table  5 ), 413,869,501 paper-author-affiliations records (Table  4 ), and 277,494,994 paper-fields records (Table  7 ).

Linking publication data with external sources

While the main paper table captures citation relationships among scientific publications, there has been growing interest in studying how science interacts with other socioeconomic institutions 35 , 36 , 41 , 55 , 81 , 82 . Here, we further trace references of scientific publications in data sources that go beyond publication datasets, tracking the linkage between papers to their upstream funding supports and downstream uses in public domains. Specifically, here we link papers to the grants they acknowledge in NSF and NIH, as well as public uses of science by tracking references of scientific publications in patents, clinical trials, and news and social media.

NIH funding

The National Institutes of Health (NIH) is the largest public funder for biomedical research in the world. The recent decade has witnessed increasing interest in understanding the role of NIH funding for the advancement of biomedicine 81 , 82 and its impact on individual career development 83 , 84 . NIH ExPORTER provides bulk NIH RePORTER ( https://report.nih.gov/ ) data on research projects funded by the NIH and other major HHS operating divisions. The database also provides link tables (updated on May 16, 2021) that connects funded projects with resulting publications over the past four decades.

To construct the funded project-paper linkages between SciSciNet Paper ID and NIH Project Number, we use the PMID of MAG papers (from our previously curated “PaperExtendedAttributes” table based on MAG) as the intermediate key, matching more than 98.9% of the original NIH link table records to primary Paper ID in SciSciNet. After dropping duplicate records, we end up with a collection of 6,013,187 records (Table  11 ), linking 2,636,061 scientific papers (identified by primary MAG Paper IDs) to 379,014 NIH projects (identified by core NIH-funded project numbers).

NSF funding

Beyond biomedical research, the National Science Foundation (NSF) funds approximately 25% of all federally supported basic research conducted by the United States’ colleges and universities across virtually all fields of science and engineering. NSF provides downloadable information on research projects it has funded, including awardee, total award amount, investigator, and so forth, but no information on funded research publications. While Federal RePORTER offers downloadable files on NSF awards with links to supported publications (662,072 NSF award-publication records by 2019), it only covers a limited time period and has been retired by March 2022. To obtain a more comprehensive coverage of records linking NSF awards to supported papers, we crawl the webpages of all NSF awards to retrieve information on their resulting publications. In particular, we first created a comprehensive list of all NSF award numbers from https://www.nsf.gov/awardsearch/download.jsp . We then iterate over this list to download the entire webpage document of each NSF award (from the URL https://www.nsf.gov/awardsearch/showAward?AWD_ID  = [Award number]), and use “Publications as a result of this research” column to identify scientific publications related to this award. We then extract paper titles and relevant information provided by using the Python library ElementTree to navigate and parse the webpage document structurally. We end up collecting 489,446 NSF awards since 1959 (Table  20 ), including linkages between 131,545 awards and 1,350,915 scientific publications.

To process information crawled from NSF.gov, which is presented as raw text strings, we design a text-based multi-level matching process to link NSF awards to SciSciNet scientific publications:

For records with DOI information in the raw texts of funded research publications, we perform an exact match with SciSciNet primary papers through DOI. If the DOI in an NSF publication record matched that of one primary paper, we create a linkage between the NSF Award Number and the primary Paper ID. We matched 458,463 records from NSF awards to SciSciNet primary papers, where each DOI appeared only once in the entire primary paper table, thus enabling association with a unique Paper ID (exact match). After dropping duplicates where the same DOI appears repeatedly in the same NSF award, we yield 350,611 records (26.0%) from NSF awards to SciSciNet primary papers.

To process the rest of the records, we then use the title information of each article for further matching. After extracting the title from NSF records and performing a standardization procedure (e.g., converting each letter into lowercase and removing punctuation marks, extra spaces, tabs, and newline characters), our exact matches between paper titles in the NSF award data and SciSciNet primary paper data yield 246,701 unique matches (18.3% in total) in this step.

We further develop a search engine for records that have not been matched in the preceding steps. Here we use Elasticsearch, a free and open search and analytics engine, to index detailed information (paper title, author, journal or conference name, and publication year) of all SciSciNet primary papers. We then feed raw texts of the crawled NSF publications into the system and obtain results with the top two highest scores associated with the indexed primary papers. Similar to a previous study 55 , we use scores of the second matched primary papers as a null model, and then identify the first matched primary paper as a match if its score is significantly higher than the right-tail cutoff of the second score distribution ( P  = 0.05). Following this procedure, we match the remaining 467,159 records (34.6%) from the two previous steps with significantly higher scores (Fig.  2a ). Note that this procedure likely represents a conservative strategy that prioritizes precision over recall. Manually inspecting the rest of potential matchings, we find that those with large differences between the top two Z-scores (Fig.  2b ) are also likely to be correct matches. To this end, we also include these heuristic links, together with the difference of their Z-scores, as fuzzy matching linkages between SciSciNet papers and NSF awards.

figure 2

Matching NSF reference string to MAG records. ( a ) Distribution of Z-scores for papers matched in ElasticSearch with the first and second highest scores. The vertical red line denotes the right-tail cutoff of the second score distribution ( P  = 0.05). ( b ) Distribution of pairwise Z-score differences for papers matched in search engine but with the first score no higher than the right-tail cutoff of the second score distribution ( P  = 0.05).

We further supplement these matchings with information from Crossref data dump, an independent dataset that links publications to over 30 K funders including NSF. We collect all paper-grant pairs where the funder is identified as NSF. We then use the raw grant number from Crossref and link paper records between Crossref and SciSciNet using DOIs. We obtain 305,314 records after cleaning, including 196,509 SciSciNet primary papers with DOIs matching to 83,162 NSF awards.

By combining records collected from all these steps, we collect 1,130,641 unique linkages with high confidence levels and 178,877 additional possible linkages from fuzzy matches (Table  12 ). Together these links connect 148,148 NSF awards and 929,258 SciSciNet primary papers.

Patent citations to science

The process in which knowledge transfers from science to marketplace applications has received much attention in science and innovation literature 35 , 41 , 85 , 86 , 87 , 88 . The United States Patent and Trademark Office (USPTO) makes patenting activity data publicly accessible, with the PatentsView platform providing extensive metadata including as related to patent assignees, inventors, and lawyers, along with patents’ internal citations and full-text information. The European Patent Office (EPO) also provides open access to patent data containing rich attributes.

Building on recent advances in linking papers to patents 35 , 67 , 68 , Marx and Fuegi developed a large-scale dataset of over 40 M citations from USPTO and EPO patents to scientific publications in MAG. Using this corpus (Version v34 as of December 24, 2021), we merge 392 K patent citation received by affiliated MAG papers to their respective primary IDs in the same paper family. Dropping possible duplicate records with the same pair of primary Paper ID and Patent ID results in 38,740,313 paper-patent citation pairs between 2,360,587 patents from USPTO and EPO and 4,627,035 primary papers in SciSciNet (Table  15 ).

Clinical trials citations to science

Understanding bench-to-bed-side translation is essential for biomedical research 81 , 89 . ClinicalTrials.gov provides publicly available clinical study records covering 50 U.S. states and 220 countries, sourced from the U.S. National Library of Medicine. The Clinical Trials Transformation Initiative (CTTI) makes available clinical trials data through a database for Aggregate Analysis of ClinicalTrials.gov (AACT), an aggregated relational database helping researchers better study drugs, policies, publications, and other related items to clinical trials.

Overall, the data covers 686,524 records linking clinical trials to background or result papers (as of January 26th, 2022). We select 480,893 records with papers as reference background supporting clinical trials, of which 451,357 records contain 63,281 unique trials matching to 345,797 reference papers with PMIDs. Similar to the process of linking scientific publications to NIH-funded projects, we again establish linkages between SciSciNet primary Paper ID and NCT Number (National Clinical Trial Number) via PMID, aided by the curated “PaperExtendedAttributes” table as the intermediary. After standardizing the data format of the intermediate index PMID to merge publications and clinical trials, we obtain 438,220 paper-clinical linkages between 61,447 NCT clinical trials and 337,430 SciSciNet primary papers (Table  13 ).

News and social mentions of science

Understanding how science is mentioned in media has been another important research direction in the science of science community 44 , 90 . The Newsfeed mentions in Crossref Event Data link scientific papers in Crossref 59 with DOIs to news articles or blog posts in RSS and Atom feeds, providing access to the latest scientific news mentions from multiple sources, including Scientific American , The Guardian , Vox , The New York Times , and others. Also, Twitter mentions in Crossref Event Data link scientific papers to tweets created by Twitter users, offering an opportunity to explore scientific mentions in Twitter.

We use the Crossref Event API to collect 947,160 records between 325,396 scientific publications and 387,578 webpages from news blogs or posts (from April 5 th , 2017 to January 16 th , 2022) and 59,593,281 records between 4,661,465 scientific publications and 58,099,519 tweets (from February 7 th , 2017 to January 17 th , 2022).

For both news media and social media mentions, we further link Crossref’s publication records to SciSciNet’s primary papers. To do so, we first normalize the DOI format of these data records and converted all alphabetic characters to lowercase. We use normalized DOI as the intermediate index, as detailed below:

For news media mentions, we construct linkages between primary Paper ID and Newsfeed Object ID (i.e., the webpage of news articles or blog posts) by inner joining normalized DOIs. We successfully link 899,323 records from scientific publications to news webpages in the Newsfeed list, accounting for 94.9% of the total records. The same news mention may be collected multiple times. After removing duplicate records, we end up with 595,241 records, linking 307,959 papers to 370,065 webpages from Newsfeed (Table  17 ).

Similarly, for social media mentions, we connect primary Paper IDs with Tweet IDs through inner joining normalized DOIs, yielding 56,121,135 records, more than 94% of the total records. After dropping duplicate records, we keep 55,846,550 records, linking 4,329,443 papers to 53,053,505 tweets (Table  16 ).

We also provide metadata of paper-news linkages, including the mention time and the detailed mention information in Newsfeed, to better support future research on this topic (Table  18 ). Similarly, we also offer the metadata of paper-tweet links, including the mention time and the original collected Tweet ID so that interested researchers can merge with further information from Twitter using the Tweet ID (Table  19 ).

Nobel Prize data from the dataset of publication records for Nobel laureates

We integrate a recent dataset by Li et al . 91 in the data lake, containing the publication records of Nobel laureates in science from 1900 to 2016, including both Nobel prize-winning works and other papers produced in their careers. After mapping affiliated MAG Paper IDs to primary ones, we obtain 87,316 publication records of Nobel laureates in SciSciNet primary paper Table (20,434 in physics, 38,133 in chemistry, and 28,749 in physiology/medicine, Table  14 ).

Calculation of commonly used measurements

Using the constructed dataset, we further calculate a range of commonly used measurements of scientific ideas, impacts, careers, and collaborations. Interested readers can find more details and validations of these measurements in the literature 15 , 19 , 20 , 46 , 47 , 48 , 92 , 93 , 94 , 95 , 96 , 97 , 98 .

Publication-level

The number of researchers and institutions in a scientific paper.

Building on team science literature 19 , 27 , we calculate the number of authors and the number of institutions for each paper as recorded in our data lake. We group papers by primary Paper ID in the selected “SciSciNet_PaperAuthorAffiliations” table and aggregate the unique counts of Author IDs and Affiliation IDs as the number of researchers (team size) and institutions, respectively.

Five-year citations ( c 5 ), ten-year citations ( c 10 ), normalized citation ( c f ), and hit paper

The number of citations of a paper evolves over time 46 , 48 , 99 , 100 . Here we calculate c 5 and c 10 , defined as the number of citations a paper received within 5 years and 10 years of publication, respectively. For the primary papers, we calculate c 5 for all papers published up to 2016 (As the last version of MAG publication data is available until 2021) by counting the number of citation pairs with time difference less than or equal to 5 years. Similarly, we calculate c 10 for all papers published up to 2011.

To compare citation counts across disciplines and time, Radicchi et al . 48 proposed the relative citation indicator c f , as the total number of citations c divided by the average number of citations c 0 in the same field and the same year. Here we calculate the normalized citation indicator for each categorized paper in both top-level fields and subfields, known as Level-0 fields (19 in total) and Level-1 fields (292 in total) categorized by MAG, respectively. Note that each paper may be associated with multiple fields, hence here we report calculated normalized citations for each paper-field pair in the “SciSciNet_PaperFields” data table.

Another citation-based measure widely used in the science of science literature 16 , 19 , 83 is “hit papers”, defined as papers in the top 5% of citations within the same field and year. Similar to our calculation of c f , we use the same grouping by fields and years, and identify all papers with citations greater than the top 5% citation threshold. We also perform similar operations for the top 1% and top 10% hit papers.

Citation dynamics

A model developed by Wang, Song, and Barabási (the WSB model) 46 captures the long-term citation dynamics of individual papers after incorporating three fundamental mechanisms, including preferential attachment, aging, and fitness. The model predicts the cumulative citations received by paper i at time t after publication: \({c}_{i}^{t}=m\left[{e}^{{{\rm{\lambda }}}_{i}\Phi \left(\frac{lnt-{{\rm{\mu }}}_{i}}{{{\rm{\sigma }}}_{i}}\right)}-1\right]\) , where Φ ( x ) is the standard cumulative normal distribution of x , m captures the average number of references per paper, and μ i , σ i , and λ i indicate the immediacy, longevity, and fitness parameters characterizing paper i , respectively.

We implement the WSB model with prior for papers published in the fields of math and physics. Following the method proposed by Shen et al . 92 , we adopt the Bayesian approach to calculate the conjugate prior, which follows a gamma distribution. The method allows us to better predict the long-term impact through the posterior estimation of λ i , while helping to avoid potential overfitting problems. Fitting this model to empirical data, we compute the immediacy μ i , the longevity σ i , and the ultimate impact \({c}_{{\rm{i}}}^{\infty }={\rm{m}}\left[{e}^{{{\rm{\lambda }}}_{i}}-1\right]\) for all math and physics papers with at least 10 citations within 10 years after publication (published no later than 2011). To facilitate research on citation dynamics across different fields 48 , we have also used the same procedure to fit the citation sequences for papers that have received at least 10 citations within 10 years across all fields of study from the 1960s to the 1990s.

Sleeping beauty coefficient

Sometimes it may take years or even decades for papers to gain attention from the scientific community, a phenomenon known as the “Sleeping Beauty” in science 93 . The sleeping beauty coefficient B is defined as \({\rm{B}}={\sum }_{t=0}^{{t}_{m}}\frac{\frac{{c}_{{t}_{m}}-{c}_{0}}{{t}_{m}}\cdot t+{c}_{0}-{c}_{t}}{{\rm{\max }}\left(1,{c}_{t}\right)}\) , where the paper receives its maximum yearly citation \({c}_{{t}_{m}}\) in year t m and c 0 in the year of publication. Here we calculate the sleeping beauty coefficient from yearly citation records of a paper. We match the publication years for each citing-cited paper pair published in journals and then aggregate yearly citations since publication for each cited paper. Next, we group the “SciSciNet_PaperReferences” table by each cited paper and compute the coefficient B , along with the awakening time. As a result, we obtain 52,699,363 records with sleeping beauty coefficients for journal articles with at least one citation.

Novelty and conventionality

Research shows that the highest-impact papers in science tend to be grounded in exceptionally conventional combinations of prior work yet simultaneously feature an intrusion of atypical combinations 47 . Here following this work 47 , we calculate the novelty and conventionality score of each paper by computing the Z-score for each combination of journal pairs. We further calculate the distribution of journal pair Z-scores by traversing all possible duos of references cited by a particular paper. A paper’s median Z-score characterizes the median conventionality of the paper, whereas a paper’s 10 th percentile Z-score captures the tail novelty of the paper’s atypical combinations.

More specifically, we first use the information of publication years for each citing-cited paper pair both published in journals and shuffle the reference records within the citing-cited year group to generate 10 randomized citation networks, while controlling the naturally skewed citation distributions. We then traverse each focal paper published in the same year. We further aggregate the frequency of reference journal pairs for papers in the real citation network and 10 randomized citation networks, calculating the Z-score of each reference journal pair for papers published in the same year. Finally, for each focal paper, we obtain its 10 th percentile and median of the Z-scores distribution, yielding 44,143,650 publication records with novelty and conventionality measures for journal papers from 1950 to 2021.

Disruption score

Disruption index quantifies the extent to which a paper disrupts or develops the existing literature 20 , 51 . Disruption, or D , is calculated through citation networks. For a given paper, one can separate its future citations into two types. One type only cites the focal paper itself while ignoring all the references that the paper builds upon, and the other is to cite both the focal paper and its references. D is expressed as: \({\rm{D}}={{\rm{p}}}_{{\rm{i}}}-{{\rm{p}}}_{{\rm{j}}}=\frac{{n}_{i}-{n}_{j}}{{n}_{i}+{n}_{j}+{n}_{k}}\) , where n i is the number of subsequent works that only cite the focal paper, n j is the number of subsequent works that cite both the focal paper and its references, and n k is the number of subsequent works that cite the references of the focal paper only. Following this definition, we calculate the disruption scores for all the papers that have at least one forward and backward citation (48,581,274 in total).

The number of NSF and NIH supporting grants

For external linkages from scientific publications to upstream supporting funding sources, we calculate the number of NSF/NIH grants associated with each primary paper in SciSciNet.

The number of patent citations, Newsfeed mentions, Twitter mentions, and clinical trial citations

For external linkages from scientific publications to downstream public uses of science, we also calculate the number of citations each primary paper in SciSciNet received from domains that go beyond science, including patents from USPTO and EPO, news and social media mentions from Newsfeed and Twitter, and clinical trials from ClinicalTrials.gov.

Individual- and Institutional-level measures

Productivity.

Scientific productivity is a widely used measure for quantifying individual careers 9 , 15 . Here we aggregate the unique primary Paper ID in SciSciNet, after grouping the records in the “SciSciNet_PaperAuthorAffiliations” data table by Author ID or Affiliation ID and calculate the number of publications produced by the same author or affiliation.

H-index is a popular metric to estimate a researcher’s career impact. The index of a scientist is h , if h of her papers have at least h citations and each of the remaining papers have less than h citations 94 , 101 . Here we compile the full publication list associated with each author, sort these papers by their total number of citations in descending order, and calculate the maximum value that satisfies the condition above as the H-index. By repeating the same procedure on each research institution, we also provide an institution-level H-index as well.

Scientific impact

Building on our c 10 measure at the paper level, here we further calculate the average c 10 (< c 10 >) for each author and affiliation, which offers a proxy to individual and institutional level scientific impact. Similarly, we calculate the average log c 10 (<log c 10 >), which is closely related to the Q parameter 15 of individual scientific impact.

Here we group by Author and Affiliation ID in the “PaperAuthorAffiliations” table, and then aggregate c 10 and log c 10 (pre-calculated at the paper level) of all papers published by the same id. Following previous works 15 , 16 , 102 , to avoid taking logarithm of zeros, we increase c 10 by one when calculating the <log c 10 >.

Name-gender associations

The availability of big data also enables a range of studies focusing on gender disparities, ranging from scientific publications and careers 17 , 103 , 104 , 105 , 106 to collaboration patterns 25 , 107 and the effects of the pandemic on women scientists 45 , 108 , 109 , 110 . Here we apply the method from a recent statistical model 80 to infer author gender based on their first names in the original author table. The method feeds unique author names into a cultural consensus model of name-gender associations incorporating 36 separate sources across over 150 countries. Note that for all the 134,197,162 authors, 23.26% of the authors (31,224,458) have only the first initials, which are excluded from the inference. By fine-tuning the annotated names from these data sources following the original method, we obtain 409,809 unique names with max uncertainty threshold set to 0.26 and 85% of the sample classified. Finally, we merge these name-gender inference records into the original SciSciNet_Authors table, resulting a SciSciNet_Authors_Gender table, which contains 86,286,037 authors with inferred probability that indicates a name belongs to an individual gendered female, denoted as P(gf), as well as the number of inference source datasets and empirical counts. Together, by combining new statistical models with our systematic authorship information, this new table provides name-gender information, useful in studying gender-related questions. It is important to note that such name-based gender inference algorithms, including the one used here as well as other popular tools such as genderize.io , have limitations and are necessarily imperfect. The limitations should be considered carefully when applying these methods 96 .

Data Records

The data lake, SciSciNet, is freely available at Figshare 72 .

Data structure

Table  2 presents the size and descriptions of these data files.

Table  3 contains information about “SciSciNet_Papers”, which is the data lake’s primary paper table, containing information on the primary scientific publications, including Paper ID, DOI, and others, along with the Journal ID or Conference Series ID, which can link papers to corresponding journals or conference series that take place regularly. The short description in each data field includes the corresponding explanation of that field.

Tables  4 – 22 include the data fields and corresponding descriptions of each data table. Each data field specified is clear from its index name. An ID of the data field in a data table can be linked, if this field has the same ID name as another field in another table. Further, the data link tables provide linkages from scientific publications to external socioeconomic institutions. For example, the paper with primary “PaperID” as “246319838”, which studied the hereditary spastic paraplegia 111 , lead to three core NIH project number “R01NS033645”, “R01NS036177”, and “R01NS038713” in the Table  11 “SciSciNet_Link_NIH”. We can not only extract detailed information and metrics of the paper in the data lake (e.g., title from Table  8 “SciSciNet_PaperDetails”, or citation counts from the primary paper Table  3 “SciSciNet_Papers”) but also obtain further information of the funded-projects, such as the total funding amount, from NIH RePORTER ( https://report.nih.gov ).

Descriptive statistics

Next, we present a set of descriptive statistics derived from the data lake. Figure  3a–c show the distribution of papers across 19 top-level fields, the exponential growth of scientific publications in SciSciNet over time, and the average team size of papers by field over time.

figure 3

Summary statistics of scientific publications in SciSciNet. ( a ) The number of publications in 19 top-level fields. For clarity we aggregated the field classification into the top level (e.g., a paper is counted as a physics paper if it is associated with physics or any other subfields of physics). ( b ) The exponential growth of science over time. ( c ) Average team size by field from 1950 to 2020. The bold black line is for papers in all the 19 top-level fields. Each colored line indicates each of the 19 fields (color coded according to (a)).

Building on the external linkages we constructed, Fig.  4a–f show the distribution of paper-level upstream funding sources from NIH and NSF, and downstream applications and mentions of science, including USPTO/EPO patents, clinical trials, news mentions from Newsfeed, and social media mentions from Twitter.

figure 4

Linking scientific publications with socioeconomic institutions. Panels ( a, b and d, e ) show the distribution of paper-level downstream applications ( a : Twitter mentions; b : Newsfeed mentions; d : Patents; e : Clinical trials). Panels ( c and f ) show the distribution of supporting scientific grants from NIH ( c ) and NSF ( f ).

Figure  5 presents the probability distributions of various commonly used metrics in the science of science using our data lake, which are broadly consistent with the original studies in the literature.

figure 5

Commonly used metrics in SciSciNet. ( a ) The distribution of disruption score for 48,581,274 papers 20 (50,000 bins in total). ( b ) Cumulative distribution function (CDF) of 44,143,650 journal papers’ 10 th percentile and median Z-scores 47 . ( c ) Distribution of \({e}^{{\rm{\langle }}log{c}_{\mathrm{10}}{\rm{\rangle }}}\) for scholars 15 with at least 10 publications in SciSciNet. The red line corresponds to a log-normal fit with μ = 2.14 and σ  = 1.14. ( d ) Survival distribution function of sleeping beauty coefficients 93 for 52,699,363 papers, with a power-law fit: exponent α  = 2.40. ( e ) Data collapse for a selected subset of papers with more than 30 citations within 30 years across journals in physics in the 1960s, based on WSB model 46 . The red line corresponds to the cumulative distribution function of the standard normal distribution.

Technical Validation

Validation of publication and citation records.

As we select the primary papers from the original MAG dataset, we have re-counted the citations and references within the subset of primary papers. To test the reliability of updated citation and reference counts in SciSciNet, here we compare the two versions (i.e., raw MAG counts and redistributed SciSciNet counts), by calculating the Spearman correlation coefficients for both citations and references. The Spearman correlation coefficients are 0.991 for citations and 0.994 for references, indicating that these metrics are highly correlated before and after the redistribution process.

We also examine the coverage of our publication data through a cross-validation with an external dataset, Dimensions 112 . By using DOI as a standardized identifier, we find that the two databases contain a similar number of papers, with 106,517,016 papers in Dimensions and 98,795,857 papers in SciSciNet associated with unique DOIs. We further compare the overlap of the two databases, finding the two data sources share a vast majority of papers in common (84,936,278 papers with common DOIs, accounting for 79.74% of Dimensions and 85.97% of SciSciNet).

Further, the citation information recorded by the two datasets appears highly consistent. Within the 84.9 M papers we matched with common DOIs, SciSciNet records a similar, yet slightly higher number of citations on average (16.75), compared with Dimensions (14.64). Our comparison also reveals a high degree of consistency in paper-level citation counts between the two independent corpora, with a Spearman correlation coefficient 0.946 and a concordance coefficient 98 , 113 of 0.940. Together, these validations provide further support for the coverage of the data lake.

Validation of external data linkages

We further perform additional cross-validation to understand the reliability of data linkages from scientific publications to external data sources. Here we focus more on the NSF-SciSciNet publications linkages we created from raw data collection to final data linkage. We also use the same approach to validate the NIH-SciSciNet publications linkages.

Here we compare the distribution and coverage of paper-grants linkages between SciSciNet and Dimensions—one of the state-of-the-art commercial databases in publication-grant linkages 112 . Figure  6a,b present the distribution of the number of papers matched to each NSF award and NIH grant, showing that our open-source approach offers a comparable degree of coverage. We further perform individual grant level analysis, by comparing the number of papers matched to each grant reported by the two sources (Fig.  6c,d ), again finding high degrees of consistency (Spearman correlation coefficient: 0.973 for NIH grants and 0.714 for NSF grants).

figure 6

Validation of data linkages between SciSciNet and Dimensions. Panels ( a, b ), The distribution of number of papers matched to each NIH and NSF grant, respectively. Panels ( c, d ), The number of papers matched to each NIH and NSF grant, respectively. All panels are based on data in a 20-year period (2000–2020).

We further calculate the confusion matrices of linkage from SciSciNet and Dimensions. By connecting the two datasets through paper DOIs and NSF/NIH grant project numbers, we compare their overlaps and differences in grant-paper pairs. For NSF, the confusion matrix is shown in Table  23 . The two datasets provide a similar level of coverage, with Dimensions containing 670,770 pairs and SciSciNet containing 632,568 pairs. 78.9% pairs in Dimensions (and 83.7% pairs in SciSciNet) can be found in the other dataset, documenting a high degree of consistency between the two sources. While there are data links contained in Dimensions that are not in SciSciNet, we also find that there exists a similar amount of data records in SciSciNet but not in Dimensions. Table  24 shows the confusion matrix of NIH grant-paper pairs between the two datasets. Again, the two datasets share a vast majority of grant-paper pairs in common, and 95.3% pairs in Dimensions (and 99.7% pairs in SciSciNet) can also be found in the other dataset. These validations further support the overall quality and coverage of data linkages in SciSciNet.

Validation of calculations of commonly used measurements

We also seek to validate the calculated metrics included in SciSciNet. In addition to manual inspection of independent data samples during data processing, along with presenting the corresponding distributions of indicators in the Descriptive statistics section, which capture general patterns, we further double-check the calculation results of these popular measurements in SciSciNet by reproducing canonical results in the science of science under a series of standardized and transparent processes.

For disruption scores, we plot the median disruption percentile and average citations on different team sizes for 48,581,274 publications with at least one citation and reference record in SciSciNet. As shown in Fig.  7a , when team size increases, the disruption percentile decreases while the average citations increase, which is consistent with the empirical findings that small teams disrupt whereas large teams develop 20 . In addition, the probability of being among the top 5% disruptive publications is negatively correlated with the team size, while the probability of being among the most impactful publications increases is positively correlated with the team size (Fig.  7b ). These results demonstrate the consistency with results obtained in the literature.

figure 7

Calculating commonly used measurements in the science of science literature. ( a, b ), Small teams disrupt while large teams develop in SciSciNet. ( c ), The cumulative distribution functions (CDFs) of proportion of external citations for papers with high (top 10,000, B > 307.55), medium (from 10,001 st to top 2% SBs, 33< B < = 307.55); and low (B < = 33) sleeping beauty indexes. ( d ), The probability of a 5% hit paper, conditional on novelty and conventionality for all journal articles in SciSciNet from 1950 to 2000.

The combinations of conventional wisdom and atypical knowledge tend to predict a higher citation impact 47 . Here we repeat the original analysis by categorizing papers based on (1) median conventionality: whether the median score of a paper is in the upper half and (2) tail novelty: whether the paper is within the top 10 th percentile of novelty score. We then identified hit papers (within the subset of our analysis), defined as papers rank in the top 5% of ten-year citations within the same top-level field and year. The four quadrants in Fig.  7d suggest that papers with high median conventionality and high tail novelty present a higher hit rate of 7.32%, within the selection of SciSciNet papers published from 1950 to 2000. Also, papers with high median conventionality but low tail novelty show a hit rate of 4.18%, roughly similar to the baseline rate of 5%, while those with low median conventionality but high tail novelty display a hit rate of 6.48%. Meanwhile, papers with both low median conventionality and low tail novelty exhibit a hit rate of 3.55%. These results are broadly consistent with the canonical results reported in 47 .

In Fig.  5e , we select 36,802 physics papers published in the 1960s with more than 30 citations within 30 years of publication. By rescaling their citation dynamics using the fitted parameters, we find a remarkable collapse of rescaled citation dynamics which appears robust across fields and decades. We further validate the predictive power of the model with prior based on Shen et al . 92 , by calculating the out-of-sample prediction accuracy. We find that with a training period of 15 years, the predictive accuracy (defined as a strict absolute tolerance threshold of 0.1) stays above 0.65 for 10 years after the training period, and the Mean Absolute Percentage Error (MAPE) is less than 0.1. The MAPE stays less than 0.15 for 20 years after the training period.

Sleeping beauty

We first fit the distribution of the sleeping beauty coefficients in SciSciNet (Fig.  5d ) to a power-law form using maximum likelihood estimation 114 , obtaining a power-law exponent α  = 2.40 and minimum value B m  = 23.59. By using fine-grained subfield information provided by MAG, we further calculate the proportion of external citations. Consistent with the original study 93 , we find that papers with high B scores are more likely to have a higher proportion of external citations from other fields (Fig.  7c ).

Usage Notes

Note that, recognizing the recent surge of interest in quantitative understanding of science 95 , 97 , 98 , 115 , 116 , the measurements currently covered in the data lake are not meant to be comprehensive; rather they serve as examples to illustrate how researchers from the broader community can collectively contribute and enrich the data lake. There are also limitations of the data lake that readers should keep in mind when using the data lake. For example, our grant-publication linkage is focused on scientific papers supported by NSF and NIH; patent-publication linkage is limited to citations from USPTO and EPO patents; clinical trial-publication linkage is derived from clinitrials.gov (where the geographical distribution may be heterogenous across countries, Table  25 ); and media-publication linkage is based on sources tracked by Crossref. Further, while our data linkages are based on state-of-the-art methods of data extraction and cleaning, as with any matching, the methods are necessarily imperfect and may be further improved through integration with complementary commercial products such as Altmetric and Dimensions. Finally, our data inherently represents a static snapshot, drawing primarily from the final edition of MAG (Dec 2021 version). While this snapshot is already sufficient in answering many of the research questions that arise in the field, future work may engage in continuous improvement and update of the data lake to maximize its potential.

Overall, this data lake serves as an initial step for serving the community in studying publications, funding, and broader impact. At the same time, there are also several promising directions for future work expanding the present effort. For example, the rapid development in natural language processing (NLP) models and techniques, accompanied by the increasing availability of text information from scientific articles, offers new opportunities to collect and curate more detailed content information. For example, one can link SciSciNet to other sources such as OpenAlex or Semantic Scholar to analyze large-scale data of abstract, full-text, or text-based embeddings. Such efforts will not only enrich the metadata associated with each paper, but also enable more precise identification and linkage of bio/chemical entities studied in these papers 117 . Further, although platforms like MAG have implemented advanced algorithms for name disambiguation and topic/field classification at scale, these algorithms are inherently imperfect and not necessarily consistent across datasets, hence it is essential to further validate and improve the accuracy of name disambiguation and topic classifications 118 . Related, in this paper we primarily focus on paper-level linkages across different datasets. Using these linkages as intermediary information, one can further construct and enrich individual-level profiles, allowing us to combine professional information (e.g., education background, grants, publications, and other broad impact) of researchers with important demographic dimensions (e.g., gender, age, race, and ethnicity). Finally, the data lake could contribute to an ecosystem for the collective community of the science of science. For example, there are synergies with the development of related programming packages, such as pySciSci 119 . By making the data lake fully open, we also hope it inspires other researchers to contribute to the data lake and enrich its coverage. For example, when a research team publishes a new measure, they could put out a data file that computes their measure based on SciSciNet, effectively adding a new column to the data lake. Lastly, science forms a complex social system and often offers an insightful lens to examine broader social science questions, suggesting that the SciSciNet may see greater utility by benefiting adjacent fields such as computational social science 120 , 121 , network science 122 , 123 , complex systems 124 , and more 125 .

Code availability

The source code for data selection and curation, data linkage, and metrics calculation is available at https://github.com/kellogg-cssi/SciSciNet .

Liu, L., Jones, B. F., Uzzi, B. & Wang, D. Measurement and Empirical Methods in the Science of Science. Nature Human Behaviour , https://doi.org/10.1038/s41562-023-01562-4 (2023).

Fortunato, S. et al . Science of science. Science 359 , eaao0185 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Wang, D. & Barabási, A.-L. The science of science . (Cambridge University Press, 2021).

Zeng, A. et al . The science of science: From the perspective of complex systems. Physics reports 714 , 1–73 (2017).

Article   ADS   MathSciNet   MATH   Google Scholar  

Azoulay, P. et al . Toward a more scientific science. Science 361 , 1194–1197 (2018).

Article   ADS   PubMed   Google Scholar  

Clauset, A., Larremore, D. B. & Sinatra, R. Data-driven predictions in the science of science. Science 355 , 477–480 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nature communications 12 , 1–10 (2021).

ADS   Google Scholar  

Jones, B. F. The burden of knowledge and the “death of the renaissance man”: Is innovation getting harder? The Review of Economic Studies 76 , 283–317 (2009).

Article   MATH   Google Scholar  

Way, S. F., Morgan, A. C., Clauset, A. & Larremore, D. B. The misleading narrative of the canonical faculty productivity trajectory. Proceedings of the National Academy of Sciences 114 , E9216–E9223, https://doi.org/10.1073/pnas.1702121114 (2017).

Article   ADS   CAS   Google Scholar  

Jones, B. F. & Weinberg, B. A. Age dynamics in scientific creativity. Proceedings of the National Academy of Sciences 108 , 18910–18914 (2011).

Malmgren, R. D., Ottino, J. M. & Amaral, L. A. N. The role of mentorship in protege performance. Nature 465 , 622–U117 (2010).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Liénard, J. F., Achakulvisut, T., Acuna, D. E. & David, S. V. Intellectual synthesis in mentorship determines success in academic careers. Nature communications 9 , 1–13 (2018).

Article   Google Scholar  

Petersen, A. M. et al . Reputation and Impact in Academic Careers. Proceedings of the National Academy of Science USA 111 , 15316–15321 (2014).

Ma, Y., Mukherjee, S. & Uzzi, B. Mentorship and protégé success in STEM fields. Proceedings of the National Academy of Sciences 117 , 14077–14083 (2020).

Sinatra, R., Wang, D., Deville, P., Song, C. M. & Barabasi, A. L. Quantifying the evolution of individual scientific impact. Science 354 (2016).

Liu, L. et al . Hot streaks in artistic, cultural, and scientific careers. Nature 559 , 396–399 (2018).

Larivière, V., Ni, C., Gingras, Y., Cronin, B. & Sugimoto, C. R. Bibliometrics: Global gender disparities in science. Nature News 504 , 211 (2013).

Sugimoto, C. R. et al . Scientists have most impact when they’re free to move. Nature 550 , 29–31 (2017).

Wuchty, S., Jones, B. F. & Uzzi, B. The increasing dominance of teams in production of knowledge. Science 316 , 1036–1039 (2007).

Wu, L., Wang, D. & Evans, J. A. Large teams develop and small teams disrupt science and technology. Nature 566 , 378–382, https://doi.org/10.1038/s41586-019-0941-9 (2019).

Milojevic, S. Principles of scientific research team formation and evolution. Proceedings of the National Academy of Sciences 111 , 3984–3989 (2014).

Newman, M. E. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences 98 , 404–409 (2001).

Article   ADS   MathSciNet   CAS   MATH   Google Scholar  

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nature communications 9 , 1–10 (2018).

Article   CAS   Google Scholar  

Shen, H.-W. & Barabási, A.-L. Collective credit allocation in science. Proceedings of the National Academy of Sciences 111 , 12325–12330 (2014).

Leahey, E. From Sole Investigator to Team Scientist: Trends in the Practice and Study of Research Collaboration. Annual Review of Sociology, Vol 42 42 , 81–100 (2016).

Clauset, A., Arbesman, S. & Larremore, D. B. Systematic inequality and hierarchy in faculty hiring networks. Science advances 1 , e1400005 (2015).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Jones, B. F., Wuchty, S. & Uzzi, B. Multi-university research teams: shifting impact, geography, and stratification in science. science 322 , 1259–1262 (2008).

Deville, P. et al . Career on the move: Geography, stratification, and scientific impact. Scientific reports 4 (2014).

Chu, J. S. & Evans, J. A. Slowed canonical progress in large fields of science. Proceedings of the National Academy of Sciences 118 (2021).

Azoulay, P., Fons-Rosen, C. & Graff Zivin, J. S. Does science advance one funeral at a time? American Economic Review 109 , 2889–2920 (2019).

Article   PubMed   Google Scholar  

Jin, C., Ma, Y. & Uzzi, B. Scientific prizes and the extraordinary growth of scientific topics. Nature communications 12 , 1–11 (2021).

Nagaraj, A., Shears, E. & de Vaan, M. Improving data access democratizes and diversifies science. Proceedings of the National Academy of Sciences 117 , 23490–23498 (2020).

Evans, J. A. & Reimer, J. Open access and global participation in science. Science 323 , 1025–1025 (2009).

Peng, H., Ke, Q., Budak, C., Romero, D. M. & Ahn, Y.-Y. Neural embeddings of scholarly periodicals reveal complex disciplinary organizations. Science Advances 7 , eabb9004 (2021).

Ahmadpoor, M. & Jones, B. F. The dual frontier: Patented inventions and prior scientific advance. Science 357 , 583–587 (2017).

Yin, Y., Gao, J., Jones, B. F. & Wang, D. Coevolution of policy and science during the pandemic. Science 371 , 128–130 (2021).

Ding, W. W., Murray, F. & Stuart, T. E. Gender differences in patenting in the academic life sciences. science 313 , 665–667 (2006).

CAS   PubMed   Google Scholar  

Bromham, L., Dinnage, R. & Hua, X. Interdisciplinary research has consistently lower funding success. Nature 534 , 684 (2016).

Larivière, V., Vignola-Gagné, E., Villeneuve, C., Gélinas, P. & Gingras, Y. Sex differences in research funding, productivity and impact: an analysis of Québec university professors. Scientometrics 87 , 483–498 (2011).

Li, D., Azoulay, P. & Sampat, B. N. The applied value of public investments in biomedical research. Science 356 , 78–81 (2017).

Fleming, L., Greene, H., Li, G., Marx, M. & Yao, D. Government-funded research increasingly fuels innovation. Science 364 , 1139–1141, https://doi.org/10.1126/science.aaw2373 (2019).

Lazer, D. M. et al . The science of fake news. Science 359 , 1094–1096 (2018).

Scheufele, D. A. & Krause, N. M. Science audiences, misinformation, and fake news. Proceedings of the National Academy of Sciences 116 , 7662–7669 (2019).

Kreps, S. E. & Kriner, D. L. Model uncertainty, political contestation, and public trust in science: Evidence from the COVID-19 pandemic. Science advances 6 , eabd4563 (2020).

Myers, K. R. et al . Unequal effects of the COVID-19 pandemic on scientists. Nature Human Behaviour https://doi.org/10.1038/s41562-020-0921-y (2020).

Wang, D. S., Song, C. M. & Barabasi, A. L. Quantifying Long-Term Scientific Impact. Science 342 , 127–132 (2013).

Uzzi, B., Mukherjee, S., Stringer, M. & Jones, B. Atypical combinations and scientific impact. Science 342 , 468–472 (2013).

Radicchi, F., Fortunato, S. & Castellano, C. Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences 105 , 17268–17272 (2008).

de Solla Price, D. J. Networks of Scientific Papers. Science 149 , 510–515 (1965).

Article   ADS   Google Scholar  

Price, D. d. S. A general theory of bibliometric and other cumulative advantage processes. Journal of the American society for Information science 27 , 292–306 (1976).

Funk, R. J. & Owen-Smith, J. A Dynamic Network Measure of Technological Change. Management Science 63 , 791–817 (2017).

Thelwall, M., Haustein, S., Larivière, V. & Sugimoto, C. R. Do altmetrics work? Twitter and ten other social web services. PloS one 8 (2013).

Wang, R. et al . in Proceedings of the 27th ACM International Conference on Information and Knowledge Management 1487–1490 (Association for Computing Machinery, Torino, Italy, 2018).

Tan, Z. et al . in Proceedings of the 25th International Conference Companion on World Wide Web 437–442 (International World Wide Web Conferences Steering Committee, Montréal, Québec, Canada, 2016).

Yin, Y., Dong, Y., Wang, K., Wang, D. & Jones, B. F. Public use and public funding of science. Nature Human Behaviour https://doi.org/10.1038/s41562-022-01397-5 (2022).

Wu, J. et al . CiteSeerX: AI in a Digital Library Search Engine. AI Magazine 36 , 35–48, https://doi.org/10.1609/aimag.v36i3.2601 (2015).

Wan, H., Zhang, Y., Zhang, J. & Tang, J. AMiner: Search and Mining of Academic Social Networks. Data Intelligence 1 , 58–76, https://doi.org/10.1162/dint_a_00006 (2019).

Zhang, Y., Zhang, F., Yao, P. & Tang, J. in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining . 1002–1011.

Hendricks, G., Tkaczyk, D., Lin, J. & Feeney, P. Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies 1 , 414–427 (2020).

Priem, J., Piwowar, H. & Orr, R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833 (2022).

Sinha, A. et al . in Proceedings of the 24th International Conference on World Wide Web 243–246 (Association for Computing Machinery, Florence, Italy, 2015).

Wang, K. et al . A Review of Microsoft Academic Services for Science of Science Studies. Frontiers in Big Data 2 , 45 (2019).

Wang, K. et al . Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1 , 396–413 (2020).

Pinski, G. & Narin, F. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information processing & management 12 , 297–312 (1976).

Carpenter, M. P., Cooper, M. & Narin, F. Linkage between basic research literature and patents. Research Management 23 , 30–35 (1980).

Narin, F., Hamilton, K. S. & Olivastro, D. The increasing linkage between US technology and public science. Research policy 26 , 317–330 (1997).

Marx, M. & Fuegi, A. Reliance on science: Worldwide front‐page patent citations to scientific articles. Strategic Management Journal 41 , 1572–1594 (2020).

Marx, M. & Fuegi, A. Reliance on science by inventors: Hybrid extraction of in‐text patent‐to‐article citations. Journal of Economics & Management Strategy (2020).

de Solla Price, D. Little science, big science . (Columbia University Press, 1963).

Sinatra, R., Deville, P., Szell, M., Wang, D. & Barabási, A.-L. A century of physics. Nature Physics 11 , 791–796 (2015).

de Solla Price, D. Science since babylon . (Yale University Press, 1961).

Lin, Z., Yin, Y., Liu, L. & Wang, D. SciSciNet: A large-scale open data lake for the science of science research, Figshare , https://doi.org/10.6084/m9.figshare.c.6076908.v1 (2022).

Microsoft Academic. Microsoft Academic Graph. Zenodo , https://doi.org/10.5281/zenodo.6511057 (2022).

Smalheiser, N. R. & Torvik, V. I. Author name disambiguation. Annual review of information science and technology 43 , 1–43 (2009).

Tang, J., Fong, A. C., Wang, B. & Zhang, J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24 , 975–987 (2011).

Ferreira, A. A., Gonçalves, M. A. & Laender, A. H. A brief survey of automatic methods for author name disambiguation. Acm Sigmod Record 41 , 15–26 (2012).

Sanyal, D. K., Bhowmick, P. K. & Das, P. P. A review of author name disambiguation techniques for the PubMed bibliographic database. Journal of Information Science 47 , 227–254 (2021).

Morrison, G., Riccaboni, M. & Pammolli, F. Disambiguation of patent inventors and assignees using high-resolution geolocation data. Scientific data 4 , 1–21 (2017).

Tekles, A. & Bornmann, L. Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches1. Quantitative Science Studies 1 , 1510–1528, https://doi.org/10.1162/qss_a_00081 (2020).

Van Buskirk, I., Clauset, A. & Larremore, D. B. An Open-Source Cultural Consensus Approach to Name-Based Gender Classification. arXiv preprint arXiv:2208.01714 (2022).

Cleary, E. G., Beierlein, J. M., Khanuja, N. S., McNamee, L. M. & Ledley, F. D. Contribution of NIH funding to new drug approvals 2010–2016. Proceedings of the National Academy of Sciences 115 , 2329–2334 (2018).

Packalen, M. & Bhattacharya, J. NIH funding and the pursuit of edge science. Proceedings of the National Academy of Sciences 117 , 12011–12016, https://doi.org/10.1073/pnas.1910160117 (2020).

Wang, Y., Jones, B. F. & Wang, D. Early-career setback and future career impact. Nature communications 10 , 1–10 (2019).

Hechtman, L. A. et al . NIH funding longevity by gender. Proceedings of the National Academy of Sciences 115 , 7943–7948 (2018).

Agrawal, A. & Henderson, R. Putting patents in context: Exploring knowledge transfer from MIT. Management science 48 , 44–60 (2002).

Bekkers, R. & Freitas, I. M. B. Analysing knowledge transfer channels between universities and industry: To what degree do sectors also matter? Research policy 37 , 1837–1853 (2008).

Owen-Smith, J. & Powell, W. W. To patent or not: Faculty decisions and institutional success at technology transfer. The Journal of Technology Transfer 26 , 99–114 (2001).

Mowery, D. C. & Shane, S. Introduction to the special issue on university entrepreneurship and technology transfer. Management Science 48 , v–ix (2002).

Williams, R. S., Lotia, S., Holloway, A. K. & Pico, A. R. From Scientific Discovery to Cures: Bright Stars within a Galaxy. Cell 163 , 21–23, https://doi.org/10.1016/j.cell.2015.09.007 (2015).

Article   CAS   PubMed   Google Scholar  

Hmielowski, J. D., Feldman, L., Myers, T. A., Leiserowitz, A. & Maibach, E. An attack on science? Media use, trust in scientists, and perceptions of global warming. Public Understanding of Science 23 , 866–883 (2014).

Li, J., Yin, Y., Fortunato, S. & Wang, D. A dataset of publication records for Nobel laureates. Scientific data 6 , 33 (2019).

Shen, H., Wang, D., Song, C. & Barabási, A.-L. in Proceedings of the AAAI Conference on Artificial Intelligence .

Ke, Q., Ferrara, E., Radicchi, F. & Flammini, A. Defining and identifying Sleeping Beauties in science. Proceedings of the National Academy of Sciences , 201424329 (2015).

Hirsch, J. E. An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America 102 , 16569–16572 (2005).

Article   ADS   CAS   PubMed   PubMed Central   MATH   Google Scholar  

Waltman, L., Boyack, K. W., Colavizza, G. & van Eck, N. J. A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies 1 , 691–713, https://doi.org/10.1162/qss_a_00035 (2020).

Santamaría, L. & Mihaljević, H. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science 4 , e156 (2018).

Bornmann, L. & Williams, R. An evaluation of percentile measures of citation impact, and a proposal for making them better. Scientometrics 124 , 1457–1478, https://doi.org/10.1007/s11192-020-03512-7 (2020).

Haunschild, R., Daniels, A. D. & Bornmann, L. Scores of a specific field-normalized indicator calculated with different approaches of field-categorization: Are the scores different or similar? Journal of Informetrics 16 , 101241, https://doi.org/10.1016/j.joi.2021.101241 (2022).

Yin, Y. & Wang, D. The time dimension of science: Connecting the past to the future. Journal of Informetrics 11 , 608–621 (2017).

Stringer, M. J., Sales-Pardo, M. & Amaral, L. A. N. Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. Journal of the American Society for Information Science and Technology 61 , 1377–1385 (2010).

Bornmann, L. & Daniel, H.-D. What do we know about the h index? Journal of the American Society for Information Science and Technology 58 , 1381–1385, https://doi.org/10.1002/asi.20609 (2007).

Li, J., Yin, Y., Fortunato, S. & Wang, D. Nobel laureates are almost the same as us. Nature Reviews Physics 1 , 301 (2019).

Abramo, G., D’Angelo, C. & Caprasecca, A. Gender differences in research productivity: A bibliometric analysis of the Italian academic system. Scientometrics 79 , 517–539 (2009).

Huang, J., Gates, A. J., Sinatra, R. & Barabási, A.-L. Historical comparison of gender inequality in scientific careers across countries and disciplines. Proceedings of the National Academy of Sciences 117 , 4609–4616 (2020).

Dworkin, J. D. et al . The extent and drivers of gender imbalance in neuroscience reference lists. Nature neuroscience 23 , 918–926 (2020).

Squazzoni, F. et al . Peer review and gender bias: A study on 145 scholarly journals. Science advances 7 , eabd0299 (2021).

Yang, Y., Tian, T. Y., Woodruff, T. K., Jones, B. F. & Uzzi, B. Gender-diverse teams produce more novel and higher-impact scientific ideas. Proceedings of the National Academy of Sciences 119 , e2200841119 (2022).

Squazzoni, F. et al . Only second-class tickets for women in the COVID-19 race. A study on manuscript submissions and reviews in 2329 Elsevier journals. A study on manuscript submissions and reviews in 2329 (2020).

Vincent-Lamarre, P., Sugimoto, C. R. & Larivière, V. The decline of women’s research production during the coronavirus pandemic. Nature index 19 (2020).

Staniscuaski, F. et al . Gender, race and parenthood impact academic productivity during the COVID-19 pandemic: from survey to action. Frontiers in psychology 12 , 663252 (2021).

Fink, J. K. Hereditary spastic paraplegia. Neurologic Clinics 20 , 711–726, https://doi.org/10.1016/S0733-8619(02)00007-5 (2002).

Herzog, C., Hook, D. & Konkiel, S. Dimensions: Bringing down barriers between scientometricians and data. Quantitative Science Studies 1 , 387–395 (2020).

Lawrence, I. & Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics , 255–268 (1989).

Clauset, A., Shalizi, C. R. & Newman, M. E. Power-law distributions in empirical data. SIAM review 51 , 661–703 (2009).

Bornmann, L. & Wohlrabe, K. Normalisation of citation impact in economics. Scientometrics 120 , 841–884, https://doi.org/10.1007/s11192-019-03140-w (2019).

van Eck, N. J. & Waltman, L. Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 111 , 1053–1070, https://doi.org/10.1007/s11192-017-2300-7 (2017).

Xu, J. et al . Building a PubMed knowledge graph. Scientific Data 7 , 205, https://doi.org/10.1038/s41597-020-0543-2 (2020).

Torvik, V. I. & Smalheiser, N. R. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 3 , 1–29 (2009).

Reproducible Science of Science at scale: pySciSci Abstract Quantitative Science Studies 1-17, https://doi.org/10.1162/qss_a_00260 .

Lazer, D. M. et al . Computational social science: Obstacles and opportunities. Science 369 , 1060–1062 (2020).

Lazer, D. et al . Computational social science. Science 323 , 721–723 (2009).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Barabási, A.-L. Network science . (Cambridge University, 2015).

Newman, M. Networks: an introduction . (Oxford University Press, 2010).

Castellano, C., Fortunato, S. & Loreto, V. Statistical physics of social dynamics. Reviews of modern physics 81 , 591 (2009).

Dong, Y., Ma, H., Shen, Z. & Wang, K. in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 1437–1446 (ACM).

Download references

Acknowledgements

The authors thank Alanna Lazarowich, Krisztina Eleki, Jiazhen Liu, Huawei Shen, Benjamin F. Jones, Brian Uzzi, Alex Gates, Daniel Larremore, YY Ahn, Lutz Bornmann, Ludo Waltman, Vincent Traag, Caroline Wagner, and all members of the Center for Science of Science and Innovation (CSSI) at Northwestern University for their help. This work is supported by the Air Force Office of Scientific Research under award number FA955017-1-0089 and FA9550-19-1-0354, National Science Foundation grant SBE 1829344, the Alfred P. Sloan Foundation G-2019-12485, and Peter G. Peterson Foundation 21048. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and affiliations.

Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA

Zihang Lin, Yian Yin, Lu Liu & Dashun Wang

Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA

Kellogg School of Management, Northwestern University, Evanston, IL, USA

School of Computer Science, Fudan University, Shanghai, China

McCormick School of Engineering, Northwestern University, Evanston, IL, USA

Yian Yin & Dashun Wang

You can also search for this author in PubMed   Google Scholar

Contributions

D.W. and Y.Y. conceived the project and designed the experiments; Z.L. and Y.Y. collected the data; Z.L. performed data pre-processing, statistical analyses, and validation with help from Y.Y., L.L. and D.W.; Z.L., Y.Y. and D.W. wrote the manuscript; all authors edited the manuscript.

Corresponding author

Correspondence to Dashun Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10 , 315 (2023). https://doi.org/10.1038/s41597-023-02198-9

Download citation

Received : 13 July 2022

Accepted : 02 May 2023

Published : 01 June 2023

DOI : https://doi.org/10.1038/s41597-023-02198-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Publication, funding, and experimental data in support of human reference atlas construction and usage.

  • Yongxin Kong
  • Katy Börner

Scientific Data (2024)

Women’s strength in science: exploring the influence of female participation on research impact and innovation

  • Wenxuan Shi

Scientometrics (2024)

Gender assignment in doctoral theses: revisiting Teseo with a method based on cultural consensus theory

  • Nataly Matias-Rayme
  • Iuliana Botezan
  • Rodrigo Sánchez-Jiménez

Unveiling the dynamics of team age structure and its impact on scientific innovation

  • Alex J. Yang

Forecasting the future of artificial intelligence with machine learning-based link prediction in an exponentially growing knowledge network

  • Mario Krenn
  • Lorenzo Buffoni
  • Michael Kopp

Nature Machine Intelligence (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data lake research paper

Apple skips Nvidia's GPUs for its AI models, uses thousands of Google TPUs instead

Recently released research paper reveals the details.

Google TPUv4 in the data center

Apple has revealed that it didn’t use Nvidia’s hardware accelerators to develop its recently revealed Apple Intelligence features. According to an official Apple research paper (PDF), it instead relied on Google TPUs to crunch the training data behind the Apple Intelligence Foundation Language Models.

Systems packing Google TPUv4 and TPUv5 chips were instrumental to the creation of the Apple Foundation Models (AFMs). These models, AFM-server and AFM-on-device models, were designed to power online and offline Apple Intelligence features which were heralded back at WWDC 2024 in June.

Apple Intelligence Foundation Language Models

AFM-server is Apple’s biggest LLM, and thus it remains online only. According to the recently released research paper, Apple’s AFM-server was trained on 8,192 TPUv4 chips “provisioned as 8 × 1,024 chip slices, where slices are connected together by the data-center network (DCN).” Pre-training was a triple-stage process, starting with 6.3T tokens, continuing with 1T tokens, and then context-lengthening using 100B tokens.

Apple said the data used to train its AFMs included info gathered from the Applebot web crawler (heeding robots.txt) plus various licensed “high-quality” datasets. It also leveraged carefully chosen code, math, and public datasets.

Of course, the ARM-on-device model is significantly pruned, but Apple reckons its knowledge distillation techniques have optimized this smaller model’s performance and efficiency. The paper reveals that AFM-on-device is a 3B parameter model, distilled from the 6.4B server model, which was trained on the full 6.3T tokens.

Unlike AFM-server training, Google TPUv5 clusters were harnessed to prepare the ARM-on-device model. The paper reveals that “AFM-on-device was trained on one slice of 2,048 TPUv5p chips.”

It is interesting to see Apple has released such a detailed paper, revealing techniques and technologies behind Apple Intelligence. The company isn’t renowned for its transparency but seems to be trying hard to impress in AI, perhaps as it has been late to the game.

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Apple Intelligence Foundation Language Models

According to Apple’s in-house testing, AFM-server and AFM-on-device excel in benchmarks such as Instruction Following, Tool Use, Writing, and more. We’ve embedded the Writing Benchmark chart, above, for one example.

If you are interested in some deeper details regarding the training and optimizations used by Apple, as well as further benchmark comparisons, check out the PDF linked in the intro.

Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

Intel reportedly gave up a chance to buy a stake in OpenAI in 2017

Nvidia accused of scraping ‘A Human Lifetime’ of videos per day to train AI

Keychron K2 HE Special Edition Review: Mid-Century Magnetic

  • Heat_Fan89 This is NO surprise as Apple is still stinging from their last encounter with Nvidia which went bad and cost Apple a lot of money many years ago with failed GPU's that Nvidia did not take responsibility for. Both sides went blaming the other. Apple is not the only company burned by Nvidia. Reply
  • EMI_Black_Ace Well, given that all Apple wants out of the hardware is AI compute, Google TPUs deliver that on a more cost efficient basis without any of the other "stuff" Nvidia offers. Reply
  • Kamen Rider Blade Apple has a "Don't do Business" with nVIDIA & it's hardware after they were burned by nVIDIA. Reply
  • husker Perhaps Apple is willing to disclose its methods in order to show others that there is an AI road that does not lead to Nvidia. Reply
  • ezst036 How serious can Apple be about gaming if they refuse to bury the hatchet with Nvidia? Reply
Heat_Fan89 said: This is NO surprise as Apple is still stinging from their last encounter with Nvidia which went bad and cost Apple a lot of money many years ago with failed GPU's that Nvidia did not take responsibility for. Both sides went blaming the other. Apple is not the only company burned by Nvidia.
ezst036 said: How serious can Apple be about gaming if they refuse to bury the hatchet with Nvidia?
Makaveli said: add microsoft to that list with the original xbox. There is a reason all the consoles are using AMD IP. Apple is a trillion dollar company if they cared about gaming they would have been in that market along time ago.
Mattzun said: Apple has already captured a lot of gaming revenue - its just not on laptops/desktops. 55 percent of gaming revenue is on mobile devices and Apple is getting a huge cut of that from both device sales and the app store.
husker said: Perhaps Apple is willing to disclose its methods in order to show others that there is an AI road that does not lead to Nvidia.
  • View All 9 Comments

Most Popular

data lake research paper

NASA Logo

Suggested Searches

  • Climate Change
  • Expedition 64
  • Mars perseverance
  • SpaceX Crew-2
  • International Space Station
  • View All Topics A-Z

Humans in Space

Earth & climate, the solar system, the universe, aeronautics, learning resources, news & events.

NASA, EPA Tackle NO2 Air Pollution in Overburdened Communities

NASA, EPA Tackle NO2 Air Pollution in Overburdened Communities

NASA Sends More Science to Space, More Strides for Future Exploration

NASA Sends More Science to Space, More Strides for Future Exploration

  • Search All NASA Missions
  • A to Z List of Missions
  • Upcoming Launches and Landings
  • Spaceships and Rockets
  • Communicating with Missions
  • James Webb Space Telescope
  • Hubble Space Telescope
  • Why Go to Space
  • Commercial Space
  • Destinations
  • Living in Space
  • Explore Earth Science
  • Earth, Our Planet
  • Earth Science in Action
  • Earth Multimedia
  • Earth Science Researchers
  • Pluto & Dwarf Planets
  • Asteroids, Comets & Meteors
  • The Kuiper Belt
  • The Oort Cloud
  • Skywatching
  • The Search for Life in the Universe
  • Black Holes
  • The Big Bang
  • Dark Energy & Dark Matter
  • Earth Science
  • Planetary Science
  • Astrophysics & Space Science
  • The Sun & Heliophysics
  • Biological & Physical Sciences
  • Lunar Science
  • Citizen Science
  • Astromaterials
  • Aeronautics Research
  • Human Space Travel Research
  • Science in the Air
  • NASA Aircraft
  • Flight Innovation
  • Supersonic Flight
  • Air Traffic Solutions
  • Green Aviation Tech
  • Drones & You
  • Technology Transfer & Spinoffs
  • Space Travel Technology
  • Technology Living in Space
  • Manufacturing and Materials
  • Science Instruments
  • For Kids and Students
  • For Educators
  • For Colleges and Universities
  • For Professionals
  • Science for Everyone
  • Requests for Exhibits, Artifacts, or Speakers
  • STEM Engagement at NASA
  • NASA's Impacts
  • Centers and Facilities
  • Directorates
  • Organizations
  • People of NASA
  • Internships
  • Our History
  • Doing Business with NASA
  • Get Involved

NASA en Español

  • Aeronáutica
  • Ciencias Terrestres
  • Sistema Solar
  • All NASA News
  • Video Series on NASA+
  • Newsletters
  • Social Media
  • Media Resources
  • Upcoming Launches & Landings
  • Virtual Events
  • Sounds and Ringtones
  • Interactives
  • STEM Multimedia

An artist’s concept of the 2012 Mars Curiosity Landing using the skycrane maneuver, with the rover hanging below the hovering spacecraft via three nylon tethers.

Here’s How Curiosity’s Sky Crane Changed the Way NASA Explores Mars

How We Land on Mars

How We Land on Mars

A landscape image. In the foreground on the left side of the image is a single small evergreen tree, with pine needles only at the top of the tree. The rest of the foreground is mostly a green/brown grass. The background shows some extending landscapes, but primarily is taken up by the sky, a light blue color that is covered by white and gray puffy clouds.

Tundra Vegetation to Grow Taller, Greener Through 2100, NASA Study Finds

data lake research paper

What’s New With the Artemis II Crew

Thanksgiving meal on the ISS

Food in Space

Image shows Northrop Grumman's Cygnus space freighter attached to the Canadarm2 robotic arm ahead of its release from the International Space Station's Unity module.

NASA Offers Virtual Activities for 21st Northrop Grumman Resupply Mission

Firefighters silhouetted against a forest fire at night.

Improving Firefighter Safety with STRATO

Celebrate Heliophysics Big Year: Free Monthly Webinars on the Sun Touches Everything

Celebrate Heliophysics Big Year: Free Monthly Webinars on the Sun Touches Everything

AstroViz: Iconic Pillars of Creation Star in NASA’s New 3D Visualization

AstroViz: Iconic Pillars of Creation Star in NASA’s New 3D Visualization

NASA Scientists on Why We Might Not Spot Solar Panel Technosignatures

NASA Scientists on Why We Might Not Spot Solar Panel Technosignatures

Hubble Spies a Diminutive Galaxy

Hubble Spies a Diminutive Galaxy

Quantum Scale Sensors used to Measure Planetary Scale Magnetic Fields

Quantum Scale Sensors used to Measure Planetary Scale Magnetic Fields

Amendment 38: A.43 Earth Action: Health and Air Quality Due Date Delay.

Amendment 38: A.43 Earth Action: Health and Air Quality Due Date Delay.

Students tour NASA’s Ames Research Center during the Forum.

Collegiate Teams to Focus on Aviation Solutions for Agriculture in 2025 Gateways to Blue Skies Competition  

‘current’ events: nasa and usgs find a new way to measure river flows.

Artist illustration of the X-66 in flight above the clouds with the sun in the background.

NASA Furthers Aeronautical Innovation Using Model-Based Systems

EVA Astronauts working at Lunar South Pole crater. NASA artist concept of Lunar exploration activities.

NASA Optical Navigation Tech Could Streamline Planetary Exploration

Akeem Shannon sitting on a sofa, holding a cellphone with Flipstik attached to it.

Tech Today: Flipping NASA Tech and Sticking the Landing 

Madyson Knox experiments with UV-sensitive beads.

How Do I Navigate NASA Learning Resources and Opportunities?

Dr. Ariadna Farrés-Basiana stands in between a model of the Nancy Grace Roman Space Telescope and sign showing history of the telescope. She is wearing a t shirt with a space shuttle graphic and jean shorts. The NASA meatball and Goddard Space Flight Center logo is on the wall behind her.

There Are No Imaginary Boundaries for Dr. Ariadna Farrés-Basiana

NASA Astronaut Official Portrait Frank Rubio

Astronauta de la NASA Frank Rubio

2021 Astronaut Candidates Stand in Recognition

Diez maneras en que los estudiantes pueden prepararse para ser astronautas

Paper on creating friendly color maps for color vision deficiency accepted for publication.

The headshot image of Elizabeth Blackwell

Elizabeth Blackwell

Timothy Lang (ST11) is a co-author on an article titled “Effective Visualization of Radar Data for Users Impacted by Color Vision Deficiency”, which was recently accepted for publication in Bulletin of the American Meteorological Society. The article is led by Zachary Sherman of Argonne National Laboratory (ANL), and it is an outgrowth of a long-standing collaboration on open science between ANL, MSFC, and other institutions that predates NASA Science Policy Directive (SPD) 41a and the Transform to Open Science (TOPS) campaign. Color Vision Deficiency (CVD) affects up to 8% of genetic males and 0.5% of genetic females, and traditional color maps used in radar meteorology and other Earth sciences often lack perceptual accuracy and clarity when viewed by those affected by CVD. The article reviews new color maps that convey useful and clear scientific information whether viewed by those with normal color perception or those with CVD. These color maps are available in open-source repositories like cmweather ( https://github.com/openradar/cmweather ) and pyart ( http://arm-doe.github.io/pyart/ ). The article and the open-source CVD-friendly color maps are excellent examples of the greater inclusivity fostered when open science is practiced. Read the paper at: https://journals.ametsoc.org/view/journals/bams/aop/BAMS-D-23-0056.1/BAMS-D-23-0056.1.xml .

data lake research paper

Consumer Credit Reporting Data

Since the 2000s, economists across fields have increasingly used consumer credit reporting data for research. We introduce readers to the economics of and the institutional details of these data. Using examples from the literature, we provide practical guidance on how to use these data to construct economic measures of borrowing, consumption, credit access, financial distress, and geographic mobility. We explain what credit scores measure, and why. We highlight how researchers can access credit reporting data via existing datasets or by creating new datasets, including by linking credit reporting data with surveys and external datasets.

We are extremely grateful to David Romer (the editor), five anonymous referees, Aly Brown, Amy Quester, Andres Shahidinejad, Breno Braga, Brian Bucks, Evan White, Jonah Kaplan, Lance Lochner, Matthew Notowidigdo, Michael Varley, Pavneet Singh, ASSA and NBER conference participants for their feedback improving this paper, and to Jehoon Chung for research assistance. We shared an earlier draft with representatives from Equifax, Experian, TransUnion, FICO, and VantageScore and are grateful for their feedback. Prior to circulation, this paper was reviewed in accordance with the Federal Reserve Bank of New York review policy and the Consumer Financial Protection Bureau policy on independent research. Gibbs works for the Consumer Financial Protection Bureau, which implements and enforces Federal consumer financial laws, including those discussed in this paper. Guttman-Kenney acknowledges support from the NBER's PhD Dissertation Fellowship on Consumer Financial Management funded by the Institute of Consumer Money Management, and Wang acknowledges support from the Gies College of Business. The views expressed are those of the authors and do not necessarily reflect those of the Consumer Financial Protection Bureau, the Federal Reserve Bank of New York, the Federal Reserve System, or the United States. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.

MARC RIS BibTeΧ

Download Citation Data

  • data appendix

Working Groups

More from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

2024, 16th Annual Feldstein Lecture, Cecilia E. Rouse," Lessons for Economists from the Pandemic" cover slide

Data Lake: A New Ideology in Big Data Era

  • December 2017
  • Conference: 2017 4th International Conference on Wireless Communication and Sensor Network [WCSN2017]
  • At: Wuhan, China

Pwint Phyu Khine at University of Yangon

  • University of Yangon
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

: Comparison of Data warehouse and Data Lake

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Chandrakanth Lekkala

  • Andreas S. Andreou

Mohamed Cherradi

  • CLUSTER COMPUT
  • Xiaoyan Zhao
  • Conghui Zhang
  • Shaopeng Guan

Clara Voß

  • Widyastuti Andriyani
  • Aisyah Mutia Dawis
  • Rakhmat Purnomo
  • Jean Emmanuel Ntsama

Franklin Tchakounte

  • Yaya Traore

Moustapha Bikienga

  • Frédéric T. Ouedraogo

Ayman Alserafi

  • Sanjay Ghemawat
  • Mike Stonebraker

Ugur Cetintemel

  • Brian Stein
  • Alan Morrison
  • Chris Campbell
  • Chuck Yarbrough
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Ecosystem services provision through nature-based solutions: A sustainable development pathway to environmental stewardship as evidenced in the Protecting Lake Hawassa Partnership in Ethiopia

  • Research Article
  • Published: 06 August 2024

Cite this article

data lake research paper

  • Mulugeta Dadi Belete   ORCID: orcid.org/0000-0001-7781-7252 1 ,
  • Nathalie Richards   ORCID: orcid.org/0000-0002-4346-2462 2 &
  • Alisa Gehrels   ORCID: orcid.org/0009-0006-4795-7434 3  

The imperative to foster environmental stewardship amidst escalating challenges has driven the adoption of nature-based solutions (NBS) for landscape restoration. This paper explores the implementation and impacts of ecohydrological NBS interventions within the context of sustainable development, focusing on restoring locally valued ecosystem services and catalyzing environmental stewardship to drive environmental, economic, and social sustainability. It posits that if development interventions effectively deliver ecosystem services valuable to stakeholders, impacted communities are likely to support local environmental protection. Ethiopia’s Protecting Lake Hawassa Partnership (PLHP) employed a participatory approach guided by the Natural Resources Risk and Action Framework (NRAF) to engage a diverse network of local stakeholders in restoring and protecting the watershed. Implemented ecohydrological NBS enhanced ecosystem functionality in targeted hillslopes, gullies, and degraded farmlands. Multiple ecosystem services addressing soil erosion, water scarcity, and agricultural productivity were delivered, including productivity enhancement, flood regulation, land preservation, co-benefits from plantation, and moisture conservation. Landscape Functionality Analysis (LFA) revealed significant improvements in ecosystem stability, infiltration, and nutrient cycling. Qualitative assessments of the communities’ perception of ecosystem services emphasized the importance of aligning development project outcomes with local needs. Results underscored the robust nexus between NBS, ecosystem services, and environmental stewardship, highlighting the role of perceived benefits in fostering community engagement. The study advocates that environmental management practices, including NBS, which tangibly improve ecosystem services prioritized by local communities, drive stewardship and, therefore, the long-term sustainability of improved environmental protection. Further research is warranted to explore the scalability and cost-effectiveness of NBS interventions in diverse socioeconomic contexts, and to enhance understanding of trade-offs and synergies between economic development, ecological conservation, and social equity in development projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

data lake research paper

( Source : Degife et al. 2021 )

data lake research paper

( Source : adapted from Cáceres and Fernández 2021 , p.20)

data lake research paper

( Source : Modified from Belete ( 2022 , p. 71)

data lake research paper

( Source : Modified from Belete ( 2022 , p. 98)

data lake research paper

( Source : Modified from Tongway and Hindley 2004 , p. 111; Belete ( 2022 , p. 82))

data lake research paper

( Source : Belete ( 2022 , p. 76) and Belete ( 2023 . p. 5))

data lake research paper

( Source : self)

data lake research paper

Similar content being viewed by others

data lake research paper

Identifying Spatial Patterns and Ecosystem Service Delivery of Nature-Based Solutions

data lake research paper

Synthesis: Ecosystem Restoration in the Context of Socio-Ecological Production Landscapes and Seascapes (SEPLS)

data lake research paper

Identifying and Aligning Ecosystem Services and Beneficiaries Associated with Best Management Practices in Chesapeake Bay Watershed

Belete MD (2022) Ecohydrology-Based Landscape Restoration: Theory and Practice 1st Edition. Routledge Focus on Environment and Sustainability. Routledge, London. https://doi.org/10.4324/9781003309130

Book   Google Scholar  

Belete MD (2023) Ecohydrological nature-based solution for climate resilience and biodiversity enhancement in water-limited ecosystem: Perspectives and proof of concepts. Ecohydrol Hydrobiol 23(4):507–517. https://doi.org/10.1016/j.ecohyd.2023.08.011

Article   Google Scholar  

Bennett NJ, Whitty TS, Finkbeiner E et al (2018) Environmental Stewardship: A Conceptual Review and Analytical Framework. Environ Manage 61:597–614. https://doi.org/10.1007/s00267-017-0993-2

Bradshaw CJ, Ehrlich PR, Beattie A, Ceballos G, Crist E, Diamond J, Blumstein DT (2021) Underestimating the challenges of avoiding a ghastly future. Frontiers in Conservation Science 1(9):615419

Brown J, Mitchell N (2000) Culture and nature in the protection of Andean landscapes. Mt Res Dev 20:212–217

Buergelt, Petra and Douglas Paton (2018). Transformation: the key for reducing the risk and impact of disasters and climate change. In : Marguerite Welch, Victoria Marsick, Dyan Holt (eds.). Proceedings of the XIII Biennial Transformative Learning Conference: Building Transformative Community: Exacting Possibility in Today’s Times. pp 200–202. Columbia University. New York.

Cáceres, Diana and Carmen Fernández Fernández (2021). Natural Resources Risk and Action Framework: User Manual. GIZ- Natural Resources Stewardship Programme., Eschborn Germany. https://nature-stewardship.org/wp-content/uploads/NRAF_User-Manual_compressed.pdf

Chapin FS (2017) Ecological Foundations of Landscape Stewardship. In: Bieling C, Plieninger T (eds) The Science and Practice of Landscape Stewardship. Cambridge University Press, pp 21–34

Chapter   Google Scholar  

Chapin FS III, Kofinas GP, Folke C (eds) (2009) Principles of ecosystem stewardship: resilience-based natural resource management in a changing world. Springer, New York, New York, USA

Chapin FS, Carpenter SR, Kofinas GP, Folke C, Abel N, Clark WC, Per Olsson D, Smith MS, Walker B, Young OR, Berkes F, Reinette Biggs J, Grove M, Naylor RL, Pinkerton E, Steffen W, Swanson FJ (2010) Ecosystem stewardship: sustainability strategies for a rapidly changing planet. Trends Ecol Evol 25(4):241–249

Chapin FS, Power ME, Pickett STA, Freitag A, Reynolds JA, Jackson RB, Lodge DM, Duke C, Collins SL, Power AG, et al. (2011). Earth stewardship: science for action to sustain the human-earth system. Ecosphere. 2(8):art89.

Coley JD, Betz N, Helmuth B, Ellenbogen K, Scyphers SB, Adams D (2021) Beliefs about Human-Nature Relationships and Implications for Investment and Stewardship Surrounding Land-Water System Conservation. Land 10:1293

Cuni-Sanchez A, Omeny P, Pfeifer M, Olaka L, Mamo MB, Marchant R, Burgess ND (2019) Climate Change and Pastoralists: Perceptions and Adaptation in Montane Kenya. Clim Dev 11:513–524

Degife A, Worku H, Gizaw S (2021) Environmental implications of soil erosion and sediment yield in Lake Hawassa watershed, south-central Ethiopia. Environ Syst Res 10:28. https://doi.org/10.1186/s40068-021-00232-6

Díaz S, Demissew S, Carabias J et al (2015) The IPBES conceptual framework—connecting nature and people. Curr Opin Environ Sustain 14:1–16

IPBES (2019) Summary for policymakers of the global assessment report on biodiversity and ecosystem services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. In S. Díaz, J. Settele, E. S. Brondízio, H. T. Ngo, M. Guèze, J. Agard, A. Arneth, P. Balvanera, K. A. Brauman, S. H. M. Butchart, K. M. A. Chan, L. A. Garibaldi, K. Ichii, J. Liu, S. M. Subramanian, G. F. Midgley, P. Miloslavich, Z. Molnár, D. Obura, A. Pfaff, S. Polasky, A. Purvis, J. Razzaque, B. Reyers, R. Roy Chowdhury, Y. J. Shin, I. J. Visseren-Hamakers, K. J. Willis, & C. N. Zayas (Eds.). IPBES Secretariat.

Eiblmeier MT (2023) The Protecting Lake Hawassa Partnership, Ethiopia A case study on natural resources stewardship (pp. 1–8). Natural Resources Stewardship Programme, Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH. https://nature-stewardship.org/wp-content/uploads/NatuReS_Protecting-Lake-Hawassa-Case-Study-1.pdf

Falkenmark MJ, Rockström J (2004). Balancing Water for Humans and Nature: The New Approach in Ecohydrology Earthscan. 320 p.

Gill N (2014) Making country good: Stewardship and environmental change in central Australian pastoral culture. Trans Inst Br Geogr 39:265–277

Hartel T, Fischer J, Câmpeanu C, Milcu AI, Hanspach J, Fazey I (2014) The Importance of Ecosystem Services for Rural Inhabitants in a Changing Cultural Landscape in Romania. Ecol. Soc. 19, art42.

Jin X (2023) Preserving Ecosystem Services: A Call to Action for Environmental Stewardship. J Bio Endanger Species 11:05

Google Scholar  

Langemeyer J, Camps-Calvet M, Calvet-Mir L, Barthel S, Gómez-Baggethun E (2018) Stewardship of urban ecosystem services: understanding the value(s) of urban gardens in Barcelona. Landsc Urban Plan 170:79–89. https://doi.org/10.1016/j.landurbplan.2017.09.013

Leonard S, Parsons M, Olawsky K, Kofod F (2013) The Role of Culture and Traditional Knowledge in Climate Change Adaptation: Insights from East Kimberley. Australia Glob Environ Change 23:623–632

Lokocz E, Ryan RL, Sadler AJ (2011) Motivations for land protection and stewardship: Exploring place attachment and rural landscape character in Massachusetts. Landsc Urban Plan 99:65–76

Lorimer J (2012) Multinatural geographies for the anthropocene. Prog Hum Geogr 36(5):593–612

Machmer M, Steeger C. (2002). Effectiveness Evaluation Guidelines for Ecosystem Restoration . Final report submitted to Habitat Branch, Ministry of Water, Land and Air Protection. Victoria.

Maestre FT, Cortina J (2004) Insights into Ecosystem Composition and Function in a Sequence of Degraded Semiarid Steppes. Restor Ecol 12(4):494–502

MEA (2005). Millennium ecosystem assessment. In: ecosystems and human well-being: synthesis (Washington, DC: Island Press), 137. https://doi.org/10.5822/978-1-61091-484-0_1

Messerli P, Kim EM, Lutz W, Moatti JP, Richardson K, Saidam M, Smith D, Eloundou-Enyegue P, Foli E, Glassman A, Licona GH (2019) Expansion of sustainability science needed for the SDGs. Nat Sustain 2(10):892–894

Palmer C (2006) Stewardship: a case study in environmental ethics. In: Berry RJ (ed) Environmental Stewardship: Critical Perspectives. Past and Present. T&T Clark, London, pp 63–75

Pant HK, Adjei MB, Rechcigl JE (2004) Agricultural catchments and water quality: A phosphorus enigma. J Food Agric Environ 2:355–358

Penker M, Enengel B, Mann C, Aznar O (2013) Understanding landscape stewardship – Lessons to be learned from public service economics. J Agric Econ 64:54–72

Plummer R, Spiers A, Summer R, Fitz Gibbon J (2008) The contributions of stewardship to managing agro-ecosystem environments. J Sustain Agric 31:55–84

Preiser R, Pereira LM, Biggs RO (2017) Navigating alternative framings of human-environment interactions: variations on the theme of ‘Finding Nemo.’ Anthropocene 20:83–87

Read ZJ, King HP, Tongway DJ, Ogilvy S, Greene RSB, Hand G (2016) Landscape function analysis to assess soil processes on farms following ecological restoration and changes in grazing management. European J Soil Sci 67(4):409–420

Article   CAS   Google Scholar  

Richards N, Gutierrez-Arellano C (2022) Effects of community-based water management decisions at catchment scale, an interdisciplinary approach: the case of the Great Ruaha River Catchment. Tanzania Water Practice & Technol 17(2):598–611

Richards N, Mkenda A, Bjornlund H (2022) Addressing water security through catchment water stewardship partnerships: experiences from the Pangani Basin. Tanzania Water Int 47(4):540–564

Sanborn T, Jung J (2021) Intersecting Social Science and Conservation. Front Mar Sci 8:676394

Sanderson EW, Jaiteh M, Levy MA, Redford KH, Wannebo AV, Woolmer G (2022) The human footprint and the last of the wild: the human footprint is a global map of human influence on the land surface, which suggests that human beings are stewards of nature, whether we like it or not. AIBS Bull 52(10):891–904

Sayles JS, Baggio JA (2017) Social–ecological network analysis of scale mismatches in estuary watershed restoration. Proc Natl Acad Sci 114:E1776–E1785. https://doi.org/10.1073/pnas.1604405114

Schlesinger WH, Reynolds JF, Cunningham GL, Huenneke LF, Jarrell WM, Virginia RA, Whitford WG (1990) Biological feedbacks in global desertification. Science 247:1043–1048. https://doi.org/10.1126/science.247.4946.1043

Scholz RW, Binder CR (2011) The HES Framework. In: Environmental literacy in science and society: from knowledge to decisions, Cambridge University Press, pp 453–462.

Sookram R (2013) Environmental Attitudes and Environmental Stewardship: Implications for Sustainability. J Values-Based Leadersh 6(2):5

Steffen W, Persson Å, Deutsch L, Zalasiewicz J, Williams M, Richardson K, Crumley C, Crutzen P, Folke C, Gordon L, Molina M (2011) The Anthropocene: from global change to planetary stewardship. Ambio 40(7):739–761

Steffen W, Richardson K, Rockström J, Cornell SE, Fetzer I, Bennett EM (2015) Planetary boundaries: guiding human development on a changing planet. Science. https://doi.org/10.1126/science.1259855

Suich H, Howe C, Mace G (2015) Ecosystem services and poverty alleviation: A review of the empirical links. Ecosyst Serv 12:137–147. https://doi.org/10.1016/j.ecoser.2015.02.005

Szucs E, Geers R, Sossidou E (2009) Stewardship, Stockmanship and Sustainability in Animal Agriculture. Asian-Australasian J Animal Sci. https://doi.org/10.5713/ajas.2009.80603

Tongway DJ, Hindley N (2004) Landscape function analysis: a system for monitoring rangeland function. African J Range & Forage Sci 21(2):109–113

Turnbull JW, Johnston EL, Kajlich L, Clark GF (2020) Quantifying local coastal stewardship reveals motivations, models and engagement strategies. Biol Conserv 249:108714

Willock J, Deary IJ, Edward-Jones G, Gibson GJ, McGregor MJ, Sutherland A, Dent JB, Morgan O, Grieve R (1999) The Role of Attitudes and Objectives in Farmer Decision Making: Business and Environmentally-Oriented Behaviour in Scotland. J Agric Econ 50(2):286–303

Worrell R, Appleby MC (2000) Stewardship of natural resources: definition, ethical and practical aspects. J Agric Environ Ethics 12(3):263–277

WWF (2020) Nature-based solutions for climate change. WWF Brief, July 2020.  https://wwfint.awsassets.panda.org/downloads

Zhang W, Ricketts TH, Kremen C, Carney K, Swinton SM (2007) Ecosystem services and disservices to agriculture. Ecol Econ 64:253–260

Download references

This research did not receive specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Water Resources and Irrigation Engineering, Institute of Technology, Hawassa University, P.O.Box 05, Hawassa, Ethiopia

Mulugeta Dadi Belete

The German Agency for International Cooperation, Dag-Hammarskjöld-Weg 1-5, 65760, Eschborn, Germany

Nathalie Richards

The Co-Operative Development Foundation of Canada, 350 Sparks St., Suite 906, Ottawa, ON, K1R 7S8, Canada

Alisa Gehrels

You can also search for this author in PubMed   Google Scholar

Contributions

Mulugeta Dadi Belete and Nathalie Richards contributed the conceptualization, data collection; formal analysis; investigation; methodology; and writing of the manuscript. Alisa Gehrels contributed on the methodological design and writing of the manuscript.

Corresponding author

Correspondence to Mulugeta Dadi Belete .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no confict of interest in this project.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Belete, M.D., Richards, N. & Gehrels, A. Ecosystem services provision through nature-based solutions: A sustainable development pathway to environmental stewardship as evidenced in the Protecting Lake Hawassa Partnership in Ethiopia. Socio Ecol Pract Res (2024). https://doi.org/10.1007/s42532-024-00193-x

Download citation

Received : 18 October 2023

Revised : 24 June 2024

Accepted : 30 June 2024

Published : 06 August 2024

DOI : https://doi.org/10.1007/s42532-024-00193-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Ecohydrology
  • Nature-based solutions
  • Ecosystem services
  • Environmental stewardship
  • Landscape restoration

Advertisement

  • Find a journal
  • Publish with us
  • Track your research
  • Español (Spanish)
  • Français (French)
  • Bahasa Indonesia (Indonesian)
  • Brasil (Portuguese)
  • India (English)
  • हिंदी (Hindi)
  • Feature Stories
  • Explore All
  • Subscribe page
  • Submissions
  • Privacy Policy
  • Terms of Use
  • Advertising
  • Wild Madagascar
  • Selva tropicales
  • Mongabay.org
  • Tropical Forest Network

Indonesia, EU reconcile forest data ahead of new rules on deforestation-free trade

cover image

Share this article

If you liked this story, share it with other people.

  • The Indonesian government and its European Union counterparts are ironing out differences in their forest and commodity supply chain data ahead of a looming deadline that could shut Indonesian commodities out of the EU market.
  • Under the EU’s Deforestation Regulation, commodities associated with deforestation will be barred as of next year from entering the EU market; Indonesia is a major producer of four of the seven listed commodities: palm oil, coffee, cocoa and rubber.
  • To be allowed to export these commodities to the EU, producers and traders must be able to show that they weren’t sourced from land that was deforested to grow them, but the forest maps used by Indonesia and the EU have several differences that need to be reconciled.
  • The EU ambassador to Indonesia says his side is working with local authorities to resolve the matter, which he attributes to the differing definition of “forest” as used by the European and Indonesian authorities.

JAKARTA — The Indonesian government is working on improving and synchronizing its forest and supply chain data to comply with increasingly strict sustainability standards and requirements in markets where it exports to, including the European Union.

Earlier this year, the Indonesian government discovered discrepancies between the forest map and data that it uses, and those used by the EU as a reference for the implementation of the European Union Deforestation Regulation (EUDR).

The EUDR bans imports of seven forest-related commodities — soy, palm oil, coffee, cocoa, timber, rubber and beef — associated with deforestation and illegality. It requires producers and companies trading these commodities into the EU to  provide detailed evidence  proving they weren’t produced from land deforested since 2020. The new regulations give producers and companies until Dec. 30, 2024, to fully comply.

To implement the EUDR, the EU is using forest data published on its Forest Observatory platform, which monitors changes in the world’s forest cover and related divers.

Meanwhile, the Indonesian government has its own forest monitoring system called SIMONTANA . It also has its own definition and classification of forest and deforestation.

When comparing the EU Forest Observatory maps and the maps in SIMONTANA, the government found discrepancies, with the EU overestimating Indonesia’s forest cover. The government found that some shrublands and farmland, like oil palm plantations and coffee estates, had been categorized as forest cover by the EU.

These discrepancies could make it difficult for Indonesian producers to comply with the EUDR and to export their products to the European market, said WWF Indonesia CEO Aditya Bayunanda.

“Getting the right map together could help us to comply [with the EUDR]. Otherwise there could be debate [on whose data are correct] during every shipment,” he told Mongabay in Jakarta in June.

These data discrepancies could also result in the EU wrongly categorizing Indonesia as a high-risk country, Indonesian Environment and Forestry Minister Siti Nurbaya Bakar said as quoted by state-owned media.

The EUDR adopts a classification system that will categorize exporting countries based on their deforestation risk. Low-risk countries will have a simpler due diligence procedure, while higher-risk countries will have to go through more rigorous checks. The checks will make use of geolocation coordinates, satellite monitoring tools and DNA analysis that can trace the origin of product entering the EU.

There are concerns that a high-risk label for Indonesia — the world’s biggest palm oil producer and also a major exporter of timber, coffee, cocoa and rubber — will make it more difficult for producers in the country to export their goods to the EU.

To iron out these differences, the Indonesian government is working with EU authorities, said Denis Chaibi, the EU ambassador to Indonesia.

“Yes, the government approached the EU, indicating that the maps that were prepared by our joint research center contains a mistake according to the Indonesian authorities,” he told reporters in Jakarta in June. “So yesterday we had a meeting to compare notes and maps, and I think there’d be a follow-up so that we can continue our work to make sure that our data are close to each other.”

Chaibi said the data discrepancies stem from the differences in forest definition adopted by the EU and the Indonesian government.

The EU uses the Food and Agriculture Organization’s definition of forest, which Chaibi said is the one that most people use. Meanwhile, the Indonesian government has its own definition .

“So we have to narrow down the differences between our understanding of what constitutes a forest,” Chaibi said.

The looming deadline to comply by the end of the year means both sides have less than six months to iron out their differences, he added.

“[W]e are working hard to make sure we can move the data January 1, which is the start of the implementation of the new regulation,” Chaibi said.

Deforestation in East Kalimantan for oil palm plantations.

Traceability dashboard

Another effort that the Indonesian government is undertaking to make it easier for its producers to prove that their products are deforestation-free is developing a supply chain traceability system .

The system will be in the form of an online dashboard, set for launch in September, which will collect and synchronize all data and maps related to various commodities, such as palm oil, coffee, cocoa and rubber, at all stages of the supply chain.

The EUDR has received backlash from producer countries like Indonesia, which accused the bloc of unfairly treatment of their products in the European market. One of the more contentious aspects of the EUDR is the requirement for producers and traders to provide precise geographical coordinates for all plots of land from which their products are sourced.

The idea is so that buyers in the EU can trace commodities back to the farm where it was grown to make sure they aren’t produced by clearing forests first.

However, achieving full traceability for commodities like palm oil in Indonesia has proven challenging due to various factors, including bureaucratic hurdles, overlapping land claims, and lack of documentation for palm oil transactions.

For one, Indonesia doesn’t have a mandatory traceability system for palm oil producers. It does have a mandatory sustainable palm oil standard, ISPO, but this doesn’t impose any traceability requirements, even though there’s a plan to include it in the next iteration of the standard.

The industry, however, has largely taken it upon itself to have a traceability system. But this often only traces the commodity back to the processing mill where palm kernels are pressed, not the plantation where the kernels were cultivated. This gap is likely due to the lack of publicly available data on plantations, which the government largely conceals on the grounds of data privacy.

Lack of traceability to the plantation level is especially true for farms managed by independent smallholders, who produce up to 40% of Indonesia’s palm oil but who often don’t document their transactions and sell their palm fruit to an informal network of intermediaries and middlemen. This makes it particularly challenging to trace their supply, WWF’s Aditya said.

There’s also the issue of overlapping land ownership data, with different government agencies possessing different data on the size of oil palm plantations.

The new dashboard aims to address these shortcomings.

Eloise O’Carroll, program manager for forestry, natural resources and energy at the EU delegation to Indonesia, welcomed the government initiative. She said the dashboard will be helpful not just in the context of the EUDR, but also for other export markets that increasingly demand sustainable products.

“[We’re] also pleased to know that under the dashboard, the government will have data on deforestation using the FAO’s definition, but also using the Indonesia’s definition and classification of forests,” O’Carroll said.

Banner image: Deforestation for oil palm plantation in Riau, Indonesia. Image by Rhett A. Butler/Mongabay.

FEEDBACK: Use  this form  to send a message to the author of this post. If you want to post a public comment, you can do that at the bottom of the page.

' src=

To wipe or to wash? That is the question

Active clearance and drainage of peatland rainforest in a concession run by PT Asia Tani Persada, which is also an orangutan habitat.

Toilet paper: Environmentally impactful, but alternatives are rolling out

data lake research paper

Rolling towards circularity? Tracking the trace of tires

Wheat field in Kansas. Image by Lane Pearman via Flickr (CC BY 2.0).

Getting the bread: What’s the environmental impact of wheat?

Consumed traces the life cycle of a variety of common consumer products from their origins, across supply chains, and waste streams. The circular economy is an attempt to lessen the pace and impact of consumption through efforts to reduce demand for raw materials by recycling wastes, improve the reusability/durability of products to limit pollution, and […]

Free and open access to credible information

Latest articles.

data lake research paper

Peter Dykstra, award-winning environmental journalist, died at 67

Fraser Valley as seen from Sumas Mountain.

Climate change could return a stolen lake to Indigenous people, a century later

Red-throated bee-eaters (Merops bulocki), Pehonko, Benin. Image by Yves Bas via iNaturalist (CC BY 4.0).

Birdsong rings out once again in Togo’s sacred forest of Titiyo

An ice shelf and the sea in Greenland.

High-resolution maps reveal surprises about how ice shelves melt

Children are invited to guess the type of medicinal plant from its aroma.

Sumatra community school hands down ancient knowledge to modern generation

Abijata-Shalla National Park’s management is working to build a common understanding of how deforestation and other activities affect the landscape on which communities and wildlife depend. Image by Solomon Yimer for Mongabay.

Holistic care for an Ethiopian lake system: Interview with Redwan Mohammed

Lake Turkana. Image by Aocrane via Flickr (CC BY-NC-ND 2.0).

Small steps towards larger goal of protecting East African wetlands

you're currently offline

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: evaluating sam2's role in camouflaged object detection: from sam to sam2.

Abstract: The Segment Anything Model (SAM), introduced by Meta AI Research as a generic object segmentation model, quickly garnered widespread attention and significantly influenced the academic community. To extend its application to video, Meta further develops Segment Anything Model 2 (SAM2), a unified model capable of both video and image segmentation. SAM2 shows notable improvements over its predecessor in terms of applicable domains, promptable segmentation accuracy, and running speed. However, this report reveals a decline in SAM2's ability to perceive different objects in images without prompts in its auto mode, compared to SAM. Specifically, we employ the challenging task of camouflaged object detection to assess this performance decrease, hoping to inspire further exploration of the SAM model family by researchers. The results of this paper are provided in \url{ this https URL }.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Architecture of Data Lake

    data lake research paper

  2. (PDF) An Overview of Data Lake

    data lake research paper

  3. What Is a Data Lake? Types, Elements & Best Practices

    data lake research paper

  4. Data Lake Management

    data lake research paper

  5. What Is a Data Lake and Why Is It Essential for Big Data?

    data lake research paper

  6. (PDF) An Overview of Data Warehouse and Data Lake in Modern Enterprise

    data lake research paper

VIDEO

  1. AWS Data Lake Eps 4

  2. Data + AI Summit Keynote, Wednesday Part 6- Delta Lake 3.0 & UniForm

  3. Investigating Glacial Lake Yahara

  4. Integrate Data Lake sources with Autonomous Database

  5. Demystifying data lake layers by Dmytro Dragan

  6. Automatically create EXTRACT statements using Data Lake Tools For VS

COMMENTS

  1. [2106.09592] Data Lakes: A Survey of Functions and Systems

    This survey reviews the development, architectures, and systems of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing approaches and systems based on their provided functions for data lakes, which makes this survey a useful technical reference for designing ...

  2. Observations and Expectations on Recent Developments of Data Lakes

    Although progress has been made in data lake research and applications, there are also numerous issues and challenges need to be addressed. In this paper, we survey some recent developments, provide our observations, as well as our expectations on future research and practice in this area.

  3. An Overview of Data Warehouse and Data Lake in Modern Enterprise ...

    In this regard, two of the popular data management systems in the area of big data analytics (i.e., data warehouse and data lake) act as platforms to accumulate the big data generated and used by organizations. Although seemingly similar, both of them differ in terms of their characteristics and applications.

  4. (PDF) Data Lakes: A Survey of Functions and Systems

    We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey will motivate the future development of data lake research and practice.

  5. PDF Data Lakes: A Survey of Functions and Systems

    Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available technologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes.

  6. On data lake architectures and metadata management

    Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.

  7. Data Lakes: A Survey Paper

    Thereby, we present in this paper a clear and comprehensible overview of data lake definitions, architectures, and technologies. Indeed, we will classify the different scientific contributions on the data lake according to each layer of their architecture.

  8. PDF Data Lakes: A Survey of Functions and Systems

    Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available tech-nologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes.

  9. Title: On data lake architectures and metadata management

    Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.

  10. Big Data Lakes: Models, Frameworks, and Techniques

    In line with this exciting research perspective, this paper proposes an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions, on advancing the big data lake research trend.

  11. Data Lakes: Trends and Perspectives

    5 Conclusions. In this prospective/survey paper, we propose a complete definition of data lake and an extensible functional architecture based on 4 zones. Our definition has the advantages of being more complete than the literature and includes both input and output, different functions as well as users of data lakes.

  12. Data Management in the Data Lake: A Systematic Mapping

    This paper sheds light on existing works for performing analysis and predictions on data placed in data lakes. Our study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction.

  13. Toward data lakes as central building blocks for data management and

    Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research ...

  14. A Mapping Study about Data Lakes: An Improved Definition ...

    PDF | On Jul 10, 2019, Julia Couto and others published A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures | Find, read and cite all the research you need on ...

  15. Data lake: a new ideology in big data era

    PDF | Data Lake is one of the arguable concepts appeared in the era of big data. Data Lake original idea is originated from business field instead of... | Find, read and cite all the research you ...

  16. Data Lakes in Healthcare: Applications and Benefits from the

    We review the literature and structure it according to data sources and players, and we identify applications and future research needs of data lakes in the healthcare domain. Overall, it turned out that all players could benefit from the capabilities of data lakes.

  17. Managing data lakes in big data era: What's a data lake and why has it

    The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes. The paper discusses the concept of data lakes and shares the author's thoughts and practices of data lakes.

  18. SciSciNet: A large-scale open data lake for the science of science research

    Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses ...

  19. PDF Data lake concept and systems: a survey

    We provide a compre-hensive overview of research questions for designing and building data lakes. We classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes.

  20. PDF Big Data Lakes: Models, Frameworks, and Techniques

    In line with this exciting research perspective, this paper proposes an overview of state-of-the-art approaches that are at the foundations of big data lake research, and innovative open problems and issues, which drive future research directions, on advancing the big data lake research trend.

  21. PDF Putting the Data Lake to Work A Guide to Best Practices

    OO. Machine and sensor data. For each of these data types, the data lake created a value chain through which new types of business value emerged: OO. Using data lakes for web data increased the speed and quality of web search. OO. Using data lakes for clickstream data supported more efective methods of web advertising.

  22. Apple skips Nvidia's GPUs for its AI models, uses thousands of Google

    According to the recently released research paper, Apple's AFM-server was trained on 8,192 TPUv4 chips "provisioned as 8 × 1,024 chip slices, where slices are connected together by the data ...

  23. Data lake concept and systems: a survey

    This survey reviews the development, definition, and architectures of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the ...

  24. Paper on Creating Friendly Color Maps for Color Vision ...

    Timothy Lang (ST11) is a co-author on an article titled "Effective Visualization of Radar Data for Users Impacted by Color Vision Deficiency", which was recently accepted for publication in Bulletin of the American Meteorological Society.

  25. Consumer Credit Reporting Data

    Since the 2000s, economists across fields have increasingly used consumer credit reporting data for research. We introduce readers to the economics of and the institutional details of these data. Using examples from the literature, we provide practical guidance on how to use these data to construct ...

  26. SBTi releases technical publications in an early step in the Corporate

    The SBTi releases research on considerations for a more effective approach to scope 3 emissions. This is an early step in the review of the Corporate Net-Zero Standard; guidance remains unchanged until process is complete.

  27. Data Lake: A New Ideology in Big Data Era

    A data lake is a storage solution for big data that can store vast amounts of original data in any format and has better data analysis and processing capabilities [31] [32] [33].

  28. Ecosystem services provision through nature-based solutions: A

    The imperative to foster environmental stewardship amidst escalating challenges has driven the adoption of nature-based solutions (NBS) for landscape restoration. This paper explores the implementation and impacts of ecohydrological NBS interventions within the context of sustainable development, focusing on restoring locally valued ecosystem services and catalyzing environmental stewardship ...

  29. Indonesia, EU reconcile forest data ahead of new rules on deforestation

    JAKARTA — The Indonesian government is working on improving and synchronizing its forest and supply chain data to comply with increasingly strict sustainability standards and requirements in ...

  30. Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2

    The Segment Anything Model (SAM), introduced by Meta AI Research as a generic object segmentation model, quickly garnered widespread attention and significantly influenced the academic community. To extend its application to video, Meta further develops Segment Anything Model 2 (SAM2), a unified model capable of both video and image segmentation. SAM2 shows notable improvements over its ...