Data Archive Geo (DAG)

Q&A

Welcome to the Q&A for the Data Archive Geosciences. This page is a listing of all of the Q&As that we have available, you can use the search tool to help you find questions. These questions can be found on other pages as well in relevant sections. 

[collapse title=”What is considered research data?”]

Data which is directly or indirectly related to a research project or program. DAG is intended for research data that is no longer active. 

Feedback on this Question
[/collapse]

[collapse title=”What is considered (in)active data?”]

The archived data is read-only and suitable for data that you no longer (need to) change because you project is finished, or because the data has reached a final stage (e.g. raw, cleaned or processed). 

If the data is still active, and you expect to make changes to it, then you can use Yoda for Geosciences, the faculty network drive or another approved storage location.

Related Links:
UU Geosciences Yoda
UU ICT Storage Finder Tool

Feedback on this Question
[/collapse]

[collapse title=”How does DAG relate to other archives and repositories?”]

DAG is an internal archive for the faculty of Geosciences. It is intended to safeguard research data that cannot or may not be published as FAIR data in public repositories such as Pangaea or Yoda.

Related Links:
UU ICT Data Storage Finder Tool
UU Geosciences Repository Listing

Feedback on this Question
[/collapse]

[collapse title=”How does DAG relate to Yoda?”]

Although DAG and Yoda use the same infrastructure and make use of the same IRods technology, there are some fundamental differences. DAG is an internal data archive, which is only accessible by member of the faculty of Geosciences, while Yoda is an institutional repository, which is accessible by users from Utrecht University and beyond. DAG is focused on preserving data on the long-term which can be considered as static. Besides preserving data on the long-term in the vault, in Yoda data can also been stored in the workspace for sharing and collaboration purposes. In addition data in the vault of Yoda can be made publicly available, so a DOI can be assigned and data is online retrievable by means of the registered metadata. Also the data in DAG is retrievable, but only within DAG itself. By the way: it is possible to make data from DAG publicly available by using a publication workflow.

Related Links:
What is Yoda?

Feedback on this Question
[/collapse]

[collapse title=”Is data in DAG FAIR?”]

The data in DAG are only findable and accessible for researchers of the faculty and this makes the data less FAIR than data that is in most public repositories.

However, there is a lot of data in the faculty that remains on external disks or isolated network folders, in many cases because the data is not suitable or allowed to be published, even with restricted access.

DAG provides a solution for these data, to ensure that these data remain available. By providing metadata, access controls and guidance we maximize the findability, accessibility, interoperability and reusability of the data.

Feedback on this Question
[/collapse]

[collapse title=”Who may use DAG?”]

These guidelines are created in cooperation with researchers and supporters from different divisions of the faculty 

  • Data Manager and Stewards: Vincent Brunst, Garrett Speed, Ilja Kocken 
  • Pilot Group sensitive data
  • Pilot Group lab data
  • Pilot Group big data
  • Yoda Team: Maarten Hoogerwerf, Erik Hakvoort, Monic Hodes

We welcome any feedback, suggestions, etc. Users can give feedback on questions and pages using the feedback link on each page. 

Feedback on this Question
[/collapse]

[collapse title=”What is the sensitivity of my data?”]

Data is considered sensitive if the data contains information on persons, groups of at-risk persons, endangered species, paleontological and archaeological sites, or data with legal or contractual obligations, commercial data or data with 3rd party ownership, containing state secrets, or is under embargo. If the data has any of these conditions the data must be categorized in the DAG as sensitive.   

Otherwise, if the data has no foreseeable risks associated with it, it can be considered non-sensitive.   

Feedback on this Question
[/collapse]

[collapse title=”Can I archive sensitive data?”]

Yes, you may archive data with which is sensitive. However, you should restrict access to your data via the metadata. 

If your data contains personal data, then you must indicate that in the metadata, so that we can manage it in compliance with GDPR. 

Feedback on this Question
[/collapse]

[collapse title=”What is personal data?”]

Personal data means data that can be related to an identified or identifiable natural person. This data needs to comply with the regulations of the GDPR / AVG. Otherwise sensitive data is data that has been labelled as sensitive for ethical, commercial, or valorization reasons and should be treated as such. Please consult your data steward or the faculty privacy officer if you are unsure if your data may be considered personal data. 

Related Links:
UU RDM Handling Personal Data

Feedback on this Question
[/collapse]

[collapse title=”How does DAG safeguard my data?”]

The DAG is built upon the Yoda platform, which complies with Utrecht University’s Information Security policy for data classified as public, internal use or sensitive.  

As a depositor you can control restrict data access to either the whole faculty, or to yourself (and data managers).  

Access restrictions are set in the metadata through the personal data field and the data sensitivity field. Note that metadata can be searched by the whole faculty.  

Feedback on this Question
[/collapse]

[collapse title=”Who can find my data?”]

Your data, once it is submitted and stored in the archive, can be found by all members of the faculty, after logging in to DAG. The metadata is not shared with other systems, so it cannot be found outside DAG. If you do need your data to be findable, you should consider publishing it in a public repository such as Yoda or Pangea, you can find public repositories on the UU Repository Finder tool, or you can contact your data steward for help.

Related Links:
UU Repository Decision Tool

Contact Information:
UU Geosciences Data Team (Data Stewards, Data Manager, Privacy Officer)

Feedback on this Question
[/collapse]

[collapse title=”Is there a maximum size to the data that I can archive?”]

In principle there is no maximum size to the files or dataset that you need to archive. However, there are some things that you need to consider: 

  • Is it worth archiving the specific (large amount of data)? 
  • How are you going to transfer the data efficiently?  
  • Does DAG have enough capacity to store the data? 

You should have no trouble archiving data up to a few GB with file explorer, or up to 100s of GBs if you use iCommands. If you plan to upload over 1 TB of data, you need to contact the DAG management so that we can reserve sufficient storage capacity and for the team to help with optimizing the data package. 

Feedback on this Question
[/collapse]

[collapse title=”What are the roles of users in DAG?”]

Within DAG the following roles will be distinguished:  

  

  • Data Owner – The principal researcher of the project or research group leader. The owner is recorded in the metadata, and responsible for establishing access controls to the data.  
  • Data Depositor – The person who uploads and documents the data in DAG. Ideally this is the person that knows the context and processes for creating and using the data (the data creator), on behalf of the data owner. The depositor is recorded in the metadata, and will also be used as the primary contact for questions about the data.  
  • Data Manager – The person(s) that curate the data, monitor access controls and support in fulfilling data access requests and possesses deeper technical knowledge about the backend of DAG.  
  • Data Consumer – this is a faculty staff member who is interested in reusing research data from DAG. The data consumer can search the data and (request) access. The data consumer should respect any conditions specified by the data owner. 

Feedback on this Question
[/collapse]

[collapse title=”How to determine the owner or main responsible person for my data?”]

Primary responsibility for complying with these guidelines lies with the researcher who is also responsible for the generation of the research data. This also applies for PhD candidates and postdoctoral researchers. For research master’s students, their promotor or daily supervisor is responsible. 

Feedback on this Question
[/collapse]

[collapse title=”Who should deposit the data?”]

Data creators and data owners have the primary responsibility to deposit data., because they have insight knowledge about the content of the data and how it is originated,. They should write the data documentation, decide in what structure the data should be recorded and create the metadata.    

Data managers, data stewards, data custodians and/or administrative/support staff are not responsible for depositing data, but they are available to help. They do not have a full understanding of the structure and content of the data, so they should focus on supporting the data depositor, so he / she can deliver high-quality data, metadata and data documentation to DAG.   

Feedback on this Question
[/collapse]

[collapse title=”What will happen when the retention period expires?”]

Good open data and research integrity principles suggest archiving research data for at least 10 years, sometimes longer. Once archived, data should not be deleted until the specified period is completed. Normally only data owners with permanent contracts would stay in the faculty that long. Normally only they can determine what to do with the data beyond the archiving period. However principal investigators and group leaders (data owners) are also responsible during the period that the data must be kept. If any of them leaves the faculty, then consider the creating a data ownership succession plan that lays out who takes over responsibility for the data at UU. 

Feedback on this Question
[/collapse]

[collapse title=”What sensitive data can I archive?”]

Selecting data means making choices about what to keep for the long term, and what data to archive securely. This means that you have to decide whether your dataset contains data that needs to be deleted or separated. Reasons to exclude data from publishing include (but are not limited to):   

  • The data is redundant   
  • Data concern temporary byproducts, which are irrelevant for future use   
  • Data is sensitive for privacy reasons in regard to the GDPR/AVG: like consent forms, voice recordings, transcripts, DNA data, or any other data the contains information on specific people.  
  • Data containing state secrets  
  • Data sensitive to competition in a commercial sense, preserving data for the long term is in breach of contractual arrangements with your consortium partners or other parties involved 

Feedback on this Question
[/collapse]

[collapse title=”Should I archive all data from each processing stage?”]

Where possible, the original (primary/rough) data should be archived, where possible together with the code, processing scripts, or processing instructions needed to consult the data. Next in priority are permanent enriched data files which are derived from the primary data and can be used for analysis as described in the methodology section of the research. Subsequently, results from data analysis which can be used for substantiation of findings which are described in research articles, papers or thesis should be deposited. 

Feedback on this Question
[/collapse]

[collapse title=”When should I start archiving my data?”]

It is recommended to start archiving when a set of data will not be adjusted anymore for the research project itself and can be considered as static. But certainly by the end of a project at the very latest and/or when a lead researcher/data collector leaves the projects/institute, whichever comes first. 

Feedback on this Question
[/collapse]

[collapse title=”What data format should I use for my data and what options do I have?”]

For maintenance purposes and to ensure long-term accessibility, it is preferable that data files will be archived in ‘sustainable’ file formats following the FAIR guidelines, where possible. A list of sustainable file formats can be found here, or you can contact your data steward for assistance in finding an open and sustainable file format. When it is not feasible to convert your data in another file format which is considered to be open, please mention details about the data formats and the used software in the data documentation. 

Related Links:
DANS File Format List
4TU Preferred File Formats (PDF)
UK Data Service Open Data Formats

Feedback on this Question
[/collapse]

[collapse title=”What data – or folder structure should I apply to my data?”]

If your field of study has a generally accepted folder structure for datasets, you should use that structure. If not, you should group things logically, possibly by sample type, data type, date of data collection, or by site. Determining a folder structure can be difficult, you can consult with your data steward on an appropriate structure, and also be sure to document your folder structure for what data is in which folders.  Your data steward can help you create a folder structure to help you stay organized.  

Contact Information:
UU Geosciences Data Team (Data Stewards, Data Manager, Privacy Officer)

Feedback on this Question
[/collapse]

[collapse title=”How do I determine the granularity level of my data package(s)?”]

Try to divide the data into several packages according to data type, processing stage and size. Make sure that these are logical units that can be described as 1 set by means of metadata and data documentation. 

Feedback on this Question
[/collapse]

[collapse title=”What topics should be documented?”]

The next topics should be included in the data documentation:   

  • General content description; Brief description of what content can be found in the data package.  
  • Folder structure / relations (also mentioned below); Explanation on the inserted structure of folders and subfolders in order to make distinguish between recorded data files. When two or more (sub) folders have relations with each other of any kind, also mentions these.  
  • Folder contents; Describe the content of the included data files and how it is positioned within the present folders. Also make clear how the data can be used what its purpose is.   
  • Used abbreviations/acronyms; Make a list of used abbreviations and acronyms used in files, columns and in filenames.  
  • Codebook (when not provided in separate files); Description on how the data was attained or which settings were used on machines to attain the data. This should also include what the categories in datafiles columns mean and what processing of data took place to attain the files.  
  • Description of workflow: this workflow explains which processing steps have took place on the data, what analysis on the data is performed and which methods have been applied to get the data in the form as they were added in the data package  
  • Ethical review (if applicable); when an ethical assessment is carried out by an ethics review board or committee, details about the assessment and its results should be mentioned  
  • software or instrument-specific information needed to understand or interpret the data, including software and hardware version numbers. Include measurement details, used standards and calibration information, if appropriate. 

If the data set includes multiple files that relate to one another, the relationship between the files or a description of the file structure that holds them would be helpful. There may also be information about related data collected but that is not in the described dataset.   

There should be description of any quality-assurance procedures performed on the data (for instance, definitions of codes or symbols used to note or characterize low quality/questionable/outliers/missing data that people should be aware of).  

Also, documents dealing with data management and privacy aspects around the project can be considered as data documentation and should be included with the data package. 

Feedback on this Question
[/collapse]

[collapse title=”What files should I add as documentation?”]

The general parameters of the project should be documented in discrete files. For instance, the following information / files is required to interpret and understand a study by a researcher who is not part of the research team:    

  • Proposal   
  • The data collection / generation methods to contextualize the space and time of the study   
  • Final Data Collection Tools     
  • Analytical and procedural information (such as fieldnotes, observations, codebook development)    
  • Codebook explaining variable definitions, units of measurement, names and schemas  
  • Permissions or licenses from copyright holders from partner organizations (if any)   
  • Any assumptions made during analysis  

When the data collection methods and objectives etc. are included in the project proposal or accompanied publications, there is no need to add them in the data documentation.   

Feedback on this Question
[/collapse]

[collapse title=”Should sensitive data be treated differently compared to non-sensitive data when depositing both to DAG?”]

In preparing your dataset for archiving, the first step is to determine which parts of your data are sensitive or highly sensitive (see data access), so it can be separated from the other data. Also, data with a contractual obligation to delete, temporary data and incomplete data should be left out of the data package which will be archived. 

Feedback on this Question
[/collapse]

[collapse title=”How can I improve the quality of my data?”]

Throughout the research cycle, the quality and durability of data must therefore be ensured through careful management. However, there is no uniform definition for data quality and its criteria. What is clear is that it not only depends on its own features, but also on the users and processes that the data is used for. Therefore, the below presented scheme with data quality indicators has been used to make clear what aspects contribute to the quality level of deposited data. In Appendix the different indicators mentioned under the main headings will be described, so it is clear how these aspects apply within DAG to give substance to the data quality of the included data sets. 

Feedback on this Question
[/collapse]

[collapse title=”Do I have a DMP?”]

According to the policy framework research data of the university it is the responsibility of the researcher that there is made up a data management plan (DMP). Please find out first in your research team or with your supervisor if a plan is already been created. If this is not the case, it would be good to draw it up right now. Please visit the added link to learn how you can create a DMP. 

Related Links:
UU RDM – Data Management Planning
UU Policy – Framework for Research Data Management (PDF)

Feedback on this Question
[/collapse]

[collapse title=”What restrictions on the inclusion of data in DAG might apply?”]

Not all research data can be included in DAG. This has to do with the fact that the faculty (owner of DAG) takes over the responsibility for making the data available from the data owner. The data owner still has the rights and the responsibility to determine what happens to the data stored in DAG. Data of which the UU (faculty or data owner) does not have the ownership or intellectual property rights cannot be included in DAG. This could also apply to data to which third parties, such as funder, publisher, data supplier or consortium partner have (shared) claims or rights. In that case it must be agreed with the relevant party whether the data can be included in DAG. In addition, commercial considerations or valorisation purposes could be reasons not to include data in DAG. If these reasons apply, find out first before data are deposited. 

Feedback on this Question
[/collapse]

[collapse title=”What are the costs for archiving, and how are these covered?”]

There are costs for archiving your data the most important being the cost of the data storage. 

The storage costs are currently covered by the faculty and will not be charged to the individual researcher or research group.   

The storage usage will be monitored, and when inefficent usage is detected, we will contact the data depositor and data owners to discuss how the usage of DAG can be optimized. 

Related Links:
UU Guide – Cost of Data Management

Feedback on this Question
[/collapse]

[collapse title=”What access category should I choose?”]

Open: Data is to be accessible to all faculty members, free to use/download and modify. Data has been de-identified and has an appropriate license.    

  • Open for analysis   
  • Open for reuse   
  • Open for redistribution   
  • Open to adapt   
  • Open for redistribution of adapted version   
  • Open with obligation to cite   
  • Open except for commercial purposes   

Restricted: Data is not directly accessible to all faculty members. Stored data should only be accessible to approved researchers only. Access can be given upon request by the data owner/ custodian. Metadata is  findable by all faculty members.    

  • Available upon request   
  • Conditionally available   

Feedback on this Question
[/collapse]

[collapse title=”How can access rights be transferred to a successor?”]

The succession plan for role of data owner is a hierarchical list of people who assume responsibility over data modifications, evaluating access requests, authorizing access, tracking data usage, and deleting a data package. A minimum of one successor is required. An exemplary hierarchical list for succession is: 

  • Principle Investigator 
  • Group Leader 
  • Third-Party Data Owner 
  • Section Head 
  • Head of Research 
  • Head of Department 
  • Dean 

Feedback on this Question
[/collapse]

[collapse title=”Can I publish data that is archived in DAG?”]

Yes

Feedback on this Question
[/collapse]

[collapse title=”How will responsibilities be succeeded after I leave the university?”]

The succession of responsibilities over the data should be part of the data management plan (DMP). More information on the DMP can be found here. If there is no DMP, the data depositor should provide a plan of succession to address variability in personnel (update this plan when project finishes).    

Normally the principal investigator will be registered in the metadata as data owner. 

Feedback on this Question
[/collapse]
Feedback? Please tell us what you think on this page