Preloader

A self-contained and self-explanatory DNA storage system

As shown in Fig. 2, the DNA storage process can be in general divided into two inter-connected parts—a digital part and a bio-laboratory part, and this research is conducted with a focus squarely on the digital part. One of the key problems in this part is how to integrate external tools into the encoded DNA sequence for high storage density. To evaluate the effectiveness of our approach, we select three often-used compression programs—rar, 7z and zip–as the tool and a set of different types of data as the target data to be stored. The used symbols are list in Table 3 in “Appendix”.

Storage overhead

The data self-containment inevitably brings some storage overheads. To evaluate this impact, we first define compression efficiency, denoted by e, as the metric to measure the space impact of the selected compression programs in the worst case, whereby the program with the best performance is selected as the main test program to evaluate our proposed methods.

Figure 8
figure8

Compression efficiency (data types). This figure shows how the (e_c) values of the three selected compression methods are changed with respect to six types of test data (txt, mp3, jpg, pdf and other files, such as exe, dll, html, etc.) for the cases—the data is either redundant (a) or not (b). The data sizes used by these two figures are 4MB (w/ redundancy) and 2MB (w/o redundancy), respectively. In particular, (a) shows that the overall compression efficiency of 7z is best among the three, and the relative effects of rar and zip are different in different situations. From (b), one can still see that 7z is the best, but its advantage is diminished when the data redundancy is absent.

Formally, the compression efficiency is defined as (e_c=1-e) where e is given by (e = (r_c times S_o + S_T)/S_o), here, (S_o) represents the original data size, (S_T) is the size of the tool file, and (r_c) is compression ratio, defined on per-program basis as (r_c={S_c}/{S_o}) where (S_c) is the size of the file after it is compressed. Given this definition, we can make two key observations.

Observation 1

Not only compressed data but also external tool are considered in the computation of (e_c). In particular, when no self-containment is involved, i.e., (S_T=0), (e_c) is reduced to (e_c=1-r_c), which represents an ideal case for overall compression effects.

Observation 2

It is not always beneficial to use the self-containment method for the DNA storage as (e_c) could be less than or equal to 0 when (S_Tge (1-r_c)S_o). In other words, (e_c) can provide a guidance in the use of our method in the worst case.

The compression efficiencies of the selected programs with respect to different types of data files are compared in Fig. 8 where compression efficiency (e_c) vs. 6 common types of test data files are depicted. In the figure, the storage overhead (S_T) of 7z is about 201 KB, rar is 303.5 KB and zip is 254 KB, and the test data with and without redundancy are 4MB and 2 MB, respectively. One can see that the higher the value of (e_c), the better the compression effects, which means that the space overhead of compression program is smaller and smaller. On the other hand, it is also validated that the self-containment method in the worst case is not always beneficial as shown in the figure for the bmp file.

Self-contained data will inevitably bring in space overhead (i.e., (S_T>0)). Given the results of Fig. 8, we deliberately choose 7z as the compression tool and txt as the test data due to their relatively good performance, and then conduct a test to measure the overhead of the proposed self-contained and self-explanatory method. The results of the test are shown in Fig. 9 in which how the values of (e_c) in ideal case (in blue) and in self-containment case (in red) are changed with respect to (S_o) and (r_c) are depicted.

Figure 9
figure9

Compression efficiency (data size). We calculate the compression efficiency of 7z when the size of the original data (S_o) gradually increases. Here, (r_c) of 7z is 0.34 when a text file in size of 500 MB is compressed. Given that (S_T) of 7z is 201 KB, we have the results as shown in (a) where (e_c)s are calculated for the data storage method in ideal case by (e_c =1-r_c= 0.66) and in the self-containment case by (e_c = 1 – (r_c times S_o + 201K)/S_o). To validate the results, we further pack a set of test data files with different types (e.g., txt, jpg, exe, mp3) in a single compressed package and compare their actual values of (e_c) with a fixed (r_c=0.34) as shown in (b). The results are roughly consistent with those in calculation as shown in (a).

It can be seen from Fig. 9 that the self-containment method is not always beneficial to saving storage space. Rather, it is possible to incur extra storage overhead, compared to the default method that no compression is employed. However, when the amount of data reaches a certain threshold, the self-containment method approaches those methods in ideal case, and the storage overhead incurred by the self-containment at this time can be almost ignored.

Data storage

In contrast to the previous section, which focuses on the compression efficiencies of selected compression programs in the worst intuitive and 1-CS cases, we evaluated the proposed 1-1CS, M-1CI and 1-MCI in this section when they are used to achieve self-containment and self-explanation. For fair comparison, we deliberately assume that all three methods are used to store an equal number of n data files ((D_i,iin {1,2ldots n})). Due to the limitation on the length of synthetic DNA fragment, the range of n is thus determined by 1-MCI and limited to (0<n<lfloor L_s/L_p rfloor) where (L_s) represents the maximal length of DNA segment in nts and (L_p) the length of primer in nts. We first analyse the synthesised and sequenced amount of data with respect to each proposed method in theory, and then carry out a DNA data storage process on a real platform to validate the conclusions made above.

Method analysis

With the analysis in mind, we first give the amount of binary data (Sb_M) that needs to be stored by 1-1CS, 1-MCI and M-1CI as follows:

$$begin{aligned} Sb_{1-1CS}= & {} sum _{i=1}^{n}{r_c times S_{D_{i}}}+ntimes (S_{h}+S_{T}) end{aligned}$$

(1)

$$begin{aligned} Sb_{M-1CI}= & {} sum _{i=1}^{n}{r_ctimes S_{D_{i}}} + ntimes (S_{h} + 2L_{p}) + S_{T} end{aligned}$$

(2)

$$begin{aligned} Sb_{1-MCI}= & {} sum _{i=1}^{n}{r_ctimes S_{D_{i}}} +ntimes S_{h} + S_{T}. end{aligned}$$

(3)

In addition to the 1-MCI tool files, the DNA fragments of other files contain a primer at both ends, so the length of the data of the fragment is (L_{s}-2L_p), the DNA fragment of the 1-MCI method’ tool file contains primers for n data files and a universal primer, so the length is (L_s-(n+1) L_p). The total number of bases (Sd_M) after adding the primers to the segmented fragments produced by three methods are:

$$begin{aligned} Sd_{1-1CS}= & {} Sb_{1-1CS}times a + frac{Sb_{1-1CS} times a}{L_{s}-2times L_p}times 2L_p = a times Sb_{1-1CS}times (1+frac{2L_p}{L_{s}-2 L_p}) end{aligned}$$

(4)

$$begin{aligned} Sd_{M-1CI}= & {} Sb_{M-1CI}times a + frac{Sb_{M-1CI} times a}{L_{s}-2times L_p}times 2L_p = a times Sb_{M-1CI}times (1+frac{2L_p}{L_{s}-2 L_p}) end{aligned}$$

(5)

$$begin{aligned} Sd_{1-MCI}= & {} Sb_{1-MCI} times a + frac{(sum _{i=1}^{n}{r_ctimes S_{D_{i}}}+ntimes S_{h}) times a}{L_s}times 2L_P + frac{S_{T} times a}{L_s-(n+1) L_p} times (n + 1)L_p end{aligned}$$

(6)

where a is the base factor.

In order to simplify the calculation, we assume that each (D_i) has an equal size (S_D). Figure 10 presents the quantitative results of the number of bases that need to be stored for each method, i.e., (Sb_M), under different settings. In Fig. 10a, we can observe that all three methods show a clear linear trend, and the 1-1CS method requires the highest DNA base storage capacity compared with its counterparts. Both 1-MCI and M-1CI show a relatively lower DNA base storage capacity, which demonstrates the effectiveness of using primers to avoid data redundancy. Figure 10b presents the trend of these three methods when the (L_s/L_p) ratio varies. A clear decreasing trend is observed, and the descending speed slows down when the ratio becomes larger. Again, the 1-1CS requires the highest amount of DNA storage, followed by the M-1CI. 1-MCI achieves the lowest DNA storage due to the effectiveness brought by the data redundancy avoidance.

Generally speaking, when these three methods are being compared, the 1-1CS has the largest storage overhead since it needs to store multiple copies of the compressing tool. By avoiding this data redundancy, the 1-MCI and M-1CI perform better than 1-1CS in terms of Sd.

Figure 10
figure10

Number of bases that need to be stored for each self-containment method. (a) Illustrates how the (Sd_M) changes when the number of files n varies. All three methods show a clear linear trend, and the 1-1CS method requires the highest DNA storage capacity compared with its counterparts. Both 1-MCI and M-1CI show a relatively lower DNA base storage capacity. (b) Presents the trend of (Sd_M) when the (L_s/L_p) ratio changes. A clear decreasing trend is observed, and the descending speed slows down when the ratio becomes larger. Again, the 1-1CS requires the highest amount of DNA storage, followed by the M-1CI. 1-MCI achieves the lowest DNA base storage due to the effectiveness brought by the data redundancy avoidance. Note that we set the parameters (S_h=10), (S_T = 2010), (L_p = 20), and (a = 1/1.6) during experiments. (S_D) is set to be 100000 to make sure that the size of data far exceeds the size of primers. (r_c) is 0.34 for the 7z compression algorithm. (L_s) is also greater than (L_p), it varies between 100 and 400 to achieve a ratio ranging between 5 and 20.

Figure 11
figure11

Storage density. (a) presents the storage density (d = S_o/S_b) when storing different files using three methods. In the ideal case, only the compressed data will be stored and its compressor will be ignored. Note that the M-1CI and 1-MCI methods present the same storage density performance, hence they are represented using only one bar. Thanks to the avoidance of the data redundancy, the 1-MCI and M-1CI method demonstrate higher storage density compared with the 1-1CS. (b) shows the storage density of these methods when storing all the data. Again in this case, both 1-MCI and M-1CI present nearly the same storage density performance, which outperform the 1-1CS method.

Method tests

As opposed to the above studies which focus on the theoretical calculations, we carried out a DNA data storage process on a real platform to validate the conclusions made above. The setup of our experiment is configured with a Intel(R) Core(TM)-7-9700 CPU@3.00GHz processor, the RAM memory with size 16.0 GB, the OS configuration is 64-bit Windows10, and the experimental environment is Python3.8. Since our method is mainly to design the indexing method of files and related tools, the design of the coding algorithm is not what we are concerned about. We used the open source tool Chamaeleo28 to test our method.

In this experiment, we selected two compression program tools, 7z (194KB in size) and zip (87KB in size). Although 7z is relatively large, its compression effects for those files with more redundant data is better than those of zip. We established 5 file directories—Cat, Jane, MonaLisa, Leo, and mix to store the data files in 3 types—bmp, txt, and jpg—as independent data items for testing. The references of the dataset we used are given in the “Appendix” section.

Experiment design: As discussed, in this paper we are mainly concerned with the digital part of the DNA storage process, the actual bio-synthetic storage is deliberately ignored. Similarly, we assume that the data can be successfully amplified by PCR and sequenced, and then decoded and decompressed after it is read out to realise its self-interpretation.

(a) Primer Design: Since the file indexing in our method mainly relies on the allocation of primers, we also carried out different primer designs for the experiments. When designing the primers, we should not only consider the parameters of biological information that need to be followed, but also concern with the issue of homology15. Considering the impact on time and space, we selected Grass17 encoding DNA files for the primer design. We thus needed to design a pair of primers for each data file. The data files of all the methods need a pair of primers. For 1-1CS and M-1CI, we needed to design a pair of primers for each tool file. For 1-MCI, we only needed to design a pair of universal primers (PrimerU’) for the two tool files. Finally, adding primers to the data fragment realises the calculation phase of data self-containment.

For both 1-1CS and M-1CI, there are primer sequences on the left and right ends of the fragments in both data file and tool file. We set the primer sequence with a length of (L_P = 20), the effective data load thus will not exceed (220-40 = 180). Taking into account the space required for the index, we set the final data load to be 160bp, and the space occupied by the index varies according to the file size. Similarly, for 1-MCI, the two ends of the DNA fragment of the data file are still primer sequences, and the data payload is also 160pb. However, for the DNA fragment of the tool file, it needs to include the primer of the data file and a universal primer, so its 7z and zip data payloads are thus up to (220-2times 20-20=160) and (220-3times 20-20=140), respectively. Considering the space occupied by the index, we set the final data length for 7z and zip to 140 and 10, respectively. The index length changes based on the file size.

(b) Process design: We first use different compression programs to compress a set of test data (5 data files in different types and 2 compression programs), and then adopt a classic coding scheme to encode the compressed data based on the Chamaeleo software. More specifically, we assume that the length of the synthetic DNA sequence is within 220bp, and then make the encoding process that is first to read in each test file as binary data, then divide it into fragments, add an index to identify the position of fragment, and then exploit the selected encoding algorithm to encode it into a DNA sequence, and finally add the designed primers at its both ends. As the error problem is not the focus of this paper, we do not add an error correction code field to the fragment.

Experimental results: With the above designs, we store the five test files in the DNA storage with different proposed methods and then measure their respective storage densities and time taken to reveal the advantages of the proposed methods. The storage density is defined as (d = S_o/S_b = 1/a), here, (S_b) is the total number of bases obtained by the selected encoding scheme, a is the base factor. This metric indicates how many bits, including all the data, tool, and their respective index primers, are represented by a DNA base, which reflects how effective the self-containment methods can be used with DNA coding schemes. According to this definition, the higher the storage density, the better the method. The relevant results are shown in Fig. 11 where the storage densities of five different types of test files are compared, these files are first compressed by 7z and then encoded with the coding algorithm in Grass et al.17, and finally stored with the proposed self-contained and self-explanatory methods. In Fig. 11a, the storage density d when storing different files using these three methods are presented. In the ideal case, only the compressed data will be stored, and it compressor tool will be ignored. By avoiding the data redundancy, the 1-MCI and M-1CI method demonstrate higher storage density, compared with the 1-1CS method. Hence, it indicates the effectiveness of using primers to reduce redundant data storage and boost the storage density. The same result is also supported by Fig. 11b, which shows the storage density d of these methods when storing all the data files. Again, both 1-MCI and M-1CI outperform the 1-1CS in terms of storage density. In Fig. 12a, it presents the time taken for the encoding and decoding process when processing different files under 1-MCI and M-1CI storage methods during the digital part. As we can observe, both the 1-MCI and M-1CI present nearly the same reasonable time performance, which verifies that the proposed method runs reasonably when processing data files.

Summary: In summary, the self-contained and self-explanatory technology will bring certain data overhead, but it greatly improves the integrity of data, and also ensures the reliable storage of data in external unreliable environment. In general, when the tool files are relatively small, The 1-1CS method can be used. While in case of relatively large tools, the M-1CI method can be used to minimise the redundant data overhead and reduce the number of sequencing passes for cost reduction. However, if there are a number of data files using the same tool, one can adopt the 1-MCI method to achieve an effective solution.

Figure 12
figure12

Time efficiency. This figure presents the time taken for the encoding and decoding process when processing different files under 1-MCI and M-1CI storage methods. Since 1-MCI and M-1CI present nearly the same time performance, therefore, the bar chart does not present different methods as one extra dimension.

Source link