Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs-storage : Add FolderStorage struct #81

Merged
merged 17 commits into from
Sep 2, 2024
2 changes: 1 addition & 1 deletion fs-storage/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ serde_json = "1.0.82"
serde = { version = "1.0.138", features = ["derive"] }
jni = { version = "0.21.1", optional = true }
jnix = { version = "0.5.1", features = ["derive"], optional = true }

bincode = "1.3"
data-error = { path = "../data-error" }


Expand Down
378 changes: 378 additions & 0 deletions fs-storage/src/folder_storage.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,378 @@
use serde::{Deserialize, Serialize};
use std::fs::{self, File};
use std::io::{Read, Write};
use std::time::SystemTime;
use std::{
collections::BTreeMap,
path::{Path, PathBuf},
};

use crate::base_storage::{BaseStorage, SyncStatus};
use crate::monoid::Monoid;
// use crate::utils::read_version_2_fs;
use crate::utils::remove_files_not_in_ram;
use data_error::{ArklibError, Result};

/*
Note on `FolderStorage` Versioning:

`FolderStorage` is a basic key-value storage system that persists data to disk.
where the key is the path of the file inside the directory.


In version 2, `FolderStorage` stored data in a plaintext format.
Starting from version 3, data is stored in JSON format.

For backward compatibility, we provide a helper function `read_version_2_fs` to read version 2 format.
*/
const STORAGE_VERSION: i32 = 3;

/// Represents a folder storage system that persists data to disk.
pub struct FolderStorage<K, V>
where
K: Ord,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the key in FolderStorage is the filename, we should add a constraint that K must implement std::fmt::Display, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Technically, we could solve the task for arbitrary key type by storing the keys together with values. But this complicates the solution unnecessary for this moment. We'll be able to loose the constraint when we implement "chunked storage".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedStorage is meant to store n key-value pairs per file. FolderStorage is a special case where n = 1. We can optimise the special case by storing the key only in the in-memory mapping. This means that the key only needs constraints to be stored in the BTreeMap which would be Ord + Eq.

Copy link
Member

@kirillt kirillt Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twitu Yes, but the easiest way to persist the keys is in the filenames.

<storage_root>
|- qwe
|- asd
|- zxc

"qwe", "asd", "zxc" are the keys here, e.g. file <storage_root>/qwe contains value assigned to key "qwe"

Do we store data like this currently? We could just fix the keys type to String but we want to use the storage at least with numerical keys, too. So, Display is the most practical generalization.

If we allow keys to be just Ord, then we need to serialize, store, load and deserialize them. I would postpone such work till we start working on the chunked storage.

{
/// Label for logging
label: String,
/// Path to the underlying file where data is persisted
path: PathBuf,
/// `ram_timestamps` can be used to track the last time a file was modified in memory.
/// where the key is the path of the file inside the directory.
ram_timestamps: BTreeMap<K, SystemTime>,
/// `disk_timestamps` can be used to track the last time a file written or read from disk.
/// where the key is the path of the file inside the directory.
disk_timestamps: BTreeMap<K, SystemTime>,
data: FolderStorageData<K, V>,
}

/// A struct that represents the data stored in a [`FolderStorage`] instance.
///
///
/// This is the data that is serialized and deserialized to and from disk.
#[derive(Serialize, Deserialize)]
pub struct FolderStorageData<K, V>
where
K: Ord,
{
version: i32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should we store the version information? In FileStorage, it is serialized in a separate field. For FolderStorage, we need a way to track the version to ensure backward compatibility in the future. I have two ideas:

  • Embed the storage version in each file's content as a separate field.
  • Use a special file in the folder to specify the version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files storing the values should not be used for this. Examples of values stored in a FolderStorage range from JSON properties to binary image data.

Storing the version in a separate file sounds like a better idea to me. But do we really need versioning here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might manage without versioning information in FolderStorage, but if we decide to update to a newer version later (for instance, if binary serialization turns out to be unsuitable), implementing backward compatibility will become more challenging, though not impossible.

We've already dealt with a similar situation in FileStorage.
I don't have a strong opinion on this. If you prefer to omit versioning information, that's fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We might change the folder structure in future, too. Let's add version file just for the case.

On other hand, versioning of file format probably should be delegated to user side, because we should just write bytes using writer that user passes to us. So, JSONs could be printed as text, or in binary form. Can we implement it in a way that user controls how they want to store JSONs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper way to allow the user to pick the serialization format would be to introduce another generic parameter and allow it to pick from a given format, or by making the generic parameter an implementation of Writer. However, I believe this can be a feature for later.

Currently, we can pick JSON as the default format since it is human readable. In fact, later when we extend the implementation for multiple writers for generics we don't need a separate version file because generic parameters can be set with default as JSON.

So for v1, I think not making it generic and using serde_json as the serialization format will be a good approach.

entries: BTreeMap<K, V>,
}

impl<K, V> AsRef<BTreeMap<K, V>> for FolderStorageData<K, V>
where
K: Ord,
{
fn as_ref(&self) -> &BTreeMap<K, V> {
&self.entries
}
}

impl<K, V> FolderStorage<K, V>
where
K: Ord
+ Clone
+ serde::Serialize
+ serde::de::DeserializeOwned
+ std::str::FromStr
+ std::fmt::Display,
V: Clone
+ serde::Serialize
+ serde::de::DeserializeOwned
+ std::str::FromStr
+ Monoid<V>,
{
/// Create a new folder storage with a diagnostic label and directory path
/// The storage will be initialized using the disk data, if the path exists
///
/// Note: if the folder storage already exists, the data will be read from the folder
/// without overwriting it.
pub fn new(label: String, path: &Path) -> Result<Self> {
let mut storage = Self {
label,
path: PathBuf::from(path),
ram_timestamps: BTreeMap::new(),
disk_timestamps: BTreeMap::new(),
data: FolderStorageData {
version: STORAGE_VERSION,
entries: BTreeMap::new(),
},
};

if Path::exists(path) {
storage.read_fs()?;
}

Ok(storage)
}

/// Load mapping from folder storage
fn load_fs_data(&mut self) -> Result<FolderStorageData<K, V>> {
if !self.path.exists() {
return Err(ArklibError::Storage(
self.label.clone(),
"File does not exist".to_owned(),
));
}

if !self.path.is_dir() {
return Err(ArklibError::Storage(
self.label.clone(),
"Path is not a directory".to_owned(),
));
}

let mut data = FolderStorageData {
version: STORAGE_VERSION,
entries: BTreeMap::new(),
};

self.disk_timestamps.clear();
self.ram_timestamps.clear();

// read_version_2_fs : unimplemented!()

for entry in fs::read_dir(&self.path)? {
let entry = entry?;
let path = entry.path();
if path.is_file()
&& path.extension().map_or(false, |ext| ext == "bin")
{
let key = path
.file_stem()
.unwrap()
.to_str()
.unwrap()
.parse::<K>()
.map_err(|_| {
ArklibError::Storage(
self.label.clone(),
"Failed to parse key from filename".to_owned(),
)
})?;

let mut file = File::open(&path)?;
let mut buffer = Vec::new();
file.read_to_end(&mut buffer)?;

let value: V = bincode::deserialize(&buffer).map_err(|e| {
ArklibError::Storage(
self.label.clone(),
format!("Failed to deserialize value: {}", e),
)
})?;
data.entries.insert(key.clone(), value);

if let Ok(metadata) = fs::metadata(&path) {
if let Ok(modified) = metadata.modified() {
self.disk_timestamps.insert(key.clone(), modified);
self.ram_timestamps.insert(key, modified);
}
}
}
}
Ok(data)
}
}

impl<K, V> BaseStorage<K, V> for FolderStorage<K, V>
where
K: Ord
+ Clone
+ serde::Serialize
+ serde::de::DeserializeOwned
+ std::str::FromStr
+ std::fmt::Display,
V: Clone
+ serde::Serialize
+ serde::de::DeserializeOwned
+ std::str::FromStr
+ Monoid<V>,
{
/// Set a key-value pair in the internal mapping
fn set(&mut self, key: K, value: V) {
self.data.entries.insert(key.clone(), value);
self.ram_timestamps.insert(key, SystemTime::now());
}

/// Remove an entry from the internal mapping given a key
fn remove(&mut self, id: &K) -> Result<()> {
self.data.entries.remove(id).ok_or_else(|| {
ArklibError::Storage(self.label.clone(), "Key not found".to_owned())
})?;
// self.ram_timestamps.remove(id);
// OR
self.ram_timestamps
.insert(id.clone(), SystemTime::now());
Ok(())
}

/// Compare the timestamp of the storage files
/// with the timestamps of the in-memory storage and the last written
/// to time to determine if either of the two requires syncing.
fn sync_status(&self) -> Result<SyncStatus> {
let mut ram_newer = false;
let mut disk_newer = false;

for (key, ram_timestamp) in &self.ram_timestamps {
let file_path = self.path.join(format!("{}.bin", key));

if let Ok(metadata) = fs::metadata(&file_path) {
if let Ok(disk_timestamp) = metadata.modified() {
match ram_timestamp.cmp(&disk_timestamp) {
std::cmp::Ordering::Greater => ram_newer = true,
std::cmp::Ordering::Less => disk_newer = true,
std::cmp::Ordering::Equal => {}
}
} else {
// If we can't read the disk timestamp, assume RAM is newer
ram_newer = true;
}
} else {
// If the file doesn't exist on disk, RAM is newer
ram_newer = true;
}

// If we've found both RAM and disk modifications, we can stop checking
if ram_newer && disk_newer {
break;
}
}

// Check for files on disk that aren't in RAM
for entry in fs::read_dir(&self.path)? {
let entry = entry?;
let path = entry.path();
if path.is_file()
&& path.extension().map_or(false, |ext| ext == "bin")
{
let key = path
.file_stem()
.unwrap()
.to_str()
.unwrap()
.parse::<K>()
.map_err(|_| {
ArklibError::Storage(
self.label.clone(),
"Failed to parse key from filename".to_owned(),
)
})?;
if !self.ram_timestamps.contains_key(&key) {
disk_newer = true;
break;
}
}
}

let status = match (ram_newer, disk_newer) {
(false, false) => SyncStatus::InSync,
(true, false) => SyncStatus::StorageStale,
(false, true) => SyncStatus::MappingStale,
(true, true) => SyncStatus::Diverge,
};

log::info!("{} sync status is {}", self.label, status);
Ok(status)
}

/// Sync the in-memory storage with the storage on disk
fn sync(&mut self) -> Result<()> {
match self.sync_status()? {
SyncStatus::InSync => Ok(()),
SyncStatus::MappingStale => self.read_fs().map(|_| ()),
SyncStatus::StorageStale => self.write_fs().map(|_| ()),
SyncStatus::Diverge => {
let data = self.load_fs_data()?;
self.merge_from(&data)?;
self.write_fs()?;
Ok(())
}
}
}

/// Read the data from folder storage
fn read_fs(&mut self) -> Result<&BTreeMap<K, V>> {
let data = self.load_fs_data()?;
self.data = data;
Ok(&self.data.entries)
}

/// Get a value from the internal mapping
fn get(&self, id: &K) -> Option<&V> {
self.data.entries.get(id)
}

/// Write the data to folder
///
/// Update the modified timestamp in file metadata to avoid OS timing issues
/// https://github.com/ARK-Builders/ark-rust/pull/63#issuecomment-2163882227
fn write_fs(&mut self) -> Result<()> {
fs::create_dir_all(&self.path)?;

for (key, value) in &self.data.entries {
let file_path = self.path.join(format!("{}.bin", key));
let encoded: Vec<u8> = bincode::serialize(value).map_err(|e| {
ArklibError::Storage(
self.label.clone(),
format!("Failed to serialize value: {}", e),
)
})?;

let mut file = File::create(&file_path)?;
file.write_all(&encoded)?;
file.flush()?;

let new_timestamp = SystemTime::now();
file.set_modified(new_timestamp)?;
file.sync_all()?;

self.disk_timestamps
.insert(key.clone(), new_timestamp);
}

// Remove files for keys that no longer exist
remove_files_not_in_ram(&self.path, &self.label, &self.data.entries);

log::info!(
"{} {} entries have been written",
self.label,
self.data.entries.len()
);
Ok(())
}

/// Erase the folder from disk
fn erase(&self) -> Result<()> {
fs::remove_dir(&self.path).map_err(|err| {
ArklibError::Storage(self.label.clone(), err.to_string())
})
}

/// Merge the data from another folder storage instance into this folder storage instance
fn merge_from(&mut self, other: impl AsRef<BTreeMap<K, V>>) -> Result<()>
where
V: Monoid<V>,
{
let other_entries = other.as_ref();
for (key, value) in other_entries {
if let Some(existing_value) = self.data.entries.get(key) {
let resolved_value = V::combine(existing_value, value);
self.set(key.clone(), resolved_value);
} else {
self.set(key.clone(), value.clone())
}
self.ram_timestamps
.insert(key.clone(), SystemTime::now());
}
Ok(())
}
}

impl<K, V> AsRef<BTreeMap<K, V>> for FolderStorage<K, V>
where
K: Ord,
{
fn as_ref(&self) -> &BTreeMap<K, V> {
&self.data.entries
}
}
1 change: 1 addition & 0 deletions fs-storage/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
pub mod base_storage;
pub mod btreemap_iter;
pub mod file_storage;
pub mod folder_storage;
#[cfg(feature = "jni-bindings")]
pub mod jni;
pub mod monoid;
Expand Down
Loading
Loading