{
"cells": [
{
"cell_type": "markdown",
"id": "87adae48-90b8-4132-bccf-fb7ec200df97",
"metadata": {},
"source": [
"# Abstract\n",
"\n",
"We evaluate the performance of a NAR-based, a file-based (uncompressed and xz-compressed) and a casync-based substitution mechanism through 3 scenarios:\n",
"\n",
"1. A curl-induced mass rebuild impact on a NixOS machine closure.\n",
"1. A single derivation version bump (Firefox) impact on the said derivation.\n",
"1. A stable -> unstable channel jump impact on a NixOS machine closure.\n",
"\n",
"For each of these scenarios, we compare how much data the substitution technique required to transfer versus how much data the NAR substitution required to transfer.\n",
"\n",
"Unsurprisingly, the mass rebuild scenario is the one for which we see the biggest improvement: nix-casync cuts down by 48.7% the amount of downloaded data, the xz-compressed file-based substitution cuts down the same amount by 38.4%.\n",
"\n",
"Surprisingly, we see an improvement in the case of a Nixpkgs stable (21.11) -> Nixpkgs unstable (20.05 pre-release) jump: xz-compressed file based substitution cuts down by 18% the amount of downloaded data, Casync by 17.2%.\n",
"\n",
"We see almost no improvements for the derivation bump scenario: xz-compressed file based substitution 1%, Casync 1%.\n",
"\n",
"For file substitution, we can see that the compression is crutial for the overall performance. Uncompressed file-based substitution is consistently 2 order of magnitude worse than the NAR-based substitution.\n",
"\n",
"We can conclude that reducing the substitution granularity, either via Casync of xz-compressed file-based substitution consistently reduces (1% -> 48.7%) the amount of transferred data in 3 different common scenarios."
]
},
{
"cell_type": "markdown",
"id": "f1eb8600-b572-498a-8f8b-15ad6b91d2b4",
"metadata": {
"tags": []
},
"source": [
"# Import Benchmark Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f9d5564c-dab3-458d-923f-848e82459c6e",
"metadata": {},
"outputs": [],
"source": [
"%config InlineBackend.figure_formats = ['svg']\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"plt.style.use('dark_background')\n",
"\n",
"def toMb(b):\n",
" return b * (9.537e-7)"
]
},
{
"cell_type": "markdown",
"id": "4b2cb125-b30b-4049-a6d8-ea2468963085",
"metadata": {},
"source": [
"First, let's import the data generated by the `../companeSubsEfficiency` benchmark."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f9e97b01-f776-4439-af58-ac7f1313d1a0",
"metadata": {},
"outputs": [],
"source": [
"results_dir='bench-results'\n",
"def importBenchmarkCSVs(contentDir):\n",
" return {\n",
" \"casync\": pd.read_csv(f\"{contentDir}/casync.csv\",\";\"),\n",
" \"file\": pd.read_csv(f\"{contentDir}/file.csv\",\";\"),\n",
" \"compressed-file\": pd.read_csv(f\"{contentDir}/file-xz-compressed.csv\",\";\"),\n",
" \"nar\": pd.read_csv(f\"{contentDir}/nar.csv\",\";\"),\n",
" }\n",
"\n",
"b = {\n",
" \"massRebuild\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/before-mass-rebuild\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/after-mass-rebuild\"),\n",
" },\n",
" \"channelJump\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/nixpkgs-stable-channel\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/nixpkgs-unstable-channel\")\n",
" },\n",
" \"firefoxBump\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/before-firefox-bump\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/after-firefox-bump\")\n",
" },\n",
" \"gimpBump\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/before-gimp-bump\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/after-gimp-bump\")\n",
" },\n",
" \"emacsBump\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/before-emacs-bump\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/after-emacs-bump\")\n",
" },\n",
" \"openmpiBump\": {\n",
" \"before\": importBenchmarkCSVs(f\"{results_dir}/before-openmpi-bump\"),\n",
" \"after\": importBenchmarkCSVs(f\"{results_dir}/after-openmpi-bump\")\n",
" },\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "81b08f0a-40e6-4379-89a8-8e14102d2c39",
"metadata": {
"tags": []
},
"source": [
"# Methodology\n",
"\n",
"For each of these benchmarks, we're going to evaluate different store path substitution techniques and compare their efficiencies.\n",
"\n",
"The following benchmarks will consist in building a NixOS machine configuration against 2 Nixpkgs commits. We'll simulate a NixOS machine update from the first commit to the second one.\n",
"\n",
"We're going to evaluate the 3 following substitution techniques:\n",
"\n",
"1. **Nar substitution**: This is the substitution model currently used by both the NixOS and Guix project. It consists in `.tar.xz`-ing a full store path. In this benchmark, we'll identify each NAR by its filename, which is derived by the `sha256` sum of their content.\n",
"1. **Casync substitution**: This is an experimental substitution method implemented via the [nix-casync](https://github.com/flokli/nix-casync) project. Here, starting from a NAR, we uncompress it and chunk it in smaller bits. In this benchmark, we'll identify each casync chunk by its filename, which is already derived the `sha256` sum of its content.\n",
"1. **File-based substitution**: This is a substitution method [the Guix](https://lists.gnu.org/archive/html/guix-devel/2021-01/msg00079.html) project brainstormed around. Basically, each store file would be served separately. In this benchmark, we'll identify these files using the `sha256` sum of their content.\n",
"1. **XZ-Compressed File-based substitution**: Similar to the File-based substitution but with each file individually compressed using the xz compression algorithm using the profile 6 extreme.\n",
"\n",
"\n",
"Note: we're using NixOS/Nixpkgs for all these benchmarks. However, since Guix currently use the same substitution mechanism, you can safely assume the same conclusions holds true for it as well."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2af8dcec-8665-490a-9146-86b81dd108b0",
"metadata": {},
"outputs": [],
"source": [
"def analyse_benchmark_results(i, file):\n",
" \"\"\"\n",
" Analyse a benchmark results.\n",
" \n",
" :param i: benchmark dataframes. Expecting a \"before\" and a \"after\" dataframe.\n",
" \n",
" Each benchmark simulates the substitutions triggered by transition between two\n",
" nix closures, a \"before\" and a \"after\" one.\n",
" \n",
" For each substitution mechanism, we then simulate what we can re-use and have \n",
" to download by diff-ing the substitution atoms (file, chunk or NAR).\n",
" \"\"\"\n",
" \n",
" _a_nar = i[\"after\"][\"nar\"]\n",
" _b_nar = i[\"before\"][\"nar\"]\n",
" _a_casync = i[\"after\"][\"casync\"]\n",
" _b_casync = i[\"before\"][\"casync\"]\n",
" _a_file = i[\"after\"][\"file\"]\n",
" _b_file = i[\"before\"][\"file\"]\n",
" _a_compressed_file = i[\"after\"][\"compressed-file\"]\n",
" _b_compressed_file = i[\"before\"][\"compressed-file\"]\n",
"\n",
" nar_closure_size = _a_nar[\"Nar Size\"].sum()\n",
" casync_closure_size = _a_casync[\"Chunk Size\"].sum()\n",
" file_closure_size = _a_file[\"Size\"].sum()\n",
" compressed_file_closure_size = _a_compressed_file[\"Size\"].sum()\n",
" \n",
" _nar_merged = _a_nar.merge(_b_nar, how = \"left\", on=\"Nar Name\", indicator=True, suffixes=(\"_after\",\"_before\"))\n",
" nar_dl_size = _nar_merged.loc[_nar_merged[\"_merge\"] == \"left_only\"][\"Nar Size_after\"].sum()\n",
" nar_reused_size = _nar_merged.loc[_nar_merged[\"_merge\"] == \"both\"][\"Nar Size_after\"].sum()\n",
" nar_nar_savings = 0\n",
" \n",
" _casync_merged = _a_casync.merge(_b_casync, how=\"left\", on=\"Chunk Name\", indicator=True, suffixes=(\"_after\",\"_before\"))\n",
" casync_dl_size = _casync_merged.loc[_casync_merged[\"_merge\"]==\"left_only\"][\"Chunk Size_after\"].sum()\n",
" casync_reused_size = _casync_merged.loc[_casync_merged[\"_merge\"]==\"both\"][\"Chunk Size_after\"].sum()\n",
" casync_nar_savings = (nar_dl_size - casync_dl_size) / nar_dl_size\n",
" \n",
" _file_merged = _a_file.merge(_b_file, how=\"left\", on=\"Sha256\", indicator=True, suffixes=(\"_after\",\"_before\"))\n",
" file_dl_size = _file_merged.loc[_file_merged[\"_merge\"]==\"left_only\"][\"Size_after\"].sum()\n",
" file_reused_size = _file_merged.loc[_file_merged[\"_merge\"]==\"both\"][\"Size_after\"].sum()\n",
" file_nar_savings = (nar_dl_size - file_dl_size) / nar_dl_size\n",
"\n",
" _compressed_file_merged = _a_compressed_file.merge(_b_compressed_file, how=\"left\", on=\"Sha256\", indicator=True, suffixes=(\"_after\",\"_before\"))\n",
" compressed_file_dl_size = _compressed_file_merged.loc[_compressed_file_merged[\"_merge\"]==\"left_only\"][\"Size_after\"].sum()\n",
" compressed_file_reused_size = _compressed_file_merged.loc[_compressed_file_merged[\"_merge\"]==\"both\"][\"Size_after\"].sum()\n",
" compressed_file_nar_savings = (nar_dl_size - compressed_file_dl_size) / nar_dl_size\n",
" if file:\n",
" return pd.DataFrame( data = {\n",
" \"Name\": [\"NAR\", \"Casync\", \"File\", \"Compressed File\"],\n",
" \"Closure Size (MB)\": [toMb(nar_closure_size), toMb(casync_closure_size), toMb(file_closure_size), toMb(compressed_file_closure_size)],\n",
" \"Downloaded Size (MB)\": [toMb(nar_dl_size), toMb(casync_dl_size), toMb(file_dl_size), toMb(compressed_file_dl_size)],\n",
" \"Re-used Size (MB)\": [toMb(nar_reused_size), toMb(casync_reused_size), toMb(file_reused_size), toMb(compressed_file_reused_size)],\n",
" \"DL Savings Compared to NAR (%)\": [nar_nar_savings * 100, casync_nar_savings * 100, file_nar_savings * 100, compressed_file_nar_savings * 100]\n",
" })\n",
" else:\n",
" return pd.DataFrame( data = {\n",
" \"Name\": [\"NAR\", \"Casync\", \"Compressed File\"],\n",
" \"Closure Size (MB)\": [toMb(nar_closure_size), toMb(casync_closure_size), toMb(compressed_file_closure_size)],\n",
" \"Downloaded Size (MB)\": [toMb(nar_dl_size), toMb(casync_dl_size), toMb(compressed_file_dl_size)],\n",
" \"Re-used Size (MB)\": [toMb(nar_reused_size), toMb(casync_reused_size), toMb(compressed_file_reused_size)],\n",
" \"DL Savings Compared to NAR (%)\": [nar_nar_savings * 100, casync_nar_savings * 100, compressed_file_nar_savings * 100]\n",
" })\n",
"\n",
"def gen_perf_pie(dataframe, key):\n",
" idx=mass_rebuild_results.query(f'Name == \"{key}\"').index[0]\n",
" pd.DataFrame(data={\"data\":[dataframe[\"Downloaded Size (MB)\"][idx],dataframe[\"Re-used Size (MB)\"][idx]]},\\\n",
" index=[\"Downloaded\",\"Re-Used\"])\\\n",
" .plot.pie(figsize=(6,6), y=\"data\", ylabel=\"\", title=f\"{key} Downloaded/Re-Used Data\") "
]
},
{
"cell_type": "markdown",
"id": "6e632d28-b1b7-450b-abe0-274337a6dfbb",
"metadata": {
"tags": []
},
"source": [
"# Benchmark Scenarios\n",
"\n",
"\n",
"## 1. Mass Rebuild\n",
"\n",
"Let's build the same NixOS machine description using two Nixpkgs commits: one before and one after the [staging next 2021-12-03](https://github.com/NixOS/nixpkgs/pull/148396) iteration merge to master. This staging iteration contains, among other things, a `curl` version bump. That `curl` version bump triggers a almost entire Nixpkgs mass rebuild: both `nix` and `stdenv` are depending on it.\n",
"\n",
"This mass-rebuild scenario represents a long-standing issue in terms of substitution performance."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "85c720c5-4688-4d7e-84d3-4006700a69fb",
"metadata": {},
"outputs": [],
"source": [
"mass_rebuild_results = analyse_benchmark_results(b[\"massRebuild\"], True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "401c1c9d-b264-4266-ac92-de660f4577bc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"p = mass_rebuild_results.plot.bar(figsize=(12,5), x=\"Name\",y=\"Downloaded Size (MB)\",title=\"Volume to Download for the Mass Rebuild Update (less is better)\", xlabel=\"\", ylabel=\"Size in MB\", color=\"#ff6a00\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "326255a1-a681-4405-9a40-889651392948",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_ = mass_rebuild_results.plot.bar(figsize=(12,5), x=\"Name\",y=\"DL Savings Compared to NAR (%)\",title=\"DL Savings Compared to NAR (more is better)\", xlabel=\"\", ylabel=\"Savings in %\", color=\"#ff6a00\")"
]
},
{
"cell_type": "markdown",
"id": "49e237d1-6ed7-4e2b-bc9d-f51d9a9eed2f",
"metadata": {},
"source": [
"We can see a massive performance gain for both Casync (48.4%) and xz-compressed files (38.4%). We can also see that compression plays a massive role in terms of substitution performance: the uncompressed files are doing almost 90% worse than the plain NAR substitution."
]
},
{
"cell_type": "markdown",
"id": "2ca085b5-a768-468c-ab8a-76bde3de6818",
"metadata": {},
"source": [
"## 2. Firefox Bump\n",
"\n",
"In this scenario, we're going to simulate a Firefox update. We took the Firefox 97.0 -> 97.0.1 bump [7e23a7fb8268f16e83ef60bbd2708e1d57fd49ef](https://github.com/NixOS/nixpkgs/commit/7e23a7fb8268f16e83ef60bbd2708e1d57fd49ef) as a test example."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "318921b6-bd97-46e5-9aa4-b89e01831bb1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"