HTTP Headers for Resumable Downloads

Posted on 2024-04-22

We’ve all experienced the frustration of a poor internet connection. You may recall the disappointment of a large file download failing after 24 hours of waiting. Even worse, discovering that the download is not resumable.

Responsibility for resumable downloads doesn’t solely rest on the client side with the correct setting of HTTP headers. It’s equally, if not more, important for the backend to correctly enable several headers and implement the associated logic.

While I won’t delve into the detailed implementation in a specific language, understanding the headers discussed below will equip you with the knowledge to easily implement this feature if you wish.

Client

The only aspect you need to focus on is the Range HTTP request header. This header specifies the portions of a resource that the server should return. That’s all there is to it.

1	Range: <unit>=<range-start>-

On the client side, the only requirement is to properly implement the Range HTTP request header. This involves using the correct unit and determining the starting point of the range. The server then knows which portion of the file to send. There’s no need to worry about specifying the range end, as the typical use case involves resuming and downloading the entire file.

Server

Now, things start to get more complicated.

The ETag (also known as entity tag) HTTP response header serves as an identifier for a specific version of a resource.

1	ETag: "<etag_value>"

If your target client includes a browser, then you need to set the ETag. Modern browsers expect to see this value; otherwise, the browser will simply retry downloading the entire file again.

The Content-Range response HTTP header signifies the position of a partial message within the full body message.

1	Content-Range: <unit> <range-start>-<range-end>/<size>

Imagine you are downloading a file of 500 bytes, but due to an unstable internet connection, the download is interrupted after only 100 bytes. In this scenario, you would expect the server to send the remaining 400 bytes of the file. Consequently, you would anticipate seeing the appropriate header in the server’s response.

1	Content-Range: bytes 100-499/500

Check out MDN for understanding those numbers, I won’t explain them here.

The Accept-Ranges HTTP response header acts as a signal from the server, indicating its capability to handle partial requests from the client for file downloads.

Essentially, this header communicates to the client, “Hey, I am capable of handling this, let’s proceed.”

Don’t ask me why, you just need it.

1	Accept-Ranges: <range-unit>

I suggest simply using bytes.

1	Accept-Ranges: bytes

The Content-Length header signifies the size of the message body, measured in bytes, that is transmitted to the recipient.

In layman’s terms, it represents the bytes of the remaining file.

1	Content-Length: <length>

Let’s continue the same example mentioned above, the server is going to send the remaining 400 bytes of the file.

1	Content-Length: 400

This is merely an introduction.

There are many complex considerations to take into account. For instance, when dealing with ETags, you must strategize on how to assign a unique ID to each resource. Additionally, you need to determine how to update the ETag when a resource is upgraded to a newer version.

Understanding those HTTPS headers is a good start.

Handle Login with Python Bindings for Selenium

Posted on 2024-03-23

Before everything else, you need to install the Selenium package, of course.

1	pip install selenium

Or, if you hate to deal with anti-bot measures, you can just use this.

1	pip install undetected-chromedriver

Then, add the user data directory to the ChromeOptions object. It is the path to your Chrome profile. For macOS, it is located at ‘~/Library/Application Support/Google/Chrome’.

import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument(f"--user-data-dir={'Path_to_your_Chrome_profile'}")
driver = uc.Chrome(options=options)

driver.get('https://www.example.com')

The --user-data-dir argument is kind of cheating because it allows you to bypass the login process without actually logging in.

Cookie is your friend.

But sometimes, you need to handle the login process, for instance, you have to switch between multiple accounts.

First of all, take care of your credentials. Use an .env file.

import os
from dotenv import load_dotenv

load_dotenv()

USERNAME = os.getenv('USERNAME_ENV_VAR')
PASSWORD = os.getenv('PASSWORD_ENV_VAR')

Then, you can use the send_keys method to fill in the username and password fields. I add one while loop to wait for the element in case the script runs too fast.

while True:
    try:
        driver.find_element(by=By.ID, value="username").send_keys(USERNAME)
        break
    except:
        time.sleep(1)

driver.find_element(by=By.ID, value="password").send_keys(PASSWORD)
driver.find_element(by=By.ID, value="submit").click()

After logging in, the chrome usally pops up a dialog asking if you want to save the password. It is annoying.

You can try to disable it by adding the --disable-save-password-bubble or --disable-popup-blocking argument to the ChromeOptions object. I don’t think it works. But you can try.

In the end, I just used a hack, that is to open a new tab and immediately close it, the popup will appear.

# open a new tab
driver.execute_script("window.open('','_blank')")

time.sleep(1) # 1 second wait is enough I guess
driver.switch_to.window(driver.window_handles[1])

# say goodbye to the new tab
driver.close()

# now switch back to the original tab
driver.switch_to.window(driver.window_handles[0])

That’s it.

Oh, one more thing.

Add user-agent to the ChromeOptions object is also a good idea. And please do not forget to specify version_main for the driver to match your current chrome version.

Raycast Alternative on Windows 11: Microsoft PowerToys

Posted on 2024-03-16

Raycast is a productivity tool for macOS. It allows you to quickly access files, folders, and applications. It’s great, but only available on macOS. If you already use Raycast, you know how useful it is. If you don’t, you should give it a try if you have a Mac.

For daily work, I also use Windows, and I was trying to implement a similar workflow on Windows. The thing I missed the most was the ability to search and open previously used workspaces in VS Code or remote machines with a few keystrokes.

You can guess my excitement when I found out about Microsoft PowerToys.

OK.

Enable VS Code search in the settings for PowerToys Run utility.

Then, you can use the shortcut Alt + Space to search for your workspaces.

1	{ THE_WORKSPACE_NAME_YOU_WANT_TO_OPEN

Now I have to find the equivalent of zsh-autosuggestions on Windows. Wish me luck.

Links to the tools mentioned in this post:

Install Windows 11 with VMware Fusion on Apple M2 MacBook

Posted on 2023-09-02

The Problem

Microsoft is only providing intel version ISO file for Windows 11.
For Windows 11 Insider Preview, the arm version is provided only in VHDX format.

Workaround

Luckily, we can get the ESD file and then convert it into ISO file which can be used in VMware.

Steps

Go to the website of Parallels to download their Mac app. Alternatively, you can get the DMG link from Homebrew API

The link looks like this. After downloading, double click the DMG file but don’t install the Parallels. You just need to mount the DMG file.

Then open the terminal and run the following commands:

1
2

sudo ditto /Volumes/Parallels\ Desktop\ 19/Parallels\ Desktop.app/Contents/MacOS/prl_esd2iso /usr/local/bin/prl_esd2iso
sudo ditto /Volumes/Parallels\ Desktop\ 19/Parallels\ Desktop.app/Contents/Frameworks/libwimlib.1.dylib /usr/local/lib/libwimlib.1.dylib

We can thank Parallels for providing these amazing tools later. Unmount and delete the DMG file.

To figure out the download link for Windows 11 ESD file:

cd ~/Downloads/ && curl -L "https://go.microsoft.com/fwlink?linkid=2156292" -o products_Win11.cab && tar -xf products_Win11.cab products.xml && cat products.xml | cat products.xml | grep ".*_CLIENTCONSUMER_RET_A64FRE_en-us.esd" | sed -e s/"<FileName>"//g -e s/"<\/FileName>"//g -e s/\ //g -e s/"<FilePath>"//g -e s/"<\/FilePath>"//g -e s/\ //g | head -n 2

By the way, I assume your current working directory is ~/Downloads/. If not, please change it accordingly.

Use curl to download the ESD file:

curl http://dl.delivery.mp.microsoft.com/filestreamingservice/files/f16733c5-e9f8-4613-9fe6-d331c8dd6e28/22621.1702.230505-1222.ni_release_svc_refresh_CLIENTCONSUMER_RET_A64FRE_en-us.esd --output win11.esd

Convert the ESD file into ISO file.

1	prl_esd2iso ~/Downloads/win11.esd ~/Downloads/win11.iso

Now you can insert the ISO file into VMware Fusion which is free to use with a personal license You can find the license key after you register/login on the their website.

Install the vmware fusion with Homebrew. Yes, you need have Homebrew installed, but I guess you already done it.

1	brew install --cask vmware-fusion

If you run into the chown issue like:

1	Chown /Applications/VMware Fusion.app: Operation not permitted

Please double check if the Full Disk Access is granted for the Terminal.app in system settings Privacy & Security.

Drag and drop the Windows 11 ISO file into vmware. You can go with UEFI and also the default values with the rest of the settings.

Pay attention to the message on the screen, if it is saying press any key to continue, don’t wait. You only have five seconds to hit the key, so be fast. I will not talk about the basic steps of installing Windows 11, I trust you can install the operating system with the GUI.

When you reach the setup step of internet connection, press shift + fn + F10 to invoke the CMD. Input:

1	OOBE\BYPASSNRO

It will auto restart the setup steps and this time you choose the option I don’t have internet (Yup, actually you don’t.) Continue with limited setup. If everything goes well. In the end, you get into Windows 11 Desktop

Run PowerShell as Administrator and type:

1	Set-ExecutionPolicy RemoteSigned

Insert the VMware Tools CD image into the virtual machine. Run the setup script with PowerShell.

In case you want to set the Execution Policy back:

1	Set-ExecutionPolicy Restricted

If the VMware Tools are successfully installed, the internet connection is working inside the virtual machine. Adjust settings as you wish. For example, set the display resolution to 2880 x 1800 and Scale to 200%.

A fully operational Windows 11 on Mac is all yours.

Enjoy.

Find and Terminate Active Resources on AWS Using Tag Editor in Resource Groups

Posted on 2023-08-22

You must have the same headache as me when AWS sends you an email saying this month you get yet another bill for all the unknown active resources on AWS.

But wait, where are they? How can I find them?

I have been asking myself these questions time and time again. Now I finally find a simple way to deal with it.

Open AWS Resource Groups: https://console.aws.amazon.com/resource-groups/
In the navigation pane, on the left side of the screen, choose Tag Editor.
For Regions, choose All regions.
For Resource types, choose All supported resource types.
Choose Search resources.

Then, you will see all the resources that are still active in your account.

You need to terminate them one by one.

Good luck!

Optimize TCP by enabling BBR on Oracle Cloud Linux

Posted on 2023-05-07

Introduction

BBR (“Bottleneck Bandwidth and Round-trip propagation time”) aims to improve network performance and reduce latency. BBR estimates the available network bandwidth and the round-trip time (RTT) to adjust the TCP sending rate dynamically, reducing queuing delays and reducing packet loss.

Prerequisites

Check if your Linux kernel version is 4.9 or higher.

uname -r

Congestion Control Status

1	sysctl net.ipv4.tcp_available_congestion_control

If you see net.ipv4.tcp_available_congestion_control = bbr cubic reno, then BBR is enabled. You can check it again after we enable BBR.

Enable BBR

1
2
3

echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf
sudo sysctl -p

The first line enables Fair Queueing (FQ), which is a network scheduler that improves network performance by reducing latency and jitter. The second line enables BBR.

The last line reloads the configuration file for the changes to take effect.

References

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat:
https://blog.cloudflare.com/http-2-prioritization-with-nginx/

TCP BBR congestion control comes to GCP – your Internet just got faster:
https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster

Progress Display for ftplib using Rich Library in Python

Posted on 2022-07-03

The ftplib is included in Python batteries, which can be used to implement the client side of the FTP protocol. It’s compact and easy to use, but missing a user-friendly progress display. For example, if you have a long-time connection to upload / download a large file to / from the FTP server, the terminal tells you nothing about the progress of the file transfer.

Don’t panic.

Luckily, we have Rich, a Python library, showing rich text (with color and style) to the terminal. Especially, it can display continuously updated information regarding the progress of long running tasks / file copies etc. That is perfect for the scenario of file transfer with FTP server.

The main challenge is that the Progress in rich.progress has to be called every time you need to update the UI and we have to, at the same time, synchronize the actual progress of FTP file transfer.

OK, show me the code.

First, make sure you have rich library installed.

Then, double check if you get these dependencies imported.

import time
from ftplib import FTP_TLS
from rich import print
from rich.progress import Progress, SpinnerColumn, TotalFileSizeColumn, TransferSpeedColumn, TimeElapsedColumn

The implementation is straightforward with the help from the callback in FTP.retrbinary(). The callback function is called for each block of data received. And that is when we take the chance to update and render the progress display.

Here is an example of downloading from FTP server.

def download_from_ftp(file_path):
    # ftplib.FTP_TLS() connects to FTP server with username and password
    ftp = FTP_TLS(host=FTP_HOST, user=FTP_USER, passwd=FTP_PASSWD)

    # Securing the data connection requires the user to explicitly ask for it by calling the prot_p()
    ftp.prot_p()

    # prepare a file object on local machine to write data into
    f = open(file_path, 'wb')

    # initialize the ftp_progress with file_size, ftp connection and the file object
    # you may need to work out how to get the actual file size
    # Hint: FTP.dir() produces a directory listing as returned by the LIST command
    tracker = ftp_progress(file_size, ftp, f)

    # the trick to update rich progress display is using the callback function in retrbinary()
    ftp.retrbinary('RETR example_file.zip', callback=tracker.handle)
    
    # stop progress display and also terminate the file object
    tracker.stop()

    # send a QUIT command to the server and close the connection
    ftp.quit()

If you go through the comments I wrote for the above function, then the below class should be fairly self-explanatory to you. The handle() is where we reflect the changes in each iteration, yes in callbacks.

One thing you should be aware of is that FTP uses two separate TCP connections: one to carry commands and the other to transfer data. So in the case of a long-time file transfer, you need to talk to the command channel once a while, to keep it connected. ‘NOOP’ command is designed for this, to prevent the client from being automatically disconnected (by server) for being idle.

class ftp_progress:
    def __init__(self, file_size, ftp, f):
        self.file_size = file_size
        self.ftp = ftp
        self.f = f
        self.size_written = 0
        self.time = time.time()
        self.progress = Progress(
            SpinnerColumn(),
            *Progress.get_default_columns(),
            TotalFileSizeColumn(),
            TransferSpeedColumn(),
            TimeElapsedColumn(),
        )
        self.task_download = self.progress.add_task("[red]Download...", total=self.file_size)
        self.progress.start()

    def stop(self):
        self.progress.stop()
        self.f.close()
        
    def handle(self, data):
        self.f.write(data)
        self.size_written += 8192
        self.progress.update(self.task_download, advance=8192)

        # keep FTP control connection alive
        if time.time() - self.time > 60:
            self.time = time.time()
            self.ftp.putcmd('NOOP')

As a final note, it should be mentioned that be careful of passing by reference in Python. If you don’t close / keep FTP connections correctly with the server, strange things (not the TV show) cound happen.

And, stay away from the nested callbacks, always.

Ref:

ftplib — FTP protocol client
https://docs.python.org/3/library/ftplib.html

Rich’s documentation
https://rich.readthedocs.io/en/stable/index.html

Progress Display (Rich)
https://rich.readthedocs.io/en/stable/progress.html

Rebuild this blog using Hexo and Theme NexT

Posted on 2022-02-24 Edited on 2022-02-25

Seven years ago, I wrote the first post in this blog to explain how to build this blog. Basically, I still use the Hexo framework for blogging. Here is the summary of how you can rebuild the blog today.

Versions

Node.js v17.6.0
npm 8.5.1
hexo 6.0.0
hexo-theme-next 8.10.0

If you use the different versions, things may break.

Create GitHub Repo for the blog folder

It will be convenient to use git to version control your post writings. You can just create an empty private repo. Let’s call the folder as your_blog_folder/

Create Github Repo for hosting blog Github Page

A public repo with the repo name as YOUR_NAME.github.io, YOUR_NAME here is your GitHub account name.

Install Node, Hexo and NexT

Install Node.js with Homebrew

1	brew install node

Install Hexo and NexT with npm

1 2	npm install -g hexo-cli npm install hexo-theme-next

Initialize Hexo

Make sure you are in the target folder

1
2
3

cd your_blog_folder
hexo init
npm install

Configure Hexo and NexT

your_blog_folder/_config.yml is the config file for Hexo, and you need to copy the config file of NexT at /node_modules/hexo-theme-next/_config.yml to the root folder of your_blog_folder/, rename it to _config.next.yml. So they are now in the same folder. Here are some of the key configuration changes.

For _config.yml

# Site
timezone: 'America/Toronto'
# URL
url: http://YOUR_NAME.github.io
# Extensions
theme: next
# Deployment
deploy:
  type: git
  repo: https://github.com/YOUR_NAME/YOUR_NAME.github.io
  branch: main

For _config.next.yml

# Creative Commons 4.0 International License.
creative_commons:
  sidebar: true
  post: true
# Menu Settings
menu:
  home: / || fa fa-home
  archives: /archives/ || fa fa-archive
# Sidebar Settings
sidebar:
  position: right
# Social Links
social:
  GitHub: https://github.com/YOUR_NAME || fab fa-github
# Subscribe through Telegram Channel, Twitter, etc.
follow_me:
  RSS: /atom.xml || fa fa-rss
# Misc Theme Settings
codeblock:
  theme:
    light: tomorrow-night-bright
    dark: tomorrow-night-bright
  copy_button:
    enable: true
    style: mac
# Reading progress bar
reading_progress:
  enable: true

Write a post

1	hexo n 'Your_post_name'

And if you are using vscode, you will find the Your_post_name.md in the your_blog_folder/source/_posts/ you can write in markdown.

Generate the post html

hexo g

You can view the post locally and make changes if you need.

Deploy the post to GitHub

hexo d

It will automatically generate GitHub Page for you which you can see by visiting YOUR_NAME.github.io

When you try to deploy the blog, if you see this message ERROR Deployer not found: git, or other error messages, for example, you need rss feed ready for your blog, please double check if you have all node_modules installed.

1 2	npm install hexo-deployer-git --save npm install hexo-generator-feed --save

Reminder

If you git clone the private blog repo to write a new post, remember to npm install before you try to deploy it to github. Because the moment you git clone to your local machine, the folder node_modules is empty. It is not sync with git by default as set in the .gitignore generated by Hexo.

1 2	rm -rf node_modules && npm install --force hexo d -g

Ref:

Hexo
https://hexo.io

Theme NexT
https://theme-next.js.org

hexo-generator-feed
https://github.com/hexojs/hexo-generator-feed

HTTP Proxy via Squid in Ubuntu on Microsoft Azure

Posted on 2022-02-24 Edited on 2022-02-25

For some reason, you need to use a HTTP Proxy.

Let me assume you have an account of Microsoft Azure.

Create VM on Azure

First, go to Azure Portal and create a linux virtual machine, say, Ubuntu 20.04 LTS. Default config will be fine during your VM setup.

Connect to VM

Connect to the virtual machine via SSH with client, in my case, I use Terminal on MacOS. Let’s rename the private key file to azureuser.pem which you download from the previous VM creation step. Use chmod 400 to ensure you have read-only access to the private key.

1 2	chmod 400 azureuser.pem ssh -i /Path/To/Some/Folder/azureuser.pem azureuser@vps_IP

Install Squid

1
2
3

sudo apt-get update -y
sudo apt-get upgrade -y
sudo apt-get install squid -y

Update Squid Config file

1	sudo nano /etc/squid/squid.conf

Find and insert the below two lines BEFORE the line of http_access deny all, to allow your IP to use the proxy.

1 2	acl client src your_IP http_access allow client

By default, Squid uses port 3128. If you are going to change it, just remember to double check if any other application is using the port by default.

1	http_port PORT_NUMBER

By default, Squid will append your original IP address in the HTTP requests it forwards, something like X-Forwarded-For: 192.1.2.3. So, if you don’t want the destination server to know you are using a proxy, and you want to remove some of the request_headers that Squid passes on to the destination server, you can Ctrl + W to find and uncomment these:

# forwarded_for off
# request_header_access Authorization allow all
# request_header_access Proxy-Authorization allow all
# request_header_access Cache-Control allow all           
# request_header_access Content-Length allow all
# request_header_access Content-Type allow all
# request_header_access Date allow all
# request_header_access Host allow all
# request_header_access If-Modified-Since allow all
# request_header_access Pragma allow all
# request_header_access Accept allow all
# request_header_access Accept-Charset allow all
# request_header_access Accept-Encoding allow all
# request_header_access Accept-Language allow all
# request_header_access Connection allow all
# request_header_access All deny all

All, done. Ctrl + O to save the file, and Ctrl + X to exit the file. Restart Squid or maybe you can just restart VM.

1	sudo service squid restart

Open Port for Squid

Lastly, go to your VM on the azure portal, open the Networking Tab in the settings. You need to add one inbound port rules to make the Squid http_port number (by default, 3128) accessible from your IP.

Use HTTP Proxy in Python

import requests

http_proxy = {
	'http': 'http://vps_IP:PORT_NUMBER',
}

r = requests.get(url, proxies=http_proxy)

Use HTTP Proxy in Terminal on MacOS (Bash)

Enter the below command into the terminal for a session use.

1	export http_proxy=http://vps_IP:PORT_NUMBER

If you want to make the proxy permanent in the Terminal, you can add it into bash_profile.

Use HTTP Proxy on iOS

Go to the settings page of the WiFi you are currently using. Configure HTTP Proxy to Manual and enter the details.

By the way, save your time and do not use Squid to bypass GFW. It will fail, technically speaking. I’m sorry.

Squid is not designed to encrypt internet traffic by any means. On the contrary, Squid server can and will intercept SSL interception because Squid is literally the man-in-the-middle (MiTM). Keep that in mind if you consider making use of the other’s HTTP Proxy.

Ref:

Ubuntu documentation for Squid:
https://help.ubuntu.com/community/Squid

Squid man page:
http://manpages.ubuntu.com/manpages/focal/en/man8/squid.8.html

Official Squid site:
http://www.squid-cache.org

Store Password in Database

Posted on 2021-07-14 Edited on 2022-02-24

I have been thinking about how to store customer’s login details into the database.

I mean, how to arrange the tables to store the passwords securely enough so that even if one day the database is leaked (and it will), no actual passwords are exposed.

On seond thoughts, managing the passwords by myself could be a bad idea.

Hand it over to Google or Facebook.

OK.

Let’s say, for some unknown reasons, I have to do it.

The million-dollar question is, HOW.

If I store the plain text password anywhere in the database, the company probably would just fire the guy who designed the database, that’s me, for good.
If I store the hashed password with username, just like that, it is vulnerable to dictionary or rainbow table attacks. But maybe I can keep my job, for now.
If I use SHA512, instead of MD5, as the hash function, the computational power required to crack the passwords is signaficantly different.

Solution (perhaps)

Add random salt to each individual password, and then calculate the SHA512 hash values. Remember to generate salt again, once the customer changes the password. Also, I need a cryptographically secure random function to generate salt.

LoginID	LoginName	Salt	HashedPasswordWithSalt
0000001	Alice	T7#jd	RncFuVDvUtVxXUFrvOHPfiF
0000002	Bob	$1Yo2	UZ0CkHkEccFErZujyAl3wys
0000003	Charlie	UWp*1	Pt4a1176FY2zcewmbcvEuAN

In practice, use longer salt, I guess.

Ref:

Password Cracking
https://www.youtube.com/watch?v=7U-RbOKanYs

How NOT to Store Passwords!
https://www.youtube.com/watch?v=8ZtInClXe1Q

Rainbow table
https://en.wikipedia.org/wiki/Rainbow_table