(This post is just a note to improve my knowledge of #systemdesign).
In any system that we design, we may need to identify each object uniquely. Very often, this token generator module itself can become a performance and concurrency bottleneck.
Will be outlining all the methods that I come across for unique token generation in this post from simple to hard and complicated. As always, I expect to update this page from time as whenever I find new ways to generate unique ids.
- use a monotonically increasing counter and assign whenever a new request for a unique resource id is made
+ simple to use
— additional mechanism needed to use concurrently; counterspace needs to be partitioned to be used in parallel
— last assigned value needs to be persisted in stable storage - use hashing function like MD5|SHA1|SHA2 or equivalent to generate a unique hash function.
+ easy to use and low probability of collisions
— MD5 (128bit|16bytes) SHA1–1(160bit|40bytes),SHA2(28..64bytes) still too long to uniquely identify a resource in certain usecases viz. URL shortener|pastebin
— computation of the cryptographic hash needs CPU cycles
— id tied to URL, cannot generate more than 1 shorturl or custom id to identify resource
— cannot be precomputed|batch computed - use UUID / GUID
+ no centralized authority is required to administer them
+ can be independently generated without any fear of collisions as they are generated unique to systems MAC id.
+ do not depend on input so can be batch generated and preallocated.
— UUID(128bits|32bytes) too long for certain usecases viz bitly|pastebin - Use PRIMARY key in DB and AUTOINCREMENT in MySQL
* MongoDB provides ObjectID, MySQL|MariaDB provide AUTO_INCREMENT, MS SQL Server provides IDENTITY
* Is a form of Persistence Layer Generated IDs
— generated ID is not known to your code without a roundtrip to the database
— extra roundtrip to the RDBMS may slow down your application - SnowflakeID aka TwitterID
* Snowflake ID is a form of unique identifier (64bits|8bytes) used in distributed computing. * Form of ID server|s. Implementation can be a single server creating Ids or a cluster of servers creating high numbers of IDs per second.
* created by Twitter and is used for the IDs of tweets, adopted by other companies, including Discord and Instagram
* Instagram uses a modified version( 41 bits for a timestamp, 13 bits for a shard ID, and 10 bits for a sequence number.)
* Twitter was looking for a minimum of 10000 IDs per sec per process with a response rate of <2ms.
* ID servers should require no network coordination at all between them and generated Ids should be, roughly, time-ordered.
* Format: Time(41 bits timestamp with ms precision)+ 10 bits(machine ID)
+ 12 bits(per-machine sequence number, rolls every 4096 per machine), to allow creation of multiple snowflakes in the same ms.
* Final number is generally serialized in decimal
* Sortable by time, because they are based on the time they were created
* Time a snowflake was created can be calculated from the snowflake.
* More on it at https://en.wikipedia.org/wiki/Snowflake_ID - ULID: Unique Lexicographically Sorted Individual Identifiers
* Relatively new mechanism used by Shopify, 50% faster then UUID generation
* ULIDs have a time component and sortable
* ULIDS have a compact string representation
* https://github.com/ulid/spec
- 1.21e+24 unique ULIDs per millisecond
- Lexicographically sortable!
- Canonically encoded as a 26 character string, as opposed to the 36 character UUID
- Case insensitive
- Uses Crockford’s base32 for better efficiency and readability (5 bits per character) incl. all 0–9 digits but alphabet excludes the letters I, L, O, and U to avoid confusion and abuse.
- No special characters (URL safe)
- Monotonic sort order (correctly detects and handles the same millisecond)
* https://www.youtube.com/watch?v=f53-Iw_5ucA
> ttttttttttrrrrrrrrrrrrrrrr
where
t is Timestamp (10 characters)
r is Randomness (16 characters)
- https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
- https://ebrary.net/64743/computer_science/partitioning_hash
- https://dba.stackexchange.com/questions/166843/what-index-to-use-with-lots-of-duplicate-values
- https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
- https://betterprogramming.pub/uuid-generation-snowflake-identifiers-unique-2aed8b1771bc?gi=605a222fa212
- https://blog.bytebytego.com/p/ep36-types-of-databases-and-use-cases
- https://towardsdatascience.com/choosing-the-right-database-in-a-system-design-interview-b8af9c6dc525?gi=0379cabc721e
- https://medium.com/timesopen/collective-decision-making-with-ahp-3ef819e5bc2a
- http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
- https://firebase.googleblog.com/2015/02/the-2120-ways-to-ensure-unique_68.html
- https://levelup.gitconnected.com/how-to-generate-unique-ids-in-distributed-systems-6-key-strategies-37a8ab3b367d