url shortening algo

Hi,

Most of the popular URL shortening services simply take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).

how can i fix the encoding/decoding in the following url shortening algo? it seems it overflows unsigned long long...

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

class URLShortner {
private:
	vector<char> alphabet;
	size_t base;

public:
	URLShortner(std::string list): alphabet(list.begin(), list.end()), 
	                               base(list.size()) {}

	string encode(unsigned long long num) {
		vector<char> sb;

		while (num > 0) {
			size_t index = num % base;

			sb.push_back(alphabet[index]);
			num /= base;
		}

		string result(sb.cbegin(), sb.cend());

		reverse_copy(sb.begin(), sb.end(), result.begin());

		return result;
	}

	int alphabet_indexof(char s) {
		int index = -1;

		auto it = std::find(alphabet.begin(), alphabet.end(), s);
		if (it != alphabet.end()) {
  			index = std::distance(alphabet.begin(), it);
		}

		return index;
	}

	unsigned long long decode(const string &str) {
	    unsigned long long num = 0;

	    for(int i = 0; i < str.size(); i++) {
	      int index = alphabet_indexof(str[i]);
	      assert(index != -1);
	      num = num * base + alphabet_indexof(str[i]);
	    }

	    return num;
	}
};

int main() {
	// your code goes here

	URLShortner s{"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789."};

	unsigned long long int id = s.decode("www.google.com");
	cout << id << endl;
	string url_string1 = s.encode(id);
	cout << url_string1 << endl;

	string url_string2 = s.encode(12345);
	cout << url_string2 << endl;
	cout << s.decode(url_string2) << endl;

	return 0;
}

allen schrieb:

Most of the popular URL shortening services simply take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).

I think you mixed up "ID in the database" (which is a rather small integer) with "URL" (witch may be a string of any length) in your implementation.

SeppJ

Your algorithm will not shorten anything. It just makes a numerical representation out of a string and vice versa. In fact, the number (when written in base 10) will be longer, as you only have log_2(10) bits per character. That is also the reason, why even an unsigned long long will overflow. The numbers you are handling here are just way too large.

what does that mean in terms of code?
take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).

hustbaer

allen schrieb:

what does that mean in terms of code?
take the ID in the database of the URL and then convert it to either Base 36 [a-z0-9] (case insensitive) or Base 62 (case sensitive).

Well, using a database would be a start.
Getting a basic understanding of programming would also help a lot.

base32 is what this function should do... the id would come from the database... would that work?

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;

string decode(unsigned long long id) {
vector<char> sb;
vector<char> alphabet = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f','g','h','j','k','m','n','p','q','r','s','t','v','w','x','y','z'};
size_t base = alphabet.size();

while (id > 0) {
size_t index = id % base;

sb.push_back(alphabet[index]);
id /= base;
}

string result(sb.cbegin(), sb.cend());

reverse_copy(sb.begin(), sb.end(), result.begin());

return result;
}

int main() {
// your code goes here

cout << decode(123) << endl;

return 0;
}

allen schrieb:

base32 is what this function should do... the id would come from the database... would that work?

Yes, assuming the id is nonzero.

Here's a cleaner version of your code:

#include <iostream>
#include <string>
#include <algorithm>
#include <cassert>
using namespace std;

string decode(unsigned long long id) {
  if (id == 0) return "0";

  const char alphabet[] = "0123456789abcdefghjkmnpqrstvwxyz";
  const size_t base = 32;
  assert(end(alphabet) - begin(alphabet) - 1 == base);

  string code;
  while (id > 0) {
    code.push_back(alphabet[id % base]);
    id /= base;
  }
  reverse(begin(code), end(code));
  return code;
}

int main() {
  cout << decode(123) << '\n';
}

SeppJ

allen schrieb:

base32 is what this function should do... the id would come from the database... would that work?

It does base36.

SeppJ schrieb:

allen schrieb:

base32 is what this function should do... the id would come from the database... would that work?

It does base36.

Wow, I didn't notice that he left out the letters 'i', 'l', 'o' and 'u' in his alphabet array. What he does is neither base32 nor base36.

SeppJ

during schrieb:

SeppJ schrieb:

allen schrieb:

base32 is what this function should do... the id would come from the database... would that work?

It does base36.

Wow, I didn't notice that he left out the letters 'i', 'l', 'o' and 'u' in his alphabet array. What he does is neither base32 nor base36.

Oops, me neither. So it is neither base36, nor what is commonly known as base32. It appears to be Crockford's Base32, but without checksums.

Wikipedia schrieb:

It also excludes the letter U to reduce the likelihood of accidental obscenity.

i missed out some characters...the following should fix the base32?

vector<char> alphabet = {'0','1','2','3','4','5','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};

SeppJ

Does the exact encoding matter? You seem to understand the important principles, the choice of alphabet is just details.

What happens, if the ID is 0 or the last digit in base_whatever is 0 ?