Monday, May 26, 2014

[Google] How a linkedlist is maintained by multiple machines?

Question:

Answer:
任何一个获得互斥锁(这个锁禁止其他节点读取数据)的节点更改了数据,之后如果lock manager[[可以理解为一个全局的锁的管理者]]要收回这个锁,这意味着有其他节点要读取或写入数据,之前那个节点需要以piggyback的形式把更改的数据送回,这样其他节点(那些获得读写权限的节点)就可以获得最新的数据版本了。
2

We can tackle this problem using a master for synchronization and N nodes for actual storage of the linked list.

Every time a node wants to do something on the linked list (since we are using a linked list we can suppose the operations available are append on both ends/remove on both ends), it has first to acquire a lock which state is controlled by the master.

This will work but won't allow for much concurrency as no parallel operations would be possible.

If we want to allow parallel operations to occur, we can make the operations reported (marked as => below) to the master before they are actually performed on any node:
Operation 1 - Node i => Append 3 to the linked list
Operation 2 - Node j => Remove the top element of the linked list

The master would allow the operation 1 and start propagating it to every N nodes.
Then it would see the operation 2 and would not allow it as it is not compatible with operation 1 (add/delete on the same side cannot be done concurrently).

Another sequence could be:
Operation 1 - Node i => Append 3 to the linked list
Operation 2 - Node j => Append 5 to the linked list

In that case, the master could allow each operation to be performed but would propagate operation 1 before it actually starts propagating operation 2.

In my opinion, the key point here is to ask:
- should we allow parallel operations and which ones ? if no, then a master lock will do fine
- does it matter if the order of insertion/deletion is not respected ? (i.e. we can permute two consecutive append or two consecutive delete) If not, then we can allow for some level of concurrency.

Sunday, May 18, 2014

Surrounded Region -- Leetcode

Question:
Given a 2D board containing 'X' and 'O', capture all regions surrounded by 'X'.
A region is captured by flipping all 'O's into 'X's in that surrounded region.
For example,
X X X X
X O O X
X X O X
X O X X
After running your function, the board should be:
X X X X
X X X X
X X X X
X O X X
Answer:
Similar to "Blood Fill" method used in computer game and graphics.

1. DFS (Recursion method), can pass small data set, but time exceeded for large set:
1.1 Directly find 'O', reverse into 'X', then DFS, if there exists an 'O' in corner in this DFS process, then backtrack to 'O'.
class Solution {
public:
    void solve(vector<vector<char>> &board) {
        if(!board.size())return;
     
        int m= board.size();
        int n= board[0].size();
        for(int i=0; i<m; ++i){
            for(int j= 0; j<n; ++j){
                if(board[i][j]=='O'){
                    DFS(i,j,board);
                }
            }
        }
        return;
    }
    bool DFS(int i, int j,vector<vector<char>> &board){
        int m= board.size();
        int n= board[0].size();
        if(i-1<0 || j-1<0 || i+1>m-1 || j+1>n-1) return false;   //corner case: 'O'
     
        board[i][j]= 'X';
     
        if(i-1>=0 && board[i-1][j]=='O'){
            if(!DFS(i-1,j,board)){
                board[i-1][j]= 'O';                      //Backtrack!!!
                return false;
            }
        }
        if(j-1>=0 && board[i][j-1]=='O'){
            if(!DFS(i,j-1,board)){
                board[i][j]= 'O';
                return false;
            }
        }
        if(i+1<=m-1 && board[i+1][j]=='O'){
            if(!DFS(i+1,j,board)){
                board[i][j]= 'O';
                return false;
            }
        }
        if(j+1<=n-1 && board[i][j+1]=='O'){
            if(!DFS(i,j+1,board)){
                board[i][j]= 'O';
                return false;
            }
        }
        return true;
    }
};

1.2 Find 'O' in the edge firstly, then DFS these "invalid" 'O's, mark them as 'C'. Finally reverse all 'O' into 'X', 'C' into 'O'.
class Solution {
public:
    void solve(vector<vector<char>> &board) {
        if(!board.size())return;
     
        int m= board.size();
        int n= board[0].size();
        for(int i=0; i<m; ++i){
            if(board[i][0]=='O')
                DFS(i,0,board);
            if(board[i][n-1]=='O')
                DFS(i,n-1,board);
        }
        for(int i=0; i<n; ++i){
            if(board[0][i]=='O')
                DFS(0,i,board);
            if(board[m-1][i]=='O')
                DFS(m-1,i,board);
        }
     
        for(int i=0;i<m;++i){
            for(int j=0;j<n;++j){
                if(board[i][j]=='O')
                    board[i][j]= 'X';
                else if(board[i][j]=='C')
                    board[i][j]= 'O';
            }
        }
        return;
    }
    void DFS(int i, int j,vector<vector<char>> &board){            //Mark all corner 'O' into 'C'.
        int m= board.size();
        int n= board[0].size();
     
        board[i][j]= 'C';
     
        if(i-1>=0 && board[i-1][j]=='O')
            DFS(i-1,j,board);
        if(j-1>=0 && board[i][j-1]=='O')
            DFS(i,j-1,board);
        if(i+1<=m-1 && board[i+1][j]=='O')
            DFS(i+1,j,board);
        if(j+1<=n-1 && board[i][j+1]=='O')
            DFS(i,j+1,board);
        return;
    }
};

2. BFS: Using queue, iterative, no recursion, can pass large set.
class Solution {
    struct pair{
        int x,y;
        pair(int i,int j):x(i),y(j){};
    };
public:
    void solve(vector<vector<char>> &board) {
        if(!board.size())return;
     
        int m= board.size();
        int n= board[0].size();
     
        for(int i=0; i<m; ++i){        //For 'O' in edge case.
            if(board[i][0]=='O'){
                BFS(i,0,board);
            }
            if(board[i][n-1]=='O'){
                BFS(i,n-1,board);
            }
        }
        for(int i=0; i<n; ++i){
            if(board[0][i]=='O'){
                BFS(0,i,board);
            }
            if(board[m-1][i]=='O'){
                BFS(m-1,i,board);
            }
        }
     
        for(int i=0;i<m;++i){
            for(int j=0;j<n;++j){
                if(board[i][j]=='O')         //which 'O' should be flipped to 'X'.
                    board[i][j]= 'X';
                else if(board[i][j]=='C')    //which 'O' shouldn't be flipped.
                    board[i][j]= 'O';
            }
        }
        return;
    }
    void BFS(int a,int b,vector<vector<char>> &board){        //Mark all corner 'O' into 'C'.
        int m= board.size();
        int n= board[0].size();
        queue<pair> que;
     
        board[a][b]= 'C';
        que.push(pair(a,b));
     
        while(!que.empty()){
            pair t= que.front();
            que.pop();
            int i= t.x;
            int j= t.y;
            if(i-1>=0 && board[i-1][j]=='O'){
                board[i-1][j]= 'C';
                que.push(pair(i-1,j));
            }
            if(j-1>=0 && board[i][j-1]=='O'){
                board[i][j-1]= 'C';
                que.push(pair(i,j-1));
            }
            if(i+1<=m-1 && board[i+1][j]=='O'){
                board[i+1][j]= 'C';
                que.push(pair(i+1,j));
            }
            if(j+1<=n-1 && board[i][j+1]=='O'){
                board[i][j+1]= 'C';
                que.push(pair(i,j+1));
            }
        }
        return;
    }
};

Saturday, May 10, 2014

Word Search -- Leetcode

Question:
Given a 2D board and a word, find if the word exists in the grid.
The word can be constructed from letters of sequentially adjacent cell, where "adjacent" cells are those horizontally or vertically neighboring. The same letter cell may not be used more than once.
For example,
Given board =
[
  ["ABCE"],
  ["SFCS"],
  ["ADEE"]
]
word = "ABCCED", -> returns true,
word = "SEE", -> returns true,
word = "ABCB", -> returns false.
Answer:
class Solution {
public:
    bool exist(vector<vector<char> > &board, string word) {
        int m=board.size();
        int n=board[0].size();
        vector<vector<bool> > visited(m, vector<bool>(n,false) );
        
        for(int i=0;i<m;++i){
            for(int j=0;j<n;++j){
                if(board[i][j]==word[0]){
                    if(DFS(board,i,j,visited,0,word))
                        return true;
                }
            }
        }
        return false;
    }
    
    bool DFS(vector<vector<char>> &board, int i,int j,vector<vector<bool>> &visited,int index,string &word){  
        if(word[index]!=board[i][j])return false;
        
        //word[index] == board[i][j]
        if(index == word.length()-1){   //the last char of word, end case.
            visited[i][j]=true;
            return true;
        }
        else{
            visited[i][j]= true;
            
            if(j-1>=0 && !visited[i][j-1]){
               if(DFS(board,i,j-1,visited,index+1,word))
                   return true;
            }
            if(i-1>=0 && !visited[i-1][j]){
               if(DFS(board,i-1,j,visited,index+1,word))
                   return true;
            }
            if(j+1<board[0].size() && !visited[i][j+1]){
               if(DFS(board,i,j+1,visited,index+1,word))
                   return true;
            }
            if(i+1<board.size() && !visited[i+1][j]){
               if(DFS(board,i+1,j,visited,index+1,word))
                   return true;
            }
            
            visited[i][j] = false;   //backtrack!!!
        }
        return false;
    }

};

Friday, May 2, 2014

Google : GFS,mapreduce,Bigtable

分布式系统  一 —— Google三驾马车: GFS,mapreduce,Bigtable

谈到分布式系统,就不得不提Google的三驾马车:Google fs[1],Mapreduce[2],Bigtable[3]。
虽然Google没有公布这三个产品的源码,但是他发布了这三个产品的详细设计论文。而且,Yahoo资助的Hadoop也有按照这三篇论文的开源Java实现:Hadoop对应Mapreduce, Hadoop Distributed File System (HDFS)对应Google fs,Hbase对应Bigtable。不过在性能上Hadoop比Google要差很多,参见表1。
Experiment
HBase20070916
BigTable
random reads
272
1212
random reads (mem)
Not implemented
10811
random writes
1460
8850
sequential reads
267
4425
sequential writes
1278
8547
Scans
3692
15385
表1。Hbase和BigTable性能比较(来源于http://wiki.apache.org/lucene-hadoop/Hbase/PerformanceEvaluation)
以下分别介绍这三个产品:
一 Google fs
GFS是一个可扩展的分布式文件系统,用于大型的、分布式的、对大量数据进行访问的应用。它运行于廉价的普通硬件上,提供容错功能。
分布式系统漫谈一 <wbr>—— Google三驾马车: <wbr>GFS,mapreduce,Bigtable
图1 GFS Architecture
(1)GFS的结构
1. GFS的结构图见图1,由一个master和大量的chunkserver构成,
2. 不像Amazon Dynamo的没有主的设计,Google设置一个主来保存目录和索引信息,这是为了简化系统结果,提高性能来考虑的,但是这就会造成主成为单点故障或者瓶颈。为了消除主的单点故障Google把每个chunk设置的很大(64M),这样,由于代码访问数据的本地性,application端和master的交互会减少,而主要数据流量都是Application和chunkserver之间的访问。
3. 另外,master所有信息都存储在内存里,启动时信息从chunkserver中获取。提高了master的性能和吞吐量,也有利于master当掉后,很容易把后备j机器切换成master。
4. 客户端和chunkserver都不对文件数据单独做缓存,只是用linux文件系统自己的缓存
The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.”

“Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However,we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.”

“Neither the client nor the chunkserver caches file data.Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.”

(2)GFS的复制
GFS典型的复制到3台机器上,参看图2
分布式系统漫谈一 <wbr>—— Google三驾马车: <wbr>GFS,mapreduce,Bigtable
图2 一次写操作的控制流和数据流
(3) 对外的接口
和文件系统类似,GFS对外提供create, delete,open, close, read, 和 write 操作
另外,GFS还新增了两个接口snapshot and record append,snapshot是做一个
“Moreover, GFS has snapshot and record append operations.
Snapshot creates a copy of a file or a directory tree at low cost.
Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.”

二 Mapreduce
Mapreduce是针对分布式并行计算的一套编程模型。
讲到并行计算,就不能不谈到微软的Herb Sutter在2005年发表的文章” The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software”[6],主要意思是通过提高cpu主频的方式来提高程序的性能很快就要过去了,cpu的设计方向也主要是多核,超线程等并发上。但是以前的程序并不能自动的得到多核的好处,只有编写并发程序,才能真正获得多核的好处。分布式计算也是一样。
分布式系统漫谈一 <wbr>—— Google三驾马车: <wbr>GFS,mapreduce,Bigtable
图3 Mapreduce Execution overview

1.Mapreduce是由Map和reduce组成,来自于Lisp,Map是影射,把指令分发到多个worker上去,reduce是规约,把Map的worker计算出来的结果合并。(参见图3)
2.Google的Mapreduce实现使用GFS存储数据。
3.Mapreduce可用于Distributed Grep,Count of URL Access Frequency,ReverseWeb-Link Graph,Distributed Sort,Inverted Index
三 Bigtable
就像文件系统需要数据库来存储结构化数据一样,GFS也需要Bigtable来存储结构化数据。
1.  BigTable 是建立在 GFS ,Scheduler ,Lock Service 和 MapReduce 之上的。
2.  每个Table都是一个多维的稀疏图
3.  为了管理巨大的Table,把Table根据行分割,这些分割后的数据统称为:Tablets。每个Tablets大概有 100-200 MB,每个机器存储100个左右的 Tablets。底层的架构是:GFS。由于GFS是一种分布式的文件系统,采用Tablets的机制后,可以获得很好的负载均衡。比如:可以把经常响应的表移动到其他空闲机器上,然后快速重建。


参考文献
[1]       The Google File System; http://labs.google.com/papers/gfs-sosp2003.pdf
[2]       MapReduce: Simplifed Data Processing on Large Clusters; http://labs.google.com/papers/mapreduce-osdi04.pdf
[3]       Bigtable: A Distributed Storage System for Structured Data;http://labs.google.com/papers/bigtable-osdi06.pdf
[4]       Hadoop ; http://lucene.apache.org/hadoop/
[5]       Hbase: Bigtable-like structured storage for Hadoop HDFS;http://wiki.apache.org/lucene-hadoop/Hbase
[6]       The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software;http://www.gotw.ca/publications/concurrency-ddj.htm

转自:http://blog.sina.com.cn/s/blog_4ed630e801000bi3.html

Week4: DNS Performance and the Effectiveness of Caching

----For EECS345 Distributed System (CS Northwestern course) weekly reading report

                            Week4: DNS Performance and the Effectiveness of Caching

       This paper has presented a detailed analysis of traces of DNS and associated TCP traffic collected on the Internet based on the data form traced links of MIT Laboratory and KAIST, then discusses trace-driven simulations to study the effectiveness of DNS caching as a function of TTL and degree of cache sharing.

       By analyse the collected data, the authors have these results:
1.                    Distribution of popular names following the Zipf’s law fails to make use of caching with larger TTL values.
2.                    Sharing cache among groups of clients has limited gain in terms of cache hit after the total member count crosses 20-25.
3.                   The client-perceived latency is adversely affected by number of referrals, and caching NS records to reduce the number of referrals will decrease latency as well as load on the root servers.
4.                   Distribution of names causing negative responses follows a heavy tailed distribution as well. As a result, hit rate of negative caching is also limited.

       Using trace-driven simulations algorithm, the author wants to find how useful to share DNS caches among many client machines and what is the impact of choice of TTL on caching effectiveness. The authors quantify two important statistics:  the distribution of name popularity and TTL values in the trace data. Both determine cache hit rates. And draw a conclusion that for A-records, lower TTL seems won’t harm the hit rates, caching appears to have limited effectiveness. But for NS-records, it would increase the load on root server and harm DNS scalability.

At the end, the paper draw a conclusion that the  widespread use of dynamic, lower-TTL A-record bindings should not harm DNS performance. Such that the scalability of DNS are less dependent on the hierarchical design of its name space or good A-record caching(originally believed).